From 11295055526308ee71d82dc97f0a9ca2dd61c3b9 Mon Sep 17 00:00:00 2001 From: Laurence Oberman Date: Thu, 1 Nov 2018 09:30:18 -0400 Subject: watchdog/core: Add watchdog_thresh command line parameter The hard and soft lockup detector threshold has a default value of 10 seconds which can only be changed via sysctl. During early boot lockup detection can trigger when noisy debugging emits a large amount of messages to the console, but there is no way to set a larger threshold on the kernel command line. The detector can only be completely disabled. Add a new watchdog_thresh= command line parameter to allow boot time control over the threshold. It works in the same way as the sysctl and affects both the soft and the hard lockup detectors. Signed-off-by: Laurence Oberman Signed-off-by: Thomas Gleixner Cc: rdunlap@infradead.org Cc: prarit@redhat.com Link: https://lkml.kernel.org/r/1541079018-13953-1-git-send-email-loberman@redhat.com --- kernel/watchdog.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 977918d5d350..8fbfda94a67b 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -199,6 +199,13 @@ static int __init nosoftlockup_setup(char *str) } __setup("nosoftlockup", nosoftlockup_setup); +static int __init watchdog_thresh_setup(char *str) +{ + get_option(&str, &watchdog_thresh); + return 1; +} +__setup("watchdog_thresh=", watchdog_thresh_setup); + #ifdef CONFIG_SMP int __read_mostly sysctl_softlockup_all_cpu_backtrace; -- cgit v1.3-7-g2ca7 From 47b7aee14fd7e453370a5d15dfb11c958ca360f2 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 26 Sep 2018 16:12:06 +0100 Subject: sched/fair: Clean up load_balance() condition The alignment of the condition is off, clean that up. Also, logical operators have lower precedence than bitwise/relational operators, so remove one layer of parentheses to make the condition a bit simpler to follow. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-1-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee271bb661cc..4e298931a715 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8877,9 +8877,9 @@ out_all_pinned: out_one_pinned: /* tune up the balancing interval */ - if (((env.flags & LBF_ALL_PINNED) && - sd->balance_interval < MAX_PINNED_INTERVAL) || - (sd->balance_interval < sd->max_interval)) + if ((env.flags & LBF_ALL_PINNED && + sd->balance_interval < MAX_PINNED_INTERVAL) || + sd->balance_interval < sd->max_interval) sd->balance_interval *= 2; ld_moved = 0; -- cgit v1.3-7-g2ca7 From 3f130a37c442d5c4d66531b240ebe9abfef426b5 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 26 Sep 2018 16:12:07 +0100 Subject: sched/fair: Don't increase sd->balance_interval on newidle balance When load_balance() fails to move some load because of task affinity, we end up increasing sd->balance_interval to delay the next periodic balance in the hopes that next time we look, that annoying pinned task(s) will be gone. However, idle_balance() pays no attention to sd->balance_interval, yet it will still lead to an increase in balance_interval in case of pinned tasks. If we're going through several newidle balances (e.g. we have a periodic task), this can lead to a huge increase of the balance_interval in a very small amount of time. To prevent that, don't increase the balance interval when going through a newidle balance. This is a similar approach to what is done in commit 58b26c4c0257 ("sched: Increment cache_nice_tries only on periodic lb"), where we disregard newidle balance and rely on periodic balance for more stable results. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4e298931a715..a17ca4254427 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8876,13 +8876,22 @@ out_all_pinned: sd->nr_balance_failed = 0; out_one_pinned: + ld_moved = 0; + + /* + * idle_balance() disregards balance intervals, so we could repeatedly + * reach this code, which would lead to balance_interval skyrocketting + * in a short amount of time. Skip the balance_interval increase logic + * to avoid that. + */ + if (env.idle == CPU_NEWLY_IDLE) + goto out; + /* tune up the balancing interval */ if ((env.flags & LBF_ALL_PINNED && sd->balance_interval < MAX_PINNED_INTERVAL) || sd->balance_interval < sd->max_interval) sd->balance_interval *= 2; - - ld_moved = 0; out: return ld_moved; } -- cgit v1.3-7-g2ca7 From ff1cdc94de4d336be45336d70709dfcf3d682514 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Fri, 26 Oct 2018 21:17:43 +0800 Subject: sched/core: Introduce set_next_task() helper for better code readability When we pick the next task, we will do the following for the task: 1) p->se.exec_start = rq_clock_task(rq); 2) dequeue_pushable(_dl)_task(rq, p); When we call set_curr_task(), we also need to do the same thing above. In rt.c, the code at 1) is in the _pick_next_task_rt() and the code at 2) is in the pick_next_task_rt(). If we put two operations in one function, maybe better. So, we introduce a new function set_next_task(), which is responsible for doing the above. By introducing the function we can get rid of calling the dequeue_pushable(_dl)_task() directly(We can call set_next_task()) in pick_next_task() and have better code readability and reuse. In set_curr_task_rt(), we also can call set_next_task(). Do this things such that we end up with: static struct task_struct *pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { /* do something else ... */ put_prev_task(rq, prev); /* pick next task p */ set_next_task(rq, p); /* do something else ... */ } put_prev_task() can match set_next_task(), which can make the code more readable. Signed-off-by: Muchun Song Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 19 ++++++++++--------- kernel/sched/rt.c | 24 +++++++++++------------- 2 files changed, 21 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 91e4202b0634..470ba6b464fe 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1695,6 +1695,14 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p) } #endif +static inline void set_next_task(struct rq *rq, struct task_struct *p) +{ + p->se.exec_start = rq_clock_task(rq); + + /* You can't push away the running task */ + dequeue_pushable_dl_task(rq, p); +} + static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq, struct dl_rq *dl_rq) { @@ -1750,10 +1758,8 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) BUG_ON(!dl_se); p = dl_task_of(dl_se); - p->se.exec_start = rq_clock_task(rq); - /* Running task will never be pushed. */ - dequeue_pushable_dl_task(rq, p); + set_next_task(rq, p); if (hrtick_enabled(rq)) start_hrtick_dl(rq, p); @@ -1808,12 +1814,7 @@ static void task_fork_dl(struct task_struct *p) static void set_curr_task_dl(struct rq *rq) { - struct task_struct *p = rq->curr; - - p->se.exec_start = rq_clock_task(rq); - - /* You can't push away the running task */ - dequeue_pushable_dl_task(rq, p); + set_next_task(rq, rq->curr); } #ifdef CONFIG_SMP diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index a21ea6021929..9aa3287ce301 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1498,6 +1498,14 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag #endif } +static inline void set_next_task(struct rq *rq, struct task_struct *p) +{ + p->se.exec_start = rq_clock_task(rq); + + /* The running task is never eligible for pushing */ + dequeue_pushable_task(rq, p); +} + static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq, struct rt_rq *rt_rq) { @@ -1518,7 +1526,6 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq, static struct task_struct *_pick_next_task_rt(struct rq *rq) { struct sched_rt_entity *rt_se; - struct task_struct *p; struct rt_rq *rt_rq = &rq->rt; do { @@ -1527,10 +1534,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) rt_rq = group_rt_rq(rt_se); } while (rt_rq); - p = rt_task_of(rt_se); - p->se.exec_start = rq_clock_task(rq); - - return p; + return rt_task_of(rt_se); } static struct task_struct * @@ -1573,8 +1577,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) p = _pick_next_task_rt(rq); - /* The running task is never eligible for pushing */ - dequeue_pushable_task(rq, p); + set_next_task(rq, p); rt_queue_push_tasks(rq); @@ -2355,12 +2358,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) static void set_curr_task_rt(struct rq *rq) { - struct task_struct *p = rq->curr; - - p->se.exec_start = rq_clock_task(rq); - - /* The running task is never eligible for pushing */ - dequeue_pushable_task(rq, p); + set_next_task(rq, rq->curr); } static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task) -- cgit v1.3-7-g2ca7 From b82592199032bf7c778f861b936287e37ebc9f62 Mon Sep 17 00:00:00 2001 From: Long Li Date: Fri, 2 Nov 2018 18:02:48 +0000 Subject: genirq/affinity: Spread IRQs to all available NUMA nodes If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts which are allocated for a device, the interrupt affinity spreading code fails to spread them across all nodes. The reason is, that the spreading code starts from node 0 and continues up to the number of interrupts requested for allocation. This leaves the nodes past the last interrupt unused. This results in interrupt concentration on the first nodes which violates the assumption of the block layer that all nodes are covered evenly. As a consequence the NUMA nodes above the number of interrupts are all assigned to hardware queue 0 and therefore NUMA node 0, which results in bad performance and has CPU hotplug implications, because queue 0 gets shut down when the last CPU of node 0 is offlined. Go over all NUMA nodes and assign them round-robin to all requested interrupts to solve this. [ tglx: Massaged changelog ] Signed-off-by: Long Li Signed-off-by: Thomas Gleixner Reviewed-by: Ming Lei Cc: Michael Kelley Link: https://lkml.kernel.org/r/20181102180248.13583-1-longli@linuxonhyperv.com --- kernel/irq/affinity.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index f4f29b9d90ee..e12cdf637c71 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -117,12 +117,11 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, */ if (numvecs <= nodes) { for_each_node_mask(n, nodemsk) { - cpumask_copy(masks + curvec, node_to_cpumask[n]); - if (++done == numvecs) - break; + cpumask_or(masks + curvec, masks + curvec, node_to_cpumask[n]); if (++curvec == last_affv) curvec = affd->pre_vectors; } + done = numvecs; goto out; } -- cgit v1.3-7-g2ca7 From 5c903e108d0b005cf59904ca3520934fca4b9439 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 2 Nov 2018 22:59:49 +0800 Subject: genirq/affinity: Move two stage affinity spreading into a helper function No functional change. Prepares for supporting allocating and affinitizing interrupt sets. [ tglx: Minor changelog tweaks ] Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke Cc: Keith Busch Cc: Sagi Grimberg Link: https://lkml.kernel.org/r/20181102145951.31979-3-ming.lei@redhat.com --- kernel/irq/affinity.c | 92 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 56 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index e12cdf637c71..2f9812b6035e 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -94,7 +94,7 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask, return nodes; } -static int irq_build_affinity_masks(const struct irq_affinity *affd, +static int __irq_build_affinity_masks(const struct irq_affinity *affd, int startvec, int numvecs, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, @@ -165,6 +165,58 @@ out: return done; } +/* + * build affinity in two stages: + * 1) spread present CPU on these vectors + * 2) spread other possible CPUs on these vectors + */ +static int irq_build_affinity_masks(const struct irq_affinity *affd, + int startvec, int numvecs, + cpumask_var_t *node_to_cpumask, + struct cpumask *masks) +{ + int curvec = startvec, usedvecs = -1; + cpumask_var_t nmsk, npresmsk; + + if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) + return usedvecs; + + if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) + goto fail; + + /* Stabilize the cpumasks */ + get_online_cpus(); + build_node_to_cpumask(node_to_cpumask); + + /* Spread on present CPUs starting from affd->pre_vectors */ + usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs, + node_to_cpumask, cpu_present_mask, + nmsk, masks); + + /* + * Spread on non present CPUs starting from the next vector to be + * handled. If the spreading of present CPUs already exhausted the + * vector space, assign the non present CPUs to the already spread + * out vectors. + */ + if (usedvecs >= numvecs) + curvec = affd->pre_vectors; + else + curvec = affd->pre_vectors + usedvecs; + cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); + usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs, + node_to_cpumask, npresmsk, + nmsk, masks); + put_online_cpus(); + + free_cpumask_var(npresmsk); + + fail: + free_cpumask_var(nmsk); + + return usedvecs; +} + /** * irq_create_affinity_masks - Create affinity masks for multiqueue spreading * @nvecs: The total number of vectors @@ -177,7 +229,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { int affvecs = nvecs - affd->pre_vectors - affd->post_vectors; int curvec, usedvecs; - cpumask_var_t nmsk, npresmsk, *node_to_cpumask; + cpumask_var_t *node_to_cpumask; struct cpumask *masks = NULL; /* @@ -187,15 +239,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) if (nvecs == affd->pre_vectors + affd->post_vectors) return NULL; - if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) - return NULL; - - if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) - goto outcpumsk; - node_to_cpumask = alloc_node_to_cpumask(); if (!node_to_cpumask) - goto outnpresmsk; + return NULL; masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); if (!masks) @@ -205,30 +251,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (curvec = 0; curvec < affd->pre_vectors; curvec++) cpumask_copy(masks + curvec, irq_default_affinity); - /* Stabilize the cpumasks */ - get_online_cpus(); - build_node_to_cpumask(node_to_cpumask); - - /* Spread on present CPUs starting from affd->pre_vectors */ usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, - node_to_cpumask, cpu_present_mask, - nmsk, masks); - - /* - * Spread on non present CPUs starting from the next vector to be - * handled. If the spreading of present CPUs already exhausted the - * vector space, assign the non present CPUs to the already spread - * out vectors. - */ - if (usedvecs >= affvecs) - curvec = affd->pre_vectors; - else - curvec = affd->pre_vectors + usedvecs; - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); - usedvecs += irq_build_affinity_masks(affd, curvec, affvecs, - node_to_cpumask, npresmsk, - nmsk, masks); - put_online_cpus(); + node_to_cpumask, masks); /* Fill out vectors at the end that don't need affinity */ if (usedvecs >= affvecs) @@ -240,10 +264,6 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) outnodemsk: free_node_to_cpumask(node_to_cpumask); -outnpresmsk: - free_cpumask_var(npresmsk); -outcpumsk: - free_cpumask_var(nmsk); return masks; } -- cgit v1.3-7-g2ca7 From 060746d9e394084b7401e7532f2de528ecbfb521 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 2 Nov 2018 22:59:50 +0800 Subject: genirq/affinity: Pass first vector to __irq_build_affinity_masks() No functional change. Prepares for support of allocating and affinitizing sets of interrupts, in which each set of interrupts needs a full two stage spreading. The first vector argument is necessary for this so the affinitizing starts from the first vector of each set. [ tglx: Minor changelog tweaks ] Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke Cc: Keith Busch Cc: Sagi Grimberg Link: https://lkml.kernel.org/r/20181102145951.31979-4-ming.lei@redhat.com --- kernel/irq/affinity.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 2f9812b6035e..e028b773e38a 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -95,14 +95,14 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask, } static int __irq_build_affinity_masks(const struct irq_affinity *affd, - int startvec, int numvecs, + int startvec, int numvecs, int firstvec, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, struct cpumask *nmsk, struct cpumask *masks) { int n, nodes, cpus_per_vec, extra_vecs, done = 0; - int last_affv = affd->pre_vectors + numvecs; + int last_affv = firstvec + numvecs; int curvec = startvec; nodemask_t nodemsk = NODE_MASK_NONE; @@ -119,7 +119,7 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, for_each_node_mask(n, nodemsk) { cpumask_or(masks + curvec, masks + curvec, node_to_cpumask[n]); if (++curvec == last_affv) - curvec = affd->pre_vectors; + curvec = firstvec; } done = numvecs; goto out; @@ -129,7 +129,7 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, int ncpus, v, vecs_to_assign, vecs_per_node; /* Spread the vectors per node */ - vecs_per_node = (numvecs - (curvec - affd->pre_vectors)) / nodes; + vecs_per_node = (numvecs - (curvec - firstvec)) / nodes; /* Get the cpus on this node which are in the mask */ cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]); @@ -157,7 +157,7 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, if (done >= numvecs) break; if (curvec >= last_affv) - curvec = affd->pre_vectors; + curvec = firstvec; --nodes; } @@ -190,8 +190,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, /* Spread on present CPUs starting from affd->pre_vectors */ usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs, - node_to_cpumask, cpu_present_mask, - nmsk, masks); + affd->pre_vectors, + node_to_cpumask, + cpu_present_mask, nmsk, masks); /* * Spread on non present CPUs starting from the next vector to be @@ -205,8 +206,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, curvec = affd->pre_vectors + usedvecs; cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs, - node_to_cpumask, npresmsk, - nmsk, masks); + affd->pre_vectors, + node_to_cpumask, npresmsk, + nmsk, masks); put_online_cpus(); free_cpumask_var(npresmsk); -- cgit v1.3-7-g2ca7 From 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 2 Nov 2018 22:59:51 +0800 Subject: genirq/affinity: Add support for allocating interrupt sets A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts, and have them appropriately affinitized. Add support for defining a number of sets in the irq_affinity structure, of varying sizes, and get each set affinitized correctly across the machine. [ tglx: Minor changelog tweaks ] Signed-off-by: Jens Axboe Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Reviewed-by: Hannes Reinecke Reviewed-by: Ming Lei Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Cc: linux-block@vger.kernel.org Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.com --- drivers/pci/msi.c | 14 +++++++++ include/linux/interrupt.h | 4 +++ kernel/irq/affinity.c | 77 +++++++++++++++++++++++++++++++++-------------- 3 files changed, 72 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index af24ed50a245..265ed3e4c920 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, if (maxvec < minvec) return -ERANGE; + /* + * If the caller is passing in sets, we can't support a range of + * vectors. The caller needs to handle that. + */ + if (affd && affd->nr_sets && minvec != maxvec) + return -EINVAL; + if (WARN_ON_ONCE(dev->msi_enabled)) return -EINVAL; @@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev *dev, if (maxvec < minvec) return -ERANGE; + /* + * If the caller is passing in sets, we can't support a range of + * supported vectors. The caller needs to handle that. + */ + if (affd && affd->nr_sets && minvec != maxvec) + return -EINVAL; + if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL; diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 1d6711c28271..ca397ff40836 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -247,10 +247,14 @@ struct irq_affinity_notify { * the MSI(-X) vector space * @post_vectors: Don't apply affinity to @post_vectors at end of * the MSI(-X) vector space + * @nr_sets: Length of passed in *sets array + * @sets: Number of affinitized sets */ struct irq_affinity { int pre_vectors; int post_vectors; + int nr_sets; + int *sets; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index e028b773e38a..08c904eb7279 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -171,28 +171,29 @@ out: * 2) spread other possible CPUs on these vectors */ static int irq_build_affinity_masks(const struct irq_affinity *affd, - int startvec, int numvecs, + int startvec, int numvecs, int firstvec, cpumask_var_t *node_to_cpumask, struct cpumask *masks) { - int curvec = startvec, usedvecs = -1; + int curvec = startvec, nr_present, nr_others; + int ret = -ENOMEM; cpumask_var_t nmsk, npresmsk; if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) - return usedvecs; + return ret; if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) goto fail; + ret = 0; /* Stabilize the cpumasks */ get_online_cpus(); build_node_to_cpumask(node_to_cpumask); /* Spread on present CPUs starting from affd->pre_vectors */ - usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs, - affd->pre_vectors, - node_to_cpumask, - cpu_present_mask, nmsk, masks); + nr_present = __irq_build_affinity_masks(affd, curvec, numvecs, + firstvec, node_to_cpumask, + cpu_present_mask, nmsk, masks); /* * Spread on non present CPUs starting from the next vector to be @@ -200,23 +201,24 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, * vector space, assign the non present CPUs to the already spread * out vectors. */ - if (usedvecs >= numvecs) - curvec = affd->pre_vectors; + if (nr_present >= numvecs) + curvec = firstvec; else - curvec = affd->pre_vectors + usedvecs; + curvec = firstvec + nr_present; cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); - usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs, - affd->pre_vectors, - node_to_cpumask, npresmsk, - nmsk, masks); + nr_others = __irq_build_affinity_masks(affd, curvec, numvecs, + firstvec, node_to_cpumask, + npresmsk, nmsk, masks); put_online_cpus(); + if (nr_present < numvecs) + WARN_ON(nr_present + nr_others < numvecs); + free_cpumask_var(npresmsk); fail: free_cpumask_var(nmsk); - - return usedvecs; + return ret; } /** @@ -233,6 +235,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) int curvec, usedvecs; cpumask_var_t *node_to_cpumask; struct cpumask *masks = NULL; + int i, nr_sets; /* * If there aren't any vectors left after applying the pre/post @@ -253,8 +256,28 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (curvec = 0; curvec < affd->pre_vectors; curvec++) cpumask_copy(masks + curvec, irq_default_affinity); - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, - node_to_cpumask, masks); + /* + * Spread on present CPUs starting from affd->pre_vectors. If we + * have multiple sets, build each sets affinity mask separately. + */ + nr_sets = affd->nr_sets; + if (!nr_sets) + nr_sets = 1; + + for (i = 0, usedvecs = 0; i < nr_sets; i++) { + int this_vecs = affd->sets ? affd->sets[i] : affvecs; + int ret; + + ret = irq_build_affinity_masks(affd, curvec, this_vecs, + curvec, node_to_cpumask, masks); + if (ret) { + kfree(masks); + masks = NULL; + goto outnodemsk; + } + curvec += this_vecs; + usedvecs += this_vecs; + } /* Fill out vectors at the end that don't need affinity */ if (usedvecs >= affvecs) @@ -279,13 +302,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity { int resv = affd->pre_vectors + affd->post_vectors; int vecs = maxvec - resv; - int ret; + int set_vecs; if (resv > minvec) return 0; - get_online_cpus(); - ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv; - put_online_cpus(); - return ret; + if (affd->nr_sets) { + int i; + + for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) + set_vecs += affd->sets[i]; + } else { + get_online_cpus(); + set_vecs = cpumask_weight(cpu_possible_mask); + put_online_cpus(); + } + + return resv + min(set_vecs, vecs); } -- cgit v1.3-7-g2ca7 From 7d9df98be66fec64349f9f1c9d3e896293fe7b45 Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Sat, 3 Nov 2018 22:31:04 -0400 Subject: clockevents: Remove unnecessary unlikely() WARN_ON() and WARN_ON_ONCE() already contains an unlikely(), so it's not necessary to use unlikely. Signed-off-by: Yangtao Li Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20181104023104.2572-1-tiny.windzz@gmail.com --- kernel/time/clockevents.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 8c0e4092f661..af58898d9ebf 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -39,10 +39,8 @@ static u64 cev_delta2ns(unsigned long latch, struct clock_event_device *evt, u64 clc = (u64) latch << evt->shift; u64 rnd; - if (unlikely(!evt->mult)) { + if (WARN_ON(!evt->mult)) evt->mult = 1; - WARN_ON(1); - } rnd = (u64) evt->mult - 1; /* @@ -164,10 +162,8 @@ void clockevents_switch_state(struct clock_event_device *dev, * on it, so fix it up and emit a warning: */ if (clockevent_state_oneshot(dev)) { - if (unlikely(!dev->mult)) { + if (WARN_ON(!dev->mult)) dev->mult = 1; - WARN_ON(1); - } } } } @@ -315,10 +311,8 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, int64_t delta; int rc; - if (unlikely(expires < 0)) { - WARN_ON_ONCE(1); + if (WARN_ON_ONCE(expires < 0)) return -ETIME; - } dev->next_event = expires; -- cgit v1.3-7-g2ca7 From 4d9ebbe2b061a9c25e12ba8539ba172533132eb6 Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Sat, 3 Nov 2018 22:27:41 -0400 Subject: cgroup: remove unnecessary unlikely() WARN_ON() already contains an unlikely(), so it's not necessary to use unlikely. Signed-off-by: Yangtao Li Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6aaf5dd5383b..2e5d90dfcb49 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5978,10 +5978,8 @@ static ssize_t show_delegatable_files(struct cftype *files, char *buf, ret += snprintf(buf + ret, size - ret, "%s\n", cft->name); - if (unlikely(ret >= size)) { - WARN_ON(1); + if (WARN_ON(ret >= size)) break; - } } return ret; -- cgit v1.3-7-g2ca7 From ea956d8be91edc702a98b7fe1f9463e7ca8c42ab Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Wed, 10 Oct 2018 16:22:57 -0400 Subject: audit: print empty EXECVE args Empty executable arguments were being skipped when printing out the list of arguments in an EXECVE record, making it appear they were somehow lost. Include empty arguments as an itemized empty string. Reproducer: autrace /bin/ls "" "/etc" ausearch --start recent -m execve -i | grep EXECVE type=EXECVE msg=audit(10/03/2018 13:04:03.208:1391) : argc=3 a0=/bin/ls a2=/etc With fix: type=EXECVE msg=audit(10/03/2018 21:51:38.290:194) : argc=3 a0=/bin/ls a1= a2=/etc type=EXECVE msg=audit(1538617898.290:194): argc=3 a0="/bin/ls" a1="" a2="/etc" Passes audit-testsuite. GH issue tracker at https://github.com/linux-audit/audit-kernel/issues/99 Signed-off-by: Richard Guy Briggs [PM: cleaned up the commit metadata] Signed-off-by: Paul Moore --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index b2d1f043f17f..1513873e23bd 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1107,7 +1107,7 @@ static void audit_log_execve_info(struct audit_context *context, } /* write as much as we can to the audit log */ - if (len_buf > 0) { + if (len_buf >= 0) { /* NOTE: some magic numbers here - basically if we * can't fit a reasonable amount of data into the * existing audit buffer, flush it and start with -- cgit v1.3-7-g2ca7 From e8da8794a7fd9eef1ec9a07f0d4897c68581c72b Mon Sep 17 00:00:00 2001 From: Long Li Date: Tue, 6 Nov 2018 04:00:00 +0000 Subject: genirq/matrix: Improve target CPU selection for managed interrupts. On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. [ tglx: Clarified the background of the problem in the changelog and described it independent of NVME ] Signed-off-by: Long Li Signed-off-by: Thomas Gleixner Cc: Michael Kelley Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com --- kernel/irq/matrix.c | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 1f0985adf193..30cc217b8631 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -14,6 +14,7 @@ struct cpumap { unsigned int available; unsigned int allocated; unsigned int managed; + unsigned int managed_allocated; bool initialized; bool online; unsigned long alloc_map[IRQ_MATRIX_SIZE]; @@ -145,6 +146,27 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix *m, return best_cpu; } +/* Find the best CPU which has the lowest number of managed IRQs allocated */ +static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m, + const struct cpumask *msk) +{ + unsigned int cpu, best_cpu, allocated = UINT_MAX; + struct cpumap *cm; + + best_cpu = UINT_MAX; + + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->managed_allocated > allocated) + continue; + + best_cpu = cpu; + allocated = cm->managed_allocated; + } + return best_cpu; +} + /** * irq_matrix_assign_system - Assign system wide entry in the matrix * @m: Matrix pointer @@ -269,7 +291,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, if (cpumask_empty(msk)) return -EINVAL; - cpu = matrix_find_best_cpu(m, msk); + cpu = matrix_find_best_cpu_managed(m, msk); if (cpu == UINT_MAX) return -ENOSPC; @@ -282,6 +304,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, return -ENOSPC; set_bit(bit, cm->alloc_map); cm->allocated++; + cm->managed_allocated++; m->total_allocated++; *mapped_cpu = cpu; trace_irq_matrix_alloc_managed(bit, cpu, m, cm); @@ -395,6 +418,8 @@ void irq_matrix_free(struct irq_matrix *m, unsigned int cpu, clear_bit(bit, cm->alloc_map); cm->allocated--; + if(managed) + cm->managed_allocated--; if (cm->online) m->total_allocated--; @@ -464,13 +489,14 @@ void irq_matrix_debug_show(struct seq_file *sf, struct irq_matrix *m, int ind) seq_printf(sf, "Total allocated: %6u\n", m->total_allocated); seq_printf(sf, "System: %u: %*pbl\n", nsys, m->matrix_bits, m->system_map); - seq_printf(sf, "%*s| CPU | avl | man | act | vectors\n", ind, " "); + seq_printf(sf, "%*s| CPU | avl | man | mac | act | vectors\n", ind, " "); cpus_read_lock(); for_each_online_cpu(cpu) { struct cpumap *cm = per_cpu_ptr(m->maps, cpu); - seq_printf(sf, "%*s %4d %4u %4u %4u %*pbl\n", ind, " ", - cpu, cm->available, cm->managed, cm->allocated, + seq_printf(sf, "%*s %4d %4u %4u %4u %4u %*pbl\n", ind, " ", + cpu, cm->available, cm->managed, + cm->managed_allocated, cm->allocated, m->matrix_bits, cm->alloc_map); } cpus_read_unlock(); -- cgit v1.3-7-g2ca7 From e84cd7ee630e44a2cc8ae49e85920a271b214cb3 Mon Sep 17 00:00:00 2001 From: Ke Wu Date: Tue, 6 Nov 2018 15:21:30 -0800 Subject: modsign: use all trusted keys to verify module signature Make mod_verify_sig to use all trusted keys. This allows keys in secondary_trusted_keys to be used to verify PKCS#7 signature on a kernel module. Signed-off-by: Ke Wu Signed-off-by: Jessica Yu --- kernel/module_signing.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/module_signing.c b/kernel/module_signing.c index f2075ce8e4b3..6b9a926fd86b 100644 --- a/kernel/module_signing.c +++ b/kernel/module_signing.c @@ -83,6 +83,7 @@ int mod_verify_sig(const void *mod, struct load_info *info) } return verify_pkcs7_signature(mod, modlen, mod + modlen, sig_len, - NULL, VERIFYING_MODULE_SIGNATURE, + VERIFY_USE_SECONDARY_KEYRING, + VERIFYING_MODULE_SIGNATURE, NULL, NULL); } -- cgit v1.3-7-g2ca7 From 4ec22e9c5a90e3809dd52014d5d239af8831a520 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:35 -0500 Subject: cpuset: Enable cpuset controller in default hierarchy Given the fact that thread mode had been merged into 4.14, it is now time to enable cpuset to be used in the default hierarchy (cgroup v2) as it is clearly threaded. The cpuset controller had experienced feature creep since its introduction more than a decade ago. Besides the core cpus and mems control files to limit cpus and memory nodes, there are a bunch of additional features that can be controlled from the userspace. Some of the features are of doubtful usefulness and may not be actively used. This patch enables cpuset controller in the default hierarchy with a minimal set of features, namely just the cpus and mems and their effective_* counterparts. We can certainly add more features to the default hierarchy in the future if there is a real user need for them later on. Alternatively, with the unified hiearachy, it may make more sense to move some of those additional cpuset features, if desired, to memory controller or may be to the cpu controller instead of staying with cpuset. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++-- kernel/cgroup/cpuset.c | 48 +++++++++++++- 2 files changed, 149 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 476722b7b636..01b70f69304e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/. 5-3-3-2. IO Latency Interface Files 5-4. PID 5-4-1. PID Interface Files - 5-5. Device - 5-6. RDMA - 5-6-1. RDMA Interface Files - 5-7. Misc - 5-7-1. perf_event + 5-5. Cpuset + 5.5-1. Cpuset Interface Files + 5-6. Device + 5-7. RDMA + 5-7-1. RDMA Interface Files + 5-8. Misc + 5-8-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1610,6 +1612,103 @@ through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated. +Cpuset +------ + +The "cpuset" controller provides a mechanism for constraining +the CPU and memory node placement of tasks to only the resources +specified in the cpuset interface files in a task's current cgroup. +This is especially valuable on large NUMA systems where placing jobs +on properly sized subsets of the systems with careful processor and +memory placement to reduce cross-node memory access and contention +can improve overall system performance. + +The "cpuset" controller is hierarchical. That means the controller +cannot use CPUs or memory nodes not allowed in its parent. + + +Cpuset Interface Files +~~~~~~~~~~~~~~~~~~~~~~ + + cpuset.cpus + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested CPUs to be used by tasks within this + cgroup. The actual list of CPUs to be granted, however, is + subjected to constraints imposed by its parent and can differ + from the requested CPUs. + + The CPU numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.cpus + 0-4,6,8-10 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.cpus" or all the available CPUs if none is found. + + The value of "cpuset.cpus" stays constant until the next update + and won't be affected by any CPU hotplug events. + + cpuset.cpus.effective + A read-only multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the onlined CPUs that are actually granted to this + cgroup by its parent. These CPUs are allowed to be used by + tasks within the current cgroup. + + If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows + all the CPUs from the parent cgroup that can be available to + be used by this cgroup. Otherwise, it should be a subset of + "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" + can be granted. In this case, it will be treated just like an + empty "cpuset.cpus". + + Its value will be affected by CPU hotplug events. + + cpuset.mems + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested memory nodes to be used by tasks within + this cgroup. The actual list of memory nodes granted, however, + is subjected to constraints imposed by its parent and can differ + from the requested memory nodes. + + The memory node numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.mems + 0-1,3 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.mems" or all the available memory nodes if none + is found. + + The value of "cpuset.mems" stays constant until the next update + and won't be affected by any memory nodes hotplug events. + + cpuset.mems.effective + A read-only multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the onlined memory nodes that are actually granted to + this cgroup by its parent. These memory nodes are allowed to + be used by tasks within the current cgroup. + + If "cpuset.mems" is empty, it shows all the memory nodes from the + parent cgroup that will be available to be used by this cgroup. + Otherwise, it should be a subset of "cpuset.mems" unless none of + the memory nodes listed in "cpuset.mems" can be granted. In this + case, it will be treated just like an empty "cpuset.mems". + + Its value will be affected by memory nodes hotplug events. + + Device controller ----------------- diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 266f10cb7222..2b5c4477d969 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) return 0; } - /* * for the common functions, 'private' gives the type of file */ -static struct cftype files[] = { +static struct cftype legacy_files[] = { { .name = "cpus", .seq_show = cpuset_common_seq_show, @@ -1931,6 +1930,47 @@ static struct cftype files[] = { { } /* terminate */ }; +/* + * This is currently a minimal set for the default hierarchy. It can be + * expanded later on by migrating more features and control files from v1. + */ +static struct cftype dfl_files[] = { + { + .name = "cpus", + .seq_show = cpuset_common_seq_show, + .write = cpuset_write_resmask, + .max_write_len = (100U + 6 * NR_CPUS), + .private = FILE_CPULIST, + .flags = CFTYPE_NOT_ON_ROOT, + }, + + { + .name = "mems", + .seq_show = cpuset_common_seq_show, + .write = cpuset_write_resmask, + .max_write_len = (100U + 6 * MAX_NUMNODES), + .private = FILE_MEMLIST, + .flags = CFTYPE_NOT_ON_ROOT, + }, + + { + .name = "cpus.effective", + .seq_show = cpuset_common_seq_show, + .private = FILE_EFFECTIVE_CPULIST, + .flags = CFTYPE_NOT_ON_ROOT, + }, + + { + .name = "mems.effective", + .seq_show = cpuset_common_seq_show, + .private = FILE_EFFECTIVE_MEMLIST, + .flags = CFTYPE_NOT_ON_ROOT, + }, + + { } /* terminate */ +}; + + /* * cpuset_css_alloc - allocate a cpuset css * cgrp: control group that the new cpuset will be part of @@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = { .post_attach = cpuset_post_attach, .bind = cpuset_bind, .fork = cpuset_fork, - .legacy_cftypes = files, + .legacy_cftypes = legacy_files, + .dfl_cftypes = dfl_files, .early_init = true, + .threaded = true, }; /** -- cgit v1.3-7-g2ca7 From 58b7484250db8da0d49bdd289c877ccd3319575c Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:36 -0500 Subject: cpuset: Define data structures to support scheduling partition >From a cpuset point of view, a scheduling partition is a group of cpusets with their own set of exclusive CPUs that are not shared by other tasks outside the scheduling partition. In the legacy hierarchy, scheduling partitions are supported indirectly via the right use of the load balancing and the exclusive CPUs flag which is not intuitive and can be hard to use. To fully support the concept of scheduling partitions in the default hierarchy, we need to add some new field into the cpuset structure as well as a new tmpmasks structure that is used to pre-allocate cpumasks at the top level cpuset functions to avoid memory allocation in inner functions as memory allocation failure in those inner functions may cause a cpuset to have inconsistent states. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2b5c4477d969..29a2bdc671fd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -109,6 +109,13 @@ struct cpuset { cpumask_var_t effective_cpus; nodemask_t effective_mems; + /* + * CPUs allocated to child sub-partitions (default hierarchy only) + * - CPUs granted by the parent = effective_cpus U subparts_cpus + * - effective_cpus and subparts_cpus are mutually exclusive. + */ + cpumask_var_t subparts_cpus; + /* * This is old Memory Nodes tasks took on. * @@ -134,6 +141,30 @@ struct cpuset { /* for custom sched domain */ int relax_domain_level; + + /* number of CPUs in subparts_cpus */ + int nr_subparts_cpus; + + /* partition root state */ + int partition_root_state; +}; + +/* + * Partition root states: + * + * 0 - not a partition root + * 1 - partition root + */ +#define PRS_DISABLED 0 +#define PRS_ENABLED 1 + +/* + * Temporary cpumasks for working with partitions that are passed among + * functions to avoid memory allocation in inner functions. + */ +struct tmpmasks { + cpumask_var_t addmask, delmask; /* For partition root */ + cpumask_var_t new_cpus; /* For update_cpumasks_hier() */ }; static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) @@ -218,9 +249,15 @@ static inline int is_spread_slab(const struct cpuset *cs) return test_bit(CS_SPREAD_SLAB, &cs->flags); } +static inline int is_partition_root(const struct cpuset *cs) +{ + return cs->partition_root_state; +} + static struct cpuset top_cpuset = { .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), + .partition_root_state = PRS_ENABLED, }; /** -- cgit v1.3-7-g2ca7 From bf92370c035d5ed73a90927450c20a07adf08cfd Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:37 -0500 Subject: cpuset: Simply allocation and freeing of cpumasks The previous commit introduces a new subparts_cpus mask into the cpuset data structure and a new tmpmasks structure. Managing the allocation and freeing of those cpumasks is becoming more complex. So a number of helper functions are added to simplify and streamline the management of those cpumasks. To make it simple, all the cpumasks are now pre-cleared on allocation. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 110 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 77 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 29a2bdc671fd..2ac9437ce8f1 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -455,6 +455,65 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) is_mem_exclusive(p) <= is_mem_exclusive(q); } +/** + * alloc_cpumasks - allocate three cpumasks for cpuset + * @cs: the cpuset that have cpumasks to be allocated. + * @tmp: the tmpmasks structure pointer + * Return: 0 if successful, -ENOMEM otherwise. + * + * Only one of the two input arguments should be non-NULL. + */ +static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp) +{ + cpumask_var_t *pmask1, *pmask2, *pmask3; + + if (cs) { + pmask1 = &cs->cpus_allowed; + pmask2 = &cs->effective_cpus; + pmask3 = &cs->subparts_cpus; + } else { + pmask1 = &tmp->new_cpus; + pmask2 = &tmp->addmask; + pmask3 = &tmp->delmask; + } + + if (!zalloc_cpumask_var(pmask1, GFP_KERNEL)) + return -ENOMEM; + + if (!zalloc_cpumask_var(pmask2, GFP_KERNEL)) + goto free_one; + + if (!zalloc_cpumask_var(pmask3, GFP_KERNEL)) + goto free_two; + + return 0; + +free_two: + free_cpumask_var(*pmask2); +free_one: + free_cpumask_var(*pmask1); + return -ENOMEM; +} + +/** + * free_cpumasks - free cpumasks in a tmpmasks structure + * @cs: the cpuset that have cpumasks to be free. + * @tmp: the tmpmasks structure pointer + */ +static inline void free_cpumasks(struct cpuset *cs, struct tmpmasks *tmp) +{ + if (cs) { + free_cpumask_var(cs->cpus_allowed); + free_cpumask_var(cs->effective_cpus); + free_cpumask_var(cs->subparts_cpus); + } + if (tmp) { + free_cpumask_var(tmp->new_cpus); + free_cpumask_var(tmp->addmask); + free_cpumask_var(tmp->delmask); + } +} + /** * alloc_trial_cpuset - allocate a trial cpuset * @cs: the cpuset that the trial cpuset duplicates @@ -467,31 +526,24 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs) if (!trial) return NULL; - if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL)) - goto free_cs; - if (!alloc_cpumask_var(&trial->effective_cpus, GFP_KERNEL)) - goto free_cpus; + if (alloc_cpumasks(trial, NULL)) { + kfree(trial); + return NULL; + } cpumask_copy(trial->cpus_allowed, cs->cpus_allowed); cpumask_copy(trial->effective_cpus, cs->effective_cpus); return trial; - -free_cpus: - free_cpumask_var(trial->cpus_allowed); -free_cs: - kfree(trial); - return NULL; } /** - * free_trial_cpuset - free the trial cpuset - * @trial: the trial cpuset to be freed + * free_cpuset - free the cpuset + * @cs: the cpuset to be freed */ -static void free_trial_cpuset(struct cpuset *trial) +static inline void free_cpuset(struct cpuset *cs) { - free_cpumask_var(trial->effective_cpus); - free_cpumask_var(trial->cpus_allowed); - kfree(trial); + free_cpumasks(cs, NULL); + kfree(cs); } /* @@ -1385,7 +1437,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, if (spread_flag_changed) update_tasks_flags(cs); out: - free_trial_cpuset(trialcs); + free_cpuset(trialcs); return err; } @@ -1769,7 +1821,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, break; } - free_trial_cpuset(trialcs); + free_cpuset(trialcs); out_unlock: mutex_unlock(&cpuset_mutex); kernfs_unbreak_active_protection(of->kn); @@ -2024,26 +2076,19 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css) cs = kzalloc(sizeof(*cs), GFP_KERNEL); if (!cs) return ERR_PTR(-ENOMEM); - if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) - goto free_cs; - if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL)) - goto free_cpus; + + if (alloc_cpumasks(cs, NULL)) { + kfree(cs); + return ERR_PTR(-ENOMEM); + } set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); - cpumask_clear(cs->cpus_allowed); nodes_clear(cs->mems_allowed); - cpumask_clear(cs->effective_cpus); nodes_clear(cs->effective_mems); fmeter_init(&cs->fmeter); cs->relax_domain_level = -1; return &cs->css; - -free_cpus: - free_cpumask_var(cs->cpus_allowed); -free_cs: - kfree(cs); - return ERR_PTR(-ENOMEM); } static int cpuset_css_online(struct cgroup_subsys_state *css) @@ -2134,9 +2179,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css); - free_cpumask_var(cs->effective_cpus); - free_cpumask_var(cs->cpus_allowed); - kfree(cs); + free_cpuset(cs); } static void cpuset_bind(struct cgroup_subsys_state *root_css) @@ -2200,6 +2243,7 @@ int __init cpuset_init(void) BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL)); cpumask_setall(top_cpuset.cpus_allowed); nodes_setall(top_cpuset.mems_allowed); -- cgit v1.3-7-g2ca7 From ee8dde0cd2ce78b62d16aec1c29960b64380e634 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:38 -0500 Subject: cpuset: Add new v2 cpuset.sched.partition flag A new cpuset.sched.partition boolean flag is added to cpuset v2. This new flag, if set, indicates that the cgroup is the root of a new scheduling domain or partition that includes itself and all its descendants except those that are scheduling domain roots themselves and their descendants. With this new flag, one can directly create as many partitions as necessary without ever using the v1 trick of turning off load balancing in specific cpusets to create partitions as a side effect. This new flag is owned by the parent and will cause the CPUs in the cpuset to be removed from the effective CPUs of its parent. This is implemented internally by adding a new subparts_cpus mask that holds the CPUs belonging to child partitions so that: subparts_cpus | effective_cpus = cpus_allowed subparts_cpus & effective_cpus = 0 This new flag can only be turned on in a cpuset if its parent is a partition root itself. The state of this flag cannot be changed if the cpuset has children. Once turned on, further changes to "cpuset.cpus" is allowed as long as there is at least one CPU left that can be granted from the parent and a child partition root cannot use up all the CPUs in the parent's effective_cpus. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 365 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 352 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2ac9437ce8f1..85e04162adcb 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -970,10 +970,191 @@ static void update_tasks_cpumask(struct cpuset *cs) css_task_iter_end(&it); } +/** + * compute_effective_cpumask - Compute the effective cpumask of the cpuset + * @new_cpus: the temp variable for the new effective_cpus mask + * @cs: the cpuset the need to recompute the new effective_cpus mask + * @parent: the parent cpuset + * + * If the parent has subpartition CPUs, include them in the list of + * allowable CPUs in computing the new effective_cpus mask. + */ +static void compute_effective_cpumask(struct cpumask *new_cpus, + struct cpuset *cs, struct cpuset *parent) +{ + if (parent->nr_subparts_cpus) { + cpumask_or(new_cpus, parent->effective_cpus, + parent->subparts_cpus); + cpumask_and(new_cpus, new_cpus, cs->cpus_allowed); + } else { + cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus); + } +} + +/* + * Commands for update_parent_subparts_cpumask + */ +enum subparts_cmd { + partcmd_enable, /* Enable partition root */ + partcmd_disable, /* Disable partition root */ + partcmd_update, /* Update parent's subparts_cpus */ +}; + +/** + * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset + * @cpuset: The cpuset that requests change in partition root state + * @cmd: Partition root state change command + * @newmask: Optional new cpumask for partcmd_update + * @tmp: Temporary addmask and delmask + * Return: 0, 1 or an error code + * + * For partcmd_enable, the cpuset is being transformed from a non-partition + * root to a partition root. The cpus_allowed mask of the given cpuset will + * be put into parent's subparts_cpus and taken away from parent's + * effective_cpus. The function will return 0 if all the CPUs listed in + * cpus_allowed can be granted or an error code will be returned. + * + * For partcmd_disable, the cpuset is being transofrmed from a partition + * root back to a non-partition root. any CPUs in cpus_allowed that are in + * parent's subparts_cpus will be taken away from that cpumask and put back + * into parent's effective_cpus. 0 should always be returned. + * + * For partcmd_update, if the optional newmask is specified, the cpu + * list is to be changed from cpus_allowed to newmask. Otherwise, + * cpus_allowed is assumed to remain the same. The function will return + * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0 + * otherwise. In case of error, an error code will be returned. + * + * The partcmd_enable and partcmd_disable commands are used by + * update_prstate(). The partcmd_update command is used by + * update_cpumasks_hier() with newmask NULL and update_cpumask() with + * newmask set. + * + * The checking is more strict when enabling partition root than the + * other two commands. + * + * Because of the implicit cpu exclusive nature of a partition root, + * cpumask changes that violates the cpu exclusivity rule will not be + * permitted when checked by validate_change(). The validate_change() + * function will also prevent any changes to the cpu list if it is not + * a superset of children's cpu lists. + */ +static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, + struct cpumask *newmask, + struct tmpmasks *tmp) +{ + struct cpuset *parent = parent_cs(cpuset); + int adding; /* Moving cpus from effective_cpus to subparts_cpus */ + int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ + + lockdep_assert_held(&cpuset_mutex); + + /* + * The parent must be a partition root. + * The new cpumask, if present, or the current cpus_allowed must + * not be empty. + */ + if (!is_partition_root(parent) || + (newmask && cpumask_empty(newmask)) || + (!newmask && cpumask_empty(cpuset->cpus_allowed))) + return -EINVAL; + + /* + * Enabling/disabling partition root is not allowed if there are + * online children. + */ + if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) + return -EBUSY; + + /* + * Enabling partition root is not allowed if not all the CPUs + * can be granted from parent's effective_cpus or at least one + * CPU will be left after that. + */ + if ((cmd == partcmd_enable) && + (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || + cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))) + return -EINVAL; + + /* + * A cpumask update cannot make parent's effective_cpus become empty. + */ + adding = deleting = false; + if (cmd == partcmd_enable) { + cpumask_copy(tmp->addmask, cpuset->cpus_allowed); + adding = true; + } else if (cmd == partcmd_disable) { + deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, + parent->subparts_cpus); + } else if (newmask) { + /* + * partcmd_update with newmask: + * + * delmask = cpus_allowed & ~newmask & parent->subparts_cpus + * addmask = newmask & parent->effective_cpus + * & ~parent->subparts_cpus + */ + cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask); + deleting = cpumask_and(tmp->delmask, tmp->delmask, + parent->subparts_cpus); + + cpumask_and(tmp->addmask, newmask, parent->effective_cpus); + adding = cpumask_andnot(tmp->addmask, tmp->addmask, + parent->subparts_cpus); + /* + * Return error if the new effective_cpus could become empty. + */ + if (adding && !deleting && + cpumask_equal(parent->effective_cpus, tmp->addmask)) + return -EINVAL; + } else { + /* + * partcmd_update w/o newmask: + * + * addmask = cpus_allowed & parent->effectiveb_cpus + * + * Note that parent's subparts_cpus may have been + * pre-shrunk in case the CPUs granted to the parent + * by the grandparent changes. So no deletion is needed. + */ + adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, + parent->effective_cpus); + if (cpumask_equal(tmp->addmask, parent->effective_cpus)) + return -EINVAL; + } + + if (!adding && !deleting) + return 0; + + /* + * Change the parent's subparts_cpus. + * Newly added CPUs will be removed from effective_cpus and + * newly deleted ones will be added back to effective_cpus. + */ + spin_lock_irq(&callback_lock); + if (adding) { + cpumask_or(parent->subparts_cpus, + parent->subparts_cpus, tmp->addmask); + cpumask_andnot(parent->effective_cpus, + parent->effective_cpus, tmp->addmask); + } + if (deleting) { + cpumask_andnot(parent->subparts_cpus, + parent->subparts_cpus, tmp->delmask); + cpumask_or(parent->effective_cpus, + parent->effective_cpus, tmp->delmask); + } + + parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus); + spin_unlock_irq(&callback_lock); + + return cmd == partcmd_update; +} + /* * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree - * @cs: the cpuset to consider - * @new_cpus: temp variable for calculating new effective_cpus + * @cs: the cpuset to consider + * @tmp: temp variables for calculating effective_cpus & partition setup * * When congifured cpumask is changed, the effective cpumasks of this cpuset * and all its descendants need to be updated. @@ -982,7 +1163,7 @@ static void update_tasks_cpumask(struct cpuset *cs) * * Called with cpuset_mutex held */ -static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus) +static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) { struct cpuset *cp; struct cgroup_subsys_state *pos_css; @@ -991,28 +1172,61 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus) rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); + bool cs_empty; + + compute_effective_cpumask(tmp->new_cpus, cp, parent); + cs_empty = cpumask_empty(tmp->new_cpus); - cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus); + /* + * A partition root cannot have empty effective_cpus + */ + WARN_ON_ONCE(cs_empty && is_partition_root(cp)); /* * If it becomes empty, inherit the effective mask of the * parent, which is guaranteed to have some CPUs. */ - if (is_in_v2_mode() && cpumask_empty(new_cpus)) - cpumask_copy(new_cpus, parent->effective_cpus); + if (is_in_v2_mode() && cs_empty) + cpumask_copy(tmp->new_cpus, parent->effective_cpus); - /* Skip the whole subtree if the cpumask remains the same. */ - if (cpumask_equal(new_cpus, cp->effective_cpus)) { + /* + * Skip the whole subtree if the cpumask remains the same + * and has no partition root state. + */ + if (!is_partition_root(cp) && + cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { pos_css = css_rightmost_descendant(pos_css); continue; } + /* + * update_parent_subparts_cpumask() should have been called + * for cs already in update_cpumask(). We should also call + * update_tasks_cpumask() again for tasks in the parent + * cpuset if the parent's subparts_cpus changes. + */ + if ((cp != cs) && cp->partition_root_state && + update_parent_subparts_cpumask(cp, partcmd_update, + NULL, tmp)) { + if (parent != &top_cpuset) + update_tasks_cpumask(parent); + } + if (!css_tryget_online(&cp->css)) continue; rcu_read_unlock(); spin_lock_irq(&callback_lock); - cpumask_copy(cp->effective_cpus, new_cpus); + + cpumask_copy(cp->effective_cpus, tmp->new_cpus); + if (cp->nr_subparts_cpus) { + /* + * Make sure that effective_cpus & subparts_cpus + * are mutually exclusive. + */ + cpumask_andnot(cp->effective_cpus, cp->effective_cpus, + cp->subparts_cpus); + } spin_unlock_irq(&callback_lock); WARN_ON(!is_in_v2_mode() && @@ -1047,6 +1261,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, const char *buf) { int retval; + struct tmpmasks tmp; /* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */ if (cs == &top_cpuset) @@ -1078,12 +1293,39 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) return retval; +#ifdef CONFIG_CPUMASK_OFFSTACK + /* + * Use the cpumasks in trialcs for tmpmasks when they are pointers + * to allocated cpumasks. + */ + tmp.addmask = trialcs->subparts_cpus; + tmp.delmask = trialcs->effective_cpus; + tmp.new_cpus = trialcs->cpus_allowed; +#endif + + if (cs->partition_root_state) { + /* Cpumask of a partition root cannot be empty */ + if (cpumask_empty(trialcs->cpus_allowed)) + return -EINVAL; + if (update_parent_subparts_cpumask(cs, partcmd_update, + trialcs->cpus_allowed, &tmp) < 0) + return -EINVAL; + } + spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); + + /* + * Make sure that subparts_cpus is a subset of cpus_allowed. + */ + if (cs->nr_subparts_cpus) { + cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus, + cs->cpus_allowed); + cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); + } spin_unlock_irq(&callback_lock); - /* use trialcs->cpus_allowed as a temp variable */ - update_cpumasks_hier(cs, trialcs->cpus_allowed); + update_cpumasks_hier(cs, &tmp); return 0; } @@ -1441,6 +1683,80 @@ out: return err; } +/* + * update_prstate - update partititon_root_state + * cs: the cpuset to update + * val: 0 - disabled, 1 - enabled + * + * Call with cpuset_mutex held. + */ +static int update_prstate(struct cpuset *cs, int val) +{ + int err; + struct cpuset *parent = parent_cs(cs); + struct tmpmasks tmp; + + if ((val != 0) && (val != 1)) + return -EINVAL; + if (val == cs->partition_root_state) + return 0; + + /* + * Cannot force a partial or erroneous partition root to a full + * partition root. + */ + if (val && cs->partition_root_state) + return -EINVAL; + + if (alloc_cpumasks(NULL, &tmp)) + return -ENOMEM; + + err = -EINVAL; + if (!cs->partition_root_state) { + /* + * Turning on partition root requires setting the + * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed + * cannot be NULL. + */ + if (cpumask_empty(cs->cpus_allowed)) + goto out; + + err = update_flag(CS_CPU_EXCLUSIVE, cs, 1); + if (err) + goto out; + + err = update_parent_subparts_cpumask(cs, partcmd_enable, + NULL, &tmp); + if (err) { + update_flag(CS_CPU_EXCLUSIVE, cs, 0); + goto out; + } + cs->partition_root_state = PRS_ENABLED; + } else { + err = update_parent_subparts_cpumask(cs, partcmd_disable, + NULL, &tmp); + if (err) + goto out; + + cs->partition_root_state = 0; + + /* Turning off CS_CPU_EXCLUSIVE will not return error */ + update_flag(CS_CPU_EXCLUSIVE, cs, 0); + } + + /* + * Update cpumask of parent's tasks except when it is the top + * cpuset as some system daemons cannot be mapped to other CPUs. + */ + if (parent != &top_cpuset) + update_tasks_cpumask(parent); + + rebuild_sched_domains_locked(); +out: + free_cpumasks(NULL, &tmp); + return err; +} + /* * Frequency meter - How fast is some event occurring? * @@ -1686,6 +2002,7 @@ typedef enum { FILE_MEM_EXCLUSIVE, FILE_MEM_HARDWALL, FILE_SCHED_LOAD_BALANCE, + FILE_PARTITION_ROOT, FILE_SCHED_RELAX_DOMAIN_LEVEL, FILE_MEMORY_PRESSURE_ENABLED, FILE_MEMORY_PRESSURE, @@ -1755,6 +2072,9 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SCHED_RELAX_DOMAIN_LEVEL: retval = update_relax_domain_level(cs, val); break; + case FILE_PARTITION_ROOT: + retval = update_prstate(cs, val); + break; default: retval = -EINVAL; break; @@ -1905,6 +2225,8 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) switch (type) { case FILE_SCHED_RELAX_DOMAIN_LEVEL: return cs->relax_domain_level; + case FILE_PARTITION_ROOT: + return cs->partition_root_state; default: BUG(); } @@ -2056,6 +2378,14 @@ static struct cftype dfl_files[] = { .flags = CFTYPE_NOT_ON_ROOT, }, + { + .name = "sched.partition", + .read_s64 = cpuset_read_s64, + .write_s64 = cpuset_write_s64, + .private = FILE_PARTITION_ROOT, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { } /* terminate */ }; @@ -2157,7 +2487,12 @@ out_unlock: /* * If the cpuset being removed has its flag 'sched_load_balance' * enabled, then simulate turning sched_load_balance off, which - * will call rebuild_sched_domains_locked(). + * will call rebuild_sched_domains_locked(). That is not needed + * in the default hierarchy where only changes in partition + * will cause repartitioning. + * + * If the cpuset has the 'sched.partition' flag enabled, simulate + * turning 'sched.partition" off. */ static void cpuset_css_offline(struct cgroup_subsys_state *css) @@ -2166,7 +2501,11 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) mutex_lock(&cpuset_mutex); - if (is_sched_load_balance(cs)) + if (is_partition_root(cs)) + update_prstate(cs, 0); + + if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && + is_sched_load_balance(cs)) update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); cpuset_dec(); -- cgit v1.3-7-g2ca7 From 3881b86128d0be22e8947ac1fca0429c74bf055e Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:39 -0500 Subject: cpuset: Add an error state to cpuset.sched.partition When external events like CPU offlining or user events like changing the cpu list of an ancestor cpuset happen, update_cpumasks_hier() will be called to update the effective cpus of each of the affected cpusets. That will then call update_parent_subparts_cpumask() if partitions are impacted. Currently, these events may cause update_parent_subparts_cpumask() to return error if none of the requested cpus are available or it will consume all the cpus in the parent partition root. Handling these errors is problematic as the states may become inconsistent. Instead of letting update_parent_subparts_cpumask() return error, a new error state (-1) is added to the partition_root_state flag to designate the fact that the partition is no longer valid. IOW, it is no longer a real partition root, but the CS_CPU_EXCLUSIVE flag will still be set as it can be changed back to a real one if favorable change happens later on. This new error state is set internally and user cannot write this new value to "cpuset.sched.partition". Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 153 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 129 insertions(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 85e04162adcb..ef41f58d7cdf 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -153,10 +153,19 @@ struct cpuset { * Partition root states: * * 0 - not a partition root + * * 1 - partition root + * + * -1 - invalid partition root + * None of the cpus in cpus_allowed can be put into the parent's + * subparts_cpus. In this case, the cpuset is not a real partition + * root anymore. However, the CPU_EXCLUSIVE bit will still be set + * and the cpuset can be restored back to a partition root if the + * parent cpuset can give more CPUs back to this child cpuset. */ #define PRS_DISABLED 0 #define PRS_ENABLED 1 +#define PRS_ERROR -1 /* * Temporary cpumasks for working with partitions that are passed among @@ -251,7 +260,7 @@ static inline int is_spread_slab(const struct cpuset *cs) static inline int is_partition_root(const struct cpuset *cs) { - return cs->partition_root_state; + return cs->partition_root_state > 0; } static struct cpuset top_cpuset = { @@ -1021,9 +1030,12 @@ enum subparts_cmd { * * For partcmd_update, if the optional newmask is specified, the cpu * list is to be changed from cpus_allowed to newmask. Otherwise, - * cpus_allowed is assumed to remain the same. The function will return - * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0 - * otherwise. In case of error, an error code will be returned. + * cpus_allowed is assumed to remain the same. The cpuset should either + * be a partition root or an invalid partition root. The partition root + * state may change if newmask is NULL and none of the requested CPUs can + * be granted by the parent. The function will return 1 if changes to + * parent's subparts_cpus and effective_cpus happen or 0 otherwise. + * Error code should only be returned when newmask is non-NULL. * * The partcmd_enable and partcmd_disable commands are used by * update_prstate(). The partcmd_update command is used by @@ -1046,6 +1058,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, struct cpuset *parent = parent_cs(cpuset); int adding; /* Moving cpus from effective_cpus to subparts_cpus */ int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ + bool part_error = false; /* Partition error? */ lockdep_assert_held(&cpuset_mutex); @@ -1114,13 +1127,48 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, * addmask = cpus_allowed & parent->effectiveb_cpus * * Note that parent's subparts_cpus may have been - * pre-shrunk in case the CPUs granted to the parent - * by the grandparent changes. So no deletion is needed. + * pre-shrunk in case there is a change in the cpu list. + * So no deletion is needed. */ adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, parent->effective_cpus); - if (cpumask_equal(tmp->addmask, parent->effective_cpus)) - return -EINVAL; + part_error = cpumask_equal(tmp->addmask, + parent->effective_cpus); + } + + if (cmd == partcmd_update) { + int prev_prs = cpuset->partition_root_state; + + /* + * Check for possible transition between PRS_ENABLED + * and PRS_ERROR. + */ + switch (cpuset->partition_root_state) { + case PRS_ENABLED: + if (part_error) + cpuset->partition_root_state = PRS_ERROR; + break; + case PRS_ERROR: + if (!part_error) + cpuset->partition_root_state = PRS_ENABLED; + break; + } + /* + * Set part_error if previously in invalid state. + */ + part_error = (prev_prs == PRS_ERROR); + } + + if (!part_error && (cpuset->partition_root_state == PRS_ERROR)) + return 0; /* Nothing need to be done */ + + if (cpuset->partition_root_state == PRS_ERROR) { + /* + * Remove all its cpus from parent's subparts_cpus. + */ + adding = false; + deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, + parent->subparts_cpus); } if (!adding && !deleting) @@ -1172,28 +1220,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); - bool cs_empty; compute_effective_cpumask(tmp->new_cpus, cp, parent); - cs_empty = cpumask_empty(tmp->new_cpus); - - /* - * A partition root cannot have empty effective_cpus - */ - WARN_ON_ONCE(cs_empty && is_partition_root(cp)); /* * If it becomes empty, inherit the effective mask of the * parent, which is guaranteed to have some CPUs. */ - if (is_in_v2_mode() && cs_empty) + if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) cpumask_copy(tmp->new_cpus, parent->effective_cpus); /* * Skip the whole subtree if the cpumask remains the same * and has no partition root state. */ - if (!is_partition_root(cp) && + if (!cp->partition_root_state && cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { pos_css = css_rightmost_descendant(pos_css); continue; @@ -1205,11 +1246,44 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) * update_tasks_cpumask() again for tasks in the parent * cpuset if the parent's subparts_cpus changes. */ - if ((cp != cs) && cp->partition_root_state && - update_parent_subparts_cpumask(cp, partcmd_update, - NULL, tmp)) { - if (parent != &top_cpuset) - update_tasks_cpumask(parent); + if ((cp != cs) && cp->partition_root_state) { + switch (parent->partition_root_state) { + case PRS_DISABLED: + /* + * If parent is not a partition root or an + * invalid partition root, clear the state + * state and the CS_CPU_EXCLUSIVE flag. + */ + WARN_ON_ONCE(cp->partition_root_state + != PRS_ERROR); + cp->partition_root_state = 0; + + /* + * clear_bit() is an atomic operation and + * readers aren't interested in the state + * of CS_CPU_EXCLUSIVE anyway. So we can + * just update the flag without holding + * the callback_lock. + */ + clear_bit(CS_CPU_EXCLUSIVE, &cp->flags); + break; + + case PRS_ENABLED: + if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) + update_tasks_cpumask(parent); + break; + + case PRS_ERROR: + /* + * When parent is invalid, it has to be too. + */ + cp->partition_root_state = PRS_ERROR; + if (cp->nr_subparts_cpus) { + cp->nr_subparts_cpus = 0; + cpumask_clear(cp->subparts_cpus); + } + break; + } } if (!css_tryget_online(&cp->css)) @@ -1219,13 +1293,33 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) spin_lock_irq(&callback_lock); cpumask_copy(cp->effective_cpus, tmp->new_cpus); - if (cp->nr_subparts_cpus) { + if (cp->nr_subparts_cpus && + (cp->partition_root_state != PRS_ENABLED)) { + cp->nr_subparts_cpus = 0; + cpumask_clear(cp->subparts_cpus); + } else if (cp->nr_subparts_cpus) { /* * Make sure that effective_cpus & subparts_cpus * are mutually exclusive. + * + * In the unlikely event that effective_cpus + * becomes empty. we clear cp->nr_subparts_cpus and + * let its child partition roots to compete for + * CPUs again. */ cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cp->subparts_cpus); + if (cpumask_empty(cp->effective_cpus)) { + cpumask_copy(cp->effective_cpus, tmp->new_cpus); + cpumask_clear(cp->subparts_cpus); + cp->nr_subparts_cpus = 0; + } else if (!cpumask_subset(cp->subparts_cpus, + tmp->new_cpus)) { + cpumask_andnot(cp->subparts_cpus, + cp->subparts_cpus, tmp->new_cpus); + cp->nr_subparts_cpus + = cpumask_weight(cp->subparts_cpus); + } } spin_unlock_irq(&callback_lock); @@ -1702,7 +1796,7 @@ static int update_prstate(struct cpuset *cs, int val) return 0; /* - * Cannot force a partial or erroneous partition root to a full + * Cannot force a partial or invalid partition root to a full * partition root. */ if (val && cs->partition_root_state) @@ -1733,6 +1827,17 @@ static int update_prstate(struct cpuset *cs, int val) } cs->partition_root_state = PRS_ENABLED; } else { + /* + * Turning off partition root will clear the + * CS_CPU_EXCLUSIVE bit. + */ + if (cs->partition_root_state == PRS_ERROR) { + cs->partition_root_state = 0; + update_flag(CS_CPU_EXCLUSIVE, cs, 0); + err = 0; + goto out; + } + err = update_parent_subparts_cpumask(cs, partcmd_disable, NULL, &tmp); if (err) -- cgit v1.3-7-g2ca7 From 4716909cc5c566e946a3acc884bf5dc469812007 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:40 -0500 Subject: cpuset: Track cpusets that use parent's effective_cpus In the default hierarchy, a cpuset will use the parent's effective_cpus if none of the requested CPUs can be granted from the parent. That can be a problem if a parent is a partition root with children partition roots. Changes to a parent's effective_cpus list due to changes in a child partition root may not be properly reflected in a child cpuset that use parent's effective_cpus because the cpu_exclusive rule of a partition root will not guard against that. In order to avoid the mismatch, two new tracking variables are added to the cpuset structure to track if a cpuset uses parent's effective_cpus and the number of children cpusets that use its effective_cpus. So whenever cpumask changes are made to a parent, it will also check to see if it has other children cpusets that use its effective_cpus and call update_cpumasks_hier() if that is the case. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 70 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index ef41f58d7cdf..21eaa896c1a9 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -147,6 +147,14 @@ struct cpuset { /* partition root state */ int partition_root_state; + + /* + * Default hierarchy only: + * use_parent_ecpus - set if using parent's effective_cpus + * child_ecpus_count - # of children with use_parent_ecpus set + */ + int use_parent_ecpus; + int child_ecpus_count; }; /* @@ -1227,8 +1235,17 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) * If it becomes empty, inherit the effective mask of the * parent, which is guaranteed to have some CPUs. */ - if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) + if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { cpumask_copy(tmp->new_cpus, parent->effective_cpus); + if (!cp->use_parent_ecpus) { + cp->use_parent_ecpus = true; + parent->child_ecpus_count++; + } + } else if (cp->use_parent_ecpus) { + cp->use_parent_ecpus = false; + WARN_ON_ONCE(!parent->child_ecpus_count); + parent->child_ecpus_count--; + } /* * Skip the whole subtree if the cpumask remains the same @@ -1345,6 +1362,35 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) rebuild_sched_domains_locked(); } +/** + * update_sibling_cpumasks - Update siblings cpumasks + * @parent: Parent cpuset + * @cs: Current cpuset + * @tmp: Temp variables + */ +static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, + struct tmpmasks *tmp) +{ + struct cpuset *sibling; + struct cgroup_subsys_state *pos_css; + + /* + * Check all its siblings and call update_cpumasks_hier() + * if their use_parent_ecpus flag is set in order for them + * to use the right effective_cpus value. + */ + rcu_read_lock(); + cpuset_for_each_child(sibling, pos_css, parent) { + if (sibling == cs) + continue; + if (!sibling->use_parent_ecpus) + continue; + + update_cpumasks_hier(sibling, tmp); + } + rcu_read_unlock(); +} + /** * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it * @cs: the cpuset to consider @@ -1420,6 +1466,17 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, spin_unlock_irq(&callback_lock); update_cpumasks_hier(cs, &tmp); + + if (cs->partition_root_state) { + struct cpuset *parent = parent_cs(cs); + + /* + * For partition root, update the cpumasks of sibling + * cpusets if they use parent's effective_cpus. + */ + if (parent->child_ecpus_count) + update_sibling_cpumasks(parent, cs, &tmp); + } return 0; } @@ -1856,6 +1913,9 @@ static int update_prstate(struct cpuset *cs, int val) if (parent != &top_cpuset) update_tasks_cpumask(parent); + if (parent->child_ecpus_count) + update_sibling_cpumasks(parent, cs, &tmp); + rebuild_sched_domains_locked(); out: free_cpumasks(NULL, &tmp); @@ -2550,6 +2610,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) if (is_in_v2_mode()) { cpumask_copy(cs->effective_cpus, parent->effective_cpus); cs->effective_mems = parent->effective_mems; + cs->use_parent_ecpus = true; + parent->child_ecpus_count++; } spin_unlock_irq(&callback_lock); @@ -2613,6 +2675,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) is_sched_load_balance(cs)) update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + if (cs->use_parent_ecpus) { + struct cpuset *parent = parent_cs(cs); + + cs->use_parent_ecpus = false; + parent->child_ecpus_count--; + } + cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags); -- cgit v1.3-7-g2ca7 From 4b842da276a8a1057aed7af6b2a5da471f840dd0 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:41 -0500 Subject: cpuset: Make CPU hotplug work with partition When there is a cpu hotplug event (CPU online or offline), the partitions may need to be reconfigured and regenerated. So code is added to the hotplug functions to make them work with new subparts_cpus mask to compute the right effective_cpus for each of the affected cpusets. It may also change the state of a partition root from real one to an erroneous one or vice versa. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 131 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 116 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 21eaa896c1a9..82c88c60c83d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -113,6 +113,9 @@ struct cpuset { * CPUs allocated to child sub-partitions (default hierarchy only) * - CPUs granted by the parent = effective_cpus U subparts_cpus * - effective_cpus and subparts_cpus are mutually exclusive. + * + * effective_cpus contains only onlined CPUs, but subparts_cpus + * may have offlined ones. */ cpumask_var_t subparts_cpus; @@ -994,7 +997,9 @@ static void update_tasks_cpumask(struct cpuset *cs) * @parent: the parent cpuset * * If the parent has subpartition CPUs, include them in the list of - * allowable CPUs in computing the new effective_cpus mask. + * allowable CPUs in computing the new effective_cpus mask. Since offlined + * CPUs are not removed from subparts_cpus, we have to use cpu_active_mask + * to mask those out. */ static void compute_effective_cpumask(struct cpumask *new_cpus, struct cpuset *cs, struct cpuset *parent) @@ -1003,6 +1008,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus, cpumask_or(new_cpus, parent->effective_cpus, parent->subparts_cpus); cpumask_and(new_cpus, new_cpus, cs->cpus_allowed); + cpumask_and(new_cpus, new_cpus, cpu_active_mask); } else { cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus); } @@ -1125,9 +1131,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, /* * Return error if the new effective_cpus could become empty. */ - if (adding && !deleting && - cpumask_equal(parent->effective_cpus, tmp->addmask)) - return -EINVAL; + if (adding && + cpumask_equal(parent->effective_cpus, tmp->addmask)) { + if (!deleting) + return -EINVAL; + /* + * As some of the CPUs in subparts_cpus might have + * been offlined, we need to compute the real delmask + * to confirm that. + */ + if (!cpumask_and(tmp->addmask, tmp->delmask, + cpu_active_mask)) + return -EINVAL; + cpumask_copy(tmp->addmask, parent->effective_cpus); + } } else { /* * partcmd_update w/o newmask: @@ -1197,6 +1214,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, if (deleting) { cpumask_andnot(parent->subparts_cpus, parent->subparts_cpus, tmp->delmask); + /* + * Some of the CPUs in subparts_cpus might have been offlined. + */ + cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask); cpumask_or(parent->effective_cpus, parent->effective_cpus, tmp->delmask); } @@ -2863,20 +2884,29 @@ hotplug_update_tasks(struct cpuset *cs, update_tasks_nodemask(cs); } +static bool force_rebuild; + +void cpuset_force_rebuild(void) +{ + force_rebuild = true; +} + /** * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug * @cs: cpuset in interest + * @tmp: the tmpmasks structure pointer * * Compare @cs's cpu and mem masks against top_cpuset and if some have gone * offline, update @cs accordingly. If @cs ends up with no CPU or memory, * all its tasks are moved to the nearest ancestor with both resources. */ -static void cpuset_hotplug_update_tasks(struct cpuset *cs) +static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) { static cpumask_t new_cpus; static nodemask_t new_mems; bool cpus_updated; bool mems_updated; + struct cpuset *parent; retry: wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); @@ -2891,9 +2921,60 @@ retry: goto retry; } - cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus); - nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems); + parent = parent_cs(cs); + compute_effective_cpumask(&new_cpus, cs, parent); + nodes_and(new_mems, cs->mems_allowed, parent->effective_mems); + + if (cs->nr_subparts_cpus) + /* + * Make sure that CPUs allocated to child partitions + * do not show up in effective_cpus. + */ + cpumask_andnot(&new_cpus, &new_cpus, cs->subparts_cpus); + + if (!tmp || !cs->partition_root_state) + goto update_tasks; + + /* + * In the unlikely event that a partition root has empty + * effective_cpus or its parent becomes erroneous, we have to + * transition it to the erroneous state. + */ + if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || + (parent->partition_root_state == PRS_ERROR))) { + if (cs->nr_subparts_cpus) { + cs->nr_subparts_cpus = 0; + cpumask_clear(cs->subparts_cpus); + compute_effective_cpumask(&new_cpus, cs, parent); + } + /* + * If the effective_cpus is empty because the child + * partitions take away all the CPUs, we can keep + * the current partition and let the child partitions + * fight for available CPUs. + */ + if ((parent->partition_root_state == PRS_ERROR) || + cpumask_empty(&new_cpus)) { + update_parent_subparts_cpumask(cs, partcmd_disable, + NULL, tmp); + cs->partition_root_state = PRS_ERROR; + } + cpuset_force_rebuild(); + } + + /* + * On the other hand, an erroneous partition root may be transitioned + * back to a regular one or a partition root with no CPU allocated + * from the parent may change to erroneous. + */ + if (is_partition_root(parent) && + ((cs->partition_root_state == PRS_ERROR) || + !cpumask_intersects(&new_cpus, parent->subparts_cpus)) && + update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) + cpuset_force_rebuild(); + +update_tasks: cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); @@ -2907,13 +2988,6 @@ retry: mutex_unlock(&cpuset_mutex); } -static bool force_rebuild; - -void cpuset_force_rebuild(void) -{ - force_rebuild = true; -} - /** * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset * @@ -2936,6 +3010,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work) static nodemask_t new_mems; bool cpus_updated, mems_updated; bool on_dfl = is_in_v2_mode(); + struct tmpmasks tmp, *ptmp = NULL; + + if (on_dfl && !alloc_cpumasks(NULL, &tmp)) + ptmp = &tmp; mutex_lock(&cpuset_mutex); @@ -2943,6 +3021,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work) cpumask_copy(&new_cpus, cpu_active_mask); new_mems = node_states[N_MEMORY]; + /* + * If subparts_cpus is populated, it is likely that the check below + * will produce a false positive on cpus_updated when the cpu list + * isn't changed. It is extra work, but it is better to be safe. + */ cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus); mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems); @@ -2951,6 +3034,22 @@ static void cpuset_hotplug_workfn(struct work_struct *work) spin_lock_irq(&callback_lock); if (!on_dfl) cpumask_copy(top_cpuset.cpus_allowed, &new_cpus); + /* + * Make sure that CPUs allocated to child partitions + * do not show up in effective_cpus. If no CPU is left, + * we clear the subparts_cpus & let the child partitions + * fight for the CPUs again. + */ + if (top_cpuset.nr_subparts_cpus) { + if (cpumask_subset(&new_cpus, + top_cpuset.subparts_cpus)) { + top_cpuset.nr_subparts_cpus = 0; + cpumask_clear(top_cpuset.subparts_cpus); + } else { + cpumask_andnot(&new_cpus, &new_cpus, + top_cpuset.subparts_cpus); + } + } cpumask_copy(top_cpuset.effective_cpus, &new_cpus); spin_unlock_irq(&callback_lock); /* we don't mess with cpumasks of tasks in top_cpuset */ @@ -2979,7 +3078,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) continue; rcu_read_unlock(); - cpuset_hotplug_update_tasks(cs); + cpuset_hotplug_update_tasks(cs, ptmp); rcu_read_lock(); css_put(&cs->css); @@ -2992,6 +3091,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work) force_rebuild = false; rebuild_sched_domains(); } + + free_cpumasks(NULL, ptmp); } void cpuset_update_active_cpus(void) -- cgit v1.3-7-g2ca7 From 0ccea8feb9807ba87b0405a826f6830a386706f5 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:42 -0500 Subject: cpuset: Make generate_sched_domains() work with partition The generate_sched_domains() function is modified to make it work correctly with the newly introduced subparts_cpus mask for scheduling domains generation. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 82c88c60c83d..3960de7a75cc 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -769,13 +769,14 @@ static int generate_sched_domains(cpumask_var_t **domains, int ndoms = 0; /* number of sched domains in result */ int nslot; /* next empty doms[] struct cpumask slot */ struct cgroup_subsys_state *pos_css; + bool root_load_balance = is_sched_load_balance(&top_cpuset); doms = NULL; dattr = NULL; csa = NULL; /* Special case for the 99% of systems with one, full, sched domain */ - if (is_sched_load_balance(&top_cpuset)) { + if (root_load_balance && !top_cpuset.nr_subparts_cpus) { ndoms = 1; doms = alloc_sched_domains(ndoms); if (!doms) @@ -798,6 +799,8 @@ static int generate_sched_domains(cpumask_var_t **domains, csn = 0; rcu_read_lock(); + if (root_load_balance) + csa[csn++] = &top_cpuset; cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { if (cp == &top_cpuset) continue; @@ -808,6 +811,9 @@ static int generate_sched_domains(cpumask_var_t **domains, * parent's cpus, so just skip them, and then we call * update_domain_attr_tree() to calc relax_domain_level of * the corresponding sched domain. + * + * If root is load-balancing, we can skip @cp if it + * is a subset of the root's effective_cpus. */ if (!cpumask_empty(cp->cpus_allowed) && !(is_sched_load_balance(cp) && @@ -815,11 +821,16 @@ static int generate_sched_domains(cpumask_var_t **domains, housekeeping_cpumask(HK_FLAG_DOMAIN)))) continue; + if (root_load_balance && + cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) + continue; + if (is_sched_load_balance(cp)) csa[csn++] = cp; - /* skip @cp's subtree */ - pos_css = css_rightmost_descendant(pos_css); + /* skip @cp's subtree if not a partition root */ + if (!is_partition_root(cp)) + pos_css = css_rightmost_descendant(pos_css); } rcu_read_unlock(); @@ -947,7 +958,12 @@ static void rebuild_sched_domains_locked(void) * passing doms with offlined cpu to partition_sched_domains(). * Anyways, hotplug work item will rebuild sched domains. */ - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) + if (!top_cpuset.nr_subparts_cpus && + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) + goto out; + + if (top_cpuset.nr_subparts_cpus && + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask)) goto out; /* Generate domain masks and attrs */ @@ -1367,11 +1383,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) update_tasks_cpumask(cp); /* - * If the effective cpumask of any non-empty cpuset is changed, - * we need to rebuild sched domains. + * On legacy hierarchy, if the effective cpumask of any non- + * empty cpuset is changed, we need to rebuild sched domains. + * On default hierarchy, the cpuset needs to be a partition + * root as well. */ if (!cpumask_empty(cp->cpus_allowed) && - is_sched_load_balance(cp)) + is_sched_load_balance(cp) && + (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || + is_partition_root(cp))) need_rebuild_sched_domains = true; rcu_read_lock(); -- cgit v1.3-7-g2ca7 From 5776ceccd4de2a53dec740422a409e9e588c5a70 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:43 -0500 Subject: cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Because of the fact that setting the "cpuset.sched.partition" in a direct child of root can remove CPUs from the root's effective CPU list, it makes sense to know what CPUs are left in the root cgroup for scheduling purpose. So the "cpuset.cpus.effective" control file is now exposed in the v2 cgroup root. For consistency, the "cpuset.mems.effective" control file is exposed as well. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 4 ++-- kernel/cgroup/cpuset.c | 2 -- 2 files changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 01b70f69304e..595b0757ad2b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1653,7 +1653,7 @@ Cpuset Interface Files and won't be affected by any CPU hotplug events. cpuset.cpus.effective - A read-only multiple values file which exists on non-root + A read-only multiple values file which exists on all cpuset-enabled cgroups. It lists the onlined CPUs that are actually granted to this @@ -1693,7 +1693,7 @@ Cpuset Interface Files and won't be affected by any memory nodes hotplug events. cpuset.mems.effective - A read-only multiple values file which exists on non-root + A read-only multiple values file which exists on all cpuset-enabled cgroups. It lists the onlined memory nodes that are actually granted to diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 3960de7a75cc..fc1a809cd5bb 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2574,14 +2574,12 @@ static struct cftype dfl_files[] = { .name = "cpus.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_CPULIST, - .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "mems.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_MEMLIST, - .flags = CFTYPE_NOT_ON_ROOT, }, { -- cgit v1.3-7-g2ca7 From bb5b553c33cb3393f604483b103158bf7d01ca1c Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:44 -0500 Subject: cpuset: Use descriptive text when reading/writing cpuset.sched.partition Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on read. A person who is not familiar with the partition code may not understand what they mean. In order to make cpuset.sched.partition more user-friendly, it will now display the following descriptive text on read: "root" - A partition root (top cpuset of a partition) "member" - A non-root member of a partition "root invalid" - An invalid partition root Note that there is at least one partition in the whole cgroup hierarchy. The top cpuset is the root of that partition. The rests are either a root if it starts a new partition or a member of a partition. The cpuset.sched.partition file will now also accept "root" and "member" besides 1 and 0 as valid input values. The "root invalid" value is internal only and cannot be written to the file. Suggested-by: Tejun Heo Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index fc1a809cd5bb..c739fda805e0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2278,9 +2278,6 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SCHED_RELAX_DOMAIN_LEVEL: retval = update_relax_domain_level(cs, val); break; - case FILE_PARTITION_ROOT: - retval = update_prstate(cs, val); - break; default: retval = -EINVAL; break; @@ -2431,8 +2428,6 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) switch (type) { case FILE_SCHED_RELAX_DOMAIN_LEVEL: return cs->relax_domain_level; - case FILE_PARTITION_ROOT: - return cs->partition_root_state; default: BUG(); } @@ -2441,6 +2436,55 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) return 0; } +static int sched_partition_show(struct seq_file *seq, void *v) +{ + struct cpuset *cs = css_cs(seq_css(seq)); + + switch (cs->partition_root_state) { + case PRS_ENABLED: + seq_puts(seq, "root\n"); + break; + case PRS_DISABLED: + seq_puts(seq, "member\n"); + break; + case PRS_ERROR: + seq_puts(seq, "root invalid\n"); + break; + } + return 0; +} + +static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct cpuset *cs = css_cs(of_css(of)); + int val; + int retval = -ENODEV; + + buf = strstrip(buf); + + /* + * Convert "root"/"1" to 1, and convert "member"/"0" to 0. + */ + if (!strcmp(buf, "root") || !strcmp(buf, "1")) + val = PRS_ENABLED; + else if (!strcmp(buf, "member") || !strcmp(buf, "0")) + val = PRS_DISABLED; + else + return -EINVAL; + + css_get(&cs->css); + mutex_lock(&cpuset_mutex); + if (!is_cpuset_online(cs)) + goto out_unlock; + + retval = update_prstate(cs, val); +out_unlock: + mutex_unlock(&cpuset_mutex); + css_put(&cs->css); + return retval ?: nbytes; +} + /* * for the common functions, 'private' gives the type of file */ @@ -2584,8 +2628,8 @@ static struct cftype dfl_files[] = { { .name = "sched.partition", - .read_s64 = cpuset_read_s64, - .write_s64 = cpuset_write_s64, + .seq_show = sched_partition_show, + .write = sched_partition_write, .private = FILE_PARTITION_ROOT, .flags = CFTYPE_NOT_ON_ROOT, }, -- cgit v1.3-7-g2ca7 From 5cf8114d6e90b3822be5eb6a2faedf99d1c08f77 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:46 -0500 Subject: cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug For debugging purpose, it will be useful to expose the content of the subparts_cpus as a read-only file to see if the code work correctly. However, subparts_cpus will not be used at all in most use cases. So adding a new cpuset file that clutters the cgroup directory may not be desirable. This is now being done by using the hidden "cgroup_debug" kernel command line option to expose a new "cpuset.cpus.subpartitions" file. That option was originally used by the debug controller to expose itself when configured into the kernel. This is now extended to set an internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag can now be used to specify that a cgroup file should only be created when the "cgroup_debug" option is specified. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- include/linux/cgroup-defs.h | 1 + kernel/cgroup/cgroup-internal.h | 2 ++ kernel/cgroup/cgroup.c | 14 +++++++++++++- kernel/cgroup/cpuset.c | 11 +++++++++++ kernel/cgroup/debug.c | 4 +--- 5 files changed, 28 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 5e1694fe035b..8fcbae1b8db0 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -92,6 +92,7 @@ enum { CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */ CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */ + CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */ /* internal flags, do not use outside cgroup core proper */ __CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */ diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 75568fcf2180..c950864016e2 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -11,6 +11,8 @@ #define TRACE_CGROUP_PATH_LEN 1024 extern spinlock_t trace_cgroup_path_lock; extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; +extern bool cgroup_debug; +extern void __init enable_debug_cgroup(void); /* * cgroup_path() takes a spin lock. It is good practice not to take diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2e5d90dfcb49..ed7f0bfe6429 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -86,6 +86,7 @@ EXPORT_SYMBOL_GPL(css_set_lock); DEFINE_SPINLOCK(trace_cgroup_path_lock); char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; +bool cgroup_debug __read_mostly; /* * Protects cgroup_idr and css_idr so that IDs can be released without @@ -3639,7 +3640,8 @@ restart: continue; if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp)) continue; - + if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug) + continue; if (is_add) { ret = cgroup_add_file(css, cgrp, cft); if (ret) { @@ -5743,6 +5745,16 @@ static int __init cgroup_disable(char *str) } __setup("cgroup_disable=", cgroup_disable); +void __init __weak enable_debug_cgroup(void) { } + +static int __init enable_cgroup_debug(char *str) +{ + cgroup_debug = true; + enable_debug_cgroup(); + return 1; +} +__setup("cgroup_debug", enable_cgroup_debug); + /** * css_tryget_online_from_dir - get corresponding css from a cgroup dentry * @dentry: directory dentry of interest diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c739fda805e0..b897314bab53 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2204,6 +2204,7 @@ typedef enum { FILE_MEMLIST, FILE_EFFECTIVE_CPULIST, FILE_EFFECTIVE_MEMLIST, + FILE_SUBPARTS_CPULIST, FILE_CPU_EXCLUSIVE, FILE_MEM_EXCLUSIVE, FILE_MEM_HARDWALL, @@ -2382,6 +2383,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v) case FILE_EFFECTIVE_MEMLIST: seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems)); break; + case FILE_SUBPARTS_CPULIST: + seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus)); + break; default: ret = -EINVAL; } @@ -2634,6 +2638,13 @@ static struct cftype dfl_files[] = { .flags = CFTYPE_NOT_ON_ROOT, }, + { + .name = "cpus.subpartitions", + .seq_show = cpuset_common_seq_show, + .private = FILE_SUBPARTS_CPULIST, + .flags = CFTYPE_DEBUG, + }, + { } /* terminate */ }; diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c index 9caeda610249..5f1b87330bee 100644 --- a/kernel/cgroup/debug.c +++ b/kernel/cgroup/debug.c @@ -373,11 +373,9 @@ struct cgroup_subsys debug_cgrp_subsys = { * On v2, debug is an implicit controller enabled by "cgroup_debug" boot * parameter. */ -static int __init enable_cgroup_debug(char *str) +void __init enable_debug_cgroup(void) { debug_cgrp_subsys.dfl_cftypes = debug_files; debug_cgrp_subsys.implicit_on_dfl = true; debug_cgrp_subsys.threaded = true; - return 1; } -__setup("cgroup_debug", enable_cgroup_debug); -- cgit v1.3-7-g2ca7 From 042d4c70a203998697b34eaad1a99f6f09d09e4d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 22 Oct 2018 07:43:22 -0700 Subject: rcu: Eliminate BUG_ON() for sync.c The sync.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore changes these BUG_ON() calls to WARN_ON_ONCE(), but does so quite naively. Reported-by: Linus Torvalds Signed-off-by: Paul E. McKenney Acked-by: Oleg Nesterov Cc: Peter Zijlstra --- kernel/rcu/sync.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c index 3f943efcf61c..a6ba446a9693 100644 --- a/kernel/rcu/sync.c +++ b/kernel/rcu/sync.c @@ -125,8 +125,7 @@ void rcu_sync_enter(struct rcu_sync *rsp) rsp->gp_state = GP_PENDING; spin_unlock_irq(&rsp->rss_lock); - BUG_ON(need_wait && need_sync); - + WARN_ON_ONCE(need_wait && need_sync); if (need_sync) { gp_ops[rsp->gp_type].sync(); rsp->gp_state = GP_PASSED; @@ -139,7 +138,7 @@ void rcu_sync_enter(struct rcu_sync *rsp) * Nobody has yet been allowed the 'fast' path and thus we can * avoid doing any sync(). The callback will get 'dropped'. */ - BUG_ON(rsp->gp_state != GP_PASSED); + WARN_ON_ONCE(rsp->gp_state != GP_PASSED); } } @@ -166,8 +165,8 @@ static void rcu_sync_func(struct rcu_head *rhp) struct rcu_sync *rsp = container_of(rhp, struct rcu_sync, cb_head); unsigned long flags; - BUG_ON(rsp->gp_state != GP_PASSED); - BUG_ON(rsp->cb_state == CB_IDLE); + WARN_ON_ONCE(rsp->gp_state != GP_PASSED); + WARN_ON_ONCE(rsp->cb_state == CB_IDLE); spin_lock_irqsave(&rsp->rss_lock, flags); if (rsp->gp_count) { @@ -225,7 +224,7 @@ void rcu_sync_dtor(struct rcu_sync *rsp) { int cb_state; - BUG_ON(rsp->gp_count); + WARN_ON_ONCE(rsp->gp_count); spin_lock_irq(&rsp->rss_lock); if (rsp->cb_state == CB_REPLAY) @@ -235,6 +234,6 @@ void rcu_sync_dtor(struct rcu_sync *rsp) if (cb_state != CB_IDLE) { gp_ops[rsp->gp_type].wait(); - BUG_ON(rsp->cb_state != CB_IDLE); + WARN_ON_ONCE(rsp->cb_state != CB_IDLE); } } -- cgit v1.3-7-g2ca7 From 08543bda42ef06c5ee4cd74501c894aa7cc13ea8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 22 Oct 2018 08:04:03 -0700 Subject: rcu: Eliminate BUG_ON() for kernel/rcu/tree.c The tree.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 121f833acd04..bdb6659a8dbc 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2826,7 +2826,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) * Very early boot, before rcu_init(). Initialize if needed * and then drop through to queue the callback. */ - BUG_ON(cpu != -1); + WARN_ON_ONCE(cpu != -1); WARN_ON_ONCE(!rcu_is_watching()); if (rcu_segcblist_empty(&rdp->cblist)) rcu_segcblist_init(&rdp->cblist); @@ -3485,7 +3485,8 @@ static int __init rcu_spawn_gp_kthread(void) rcu_scheduler_fully_active = 1; t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name); - BUG_ON(IS_ERR(t)); + if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__)) + return 0; rnp = rcu_get_root(); raw_spin_lock_irqsave_rcu_node(rnp, flags); rcu_state.gp_kthread = t; -- cgit v1.3-7-g2ca7 From 75a8f72245223cce1571b0405b0881ca3c046df7 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Sat, 22 Sep 2018 19:41:25 -0400 Subject: rcu: Remove unused rcu_state externs The rcu_bh_state and rcu_sched_state variables were removed during the RCU flavor consolidations, but external declarations remain in tree.h. This commit therefore removes these obsolete declarations. Signed-off-by: Joel Fernandes (Google) Cc: Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 11 ----------- 1 file changed, 11 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 703e19ff532d..57a937ac51c2 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -398,17 +398,6 @@ static const char *tp_rcu_varname __used __tracepoint_string = rcu_name; #define RCU_NAME rcu_name #endif /* #else #ifdef CONFIG_TRACING */ -/* - * RCU implementation internal declarations: - */ -extern struct rcu_state rcu_sched_state; - -extern struct rcu_state rcu_bh_state; - -#ifdef CONFIG_PREEMPT_RCU -extern struct rcu_state rcu_preempt_state; -#endif /* #ifdef CONFIG_PREEMPT_RCU */ - int rcu_dynticks_snap(struct rcu_data *rdp); #ifdef CONFIG_RCU_BOOST -- cgit v1.3-7-g2ca7 From adbccddb4a16b1dbf047d330ae1e78fd1ec80352 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Sat, 22 Sep 2018 19:41:26 -0400 Subject: rcu: Fix rcu_{node,data} comments about gp_seq_needed Recent changes have removed the old ->gp_seq_needed field from the rcu_state structure, which in turn obsoleted a couple of comments in the rcu_node and rcu_data structures. This commit therefore updates these comments accordingly. Signed-off-by: Joel Fernandes (Google) Cc: Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 57a937ac51c2..c3e2807a834a 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -57,7 +57,7 @@ struct rcu_node { /* some rcu_state fields as well as */ /* following. */ unsigned long gp_seq; /* Track rsp->rcu_gp_seq. */ - unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed. */ + unsigned long gp_seq_needed; /* Track furthest future GP request. */ unsigned long completedqs; /* All QSes done for this node. */ unsigned long qsmask; /* CPUs or groups that need to switch in */ /* order for current grace period to proceed.*/ @@ -163,7 +163,7 @@ union rcu_noqs { struct rcu_data { /* 1) quiescent-state and grace-period handling : */ unsigned long gp_seq; /* Track rsp->rcu_gp_seq counter. */ - unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed ctr. */ + unsigned long gp_seq_needed; /* Track furthest future GP request. */ union rcu_noqs cpu_no_qs; /* No QSes yet for this CPU. */ bool core_needs_qs; /* Core waits for quiesc state. */ bool beenonline; /* CPU online at least once. */ -- cgit v1.3-7-g2ca7 From 309ba859b95085f61f4f2a154df6be9cb9713a12 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 11 Jul 2018 14:36:49 -0700 Subject: rcu: Eliminate synchronize_rcu_mult() Now that synchronize_rcu() waits for both RCU read-side critical sections and preempt-disabled regions of code, the sole caller of synchronize_rcu_mult() can be replaced by synchronize_rcu(). This patch makes this change and removes synchronize_rcu_mult(). Note that _wait_rcu_gp() still supports synchronize_rcu_mult(), and thus might be simplified in the future to take only take a single call_rcu() function rather than the current list of them. Signed-off-by: Paul E. McKenney --- include/linux/rcupdate_wait.h | 17 ----------------- kernel/rcu/update.c | 6 ++---- kernel/sched/core.c | 2 +- 3 files changed, 3 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h index 8a16c3eb3dd0..c0578ba23c1a 100644 --- a/include/linux/rcupdate_wait.h +++ b/include/linux/rcupdate_wait.h @@ -31,21 +31,4 @@ do { \ #define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__) -/** - * synchronize_rcu_mult - Wait concurrently for multiple grace periods - * @...: List of call_rcu() functions for different grace periods to wait on - * - * This macro waits concurrently for multiple types of RCU grace periods. - * For example, synchronize_rcu_mult(call_rcu, call_rcu_tasks) would wait - * on concurrent RCU and RCU-tasks grace periods. Waiting on a give SRCU - * domain requires you to write a wrapper function for that SRCU domain's - * call_srcu() function, supplying the corresponding srcu_struct. - * - * If Tiny RCU, tell _wait_rcu_gp() does not bother waiting for RCU, - * given that anywhere synchronize_rcu_mult() can be called is automatically - * a grace period. - */ -#define synchronize_rcu_mult(...) \ - _wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__) - #endif /* _LINUX_SCHED_RCUPDATE_WAIT_H */ diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index f203b94f6b5b..c729ca5e6ee2 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -335,8 +335,7 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, /* Initialize and register callbacks for each crcu_array element. */ for (i = 0; i < n; i++) { if (checktiny && - (crcu_array[i] == call_rcu || - crcu_array[i] == call_rcu_bh)) { + (crcu_array[i] == call_rcu)) { might_sleep(); continue; } @@ -352,8 +351,7 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, /* Wait for all callbacks to be invoked. */ for (i = 0; i < n; i++) { if (checktiny && - (crcu_array[i] == call_rcu || - crcu_array[i] == call_rcu_bh)) + (crcu_array[i] == call_rcu)) continue; for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f12225f26b70..ea12ebc57840 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5788,7 +5788,7 @@ int sched_cpu_deactivate(unsigned int cpu) * * Do sync before park smpboot threads to take care the rcu boost case. */ - synchronize_rcu_mult(call_rcu, call_rcu_sched); + synchronize_rcu(); if (!sched_smp_initialized) return 0; -- cgit v1.3-7-g2ca7 From d3ff3891b2edba63a7dee9023306bb66878fc3d8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 11 Jul 2018 14:42:53 -0700 Subject: rcu: Consolidate the RCU update functions invoked by sync.c This commit retains all the various gp_ops[] entries, but makes their update functions all be synchronize_rcu(), call_rcu() and rcu_barrier(). The read-side checks remain consistent with the various RCU flavors, which still exist on the read side. Signed-off-by: Paul E. McKenney Cc: Oleg Nesterov Cc: Peter Zijlstra --- kernel/rcu/sync.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c index 3f943efcf61c..9d570b1892b0 100644 --- a/kernel/rcu/sync.c +++ b/kernel/rcu/sync.c @@ -44,15 +44,15 @@ static const struct { __INIT_HELD(rcu_read_lock_held) }, [RCU_SCHED_SYNC] = { - .sync = synchronize_sched, - .call = call_rcu_sched, - .wait = rcu_barrier_sched, + .sync = synchronize_rcu, + .call = call_rcu, + .wait = rcu_barrier, __INIT_HELD(rcu_read_lock_sched_held) }, [RCU_BH_SYNC] = { - .sync = synchronize_rcu_bh, - .call = call_rcu_bh, - .wait = rcu_barrier_bh, + .sync = synchronize_rcu, + .call = call_rcu, + .wait = rcu_barrier, __INIT_HELD(rcu_read_lock_bh_held) }, }; -- cgit v1.3-7-g2ca7 From 78d125d33858d00756baf0f40e75e77dcbfbea55 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 11 Jul 2018 15:36:43 -0700 Subject: sched/membarrier: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, the synchronize_sched() in sys_membarrier() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney Cc: Mathieu Desnoyers --- kernel/sched/membarrier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 76e0eaf4654e..388a7a6c1aa2 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -298,7 +298,7 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) if (tick_nohz_full_enabled()) return -EINVAL; if (num_online_cpus() > 1) - synchronize_sched(); + synchronize_rcu(); return 0; case MEMBARRIER_CMD_GLOBAL_EXPEDITED: return membarrier_global_expedited(); -- cgit v1.3-7-g2ca7 From 0607ba8403c4cdb253f8c5200ecf654dfb7790cc Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 25 Apr 2018 13:01:25 -0700 Subject: srcu: Prevent __call_srcu() counter wrap with read-side critical section Ever since cdf7abc4610a ("srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context"), it has been permissible to use SRCU read-side critical sections in interrupt context. This allows __call_srcu() to use SRCU read-side critical sections to prevent a new SRCU grace period from ending before the call to either srcu_funnel_gp_start() or srcu_funnel_exp_start completes, thus preventing SRCU grace-period counter overflow during that time. Note that this does not permit removal of the counter-wrap checks in srcu_gp_end(). These check are necessary to handle the case where a given CPU does not interact at all with SRCU for an extended time period. This commit therefore adds an SRCU read-side critical section to __call_srcu() in order to prevent grace period counter wrap during the funnel-locking process. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index a8846ed7f352..60f3236beaf7 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -858,6 +858,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, rcu_callback_t func, bool do_norm) { unsigned long flags; + int idx; bool needexp = false; bool needgp = false; unsigned long s; @@ -871,6 +872,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, return; } rhp->func = func; + idx = srcu_read_lock(sp); local_irq_save(flags); sdp = this_cpu_ptr(sp->sda); spin_lock_rcu_node(sdp); @@ -892,6 +894,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, srcu_funnel_gp_start(sp, sdp, s, do_norm); else if (needexp) srcu_funnel_exp_start(sp, sdp->mynode, s); + srcu_read_unlock(sp, idx); } /** -- cgit v1.3-7-g2ca7 From e647815a4d3b3be9d85b5750ed0f2947fd78fac7 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Thu, 8 Nov 2018 04:08:42 -0500 Subject: bpf: let verifier to calculate and record max_pkt_offset In check_packet_access, update max_pkt_offset after the offset has passed __check_packet_access. It should be safe to use u32 for max_pkt_offset as explained in code comment. Also, when there is tail call, the max_pkt_offset of the called program is unknown, so conservatively set max_pkt_offset to MAX_PACKET_OFF for such case. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Daniel Borkmann --- include/linux/bpf.h | 1 + kernel/bpf/verifier.c | 12 ++++++++++++ 2 files changed, 13 insertions(+) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 33014ae73103..b6a296e01f6a 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -293,6 +293,7 @@ struct bpf_prog_aux { atomic_t refcnt; u32 used_map_cnt; u32 max_ctx_offset; + u32 max_pkt_offset; u32 stack_depth; u32 id; u32 func_cnt; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1971ca325fb4..75dab40b19a3 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1455,6 +1455,17 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off, verbose(env, "R%d offset is outside of the packet\n", regno); return err; } + + /* __check_packet_access has made sure "off + size - 1" is within u16. + * reg->umax_value can't be bigger than MAX_PACKET_OFF which is 0xffff, + * otherwise find_good_pkt_pointers would have refused to set range info + * that __check_packet_access would have rejected this pkt access. + * Therefore, "off + reg->umax_value + size - 1" won't overflow u32. + */ + env->prog->aux->max_pkt_offset = + max_t(u32, env->prog->aux->max_pkt_offset, + off + reg->umax_value + size - 1); + return err; } @@ -6138,6 +6149,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) */ prog->cb_access = 1; env->prog->aux->stack_depth = MAX_BPF_STACK; + env->prog->aux->max_pkt_offset = MAX_PACKET_OFF; /* mark bpf_tail_call as different opcode to avoid * conditional branch in the interpeter for every normal -- cgit v1.3-7-g2ca7 From 1385d755cfb42f596ef1cf9f5c761010ff3b34e7 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:25 +0000 Subject: bpf: pass a struct with offload callbacks to bpf_offload_dev_create() For passing device functions for offloaded eBPF programs, there used to be no place where to store the pointer without making the non-offloaded programs pay a memory price. As a consequence, three functions were called with ndo_bpf() through specific commands. Now that we have struct bpf_offload_dev, and since none of those operations rely on RTNL, we can turn these three commands into hooks inside the struct bpf_prog_offload_ops, and pass them as part of bpf_offload_dev_create(). This commit effectively passes a pointer to the struct to bpf_offload_dev_create(). We temporarily have two struct bpf_prog_offload_ops instances, one under offdev->ops and one under offload->dev_ops. The next patches will make the transition towards the former, so that offload->dev_ops can be removed, and callbacks relying on ndo_bpf() added to offdev->ops as well. While at it, rename "nfp_bpf_analyzer_ops" as "nfp_bpf_dev_ops" (and similarly for netdevsim). Suggested-by: Jakub Kicinski Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +- drivers/net/ethernet/netronome/nfp/bpf/main.h | 2 +- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 4 ++-- drivers/net/netdevsim/bpf.c | 6 +++--- include/linux/bpf.h | 3 ++- kernel/bpf/offload.c | 5 ++++- 6 files changed, 13 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 6243af0ab025..dccae0319204 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -465,7 +465,7 @@ static int nfp_bpf_init(struct nfp_app *app) app->ctrl_mtu = nfp_bpf_ctrl_cmsg_mtu(bpf); } - bpf->bpf_dev = bpf_offload_dev_create(); + bpf->bpf_dev = bpf_offload_dev_create(&nfp_bpf_dev_ops); err = PTR_ERR_OR_ZERO(bpf->bpf_dev); if (err) goto err_free_neutral_maps; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index abdd93d14439..941277936475 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -513,7 +513,7 @@ int nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int nfp_bpf_finalize(struct bpf_verifier_env *env); -extern const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops; +extern const struct bpf_prog_offload_ops nfp_bpf_dev_ops; struct netdev_bpf; struct nfp_app; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index dc548bb4089e..2fca996a7e77 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -209,7 +209,7 @@ nfp_bpf_verifier_prep(struct nfp_app *app, struct nfp_net *nn, goto err_free; nfp_prog->verifier_meta = nfp_prog_first_meta(nfp_prog); - bpf->verifier.ops = &nfp_bpf_analyzer_ops; + bpf->verifier.ops = &nfp_bpf_dev_ops; return 0; @@ -602,7 +602,7 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct bpf_prog *prog, return 0; } -const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops = { +const struct bpf_prog_offload_ops nfp_bpf_dev_ops = { .insn_hook = nfp_verify_insn, .finalize = nfp_bpf_finalize, }; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index cb3518474f0e..135aee864162 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -91,7 +91,7 @@ static int nsim_bpf_finalize(struct bpf_verifier_env *env) return 0; } -static const struct bpf_prog_offload_ops nsim_bpf_analyzer_ops = { +static const struct bpf_prog_offload_ops nsim_bpf_dev_ops = { .insn_hook = nsim_bpf_verify_insn, .finalize = nsim_bpf_finalize, }; @@ -547,7 +547,7 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) if (err) return err; - bpf->verifier.ops = &nsim_bpf_analyzer_ops; + bpf->verifier.ops = &nsim_bpf_dev_ops; return 0; case BPF_OFFLOAD_TRANSLATE: state = bpf->offload.prog->aux->offload->dev_priv; @@ -599,7 +599,7 @@ int nsim_bpf_init(struct netdevsim *ns) if (IS_ERR_OR_NULL(ns->sdev->ddir_bpf_bound_progs)) return -ENOMEM; - ns->sdev->bpf_dev = bpf_offload_dev_create(); + ns->sdev->bpf_dev = bpf_offload_dev_create(&nsim_bpf_dev_ops); err = PTR_ERR_OR_ZERO(ns->sdev->bpf_dev); if (err) return err; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index b6a296e01f6a..c0197c37b2b2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -692,7 +692,8 @@ int bpf_map_offload_get_next_key(struct bpf_map *map, bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map); -struct bpf_offload_dev *bpf_offload_dev_create(void); +struct bpf_offload_dev * +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops); void bpf_offload_dev_destroy(struct bpf_offload_dev *offdev); int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev, struct net_device *netdev); diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 8e93c47f0779..d513fbf9ca53 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -33,6 +33,7 @@ static DECLARE_RWSEM(bpf_devs_lock); struct bpf_offload_dev { + const struct bpf_prog_offload_ops *ops; struct list_head netdevs; }; @@ -655,7 +656,8 @@ unlock: } EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister); -struct bpf_offload_dev *bpf_offload_dev_create(void) +struct bpf_offload_dev * +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops) { struct bpf_offload_dev *offdev; int err; @@ -673,6 +675,7 @@ struct bpf_offload_dev *bpf_offload_dev_create(void) if (!offdev) return ERR_PTR(-ENOMEM); + offdev->ops = ops; INIT_LIST_HEAD(&offdev->netdevs); return offdev; -- cgit v1.3-7-g2ca7 From 341b3e7b7b89315c43d262da3199098bcf9bbe57 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:26 +0000 Subject: bpf: call verify_insn from its callback in struct bpf_offload_dev We intend to remove the dev_ops in struct bpf_prog_offload, and to only keep the ops in struct bpf_offload_dev instead, which is accessible from more locations for passing function pointers. But dev_ops is used for calling the verify_insn hook. Switch to the newly added ops in struct bpf_prog_offload instead. To avoid table lookups for each eBPF instruction to verify, we remember the offdev attached to a netdev and modify bpf_offload_find_netdev() to avoid performing more than once a lookup for a given offload object. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 1 + kernel/bpf/offload.c | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c0197c37b2b2..672714cd904f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -273,6 +273,7 @@ struct bpf_prog_offload_ops { struct bpf_prog_offload { struct bpf_prog *prog; struct net_device *netdev; + struct bpf_offload_dev *offdev; void *dev_priv; struct list_head offloads; bool dev_state; diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index d513fbf9ca53..2cd3c0d0417b 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -107,6 +107,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) err = -EINVAL; goto err_unlock; } + offload->offdev = ondev->offdev; prog->aux->offload = offload; list_add_tail(&offload->offloads, &ondev->progs); dev_put(offload->netdev); @@ -167,7 +168,8 @@ int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, down_read(&bpf_devs_lock); offload = env->prog->aux->offload; if (offload) - ret = offload->dev_ops->insn_hook(env, insn_idx, prev_insn_idx); + ret = offload->offdev->ops->insn_hook(env, insn_idx, + prev_insn_idx); up_read(&bpf_devs_lock); return ret; -- cgit v1.3-7-g2ca7 From 6dc18fa6f4cad69c892d6fb9499f7e41c6a88a8e Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:27 +0000 Subject: bpf: call finalize() from its callback in struct bpf_offload_dev In a way similar to the change previously brought to the verify_insn hook, switch to the newly added ops in struct bpf_prog_offload for calling the functions used to perform final verification steps for offloaded programs. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- kernel/bpf/offload.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 2cd3c0d0417b..2c88cb4ddfd8 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -183,8 +183,8 @@ int bpf_prog_offload_finalize(struct bpf_verifier_env *env) down_read(&bpf_devs_lock); offload = env->prog->aux->offload; if (offload) { - if (offload->dev_ops->finalize) - ret = offload->dev_ops->finalize(env); + if (offload->offdev->ops->finalize) + ret = offload->offdev->ops->finalize(env); else ret = 0; } -- cgit v1.3-7-g2ca7 From 00db12c3d141356a4d1e6b6f688e0d5ed3b1f757 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:28 +0000 Subject: bpf: call verifier_prep from its callback in struct bpf_offload_dev In a way similar to the change previously brought to the verify_insn hook and to the finalize callback, switch to the newly added ops in struct bpf_prog_offload for calling the functions used to prepare driver verifiers. Since the dev_ops pointer in struct bpf_prog_offload is no longer used by any callback, we can now remove it from struct bpf_prog_offload. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 11 ++++---- drivers/net/netdevsim/bpf.c | 32 +++++++++++++----------- include/linux/bpf.h | 2 +- include/linux/netdevice.h | 6 ----- kernel/bpf/offload.c | 22 +++++++--------- 5 files changed, 32 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 2fca996a7e77..16a3a9c55852 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -188,10 +188,11 @@ static void nfp_prog_free(struct nfp_prog *nfp_prog) } static int -nfp_bpf_verifier_prep(struct nfp_app *app, struct nfp_net *nn, - struct netdev_bpf *bpf) +nfp_bpf_verifier_prep(struct net_device *netdev, struct bpf_verifier_env *env) { - struct bpf_prog *prog = bpf->verifier.prog; + struct nfp_net *nn = netdev_priv(netdev); + struct bpf_prog *prog = env->prog; + struct nfp_app *app = nn->app; struct nfp_prog *nfp_prog; int ret; @@ -209,7 +210,6 @@ nfp_bpf_verifier_prep(struct nfp_app *app, struct nfp_net *nn, goto err_free; nfp_prog->verifier_meta = nfp_prog_first_meta(nfp_prog); - bpf->verifier.ops = &nfp_bpf_dev_ops; return 0; @@ -422,8 +422,6 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap) int nfp_ndo_bpf(struct nfp_app *app, struct nfp_net *nn, struct netdev_bpf *bpf) { switch (bpf->command) { - case BPF_OFFLOAD_VERIFIER_PREP: - return nfp_bpf_verifier_prep(app, nn, bpf); case BPF_OFFLOAD_TRANSLATE: return nfp_bpf_translate(nn, bpf->offload.prog); case BPF_OFFLOAD_DESTROY: @@ -605,4 +603,5 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct bpf_prog *prog, const struct bpf_prog_offload_ops nfp_bpf_dev_ops = { .insn_hook = nfp_verify_insn, .finalize = nfp_bpf_finalize, + .prepare = nfp_bpf_verifier_prep, }; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 135aee864162..d045b7d666d9 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -91,11 +91,6 @@ static int nsim_bpf_finalize(struct bpf_verifier_env *env) return 0; } -static const struct bpf_prog_offload_ops nsim_bpf_dev_ops = { - .insn_hook = nsim_bpf_verify_insn, - .finalize = nsim_bpf_finalize, -}; - static bool nsim_xdp_offload_active(struct netdevsim *ns) { return ns->xdp_hw.prog; @@ -263,6 +258,17 @@ static int nsim_bpf_create_prog(struct netdevsim *ns, struct bpf_prog *prog) return 0; } +static int +nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_verifier_env *env) +{ + struct netdevsim *ns = netdev_priv(dev); + + if (!ns->bpf_bind_accept) + return -EOPNOTSUPP; + + return nsim_bpf_create_prog(ns, env->prog); +} + static void nsim_bpf_destroy_prog(struct bpf_prog *prog) { struct nsim_bpf_bound_prog *state; @@ -275,6 +281,12 @@ static void nsim_bpf_destroy_prog(struct bpf_prog *prog) kfree(state); } +static const struct bpf_prog_offload_ops nsim_bpf_dev_ops = { + .insn_hook = nsim_bpf_verify_insn, + .finalize = nsim_bpf_finalize, + .prepare = nsim_bpf_verifier_prep, +}; + static int nsim_setup_prog_checks(struct netdevsim *ns, struct netdev_bpf *bpf) { if (bpf->prog && bpf->prog->aux->offload) { @@ -539,16 +551,6 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) ASSERT_RTNL(); switch (bpf->command) { - case BPF_OFFLOAD_VERIFIER_PREP: - if (!ns->bpf_bind_accept) - return -EOPNOTSUPP; - - err = nsim_bpf_create_prog(ns, bpf->verifier.prog); - if (err) - return err; - - bpf->verifier.ops = &nsim_bpf_dev_ops; - return 0; case BPF_OFFLOAD_TRANSLATE: state = bpf->offload.prog->aux->offload->dev_priv; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 672714cd904f..f250494a4f56 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -268,6 +268,7 @@ struct bpf_prog_offload_ops { int (*insn_hook)(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int (*finalize)(struct bpf_verifier_env *env); + int (*prepare)(struct net_device *netdev, struct bpf_verifier_env *env); }; struct bpf_prog_offload { @@ -277,7 +278,6 @@ struct bpf_prog_offload { void *dev_priv; struct list_head offloads; bool dev_state; - const struct bpf_prog_offload_ops *dev_ops; void *jited_image; u32 jited_len; }; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8abf7b91..0fa2c2744928 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -863,7 +863,6 @@ enum bpf_netdev_command { XDP_QUERY_PROG, XDP_QUERY_PROG_HW, /* BPF program for offload callbacks, invoked at program load time. */ - BPF_OFFLOAD_VERIFIER_PREP, BPF_OFFLOAD_TRANSLATE, BPF_OFFLOAD_DESTROY, BPF_OFFLOAD_MAP_ALLOC, @@ -891,11 +890,6 @@ struct netdev_bpf { /* flags with which program was installed */ u32 prog_flags; }; - /* BPF_OFFLOAD_VERIFIER_PREP */ - struct { - struct bpf_prog *prog; - const struct bpf_prog_offload_ops *ops; /* callee set */ - } verifier; /* BPF_OFFLOAD_TRANSLATE, BPF_OFFLOAD_DESTROY */ struct { struct bpf_prog *prog; diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 2c88cb4ddfd8..1f7ac00a494d 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -142,21 +142,17 @@ static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command cmd, int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env) { - struct netdev_bpf data = {}; - int err; - - data.verifier.prog = env->prog; + struct bpf_prog_offload *offload; + int ret = -ENODEV; - rtnl_lock(); - err = __bpf_offload_ndo(env->prog, BPF_OFFLOAD_VERIFIER_PREP, &data); - if (err) - goto exit_unlock; + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) + ret = offload->offdev->ops->prepare(offload->netdev, env); + offload->dev_state = !ret; + up_read(&bpf_devs_lock); - env->prog->aux->offload->dev_ops = data.verifier.ops; - env->prog->aux->offload->dev_state = true; -exit_unlock: - rtnl_unlock(); - return err; + return ret; } int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, -- cgit v1.3-7-g2ca7 From b07ade27e93360197e453e5ca80eebdc9099dcb5 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:29 +0000 Subject: bpf: pass translate() as a callback and remove its ndo_bpf subcommand As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to code translation to the struct and remove the subcommand that was used to call them through the NDO. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 11 +++-------- drivers/net/netdevsim/bpf.c | 14 +++++++++----- include/linux/bpf.h | 1 + include/linux/netdevice.h | 3 +-- kernel/bpf/offload.c | 14 +++++++------- 5 files changed, 21 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 16a3a9c55852..8653a2189c19 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -33,9 +33,6 @@ nfp_map_ptr_record(struct nfp_app_bpf *bpf, struct nfp_prog *nfp_prog, struct nfp_bpf_neutral_map *record; int err; - /* Map record paths are entered via ndo, update side is protected. */ - ASSERT_RTNL(); - /* Reuse path - other offloaded program is already tracking this map. */ record = rhashtable_lookup_fast(&bpf->maps_neutral, &map->id, nfp_bpf_maps_neutral_params); @@ -84,8 +81,6 @@ nfp_map_ptrs_forget(struct nfp_app_bpf *bpf, struct nfp_prog *nfp_prog) bool freed = false; int i; - ASSERT_RTNL(); - for (i = 0; i < nfp_prog->map_records_cnt; i++) { if (--nfp_prog->map_records[i]->count) { nfp_prog->map_records[i] = NULL; @@ -219,9 +214,10 @@ err_free: return ret; } -static int nfp_bpf_translate(struct nfp_net *nn, struct bpf_prog *prog) +static int nfp_bpf_translate(struct net_device *netdev, struct bpf_prog *prog) { struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv; + struct nfp_net *nn = netdev_priv(netdev); unsigned int max_instr; int err; @@ -422,8 +418,6 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap) int nfp_ndo_bpf(struct nfp_app *app, struct nfp_net *nn, struct netdev_bpf *bpf) { switch (bpf->command) { - case BPF_OFFLOAD_TRANSLATE: - return nfp_bpf_translate(nn, bpf->offload.prog); case BPF_OFFLOAD_DESTROY: return nfp_bpf_destroy(nn, bpf->offload.prog); case BPF_OFFLOAD_MAP_ALLOC: @@ -604,4 +598,5 @@ const struct bpf_prog_offload_ops nfp_bpf_dev_ops = { .insn_hook = nfp_verify_insn, .finalize = nfp_bpf_finalize, .prepare = nfp_bpf_verifier_prep, + .translate = nfp_bpf_translate, }; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index d045b7d666d9..30c2cd516d1c 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -269,6 +269,14 @@ nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_verifier_env *env) return nsim_bpf_create_prog(ns, env->prog); } +static int nsim_bpf_translate(struct net_device *dev, struct bpf_prog *prog) +{ + struct nsim_bpf_bound_prog *state = prog->aux->offload->dev_priv; + + state->state = "xlated"; + return 0; +} + static void nsim_bpf_destroy_prog(struct bpf_prog *prog) { struct nsim_bpf_bound_prog *state; @@ -285,6 +293,7 @@ static const struct bpf_prog_offload_ops nsim_bpf_dev_ops = { .insn_hook = nsim_bpf_verify_insn, .finalize = nsim_bpf_finalize, .prepare = nsim_bpf_verifier_prep, + .translate = nsim_bpf_translate, }; static int nsim_setup_prog_checks(struct netdevsim *ns, struct netdev_bpf *bpf) @@ -551,11 +560,6 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) ASSERT_RTNL(); switch (bpf->command) { - case BPF_OFFLOAD_TRANSLATE: - state = bpf->offload.prog->aux->offload->dev_priv; - - state->state = "xlated"; - return 0; case BPF_OFFLOAD_DESTROY: nsim_bpf_destroy_prog(bpf->offload.prog); return 0; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f250494a4f56..d1eb3c8a3fa9 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -269,6 +269,7 @@ struct bpf_prog_offload_ops { int insn_idx, int prev_insn_idx); int (*finalize)(struct bpf_verifier_env *env); int (*prepare)(struct net_device *netdev, struct bpf_verifier_env *env); + int (*translate)(struct net_device *netdev, struct bpf_prog *prog); }; struct bpf_prog_offload { diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0fa2c2744928..27499127e038 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -863,7 +863,6 @@ enum bpf_netdev_command { XDP_QUERY_PROG, XDP_QUERY_PROG_HW, /* BPF program for offload callbacks, invoked at program load time. */ - BPF_OFFLOAD_TRANSLATE, BPF_OFFLOAD_DESTROY, BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, @@ -890,7 +889,7 @@ struct netdev_bpf { /* flags with which program was installed */ u32 prog_flags; }; - /* BPF_OFFLOAD_TRANSLATE, BPF_OFFLOAD_DESTROY */ + /* BPF_OFFLOAD_DESTROY */ struct { struct bpf_prog *prog; } offload; diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 1f7ac00a494d..ae0167366c12 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -219,14 +219,14 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog) static int bpf_prog_offload_translate(struct bpf_prog *prog) { - struct netdev_bpf data = {}; - int ret; - - data.offload.prog = prog; + struct bpf_prog_offload *offload; + int ret = -ENODEV; - rtnl_lock(); - ret = __bpf_offload_ndo(prog, BPF_OFFLOAD_TRANSLATE, &data); - rtnl_unlock(); + down_read(&bpf_devs_lock); + offload = prog->aux->offload; + if (offload) + ret = offload->offdev->ops->translate(offload->netdev, prog); + up_read(&bpf_devs_lock); return ret; } -- cgit v1.3-7-g2ca7 From eb9119471efbf730c8f830f706026b486eb701dd Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:30 +0000 Subject: bpf: pass destroy() as a callback and remove its ndo_bpf subcommand As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to program destruction to the struct and remove the subcommand that was used to call them through the NDO. Remove function __bpf_offload_ndo(), which is no longer used. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 7 ++----- drivers/net/netdevsim/bpf.c | 4 +--- include/linux/bpf.h | 1 + include/linux/netdevice.h | 5 ----- kernel/bpf/offload.c | 24 +----------------------- 5 files changed, 5 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 8653a2189c19..91085cc3c843 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -238,15 +238,13 @@ static int nfp_bpf_translate(struct net_device *netdev, struct bpf_prog *prog) return nfp_map_ptrs_record(nfp_prog->bpf, nfp_prog, prog); } -static int nfp_bpf_destroy(struct nfp_net *nn, struct bpf_prog *prog) +static void nfp_bpf_destroy(struct bpf_prog *prog) { struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv; kvfree(nfp_prog->prog); nfp_map_ptrs_forget(nfp_prog->bpf, nfp_prog); nfp_prog_free(nfp_prog); - - return 0; } /* Atomic engine requires values to be in big endian, we need to byte swap @@ -418,8 +416,6 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap) int nfp_ndo_bpf(struct nfp_app *app, struct nfp_net *nn, struct netdev_bpf *bpf) { switch (bpf->command) { - case BPF_OFFLOAD_DESTROY: - return nfp_bpf_destroy(nn, bpf->offload.prog); case BPF_OFFLOAD_MAP_ALLOC: return nfp_bpf_map_alloc(app->priv, bpf->offmap); case BPF_OFFLOAD_MAP_FREE: @@ -599,4 +595,5 @@ const struct bpf_prog_offload_ops nfp_bpf_dev_ops = { .finalize = nfp_bpf_finalize, .prepare = nfp_bpf_verifier_prep, .translate = nfp_bpf_translate, + .destroy = nfp_bpf_destroy, }; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 30c2cd516d1c..33e3d54c3a0a 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -294,6 +294,7 @@ static const struct bpf_prog_offload_ops nsim_bpf_dev_ops = { .finalize = nsim_bpf_finalize, .prepare = nsim_bpf_verifier_prep, .translate = nsim_bpf_translate, + .destroy = nsim_bpf_destroy_prog, }; static int nsim_setup_prog_checks(struct netdevsim *ns, struct netdev_bpf *bpf) @@ -560,9 +561,6 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) ASSERT_RTNL(); switch (bpf->command) { - case BPF_OFFLOAD_DESTROY: - nsim_bpf_destroy_prog(bpf->offload.prog); - return 0; case XDP_QUERY_PROG: return xdp_attachment_query(&ns->xdp, bpf); case XDP_QUERY_PROG_HW: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index d1eb3c8a3fa9..867d2801db64 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -270,6 +270,7 @@ struct bpf_prog_offload_ops { int (*finalize)(struct bpf_verifier_env *env); int (*prepare)(struct net_device *netdev, struct bpf_verifier_env *env); int (*translate)(struct net_device *netdev, struct bpf_prog *prog); + void (*destroy)(struct bpf_prog *prog); }; struct bpf_prog_offload { diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 27499127e038..17d52a647fe5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -863,7 +863,6 @@ enum bpf_netdev_command { XDP_QUERY_PROG, XDP_QUERY_PROG_HW, /* BPF program for offload callbacks, invoked at program load time. */ - BPF_OFFLOAD_DESTROY, BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_QUERY_XSK_UMEM, @@ -889,10 +888,6 @@ struct netdev_bpf { /* flags with which program was installed */ u32 prog_flags; }; - /* BPF_OFFLOAD_DESTROY */ - struct { - struct bpf_prog *prog; - } offload; /* BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE */ struct { struct bpf_offloaded_map *offmap; diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index ae0167366c12..d665e75a0ac3 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -123,23 +123,6 @@ err_maybe_put: return err; } -static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command cmd, - struct netdev_bpf *data) -{ - struct bpf_prog_offload *offload = prog->aux->offload; - struct net_device *netdev; - - ASSERT_RTNL(); - - if (!offload) - return -ENODEV; - netdev = offload->netdev; - - data->command = cmd; - - return netdev->netdev_ops->ndo_bpf(netdev, data); -} - int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env) { struct bpf_prog_offload *offload; @@ -192,12 +175,9 @@ int bpf_prog_offload_finalize(struct bpf_verifier_env *env) static void __bpf_prog_offload_destroy(struct bpf_prog *prog) { struct bpf_prog_offload *offload = prog->aux->offload; - struct netdev_bpf data = {}; - - data.offload.prog = prog; if (offload->dev_state) - WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, &data)); + offload->offdev->ops->destroy(prog); /* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */ bpf_prog_free_id(prog, true); @@ -209,12 +189,10 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog) void bpf_prog_offload_destroy(struct bpf_prog *prog) { - rtnl_lock(); down_write(&bpf_devs_lock); if (prog->aux->offload) __bpf_prog_offload_destroy(prog); up_write(&bpf_devs_lock); - rtnl_unlock(); } static int bpf_prog_offload_translate(struct bpf_prog *prog) -- cgit v1.3-7-g2ca7 From a40a26322a83d4a26a99ad2616cbd77394c19587 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:31 +0000 Subject: bpf: pass prog instead of env to bpf_prog_offload_verifier_prep() Function bpf_prog_offload_verifier_prep(), called from the kernel BPF verifier to run a driver-specific callback for preparing for the verification step for offloaded programs, takes a pointer to a struct bpf_verifier_env object. However, no driver callback needs the whole structure at this time: the two drivers supporting this, nfp and netdevsim, only need a pointer to the struct bpf_prog instance held by env. Update the callback accordingly, on kernel side and in these two drivers. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 3 +-- drivers/net/netdevsim/bpf.c | 4 ++-- include/linux/bpf.h | 2 +- include/linux/bpf_verifier.h | 2 +- kernel/bpf/offload.c | 6 +++--- kernel/bpf/verifier.c | 2 +- 6 files changed, 9 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 91085cc3c843..e6b26d2f651d 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -183,10 +183,9 @@ static void nfp_prog_free(struct nfp_prog *nfp_prog) } static int -nfp_bpf_verifier_prep(struct net_device *netdev, struct bpf_verifier_env *env) +nfp_bpf_verifier_prep(struct net_device *netdev, struct bpf_prog *prog) { struct nfp_net *nn = netdev_priv(netdev); - struct bpf_prog *prog = env->prog; struct nfp_app *app = nn->app; struct nfp_prog *nfp_prog; int ret; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 33e3d54c3a0a..560bdaf1c98b 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -259,14 +259,14 @@ static int nsim_bpf_create_prog(struct netdevsim *ns, struct bpf_prog *prog) } static int -nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_verifier_env *env) +nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_prog *prog) { struct netdevsim *ns = netdev_priv(dev); if (!ns->bpf_bind_accept) return -EOPNOTSUPP; - return nsim_bpf_create_prog(ns, env->prog); + return nsim_bpf_create_prog(ns, prog); } static int nsim_bpf_translate(struct net_device *dev, struct bpf_prog *prog) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 867d2801db64..888111350d0e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -268,7 +268,7 @@ struct bpf_prog_offload_ops { int (*insn_hook)(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int (*finalize)(struct bpf_verifier_env *env); - int (*prepare)(struct net_device *netdev, struct bpf_verifier_env *env); + int (*prepare)(struct net_device *netdev, struct bpf_prog *prog); int (*translate)(struct net_device *netdev, struct bpf_prog *prog); void (*destroy)(struct bpf_prog *prog); }; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index d93e89761a8b..11f5df1092d9 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -245,7 +245,7 @@ static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) return cur_func(env)->regs; } -int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env); +int bpf_prog_offload_verifier_prep(struct bpf_prog *prog); int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int bpf_prog_offload_finalize(struct bpf_verifier_env *env); diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index d665e75a0ac3..397d206e184b 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -123,15 +123,15 @@ err_maybe_put: return err; } -int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env) +int bpf_prog_offload_verifier_prep(struct bpf_prog *prog) { struct bpf_prog_offload *offload; int ret = -ENODEV; down_read(&bpf_devs_lock); - offload = env->prog->aux->offload; + offload = prog->aux->offload; if (offload) - ret = offload->offdev->ops->prepare(offload->netdev, env); + ret = offload->offdev->ops->prepare(offload->netdev, prog); offload->dev_state = !ret; up_read(&bpf_devs_lock); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 75dab40b19a3..8d0977980cfa 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6368,7 +6368,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) goto skip_full_check; if (bpf_prog_is_dev_bound(env->prog->aux)) { - ret = bpf_prog_offload_verifier_prep(env); + ret = bpf_prog_offload_verifier_prep(env->prog); if (ret) goto skip_full_check; } -- cgit v1.3-7-g2ca7 From 16a8cb5cffd0a2929ae97bc258d2d9c92a4e7f6d Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Fri, 9 Nov 2018 13:03:32 +0000 Subject: bpf: do not pass netdev to translate() and prepare() offload callbacks The kernel functions to prepare verifier and translate for offloaded program retrieve "offload" from "prog", and "netdev" from "offload". Then both "prog" and "netdev" are passed to the callbacks. Simplify this by letting the drivers retrieve the net device themselves from the offload object attached to prog - if they need it at all. There is currently no need to pass the netdev as an argument to those functions. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 9 ++++----- drivers/net/netdevsim/bpf.c | 7 +++---- include/linux/bpf.h | 4 ++-- kernel/bpf/offload.c | 4 ++-- 4 files changed, 11 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index e6b26d2f651d..f0283854fade 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -182,10 +182,9 @@ static void nfp_prog_free(struct nfp_prog *nfp_prog) kfree(nfp_prog); } -static int -nfp_bpf_verifier_prep(struct net_device *netdev, struct bpf_prog *prog) +static int nfp_bpf_verifier_prep(struct bpf_prog *prog) { - struct nfp_net *nn = netdev_priv(netdev); + struct nfp_net *nn = netdev_priv(prog->aux->offload->netdev); struct nfp_app *app = nn->app; struct nfp_prog *nfp_prog; int ret; @@ -213,10 +212,10 @@ err_free: return ret; } -static int nfp_bpf_translate(struct net_device *netdev, struct bpf_prog *prog) +static int nfp_bpf_translate(struct bpf_prog *prog) { + struct nfp_net *nn = netdev_priv(prog->aux->offload->netdev); struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv; - struct nfp_net *nn = netdev_priv(netdev); unsigned int max_instr; int err; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 560bdaf1c98b..6a5b7bd9a1f9 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -258,10 +258,9 @@ static int nsim_bpf_create_prog(struct netdevsim *ns, struct bpf_prog *prog) return 0; } -static int -nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_prog *prog) +static int nsim_bpf_verifier_prep(struct bpf_prog *prog) { - struct netdevsim *ns = netdev_priv(dev); + struct netdevsim *ns = netdev_priv(prog->aux->offload->netdev); if (!ns->bpf_bind_accept) return -EOPNOTSUPP; @@ -269,7 +268,7 @@ nsim_bpf_verifier_prep(struct net_device *dev, struct bpf_prog *prog) return nsim_bpf_create_prog(ns, prog); } -static int nsim_bpf_translate(struct net_device *dev, struct bpf_prog *prog) +static int nsim_bpf_translate(struct bpf_prog *prog) { struct nsim_bpf_bound_prog *state = prog->aux->offload->dev_priv; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 888111350d0e..987815152629 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -268,8 +268,8 @@ struct bpf_prog_offload_ops { int (*insn_hook)(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int (*finalize)(struct bpf_verifier_env *env); - int (*prepare)(struct net_device *netdev, struct bpf_prog *prog); - int (*translate)(struct net_device *netdev, struct bpf_prog *prog); + int (*prepare)(struct bpf_prog *prog); + int (*translate)(struct bpf_prog *prog); void (*destroy)(struct bpf_prog *prog); }; diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 397d206e184b..52c5617e3716 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -131,7 +131,7 @@ int bpf_prog_offload_verifier_prep(struct bpf_prog *prog) down_read(&bpf_devs_lock); offload = prog->aux->offload; if (offload) - ret = offload->offdev->ops->prepare(offload->netdev, prog); + ret = offload->offdev->ops->prepare(prog); offload->dev_state = !ret; up_read(&bpf_devs_lock); @@ -203,7 +203,7 @@ static int bpf_prog_offload_translate(struct bpf_prog *prog) down_read(&bpf_devs_lock); offload = prog->aux->offload; if (offload) - ret = offload->offdev->ops->translate(offload->netdev, prog); + ret = offload->offdev->ops->translate(prog); up_read(&bpf_devs_lock); return ret; -- cgit v1.3-7-g2ca7 From 46f53a65d2de3e1591636c22b626b09d8684fd71 Mon Sep 17 00:00:00 2001 From: Andrey Ignatov Date: Sat, 10 Nov 2018 22:15:13 -0800 Subject: bpf: Allow narrow loads with offset > 0 Currently BPF verifier allows narrow loads for a context field only with offset zero. E.g. if there is a __u32 field then only the following loads are permitted: * off=0, size=1 (narrow); * off=0, size=2 (narrow); * off=0, size=4 (full). On the other hand LLVM can generate a load with offset different than zero that make sense from program logic point of view, but verifier doesn't accept it. E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code: #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */ ... if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) && where ctx is struct bpf_sock_addr. Some versions of LLVM can produce the following byte code for it: 8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7) 9: 67 02 00 00 18 00 00 00 r2 <<= 24 10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll 12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1 and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently rejected by verifier. Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok() what means any is_valid_access implementation, that uses the function, works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or sock_addr_is_valid_access() for bpf_sock_addr. The patch makes such loads supported. Offset can be in [0; size_default) but has to be multiple of load size. E.g. for __u32 field the following loads are supported now: * off=0, size=1 (narrow); * off=1, size=1 (narrow); * off=2, size=1 (narrow); * off=3, size=1 (narrow); * off=0, size=2 (narrow); * off=2, size=2 (narrow); * off=0, size=4 (full). Reported-by: Yonghong Song Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- include/linux/filter.h | 16 +--------------- kernel/bpf/verifier.c | 21 ++++++++++++++++----- 2 files changed, 17 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index de629b706d1d..cc17f5f32fbb 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -668,24 +668,10 @@ static inline u32 bpf_ctx_off_adjust_machine(u32 size) return size; } -static inline bool bpf_ctx_narrow_align_ok(u32 off, u32 size_access, - u32 size_default) -{ - size_default = bpf_ctx_off_adjust_machine(size_default); - size_access = bpf_ctx_off_adjust_machine(size_access); - -#ifdef __LITTLE_ENDIAN - return (off & (size_default - 1)) == 0; -#else - return (off & (size_default - 1)) + size_access == size_default; -#endif -} - static inline bool bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default) { - return bpf_ctx_narrow_align_ok(off, size, size_default) && - size <= size_default && (size & (size - 1)) == 0; + return size <= size_default && (size & (size - 1)) == 0; } #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8d0977980cfa..b5222aa61d54 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5718,10 +5718,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) int i, cnt, size, ctx_field_size, delta = 0; const int insn_cnt = env->prog->len; struct bpf_insn insn_buf[16], *insn; + u32 target_size, size_default, off; struct bpf_prog *new_prog; enum bpf_access_type type; bool is_narrower_load; - u32 target_size; if (ops->gen_prologue || env->seen_direct_write) { if (!ops->gen_prologue) { @@ -5814,9 +5814,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) * we will apply proper mask to the result. */ is_narrower_load = size < ctx_field_size; + size_default = bpf_ctx_off_adjust_machine(ctx_field_size); + off = insn->off; if (is_narrower_load) { - u32 size_default = bpf_ctx_off_adjust_machine(ctx_field_size); - u32 off = insn->off; u8 size_code; if (type == BPF_WRITE) { @@ -5844,12 +5844,23 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) } if (is_narrower_load && size < target_size) { - if (ctx_field_size <= 4) + u8 shift = (off & (size_default - 1)) * 8; + + if (ctx_field_size <= 4) { + if (shift) + insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, + insn->dst_reg, + shift); insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, (1 << size * 8) - 1); - else + } else { + if (shift) + insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, + insn->dst_reg, + shift); insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg, (1 << size * 8) - 1); + } } new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); -- cgit v1.3-7-g2ca7 From 9cac83a57e99e4692315b7a91a81bab787961d97 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 11 Sep 2018 08:57:48 -0700 Subject: rcu: Stop expedited grace periods from relying on stop-machine The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption to prevent the cpu_online_mask from changing. However, this relies on the stop-machine mechanism in the CPU-hotplug offline code, which is not desirable (it would be good to someday remove the stop-machine mechanism). This commit therefore instead uses the relevant leaf rcu_node structure's ->ffmask, which has a bit set for all CPUs that are fully functional. A given CPU's bit is cleared very early during offline processing by rcutree_offline_cpu() and set very late during online processing by rcutree_online_cpu(). Therefore, if a CPU's bit is set in this mask, and preemption is disabled, we have to be before the synchronize_sched() in the CPU-hotplug offline code, which means that the CPU is guaranteed to be workqueue-ready throughout the duration of the enclosing preempt_disable() region of code. This also has the side-effect of using WORK_CPU_UNBOUND if all the CPUs for this leaf rcu_node structure are offline, which is an acceptable difference in behavior. Reported-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 8d18c1014e2b..e669ccf3751b 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -450,10 +450,12 @@ static void sync_rcu_exp_select_cpus(smp_call_func_t func) } INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); preempt_disable(); - cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask); + cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1); /* If all offline, queue the work on an unbound CPU. */ - if (unlikely(cpu > rnp->grphi)) + if (unlikely(cpu > rnp->grphi - rnp->grplo)) cpu = WORK_CPU_UNBOUND; + else + cpu += rnp->grplo; queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work); preempt_enable(); rnp->exp_need_flush = true; -- cgit v1.3-7-g2ca7 From 92a801e5d5b7a893881c1676b15dd246727ccd16 Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Mon, 5 Nov 2018 14:53:59 +0000 Subject: sched/fair: Mask UTIL_AVG_UNCHANGED usages The _task_util_est() is mainly used to add/remove the task contribution to/from the rq's estimated utilization at task enqueue/dequeue time. In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep consistency between enqueue and dequeue time while still being transparent to update_load_avg calls which will eventually reset the flag. Let's move the flag forcing within _task_util_est() itself so that we can simplify calling code by hiding that estimated utilization implementation detail into one of its internal functions. This will affect also the "public" API task_util_est() but we know that the flag will (eventually) impact just on the LSB of the estimated utilization, thus it's certainly acceptable. Signed-off-by: Patrick Bellasi Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar Eggemann Cc: Juri Lelli Cc: Linus Torvalds Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Quentin Perret Cc: Steve Muckle Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Todd Kjos Cc: Vincent Guittot Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a1ccf1ddd37a..28ee60cabba1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3604,7 +3604,7 @@ static inline unsigned long _task_util_est(struct task_struct *p) { struct util_est ue = READ_ONCE(p->se.avg.util_est); - return max(ue.ewma, ue.enqueued); + return (max(ue.ewma, ue.enqueued) | UTIL_AVG_UNCHANGED); } static inline unsigned long task_util_est(struct task_struct *p) @@ -3622,7 +3622,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, /* Update root cfs_rq's estimated utilization */ enqueued = cfs_rq->avg.util_est.enqueued; - enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED); + enqueued += _task_util_est(p); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); } @@ -3650,8 +3650,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) /* Update root cfs_rq's estimated utilization */ ue.enqueued = cfs_rq->avg.util_est.enqueued; - ue.enqueued -= min_t(unsigned int, ue.enqueued, - (_task_util_est(p) | UTIL_AVG_UNCHANGED)); + ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p)); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); /* @@ -6292,7 +6291,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p) */ if (unlikely(task_on_rq_queued(p) || current == p)) { estimated -= min_t(unsigned int, estimated, - (_task_util_est(p) | UTIL_AVG_UNCHANGED)); + _task_util_est(p)); } util = max(util, estimated); } -- cgit v1.3-7-g2ca7 From b5c0ce7bd1848892e2930f481828b6d7750231ed Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Mon, 5 Nov 2018 14:54:00 +0000 Subject: sched/fair: Add lsub_positive() and use it consistently The following pattern: var -= min_t(typeof(var), var, val); is used multiple times in fair.c. The existing sub_positive() already captures that pattern, but it also adds an explicit load-store to properly support lockless observations. In other cases the pattern above is used to update local, and/or not concurrently accessed, variables. Let's add a simpler version of sub_positive(), targeted at local variables updates, which gives the same readability benefits at calling sites, without enforcing {READ,WRITE}_ONCE() barriers. Signed-off-by: Patrick Bellasi Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar Eggemann Cc: Juri Lelli Cc: Linus Torvalds Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Quentin Perret Cc: Steve Muckle Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Todd Kjos Cc: Vincent Guittot Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 28ee60cabba1..2cac9a469df4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2734,6 +2734,17 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) WRITE_ONCE(*ptr, res); \ } while (0) +/* + * Remove and clamp on negative, from a local variable. + * + * A variant of sub_positive(), which does not use explicit load-store + * and is thus optimized for local variable updates. + */ +#define lsub_positive(_ptr, _val) do { \ + typeof(_ptr) ptr = (_ptr); \ + *ptr -= min_t(typeof(*ptr), *ptr, _val); \ +} while (0) + #ifdef CONFIG_SMP static inline void enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) @@ -4639,7 +4650,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) cfs_b->distribute_running = 0; throttled = !list_empty(&cfs_b->throttled_cfs_rq); - cfs_b->runtime -= min(runtime, cfs_b->runtime); + lsub_positive(&cfs_b->runtime, runtime); } /* @@ -4773,7 +4784,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) raw_spin_lock(&cfs_b->lock); if (expires == cfs_b->runtime_expires) - cfs_b->runtime -= min(runtime, cfs_b->runtime); + lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock(&cfs_b->lock); } @@ -6240,7 +6251,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p) util = READ_ONCE(cfs_rq->avg.util_avg); /* Discount task's util from CPU's util */ - util -= min_t(unsigned int, util, task_util(p)); + lsub_positive(&util, task_util(p)); /* * Covered cases: @@ -6289,10 +6300,9 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p) * properly fix the execl regression and it helps in further * reducing the chances for the above race. */ - if (unlikely(task_on_rq_queued(p) || current == p)) { - estimated -= min_t(unsigned int, estimated, - _task_util_est(p)); - } + if (unlikely(task_on_rq_queued(p) || current == p)) + lsub_positive(&estimated, _task_util_est(p)); + util = max(util, estimated); } -- cgit v1.3-7-g2ca7 From 1da1843f9f0334e2428308945d396ffecc2acfe1 Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Mon, 5 Nov 2018 16:51:55 +0530 Subject: sched/core: Create task_has_idle_policy() helper We already have task_has_rt_policy() and task_has_dl_policy() helpers, create task_has_idle_policy() as well and update sched core to start using it. While at it, use task_has_dl_policy() at one more place. Signed-off-by: Viresh Kumar Signed-off-by: Peter Zijlstra (Intel) Acked-by: Daniel Lezcano Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 4 ++-- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 10 +++++----- kernel/sched/sched.h | 5 +++++ 4 files changed, 13 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 02a20ef196a6..5afb868f7339 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -697,7 +697,7 @@ static void set_load_weight(struct task_struct *p, bool update_load) /* * SCHED_IDLE tasks get minimal weight: */ - if (idle_policy(p->policy)) { + if (task_has_idle_policy(p)) { load->weight = scale_load(WEIGHT_IDLEPRIO); load->inv_weight = WMULT_IDLEPRIO; p->se.runnable_weight = load->weight; @@ -4199,7 +4199,7 @@ recheck: * Treat SCHED_IDLE as nice 20. Only allow a switch to * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. */ - if (idle_policy(p->policy) && !idle_policy(policy)) { + if (task_has_idle_policy(p) && !idle_policy(policy)) { if (!can_nice(p, task_nice(p))) return -EPERM; } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 6383aa6a60ca..02bd5f969b21 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -974,7 +974,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, #endif P(policy); P(prio); - if (p->policy == SCHED_DEADLINE) { + if (task_has_dl_policy(p)) { P(dl.runtime); P(dl.deadline); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cac9a469df4..d1f91e6efe51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6529,7 +6529,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) static void set_last_buddy(struct sched_entity *se) { - if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) + if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se)))) return; for_each_sched_entity(se) { @@ -6541,7 +6541,7 @@ static void set_last_buddy(struct sched_entity *se) static void set_next_buddy(struct sched_entity *se) { - if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) + if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se)))) return; for_each_sched_entity(se) { @@ -6599,8 +6599,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; /* Idle tasks are by definition preempted by non-idle tasks. */ - if (unlikely(curr->policy == SCHED_IDLE) && - likely(p->policy != SCHED_IDLE)) + if (unlikely(task_has_idle_policy(curr)) && + likely(!task_has_idle_policy(p))) goto preempt; /* @@ -7021,7 +7021,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env) if (p->sched_class != &fair_sched_class) return 0; - if (unlikely(p->policy == SCHED_IDLE)) + if (unlikely(task_has_idle_policy(p))) return 0; /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 618577fc9aa8..b7a3147874e3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -176,6 +176,11 @@ static inline bool valid_policy(int policy) rt_policy(policy) || dl_policy(policy); } +static inline int task_has_idle_policy(struct task_struct *p) +{ + return idle_policy(p->policy); +} + static inline int task_has_rt_policy(struct task_struct *p) { return rt_policy(p->policy); -- cgit v1.3-7-g2ca7 From ed8885a14433aec04067463493051eaaeef3255f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Sat, 10 Nov 2018 15:52:02 +0800 Subject: sched/fair: Make some variables static The variables are local to the source and do not need to be in global scope, so make them static. Signed-off-by: Muchun Song Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20181110075202.61172-1-smuchun@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d1f91e6efe51..e30dea59d215 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -38,7 +38,7 @@ * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) */ unsigned int sysctl_sched_latency = 6000000ULL; -unsigned int normalized_sysctl_sched_latency = 6000000ULL; +static unsigned int normalized_sysctl_sched_latency = 6000000ULL; /* * The initial- and re-scaling of tunables is configurable @@ -58,8 +58,8 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_L * * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_min_granularity = 750000ULL; -unsigned int normalized_sysctl_sched_min_granularity = 750000ULL; +unsigned int sysctl_sched_min_granularity = 750000ULL; +static unsigned int normalized_sysctl_sched_min_granularity = 750000ULL; /* * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity @@ -81,8 +81,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; * * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_wakeup_granularity = 1000000UL; -unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; +unsigned int sysctl_sched_wakeup_granularity = 1000000UL; +static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; @@ -116,7 +116,7 @@ unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; * * (default: ~20%) */ -unsigned int capacity_margin = 1280; +static unsigned int capacity_margin = 1280; static inline void update_load_add(struct load_weight *lw, unsigned long inc) { -- cgit v1.3-7-g2ca7 From 3e184501083c38fa091f640acb13af17a21fd228 Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Tue, 6 Nov 2018 11:12:57 +0530 Subject: sched/core: Clean up the #ifdef block in add_nr_running() There is no point in keeping the conditional statement of the #if block outside of the #ifdef block, while all of its body is contained within the #ifdef block. Move the conditional statement under the #ifdef block as well. Signed-off-by: Viresh Kumar Cc: Daniel Lezcano Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/78cbd78a615d6f9fdcd3327f1ead68470f92593e.1541482935.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/sched.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b7a3147874e3..e0e052a50fcd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1801,12 +1801,12 @@ static inline void add_nr_running(struct rq *rq, unsigned count) rq->nr_running = prev_nr + count; - if (prev_nr < 2 && rq->nr_running >= 2) { #ifdef CONFIG_SMP + if (prev_nr < 2 && rq->nr_running >= 2) { if (!READ_ONCE(rq->rd->overload)) WRITE_ONCE(rq->rd->overload, 1); -#endif } +#endif sched_update_tick_dependency(rq); } -- cgit v1.3-7-g2ca7 From 9f16d2e6241b2fc664523f17d74adda7489f496b Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 17 Oct 2018 12:14:52 +0200 Subject: audit_tree: Remove mark->lock locking Currently, audit_tree code uses mark->lock to protect against detaching of mark from an inode. In most places it however also uses mark->group->mark_mutex (as we need to atomically replace attached marks) and this provides protection against mark detaching as well. So just remove protection with mark->lock from audit tree code and replace it with mark->group->mark_mutex protection in all the places. It simplifies the code and gets rid of some ugly catches like calling fsnotify_add_mark_locked() with mark->lock held (which cannot sleep only because we hold a reference to another mark attached to the same inode). Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 24 ++++-------------------- 1 file changed, 4 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index ea43181cde4a..1b55b1026a36 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -193,7 +193,7 @@ static inline struct list_head *chunk_hash(unsigned long key) return chunk_hash_heads + n % HASH_SIZE; } -/* hash_lock & entry->lock is held by caller */ +/* hash_lock & entry->group->mark_mutex is held by caller */ static void insert_hash(struct audit_chunk *chunk) { unsigned long key = chunk_to_key(chunk); @@ -256,13 +256,11 @@ static void untag_chunk(struct node *p) new = alloc_chunk(size); mutex_lock(&entry->group->mark_mutex); - spin_lock(&entry->lock); /* * mark_mutex protects mark from getting detached and thus also from * mark->connector->obj getting NULL. */ if (chunk->dead || !(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { - spin_unlock(&entry->lock); mutex_unlock(&entry->group->mark_mutex); if (new) fsnotify_put_mark(&new->mark); @@ -280,7 +278,6 @@ static void untag_chunk(struct node *p) list_del_init(&p->list); list_del_rcu(&chunk->hash); spin_unlock(&hash_lock); - spin_unlock(&entry->lock); mutex_unlock(&entry->group->mark_mutex); fsnotify_destroy_mark(entry, audit_tree_group); goto out; @@ -323,7 +320,6 @@ static void untag_chunk(struct node *p) list_for_each_entry(owner, &new->trees, same_root) owner->root = new; spin_unlock(&hash_lock); - spin_unlock(&entry->lock); mutex_unlock(&entry->group->mark_mutex); fsnotify_destroy_mark(entry, audit_tree_group); fsnotify_put_mark(&new->mark); /* drop initial reference */ @@ -340,7 +336,6 @@ Fallback: p->owner = NULL; put_tree(owner); spin_unlock(&hash_lock); - spin_unlock(&entry->lock); mutex_unlock(&entry->group->mark_mutex); out: fsnotify_put_mark(entry); @@ -360,12 +355,12 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) return -ENOSPC; } - spin_lock(&entry->lock); + mutex_lock(&entry->group->mark_mutex); spin_lock(&hash_lock); if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; - spin_unlock(&entry->lock); + mutex_unlock(&entry->group->mark_mutex); fsnotify_destroy_mark(entry, audit_tree_group); fsnotify_put_mark(entry); return 0; @@ -380,7 +375,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) } insert_hash(chunk); spin_unlock(&hash_lock); - spin_unlock(&entry->lock); + mutex_unlock(&entry->group->mark_mutex); fsnotify_put_mark(entry); /* drop initial reference */ return 0; } @@ -421,14 +416,12 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) chunk_entry = &chunk->mark; mutex_lock(&old_entry->group->mark_mutex); - spin_lock(&old_entry->lock); /* * mark_mutex protects mark from getting detached and thus also from * mark->connector->obj getting NULL. */ if (!(old_entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { /* old_entry is being shot, lets just lie */ - spin_unlock(&old_entry->lock); mutex_unlock(&old_entry->group->mark_mutex); fsnotify_put_mark(old_entry); fsnotify_put_mark(&chunk->mark); @@ -437,23 +430,16 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (fsnotify_add_mark_locked(chunk_entry, old_entry->connector->obj, FSNOTIFY_OBJ_TYPE_INODE, 1)) { - spin_unlock(&old_entry->lock); mutex_unlock(&old_entry->group->mark_mutex); fsnotify_put_mark(chunk_entry); fsnotify_put_mark(old_entry); return -ENOSPC; } - /* even though we hold old_entry->lock, this is safe since chunk_entry->lock could NEVER have been grabbed before */ - spin_lock(&chunk_entry->lock); spin_lock(&hash_lock); - - /* we now hold old_entry->lock, chunk_entry->lock, and hash_lock */ if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; - spin_unlock(&chunk_entry->lock); - spin_unlock(&old_entry->lock); mutex_unlock(&old_entry->group->mark_mutex); fsnotify_destroy_mark(chunk_entry, audit_tree_group); @@ -485,8 +471,6 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) list_add(&tree->same_root, &chunk->trees); } spin_unlock(&hash_lock); - spin_unlock(&chunk_entry->lock); - spin_unlock(&old_entry->lock); mutex_unlock(&old_entry->group->mark_mutex); fsnotify_destroy_mark(old_entry, audit_tree_group); fsnotify_put_mark(chunk_entry); /* drop initial reference */ -- cgit v1.3-7-g2ca7 From a5789b07b35aa56569dff762bfc063303a9ccb95 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Fix possible spurious -ENOSPC error When an inode is tagged with a tree, tag_chunk() checks whether there is audit_tree_group mark attached to the inode and adds one if not. However nothing protects another tag_chunk() to add the mark between we've checked and try to add the fsnotify mark thus resulting in an error from fsnotify_add_mark() and consequently an ENOSPC error from tag_chunk(). Fix the problem by holding mark_mutex over the whole check-insert code sequence. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 1b55b1026a36..8a74b468b666 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -342,25 +342,29 @@ out: spin_lock(&hash_lock); } +/* Call with group->mark_mutex held, releases it */ static int create_chunk(struct inode *inode, struct audit_tree *tree) { struct fsnotify_mark *entry; struct audit_chunk *chunk = alloc_chunk(1); - if (!chunk) + + if (!chunk) { + mutex_unlock(&audit_tree_group->mark_mutex); return -ENOMEM; + } entry = &chunk->mark; - if (fsnotify_add_inode_mark(entry, inode, 0)) { + if (fsnotify_add_inode_mark_locked(entry, inode, 0)) { + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(entry); return -ENOSPC; } - mutex_lock(&entry->group->mark_mutex); spin_lock(&hash_lock); if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; - mutex_unlock(&entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_destroy_mark(entry, audit_tree_group); fsnotify_put_mark(entry); return 0; @@ -375,7 +379,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) } insert_hash(chunk); spin_unlock(&hash_lock); - mutex_unlock(&entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(entry); /* drop initial reference */ return 0; } @@ -389,6 +393,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) struct node *p; int n; + mutex_lock(&audit_tree_group->mark_mutex); old_entry = fsnotify_find_mark(&inode->i_fsnotify_marks, audit_tree_group); if (!old_entry) @@ -401,6 +406,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) for (n = 0; n < old->count; n++) { if (old->owners[n].owner == tree) { spin_unlock(&hash_lock); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(old_entry); return 0; } @@ -409,20 +415,20 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) chunk = alloc_chunk(old->count + 1); if (!chunk) { + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(old_entry); return -ENOMEM; } chunk_entry = &chunk->mark; - mutex_lock(&old_entry->group->mark_mutex); /* * mark_mutex protects mark from getting detached and thus also from * mark->connector->obj getting NULL. */ if (!(old_entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { /* old_entry is being shot, lets just lie */ - mutex_unlock(&old_entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(old_entry); fsnotify_put_mark(&chunk->mark); return -ENOENT; @@ -430,7 +436,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (fsnotify_add_mark_locked(chunk_entry, old_entry->connector->obj, FSNOTIFY_OBJ_TYPE_INODE, 1)) { - mutex_unlock(&old_entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(chunk_entry); fsnotify_put_mark(old_entry); return -ENOSPC; @@ -440,7 +446,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; - mutex_unlock(&old_entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_destroy_mark(chunk_entry, audit_tree_group); @@ -471,7 +477,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) list_add(&tree->same_root, &chunk->trees); } spin_unlock(&hash_lock); - mutex_unlock(&old_entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_destroy_mark(old_entry, audit_tree_group); fsnotify_put_mark(chunk_entry); /* drop initial reference */ fsnotify_put_mark(old_entry); /* pair to fsnotify_find mark_entry */ -- cgit v1.3-7-g2ca7 From b1e4603b92d8aef8776e5673dc13fedb68d32ea4 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Fix possible tagging failures Audit tree code is replacing marks attached to inodes in non-atomic way. Thus fsnotify_find_mark() in tag_chunk() may find a mark that belongs to a chunk that is no longer valid one and will soon be destroyed. Tags added to such chunk will be simply lost. Fix the problem by making sure old mark is marked as going away (through fsnotify_detach_mark()) before dropping mark_mutex and thus in an atomic way wrt tag_chunk(). Note that this does not fix the problem completely as if tag_chunk() finds a mark that is going away, it fails with -ENOENT. But at least the failure is not silent and currently there's no way to search for another fsnotify mark attached to the inode. We'll fix this problem in later patch. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 8a74b468b666..c194dbd53753 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -278,8 +278,9 @@ static void untag_chunk(struct node *p) list_del_init(&p->list); list_del_rcu(&chunk->hash); spin_unlock(&hash_lock); + fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); - fsnotify_destroy_mark(entry, audit_tree_group); + fsnotify_free_mark(entry); goto out; } @@ -320,8 +321,9 @@ static void untag_chunk(struct node *p) list_for_each_entry(owner, &new->trees, same_root) owner->root = new; spin_unlock(&hash_lock); + fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); - fsnotify_destroy_mark(entry, audit_tree_group); + fsnotify_free_mark(entry); fsnotify_put_mark(&new->mark); /* drop initial reference */ goto out; @@ -364,8 +366,9 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; + fsnotify_detach_mark(entry); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_destroy_mark(entry, audit_tree_group); + fsnotify_free_mark(entry); fsnotify_put_mark(entry); return 0; } @@ -446,10 +449,9 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (tree->goner) { spin_unlock(&hash_lock); chunk->dead = 1; + fsnotify_detach_mark(chunk_entry); mutex_unlock(&audit_tree_group->mark_mutex); - - fsnotify_destroy_mark(chunk_entry, audit_tree_group); - + fsnotify_free_mark(chunk_entry); fsnotify_put_mark(chunk_entry); fsnotify_put_mark(old_entry); return 0; @@ -477,8 +479,9 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) list_add(&tree->same_root, &chunk->trees); } spin_unlock(&hash_lock); + fsnotify_detach_mark(old_entry); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_destroy_mark(old_entry, audit_tree_group); + fsnotify_free_mark(old_entry); fsnotify_put_mark(chunk_entry); /* drop initial reference */ fsnotify_put_mark(old_entry); /* pair to fsnotify_find mark_entry */ return 0; -- cgit v1.3-7-g2ca7 From 8d20d6e9301d7b3777d66d47dd5b89acd645cd39 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Embed key into chunk Currently chunk hash key (which is in fact pointer to the inode) is derived as chunk->mark.conn->obj. It is tricky to make this dereference reliable for hash table lookups only under RCU as mark can get detached from the connector and connector gets freed independently of the running lookup. Thus there is a possible use after free / NULL ptr dereference issue: CPU1 CPU2 untag_chunk() ... audit_tree_lookup() list_for_each_entry_rcu(p, list, hash) { list_del_rcu(&chunk->hash); fsnotify_destroy_mark(entry); fsnotify_put_mark(entry) chunk_to_key(p) if (!chunk->mark.connector) ... hlist_del_init_rcu(&mark->obj_list); if (hlist_empty(&conn->list)) { inode = fsnotify_detach_connector_from_object(conn); mark->connector = NULL; ... frees connector from workqueue chunk->mark.connector->obj This race is probably impossible to hit in practice as the race window on CPU1 is very narrow and CPU2 has a lot of code to execute. Still it's better to have this fixed. Since the inode the chunk is attached to is constant during chunk's lifetime it is easy to cache the key in the chunk itself and thus avoid these issues. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index c194dbd53753..bac5dd90c629 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -24,6 +24,7 @@ struct audit_tree { struct audit_chunk { struct list_head hash; + unsigned long key; struct fsnotify_mark mark; struct list_head trees; /* with root here */ int dead; @@ -172,21 +173,6 @@ static unsigned long inode_to_key(const struct inode *inode) return (unsigned long)&inode->i_fsnotify_marks; } -/* - * Function to return search key in our hash from chunk. Key 0 is special and - * should never be present in the hash. - */ -static unsigned long chunk_to_key(struct audit_chunk *chunk) -{ - /* - * We have a reference to the mark so it should be attached to a - * connector. - */ - if (WARN_ON_ONCE(!chunk->mark.connector)) - return 0; - return (unsigned long)chunk->mark.connector->obj; -} - static inline struct list_head *chunk_hash(unsigned long key) { unsigned long n = key / L1_CACHE_BYTES; @@ -196,12 +182,12 @@ static inline struct list_head *chunk_hash(unsigned long key) /* hash_lock & entry->group->mark_mutex is held by caller */ static void insert_hash(struct audit_chunk *chunk) { - unsigned long key = chunk_to_key(chunk); struct list_head *list; if (!(chunk->mark.flags & FSNOTIFY_MARK_FLAG_ATTACHED)) return; - list = chunk_hash(key); + WARN_ON_ONCE(!chunk->key); + list = chunk_hash(chunk->key); list_add_rcu(&chunk->hash, list); } @@ -213,7 +199,7 @@ struct audit_chunk *audit_tree_lookup(const struct inode *inode) struct audit_chunk *p; list_for_each_entry_rcu(p, list, hash) { - if (chunk_to_key(p) == key) { + if (p->key == key) { atomic_long_inc(&p->refs); return p; } @@ -295,6 +281,7 @@ static void untag_chunk(struct node *p) chunk->dead = 1; spin_lock(&hash_lock); + new->key = chunk->key; list_replace_init(&chunk->trees, &new->trees); if (owner->root == chunk) { list_del_init(&owner->same_root); @@ -380,6 +367,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) tree->root = chunk; list_add(&tree->same_root, &chunk->trees); } + chunk->key = inode_to_key(inode); insert_hash(chunk); spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); @@ -456,6 +444,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) fsnotify_put_mark(old_entry); return 0; } + chunk->key = old->key; list_replace_init(&old->trees, &chunk->trees); for (n = 0, p = chunk->owners; n < old->count; n++, p++) { struct audit_tree *s = old->owners[n].owner; @@ -654,7 +643,7 @@ void audit_trim_trees(void) /* this could be NULL if the watch is dying else where... */ node->index |= 1U<<31; if (iterate_mounts(compare_root, - (void *)chunk_to_key(chunk), + (void *)(chunk->key), root_mnt)) node->index &= ~(1U<<31); } -- cgit v1.3-7-g2ca7 From 1635e5722350597b6a149bdb131358fcd7e34906 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Make hash table insertion safe against concurrent lookups Currently, the audit tree code does not make sure that when a chunk is inserted into the hash table, it is fully initialized. So in theory a user of RCU lookup could see uninitialized structure in the hash table and crash. Add appropriate barriers between initialization of the structure and its insertion into hash table. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index bac5dd90c629..307749d6773c 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -186,6 +186,12 @@ static void insert_hash(struct audit_chunk *chunk) if (!(chunk->mark.flags & FSNOTIFY_MARK_FLAG_ATTACHED)) return; + /* + * Make sure chunk is fully initialized before making it visible in the + * hash. Pairs with a data dependency barrier in READ_ONCE() in + * audit_tree_lookup(). + */ + smp_wmb(); WARN_ON_ONCE(!chunk->key); list = chunk_hash(chunk->key); list_add_rcu(&chunk->hash, list); @@ -199,7 +205,11 @@ struct audit_chunk *audit_tree_lookup(const struct inode *inode) struct audit_chunk *p; list_for_each_entry_rcu(p, list, hash) { - if (p->key == key) { + /* + * We use a data dependency barrier in READ_ONCE() to make sure + * the chunk we see is fully initialized. + */ + if (READ_ONCE(p->key) == key) { atomic_long_inc(&p->refs); return p; } @@ -304,9 +314,15 @@ static void untag_chunk(struct node *p) list_replace_init(&chunk->owners[j].list, &new->owners[i].list); } - list_replace_rcu(&chunk->hash, &new->hash); list_for_each_entry(owner, &new->trees, same_root) owner->root = new; + /* + * Make sure chunk is fully initialized before making it visible in the + * hash. Pairs with a data dependency barrier in READ_ONCE() in + * audit_tree_lookup(). + */ + smp_wmb(); + list_replace_rcu(&chunk->hash, &new->hash); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); @@ -368,6 +384,10 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) list_add(&tree->same_root, &chunk->trees); } chunk->key = inode_to_key(inode); + /* + * Inserting into the hash table has to go last as once we do that RCU + * readers can see the chunk. + */ insert_hash(chunk); spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); @@ -459,7 +479,6 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) p->owner = tree; get_tree(tree); list_add(&p->list, &tree->chunks); - list_replace_rcu(&old->hash, &chunk->hash); list_for_each_entry(owner, &chunk->trees, same_root) owner->root = chunk; old->dead = 1; @@ -467,6 +486,13 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) tree->root = chunk; list_add(&tree->same_root, &chunk->trees); } + /* + * Make sure chunk is fully initialized before making it visible in the + * hash. Pairs with a data dependency barrier in READ_ONCE() in + * audit_tree_lookup(). + */ + smp_wmb(); + list_replace_rcu(&old->hash, &chunk->hash); spin_unlock(&hash_lock); fsnotify_detach_mark(old_entry); mutex_unlock(&audit_tree_group->mark_mutex); -- cgit v1.3-7-g2ca7 From d31b326d3ce7b1ff2ec36470dfcccb14a6c3e04e Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Factor out chunk replacement code Chunk replacement code is very similar for the cases where we grow or shrink chunk. Factor the code out into a common helper function. Signed-off-by: Jan Kara Reviewed-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit_tree.c | 86 +++++++++++++++++++++++++---------------------------- 1 file changed, 40 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 307749d6773c..d8f6cfa0005b 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -235,6 +235,38 @@ static struct audit_chunk *find_chunk(struct node *p) return container_of(p, struct audit_chunk, owners[0]); } +static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old, + struct node *skip) +{ + struct audit_tree *owner; + int i, j; + + new->key = old->key; + list_splice_init(&old->trees, &new->trees); + list_for_each_entry(owner, &new->trees, same_root) + owner->root = new; + for (i = j = 0; j < old->count; i++, j++) { + if (&old->owners[j] == skip) { + i--; + continue; + } + owner = old->owners[j].owner; + new->owners[i].owner = owner; + new->owners[i].index = old->owners[j].index - j + i; + if (!owner) /* result of earlier fallback */ + continue; + get_tree(owner); + list_replace_init(&old->owners[j].list, &new->owners[i].list); + } + /* + * Make sure chunk is fully initialized before making it visible in the + * hash. Pairs with a data dependency barrier in READ_ONCE() in + * audit_tree_lookup(). + */ + smp_wmb(); + list_replace_rcu(&old->hash, &new->hash); +} + static void untag_chunk(struct node *p) { struct audit_chunk *chunk = find_chunk(p); @@ -242,7 +274,6 @@ static void untag_chunk(struct node *p) struct audit_chunk *new = NULL; struct audit_tree *owner; int size = chunk->count - 1; - int i, j; fsnotify_get_mark(entry); @@ -291,38 +322,16 @@ static void untag_chunk(struct node *p) chunk->dead = 1; spin_lock(&hash_lock); - new->key = chunk->key; - list_replace_init(&chunk->trees, &new->trees); if (owner->root == chunk) { list_del_init(&owner->same_root); owner->root = NULL; } - - for (i = j = 0; j <= size; i++, j++) { - struct audit_tree *s; - if (&chunk->owners[j] == p) { - list_del_init(&p->list); - i--; - continue; - } - s = chunk->owners[j].owner; - new->owners[i].owner = s; - new->owners[i].index = chunk->owners[j].index - j + i; - if (!s) /* result of earlier fallback */ - continue; - get_tree(s); - list_replace_init(&chunk->owners[j].list, &new->owners[i].list); - } - - list_for_each_entry(owner, &new->trees, same_root) - owner->root = new; + list_del_init(&p->list); /* - * Make sure chunk is fully initialized before making it visible in the - * hash. Pairs with a data dependency barrier in READ_ONCE() in - * audit_tree_lookup(). + * This has to go last when updating chunk as once replace_chunk() is + * called, new RCU readers can see the new chunk. */ - smp_wmb(); - list_replace_rcu(&chunk->hash, &new->hash); + replace_chunk(new, chunk, p); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); @@ -399,7 +408,6 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) static int tag_chunk(struct inode *inode, struct audit_tree *tree) { struct fsnotify_mark *old_entry, *chunk_entry; - struct audit_tree *owner; struct audit_chunk *chunk, *old; struct node *p; int n; @@ -464,35 +472,21 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) fsnotify_put_mark(old_entry); return 0; } - chunk->key = old->key; - list_replace_init(&old->trees, &chunk->trees); - for (n = 0, p = chunk->owners; n < old->count; n++, p++) { - struct audit_tree *s = old->owners[n].owner; - p->owner = s; - p->index = old->owners[n].index; - if (!s) /* result of fallback in untag */ - continue; - get_tree(s); - list_replace_init(&old->owners[n].list, &p->list); - } + p = &chunk->owners[chunk->count - 1]; p->index = (chunk->count - 1) | (1U<<31); p->owner = tree; get_tree(tree); list_add(&p->list, &tree->chunks); - list_for_each_entry(owner, &chunk->trees, same_root) - owner->root = chunk; old->dead = 1; if (!tree->root) { tree->root = chunk; list_add(&tree->same_root, &chunk->trees); } /* - * Make sure chunk is fully initialized before making it visible in the - * hash. Pairs with a data dependency barrier in READ_ONCE() in - * audit_tree_lookup(). + * This has to go last when updating chunk as once replace_chunk() is + * called, new RCU readers can see the new chunk. */ - smp_wmb(); - list_replace_rcu(&old->hash, &chunk->hash); + replace_chunk(chunk, old, NULL); spin_unlock(&hash_lock); fsnotify_detach_mark(old_entry); mutex_unlock(&audit_tree_group->mark_mutex); -- cgit v1.3-7-g2ca7 From 8cd0feb5234ccda3c15de35b40c8010a406dfc03 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:48 -0500 Subject: audit: Remove pointless check in insert_hash() The audit_tree_group->mark_mutex is held all the time while we create the fsnotify mark, add it to the inode, and insert chunk into the hash. Hence mark cannot get detached during this time and so the check whether the mark is attached in insert_hash() is pointless. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index d8f6cfa0005b..d150514ff15e 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -184,8 +184,6 @@ static void insert_hash(struct audit_chunk *chunk) { struct list_head *list; - if (!(chunk->mark.flags & FSNOTIFY_MARK_FLAG_ATTACHED)) - return; /* * Make sure chunk is fully initialized before making it visible in the * hash. Pairs with a data dependency barrier in READ_ONCE() in -- cgit v1.3-7-g2ca7 From a8375713fb1ff28ec718b601895958f1db775774 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:49 -0500 Subject: audit: Provide helper for dropping mark's chunk reference Provide a helper function audit_mark_put_chunk() for dropping mark's reference (which has to happen only after RCU grace period expires). Currently that happens only from a single place but in later patches we introduce more callers. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index d150514ff15e..35c031ebcc12 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -132,10 +132,20 @@ static void __put_chunk(struct rcu_head *rcu) audit_put_chunk(chunk); } +/* + * Drop reference to the chunk that was held by the mark. This is the reference + * that gets dropped after we've removed the chunk from the hash table and we + * use it to make sure chunk cannot be freed before RCU grace period expires. + */ +static void audit_mark_put_chunk(struct audit_chunk *chunk) +{ + call_rcu(&chunk->head, __put_chunk); +} + static void audit_tree_destroy_watch(struct fsnotify_mark *entry) { struct audit_chunk *chunk = container_of(entry, struct audit_chunk, mark); - call_rcu(&chunk->head, __put_chunk); + audit_mark_put_chunk(chunk); } static struct audit_chunk *alloc_chunk(int count) -- cgit v1.3-7-g2ca7 From 5f5161300d7bd530e062428ac694824832960cf5 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:49 -0500 Subject: audit: Allocate fsnotify mark independently of chunk Allocate fsnotify mark independently instead of embedding it inside chunk. This will allow us to just replace chunk attached to mark when growing / shrinking chunk instead of replacing mark attached to inode which is a more complex operation. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara Signed-off-by: Paul Moore --- kernel/audit_tree.c | 64 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 50 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 35c031ebcc12..c98ab2d68a1c 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -25,7 +25,7 @@ struct audit_tree { struct audit_chunk { struct list_head hash; unsigned long key; - struct fsnotify_mark mark; + struct fsnotify_mark *mark; struct list_head trees; /* with root here */ int dead; int count; @@ -38,6 +38,11 @@ struct audit_chunk { } owners[]; }; +struct audit_tree_mark { + struct fsnotify_mark mark; + struct audit_chunk *chunk; +}; + static LIST_HEAD(tree_list); static LIST_HEAD(prune_list); static struct task_struct *prune_thread; @@ -73,6 +78,7 @@ static struct task_struct *prune_thread; */ static struct fsnotify_group *audit_tree_group; +static struct kmem_cache *audit_tree_mark_cachep __read_mostly; static struct audit_tree *alloc_tree(const char *s) { @@ -142,10 +148,33 @@ static void audit_mark_put_chunk(struct audit_chunk *chunk) call_rcu(&chunk->head, __put_chunk); } +static inline struct audit_tree_mark *audit_mark(struct fsnotify_mark *entry) +{ + return container_of(entry, struct audit_tree_mark, mark); +} + +static struct audit_chunk *mark_chunk(struct fsnotify_mark *mark) +{ + return audit_mark(mark)->chunk; +} + static void audit_tree_destroy_watch(struct fsnotify_mark *entry) { - struct audit_chunk *chunk = container_of(entry, struct audit_chunk, mark); + struct audit_chunk *chunk = mark_chunk(entry); audit_mark_put_chunk(chunk); + kmem_cache_free(audit_tree_mark_cachep, audit_mark(entry)); +} + +static struct fsnotify_mark *alloc_mark(void) +{ + struct audit_tree_mark *amark; + + amark = kmem_cache_zalloc(audit_tree_mark_cachep, GFP_KERNEL); + if (!amark) + return NULL; + fsnotify_init_mark(&amark->mark, audit_tree_group); + amark->mark.mask = FS_IN_IGNORED; + return &amark->mark; } static struct audit_chunk *alloc_chunk(int count) @@ -159,6 +188,13 @@ static struct audit_chunk *alloc_chunk(int count) if (!chunk) return NULL; + chunk->mark = alloc_mark(); + if (!chunk->mark) { + kfree(chunk); + return NULL; + } + audit_mark(chunk->mark)->chunk = chunk; + INIT_LIST_HEAD(&chunk->hash); INIT_LIST_HEAD(&chunk->trees); chunk->count = count; @@ -167,8 +203,6 @@ static struct audit_chunk *alloc_chunk(int count) INIT_LIST_HEAD(&chunk->owners[i].list); chunk->owners[i].index = i; } - fsnotify_init_mark(&chunk->mark, audit_tree_group); - chunk->mark.mask = FS_IN_IGNORED; return chunk; } @@ -278,7 +312,7 @@ static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old, static void untag_chunk(struct node *p) { struct audit_chunk *chunk = find_chunk(p); - struct fsnotify_mark *entry = &chunk->mark; + struct fsnotify_mark *entry = chunk->mark; struct audit_chunk *new = NULL; struct audit_tree *owner; int size = chunk->count - 1; @@ -298,7 +332,7 @@ static void untag_chunk(struct node *p) if (chunk->dead || !(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { mutex_unlock(&entry->group->mark_mutex); if (new) - fsnotify_put_mark(&new->mark); + fsnotify_put_mark(new->mark); goto out; } @@ -322,9 +356,9 @@ static void untag_chunk(struct node *p) if (!new) goto Fallback; - if (fsnotify_add_mark_locked(&new->mark, entry->connector->obj, + if (fsnotify_add_mark_locked(new->mark, entry->connector->obj, FSNOTIFY_OBJ_TYPE_INODE, 1)) { - fsnotify_put_mark(&new->mark); + fsnotify_put_mark(new->mark); goto Fallback; } @@ -344,7 +378,7 @@ static void untag_chunk(struct node *p) fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); fsnotify_free_mark(entry); - fsnotify_put_mark(&new->mark); /* drop initial reference */ + fsnotify_put_mark(new->mark); /* drop initial reference */ goto out; Fallback: @@ -375,7 +409,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) return -ENOMEM; } - entry = &chunk->mark; + entry = chunk->mark; if (fsnotify_add_inode_mark_locked(entry, inode, 0)) { mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(entry); @@ -426,7 +460,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (!old_entry) return create_chunk(inode, tree); - old = container_of(old_entry, struct audit_chunk, mark); + old = mark_chunk(old_entry); /* are we already there? */ spin_lock(&hash_lock); @@ -447,7 +481,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) return -ENOMEM; } - chunk_entry = &chunk->mark; + chunk_entry = chunk->mark; /* * mark_mutex protects mark from getting detached and thus also from @@ -457,7 +491,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) /* old_entry is being shot, lets just lie */ mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(old_entry); - fsnotify_put_mark(&chunk->mark); + fsnotify_put_mark(chunk->mark); return -ENOENT; } @@ -1011,7 +1045,7 @@ static int audit_tree_handle_event(struct fsnotify_group *group, static void audit_tree_freeing_mark(struct fsnotify_mark *entry, struct fsnotify_group *group) { - struct audit_chunk *chunk = container_of(entry, struct audit_chunk, mark); + struct audit_chunk *chunk = mark_chunk(entry); evict_chunk(chunk); @@ -1032,6 +1066,8 @@ static int __init audit_tree_init(void) { int i; + audit_tree_mark_cachep = KMEM_CACHE(audit_tree_mark, SLAB_PANIC); + audit_tree_group = fsnotify_alloc_group(&audit_tree_ops); if (IS_ERR(audit_tree_group)) audit_panic("cannot initialize fsnotify group for rectree watches"); -- cgit v1.3-7-g2ca7 From 49a4ee7d98dbe34cfed90b930664c8a9fa73b24c Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:49 -0500 Subject: audit: Guarantee forward progress of chunk untagging When removing chunk from a tree, we do shrink the chunk. This can fail for various reasons (due to races, ENOMEM, etc.) and in some cases we just bail from untag_chunk() relying on someone else to cleanup. Although this currently works, later we will need to add new failure situation which would break. Also this simplifies the code and will allow us to make locking around untag_chunk() less awkward. Signed-off-by: Jan Kara Reviewed-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit_tree.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index c98ab2d68a1c..ca2b6baff7aa 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -309,16 +309,28 @@ static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old, list_replace_rcu(&old->hash, &new->hash); } +static void remove_chunk_node(struct audit_chunk *chunk, struct node *p) +{ + struct audit_tree *owner = p->owner; + + if (owner->root == chunk) { + list_del_init(&owner->same_root); + owner->root = NULL; + } + list_del_init(&p->list); + p->owner = NULL; + put_tree(owner); +} + static void untag_chunk(struct node *p) { struct audit_chunk *chunk = find_chunk(p); struct fsnotify_mark *entry = chunk->mark; struct audit_chunk *new = NULL; - struct audit_tree *owner; int size = chunk->count - 1; + remove_chunk_node(chunk, p); fsnotify_get_mark(entry); - spin_unlock(&hash_lock); if (size) @@ -336,15 +348,10 @@ static void untag_chunk(struct node *p) goto out; } - owner = p->owner; - if (!size) { chunk->dead = 1; spin_lock(&hash_lock); list_del_init(&chunk->trees); - if (owner->root == chunk) - owner->root = NULL; - list_del_init(&p->list); list_del_rcu(&chunk->hash); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); @@ -354,21 +361,16 @@ static void untag_chunk(struct node *p) } if (!new) - goto Fallback; + goto out_mutex; if (fsnotify_add_mark_locked(new->mark, entry->connector->obj, FSNOTIFY_OBJ_TYPE_INODE, 1)) { fsnotify_put_mark(new->mark); - goto Fallback; + goto out_mutex; } chunk->dead = 1; spin_lock(&hash_lock); - if (owner->root == chunk) { - list_del_init(&owner->same_root); - owner->root = NULL; - } - list_del_init(&p->list); /* * This has to go last when updating chunk as once replace_chunk() is * called, new RCU readers can see the new chunk. @@ -381,17 +383,7 @@ static void untag_chunk(struct node *p) fsnotify_put_mark(new->mark); /* drop initial reference */ goto out; -Fallback: - // do the best we can - spin_lock(&hash_lock); - if (owner->root == chunk) { - list_del_init(&owner->same_root); - owner->root = NULL; - } - list_del_init(&p->list); - p->owner = NULL; - put_tree(owner); - spin_unlock(&hash_lock); +out_mutex: mutex_unlock(&entry->group->mark_mutex); out: fsnotify_put_mark(entry); -- cgit v1.3-7-g2ca7 From c22fcde775dcc9f46d73d694061441efdc7bdaad Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:49 -0500 Subject: audit: Drop all unused chunk nodes during deletion When deleting chunk from a tree, drop all unused nodes in a chunk instead of just the one used by the tree. This gets rid of possibly lingering unused nodes (created due to fallback path in untag_chunk()) and also removes some special cases and will allow us to simplify locking in untag_chunk(). Signed-off-by: Jan Kara Reviewed-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit_tree.c | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index ca2b6baff7aa..145e8c92dd31 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -277,8 +277,7 @@ static struct audit_chunk *find_chunk(struct node *p) return container_of(p, struct audit_chunk, owners[0]); } -static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old, - struct node *skip) +static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old) { struct audit_tree *owner; int i, j; @@ -288,7 +287,7 @@ static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old, list_for_each_entry(owner, &new->trees, same_root) owner->root = new; for (i = j = 0; j < old->count; i++, j++) { - if (&old->owners[j] == skip) { + if (!old->owners[j].owner) { i--; continue; } @@ -322,20 +321,28 @@ static void remove_chunk_node(struct audit_chunk *chunk, struct node *p) put_tree(owner); } +static int chunk_count_trees(struct audit_chunk *chunk) +{ + int i; + int ret = 0; + + for (i = 0; i < chunk->count; i++) + if (chunk->owners[i].owner) + ret++; + return ret; +} + static void untag_chunk(struct node *p) { struct audit_chunk *chunk = find_chunk(p); struct fsnotify_mark *entry = chunk->mark; struct audit_chunk *new = NULL; - int size = chunk->count - 1; + int size; remove_chunk_node(chunk, p); fsnotify_get_mark(entry); spin_unlock(&hash_lock); - if (size) - new = alloc_chunk(size); - mutex_lock(&entry->group->mark_mutex); /* * mark_mutex protects mark from getting detached and thus also from @@ -348,6 +355,7 @@ static void untag_chunk(struct node *p) goto out; } + size = chunk_count_trees(chunk); if (!size) { chunk->dead = 1; spin_lock(&hash_lock); @@ -360,6 +368,7 @@ static void untag_chunk(struct node *p) goto out; } + new = alloc_chunk(size); if (!new) goto out_mutex; @@ -375,7 +384,7 @@ static void untag_chunk(struct node *p) * This has to go last when updating chunk as once replace_chunk() is * called, new RCU readers can see the new chunk. */ - replace_chunk(new, chunk, p); + replace_chunk(new, chunk); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); mutex_unlock(&entry->group->mark_mutex); @@ -520,7 +529,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) * This has to go last when updating chunk as once replace_chunk() is * called, new RCU readers can see the new chunk. */ - replace_chunk(chunk, old, NULL); + replace_chunk(chunk, old); spin_unlock(&hash_lock); fsnotify_detach_mark(old_entry); mutex_unlock(&audit_tree_group->mark_mutex); -- cgit v1.3-7-g2ca7 From 8432c70062978d9a57bde6715496d585ec520c3e Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:54:56 -0500 Subject: audit: Simplify locking around untag_chunk() untag_chunk() has to be called with hash_lock, it drops it and reacquires it when returning. The unlocking of hash_lock is thus hidden from the callers of untag_chunk() with is rather error prone. Reorganize the code so that untag_chunk() is called without hash_lock, only with mark reference preventing the chunk from going away. Since this requires some more code in the caller of untag_chunk() to assure forward progress, factor out loop pruning tree from all chunks into a common helper function. Signed-off-by: Jan Kara Reviewed-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit_tree.c | 77 ++++++++++++++++++++++++++++------------------------- 1 file changed, 40 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 145e8c92dd31..82b27da7031c 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -332,28 +332,18 @@ static int chunk_count_trees(struct audit_chunk *chunk) return ret; } -static void untag_chunk(struct node *p) +static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) { - struct audit_chunk *chunk = find_chunk(p); - struct fsnotify_mark *entry = chunk->mark; - struct audit_chunk *new = NULL; + struct audit_chunk *new; int size; - remove_chunk_node(chunk, p); - fsnotify_get_mark(entry); - spin_unlock(&hash_lock); - - mutex_lock(&entry->group->mark_mutex); + mutex_lock(&audit_tree_group->mark_mutex); /* * mark_mutex protects mark from getting detached and thus also from * mark->connector->obj getting NULL. */ - if (chunk->dead || !(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { - mutex_unlock(&entry->group->mark_mutex); - if (new) - fsnotify_put_mark(new->mark); - goto out; - } + if (chunk->dead || !(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) + goto out_mutex; size = chunk_count_trees(chunk); if (!size) { @@ -363,9 +353,9 @@ static void untag_chunk(struct node *p) list_del_rcu(&chunk->hash); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); - mutex_unlock(&entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_free_mark(entry); - goto out; + return; } new = alloc_chunk(size); @@ -387,16 +377,13 @@ static void untag_chunk(struct node *p) replace_chunk(new, chunk); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); - mutex_unlock(&entry->group->mark_mutex); + mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_free_mark(entry); fsnotify_put_mark(new->mark); /* drop initial reference */ - goto out; + return; out_mutex: - mutex_unlock(&entry->group->mark_mutex); -out: - fsnotify_put_mark(entry); - spin_lock(&hash_lock); + mutex_unlock(&audit_tree_group->mark_mutex); } /* Call with group->mark_mutex held, releases it */ @@ -579,22 +566,45 @@ static void kill_rules(struct audit_tree *tree) } /* - * finish killing struct audit_tree + * Remove tree from chunks. If 'tagged' is set, remove tree only from tagged + * chunks. The function expects tagged chunks are all at the beginning of the + * chunks list. */ -static void prune_one(struct audit_tree *victim) +static void prune_tree_chunks(struct audit_tree *victim, bool tagged) { spin_lock(&hash_lock); while (!list_empty(&victim->chunks)) { struct node *p; + struct audit_chunk *chunk; + struct fsnotify_mark *mark; + + p = list_first_entry(&victim->chunks, struct node, list); + /* have we run out of marked? */ + if (tagged && !(p->index & (1U<<31))) + break; + chunk = find_chunk(p); + mark = chunk->mark; + remove_chunk_node(chunk, p); + fsnotify_get_mark(mark); + spin_unlock(&hash_lock); - p = list_entry(victim->chunks.next, struct node, list); + untag_chunk(chunk, mark); + fsnotify_put_mark(mark); - untag_chunk(p); + spin_lock(&hash_lock); } spin_unlock(&hash_lock); put_tree(victim); } +/* + * finish killing struct audit_tree + */ +static void prune_one(struct audit_tree *victim) +{ + prune_tree_chunks(victim, false); +} + /* trim the uncommitted chunks from tree */ static void trim_marked(struct audit_tree *tree) @@ -614,18 +624,11 @@ static void trim_marked(struct audit_tree *tree) list_add(p, &tree->chunks); } } + spin_unlock(&hash_lock); - while (!list_empty(&tree->chunks)) { - struct node *node; - - node = list_entry(tree->chunks.next, struct node, list); - - /* have we run out of marked? */ - if (!(node->index & (1U<<31))) - break; + prune_tree_chunks(tree, true); - untag_chunk(node); - } + spin_lock(&hash_lock); if (!tree->root && !tree->goner) { tree->goner = 1; spin_unlock(&hash_lock); -- cgit v1.3-7-g2ca7 From 83d23bc8aedc51fc40078026e9fae6e349d83b2a Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:55:16 -0500 Subject: audit: Replace chunk attached to mark instead of replacing mark Audit tree code currently associates new fsnotify mark with each new chunk. As chunk attached to an inode is replaced when new tag is added / removed, we also need to remove old fsnotify mark and add a new one on such occasion. This is cumbersome and makes locking rules somewhat difficult to follow. Fix these problems by allocating fsnotify mark independently of chunk and keeping it all the time while there is some chunk attached to an inode. Also add documentation about the locking rules so that things are easier to follow. Signed-off-by: Jan Kara Reviewed-by: Richard Guy Briggs [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore --- kernel/audit_tree.c | 160 +++++++++++++++++++++++++++------------------------- 1 file changed, 83 insertions(+), 77 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 82b27da7031c..1d8dc20296fb 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -27,7 +27,6 @@ struct audit_chunk { unsigned long key; struct fsnotify_mark *mark; struct list_head trees; /* with root here */ - int dead; int count; atomic_long_t refs; struct rcu_head head; @@ -48,8 +47,15 @@ static LIST_HEAD(prune_list); static struct task_struct *prune_thread; /* - * One struct chunk is attached to each inode of interest. - * We replace struct chunk on tagging/untagging. + * One struct chunk is attached to each inode of interest through + * audit_tree_mark (fsnotify mark). We replace struct chunk on tagging / + * untagging, the mark is stable as long as there is chunk attached. The + * association between mark and chunk is protected by hash_lock and + * audit_tree_group->mark_mutex. Thus as long as we hold + * audit_tree_group->mark_mutex and check that the mark is alive by + * FSNOTIFY_MARK_FLAG_ATTACHED flag check, we are sure the mark points to + * the current chunk. + * * Rules have pointer to struct audit_tree. * Rules have struct list_head rlist forming a list of rules over * the same tree. @@ -68,8 +74,12 @@ static struct task_struct *prune_thread; * tree is refcounted; one reference for "some rules on rules_list refer to * it", one for each chunk with pointer to it. * - * chunk is refcounted by embedded fsnotify_mark + .refs (non-zero refcount - * of watch contributes 1 to .refs). + * chunk is refcounted by embedded .refs. Mark associated with the chunk holds + * one chunk reference. This reference is dropped either when a mark is going + * to be freed (corresponding inode goes away) or when chunk attached to the + * mark gets replaced. This reference must be dropped using + * audit_mark_put_chunk() to make sure the reference is dropped only after RCU + * grace period as it protects RCU readers of the hash table. * * node.index allows to get from node.list to containing chunk. * MSB of that sucker is stolen to mark taggings that we might have to @@ -160,8 +170,6 @@ static struct audit_chunk *mark_chunk(struct fsnotify_mark *mark) static void audit_tree_destroy_watch(struct fsnotify_mark *entry) { - struct audit_chunk *chunk = mark_chunk(entry); - audit_mark_put_chunk(chunk); kmem_cache_free(audit_tree_mark_cachep, audit_mark(entry)); } @@ -188,13 +196,6 @@ static struct audit_chunk *alloc_chunk(int count) if (!chunk) return NULL; - chunk->mark = alloc_mark(); - if (!chunk->mark) { - kfree(chunk); - return NULL; - } - audit_mark(chunk->mark)->chunk = chunk; - INIT_LIST_HEAD(&chunk->hash); INIT_LIST_HEAD(&chunk->trees); chunk->count = count; @@ -277,6 +278,20 @@ static struct audit_chunk *find_chunk(struct node *p) return container_of(p, struct audit_chunk, owners[0]); } +static void replace_mark_chunk(struct fsnotify_mark *entry, + struct audit_chunk *chunk) +{ + struct audit_chunk *old; + + assert_spin_locked(&hash_lock); + old = mark_chunk(entry); + audit_mark(entry)->chunk = chunk; + if (chunk) + chunk->mark = entry; + if (old) + old->mark = NULL; +} + static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old) { struct audit_tree *owner; @@ -299,6 +314,7 @@ static void replace_chunk(struct audit_chunk *new, struct audit_chunk *old) get_tree(owner); list_replace_init(&old->owners[j].list, &new->owners[i].list); } + replace_mark_chunk(old->mark, new); /* * Make sure chunk is fully initialized before making it visible in the * hash. Pairs with a data dependency barrier in READ_ONCE() in @@ -339,21 +355,23 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) mutex_lock(&audit_tree_group->mark_mutex); /* - * mark_mutex protects mark from getting detached and thus also from - * mark->connector->obj getting NULL. + * mark_mutex stabilizes chunk attached to the mark so we can check + * whether it didn't change while we've dropped hash_lock. */ - if (chunk->dead || !(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) + if (!(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED) || + mark_chunk(entry) != chunk) goto out_mutex; size = chunk_count_trees(chunk); if (!size) { - chunk->dead = 1; spin_lock(&hash_lock); list_del_init(&chunk->trees); list_del_rcu(&chunk->hash); + replace_mark_chunk(entry, NULL); spin_unlock(&hash_lock); fsnotify_detach_mark(entry); mutex_unlock(&audit_tree_group->mark_mutex); + audit_mark_put_chunk(chunk); fsnotify_free_mark(entry); return; } @@ -362,13 +380,6 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) if (!new) goto out_mutex; - if (fsnotify_add_mark_locked(new->mark, entry->connector->obj, - FSNOTIFY_OBJ_TYPE_INODE, 1)) { - fsnotify_put_mark(new->mark); - goto out_mutex; - } - - chunk->dead = 1; spin_lock(&hash_lock); /* * This has to go last when updating chunk as once replace_chunk() is @@ -376,10 +387,8 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) */ replace_chunk(new, chunk); spin_unlock(&hash_lock); - fsnotify_detach_mark(entry); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_free_mark(entry); - fsnotify_put_mark(new->mark); /* drop initial reference */ + audit_mark_put_chunk(chunk); return; out_mutex: @@ -397,23 +406,31 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) return -ENOMEM; } - entry = chunk->mark; + entry = alloc_mark(); + if (!entry) { + mutex_unlock(&audit_tree_group->mark_mutex); + kfree(chunk); + return -ENOMEM; + } + if (fsnotify_add_inode_mark_locked(entry, inode, 0)) { mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_put_mark(entry); + kfree(chunk); return -ENOSPC; } spin_lock(&hash_lock); if (tree->goner) { spin_unlock(&hash_lock); - chunk->dead = 1; fsnotify_detach_mark(entry); mutex_unlock(&audit_tree_group->mark_mutex); fsnotify_free_mark(entry); fsnotify_put_mark(entry); + kfree(chunk); return 0; } + replace_mark_chunk(entry, chunk); chunk->owners[0].index = (1U << 31); chunk->owners[0].owner = tree; get_tree(tree); @@ -430,33 +447,41 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) insert_hash(chunk); spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); /* drop initial reference */ + /* + * Drop our initial reference. When mark we point to is getting freed, + * we get notification through ->freeing_mark callback and cleanup + * chunk pointing to this mark. + */ + fsnotify_put_mark(entry); return 0; } /* the first tagged inode becomes root of tree */ static int tag_chunk(struct inode *inode, struct audit_tree *tree) { - struct fsnotify_mark *old_entry, *chunk_entry; + struct fsnotify_mark *entry; struct audit_chunk *chunk, *old; struct node *p; int n; mutex_lock(&audit_tree_group->mark_mutex); - old_entry = fsnotify_find_mark(&inode->i_fsnotify_marks, - audit_tree_group); - if (!old_entry) + entry = fsnotify_find_mark(&inode->i_fsnotify_marks, audit_tree_group); + if (!entry) return create_chunk(inode, tree); - old = mark_chunk(old_entry); - + /* + * Found mark is guaranteed to be attached and mark_mutex protects mark + * from getting detached and thus it makes sure there is chunk attached + * to the mark. + */ /* are we already there? */ spin_lock(&hash_lock); + old = mark_chunk(entry); for (n = 0; n < old->count; n++) { if (old->owners[n].owner == tree) { spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(old_entry); + fsnotify_put_mark(entry); return 0; } } @@ -465,41 +490,16 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) chunk = alloc_chunk(old->count + 1); if (!chunk) { mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(old_entry); + fsnotify_put_mark(entry); return -ENOMEM; } - chunk_entry = chunk->mark; - - /* - * mark_mutex protects mark from getting detached and thus also from - * mark->connector->obj getting NULL. - */ - if (!(old_entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) { - /* old_entry is being shot, lets just lie */ - mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(old_entry); - fsnotify_put_mark(chunk->mark); - return -ENOENT; - } - - if (fsnotify_add_mark_locked(chunk_entry, old_entry->connector->obj, - FSNOTIFY_OBJ_TYPE_INODE, 1)) { - mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(chunk_entry); - fsnotify_put_mark(old_entry); - return -ENOSPC; - } - spin_lock(&hash_lock); if (tree->goner) { spin_unlock(&hash_lock); - chunk->dead = 1; - fsnotify_detach_mark(chunk_entry); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_free_mark(chunk_entry); - fsnotify_put_mark(chunk_entry); - fsnotify_put_mark(old_entry); + fsnotify_put_mark(entry); + kfree(chunk); return 0; } p = &chunk->owners[chunk->count - 1]; @@ -507,7 +507,6 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) p->owner = tree; get_tree(tree); list_add(&p->list, &tree->chunks); - old->dead = 1; if (!tree->root) { tree->root = chunk; list_add(&tree->same_root, &chunk->trees); @@ -518,11 +517,10 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) */ replace_chunk(chunk, old); spin_unlock(&hash_lock); - fsnotify_detach_mark(old_entry); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_free_mark(old_entry); - fsnotify_put_mark(chunk_entry); /* drop initial reference */ - fsnotify_put_mark(old_entry); /* pair to fsnotify_find mark_entry */ + fsnotify_put_mark(entry); /* pair to fsnotify_find mark_entry */ + audit_mark_put_chunk(old); + return 0; } @@ -585,6 +583,9 @@ static void prune_tree_chunks(struct audit_tree *victim, bool tagged) chunk = find_chunk(p); mark = chunk->mark; remove_chunk_node(chunk, p); + /* Racing with audit_tree_freeing_mark()? */ + if (!mark) + continue; fsnotify_get_mark(mark); spin_unlock(&hash_lock); @@ -1007,10 +1008,6 @@ static void evict_chunk(struct audit_chunk *chunk) int need_prune = 0; int n; - if (chunk->dead) - return; - - chunk->dead = 1; mutex_lock(&audit_filter_mutex); spin_lock(&hash_lock); while (!list_empty(&chunk->trees)) { @@ -1049,9 +1046,18 @@ static int audit_tree_handle_event(struct fsnotify_group *group, static void audit_tree_freeing_mark(struct fsnotify_mark *entry, struct fsnotify_group *group) { - struct audit_chunk *chunk = mark_chunk(entry); + struct audit_chunk *chunk; - evict_chunk(chunk); + mutex_lock(&entry->group->mark_mutex); + spin_lock(&hash_lock); + chunk = mark_chunk(entry); + replace_mark_chunk(entry, NULL); + spin_unlock(&hash_lock); + mutex_unlock(&entry->group->mark_mutex); + if (chunk) { + evict_chunk(chunk); + audit_mark_put_chunk(chunk); + } /* * We are guaranteed to have at least one reference to the mark from -- cgit v1.3-7-g2ca7 From f905c2fc3980a41aeccb8673ab10ed5e616391fd Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 12 Nov 2018 09:55:16 -0500 Subject: audit: Use 'mark' name for fsnotify_mark variables Variables pointing to fsnotify_mark are sometimes called 'entry' and sometimes 'mark'. Use 'mark' in all places. Reviewed-by: Richard Guy Briggs Signed-off-by: Jan Kara [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore --- kernel/audit_tree.c | 79 +++++++++++++++++++++++++++-------------------------- 1 file changed, 40 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 1d8dc20296fb..58e84eb5d826 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -158,9 +158,9 @@ static void audit_mark_put_chunk(struct audit_chunk *chunk) call_rcu(&chunk->head, __put_chunk); } -static inline struct audit_tree_mark *audit_mark(struct fsnotify_mark *entry) +static inline struct audit_tree_mark *audit_mark(struct fsnotify_mark *mark) { - return container_of(entry, struct audit_tree_mark, mark); + return container_of(mark, struct audit_tree_mark, mark); } static struct audit_chunk *mark_chunk(struct fsnotify_mark *mark) @@ -168,9 +168,9 @@ static struct audit_chunk *mark_chunk(struct fsnotify_mark *mark) return audit_mark(mark)->chunk; } -static void audit_tree_destroy_watch(struct fsnotify_mark *entry) +static void audit_tree_destroy_watch(struct fsnotify_mark *mark) { - kmem_cache_free(audit_tree_mark_cachep, audit_mark(entry)); + kmem_cache_free(audit_tree_mark_cachep, audit_mark(mark)); } static struct fsnotify_mark *alloc_mark(void) @@ -224,7 +224,7 @@ static inline struct list_head *chunk_hash(unsigned long key) return chunk_hash_heads + n % HASH_SIZE; } -/* hash_lock & entry->group->mark_mutex is held by caller */ +/* hash_lock & mark->group->mark_mutex is held by caller */ static void insert_hash(struct audit_chunk *chunk) { struct list_head *list; @@ -278,16 +278,16 @@ static struct audit_chunk *find_chunk(struct node *p) return container_of(p, struct audit_chunk, owners[0]); } -static void replace_mark_chunk(struct fsnotify_mark *entry, +static void replace_mark_chunk(struct fsnotify_mark *mark, struct audit_chunk *chunk) { struct audit_chunk *old; assert_spin_locked(&hash_lock); - old = mark_chunk(entry); - audit_mark(entry)->chunk = chunk; + old = mark_chunk(mark); + audit_mark(mark)->chunk = chunk; if (chunk) - chunk->mark = entry; + chunk->mark = mark; if (old) old->mark = NULL; } @@ -348,7 +348,7 @@ static int chunk_count_trees(struct audit_chunk *chunk) return ret; } -static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) +static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *mark) { struct audit_chunk *new; int size; @@ -358,8 +358,8 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) * mark_mutex stabilizes chunk attached to the mark so we can check * whether it didn't change while we've dropped hash_lock. */ - if (!(entry->flags & FSNOTIFY_MARK_FLAG_ATTACHED) || - mark_chunk(entry) != chunk) + if (!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED) || + mark_chunk(mark) != chunk) goto out_mutex; size = chunk_count_trees(chunk); @@ -367,12 +367,12 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *entry) spin_lock(&hash_lock); list_del_init(&chunk->trees); list_del_rcu(&chunk->hash); - replace_mark_chunk(entry, NULL); + replace_mark_chunk(mark, NULL); spin_unlock(&hash_lock); - fsnotify_detach_mark(entry); + fsnotify_detach_mark(mark); mutex_unlock(&audit_tree_group->mark_mutex); audit_mark_put_chunk(chunk); - fsnotify_free_mark(entry); + fsnotify_free_mark(mark); return; } @@ -398,7 +398,7 @@ out_mutex: /* Call with group->mark_mutex held, releases it */ static int create_chunk(struct inode *inode, struct audit_tree *tree) { - struct fsnotify_mark *entry; + struct fsnotify_mark *mark; struct audit_chunk *chunk = alloc_chunk(1); if (!chunk) { @@ -406,16 +406,16 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) return -ENOMEM; } - entry = alloc_mark(); - if (!entry) { + mark = alloc_mark(); + if (!mark) { mutex_unlock(&audit_tree_group->mark_mutex); kfree(chunk); return -ENOMEM; } - if (fsnotify_add_inode_mark_locked(entry, inode, 0)) { + if (fsnotify_add_inode_mark_locked(mark, inode, 0)) { mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); + fsnotify_put_mark(mark); kfree(chunk); return -ENOSPC; } @@ -423,14 +423,14 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) spin_lock(&hash_lock); if (tree->goner) { spin_unlock(&hash_lock); - fsnotify_detach_mark(entry); + fsnotify_detach_mark(mark); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_free_mark(entry); - fsnotify_put_mark(entry); + fsnotify_free_mark(mark); + fsnotify_put_mark(mark); kfree(chunk); return 0; } - replace_mark_chunk(entry, chunk); + replace_mark_chunk(mark, chunk); chunk->owners[0].index = (1U << 31); chunk->owners[0].owner = tree; get_tree(tree); @@ -452,21 +452,21 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree) * we get notification through ->freeing_mark callback and cleanup * chunk pointing to this mark. */ - fsnotify_put_mark(entry); + fsnotify_put_mark(mark); return 0; } /* the first tagged inode becomes root of tree */ static int tag_chunk(struct inode *inode, struct audit_tree *tree) { - struct fsnotify_mark *entry; + struct fsnotify_mark *mark; struct audit_chunk *chunk, *old; struct node *p; int n; mutex_lock(&audit_tree_group->mark_mutex); - entry = fsnotify_find_mark(&inode->i_fsnotify_marks, audit_tree_group); - if (!entry) + mark = fsnotify_find_mark(&inode->i_fsnotify_marks, audit_tree_group); + if (!mark) return create_chunk(inode, tree); /* @@ -476,12 +476,12 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) */ /* are we already there? */ spin_lock(&hash_lock); - old = mark_chunk(entry); + old = mark_chunk(mark); for (n = 0; n < old->count; n++) { if (old->owners[n].owner == tree) { spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); + fsnotify_put_mark(mark); return 0; } } @@ -490,7 +490,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) chunk = alloc_chunk(old->count + 1); if (!chunk) { mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); + fsnotify_put_mark(mark); return -ENOMEM; } @@ -498,7 +498,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) if (tree->goner) { spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); + fsnotify_put_mark(mark); kfree(chunk); return 0; } @@ -518,7 +518,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) replace_chunk(chunk, old); spin_unlock(&hash_lock); mutex_unlock(&audit_tree_group->mark_mutex); - fsnotify_put_mark(entry); /* pair to fsnotify_find mark_entry */ + fsnotify_put_mark(mark); /* pair to fsnotify_find_mark */ audit_mark_put_chunk(old); return 0; @@ -1044,16 +1044,17 @@ static int audit_tree_handle_event(struct fsnotify_group *group, return 0; } -static void audit_tree_freeing_mark(struct fsnotify_mark *entry, struct fsnotify_group *group) +static void audit_tree_freeing_mark(struct fsnotify_mark *mark, + struct fsnotify_group *group) { struct audit_chunk *chunk; - mutex_lock(&entry->group->mark_mutex); + mutex_lock(&mark->group->mark_mutex); spin_lock(&hash_lock); - chunk = mark_chunk(entry); - replace_mark_chunk(entry, NULL); + chunk = mark_chunk(mark); + replace_mark_chunk(mark, NULL); spin_unlock(&hash_lock); - mutex_unlock(&entry->group->mark_mutex); + mutex_unlock(&mark->group->mark_mutex); if (chunk) { evict_chunk(chunk); audit_mark_put_chunk(chunk); @@ -1063,7 +1064,7 @@ static void audit_tree_freeing_mark(struct fsnotify_mark *entry, struct fsnotify * We are guaranteed to have at least one reference to the mark from * either the inode or the caller of fsnotify_destroy_mark(). */ - BUG_ON(refcount_read(&entry->refcnt) < 1); + BUG_ON(refcount_read(&mark->refcnt) < 1); } static const struct fsnotify_ops audit_tree_ops = { -- cgit v1.3-7-g2ca7 From 9213784b48f8ba666b4695ca3f5d34f583daab83 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 22 Oct 2018 08:26:00 -0700 Subject: rcu: Eliminate BUG_ON() for kernel/rcu/tree_plugin.h The tree_plugin.h file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds Signed-off-by: Paul E. McKenney [ paulmck: Fix typo: s/rcuo/rcub/. ] --- kernel/rcu/tree_plugin.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 05915e536336..adda608874a8 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1464,7 +1464,8 @@ static void __init rcu_spawn_boost_kthreads(void) for_each_possible_cpu(cpu) per_cpu(rcu_cpu_has_work, cpu) = 0; - BUG_ON(smpboot_register_percpu_thread(&rcu_cpu_thread_spec)); + if (WARN_ONCE(smpboot_register_percpu_thread(&rcu_cpu_thread_spec), "%s: Could not start rcub kthread, OOM is now expected behavior\n", __func__)) + return; rcu_for_each_leaf_node(rnp) (void)rcu_spawn_one_boost_kthread(rnp); } @@ -2322,7 +2323,8 @@ static int rcu_nocb_kthread(void *arg) tail = rdp->nocb_follower_tail; rdp->nocb_follower_tail = &rdp->nocb_follower_head; raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - BUG_ON(!list); + if (WARN_ON_ONCE(!list)) + continue; trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeNonEmpty")); /* Each pass through the following loop invokes a callback. */ @@ -2495,7 +2497,8 @@ static void rcu_spawn_one_nocb_kthread(int cpu) /* Spawn the kthread for this CPU. */ t = kthread_run(rcu_nocb_kthread, rdp_spawn, "rcuo%c/%d", rcu_state.abbr, cpu); - BUG_ON(IS_ERR(t)); + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo kthread, OOM is now expected behavior\n", __func__)) + return; WRITE_ONCE(rdp_spawn->nocb_kthread, t); } -- cgit v1.3-7-g2ca7 From f0ad56e876cdd67730065274625edbcfe0cca278 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 22 Oct 2018 08:33:06 -0700 Subject: rcu: Eliminate BUG_ON() for kernel/rcu/update.c The update.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index f203b94f6b5b..cb26de25658a 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -822,7 +822,8 @@ static int __init rcu_spawn_tasks_kthread(void) struct task_struct *t; t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread"); - BUG_ON(IS_ERR(t)); + if (WARN_ONCE(IS_ERR(t), "%s: Could not start Tasks-RCU grace-period kthread, OOM is now expected behavior\n", __func__)) + return 0; smp_mb(); /* Ensure others see full kthread. */ WRITE_ONCE(rcu_tasks_kthread_ptr, t); return 0; -- cgit v1.3-7-g2ca7 From b3c1d9ec7c59feadd22693f43145d8285bd35b04 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Oct 2018 13:25:32 -0700 Subject: rcu: Avoid double multiply by HZ The rcu_check_gp_start_stall() function multiplies the return value from rcu_jiffies_till_stall_check() by HZ, but the units are already in jiffies. This commit therefore avoids the need for introduction of a jiffies-squared unit by removing the extraneous multiplication. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 121f833acd04..4933f5d9d992 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2603,7 +2603,7 @@ static void force_quiescent_state(void) static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp) { - const unsigned long gpssdelay = rcu_jiffies_till_stall_check() * HZ; + const unsigned long gpssdelay = rcu_jiffies_till_stall_check(); unsigned long flags; unsigned long j; struct rcu_node *rnp_root = rcu_get_root(); -- cgit v1.3-7-g2ca7 From 791416c47153b45f640d52baaf30995d9d396a08 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Oct 2018 15:42:44 -0700 Subject: rcu: Parameterize rcu_check_gp_start_stall() In order to debug forward-progress stalls, it is necessary to check for excessively delayed grace-period starts. This is currently done for RCU CPU stall warnings by rcu_check_gp_start_stall(), which checks to see if the start of a requested grace period has been delayed by an RCU CPU stall warning period. Because rcutorture will need to check for the time consumed by an RCU forward-progress delay, this commit promotes gpssdelay from a local variable to a formal parameter. It is not necessary to export rcu_check_gp_start_stall() because rcutorture will access it via a wrapper function. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4933f5d9d992..36e30150e1e9 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2600,10 +2600,10 @@ static void force_quiescent_state(void) * This function checks for grace-period requests that fail to motivate * RCU to come out of its idle mode. */ -static void -rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp) +void +rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, + const unsigned long gpssdelay) { - const unsigned long gpssdelay = rcu_jiffies_till_stall_check(); unsigned long flags; unsigned long j; struct rcu_node *rnp_root = rcu_get_root(); @@ -2690,7 +2690,7 @@ static __latent_entropy void rcu_process_callbacks(struct softirq_action *unused local_irq_restore(flags); } - rcu_check_gp_start_stall(rnp, rdp); + rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check()); /* If there are callbacks ready, invoke them. */ if (rcu_segcblist_ready_cbs(&rdp->cblist)) -- cgit v1.3-7-g2ca7 From 691960197e8daa39bf89886ba2e39de1e33f1ce4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Oct 2018 11:24:08 -0700 Subject: rcu: Add state name to show_rcu_gp_kthreads() output This commit adds the name of the RCU grace-period state to the show_rcu_gp_kthreads() output in order to ease debugging. This commit also moves gp_state_getname() up in the code so that show_rcu_gp_kthreads() can use it. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 36e30150e1e9..ea78532183ac 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -499,6 +499,16 @@ void rcu_force_quiescent_state(void) } EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); +/* + * Convert a ->gp_state value to a character string. + */ +static const char *gp_state_getname(short gs) +{ + if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names)) + return "???"; + return gp_state_names[gs]; +} + /* * Show the state of the grace-period kthreads. */ @@ -508,8 +518,9 @@ void show_rcu_gp_kthreads(void) struct rcu_data *rdp; struct rcu_node *rnp; - pr_info("%s: wait state: %d ->state: %#lx\n", rcu_state.name, - rcu_state.gp_state, rcu_state.gp_kthread->state); + pr_info("%s: wait state: %s(%d) ->state: %#lx\n", rcu_state.name, + gp_state_getname(rcu_state.gp_state), rcu_state.gp_state, + rcu_state.gp_kthread->state); rcu_for_each_node_breadth_first(rnp) { if (ULONG_CMP_GE(rcu_state.gp_seq, rnp->gp_seq_needed)) continue; @@ -1142,16 +1153,6 @@ static void record_gp_stall_check_time(void) rcu_state.n_force_qs_gpstart = READ_ONCE(rcu_state.n_force_qs); } -/* - * Convert a ->gp_state value to a character string. - */ -static const char *gp_state_getname(short gs) -{ - if (gs < 0 || gs >= ARRAY_SIZE(gp_state_names)) - return "???"; - return gp_state_names[gs]; -} - /* * Complain about starvation of grace-period kthread. */ -- cgit v1.3-7-g2ca7 From c669c014d1dae9c7cdbfff049c798722f8650829 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Oct 2018 12:42:21 -0700 Subject: rcu: Add jiffies-since-GP-activity to show_rcu_gp_kthreads() This commit adds a printout of the number of jiffies since the last time that the RCU grace-period kthread did any processing. This can be useful when tracking down forward-progress issues. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ea78532183ac..e7c9848d1e1b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -515,12 +515,14 @@ static const char *gp_state_getname(short gs) void show_rcu_gp_kthreads(void) { int cpu; + unsigned long j; struct rcu_data *rdp; struct rcu_node *rnp; - pr_info("%s: wait state: %s(%d) ->state: %#lx\n", rcu_state.name, - gp_state_getname(rcu_state.gp_state), rcu_state.gp_state, - rcu_state.gp_kthread->state); + j = jiffies - READ_ONCE(rcu_state.gp_activity); + pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %ld\n", + rcu_state.name, gp_state_getname(rcu_state.gp_state), + rcu_state.gp_state, rcu_state.gp_kthread->state, j); rcu_for_each_node_breadth_first(rnp) { if (ULONG_CMP_GE(rcu_state.gp_seq, rnp->gp_seq_needed)) continue; -- cgit v1.3-7-g2ca7 From 2320bda26df75c02f86c9fa42c3355ebcd16d8ed Mon Sep 17 00:00:00 2001 From: Zhouyi Zhou Date: Mon, 8 Oct 2018 06:50:41 +0000 Subject: rcu: Adjust the comment of function rcu_is_watching Because RCU avoids interrupting idle CPUs, rcu_is_watching() is used to test whether or not it is currently legal to run RCU read-side critical sections on this CPU. However, the first sentence and last sentences of current comment for rcu_is_watching have opposite meaning of what is expected. This commit therefore fixes this header comment. Signed-off-by: Zhouyi Zhou Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e7c9848d1e1b..429a46fb087a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -904,12 +904,12 @@ void rcu_irq_enter_irqson(void) } /** - * rcu_is_watching - see if RCU thinks that the current CPU is idle + * rcu_is_watching - see if RCU thinks that the current CPU is not idle * * Return true if RCU is watching the running CPU, which means that this * CPU can safely enter RCU read-side critical sections. In other words, - * if the current CPU is in its idle loop and is neither in an interrupt - * or NMI handler, return true. + * if the current CPU is not in its idle loop or is in an interrupt or + * NMI handler, return true. */ bool notrace rcu_is_watching(void) { -- cgit v1.3-7-g2ca7 From 0a89e5a402e957a97129423599a6ef6342da2764 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 15 Oct 2018 10:00:58 -0700 Subject: rcu: Trace end of grace period before end of grace period Currently, rcu_gp_cleanup() traces the end of the old grace period after the old grace period has officially ended. This might make intuitive sense, but it also makes for confusing event-trace output because the "end" trace displays not the old but instead the new grace-period number. This commit therefore traces the end of an old grace period just before that grace period officially ends. Reported-by: Aravinda Prasad Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 429a46fb087a..61bae41b1afe 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2035,9 +2035,9 @@ static void rcu_gp_cleanup(void) rnp = rcu_get_root(); raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */ - /* Declare grace period done. */ - rcu_seq_end(&rcu_state.gp_seq); + /* Declare grace period done, trace first to use old GP number. */ trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); + rcu_seq_end(&rcu_state.gp_seq); rcu_state.gp_state = RCU_GP_IDLE; /* Check for GP requests since above loop. */ rdp = this_cpu_ptr(&rcu_data); -- cgit v1.3-7-g2ca7 From 05f415715ce45da07a0b1a5eac842765b733157f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Oct 2018 04:12:58 -0700 Subject: rcu: Speed up expedited GPs when interrupting RCU reader In PREEMPT kernels, an expedited grace period might send an IPI to a CPU that is executing an RCU read-side critical section. In that case, it would be nice if the rcu_read_unlock() directly interacted with the RCU core code to immediately report the quiescent state. And this does happen in the case where the reader has been preempted. But it would also be a nice performance optimization if immediate reporting also happened in the preemption-free case. This commit therefore adds an ->exp_hint field to the task_struct structure's ->rcu_read_unlock_special field. The IPI handler sets this hint when it has interrupted an RCU read-side critical section, and this causes the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(), which, if preemption is enabled, reports the quiescent state immediately. If preemption is disabled, then the report is required to be deferred until preemption (or bottom halves or interrupts or whatever) is re-enabled. Because this is a hint, it does nothing for more complicated cases. For example, if the IPI interrupts an RCU reader, but interrupts are disabled across the rcu_read_unlock(), but another rcu_read_lock() is executed before interrupts are re-enabled, the hint will already have been cleared. If you do crazy things like this, reporting will be deferred until some later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar. Reported-by: Joel Fernandes Signed-off-by: Paul E. McKenney Acked-by: Joel Fernandes (Google) --- include/linux/sched.h | 4 +++- kernel/rcu/tree_exp.h | 4 +++- kernel/rcu/tree_plugin.h | 14 +++++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index a51c13c2b1a0..e4c7b6241088 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -572,8 +572,10 @@ union rcu_special { struct { u8 blocked; u8 need_qs; + u8 exp_hint; /* Hint for performance. */ + u8 pad; /* No garbage from compiler! */ } b; /* Bits. */ - u16 s; /* Set of bits. */ + u32 s; /* Set of bits. */ }; enum perf_event_task_context { diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index e669ccf3751b..928fe5893a57 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -692,8 +692,10 @@ static void sync_rcu_exp_handler(void *unused) */ if (t->rcu_read_lock_nesting > 0) { raw_spin_lock_irqsave_rcu_node(rnp, flags); - if (rnp->expmask & rdp->grpmask) + if (rnp->expmask & rdp->grpmask) { rdp->deferred_qs = true; + WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true); + } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 05915e536336..618956cc7a55 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -642,13 +642,21 @@ static void rcu_read_unlock_special(struct task_struct *t) local_irq_save(flags); irqs_were_disabled = irqs_disabled_flags(flags); - if ((preempt_bh_were_disabled || irqs_were_disabled) && - t->rcu_read_unlock_special.b.blocked) { + if (preempt_bh_were_disabled || irqs_were_disabled) { + WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false); /* Need to defer quiescent state until everything is enabled. */ - raise_softirq_irqoff(RCU_SOFTIRQ); + if (irqs_were_disabled) { + /* Enabling irqs does not reschedule, so... */ + raise_softirq_irqoff(RCU_SOFTIRQ); + } else { + /* Enabling BH or preempt does reschedule, so... */ + set_tsk_need_resched(current); + set_preempt_need_resched(); + } local_irq_restore(flags); return; } + WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false); rcu_preempt_deferred_qs_irqrestore(t, flags); } -- cgit v1.3-7-g2ca7 From 117f683c6e0104e1d6dfe8f143ea9c24ab069044 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 5 Nov 2018 14:20:57 -0800 Subject: rcu: Replace this_cpu_ptr() with __this_cpu_read() Because __this_cpu_read() can be lighter weight than equivalent uses of this_cpu_ptr(), this commit replaces the latter with the former. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 618956cc7a55..0bb1c1593ca4 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -597,7 +597,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags) */ static bool rcu_preempt_need_deferred_qs(struct task_struct *t) { - return (this_cpu_ptr(&rcu_data)->deferred_qs || + return (__this_cpu_read(rcu_data.deferred_qs) || READ_ONCE(t->rcu_read_unlock_special.s)) && t->rcu_read_lock_nesting <= 0; } -- cgit v1.3-7-g2ca7 From 5f1a6ef3746f536157922197d98676fa21154549 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 29 Oct 2018 07:36:50 -0700 Subject: rcu: Avoid signed integer overflow in rcu_preempt_deferred_qs() Subtracting INT_MIN can be interpreted as unconditional signed integer overflow, which according to the C standard is undefined behavior. Therefore, kernel build arguments notwithstanding, it would be good to future-proof the code. This commit therefore substitutes INT_MAX for INT_MIN in order to avoid undefined behavior. While in the neighborhood, this commit also creates some meaningful names for INT_MAX and friends in order to improve readability, as suggested by Joel Fernandes. Reported-by: Ran Rozenstein Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0bb1c1593ca4..3ed43f8cb029 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -397,6 +397,11 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp) return rnp->gp_tasks != NULL; } +/* Bias and limit values for ->rcu_read_lock_nesting. */ +#define RCU_NEST_BIAS INT_MAX +#define RCU_NEST_NMAX (-INT_MAX / 2) +#define RCU_NEST_PMAX (INT_MAX / 2) + /* * Preemptible RCU implementation for rcu_read_lock(). * Just increment ->rcu_read_lock_nesting, shared state will be updated @@ -405,6 +410,8 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp) void __rcu_read_lock(void) { current->rcu_read_lock_nesting++; + if (IS_ENABLED(CONFIG_PROVE_LOCKING)) + WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX); barrier(); /* critical section after entry code. */ } EXPORT_SYMBOL_GPL(__rcu_read_lock); @@ -424,20 +431,18 @@ void __rcu_read_unlock(void) --t->rcu_read_lock_nesting; } else { barrier(); /* critical section before exit code. */ - t->rcu_read_lock_nesting = INT_MIN; + t->rcu_read_lock_nesting = -RCU_NEST_BIAS; barrier(); /* assign before ->rcu_read_unlock_special load */ if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s))) rcu_read_unlock_special(t); barrier(); /* ->rcu_read_unlock_special load before assign */ t->rcu_read_lock_nesting = 0; } -#ifdef CONFIG_PROVE_LOCKING - { - int rrln = READ_ONCE(t->rcu_read_lock_nesting); + if (IS_ENABLED(CONFIG_PROVE_LOCKING)) { + int rrln = t->rcu_read_lock_nesting; - WARN_ON_ONCE(rrln < 0 && rrln > INT_MIN / 2); + WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX); } -#endif /* #ifdef CONFIG_PROVE_LOCKING */ } EXPORT_SYMBOL_GPL(__rcu_read_unlock); @@ -617,11 +622,11 @@ static void rcu_preempt_deferred_qs(struct task_struct *t) if (!rcu_preempt_need_deferred_qs(t)) return; if (couldrecurse) - t->rcu_read_lock_nesting -= INT_MIN; + t->rcu_read_lock_nesting -= RCU_NEST_BIAS; local_irq_save(flags); rcu_preempt_deferred_qs_irqrestore(t, flags); if (couldrecurse) - t->rcu_read_lock_nesting += INT_MIN; + t->rcu_read_lock_nesting += RCU_NEST_BIAS; } /* -- cgit v1.3-7-g2ca7 From 04547728b7b775333a4f6fbb3c55102a79dc4596 Mon Sep 17 00:00:00 2001 From: Lance Roy Date: Thu, 4 Oct 2018 23:45:46 -0700 Subject: locking/mutex: Replace spin_is_locked() with lockdep lockdep_assert_held() is better suited to checking locking requirements, since it only checks if the current thread holds the lock regardless of whether someone else does. This is also a step towards possibly removing spin_is_locked(). Signed-off-by: Lance Roy Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon Signed-off-by: Paul E. McKenney --- kernel/locking/mutex-debug.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 9aa713629387..771d4ca96dda 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -36,7 +36,7 @@ void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter) { - SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock)); + lockdep_assert_held(&lock->wait_lock); DEBUG_LOCKS_WARN_ON(list_empty(&lock->wait_list)); DEBUG_LOCKS_WARN_ON(waiter->magic != waiter); DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); @@ -51,7 +51,7 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter) void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter, struct task_struct *task) { - SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock)); + lockdep_assert_held(&lock->wait_lock); /* Mark the current thread as blocked on the lock: */ task->blocked_on = waiter; -- cgit v1.3-7-g2ca7 From b1e3aeb11c5e86ee0988a038c4e7682d6beaa977 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 13 Nov 2018 12:03:33 -0800 Subject: cpuset: Minor cgroup2 interface updates * Rename the partition file from "cpuset.sched.partition" to "cpuset.cpus.partition". * When writing to the partition file, drop "0" and "1" and only accept "member" and "root". Signed-off-by: Tejun Heo Cc: Peter Zijlstra (Intel) Cc: Waiman Long --- Documentation/admin-guide/cgroup-v2.rst | 6 +++--- kernel/cgroup/cpuset.c | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f83a5231bbe3..07e06136a550 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1708,15 +1708,15 @@ Cpuset Interface Files Its value will be affected by memory nodes hotplug events. - cpuset.sched.partition + cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. - "root" or "1" - a paritition root - "member" or "0" - a non-root member of a partition + "root" - a paritition root + "member" - a non-root member of a partition When set to be a partition root, the current cgroup is the root of a new partition or scheduling domain that comprises diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b897314bab53..1151e93d71b6 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2468,11 +2468,11 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, buf = strstrip(buf); /* - * Convert "root"/"1" to 1, and convert "member"/"0" to 0. + * Convert "root" to ENABLED, and convert "member" to DISABLED. */ - if (!strcmp(buf, "root") || !strcmp(buf, "1")) + if (!strcmp(buf, "root")) val = PRS_ENABLED; - else if (!strcmp(buf, "member") || !strcmp(buf, "0")) + else if (!strcmp(buf, "member")) val = PRS_DISABLED; else return -EINVAL; @@ -2631,7 +2631,7 @@ static struct cftype dfl_files[] = { }, { - .name = "sched.partition", + .name = "cpus.partition", .seq_show = sched_partition_show, .write = sched_partition_write, .private = FILE_PARTITION_ROOT, -- cgit v1.3-7-g2ca7 From c1bbd933e5fae83f96acd3f748bb01a0300b315d Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 13 Nov 2018 12:06:41 -0800 Subject: cgroup: Add .__DEBUG__. prefix to debug file names Clearly mark the debug files and hide them by default by prefixing ".__DEBUG__.". Signed-off-by: Tejun Heo Cc: Peter Zijlstra (Intel) Cc: Waiman Long --- kernel/cgroup/cgroup.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index ed7f0bfe6429..e06994fd4e34 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1400,12 +1400,15 @@ static char *cgroup_file_name(struct cgroup *cgrp, const struct cftype *cft, struct cgroup_subsys *ss = cft->ss; if (cft->ss && !(cft->flags & CFTYPE_NO_PREFIX) && - !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) - snprintf(buf, CGROUP_FILE_NAME_MAX, "%s.%s", - cgroup_on_dfl(cgrp) ? ss->name : ss->legacy_name, + !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) { + const char *dbg = (cft->flags & CFTYPE_DEBUG) ? ".__DEBUG__." : ""; + + snprintf(buf, CGROUP_FILE_NAME_MAX, "%s%s.%s", + dbg, cgroup_on_dfl(cgrp) ? ss->name : ss->legacy_name, cft->name); - else + } else { strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX); + } return buf; } -- cgit v1.3-7-g2ca7 From 8ddab428730d66e86ebe90aef15d5f7c5c45fe0d Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Fri, 9 Nov 2018 13:16:39 +0000 Subject: padata: clean an indentation issue, remove extraneous space Trivial fix to clean up an indentation issue Signed-off-by: Colin Ian King Signed-off-by: Herbert Xu --- kernel/padata.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/padata.c b/kernel/padata.c index d568cc56405f..3e2633ae3bca 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -720,7 +720,7 @@ int padata_start(struct padata_instance *pinst) if (pinst->flags & PADATA_INVALID) err = -EINVAL; - __padata_start(pinst); + __padata_start(pinst); mutex_unlock(&pinst->lock); -- cgit v1.3-7-g2ca7 From 592ee43faf860c1f2c0a4c11838db6fdb974bb78 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Tue, 13 Nov 2018 09:29:26 +0000 Subject: bpf: fix null pointer dereference on pointer offload Pointer offload is being null checked however the following statement dereferences the potentially null pointer offload when assigning offload->dev_state. Fix this by only assigning it if offload is not null. Detected by CoverityScan, CID#1475437 ("Dereference after null check") Fixes: 00db12c3d141 ("bpf: call verifier_prep from its callback in struct bpf_offload_dev") Signed-off-by: Colin Ian King Acked-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- kernel/bpf/offload.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 52c5617e3716..54cf2b9c44a4 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -130,9 +130,10 @@ int bpf_prog_offload_verifier_prep(struct bpf_prog *prog) down_read(&bpf_devs_lock); offload = prog->aux->offload; - if (offload) + if (offload) { ret = offload->offdev->ops->prepare(prog); - offload->dev_state = !ret; + offload->dev_state = !ret; + } up_read(&bpf_devs_lock); return ret; -- cgit v1.3-7-g2ca7 From 0fe3c7fceb500de2d0adfb9dcf292580cd43ea38 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 16 Nov 2018 12:16:35 -0500 Subject: audit: localize audit_log_session_info prototype The audit_log_session_info() function is only used in kernel/audit*, so move its prototype to kernel/audit.h Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- include/linux/audit.h | 2 -- kernel/audit.h | 2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/audit.h b/include/linux/audit.h index 9334fbef7bae..58cf665f597e 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -115,8 +115,6 @@ extern int audit_classify_compat_syscall(int abi, unsigned syscall); struct filename; -extern void audit_log_session_info(struct audit_buffer *ab); - #define AUDIT_OFF 0 #define AUDIT_ON 1 #define AUDIT_LOCKED 2 diff --git a/kernel/audit.h b/kernel/audit.h index 214e14948370..9a3828bd387b 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -210,6 +210,8 @@ struct audit_context { extern bool audit_ever_enabled; +extern void audit_log_session_info(struct audit_buffer *ab); + extern void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, struct inode *inode); -- cgit v1.3-7-g2ca7 From a2c97da11cdb973b752dd434aee9636ce10ee737 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 16 Nov 2018 16:30:10 -0500 Subject: audit: use session_info helper There are still a couple of places (mark and watch config changes) that open code auid and ses fields in sequence in records instead of using the audit_log_session_info() helper. Use the helper. Adjust the helper to accommodate being the first fields. Passes audit-testsuite. Signed-off-by: Richard Guy Briggs [PM: fixed misspellings in the description] Signed-off-by: Paul Moore --- kernel/audit.c | 6 +++--- kernel/audit_fsnotify.c | 5 ++--- kernel/audit_watch.c | 5 ++--- 3 files changed, 7 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 2a8058764aa6..6c53e373b828 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -400,7 +400,7 @@ static int audit_log_config_change(char *function_name, u32 new, u32 old, ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; - audit_log_format(ab, "%s=%u old=%u", function_name, new, old); + audit_log_format(ab, "%s=%u old=%u ", function_name, new, old); audit_log_session_info(ab); rc = audit_log_task_context(ab); if (rc) @@ -1067,7 +1067,7 @@ static void audit_log_common_recv_msg(struct audit_buffer **ab, u16 msg_type) *ab = audit_log_start(NULL, GFP_KERNEL, msg_type); if (unlikely(!*ab)) return; - audit_log_format(*ab, "pid=%d uid=%u", pid, uid); + audit_log_format(*ab, "pid=%d uid=%u ", pid, uid); audit_log_session_info(*ab); audit_log_task_context(*ab); } @@ -2042,7 +2042,7 @@ void audit_log_session_info(struct audit_buffer *ab) unsigned int sessionid = audit_get_sessionid(current); uid_t auid = from_kuid(&init_user_ns, audit_get_loginuid(current)); - audit_log_format(ab, " auid=%u ses=%u", auid, sessionid); + audit_log_format(ab, "auid=%u ses=%u", auid, sessionid); } void audit_log_key(struct audit_buffer *ab, char *key) diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c index fba78047fb37..f90ffa699e5b 100644 --- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -130,9 +130,8 @@ static void audit_mark_log_rule_change(struct audit_fsnotify_mark *audit_mark, c ab = audit_log_start(NULL, GFP_NOFS, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return; - audit_log_format(ab, "auid=%u ses=%u op=%s", - from_kuid(&init_user_ns, audit_get_loginuid(current)), - audit_get_sessionid(current), op); + audit_log_session_info(ab); + audit_log_format(ab, " op=%s", op); audit_log_format(ab, " path="); audit_log_untrustedstring(ab, audit_mark->path); audit_log_key(ab, rule->filterkey); diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 787c7afdf829..568e48d1d0ab 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -245,9 +245,8 @@ static void audit_watch_log_rule_change(struct audit_krule *r, struct audit_watc ab = audit_log_start(NULL, GFP_NOFS, AUDIT_CONFIG_CHANGE); if (!ab) return; - audit_log_format(ab, "auid=%u ses=%u op=%s", - from_kuid(&init_user_ns, audit_get_loginuid(current)), - audit_get_sessionid(current), op); + audit_log_session_info(ab); + audit_log_format(ab, "op=%s", op); audit_log_format(ab, " path="); audit_log_untrustedstring(ab, w->path); audit_log_key(ab, r->filterkey); -- cgit v1.3-7-g2ca7 From c8fc5d49c341805fee7fc295f2ea8a709f78aec4 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 16 Nov 2018 12:17:35 -0500 Subject: audit: remove WATCH and TREE config options Remove the CONFIG_AUDIT_WATCH and CONFIG_AUDIT_TREE config options since they are both dependent on CONFIG_AUDITSYSCALL and force CONFIG_FSNOTIFY. Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- init/Kconfig | 9 --------- kernel/Makefile | 4 +--- kernel/audit.h | 6 +++--- kernel/auditsc.c | 10 ---------- 4 files changed, 4 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/init/Kconfig b/init/Kconfig index a4112e95724a..7eb2538e6ca0 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -335,15 +335,6 @@ config HAVE_ARCH_AUDITSYSCALL config AUDITSYSCALL def_bool y depends on AUDIT && HAVE_ARCH_AUDITSYSCALL - -config AUDIT_WATCH - def_bool y - depends on AUDITSYSCALL - select FSNOTIFY - -config AUDIT_TREE - def_bool y - depends on AUDITSYSCALL select FSNOTIFY source "kernel/irq/Kconfig" diff --git a/kernel/Makefile b/kernel/Makefile index 7343b3a9bff0..9dc7f519129d 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -76,9 +76,7 @@ obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_SMP) += stop_machine.o obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o obj-$(CONFIG_AUDIT) += audit.o auditfilter.o -obj-$(CONFIG_AUDITSYSCALL) += auditsc.o -obj-$(CONFIG_AUDIT_WATCH) += audit_watch.o audit_fsnotify.o -obj-$(CONFIG_AUDIT_TREE) += audit_tree.o +obj-$(CONFIG_AUDITSYSCALL) += auditsc.o audit_watch.o audit_fsnotify.o audit_tree.o obj-$(CONFIG_GCOV_KERNEL) += gcov/ obj-$(CONFIG_KCOV) += kcov.o obj-$(CONFIG_KPROBES) += kprobes.o diff --git a/kernel/audit.h b/kernel/audit.h index 9a3828bd387b..0b5295aeaebb 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -268,7 +268,7 @@ extern struct tty_struct *audit_get_tty(struct task_struct *tsk); extern void audit_put_tty(struct tty_struct *tty); /* audit watch functions */ -#ifdef CONFIG_AUDIT_WATCH +#ifdef CONFIG_AUDITSYSCALL extern void audit_put_watch(struct audit_watch *watch); extern void audit_get_watch(struct audit_watch *watch); extern int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op); @@ -301,9 +301,9 @@ extern int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark #define audit_mark_compare(m, i, d) 0 #define audit_exe_compare(t, m) (-EINVAL) #define audit_dupe_exe(n, o) (-EINVAL) -#endif /* CONFIG_AUDIT_WATCH */ +#endif /* CONFIG_AUDITSYSCALL */ -#ifdef CONFIG_AUDIT_TREE +#ifdef CONFIG_AUDITSYSCALL extern struct audit_chunk *audit_tree_lookup(const struct inode *inode); extern void audit_put_chunk(struct audit_chunk *chunk); extern bool audit_tree_match(struct audit_chunk *chunk, struct audit_tree *tree); diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 1513873e23bd..605f2d825204 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -200,7 +200,6 @@ static int audit_match_filetype(struct audit_context *ctx, int val) * References in it _are_ dropped - at the same time we free/drop aux stuff. */ -#ifdef CONFIG_AUDIT_TREE static void audit_set_auditable(struct audit_context *ctx) { if (!ctx->prio) { @@ -245,12 +244,10 @@ static int grow_tree_refs(struct audit_context *ctx) ctx->tree_count = 31; return 1; } -#endif static void unroll_tree_refs(struct audit_context *ctx, struct audit_tree_refs *p, int count) { -#ifdef CONFIG_AUDIT_TREE struct audit_tree_refs *q; int n; if (!p) { @@ -274,7 +271,6 @@ static void unroll_tree_refs(struct audit_context *ctx, } ctx->trees = p; ctx->tree_count = count; -#endif } static void free_tree_refs(struct audit_context *ctx) @@ -288,7 +284,6 @@ static void free_tree_refs(struct audit_context *ctx) static int match_tree_refs(struct audit_context *ctx, struct audit_tree *tree) { -#ifdef CONFIG_AUDIT_TREE struct audit_tree_refs *p; int n; if (!tree) @@ -305,7 +300,6 @@ static int match_tree_refs(struct audit_context *ctx, struct audit_tree *tree) if (audit_tree_match(p->c[n], tree)) return 1; } -#endif return 0; } @@ -1602,7 +1596,6 @@ void __audit_syscall_exit(int success, long return_code) static inline void handle_one(const struct inode *inode) { -#ifdef CONFIG_AUDIT_TREE struct audit_context *context; struct audit_tree_refs *p; struct audit_chunk *chunk; @@ -1627,12 +1620,10 @@ static inline void handle_one(const struct inode *inode) return; } put_tree_ref(context, chunk); -#endif } static void handle_path(const struct dentry *dentry) { -#ifdef CONFIG_AUDIT_TREE struct audit_context *context; struct audit_tree_refs *p; const struct dentry *d, *parent; @@ -1685,7 +1676,6 @@ retry: return; } rcu_read_unlock(); -#endif } static struct audit_names *audit_alloc_name(struct audit_context *context, -- cgit v1.3-7-g2ca7 From 96b3b6c9091d23289721350e32c63cc8749686be Mon Sep 17 00:00:00 2001 From: Lorenz Bauer Date: Fri, 16 Nov 2018 11:41:08 +0000 Subject: bpf: allow zero-initializing hash map seed Add a new flag BPF_F_ZERO_SEED, which forces a hash map to initialize the seed to zero. This is useful when doing performance analysis both on individual BPF programs, as well as the kernel's hash table implementation. Signed-off-by: Lorenz Bauer Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 3 +++ kernel/bpf/hashtab.c | 13 +++++++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 47d606d744cc..8c01b89a4cb4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -269,6 +269,9 @@ enum bpf_attach_type { /* Flag for stack_map, store build_id+offset instead of pointer */ #define BPF_F_STACK_BUILD_ID (1U << 5) +/* Zero-initialize hash function seed. This should only be used for testing. */ +#define BPF_F_ZERO_SEED (1U << 6) + enum bpf_stack_build_id_status { /* user space need an empty entry to identify end of a trace */ BPF_STACK_BUILD_ID_EMPTY = 0, diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 2c1790288138..4b7c76765d9d 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -23,7 +23,7 @@ #define HTAB_CREATE_FLAG_MASK \ (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \ - BPF_F_RDONLY | BPF_F_WRONLY) + BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_ZERO_SEED) struct bucket { struct hlist_nulls_head head; @@ -244,6 +244,7 @@ static int htab_map_alloc_check(union bpf_attr *attr) */ bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU); bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC); + bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED); int numa_node = bpf_map_attr_numa_node(attr); BUILD_BUG_ON(offsetof(struct htab_elem, htab) != @@ -257,6 +258,10 @@ static int htab_map_alloc_check(union bpf_attr *attr) */ return -EPERM; + if (zero_seed && !capable(CAP_SYS_ADMIN)) + /* Guard against local DoS, and discourage production use. */ + return -EPERM; + if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK) /* reserved bits should not be used */ return -EINVAL; @@ -373,7 +378,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) if (!htab->buckets) goto free_htab; - htab->hashrnd = get_random_int(); + if (htab->map.map_flags & BPF_F_ZERO_SEED) + htab->hashrnd = 0; + else + htab->hashrnd = get_random_int(); + for (i = 0; i < htab->n_buckets; i++) { INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); raw_spin_lock_init(&htab->buckets[i].lock); -- cgit v1.3-7-g2ca7 From e9d81a1bc2c48ea9782e3e8b53875f419766ef47 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 8 Nov 2018 12:15:15 -0800 Subject: cgroup: fix CSS_TASK_ITER_PROCS CSS_TASK_ITER_PROCS implements process-only iteration by making css_task_iter_advance() skip tasks which aren't threadgroup leaders; however, when an iteration is started css_task_iter_start() calls the inner helper function css_task_iter_advance_css_set() instead of css_task_iter_advance(). As the helper doesn't have the skip logic, when the first task to visit is a non-leader thread, it doesn't get skipped correctly as shown in the following example. # ps -L 2030 PID LWP TTY STAT TIME COMMAND 2030 2030 pts/0 Sl+ 0:00 ./test-thread 2030 2031 pts/0 Sl+ 0:00 ./test-thread # mkdir -p /sys/fs/cgroup/x/a/b # echo threaded > /sys/fs/cgroup/x/a/cgroup.type # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs # cat /sys/fs/cgroup/x/a/cgroup.threads 2030 2031 # cat /sys/fs/cgroup/x/cgroup.procs 2030 # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads # cat /sys/fs/cgroup/x/cgroup.procs 2031 2030 The last read of cgroup.procs is incorrectly showing non-leader 2031 in cgroup.procs output. This can be fixed by updating css_task_iter_advance() to handle the first advance and css_task_iters_tart() to call css_task_iter_advance() instead of the inner helper. After the fix, the same commands result in the following (correct) result: # ps -L 2062 PID LWP TTY STAT TIME COMMAND 2062 2062 pts/0 Sl+ 0:00 ./test-thread 2062 2063 pts/0 Sl+ 0:00 ./test-thread # mkdir -p /sys/fs/cgroup/x/a/b # echo threaded > /sys/fs/cgroup/x/a/cgroup.type # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs # cat /sys/fs/cgroup/x/a/cgroup.threads 2062 2063 # cat /sys/fs/cgroup/x/cgroup.procs 2062 # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads # cat /sys/fs/cgroup/x/cgroup.procs 2062 Signed-off-by: Tejun Heo Reported-by: "Michael Kerrisk (man-pages)" Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support") Cc: stable@vger.kernel.org # v4.14+ --- kernel/cgroup/cgroup.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6aaf5dd5383b..1f84977fab47 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4202,20 +4202,25 @@ static void css_task_iter_advance(struct css_task_iter *it) lockdep_assert_held(&css_set_lock); repeat: - /* - * Advance iterator to find next entry. cset->tasks is consumed - * first and then ->mg_tasks. After ->mg_tasks, we move onto the - * next cset. - */ - next = it->task_pos->next; + if (it->task_pos) { + /* + * Advance iterator to find next entry. cset->tasks is + * consumed first and then ->mg_tasks. After ->mg_tasks, + * we move onto the next cset. + */ + next = it->task_pos->next; - if (next == it->tasks_head) - next = it->mg_tasks_head->next; + if (next == it->tasks_head) + next = it->mg_tasks_head->next; - if (next == it->mg_tasks_head) + if (next == it->mg_tasks_head) + css_task_iter_advance_css_set(it); + else + it->task_pos = next; + } else { + /* called from start, proceed to the first cset */ css_task_iter_advance_css_set(it); - else - it->task_pos = next; + } /* if PROCS, skip over tasks which aren't group leaders */ if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos && @@ -4255,7 +4260,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, it->cset_head = it->cset_pos; - css_task_iter_advance_css_set(it); + css_task_iter_advance(it); spin_unlock_irq(&css_set_lock); } -- cgit v1.3-7-g2ca7 From b47a0bd23e34022aa0d4b812fcebe85cb0c54d49 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Mon, 19 Nov 2018 15:29:06 -0800 Subject: bpf: btf: Break up btf_type_is_void() This patch breaks up btf_type_is_void() into btf_type_is_void() and btf_type_is_fwd(). It also adds btf_type_nosize() to better describe it is testing a type has nosize info. Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index ee4c82667d65..2a50d87de485 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -306,15 +306,22 @@ static bool btf_type_is_modifier(const struct btf_type *t) static bool btf_type_is_void(const struct btf_type *t) { - /* void => no type and size info. - * Hence, FWD is also treated as void. - */ - return t == &btf_void || BTF_INFO_KIND(t->info) == BTF_KIND_FWD; + return t == &btf_void; +} + +static bool btf_type_is_fwd(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_FWD; +} + +static bool btf_type_nosize(const struct btf_type *t) +{ + return btf_type_is_void(t) || btf_type_is_fwd(t); } -static bool btf_type_is_void_or_null(const struct btf_type *t) +static bool btf_type_nosize_or_null(const struct btf_type *t) { - return !t || btf_type_is_void(t); + return !t || btf_type_nosize(t); } /* union is only a special case of struct: @@ -826,7 +833,7 @@ const struct btf_type *btf_type_id_size(const struct btf *btf, u32 size = 0; size_type = btf_type_by_id(btf, size_type_id); - if (btf_type_is_void_or_null(size_type)) + if (btf_type_nosize_or_null(size_type)) return NULL; if (btf_type_has_size(size_type)) { @@ -842,7 +849,7 @@ const struct btf_type *btf_type_id_size(const struct btf *btf, size = btf->resolved_sizes[size_type_id]; size_type_id = btf->resolved_ids[size_type_id]; size_type = btf_type_by_id(btf, size_type_id); - if (btf_type_is_void(size_type)) + if (btf_type_nosize_or_null(size_type)) return NULL; } @@ -1164,7 +1171,7 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, } /* "typedef void new_void", "const void"...etc */ - if (btf_type_is_void(next_type)) + if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type)) goto resolved; if (!env_type_is_resolve_sink(env, next_type) && @@ -1178,7 +1185,7 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, * pretty print). */ if (!btf_type_id_size(btf, &next_type_id, &next_type_size) && - !btf_type_is_void(btf_type_id_resolve(btf, &next_type_id))) { + !btf_type_nosize(btf_type_id_resolve(btf, &next_type_id))) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } @@ -1205,7 +1212,7 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, } /* "void *" */ - if (btf_type_is_void(next_type)) + if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type)) goto resolved; if (!env_type_is_resolve_sink(env, next_type) && @@ -1235,7 +1242,7 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, } if (!btf_type_id_size(btf, &next_type_id, &next_type_size) && - !btf_type_is_void(btf_type_id_resolve(btf, &next_type_id))) { + !btf_type_nosize(btf_type_id_resolve(btf, &next_type_id))) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } @@ -1396,7 +1403,7 @@ static int btf_array_resolve(struct btf_verifier_env *env, /* Check array->index_type */ index_type_id = array->index_type; index_type = btf_type_by_id(btf, index_type_id); - if (btf_type_is_void_or_null(index_type)) { + if (btf_type_nosize_or_null(index_type)) { btf_verifier_log_type(env, v->t, "Invalid index"); return -EINVAL; } @@ -1415,7 +1422,7 @@ static int btf_array_resolve(struct btf_verifier_env *env, /* Check array->type */ elem_type_id = array->type; elem_type = btf_type_by_id(btf, elem_type_id); - if (btf_type_is_void_or_null(elem_type)) { + if (btf_type_nosize_or_null(elem_type)) { btf_verifier_log_type(env, v->t, "Invalid elem"); return -EINVAL; @@ -1615,7 +1622,7 @@ static int btf_struct_resolve(struct btf_verifier_env *env, const struct btf_type *member_type = btf_type_by_id(env->btf, member_type_id); - if (btf_type_is_void_or_null(member_type)) { + if (btf_type_nosize_or_null(member_type)) { btf_verifier_log_member(env, v->t, member, "Invalid member"); return -EINVAL; -- cgit v1.3-7-g2ca7 From 2667a2626f4da370409c2830552f6e8c8b8c41e2 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Mon, 19 Nov 2018 15:29:08 -0800 Subject: bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO to support the function debug info. BTF_KIND_FUNC_PROTO must not have a name (i.e. !t->name_off) and it is followed by >= 0 'struct bpf_param' objects to describe the function arguments. The BTF_KIND_FUNC must have a valid name and it must refer back to a BTF_KIND_FUNC_PROTO. The above is the conclusion after the discussion between Edward Cree, Alexei, Daniel, Yonghong and Martin. By combining BTF_KIND_FUNC and BTF_LIND_FUNC_PROTO, a complete function signature can be obtained. It will be used in the later patches to learn the function signature of a running bpf program. Signed-off-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/uapi/linux/btf.h | 18 ++- kernel/bpf/btf.c | 389 +++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 354 insertions(+), 53 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h index 972265f32871..14f66948fc95 100644 --- a/include/uapi/linux/btf.h +++ b/include/uapi/linux/btf.h @@ -40,7 +40,8 @@ struct btf_type { /* "size" is used by INT, ENUM, STRUCT and UNION. * "size" tells the size of the type it is describing. * - * "type" is used by PTR, TYPEDEF, VOLATILE, CONST and RESTRICT. + * "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT, + * FUNC and FUNC_PROTO. * "type" is a type_id referring to another type. */ union { @@ -64,8 +65,10 @@ struct btf_type { #define BTF_KIND_VOLATILE 9 /* Volatile */ #define BTF_KIND_CONST 10 /* Const */ #define BTF_KIND_RESTRICT 11 /* Restrict */ -#define BTF_KIND_MAX 11 -#define NR_BTF_KINDS 12 +#define BTF_KIND_FUNC 12 /* Function */ +#define BTF_KIND_FUNC_PROTO 13 /* Function Proto */ +#define BTF_KIND_MAX 13 +#define NR_BTF_KINDS 14 /* For some specific BTF_KIND, "struct btf_type" is immediately * followed by extra data. @@ -110,4 +113,13 @@ struct btf_member { __u32 offset; /* offset in bits */ }; +/* BTF_KIND_FUNC_PROTO is followed by multiple "struct btf_param". + * The exact number of btf_param is stored in the vlen (of the + * info in "struct btf_type"). + */ +struct btf_param { + __u32 name_off; + __u32 type; +}; + #endif /* _UAPI__LINUX_BTF_H__ */ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 2a50d87de485..6a2be79b73fc 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5,6 +5,7 @@ #include #include #include +#include #include #include #include @@ -259,6 +260,8 @@ static const char * const btf_kind_str[NR_BTF_KINDS] = { [BTF_KIND_VOLATILE] = "VOLATILE", [BTF_KIND_CONST] = "CONST", [BTF_KIND_RESTRICT] = "RESTRICT", + [BTF_KIND_FUNC] = "FUNC", + [BTF_KIND_FUNC_PROTO] = "FUNC_PROTO", }; struct btf_kind_operations { @@ -281,6 +284,9 @@ struct btf_kind_operations { static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS]; static struct btf_type btf_void; +static int btf_resolve(struct btf_verifier_env *env, + const struct btf_type *t, u32 type_id); + static bool btf_type_is_modifier(const struct btf_type *t) { /* Some of them is not strictly a C modifier @@ -314,9 +320,20 @@ static bool btf_type_is_fwd(const struct btf_type *t) return BTF_INFO_KIND(t->info) == BTF_KIND_FWD; } +static bool btf_type_is_func(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC; +} + +static bool btf_type_is_func_proto(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC_PROTO; +} + static bool btf_type_nosize(const struct btf_type *t) { - return btf_type_is_void(t) || btf_type_is_fwd(t); + return btf_type_is_void(t) || btf_type_is_fwd(t) || + btf_type_is_func(t) || btf_type_is_func_proto(t); } static bool btf_type_nosize_or_null(const struct btf_type *t) @@ -433,6 +450,30 @@ static bool btf_name_offset_valid(const struct btf *btf, u32 offset) offset < btf->hdr.str_len; } +/* Only C-style identifier is permitted. This can be relaxed if + * necessary. + */ +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) +{ + /* offset must be valid */ + const char *src = &btf->strings[offset]; + const char *src_limit; + + if (!isalpha(*src) && *src != '_') + return false; + + /* set a limit on identifier length */ + src_limit = src + KSYM_NAME_LEN; + src++; + while (*src && src < src_limit) { + if (!isalnum(*src) && *src != '_') + return false; + src++; + } + + return !*src; +} + static const char *btf_name_by_offset(const struct btf *btf, u32 offset) { if (!offset) @@ -747,11 +788,15 @@ static bool env_type_is_resolve_sink(const struct btf_verifier_env *env, /* int, enum or void is a sink */ return !btf_type_needs_resolve(next_type); case RESOLVE_PTR: - /* int, enum, void, struct or array is a sink for ptr */ + /* int, enum, void, struct, array, func or func_proto is a sink + * for ptr + */ return !btf_type_is_modifier(next_type) && !btf_type_is_ptr(next_type); case RESOLVE_STRUCT_OR_ARRAY: - /* int, enum, void or ptr is a sink for struct and array */ + /* int, enum, void, ptr, func or func_proto is a sink + * for struct and array + */ return !btf_type_is_modifier(next_type) && !btf_type_is_array(next_type) && !btf_type_is_struct(next_type); @@ -1170,10 +1215,6 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, return -EINVAL; } - /* "typedef void new_void", "const void"...etc */ - if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type)) - goto resolved; - if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); @@ -1184,13 +1225,18 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, * save us a few type-following when we use it later (e.g. in * pretty print). */ - if (!btf_type_id_size(btf, &next_type_id, &next_type_size) && - !btf_type_nosize(btf_type_id_resolve(btf, &next_type_id))) { - btf_verifier_log_type(env, v->t, "Invalid type_id"); - return -EINVAL; + if (!btf_type_id_size(btf, &next_type_id, &next_type_size)) { + if (env_type_is_resolved(env, next_type_id)) + next_type = btf_type_id_resolve(btf, &next_type_id); + + /* "typedef void new_void", "const void"...etc */ + if (!btf_type_is_void(next_type) && + !btf_type_is_fwd(next_type)) { + btf_verifier_log_type(env, v->t, "Invalid type_id"); + return -EINVAL; + } } -resolved: env_stack_pop_resolved(env, next_type_id, next_type_size); return 0; @@ -1203,7 +1249,6 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, const struct btf_type *t = v->t; u32 next_type_id = t->type; struct btf *btf = env->btf; - u32 next_type_size = 0; next_type = btf_type_by_id(btf, next_type_id); if (!next_type) { @@ -1211,10 +1256,6 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, return -EINVAL; } - /* "void *" */ - if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type)) - goto resolved; - if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); @@ -1241,13 +1282,18 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, resolved_type_id); } - if (!btf_type_id_size(btf, &next_type_id, &next_type_size) && - !btf_type_nosize(btf_type_id_resolve(btf, &next_type_id))) { - btf_verifier_log_type(env, v->t, "Invalid type_id"); - return -EINVAL; + if (!btf_type_id_size(btf, &next_type_id, NULL)) { + if (env_type_is_resolved(env, next_type_id)) + next_type = btf_type_id_resolve(btf, &next_type_id); + + if (!btf_type_is_void(next_type) && + !btf_type_is_fwd(next_type) && + !btf_type_is_func_proto(next_type)) { + btf_verifier_log_type(env, v->t, "Invalid type_id"); + return -EINVAL; + } } -resolved: env_stack_pop_resolved(env, next_type_id, 0); return 0; @@ -1787,6 +1833,232 @@ static struct btf_kind_operations enum_ops = { .seq_show = btf_enum_seq_show, }; +static s32 btf_func_proto_check_meta(struct btf_verifier_env *env, + const struct btf_type *t, + u32 meta_left) +{ + u32 meta_needed = btf_type_vlen(t) * sizeof(struct btf_param); + + if (meta_left < meta_needed) { + btf_verifier_log_basic(env, t, + "meta_left:%u meta_needed:%u", + meta_left, meta_needed); + return -EINVAL; + } + + if (t->name_off) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + + btf_verifier_log_type(env, t, NULL); + + return meta_needed; +} + +static void btf_func_proto_log(struct btf_verifier_env *env, + const struct btf_type *t) +{ + const struct btf_param *args = (const struct btf_param *)(t + 1); + u16 nr_args = btf_type_vlen(t), i; + + btf_verifier_log(env, "return=%u args=(", t->type); + if (!nr_args) { + btf_verifier_log(env, "void"); + goto done; + } + + if (nr_args == 1 && !args[0].type) { + /* Only one vararg */ + btf_verifier_log(env, "vararg"); + goto done; + } + + btf_verifier_log(env, "%u %s", args[0].type, + btf_name_by_offset(env->btf, + args[0].name_off)); + for (i = 1; i < nr_args - 1; i++) + btf_verifier_log(env, ", %u %s", args[i].type, + btf_name_by_offset(env->btf, + args[i].name_off)); + + if (nr_args > 1) { + const struct btf_param *last_arg = &args[nr_args - 1]; + + if (last_arg->type) + btf_verifier_log(env, ", %u %s", last_arg->type, + btf_name_by_offset(env->btf, + last_arg->name_off)); + else + btf_verifier_log(env, ", vararg"); + } + +done: + btf_verifier_log(env, ")"); +} + +static struct btf_kind_operations func_proto_ops = { + .check_meta = btf_func_proto_check_meta, + .resolve = btf_df_resolve, + /* + * BTF_KIND_FUNC_PROTO cannot be directly referred by + * a struct's member. + * + * It should be a funciton pointer instead. + * (i.e. struct's member -> BTF_KIND_PTR -> BTF_KIND_FUNC_PROTO) + * + * Hence, there is no btf_func_check_member(). + */ + .check_member = btf_df_check_member, + .log_details = btf_func_proto_log, + .seq_show = btf_df_seq_show, +}; + +static s32 btf_func_check_meta(struct btf_verifier_env *env, + const struct btf_type *t, + u32 meta_left) +{ + if (!t->name_off || + !btf_name_valid_identifier(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + + if (btf_type_vlen(t)) { + btf_verifier_log_type(env, t, "vlen != 0"); + return -EINVAL; + } + + btf_verifier_log_type(env, t, NULL); + + return 0; +} + +static struct btf_kind_operations func_ops = { + .check_meta = btf_func_check_meta, + .resolve = btf_df_resolve, + .check_member = btf_df_check_member, + .log_details = btf_ref_type_log, + .seq_show = btf_df_seq_show, +}; + +static int btf_func_proto_check(struct btf_verifier_env *env, + const struct btf_type *t) +{ + const struct btf_type *ret_type; + const struct btf_param *args; + const struct btf *btf; + u16 nr_args, i; + int err; + + btf = env->btf; + args = (const struct btf_param *)(t + 1); + nr_args = btf_type_vlen(t); + + /* Check func return type which could be "void" (t->type == 0) */ + if (t->type) { + u32 ret_type_id = t->type; + + ret_type = btf_type_by_id(btf, ret_type_id); + if (!ret_type) { + btf_verifier_log_type(env, t, "Invalid return type"); + return -EINVAL; + } + + if (btf_type_needs_resolve(ret_type) && + !env_type_is_resolved(env, ret_type_id)) { + err = btf_resolve(env, ret_type, ret_type_id); + if (err) + return err; + } + + /* Ensure the return type is a type that has a size */ + if (!btf_type_id_size(btf, &ret_type_id, NULL)) { + btf_verifier_log_type(env, t, "Invalid return type"); + return -EINVAL; + } + } + + if (!nr_args) + return 0; + + /* Last func arg type_id could be 0 if it is a vararg */ + if (!args[nr_args - 1].type) { + if (args[nr_args - 1].name_off) { + btf_verifier_log_type(env, t, "Invalid arg#%u", + nr_args); + return -EINVAL; + } + nr_args--; + } + + err = 0; + for (i = 0; i < nr_args; i++) { + const struct btf_type *arg_type; + u32 arg_type_id; + + arg_type_id = args[i].type; + arg_type = btf_type_by_id(btf, arg_type_id); + if (!arg_type) { + btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); + err = -EINVAL; + break; + } + + if (args[i].name_off && + (!btf_name_offset_valid(btf, args[i].name_off) || + !btf_name_valid_identifier(btf, args[i].name_off))) { + btf_verifier_log_type(env, t, + "Invalid arg#%u", i + 1); + err = -EINVAL; + break; + } + + if (btf_type_needs_resolve(arg_type) && + !env_type_is_resolved(env, arg_type_id)) { + err = btf_resolve(env, arg_type, arg_type_id); + if (err) + break; + } + + if (!btf_type_id_size(btf, &arg_type_id, NULL)) { + btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); + err = -EINVAL; + break; + } + } + + return err; +} + +static int btf_func_check(struct btf_verifier_env *env, + const struct btf_type *t) +{ + const struct btf_type *proto_type; + const struct btf_param *args; + const struct btf *btf; + u16 nr_args, i; + + btf = env->btf; + proto_type = btf_type_by_id(btf, t->type); + + if (!proto_type || !btf_type_is_func_proto(proto_type)) { + btf_verifier_log_type(env, t, "Invalid type_id"); + return -EINVAL; + } + + args = (const struct btf_param *)(proto_type + 1); + nr_args = btf_type_vlen(proto_type); + for (i = 0; i < nr_args; i++) { + if (!args[i].name_off && args[i].type) { + btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); + return -EINVAL; + } + } + + return 0; +} + static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS] = { [BTF_KIND_INT] = &int_ops, [BTF_KIND_PTR] = &ptr_ops, @@ -1799,6 +2071,8 @@ static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS] = { [BTF_KIND_VOLATILE] = &modifier_ops, [BTF_KIND_CONST] = &modifier_ops, [BTF_KIND_RESTRICT] = &modifier_ops, + [BTF_KIND_FUNC] = &func_ops, + [BTF_KIND_FUNC_PROTO] = &func_proto_ops, }; static s32 btf_check_meta(struct btf_verifier_env *env, @@ -1870,30 +2144,6 @@ static int btf_check_all_metas(struct btf_verifier_env *env) return 0; } -static int btf_resolve(struct btf_verifier_env *env, - const struct btf_type *t, u32 type_id) -{ - const struct resolve_vertex *v; - int err = 0; - - env->resolve_mode = RESOLVE_TBD; - env_stack_push(env, t, type_id); - while (!err && (v = env_stack_peak(env))) { - env->log_type_id = v->type_id; - err = btf_type_ops(v->t)->resolve(env, v); - } - - env->log_type_id = type_id; - if (err == -E2BIG) - btf_verifier_log_type(env, t, - "Exceeded max resolving depth:%u", - MAX_RESOLVE_DEPTH); - else if (err == -EEXIST) - btf_verifier_log_type(env, t, "Loop detected"); - - return err; -} - static bool btf_resolve_valid(struct btf_verifier_env *env, const struct btf_type *t, u32 type_id) @@ -1927,6 +2177,39 @@ static bool btf_resolve_valid(struct btf_verifier_env *env, return false; } +static int btf_resolve(struct btf_verifier_env *env, + const struct btf_type *t, u32 type_id) +{ + u32 save_log_type_id = env->log_type_id; + const struct resolve_vertex *v; + int err = 0; + + env->resolve_mode = RESOLVE_TBD; + env_stack_push(env, t, type_id); + while (!err && (v = env_stack_peak(env))) { + env->log_type_id = v->type_id; + err = btf_type_ops(v->t)->resolve(env, v); + } + + env->log_type_id = type_id; + if (err == -E2BIG) { + btf_verifier_log_type(env, t, + "Exceeded max resolving depth:%u", + MAX_RESOLVE_DEPTH); + } else if (err == -EEXIST) { + btf_verifier_log_type(env, t, "Loop detected"); + } + + /* Final sanity check */ + if (!err && !btf_resolve_valid(env, t, type_id)) { + btf_verifier_log_type(env, t, "Invalid resolve state"); + err = -EINVAL; + } + + env->log_type_id = save_log_type_id; + return err; +} + static int btf_check_all_types(struct btf_verifier_env *env) { struct btf *btf = env->btf; @@ -1949,10 +2232,16 @@ static int btf_check_all_types(struct btf_verifier_env *env) return err; } - if (btf_type_needs_resolve(t) && - !btf_resolve_valid(env, t, type_id)) { - btf_verifier_log_type(env, t, "Invalid resolve state"); - return -EINVAL; + if (btf_type_is_func_proto(t)) { + err = btf_func_proto_check(env, t); + if (err) + return err; + } + + if (btf_type_is_func(t)) { + err = btf_func_check(env, t); + if (err) + return err; } } -- cgit v1.3-7-g2ca7 From 838e96904ff3fc6c30e5ebbc611474669856e3c0 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Mon, 19 Nov 2018 15:29:11 -0800 Subject: bpf: Introduce bpf_func_info This patch added interface to load a program with the following additional information: . prog_btf_fd . func_info, func_info_rec_size and func_info_cnt where func_info will provide function range and type_id corresponding to each function. The func_info_rec_size is introduced in the UAPI to specify struct bpf_func_info size passed from user space. This intends to make bpf_func_info structure growable in the future. If the kernel gets a different bpf_func_info size from userspace, it will try to handle user request with part of bpf_func_info it can understand. In this patch, kernel can understand struct bpf_func_info { __u32 insn_offset; __u32 type_id; }; If user passed a bpf func_info record size of 16 bytes, the kernel can still handle part of records with the above definition. If verifier agrees with function range provided by the user, the bpf_prog ksym for each function will use the func name provided in the type_id, which is supposed to provide better encoding as it is not limited by 16 bytes program name limitation and this is better for bpf program which contains multiple subprograms. The bpf_prog_info interface is also extended to return btf_id, func_info, func_info_rec_size and func_info_cnt to userspace, so userspace can print out the function prototype for each xlated function. The insn_offset in the returned func_info corresponds to the insn offset for xlated functions. With other jit related fields in bpf_prog_info, userspace can also print out function prototypes for each jited function. Signed-off-by: Yonghong Song Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 5 +- include/linux/bpf_verifier.h | 1 + include/linux/btf.h | 2 + include/uapi/linux/bpf.h | 13 +++++ kernel/bpf/btf.c | 4 +- kernel/bpf/core.c | 13 +++++ kernel/bpf/syscall.c | 59 +++++++++++++++++++-- kernel/bpf/verifier.c | 120 ++++++++++++++++++++++++++++++++++++++++++- 8 files changed, 209 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 987815152629..7f0e225bf630 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -316,6 +316,8 @@ struct bpf_prog_aux { void *security; #endif struct bpf_prog_offload *offload; + struct btf *btf; + u32 type_id; /* type id for this prog/func */ union { struct work_struct work; struct rcu_head rcu; @@ -527,7 +529,8 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size) } /* verify correctness of eBPF program */ -int bpf_check(struct bpf_prog **fp, union bpf_attr *attr); +int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, + union bpf_attr __user *uattr); void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth); /* Map specifics */ diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 11f5df1092d9..204382f46fd8 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -204,6 +204,7 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log) struct bpf_subprog_info { u32 start; /* insn idx of function entry point */ u16 stack_depth; /* max. stack depth used by this function */ + u32 type_id; /* btf type_id for this subprog */ }; /* single container for all structs diff --git a/include/linux/btf.h b/include/linux/btf.h index e076c4697049..7f2c0a4a45ea 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m); int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); +const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); +const char *btf_name_by_offset(const struct btf *btf, u32 offset); #endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 05d95290b848..c1554aa07465 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -338,6 +338,10 @@ union bpf_attr { * (context accesses, allowed helpers, etc). */ __u32 expected_attach_type; + __u32 prog_btf_fd; /* fd pointing to BTF type data */ + __u32 func_info_rec_size; /* userspace bpf_func_info size */ + __aligned_u64 func_info; /* func info */ + __u32 func_info_cnt; /* number of bpf_func_info records */ }; struct { /* anonymous struct used by BPF_OBJ_* commands */ @@ -2638,6 +2642,10 @@ struct bpf_prog_info { __u32 nr_jited_func_lens; __aligned_u64 jited_ksyms; __aligned_u64 jited_func_lens; + __u32 btf_id; + __u32 func_info_rec_size; + __aligned_u64 func_info; + __u32 func_info_cnt; } __attribute__((aligned(8))); struct bpf_map_info { @@ -2949,4 +2957,9 @@ struct bpf_flow_keys { }; }; +struct bpf_func_info { + __u32 insn_offset; + __u32 type_id; +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6a2be79b73fc..69da9169819a 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -474,7 +474,7 @@ static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) return !*src; } -static const char *btf_name_by_offset(const struct btf *btf, u32 offset) +const char *btf_name_by_offset(const struct btf *btf, u32 offset) { if (!offset) return "(anon)"; @@ -484,7 +484,7 @@ static const char *btf_name_by_offset(const struct btf *btf, u32 offset) return "(invalid-name-offset)"; } -static const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) +const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) { if (type_id > btf->nr_types) return NULL; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 1a796e0799ec..16d77012ad3e 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -21,12 +21,14 @@ * Kris Katterjohn - Added many additional checks in bpf_check_classic() */ +#include #include #include #include #include #include #include +#include #include #include #include @@ -390,6 +392,8 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) { const char *end = sym + KSYM_NAME_LEN; + const struct btf_type *type; + const char *func_name; BUILD_BUG_ON(sizeof("bpf_prog_") + sizeof(prog->tag) * 2 + @@ -404,6 +408,15 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); + + /* prog->aux->name will be ignored if full btf name is available */ + if (prog->aux->btf) { + type = btf_type_by_id(prog->aux->btf, prog->aux->type_id); + func_name = btf_name_by_offset(prog->aux->btf, type->name_off); + snprintf(sym, (size_t)(end - sym), "_%s", func_name); + return; + } + if (prog->aux->name[0]) snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); else diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index cf5040fd5434..998377808102 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1213,6 +1213,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) /* bpf_prog_free_id() must be called first */ bpf_prog_free_id(prog, do_idr_lock); bpf_prog_kallsyms_del_all(prog); + btf_put(prog->aux->btf); call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); } @@ -1437,9 +1438,9 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, } /* last field in 'union bpf_attr' used by this command */ -#define BPF_PROG_LOAD_LAST_FIELD expected_attach_type +#define BPF_PROG_LOAD_LAST_FIELD func_info_cnt -static int bpf_prog_load(union bpf_attr *attr) +static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) { enum bpf_prog_type type = attr->prog_type; struct bpf_prog *prog; @@ -1525,7 +1526,7 @@ static int bpf_prog_load(union bpf_attr *attr) goto free_prog; /* run eBPF verifier */ - err = bpf_check(&prog, attr); + err = bpf_check(&prog, attr, uattr); if (err < 0) goto free_used_maps; @@ -2079,6 +2080,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, info.xlated_prog_len = 0; info.nr_jited_ksyms = 0; info.nr_jited_func_lens = 0; + info.func_info_cnt = 0; goto done; } @@ -2216,6 +2218,55 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } + if (prog->aux->btf) { + u32 ucnt, urec_size; + + info.btf_id = btf_id(prog->aux->btf); + + ucnt = info.func_info_cnt; + info.func_info_cnt = prog->aux->func_cnt ? : 1; + urec_size = info.func_info_rec_size; + info.func_info_rec_size = sizeof(struct bpf_func_info); + if (ucnt) { + /* expect passed-in urec_size is what the kernel expects */ + if (urec_size != info.func_info_rec_size) + return -EINVAL; + + if (bpf_dump_raw_ok()) { + struct bpf_func_info kern_finfo; + char __user *user_finfo; + u32 i, insn_offset; + + user_finfo = u64_to_user_ptr(info.func_info); + if (prog->aux->func_cnt) { + ucnt = min_t(u32, info.func_info_cnt, ucnt); + insn_offset = 0; + for (i = 0; i < ucnt; i++) { + kern_finfo.insn_offset = insn_offset; + kern_finfo.type_id = prog->aux->func[i]->aux->type_id; + if (copy_to_user(user_finfo, &kern_finfo, + sizeof(kern_finfo))) + return -EFAULT; + + /* func[i]->len holds the prog len */ + insn_offset += prog->aux->func[i]->len; + user_finfo += urec_size; + } + } else { + kern_finfo.insn_offset = 0; + kern_finfo.type_id = prog->aux->type_id; + if (copy_to_user(user_finfo, &kern_finfo, + sizeof(kern_finfo))) + return -EFAULT; + } + } else { + info.func_info_cnt = 0; + } + } + } else { + info.func_info_cnt = 0; + } + done: if (copy_to_user(uinfo, &info, info_len) || put_user(info_len, &uattr->info.info_len)) @@ -2501,7 +2552,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz err = map_get_next_key(&attr); break; case BPF_PROG_LOAD: - err = bpf_prog_load(&attr); + err = bpf_prog_load(&attr, uattr); break; case BPF_OBJ_PIN: err = bpf_obj_pin(&attr); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b5222aa61d54..f102c4fd0c5a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11,10 +11,12 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. */ +#include #include #include #include #include +#include #include #include #include @@ -4639,6 +4641,114 @@ err_free: return ret; } +/* The minimum supported BTF func info size */ +#define MIN_BPF_FUNCINFO_SIZE 8 +#define MAX_FUNCINFO_REC_SIZE 252 + +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, + union bpf_attr *attr, union bpf_attr __user *uattr) +{ + u32 i, nfuncs, urec_size, min_size, prev_offset; + u32 krec_size = sizeof(struct bpf_func_info); + struct bpf_func_info krecord = {}; + const struct btf_type *type; + void __user *urecord; + struct btf *btf; + int ret = 0; + + nfuncs = attr->func_info_cnt; + if (!nfuncs) + return 0; + + if (nfuncs != env->subprog_cnt) { + verbose(env, "number of funcs in func_info doesn't match number of subprogs\n"); + return -EINVAL; + } + + urec_size = attr->func_info_rec_size; + if (urec_size < MIN_BPF_FUNCINFO_SIZE || + urec_size > MAX_FUNCINFO_REC_SIZE || + urec_size % sizeof(u32)) { + verbose(env, "invalid func info rec size %u\n", urec_size); + return -EINVAL; + } + + btf = btf_get_by_fd(attr->prog_btf_fd); + if (IS_ERR(btf)) { + verbose(env, "unable to get btf from fd\n"); + return PTR_ERR(btf); + } + + urecord = u64_to_user_ptr(attr->func_info); + min_size = min_t(u32, krec_size, urec_size); + + for (i = 0; i < nfuncs; i++) { + ret = bpf_check_uarg_tail_zero(urecord, krec_size, urec_size); + if (ret) { + if (ret == -E2BIG) { + verbose(env, "nonzero tailing record in func info"); + /* set the size kernel expects so loader can zero + * out the rest of the record. + */ + if (put_user(min_size, &uattr->func_info_rec_size)) + ret = -EFAULT; + } + goto free_btf; + } + + if (copy_from_user(&krecord, urecord, min_size)) { + ret = -EFAULT; + goto free_btf; + } + + /* check insn_offset */ + if (i == 0) { + if (krecord.insn_offset) { + verbose(env, + "nonzero insn_offset %u for the first func info record", + krecord.insn_offset); + ret = -EINVAL; + goto free_btf; + } + } else if (krecord.insn_offset <= prev_offset) { + verbose(env, + "same or smaller insn offset (%u) than previous func info record (%u)", + krecord.insn_offset, prev_offset); + ret = -EINVAL; + goto free_btf; + } + + if (env->subprog_info[i].start != krecord.insn_offset) { + verbose(env, "func_info BTF section doesn't match subprog layout in BPF program\n"); + ret = -EINVAL; + goto free_btf; + } + + /* check type_id */ + type = btf_type_by_id(btf, krecord.type_id); + if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC) { + verbose(env, "invalid type id %d in func info", + krecord.type_id); + ret = -EINVAL; + goto free_btf; + } + + if (i == 0) + prog->aux->type_id = krecord.type_id; + env->subprog_info[i].type_id = krecord.type_id; + + prev_offset = krecord.insn_offset; + urecord += urec_size; + } + + prog->aux->btf = btf; + return 0; + +free_btf: + btf_put(btf); + return ret; +} + /* check %cur's range satisfies %old's */ static bool range_within(struct bpf_reg_state *old, struct bpf_reg_state *cur) @@ -5939,6 +6049,9 @@ static int jit_subprogs(struct bpf_verifier_env *env) func[i]->aux->name[0] = 'F'; func[i]->aux->stack_depth = env->subprog_info[i].stack_depth; func[i]->jit_requested = 1; + /* the btf will be freed only at prog->aux */ + func[i]->aux->btf = prog->aux->btf; + func[i]->aux->type_id = env->subprog_info[i].type_id; func[i] = bpf_int_jit_compile(func[i]); if (!func[i]->jited) { err = -ENOTSUPP; @@ -6325,7 +6438,8 @@ static void free_states(struct bpf_verifier_env *env) kfree(env->explored_states); } -int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) +int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, + union bpf_attr __user *uattr) { struct bpf_verifier_env *env; struct bpf_verifier_log *log; @@ -6397,6 +6511,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) if (ret < 0) goto skip_full_check; + ret = check_btf_func(env->prog, env, attr, uattr); + if (ret < 0) + goto skip_full_check; + ret = do_check(env); if (env->cur_state) { free_verifier_state(env->cur_state, true); -- cgit v1.3-7-g2ca7 From 8d75839b843ae0ef8d9db97ed05b493e687e6b75 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 21 Nov 2018 21:39:52 -0800 Subject: bpf, lpm: make longest_prefix_match() faster At LPC 2018 in Vancouver, Vlad Dumitrescu mentioned that longest_prefix_match() has a high cost [1]. One reason for that cost is a loop handling one byte at a time. We can handle more bytes at a time, if enough attention is paid to endianness. I was able to remove ~55 % of longest_prefix_match() cpu costs. [1] https://linuxplumbersconf.org/event/2/contributions/88/attachments/76/87/lpc-bpf-2018-shaping.pdf Signed-off-by: Eric Dumazet Cc: Vlad Dumitrescu Cc: Alexei Starovoitov Cc: Daniel Borkmann Signed-off-by: Daniel Borkmann --- kernel/bpf/lpm_trie.c | 59 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 49 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 9058317ba9de..bfd4882e1106 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -168,20 +168,59 @@ static size_t longest_prefix_match(const struct lpm_trie *trie, const struct lpm_trie_node *node, const struct bpf_lpm_trie_key *key) { - size_t prefixlen = 0; - size_t i; + u32 limit = min(node->prefixlen, key->prefixlen); + u32 prefixlen = 0, i = 0; - for (i = 0; i < trie->data_size; i++) { - size_t b; + BUILD_BUG_ON(offsetof(struct lpm_trie_node, data) % sizeof(u32)); + BUILD_BUG_ON(offsetof(struct bpf_lpm_trie_key, data) % sizeof(u32)); - b = 8 - fls(node->data[i] ^ key->data[i]); - prefixlen += b; +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(CONFIG_64BIT) - if (prefixlen >= node->prefixlen || prefixlen >= key->prefixlen) - return min(node->prefixlen, key->prefixlen); + /* data_size >= 16 has very small probability. + * We do not use a loop for optimal code generation. + */ + if (trie->data_size >= 8) { + u64 diff = be64_to_cpu(*(__be64 *)node->data ^ + *(__be64 *)key->data); + + prefixlen = 64 - fls64(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i = 8; + } +#endif + + while (trie->data_size >= i + 4) { + u32 diff = be32_to_cpu(*(__be32 *)&node->data[i] ^ + *(__be32 *)&key->data[i]); + + prefixlen += 32 - fls(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i += 4; + } - if (b < 8) - break; + if (trie->data_size >= i + 2) { + u16 diff = be16_to_cpu(*(__be16 *)&node->data[i] ^ + *(__be16 *)&key->data[i]); + + prefixlen += 16 - fls(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i += 2; + } + + if (trie->data_size >= i + 1) { + prefixlen += 8 - fls(node->data[i] ^ key->data[i]); + + if (prefixlen >= limit) + return limit; } return prefixlen; -- cgit v1.3-7-g2ca7 From c7c3f05e341a9a2bd1a92993d4f996cfd6e7348e Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 25 Oct 2018 19:10:36 +0900 Subject: panic: avoid deadlocks in re-entrant console drivers From printk()/serial console point of view panic() is special, because it may force CPU to re-enter printk() or/and serial console driver. Therefore, some of serial consoles drivers are re-entrant. E.g. 8250: serial8250_console_write() { if (port->sysrq) locked = 0; else if (oops_in_progress) locked = spin_trylock_irqsave(&port->lock, flags); else spin_lock_irqsave(&port->lock, flags); ... } panic() does set oops_in_progress via bust_spinlocks(1), so in theory we should be able to re-enter serial console driver from panic(): CPU0 uart_console_write() serial8250_console_write() // if (oops_in_progress) // spin_trylock_irqsave() call_console_drivers() console_unlock() console_flush_on_panic() bust_spinlocks(1) // oops_in_progress++ panic() spin_lock_irqsave(&port->lock, flags) // spin_lock_irqsave() serial8250_console_write() call_console_drivers() console_unlock() printk() ... However, this does not happen and we deadlock in serial console on port->lock spinlock. And the problem is that console_flush_on_panic() called after bust_spinlocks(0): void panic(const char *fmt, ...) { bust_spinlocks(1); ... bust_spinlocks(0); console_flush_on_panic(); ... } bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress can go back to zero. Thus even re-entrant console drivers will simply spin on port->lock spinlock. Given that port->lock may already be locked either by a stopped CPU, or by the very same CPU we execute panic() on (for instance, NMI panic() on printing CPU) the system deadlocks and does not reboot. Fix this by removing bust_spinlocks(0), so oops_in_progress is always set in panic() now and, thus, re-entrant console drivers will trylock the port->lock instead of spinning on it forever, when we call them from console_flush_on_panic(). Link: http://lkml.kernel.org/r/20181025101036.6823-1-sergey.senozhatsky@gmail.com Cc: Steven Rostedt Cc: Daniel Wang Cc: Peter Zijlstra Cc: Andrew Morton Cc: Linus Torvalds Cc: Greg Kroah-Hartman Cc: Alan Cox Cc: Jiri Slaby Cc: Peter Feiner Cc: linux-serial@vger.kernel.org Cc: Sergey Senozhatsky Cc: stable@vger.kernel.org Signed-off-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/panic.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/panic.c b/kernel/panic.c index 8b2e002d52eb..6a6df23acd1a 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -233,7 +234,10 @@ void panic(const char *fmt, ...) if (_crash_kexec_post_notifiers) __crash_kexec(NULL); - bust_spinlocks(0); +#ifdef CONFIG_VT + unblank_screen(); +#endif + console_unblank(); /* * We may have ended up stopping the CPU holding the lock (in -- cgit v1.3-7-g2ca7 From 58c5fc2b96e4ae65068d815a1c3ca81da92fa1c9 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:08 +0100 Subject: time: Remove useless filenames in top level comments Remove the pointless filenames in the top level comments. They have no value at all and just occupy space. While at it tidy up some of the comments and remove a stale one. Signed-off-by: Thomas Gleixner Acked-by: Nicolas Pitre Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182252.794898238@linutronix.de --- include/linux/hrtimer.h | 2 -- kernel/time/clockevents.c | 2 -- kernel/time/clocksource.c | 5 ----- kernel/time/hrtimer.c | 16 ++++------------ kernel/time/itimer.c | 2 -- kernel/time/jiffies.c | 2 -- kernel/time/posix-clock.c | 2 +- kernel/time/posix-timers.c | 4 ---- kernel/time/sched_clock.c | 4 ++-- kernel/time/tick-broadcast-hrtimer.c | 4 +--- kernel/time/tick-broadcast.c | 2 -- kernel/time/tick-common.c | 2 -- kernel/time/tick-oneshot.c | 2 -- kernel/time/tick-sched.c | 2 -- kernel/time/time.c | 12 ++++-------- kernel/time/timecounter.c | 6 +----- kernel/time/timekeeping.c | 10 ++-------- kernel/time/timer.c | 2 -- kernel/time/timer_list.c | 2 -- 19 files changed, 15 insertions(+), 68 deletions(-) (limited to 'kernel') diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 3892e9c8b2de..50ebe2ad43e0 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -1,6 +1,4 @@ /* - * include/linux/hrtimer.h - * * hrtimers - High-resolution kernel timers * * Copyright(C) 2005, Thomas Gleixner diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index af58898d9ebf..9b8c7c0fd113 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/clockevents.c - * * This file contains functions which manage clock event devices. * * Copyright(C) 2005-2006, Thomas Gleixner diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index ffe081623aec..1c5273fbd500 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/clocksource.c - * * This file contains the functions which manage clocksource drivers. * * Copyright (C) 2004, 2005 IBM, John Stultz (johnstul@us.ibm.com) @@ -18,9 +16,6 @@ * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - * TODO WishList: - * o Allow clocksource drivers to be unregistered */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 9cdd74bd2d27..223548bb81c6 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1,26 +1,18 @@ /* - * linux/kernel/hrtimer.c - * * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007 Timesys Corp., Thomas Gleixner * * High-resolution kernel timers * - * In contrast to the low-resolution timeout API implemented in - * kernel/timer.c, hrtimers provide finer resolution and accuracy - * depending on system configuration and capabilities. - * - * These timers are currently used for: - * - itimers - * - POSIX timers - * - nanosleep - * - precise in-kernel timing + * In contrast to the low-resolution timeout API, aka timer wheel, + * hrtimers provide finer resolution and accuracy depending on system + * configuration and capabilities. * * Started by: Thomas Gleixner and Ingo Molnar * * Credits: - * based on kernel/timer.c + * Based on the original timer wheel code * * Help, testing, suggestions, bugfixes, improvements were * provided by: diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c index 9a65713c8309..02068b2d5862 100644 --- a/kernel/time/itimer.c +++ b/kernel/time/itimer.c @@ -1,7 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 /* - * linux/kernel/itimer.c - * * Copyright (C) 1992 Darren Senn */ diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index 497719127bf9..9c3957fe9317 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -1,6 +1,4 @@ /*********************************************************************** -* linux/kernel/time/jiffies.c -* * This file contains the jiffies based clocksource. * * Copyright (C) 2004, 2005 IBM, John Stultz (johnstul@us.ibm.com) diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c index fe56c4e06c51..4959815f4fd7 100644 --- a/kernel/time/posix-clock.c +++ b/kernel/time/posix-clock.c @@ -1,5 +1,5 @@ /* - * posix-clock.c - support for dynamic clock devices + * Support for dynamic clock devices * * Copyright (C) 2010 OMICRON electronics GmbH * diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index bd62b5eeb5a0..c72307c119d9 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -1,10 +1,6 @@ /* - * linux/kernel/posix-timers.c - * - * * 2002-10-15 Posix Clocks & timers * by George Anzinger george@mvista.com - * * Copyright (C) 2002 2003 by MontaVista Software. * * 2004-06-01 Fix CLOCK_REALTIME clock/timer TIMER_ABSTIME bug. diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index cbc72c2c1fca..b38b6628f89b 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -1,6 +1,6 @@ /* - * sched_clock.c: Generic sched_clock() support, to extend low level - * hardware time counters to full 64-bit ns values. + * Generic sched_clock() support, to extend low level hardware time + * counters to full 64-bit ns values. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index a59641fb88b6..5be6154e2fd2 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -1,8 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* - * linux/kernel/time/tick-broadcast-hrtimer.c - * This file emulates a local clock event device - * via a pseudo clock device. + * Emulate a local clock event device via a pseudo clock device. */ #include #include diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index be0aac2b4300..4f5abde2dfa7 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/tick-broadcast.c - * * This file contains functions which emulate a local clock-event * device via a broadcast event source. * diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 14de3727b18e..7b5008039c2d 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/tick-common.c - * * This file contains the base functions to manage periodic tick * related events. * diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c index 6fe615d57ebb..77989efe13d2 100644 --- a/kernel/time/tick-oneshot.c +++ b/kernel/time/tick-oneshot.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/tick-oneshot.c - * * This file contains functions which manage high resolution tick * related events. * diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 69e673b88474..cb557e56a19f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1,6 +1,4 @@ /* - * linux/kernel/time/tick-sched.c - * * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007 Timesys Corp., Thomas Gleixner diff --git a/kernel/time/time.c b/kernel/time/time.c index ad204cf6d001..13ffa9950ffc 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -1,14 +1,10 @@ /* - * linux/kernel/time.c - * * Copyright (C) 1991, 1992 Linus Torvalds * - * This file contains the interface functions for the various - * time related system calls: time, stime, gettimeofday, settimeofday, - * adjtime - */ -/* - * Modification history kernel/time.c + * This file contains the interface functions for the various time related + * system calls: time, stime, gettimeofday, settimeofday, adjtime + * + * Modification history: * * 1993-09-02 Philip Gladstone * Created file with time related functions from sched/core.c and adjtimex() diff --git a/kernel/time/timecounter.c b/kernel/time/timecounter.c index 8afd78932bdf..400f3456d564 100644 --- a/kernel/time/timecounter.c +++ b/kernel/time/timecounter.c @@ -1,8 +1,5 @@ /* - * linux/kernel/time/timecounter.c - * - * based on code that migrated away from - * linux/kernel/time/clocksource.c + * Based on clocksource code. See commit 74d23cc704d1 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -14,7 +11,6 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ - #include #include diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2d110c948805..30fdf48f50c2 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1,13 +1,7 @@ /* - * linux/kernel/time/timekeeping.c - * - * Kernel timekeeping code and accessor functions - * - * This code was moved from linux/kernel/timer.c. - * Please see that file for copyright and history logs. - * + * Kernel timekeeping code and accessor functions. Based on code from + * timer.c, moved in commit 8524070b7982. */ - #include #include #include diff --git a/kernel/time/timer.c b/kernel/time/timer.c index fa49cd753dea..2f248bbedb4a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1,6 +1,4 @@ /* - * linux/kernel/timer.c - * * Kernel internal timers * * Copyright (C) 1991, 1992 Linus Torvalds diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index d647dabdac97..5d64fff384c8 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -1,6 +1,4 @@ /* - * kernel/time/timer_list.c - * * List pending timers * * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar -- cgit v1.3-7-g2ca7 From 35728b8209ee7d25b6241a56304ee926469bd154 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:09 +0100 Subject: time: Add SPDX license identifiers Update the time(r) core files files with the correct SPDX license identifier based on the license text in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This work is based on a script and data from Philippe Ombredanne, Kate Stewart and myself. The data has been created with two independent license scanners and manual inspection. The following files do not contain any direct license information and have been omitted from the big initial SPDX changes: timeconst.bc: The .bc files were not touched time.c, timer.c, timekeeping.c: Licence was deduced from EXPORT_SYMBOL_GPL As those files do not contain direct license references they fall under the project license, i.e. GPL V2 only. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Russell King Cc: Richard Cochran Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Cc: H. Peter Anvin Cc: Paul E. McKenney Link: https://lkml.kernel.org/r/20181031182252.879109557@linutronix.de --- include/linux/hrtimer.h | 1 + kernel/time/alarmtimer.c | 1 + kernel/time/clockevents.c | 1 + kernel/time/clocksource.c | 1 + kernel/time/hrtimer.c | 1 + kernel/time/jiffies.c | 1 + kernel/time/posix-clock.c | 1 + kernel/time/posix-stubs.c | 1 + kernel/time/posix-timers.c | 1 + kernel/time/sched_clock.c | 1 + kernel/time/test_udelay.c | 1 + kernel/time/tick-broadcast.c | 1 + kernel/time/tick-common.c | 1 + kernel/time/tick-oneshot.c | 1 + kernel/time/tick-sched.c | 1 + kernel/time/time.c | 1 + kernel/time/timeconst.bc | 2 ++ kernel/time/timeconv.c | 1 + kernel/time/timecounter.c | 1 + kernel/time/timekeeping.c | 1 + kernel/time/timekeeping_debug.c | 1 + kernel/time/timer.c | 1 + kernel/time/timer_list.c | 1 + 23 files changed, 24 insertions(+) (limited to 'kernel') diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 50ebe2ad43e0..851e4231d3ab 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * hrtimers - High-resolution kernel timers * diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index fa5de5e8de61..69070d399d70 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Alarmtimer interface * diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 9b8c7c0fd113..0fdbdf17f8a2 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * This file contains functions which manage clock event devices. * diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 1c5273fbd500..b1abeac5f3f7 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * This file contains the functions which manage clocksource drivers. * diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 223548bb81c6..16dacc8d3ca2 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index 9c3957fe9317..0deb0be2c445 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /*********************************************************************** * This file contains the jiffies based clocksource. * diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c index 4959815f4fd7..339e35e4605f 100644 --- a/kernel/time/posix-clock.c +++ b/kernel/time/posix-clock.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Support for dynamic clock devices * diff --git a/kernel/time/posix-stubs.c b/kernel/time/posix-stubs.c index 989ccf028bde..b9f9f6f02e11 100644 --- a/kernel/time/posix-stubs.c +++ b/kernel/time/posix-stubs.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Dummy stubs used when CONFIG_POSIX_TIMERS=n * diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index c72307c119d9..e8cd9aa6c9cf 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * 2002-10-15 Posix Clocks & timers * by George Anzinger george@mvista.com diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index b38b6628f89b..11570ba451cc 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Generic sched_clock() support, to extend low level hardware time * counters to full 64-bit ns values. diff --git a/kernel/time/test_udelay.c b/kernel/time/test_udelay.c index b0928ab3270f..d6a87bb2040f 100644 --- a/kernel/time/test_udelay.c +++ b/kernel/time/test_udelay.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * udelay() test kernel module * diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 4f5abde2dfa7..f4725f53d852 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * This file contains functions which emulate a local clock-event * device via a broadcast event source. diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 7b5008039c2d..455b8d65a2b7 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * This file contains the base functions to manage periodic tick * related events. diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c index 77989efe13d2..1c8ad0fb33c0 100644 --- a/kernel/time/tick-oneshot.c +++ b/kernel/time/tick-oneshot.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * This file contains functions which manage high resolution tick * related events. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index cb557e56a19f..62ecb2a802ca 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar diff --git a/kernel/time/time.c b/kernel/time/time.c index 13ffa9950ffc..5aa0a156e331 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds * diff --git a/kernel/time/timeconst.bc b/kernel/time/timeconst.bc index f83bbb81600b..7ed0e0fb5831 100644 --- a/kernel/time/timeconst.bc +++ b/kernel/time/timeconst.bc @@ -1,3 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + scale=0 define gcd(a,b) { diff --git a/kernel/time/timeconv.c b/kernel/time/timeconv.c index 7142580ad94f..589e0a552129 100644 --- a/kernel/time/timeconv.c +++ b/kernel/time/timeconv.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: LGPL-2.0+ /* * Copyright (C) 1993, 1994, 1995, 1996, 1997 Free Software Foundation, Inc. * This file is part of the GNU C Library. diff --git a/kernel/time/timecounter.c b/kernel/time/timecounter.c index 400f3456d564..933462326489 100644 --- a/kernel/time/timecounter.c +++ b/kernel/time/timecounter.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Based on clocksource code. See commit 74d23cc704d1 * diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 30fdf48f50c2..cd02bd38cf2d 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Kernel timekeeping code and accessor functions. Based on code from * timer.c, moved in commit 8524070b7982. diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c index 238e4be60229..d06f09209fb7 100644 --- a/kernel/time/timekeeping_debug.c +++ b/kernel/time/timekeeping_debug.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * debugfs file to track time spent in suspend * diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2f248bbedb4a..444156debfa0 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Kernel internal timers * diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index 5d64fff384c8..f81693cdf981 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * List pending timers * -- cgit v1.3-7-g2ca7 From f49c174b5f431db9fa17315269e288d4548b651c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:10 +0100 Subject: hrtimers/tick/clockevents: Remove sloppy license references "For licencing details see kernel-base/COPYING" and similar license references have no value over the SPDX identifier. Remove them. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182252.963632760@linutronix.de --- include/linux/hrtimer.h | 2 -- kernel/time/clockevents.c | 3 --- kernel/time/hrtimer.c | 2 -- kernel/time/tick-broadcast.c | 3 --- kernel/time/tick-common.c | 3 --- kernel/time/tick-oneshot.c | 3 --- kernel/time/tick-sched.c | 2 -- kernel/time/timer_list.c | 4 ---- 8 files changed, 22 deletions(-) (limited to 'kernel') diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 851e4231d3ab..2e8957eac4d4 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -8,8 +8,6 @@ * data type definitions, declarations, prototypes * * Started by: Thomas Gleixner and Ingo Molnar - * - * For licencing details see kernel-base/COPYING */ #ifndef _LINUX_HRTIMER_H #define _LINUX_HRTIMER_H diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 0fdbdf17f8a2..5e77662dd2d9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -5,9 +5,6 @@ * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007, Timesys Corp., Thomas Gleixner - * - * This code is licenced under the GPL version 2. For details see - * kernel-base/COPYING. */ #include diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 16dacc8d3ca2..f5cfa1b73d6f 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -20,8 +20,6 @@ * * George Anzinger, Andrew Morton, Steven Rostedt, Roman Zippel * et. al. - * - * For licencing details see kernel-base/COPYING */ #include diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index f4725f53d852..803fa67aace9 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -6,9 +6,6 @@ * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007, Timesys Corp., Thomas Gleixner - * - * This code is licenced under the GPL version 2. For details see - * kernel-base/COPYING. */ #include #include diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 455b8d65a2b7..529143b4c8d2 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -6,9 +6,6 @@ * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007, Timesys Corp., Thomas Gleixner - * - * This code is licenced under the GPL version 2. For details see - * kernel-base/COPYING. */ #include #include diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c index 1c8ad0fb33c0..f9745d47425a 100644 --- a/kernel/time/tick-oneshot.c +++ b/kernel/time/tick-oneshot.c @@ -6,9 +6,6 @@ * Copyright(C) 2005-2006, Thomas Gleixner * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007, Timesys Corp., Thomas Gleixner - * - * This code is licenced under the GPL version 2. For details see - * kernel-base/COPYING. */ #include #include diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 62ecb2a802ca..6fa52cd6df0b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -7,8 +7,6 @@ * No idle tick implementation for low and high resolution timers * * Started by: Thomas Gleixner and Ingo Molnar - * - * Distribute under GPLv2. */ #include #include diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index f81693cdf981..98ba50dcb1b2 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -3,10 +3,6 @@ * List pending timers * * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #include -- cgit v1.3-7-g2ca7 From 9281a7857b91cf4d283be7c86d80e5d091bfb3d9 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:11 +0100 Subject: time/debug: Remove license boilerplate The SPDX identifier is enough. Remove the license boilerplate. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182253.047449481@linutronix.de --- kernel/time/test_udelay.c | 9 --------- kernel/time/timekeeping_debug.c | 10 ---------- 2 files changed, 19 deletions(-) (limited to 'kernel') diff --git a/kernel/time/test_udelay.c b/kernel/time/test_udelay.c index d6a87bb2040f..77c63005dc4e 100644 --- a/kernel/time/test_udelay.c +++ b/kernel/time/test_udelay.c @@ -8,15 +8,6 @@ * Specifying usecs of 0 or negative values will run multiples tests. * * Copyright (C) 2014 Google, Inc. - * - * This software is licensed under the terms of the GNU General Public - * License version 2, as published by the Free Software Foundation, and - * may be copied, distributed, and modified under those terms. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. */ #include diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c index d06f09209fb7..f811882cfd13 100644 --- a/kernel/time/timekeeping_debug.c +++ b/kernel/time/timekeeping_debug.c @@ -3,16 +3,6 @@ * debugfs file to track time spent in suspend * * Copyright (c) 2011, Google, Inc. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include -- cgit v1.3-7-g2ca7 From 6c7811c628a96fe00965622f3aa1cf272cb898f7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:12 +0100 Subject: time: Remove license boilerplate The SPDX identifier defines the license of the files already. No need for the boilerplates. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Acked-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Cc: Paul E. McKenney Link: https://lkml.kernel.org/r/20181031182253.132458951@linutronix.de --- kernel/time/alarmtimer.c | 4 ---- kernel/time/clocksource.c | 14 -------------- kernel/time/jiffies.c | 25 +++++-------------------- kernel/time/timecounter.c | 10 ---------- 4 files changed, 5 insertions(+), 48 deletions(-) (limited to 'kernel') diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index 69070d399d70..2c97e8c2d29f 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -11,10 +11,6 @@ * Copyright (C) 2010 IBM Corperation * * Author: John Stultz - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #include #include diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index b1abeac5f3f7..3bcc19ceb073 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -3,20 +3,6 @@ * This file contains the functions which manage clocksource drivers. * * Copyright (C) 2004, 2005 IBM, John Stultz (johnstul@us.ibm.com) - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index 0deb0be2c445..dc1b6f1929f9 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -1,24 +1,9 @@ // SPDX-License-Identifier: GPL-2.0+ -/*********************************************************************** -* This file contains the jiffies based clocksource. -* -* Copyright (C) 2004, 2005 IBM, John Stultz (johnstul@us.ibm.com) -* -* This program is free software; you can redistribute it and/or modify -* it under the terms of the GNU General Public License as published by -* the Free Software Foundation; either version 2 of the License, or -* (at your option) any later version. -* -* This program is distributed in the hope that it will be useful, -* but WITHOUT ANY WARRANTY; without even the implied warranty of -* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -* GNU General Public License for more details. -* -* You should have received a copy of the GNU General Public License -* along with this program; if not, write to the Free Software -* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. -* -************************************************************************/ +/* + * This file contains the jiffies based clocksource. + * + * Copyright (C) 2004, 2005 IBM, John Stultz (johnstul@us.ibm.com) + */ #include #include #include diff --git a/kernel/time/timecounter.c b/kernel/time/timecounter.c index 933462326489..85b98e727306 100644 --- a/kernel/time/timecounter.c +++ b/kernel/time/timecounter.c @@ -1,16 +1,6 @@ // SPDX-License-Identifier: GPL-2.0+ /* * Based on clocksource code. See commit 74d23cc704d1 - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. */ #include #include -- cgit v1.3-7-g2ca7 From 3c8f2515ac0a986c8af6ba17c376f874306e71f4 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:13 +0100 Subject: posix-timers/stubs: Remove license boilerplate The SPDX identifier defines the license of the file already. No need for the boilerplate. Signed-off-by: Thomas Gleixner Acked-by: Nicolas Pitre Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: David Riley Cc: Colin Cross Cc: Mark Brown Cc: Arnd Bergmann Link: https://lkml.kernel.org/r/20181031182253.215825217@linutronix.de --- kernel/time/posix-stubs.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-stubs.c b/kernel/time/posix-stubs.c index b9f9f6f02e11..a51895486e5e 100644 --- a/kernel/time/posix-stubs.c +++ b/kernel/time/posix-stubs.c @@ -4,10 +4,6 @@ * * Created by: Nicolas Pitre, July 2016 * Copyright: (C) 2016 Linaro Limited - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #include -- cgit v1.3-7-g2ca7 From 2fa6d420c22268b30dc46c8cbfe8bec0e361e03e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:14 +0100 Subject: sched/clock: Remove license boilerplate The SPDX identifier defines the license of the file already. No need for the boilerplate. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182253.300140921@linutronix.de --- kernel/time/sched_clock.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 11570ba451cc..094b82ca95e5 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -2,10 +2,6 @@ /* * Generic sched_clock() support, to extend low level hardware time * counters to full 64-bit ns values. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #include #include -- cgit v1.3-7-g2ca7 From c804efeb58229e6040b9a200cbab1fc8c150f99d Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:15 +0100 Subject: posix-clocks: Remove license boiler plate The SPDX identifier defines the license of the file already. No need for the boilerplate. Signed-off-by: Thomas Gleixner Acked-by: Richard Cochran Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: Manfred Rudigier Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: "Paul E. McKenney" Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182253.385909804@linutronix.de --- kernel/time/posix-clock.c | 14 -------------- 1 file changed, 14 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c index 339e35e4605f..425bbfce6819 100644 --- a/kernel/time/posix-clock.c +++ b/kernel/time/posix-clock.c @@ -3,20 +3,6 @@ * Support for dynamic clock devices * * Copyright (C) 2010 OMICRON electronics GmbH - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #include #include -- cgit v1.3-7-g2ca7 From 0141de741e0710d5e2e68087577329606f59ed71 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 31 Oct 2018 19:21:16 +0100 Subject: posix-timers: Remove license boilerplate The SPDX identifier defines the license of the file already. No need for the boilerplate. Remove also the completely outdated Montavista snail mail address. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Ingo Molnar Acked-by: John Stultz Acked-by: Corey Minyard Cc: Peter Zijlstra Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Peter Anvin Cc: Russell King Cc: Richard Cochran Cc: "Paul E. McKenney" Cc: Nicolas Pitre Cc: David Riley Cc: Colin Cross Cc: Mark Brown Link: https://lkml.kernel.org/r/20181031182253.479792883@linutronix.de --- kernel/time/posix-timers.c | 20 +------------------- 1 file changed, 1 insertion(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index e8cd9aa6c9cf..dd70ced15a36 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -7,25 +7,7 @@ * 2004-06-01 Fix CLOCK_REALTIME clock/timer TIMER_ABSTIME bug. * Copyright (C) 2004 Boris Hu * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or (at - * your option) any later version. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - * MontaVista Software | 1237 East Arques Avenue | Sunnyvale | CA 94085 | USA - */ - -/* These are all the functions necessary to implement - * POSIX clocks & timers + * These are all the functions necessary to implement POSIX clocks & timers */ #include #include -- cgit v1.3-7-g2ca7 From cf0dd411e80f7066cabf69899724e48dd3192b99 Mon Sep 17 00:00:00 2001 From: Rustam Kovhaev Date: Fri, 23 Nov 2018 15:48:16 -0800 Subject: bpf, tags: Fix DEFINE_PER_CPU expansion Building tags produces warning: ctags: Warning: kernel/bpf/local_storage.c:10: null expansion of name pattern "\1" Let's use the same fix as in commit 25528213fe9f ("tags: Fix DEFINE_PER_CPU expansions"), even though it violates the usual code style. Signed-off-by: Rustam Kovhaev Signed-off-by: Daniel Borkmann --- kernel/bpf/local_storage.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index c97a8f968638..9e94b1cc6cf2 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -7,8 +7,7 @@ #include #include -DEFINE_PER_CPU(struct bpf_cgroup_storage*, - bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); +DEFINE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); #ifdef CONFIG_CGROUP_BPF -- cgit v1.3-7-g2ca7 From 311fe1a813324ea6d8172a3e9eefb1b274c72fea Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Sun, 25 Nov 2018 23:32:51 +0000 Subject: bpf: btf: fix spelling mistake "Memmber" -> "Member" There is a spelling mistake in a btf_verifier_log_member message, fix it. Signed-off-by: Colin Ian King Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 69da9169819a..a09b2f94ab25 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1621,7 +1621,7 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, if (BITS_ROUNDUP_BYTES(member->offset) > struct_size) { btf_verifier_log_member(env, t, member, - "Memmber bits_offset exceeds its struct size"); + "Member bits_offset exceeds its struct size"); return -EINVAL; } -- cgit v1.3-7-g2ca7 From d0a3f18a70f2d9700bf9f5e9c4a505472388a7c1 Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Thu, 2 Aug 2018 17:56:50 -0400 Subject: audit: minimize our use of audit_log_format() There are some cases where we are making multiple audit_log_format() calls in a row, for no apparent reason. Squash these down to a single audit_log_format() call whenever possible. Acked-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit.c | 11 +++++------ kernel/audit_fsnotify.c | 3 +-- kernel/audit_tree.c | 3 +-- kernel/audit_watch.c | 3 +-- kernel/auditsc.c | 7 +++---- 5 files changed, 11 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 6c53e373b828..d09298d3c2d2 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2177,22 +2177,21 @@ void audit_log_name(struct audit_context *context, struct audit_names *n, } /* log the audit_names record type */ - audit_log_format(ab, " nametype="); switch(n->type) { case AUDIT_TYPE_NORMAL: - audit_log_format(ab, "NORMAL"); + audit_log_format(ab, " nametype=NORMAL"); break; case AUDIT_TYPE_PARENT: - audit_log_format(ab, "PARENT"); + audit_log_format(ab, " nametype=PARENT"); break; case AUDIT_TYPE_CHILD_DELETE: - audit_log_format(ab, "DELETE"); + audit_log_format(ab, " nametype=DELETE"); break; case AUDIT_TYPE_CHILD_CREATE: - audit_log_format(ab, "CREATE"); + audit_log_format(ab, " nametype=CREATE"); break; default: - audit_log_format(ab, "UNKNOWN"); + audit_log_format(ab, " nametype=UNKNOWN"); break; } diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c index f90ffa699e5b..cf4512a33675 100644 --- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -131,8 +131,7 @@ static void audit_mark_log_rule_change(struct audit_fsnotify_mark *audit_mark, c if (unlikely(!ab)) return; audit_log_session_info(ab); - audit_log_format(ab, " op=%s", op); - audit_log_format(ab, " path="); + audit_log_format(ab, " op=%s path=", op); audit_log_untrustedstring(ab, audit_mark->path); audit_log_key(ab, rule->filterkey); audit_log_format(ab, " list=%d res=1", rule->listnr); diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 58e84eb5d826..d4af4d97f847 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -533,8 +533,7 @@ static void audit_tree_log_remove_rule(struct audit_krule *rule) ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return; - audit_log_format(ab, "op=remove_rule"); - audit_log_format(ab, " dir="); + audit_log_format(ab, "op=remove_rule dir="); audit_log_untrustedstring(ab, rule->tree->pathname); audit_log_key(ab, rule->filterkey); audit_log_format(ab, " list=%d res=1", rule->listnr); diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 568e48d1d0ab..20ef9ba134b0 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -246,8 +246,7 @@ static void audit_watch_log_rule_change(struct audit_krule *r, struct audit_watc if (!ab) return; audit_log_session_info(ab); - audit_log_format(ab, "op=%s", op); - audit_log_format(ab, " path="); + audit_log_format(ab, "op=%s path=", op); audit_log_untrustedstring(ab, w->path); audit_log_key(ab, r->filterkey); audit_log_format(ab, " list=%d res=1", r->listnr); diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 605f2d825204..51e735aedf58 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2503,10 +2503,9 @@ void audit_seccomp_actions_logged(const char *names, const char *old_names, if (unlikely(!ab)) return; - audit_log_format(ab, "op=seccomp-logging"); - audit_log_format(ab, " actions=%s", names); - audit_log_format(ab, " old-actions=%s", old_names); - audit_log_format(ab, " res=%d", res); + audit_log_format(ab, + "op=seccomp-logging actions=%s old-actions=%s res=%d", + names, old_names, res); audit_log_end(ab); } -- cgit v1.3-7-g2ca7 From 2a1fe215e7300c7ebd6a7a24afcab71db5107bb0 Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Mon, 26 Nov 2018 18:40:07 -0500 Subject: audit: use current whenever possible There are many places, notably audit_log_task_info() and audit_log_exit(), that take task_struct pointers but in reality they are always working on the current task. This patch eliminates the task_struct arguments and uses current directly which allows a number of cleanups as well. Acked-by: Richard Guy Briggs Signed-off-by: Paul Moore --- drivers/tty/tty_audit.c | 13 ++-- include/linux/audit.h | 6 +- kernel/audit.c | 34 +++++----- kernel/audit.h | 2 +- kernel/auditsc.c | 131 +++++++++++++++++++-------------------- security/integrity/ima/ima_api.c | 2 +- 6 files changed, 90 insertions(+), 98 deletions(-) (limited to 'kernel') diff --git a/drivers/tty/tty_audit.c b/drivers/tty/tty_audit.c index 50f567b6a66e..28f87fd6a28e 100644 --- a/drivers/tty/tty_audit.c +++ b/drivers/tty/tty_audit.c @@ -61,20 +61,19 @@ static void tty_audit_log(const char *description, dev_t dev, unsigned char *data, size_t size) { struct audit_buffer *ab; - struct task_struct *tsk = current; - pid_t pid = task_pid_nr(tsk); - uid_t uid = from_kuid(&init_user_ns, task_uid(tsk)); - uid_t loginuid = from_kuid(&init_user_ns, audit_get_loginuid(tsk)); - unsigned int sessionid = audit_get_sessionid(tsk); + pid_t pid = task_pid_nr(current); + uid_t uid = from_kuid(&init_user_ns, task_uid(current)); + uid_t loginuid = from_kuid(&init_user_ns, audit_get_loginuid(current)); + unsigned int sessionid = audit_get_sessionid(current); ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_TTY); if (ab) { - char name[sizeof(tsk->comm)]; + char name[sizeof(current->comm)]; audit_log_format(ab, "%s pid=%u uid=%u auid=%u ses=%u major=%d" " minor=%d comm=", description, pid, uid, loginuid, sessionid, MAJOR(dev), MINOR(dev)); - get_task_comm(name, tsk); + get_task_comm(name, current); audit_log_untrustedstring(ab, name); audit_log_format(ab, " data="); audit_log_n_hex(ab, data, size); diff --git a/include/linux/audit.h b/include/linux/audit.h index 58cf665f597e..a625c29a2ea2 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -151,8 +151,7 @@ extern void audit_log_link_denied(const char *operation); extern void audit_log_lost(const char *message); extern int audit_log_task_context(struct audit_buffer *ab); -extern void audit_log_task_info(struct audit_buffer *ab, - struct task_struct *tsk); +extern void audit_log_task_info(struct audit_buffer *ab); extern int audit_update_lsm_rules(void); @@ -200,8 +199,7 @@ static inline int audit_log_task_context(struct audit_buffer *ab) { return 0; } -static inline void audit_log_task_info(struct audit_buffer *ab, - struct task_struct *tsk) +static inline void audit_log_task_info(struct audit_buffer *ab) { } #define audit_enabled AUDIT_OFF #endif /* CONFIG_AUDIT */ diff --git a/kernel/audit.c b/kernel/audit.c index d09298d3c2d2..779671883349 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -1096,10 +1096,11 @@ static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature if (audit_enabled == AUDIT_OFF) return; + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_FEATURE_CHANGE); if (!ab) return; - audit_log_task_info(ab, current); + audit_log_task_info(ab); audit_log_format(ab, " feature=%s old=%u new=%u old_lock=%u new_lock=%u res=%d", audit_feature_names[which], !!old_feature, !!new_feature, !!old_lock, !!new_lock, res); @@ -2246,15 +2247,15 @@ out_null: audit_log_format(ab, " exe=(null)"); } -struct tty_struct *audit_get_tty(struct task_struct *tsk) +struct tty_struct *audit_get_tty(void) { struct tty_struct *tty = NULL; unsigned long flags; - spin_lock_irqsave(&tsk->sighand->siglock, flags); - if (tsk->signal) - tty = tty_kref_get(tsk->signal->tty); - spin_unlock_irqrestore(&tsk->sighand->siglock, flags); + spin_lock_irqsave(¤t->sighand->siglock, flags); + if (current->signal) + tty = tty_kref_get(current->signal->tty); + spin_unlock_irqrestore(¤t->sighand->siglock, flags); return tty; } @@ -2263,25 +2264,24 @@ void audit_put_tty(struct tty_struct *tty) tty_kref_put(tty); } -void audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk) +void audit_log_task_info(struct audit_buffer *ab) { const struct cred *cred; - char comm[sizeof(tsk->comm)]; + char comm[sizeof(current->comm)]; struct tty_struct *tty; if (!ab) return; - /* tsk == current */ cred = current_cred(); - tty = audit_get_tty(tsk); + tty = audit_get_tty(); audit_log_format(ab, " ppid=%d pid=%d auid=%u uid=%u gid=%u" " euid=%u suid=%u fsuid=%u" " egid=%u sgid=%u fsgid=%u tty=%s ses=%u", - task_ppid_nr(tsk), - task_tgid_nr(tsk), - from_kuid(&init_user_ns, audit_get_loginuid(tsk)), + task_ppid_nr(current), + task_tgid_nr(current), + from_kuid(&init_user_ns, audit_get_loginuid(current)), from_kuid(&init_user_ns, cred->uid), from_kgid(&init_user_ns, cred->gid), from_kuid(&init_user_ns, cred->euid), @@ -2291,11 +2291,11 @@ void audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk) from_kgid(&init_user_ns, cred->sgid), from_kgid(&init_user_ns, cred->fsgid), tty ? tty_name(tty) : "(none)", - audit_get_sessionid(tsk)); + audit_get_sessionid(current)); audit_put_tty(tty); audit_log_format(ab, " comm="); - audit_log_untrustedstring(ab, get_task_comm(comm, tsk)); - audit_log_d_path_exe(ab, tsk->mm); + audit_log_untrustedstring(ab, get_task_comm(comm, current)); + audit_log_d_path_exe(ab, current->mm); audit_log_task_context(ab); } EXPORT_SYMBOL(audit_log_task_info); @@ -2316,7 +2316,7 @@ void audit_log_link_denied(const char *operation) if (!ab) return; audit_log_format(ab, "op=%s", operation); - audit_log_task_info(ab, current); + audit_log_task_info(ab); audit_log_format(ab, " res=0"); audit_log_end(ab); } diff --git a/kernel/audit.h b/kernel/audit.h index 0b5295aeaebb..91421679a168 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -264,7 +264,7 @@ extern struct audit_entry *audit_dupe_rule(struct audit_krule *old); extern void audit_log_d_path_exe(struct audit_buffer *ab, struct mm_struct *mm); -extern struct tty_struct *audit_get_tty(struct task_struct *tsk); +extern struct tty_struct *audit_get_tty(void); extern void audit_put_tty(struct tty_struct *tty); /* audit watch functions */ diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 51e735aedf58..6593a5207fb0 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -830,44 +830,6 @@ void audit_filter_inodes(struct task_struct *tsk, struct audit_context *ctx) rcu_read_unlock(); } -/* Transfer the audit context pointer to the caller, clearing it in the tsk's struct */ -static inline struct audit_context *audit_take_context(struct task_struct *tsk, - int return_valid, - long return_code) -{ - struct audit_context *context = tsk->audit_context; - - if (!context) - return NULL; - context->return_valid = return_valid; - - /* - * we need to fix up the return code in the audit logs if the actual - * return codes are later going to be fixed up by the arch specific - * signal handlers - * - * This is actually a test for: - * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) || - * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK) - * - * but is faster than a bunch of || - */ - if (unlikely(return_code <= -ERESTARTSYS) && - (return_code >= -ERESTART_RESTARTBLOCK) && - (return_code != -ENOIOCTLCMD)) - context->return_code = -EINTR; - else - context->return_code = return_code; - - if (context->in_syscall && !context->dummy) { - audit_filter_syscall(tsk, context, &audit_filter_list[AUDIT_FILTER_EXIT]); - audit_filter_inodes(tsk, context); - } - - audit_set_context(tsk, NULL); - return context; -} - static inline void audit_proctitle_free(struct audit_context *context) { kfree(context->proctitle.value); @@ -1296,15 +1258,18 @@ static inline int audit_proctitle_rtrim(char *proctitle, int len) return len; } -static void audit_log_proctitle(struct task_struct *tsk, - struct audit_context *context) +static void audit_log_proctitle(void) { int res; char *buf; char *msg = "(null)"; int len = strlen(msg); + struct audit_context *context = audit_context(); struct audit_buffer *ab; + if (!context || context->dummy) + return; + ab = audit_log_start(context, GFP_KERNEL, AUDIT_PROCTITLE); if (!ab) return; /* audit_panic or being filtered */ @@ -1317,7 +1282,7 @@ static void audit_log_proctitle(struct task_struct *tsk, if (!buf) goto out; /* Historically called this from procfs naming */ - res = get_cmdline(tsk, buf, MAX_PROCTITLE_AUDIT_LEN); + res = get_cmdline(current, buf, MAX_PROCTITLE_AUDIT_LEN); if (res == 0) { kfree(buf); goto out; @@ -1337,15 +1302,15 @@ out: audit_log_end(ab); } -static void audit_log_exit(struct audit_context *context, struct task_struct *tsk) +static void audit_log_exit(void) { int i, call_panic = 0; + struct audit_context *context = audit_context(); struct audit_buffer *ab; struct audit_aux_data *aux; struct audit_names *n; - /* tsk == current */ - context->personality = tsk->personality; + context->personality = current->personality; ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL); if (!ab) @@ -1367,7 +1332,7 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts context->argv[3], context->name_count); - audit_log_task_info(ab, tsk); + audit_log_task_info(ab); audit_log_key(ab, context->filterkey); audit_log_end(ab); @@ -1456,7 +1421,7 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts audit_log_name(context, n, NULL, i++, &call_panic); } - audit_log_proctitle(tsk, context); + audit_log_proctitle(); /* Send end of event record to help user space know we are finished */ ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE); @@ -1474,22 +1439,31 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts */ void __audit_free(struct task_struct *tsk) { - struct audit_context *context; + struct audit_context *context = tsk->audit_context; - context = audit_take_context(tsk, 0, 0); if (!context) return; - /* Check for system calls that do not go through the exit - * function (e.g., exit_group), then free context block. - * We use GFP_ATOMIC here because we might be doing this - * in the context of the idle thread */ - /* that can happen only if we are called from do_exit() */ - if (context->in_syscall && context->current_state == AUDIT_RECORD_CONTEXT) - audit_log_exit(context, tsk); + /* We are called either by do_exit() or the fork() error handling code; + * in the former case tsk == current and in the latter tsk is a + * random task_struct that doesn't doesn't have any meaningful data we + * need to log via audit_log_exit(). + */ + if (tsk == current && !context->dummy && context->in_syscall) { + context->return_valid = 0; + context->return_code = 0; + + audit_filter_syscall(tsk, context, + &audit_filter_list[AUDIT_FILTER_EXIT]); + audit_filter_inodes(tsk, context); + if (context->current_state == AUDIT_RECORD_CONTEXT) + audit_log_exit(); + } + if (!list_empty(&context->killed_trees)) audit_kill_trees(&context->killed_trees); + audit_set_context(tsk, NULL); audit_free_context(context); } @@ -1559,17 +1533,40 @@ void __audit_syscall_exit(int success, long return_code) { struct audit_context *context; - if (success) - success = AUDITSC_SUCCESS; - else - success = AUDITSC_FAILURE; - - context = audit_take_context(current, success, return_code); + context = audit_context(); if (!context) return; - if (context->in_syscall && context->current_state == AUDIT_RECORD_CONTEXT) - audit_log_exit(context, current); + if (!context->dummy && context->in_syscall) { + if (success) + context->return_valid = AUDITSC_SUCCESS; + else + context->return_valid = AUDITSC_FAILURE; + + /* + * we need to fix up the return code in the audit logs if the + * actual return codes are later going to be fixed up by the + * arch specific signal handlers + * + * This is actually a test for: + * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) || + * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK) + * + * but is faster than a bunch of || + */ + if (unlikely(return_code <= -ERESTARTSYS) && + (return_code >= -ERESTART_RESTARTBLOCK) && + (return_code != -ENOIOCTLCMD)) + context->return_code = -EINTR; + else + context->return_code = return_code; + + audit_filter_syscall(current, context, + &audit_filter_list[AUDIT_FILTER_EXIT]); + audit_filter_inodes(current, context); + if (context->current_state == AUDIT_RECORD_CONTEXT) + audit_log_exit(); + } context->in_syscall = 0; context->prio = context->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0; @@ -1591,7 +1588,6 @@ void __audit_syscall_exit(int success, long return_code) kfree(context->filterkey); context->filterkey = NULL; } - audit_set_context(current, context); } static inline void handle_one(const struct inode *inode) @@ -2025,7 +2021,7 @@ static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, uid = from_kuid(&init_user_ns, task_uid(current)); oldloginuid = from_kuid(&init_user_ns, koldloginuid); loginuid = from_kuid(&init_user_ns, kloginuid), - tty = audit_get_tty(current); + tty = audit_get_tty(); audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); audit_log_task_context(ab); @@ -2046,7 +2042,6 @@ static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, */ int audit_set_loginuid(kuid_t loginuid) { - struct task_struct *task = current; unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; kuid_t oldloginuid; int rc; @@ -2065,8 +2060,8 @@ int audit_set_loginuid(kuid_t loginuid) sessionid = (unsigned int)atomic_inc_return(&session_id); } - task->sessionid = sessionid; - task->loginuid = loginuid; + current->sessionid = sessionid; + current->loginuid = loginuid; out: audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); return rc; diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c index 99dd1d53fc35..af134588ab4e 100644 --- a/security/integrity/ima/ima_api.c +++ b/security/integrity/ima/ima_api.c @@ -336,7 +336,7 @@ void ima_audit_measurement(struct integrity_iint_cache *iint, audit_log_untrustedstring(ab, filename); audit_log_format(ab, " hash=\"%s:%s\"", algo_name, hash); - audit_log_task_info(ab, current); + audit_log_task_info(ab); audit_log_end(ab); iint->flags |= IMA_AUDITED; -- cgit v1.3-7-g2ca7 From ba64e7d8525236aa56ab58ba3a3a71615c4ee289 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sat, 24 Nov 2018 23:20:44 -0800 Subject: bpf: btf: support proper non-jit func info Commit 838e96904ff3 ("bpf: Introduce bpf_func_info") added bpf func info support. The userspace is able to get better ksym's for bpf programs with jit, and is able to print out func prototypes. For a program containing func-to-func calls, the existing implementation returns user specified number of function calls and BTF types if jit is enabled. If the jit is not enabled, it only returns the type for the main function. This is undesirable. Interpreter may still be used and we should keep feature identical regardless of whether jit is enabled or not. This patch fixed this discrepancy. Fixes: 838e96904ff3 ("bpf: Introduce bpf_func_info") Signed-off-by: Yonghong Song Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 6 +++-- include/linux/bpf_verifier.h | 1 - kernel/bpf/core.c | 3 ++- kernel/bpf/syscall.c | 33 +++++++------------------- kernel/bpf/verifier.c | 55 ++++++++++++++++++++++++++++++-------------- 5 files changed, 52 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 7f0e225bf630..e82b7039fc66 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -299,7 +299,8 @@ struct bpf_prog_aux { u32 max_pkt_offset; u32 stack_depth; u32 id; - u32 func_cnt; + u32 func_cnt; /* used by non-func prog as the number of func progs */ + u32 func_idx; /* 0 for non-func prog, the index in func array for func prog */ bool offload_requested; struct bpf_prog **func; void *jit_data; /* JIT specific data. arch dependent */ @@ -317,7 +318,8 @@ struct bpf_prog_aux { #endif struct bpf_prog_offload *offload; struct btf *btf; - u32 type_id; /* type id for this prog/func */ + struct bpf_func_info *func_info; + u32 func_info_cnt; union { struct work_struct work; struct rcu_head rcu; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 204382f46fd8..11f5df1092d9 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -204,7 +204,6 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log) struct bpf_subprog_info { u32 start; /* insn idx of function entry point */ u16 stack_depth; /* max. stack depth used by this function */ - u32 type_id; /* btf type_id for this subprog */ }; /* single container for all structs diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 16d77012ad3e..002d67c62c8b 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -411,7 +411,8 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) /* prog->aux->name will be ignored if full btf name is available */ if (prog->aux->btf) { - type = btf_type_by_id(prog->aux->btf, prog->aux->type_id); + type = btf_type_by_id(prog->aux->btf, + prog->aux->func_info[prog->aux->func_idx].type_id); func_name = btf_name_by_offset(prog->aux->btf, type->name_off); snprintf(sym, (size_t)(end - sym), "_%s", func_name); return; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 998377808102..85cbeec06e50 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1214,6 +1214,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) bpf_prog_free_id(prog, do_idr_lock); bpf_prog_kallsyms_del_all(prog); btf_put(prog->aux->btf); + kvfree(prog->aux->func_info); call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); } @@ -2219,46 +2220,28 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } if (prog->aux->btf) { + u32 krec_size = sizeof(struct bpf_func_info); u32 ucnt, urec_size; info.btf_id = btf_id(prog->aux->btf); ucnt = info.func_info_cnt; - info.func_info_cnt = prog->aux->func_cnt ? : 1; + info.func_info_cnt = prog->aux->func_info_cnt; urec_size = info.func_info_rec_size; - info.func_info_rec_size = sizeof(struct bpf_func_info); + info.func_info_rec_size = krec_size; if (ucnt) { /* expect passed-in urec_size is what the kernel expects */ if (urec_size != info.func_info_rec_size) return -EINVAL; if (bpf_dump_raw_ok()) { - struct bpf_func_info kern_finfo; char __user *user_finfo; - u32 i, insn_offset; user_finfo = u64_to_user_ptr(info.func_info); - if (prog->aux->func_cnt) { - ucnt = min_t(u32, info.func_info_cnt, ucnt); - insn_offset = 0; - for (i = 0; i < ucnt; i++) { - kern_finfo.insn_offset = insn_offset; - kern_finfo.type_id = prog->aux->func[i]->aux->type_id; - if (copy_to_user(user_finfo, &kern_finfo, - sizeof(kern_finfo))) - return -EFAULT; - - /* func[i]->len holds the prog len */ - insn_offset += prog->aux->func[i]->len; - user_finfo += urec_size; - } - } else { - kern_finfo.insn_offset = 0; - kern_finfo.type_id = prog->aux->type_id; - if (copy_to_user(user_finfo, &kern_finfo, - sizeof(kern_finfo))) - return -EFAULT; - } + ucnt = min_t(u32, info.func_info_cnt, ucnt); + if (copy_to_user(user_finfo, prog->aux->func_info, + krec_size * ucnt)) + return -EFAULT; } else { info.func_info_cnt = 0; } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f102c4fd0c5a..05d95c0e4a26 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4650,7 +4650,7 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, { u32 i, nfuncs, urec_size, min_size, prev_offset; u32 krec_size = sizeof(struct bpf_func_info); - struct bpf_func_info krecord = {}; + struct bpf_func_info *krecord = NULL; const struct btf_type *type; void __user *urecord; struct btf *btf; @@ -4682,6 +4682,12 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, urecord = u64_to_user_ptr(attr->func_info); min_size = min_t(u32, krec_size, urec_size); + krecord = kvcalloc(nfuncs, krec_size, GFP_KERNEL | __GFP_NOWARN); + if (!krecord) { + ret = -ENOMEM; + goto free_btf; + } + for (i = 0; i < nfuncs; i++) { ret = bpf_check_uarg_tail_zero(urecord, krec_size, urec_size); if (ret) { @@ -4696,59 +4702,69 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, goto free_btf; } - if (copy_from_user(&krecord, urecord, min_size)) { + if (copy_from_user(&krecord[i], urecord, min_size)) { ret = -EFAULT; goto free_btf; } /* check insn_offset */ if (i == 0) { - if (krecord.insn_offset) { + if (krecord[i].insn_offset) { verbose(env, "nonzero insn_offset %u for the first func info record", - krecord.insn_offset); + krecord[i].insn_offset); ret = -EINVAL; goto free_btf; } - } else if (krecord.insn_offset <= prev_offset) { + } else if (krecord[i].insn_offset <= prev_offset) { verbose(env, "same or smaller insn offset (%u) than previous func info record (%u)", - krecord.insn_offset, prev_offset); + krecord[i].insn_offset, prev_offset); ret = -EINVAL; goto free_btf; } - if (env->subprog_info[i].start != krecord.insn_offset) { + if (env->subprog_info[i].start != krecord[i].insn_offset) { verbose(env, "func_info BTF section doesn't match subprog layout in BPF program\n"); ret = -EINVAL; goto free_btf; } /* check type_id */ - type = btf_type_by_id(btf, krecord.type_id); + type = btf_type_by_id(btf, krecord[i].type_id); if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC) { verbose(env, "invalid type id %d in func info", - krecord.type_id); + krecord[i].type_id); ret = -EINVAL; goto free_btf; } - if (i == 0) - prog->aux->type_id = krecord.type_id; - env->subprog_info[i].type_id = krecord.type_id; - - prev_offset = krecord.insn_offset; + prev_offset = krecord[i].insn_offset; urecord += urec_size; } prog->aux->btf = btf; + prog->aux->func_info = krecord; + prog->aux->func_info_cnt = nfuncs; return 0; free_btf: btf_put(btf); + kvfree(krecord); return ret; } +static void adjust_btf_func(struct bpf_verifier_env *env) +{ + int i; + + if (!env->prog->aux->func_info) + return; + + for (i = 0; i < env->subprog_cnt; i++) + env->prog->aux->func_info[i].insn_offset = env->subprog_info[i].start; +} + /* check %cur's range satisfies %old's */ static bool range_within(struct bpf_reg_state *old, struct bpf_reg_state *cur) @@ -6043,15 +6059,17 @@ static int jit_subprogs(struct bpf_verifier_env *env) if (bpf_prog_calc_tag(func[i])) goto out_free; func[i]->is_func = 1; + func[i]->aux->func_idx = i; + /* the btf and func_info will be freed only at prog->aux */ + func[i]->aux->btf = prog->aux->btf; + func[i]->aux->func_info = prog->aux->func_info; + /* Use bpf_prog_F_tag to indicate functions in stack traces. * Long term would need debug info to populate names */ func[i]->aux->name[0] = 'F'; func[i]->aux->stack_depth = env->subprog_info[i].stack_depth; func[i]->jit_requested = 1; - /* the btf will be freed only at prog->aux */ - func[i]->aux->btf = prog->aux->btf; - func[i]->aux->type_id = env->subprog_info[i].type_id; func[i] = bpf_int_jit_compile(func[i]); if (!func[i]->jited) { err = -ENOTSUPP; @@ -6572,6 +6590,9 @@ skip_full_check: convert_pseudo_ld_imm64(env); } + if (ret == 0) + adjust_btf_func(env); + err_release_maps: if (!env->prog->aux->used_maps) /* if we didn't copy map pointers into bpf_prog_info, release -- cgit v1.3-7-g2ca7 From 7440172974e85b1828bdd84ac6b23b5bcad9c5eb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 18:44:52 -0800 Subject: tracing: Replace synchronize_sched() and call_rcu_sched() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). Similarly, call_rcu_sched() can be replaced by call_rcu(). This commit therefore makes these changes. Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Acked-by: Steven Rostedt (VMware) --- include/linux/tracepoint.h | 2 +- kernel/trace/ftrace.c | 24 ++++++++++++------------ kernel/trace/ring_buffer.c | 12 ++++++------ kernel/trace/trace.c | 10 +++++----- kernel/trace/trace_events_filter.c | 4 ++-- kernel/trace/trace_kprobe.c | 2 +- kernel/tracepoint.c | 4 ++-- 7 files changed, 29 insertions(+), 29 deletions(-) (limited to 'kernel') diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 538ba1a58f5b..432080b59c26 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -82,7 +82,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb) static inline void tracepoint_synchronize_unregister(void) { synchronize_srcu(&tracepoint_srcu); - synchronize_sched(); + synchronize_rcu(); } #else static inline void tracepoint_synchronize_unregister(void) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index f536f601bd46..5b4f73e4fd56 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -173,7 +173,7 @@ static void ftrace_sync(struct work_struct *work) { /* * This function is just a stub to implement a hard force - * of synchronize_sched(). This requires synchronizing + * of synchronize_rcu(). This requires synchronizing * tasks even in userspace and idle. * * Yes, function tracing is rude. @@ -934,7 +934,7 @@ ftrace_profile_write(struct file *filp, const char __user *ubuf, ftrace_profile_enabled = 0; /* * unregister_ftrace_profiler calls stop_machine - * so this acts like an synchronize_sched. + * so this acts like an synchronize_rcu. */ unregister_ftrace_profiler(); } @@ -1086,7 +1086,7 @@ struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr) /* * Some of the ops may be dynamically allocated, - * they are freed after a synchronize_sched(). + * they are freed after a synchronize_rcu(). */ preempt_disable_notrace(); @@ -1286,7 +1286,7 @@ static void free_ftrace_hash_rcu(struct ftrace_hash *hash) { if (!hash || hash == EMPTY_HASH) return; - call_rcu_sched(&hash->rcu, __free_ftrace_hash_rcu); + call_rcu(&hash->rcu, __free_ftrace_hash_rcu); } void ftrace_free_filter(struct ftrace_ops *ops) @@ -1501,7 +1501,7 @@ static bool hash_contains_ip(unsigned long ip, * the ip is not in the ops->notrace_hash. * * This needs to be called with preemption disabled as - * the hashes are freed with call_rcu_sched(). + * the hashes are freed with call_rcu(). */ static int ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs) @@ -4496,7 +4496,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr, if (ftrace_enabled && !ftrace_hash_empty(hash)) ftrace_run_modify_code(&probe->ops, FTRACE_UPDATE_CALLS, &old_hash_ops); - synchronize_sched(); + synchronize_rcu(); hlist_for_each_entry_safe(entry, tmp, &hhd, hlist) { hlist_del(&entry->hlist); @@ -5314,7 +5314,7 @@ ftrace_graph_release(struct inode *inode, struct file *file) mutex_unlock(&graph_lock); /* Wait till all users are no longer using the old hash */ - synchronize_sched(); + synchronize_rcu(); free_ftrace_hash(old_hash); } @@ -5707,7 +5707,7 @@ void ftrace_release_mod(struct module *mod) list_for_each_entry_safe(mod_map, n, &ftrace_mod_maps, list) { if (mod_map->mod == mod) { list_del_rcu(&mod_map->list); - call_rcu_sched(&mod_map->rcu, ftrace_free_mod_map); + call_rcu(&mod_map->rcu, ftrace_free_mod_map); break; } } @@ -5927,7 +5927,7 @@ ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, struct ftrace_mod_map *mod_map; const char *ret = NULL; - /* mod_map is freed via call_rcu_sched() */ + /* mod_map is freed via call_rcu() */ preempt_disable(); list_for_each_entry_rcu(mod_map, &ftrace_mod_maps, list) { ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym); @@ -6262,7 +6262,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, /* * Some of the ops may be dynamically allocated, - * they must be freed after a synchronize_sched(). + * they must be freed after a synchronize_rcu(). */ preempt_disable_notrace(); @@ -6433,7 +6433,7 @@ static void clear_ftrace_pids(struct trace_array *tr) rcu_assign_pointer(tr->function_pids, NULL); /* Wait till all users are no longer using pid filtering */ - synchronize_sched(); + synchronize_rcu(); trace_free_pid_list(pid_list); } @@ -6580,7 +6580,7 @@ ftrace_pid_write(struct file *filp, const char __user *ubuf, rcu_assign_pointer(tr->function_pids, pid_list); if (filtered_pids) { - synchronize_sched(); + synchronize_rcu(); trace_free_pid_list(filtered_pids); } else if (pid_list) { /* Register a probe to set whether to ignore the tracing of a task */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 65bd4616220d..4f3247a53259 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1834,7 +1834,7 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, * There could have been a race between checking * record_disable and incrementing it. */ - synchronize_sched(); + synchronize_rcu(); for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; rb_check_pages(cpu_buffer); @@ -3151,7 +3151,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) * This prevents all writes to the buffer. Any attempt to write * to the buffer after this will fail and return NULL. * - * The caller should call synchronize_sched() after this. + * The caller should call synchronize_rcu() after this. */ void ring_buffer_record_disable(struct ring_buffer *buffer) { @@ -3253,7 +3253,7 @@ bool ring_buffer_record_is_set_on(struct ring_buffer *buffer) * This prevents all writes to the buffer. Any attempt to write * to the buffer after this will fail and return NULL. * - * The caller should call synchronize_sched() after this. + * The caller should call synchronize_rcu() after this. */ void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu) { @@ -4191,7 +4191,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_prepare); void ring_buffer_read_prepare_sync(void) { - synchronize_sched(); + synchronize_rcu(); } EXPORT_SYMBOL_GPL(ring_buffer_read_prepare_sync); @@ -4363,7 +4363,7 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu) atomic_inc(&cpu_buffer->record_disabled); /* Make sure all commits have finished */ - synchronize_sched(); + synchronize_rcu(); raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); @@ -4496,7 +4496,7 @@ int ring_buffer_swap_cpu(struct ring_buffer *buffer_a, goto out; /* - * We can't do a synchronize_sched here because this + * We can't do a synchronize_rcu here because this * function can be called in atomic context. * Normally this will be called from the same CPU as cpu. * If not it's up to the caller to protect this. diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ff1c4b20cd0a..51612b4a603f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1681,7 +1681,7 @@ void tracing_reset(struct trace_buffer *buf, int cpu) ring_buffer_record_disable(buffer); /* Make sure all commits have finished */ - synchronize_sched(); + synchronize_rcu(); ring_buffer_reset_cpu(buffer, cpu); ring_buffer_record_enable(buffer); @@ -1698,7 +1698,7 @@ void tracing_reset_online_cpus(struct trace_buffer *buf) ring_buffer_record_disable(buffer); /* Make sure all commits have finished */ - synchronize_sched(); + synchronize_rcu(); buf->time_start = buffer_ftrace_now(buf, buf->cpu); @@ -2250,7 +2250,7 @@ void trace_buffered_event_disable(void) preempt_enable(); /* Wait for all current users to finish */ - synchronize_sched(); + synchronize_rcu(); for_each_tracing_cpu(cpu) { free_page((unsigned long)per_cpu(trace_buffered_event, cpu)); @@ -5398,7 +5398,7 @@ static int tracing_set_tracer(struct trace_array *tr, const char *buf) if (tr->current_trace->reset) tr->current_trace->reset(tr); - /* Current trace needs to be nop_trace before synchronize_sched */ + /* Current trace needs to be nop_trace before synchronize_rcu */ tr->current_trace = &nop_trace; #ifdef CONFIG_TRACER_MAX_TRACE @@ -5412,7 +5412,7 @@ static int tracing_set_tracer(struct trace_array *tr, const char *buf) * The update_max_tr is called from interrupts disabled * so a synchronized_sched() is sufficient. */ - synchronize_sched(); + synchronize_rcu(); free_snapshot(tr); } #endif diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 84a65173b1e9..35f3aa55be85 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1614,7 +1614,7 @@ static int process_system_preds(struct trace_subsystem_dir *dir, /* * The calls can still be using the old filters. - * Do a synchronize_sched() and to ensure all calls are + * Do a synchronize_rcu() and to ensure all calls are * done with them before we free them. */ tracepoint_synchronize_unregister(); @@ -1845,7 +1845,7 @@ int apply_subsystem_event_filter(struct trace_subsystem_dir *dir, if (filter) { /* * No event actually uses the system filter - * we can free it without synchronize_sched(). + * we can free it without synchronize_rcu(). */ __free_filter(system->filter); system->filter = filter; diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index fec67188c4d2..adc153ab51c0 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -333,7 +333,7 @@ disable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) * event_call related objects, which will be accessed in * the kprobe_trace_func/kretprobe_trace_func. */ - synchronize_sched(); + synchronize_rcu(); kfree(link); /* Ignored if link == NULL */ } diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index a3be42304485..46f2ab1e08a9 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -92,7 +92,7 @@ static __init int release_early_probes(void) while (early_probes) { tmp = early_probes; early_probes = tmp->next; - call_rcu_sched(tmp, rcu_free_old_probes); + call_rcu(tmp, rcu_free_old_probes); } return 0; @@ -123,7 +123,7 @@ static inline void release_probes(struct tracepoint_func *old) * cover both cases. So let us chain the SRCU and sched RCU * callbacks to wait for both grace periods. */ - call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes); + call_rcu(&tp_probes->rcu, rcu_free_old_probes); } } -- cgit v1.3-7-g2ca7 From ae8b7ce7647bc5c56a9a9fc32b03e1cd0ae49629 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:04:39 -0800 Subject: kprobes: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney Cc: "Naveen N. Rao" Cc: Anil S Keshavamurthy Cc: "David S. Miller" Acked-by: Masami Hiramatsu --- kernel/kprobes.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 90e98e233647..08e31d863191 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -229,7 +229,7 @@ static int collect_garbage_slots(struct kprobe_insn_cache *c) struct kprobe_insn_page *kip, *next; /* Ensure no-one is interrupted on the garbages */ - synchronize_sched(); + synchronize_rcu(); list_for_each_entry_safe(kip, next, &c->pages, list) { int i; @@ -1382,7 +1382,7 @@ out: if (ret) { ap->flags |= KPROBE_FLAG_DISABLED; list_del_rcu(&p->list); - synchronize_sched(); + synchronize_rcu(); } } } @@ -1597,7 +1597,7 @@ int register_kprobe(struct kprobe *p) ret = arm_kprobe(p); if (ret) { hlist_del_rcu(&p->hlist); - synchronize_sched(); + synchronize_rcu(); goto out; } } @@ -1776,7 +1776,7 @@ void unregister_kprobes(struct kprobe **kps, int num) kps[i]->addr = NULL; mutex_unlock(&kprobe_mutex); - synchronize_sched(); + synchronize_rcu(); for (i = 0; i < num; i++) if (kps[i]->addr) __unregister_kprobe_bottom(kps[i]); @@ -1966,7 +1966,7 @@ void unregister_kretprobes(struct kretprobe **rps, int num) rps[i]->kp.addr = NULL; mutex_unlock(&kprobe_mutex); - synchronize_sched(); + synchronize_rcu(); for (i = 0; i < num; i++) { if (rps[i]->kp.addr) { __unregister_kprobe_bottom(&rps[i]->kp); -- cgit v1.3-7-g2ca7 From 51959d85f32dde9041655936eef206cc3323dc12 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:06:51 -0800 Subject: lockdep: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon --- kernel/locking/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 1efada2dd9dd..ef27f98714c0 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4195,7 +4195,7 @@ void lockdep_free_key_range(void *start, unsigned long size) * * sync_sched() is sufficient because the read-side is IRQ disable. */ - synchronize_sched(); + synchronize_rcu(); /* * XXX at this point we could return the resources to the pool; -- cgit v1.3-7-g2ca7 From c9a863bbb1620dca3af57934cdbb002e852449fc Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:09:14 -0800 Subject: sched/membarrier: synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney Cc: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra --- kernel/sched/membarrier.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 388a7a6c1aa2..3cd8a3a795d2 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -210,7 +210,7 @@ static int membarrier_register_global_expedited(void) * future scheduler executions will observe the new * thread flag state for this mm. */ - synchronize_sched(); + synchronize_rcu(); } atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY, &mm->membarrier_state); @@ -246,7 +246,7 @@ static int membarrier_register_private_expedited(int flags) * Ensure all future scheduler executions will observe the * new thread flag state for this process. */ - synchronize_sched(); + synchronize_rcu(); } atomic_or(state, &mm->membarrier_state); -- cgit v1.3-7-g2ca7 From cb2f55369d3a9e6cf5c34d2da39eb242279a582d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:17:01 -0800 Subject: modules: Replace synchronize_sched() and call_rcu_sched() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). Similarly, call_rcu_sched() can be replaced by call_rcu(). This commit therefore makes these changes. Signed-off-by: Paul E. McKenney Acked-by: Jessica Yu --- kernel/module.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..99b46c32d579 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2159,7 +2159,7 @@ static void free_module(struct module *mod) /* Remove this module from bug list, this uses list_del_rcu */ module_bug_cleanup(mod); /* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */ - synchronize_sched(); + synchronize_rcu(); mutex_unlock(&module_mutex); /* This may be empty, but that's OK */ @@ -3507,15 +3507,15 @@ static noinline int do_init_module(struct module *mod) /* * We want to free module_init, but be aware that kallsyms may be * walking this with preempt disabled. In all the failure paths, we - * call synchronize_sched(), but we don't want to slow down the success + * call synchronize_rcu(), but we don't want to slow down the success * path, so use actual RCU here. * Note that module_alloc() on most architectures creates W+X page * mappings which won't be cleaned up until do_free_init() runs. Any * code such as mark_rodata_ro() which depends on those mappings to * be cleaned up needs to sync with the queued work - ie - * rcu_barrier_sched() + * rcu_barrier() */ - call_rcu_sched(&freeinit->rcu, do_free_init); + call_rcu(&freeinit->rcu, do_free_init); mutex_unlock(&module_mutex); wake_up_all(&module_wq); @@ -3526,7 +3526,7 @@ fail_free_freeinit: fail: /* Try to protect us from buggy refcounters. */ mod->state = MODULE_STATE_GOING; - synchronize_sched(); + synchronize_rcu(); module_put(mod); blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod); @@ -3819,7 +3819,7 @@ static int load_module(struct load_info *info, const char __user *uargs, ddebug_cleanup: ftrace_release_mod(mod); dynamic_debug_remove(mod, info->debug); - synchronize_sched(); + synchronize_rcu(); kfree(mod->args); free_arch_cleanup: module_arch_cleanup(mod); @@ -3834,7 +3834,7 @@ static int load_module(struct load_info *info, const char __user *uargs, mod_tree_remove(mod); wake_up_all(&module_wq); /* Wait for RCU-sched synchronizing before releasing mod->list. */ - synchronize_sched(); + synchronize_rcu(); mutex_unlock(&module_mutex); free_module: /* Free lock-classes; relies on the preceding sync_rcu() */ -- cgit v1.3-7-g2ca7 From 25b0077511fe7cf1b876174f8481fb1742f4fb4d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:18:45 -0800 Subject: workqueue: Replace call_rcu_sched() with call_rcu() Now that call_rcu()'s callback is not invoked until after all preempt-disable regions of code have completed (in addition to explicitly marked RCU read-side critical sections), call_rcu() can be used in place of call_rcu_sched(). This commit therefore makes that change. Signed-off-by: Paul E. McKenney Cc: Lai Jiangshan Acked-by: Tejun Heo --- kernel/workqueue.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 0280deac392e..392be4b252f6 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3396,7 +3396,7 @@ static void put_unbound_pool(struct worker_pool *pool) del_timer_sync(&pool->mayday_timer); /* sched-RCU protected to allow dereferences from get_work_pool() */ - call_rcu_sched(&pool->rcu, rcu_free_pool); + call_rcu(&pool->rcu, rcu_free_pool); } /** @@ -3503,14 +3503,14 @@ static void pwq_unbound_release_workfn(struct work_struct *work) put_unbound_pool(pool); mutex_unlock(&wq_pool_mutex); - call_rcu_sched(&pwq->rcu, rcu_free_pwq); + call_rcu(&pwq->rcu, rcu_free_pwq); /* * If we're the last pwq going away, @wq is already dead and no one * is gonna access it anymore. Schedule RCU free. */ if (is_last) - call_rcu_sched(&wq->rcu, rcu_free_wq); + call_rcu(&wq->rcu, rcu_free_wq); } /** @@ -4195,7 +4195,7 @@ void destroy_workqueue(struct workqueue_struct *wq) * The base ref is never dropped on per-cpu pwqs. Directly * schedule RCU free. */ - call_rcu_sched(&wq->rcu, rcu_free_wq); + call_rcu(&wq->rcu, rcu_free_wq); } else { /* * We're the sole accessor of @wq at this point. Directly -- cgit v1.3-7-g2ca7 From 0809d95451f7d867d37cf2b526b8da923fd72891 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:20:05 -0800 Subject: events: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim --- kernel/events/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 84530ab358c3..c4b90cf7734a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9918,7 +9918,7 @@ static void account_event(struct perf_event *event) * call the perf scheduling hooks before proceeding to * install events that need them. */ - synchronize_sched(); + synchronize_rcu(); } /* * Now that we have waited for the sync_sched(), allow further -- cgit v1.3-7-g2ca7 From eb4c2382272ae7ae5d81fdfa5b7a6c86146eaaa4 Mon Sep 17 00:00:00 2001 From: Dennis Krein Date: Fri, 26 Oct 2018 07:38:24 -0700 Subject: srcu: Lock srcu_data structure in srcu_gp_start() The srcu_gp_start() function is called with the srcu_struct structure's ->lock held, but not with the srcu_data structure's ->lock. This is problematic because this function accesses and updates the srcu_data structure's ->srcu_cblist, which is protected by that lock. Failing to hold this lock can result in corruption of the SRCU callback lists, which in turn can result in arbitrarily bad results. This commit therefore makes srcu_gp_start() acquire the srcu_data structure's ->lock across the calls to rcu_segcblist_advance() and rcu_segcblist_accelerate(), thus preventing this corruption. Reported-by: Bart Van Assche Reported-by: Christoph Hellwig Reported-by: Sebastian Kuzminsky Signed-off-by: Dennis Krein Signed-off-by: Paul E. McKenney Tested-by: Dennis Krein Cc: # 4.16.x --- kernel/rcu/srcutree.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 60f3236beaf7..697a2d7e8e8a 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -451,10 +451,12 @@ static void srcu_gp_start(struct srcu_struct *sp) lockdep_assert_held(&ACCESS_PRIVATE(sp, lock)); WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed)); + spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&sp->srcu_gp_seq)); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, rcu_seq_snap(&sp->srcu_gp_seq)); + spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ smp_mb(); /* Order prior store to ->srcu_gp_seq_needed vs. GP start. */ rcu_seq_start(&sp->srcu_gp_seq); state = rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)); -- cgit v1.3-7-g2ca7 From aacb5d91ab1bfbb0e8123da59a2e333d52ba7f60 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 28 Oct 2018 10:32:51 -0700 Subject: srcu: Use "ssp" instead of "sp" for srcu_struct pointer In RCU, the distinction between "rsp", "rnp", and "rdp" has served well for a great many years, but in SRCU, "sp" vs. "sdp" has proven confusing. This commit therefore renames SRCU's "sp" pointers to "ssp", so that there is "ssp" for srcu_struct pointer, "snp" for srcu_node pointer, and "sdp" for srcu_data pointer. Signed-off-by: Paul E. McKenney --- include/linux/srcu.h | 78 ++++---- include/linux/srcutiny.h | 24 +-- include/linux/srcutree.h | 8 +- kernel/rcu/srcutiny.c | 120 ++++++------ kernel/rcu/srcutree.c | 488 +++++++++++++++++++++++------------------------ 5 files changed, 359 insertions(+), 359 deletions(-) (limited to 'kernel') diff --git a/include/linux/srcu.h b/include/linux/srcu.h index ebd5f1511690..c614375cd264 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -38,20 +38,20 @@ struct srcu_struct; #ifdef CONFIG_DEBUG_LOCK_ALLOC -int __init_srcu_struct(struct srcu_struct *sp, const char *name, +int __init_srcu_struct(struct srcu_struct *ssp, const char *name, struct lock_class_key *key); -#define init_srcu_struct(sp) \ +#define init_srcu_struct(ssp) \ ({ \ static struct lock_class_key __srcu_key; \ \ - __init_srcu_struct((sp), #sp, &__srcu_key); \ + __init_srcu_struct((ssp), #ssp, &__srcu_key); \ }) #define __SRCU_DEP_MAP_INIT(srcu_name) .dep_map = { .name = #srcu_name }, #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ -int init_srcu_struct(struct srcu_struct *sp); +int init_srcu_struct(struct srcu_struct *ssp); #define __SRCU_DEP_MAP_INIT(srcu_name) #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ @@ -67,28 +67,28 @@ int init_srcu_struct(struct srcu_struct *sp); struct srcu_struct { }; #endif -void call_srcu(struct srcu_struct *sp, struct rcu_head *head, +void call_srcu(struct srcu_struct *ssp, struct rcu_head *head, void (*func)(struct rcu_head *head)); -void _cleanup_srcu_struct(struct srcu_struct *sp, bool quiesced); -int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp); -void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp); -void synchronize_srcu(struct srcu_struct *sp); +void _cleanup_srcu_struct(struct srcu_struct *ssp, bool quiesced); +int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); +void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); +void synchronize_srcu(struct srcu_struct *ssp); /** * cleanup_srcu_struct - deconstruct a sleep-RCU structure - * @sp: structure to clean up. + * @ssp: structure to clean up. * * Must invoke this after you are finished using a given srcu_struct that * was initialized via init_srcu_struct(), else you leak memory. */ -static inline void cleanup_srcu_struct(struct srcu_struct *sp) +static inline void cleanup_srcu_struct(struct srcu_struct *ssp) { - _cleanup_srcu_struct(sp, false); + _cleanup_srcu_struct(ssp, false); } /** * cleanup_srcu_struct_quiesced - deconstruct a quiesced sleep-RCU structure - * @sp: structure to clean up. + * @ssp: structure to clean up. * * Must invoke this after you are finished using a given srcu_struct that * was initialized via init_srcu_struct(), else you leak memory. Also, @@ -103,16 +103,16 @@ static inline void cleanup_srcu_struct(struct srcu_struct *sp) * (with high probability, anyway), and will also cause the srcu_struct * to be leaked. */ -static inline void cleanup_srcu_struct_quiesced(struct srcu_struct *sp) +static inline void cleanup_srcu_struct_quiesced(struct srcu_struct *ssp) { - _cleanup_srcu_struct(sp, true); + _cleanup_srcu_struct(ssp, true); } #ifdef CONFIG_DEBUG_LOCK_ALLOC /** * srcu_read_lock_held - might we be in SRCU read-side critical section? - * @sp: The srcu_struct structure to check + * @ssp: The srcu_struct structure to check * * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an SRCU * read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, @@ -126,16 +126,16 @@ static inline void cleanup_srcu_struct_quiesced(struct srcu_struct *sp) * relies on normal RCU, it can be called from the CPU which * is in the idle loop from an RCU point of view or offline. */ -static inline int srcu_read_lock_held(const struct srcu_struct *sp) +static inline int srcu_read_lock_held(const struct srcu_struct *ssp) { if (!debug_lockdep_rcu_enabled()) return 1; - return lock_is_held(&sp->dep_map); + return lock_is_held(&ssp->dep_map); } #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ -static inline int srcu_read_lock_held(const struct srcu_struct *sp) +static inline int srcu_read_lock_held(const struct srcu_struct *ssp) { return 1; } @@ -145,7 +145,7 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp) /** * srcu_dereference_check - fetch SRCU-protected pointer for later dereferencing * @p: the pointer to fetch and protect for later dereferencing - * @sp: pointer to the srcu_struct, which is used to check that we + * @ssp: pointer to the srcu_struct, which is used to check that we * really are in an SRCU read-side critical section. * @c: condition to check for update-side use * @@ -154,32 +154,32 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp) * to 1. The @c argument will normally be a logical expression containing * lockdep_is_held() calls. */ -#define srcu_dereference_check(p, sp, c) \ - __rcu_dereference_check((p), (c) || srcu_read_lock_held(sp), __rcu) +#define srcu_dereference_check(p, ssp, c) \ + __rcu_dereference_check((p), (c) || srcu_read_lock_held(ssp), __rcu) /** * srcu_dereference - fetch SRCU-protected pointer for later dereferencing * @p: the pointer to fetch and protect for later dereferencing - * @sp: pointer to the srcu_struct, which is used to check that we + * @ssp: pointer to the srcu_struct, which is used to check that we * really are in an SRCU read-side critical section. * * Makes rcu_dereference_check() do the dirty work. If PROVE_RCU * is enabled, invoking this outside of an RCU read-side critical * section will result in an RCU-lockdep splat. */ -#define srcu_dereference(p, sp) srcu_dereference_check((p), (sp), 0) +#define srcu_dereference(p, ssp) srcu_dereference_check((p), (ssp), 0) /** * srcu_dereference_notrace - no tracing and no lockdep calls from here * @p: the pointer to fetch and protect for later dereferencing - * @sp: pointer to the srcu_struct, which is used to check that we + * @ssp: pointer to the srcu_struct, which is used to check that we * really are in an SRCU read-side critical section. */ -#define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1) +#define srcu_dereference_notrace(p, ssp) srcu_dereference_check((p), (ssp), 1) /** * srcu_read_lock - register a new reader for an SRCU-protected structure. - * @sp: srcu_struct in which to register the new reader. + * @ssp: srcu_struct in which to register the new reader. * * Enter an SRCU read-side critical section. Note that SRCU read-side * critical sections may be nested. However, it is illegal to @@ -194,44 +194,44 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp) * srcu_read_unlock() in an irq handler if the matching srcu_read_lock() * was invoked in process context. */ -static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp) +static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp) { int retval; - retval = __srcu_read_lock(sp); - rcu_lock_acquire(&(sp)->dep_map); + retval = __srcu_read_lock(ssp); + rcu_lock_acquire(&(ssp)->dep_map); return retval; } /* Used by tracing, cannot be traced and cannot invoke lockdep. */ static inline notrace int -srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp) +srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp) { int retval; - retval = __srcu_read_lock(sp); + retval = __srcu_read_lock(ssp); return retval; } /** * srcu_read_unlock - unregister a old reader from an SRCU-protected structure. - * @sp: srcu_struct in which to unregister the old reader. + * @ssp: srcu_struct in which to unregister the old reader. * @idx: return value from corresponding srcu_read_lock(). * * Exit an SRCU read-side critical section. */ -static inline void srcu_read_unlock(struct srcu_struct *sp, int idx) - __releases(sp) +static inline void srcu_read_unlock(struct srcu_struct *ssp, int idx) + __releases(ssp) { - rcu_lock_release(&(sp)->dep_map); - __srcu_read_unlock(sp, idx); + rcu_lock_release(&(ssp)->dep_map); + __srcu_read_unlock(ssp, idx); } /* Used by tracing, cannot be traced and cannot call lockdep. */ static inline notrace void -srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp) +srcu_read_unlock_notrace(struct srcu_struct *ssp, int idx) __releases(ssp) { - __srcu_read_unlock(sp, idx); + __srcu_read_unlock(ssp, idx); } /** diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index f41d2fb09f87..b19216aaaef2 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -60,7 +60,7 @@ void srcu_drive_gp(struct work_struct *wp); #define DEFINE_STATIC_SRCU(name) \ static struct srcu_struct name = __SRCU_STRUCT_INIT(name, name) -void synchronize_srcu(struct srcu_struct *sp); +void synchronize_srcu(struct srcu_struct *ssp); /* * Counts the new reader in the appropriate per-CPU element of the @@ -68,36 +68,36 @@ void synchronize_srcu(struct srcu_struct *sp); * __srcu_read_unlock() must be in the same handler instance. Returns an * index that must be passed to the matching srcu_read_unlock(). */ -static inline int __srcu_read_lock(struct srcu_struct *sp) +static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(sp->srcu_idx); - WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1); + idx = READ_ONCE(ssp->srcu_idx); + WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } -static inline void synchronize_srcu_expedited(struct srcu_struct *sp) +static inline void synchronize_srcu_expedited(struct srcu_struct *ssp) { - synchronize_srcu(sp); + synchronize_srcu(ssp); } -static inline void srcu_barrier(struct srcu_struct *sp) +static inline void srcu_barrier(struct srcu_struct *ssp) { - synchronize_srcu(sp); + synchronize_srcu(ssp); } /* Defined here to avoid size increase for non-torture kernels. */ -static inline void srcu_torture_stats_print(struct srcu_struct *sp, +static inline void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf) { int idx; - idx = READ_ONCE(sp->srcu_idx) & 0x1; + idx = READ_ONCE(ssp->srcu_idx) & 0x1; pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%hd,%hd)\n", tt, tf, idx, - READ_ONCE(sp->srcu_lock_nesting[!idx]), - READ_ONCE(sp->srcu_lock_nesting[idx])); + READ_ONCE(ssp->srcu_lock_nesting[!idx]), + READ_ONCE(ssp->srcu_lock_nesting[idx])); } #endif diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index 0ae91b3a7406..6f292bd3e7db 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -51,7 +51,7 @@ struct srcu_data { unsigned long grpmask; /* Mask for leaf srcu_node */ /* ->srcu_data_have_cbs[]. */ int cpu; - struct srcu_struct *sp; + struct srcu_struct *ssp; }; /* @@ -138,8 +138,8 @@ struct srcu_struct { #define DEFINE_SRCU(name) __DEFINE_SRCU(name, /* not static */) #define DEFINE_STATIC_SRCU(name) __DEFINE_SRCU(name, static) -void synchronize_srcu_expedited(struct srcu_struct *sp); -void srcu_barrier(struct srcu_struct *sp); -void srcu_torture_stats_print(struct srcu_struct *sp, char *tt, char *tf); +void synchronize_srcu_expedited(struct srcu_struct *ssp); +void srcu_barrier(struct srcu_struct *ssp); +void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf); #endif diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index b46e6683f8c9..32dfd6522548 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -37,30 +37,30 @@ int rcu_scheduler_active __read_mostly; static LIST_HEAD(srcu_boot_list); static bool srcu_init_done; -static int init_srcu_struct_fields(struct srcu_struct *sp) +static int init_srcu_struct_fields(struct srcu_struct *ssp) { - sp->srcu_lock_nesting[0] = 0; - sp->srcu_lock_nesting[1] = 0; - init_swait_queue_head(&sp->srcu_wq); - sp->srcu_cb_head = NULL; - sp->srcu_cb_tail = &sp->srcu_cb_head; - sp->srcu_gp_running = false; - sp->srcu_gp_waiting = false; - sp->srcu_idx = 0; - INIT_WORK(&sp->srcu_work, srcu_drive_gp); - INIT_LIST_HEAD(&sp->srcu_work.entry); + ssp->srcu_lock_nesting[0] = 0; + ssp->srcu_lock_nesting[1] = 0; + init_swait_queue_head(&ssp->srcu_wq); + ssp->srcu_cb_head = NULL; + ssp->srcu_cb_tail = &ssp->srcu_cb_head; + ssp->srcu_gp_running = false; + ssp->srcu_gp_waiting = false; + ssp->srcu_idx = 0; + INIT_WORK(&ssp->srcu_work, srcu_drive_gp); + INIT_LIST_HEAD(&ssp->srcu_work.entry); return 0; } #ifdef CONFIG_DEBUG_LOCK_ALLOC -int __init_srcu_struct(struct srcu_struct *sp, const char *name, +int __init_srcu_struct(struct srcu_struct *ssp, const char *name, struct lock_class_key *key) { /* Don't re-initialize a lock while it is held. */ - debug_check_no_locks_freed((void *)sp, sizeof(*sp)); - lockdep_init_map(&sp->dep_map, name, key, 0); - return init_srcu_struct_fields(sp); + debug_check_no_locks_freed((void *)ssp, sizeof(*ssp)); + lockdep_init_map(&ssp->dep_map, name, key, 0); + return init_srcu_struct_fields(ssp); } EXPORT_SYMBOL_GPL(__init_srcu_struct); @@ -68,15 +68,15 @@ EXPORT_SYMBOL_GPL(__init_srcu_struct); /* * init_srcu_struct - initialize a sleep-RCU structure - * @sp: structure to initialize. + * @ssp: structure to initialize. * * Must invoke this on a given srcu_struct before passing that srcu_struct * to any other function. Each srcu_struct represents a separate domain * of SRCU protection. */ -int init_srcu_struct(struct srcu_struct *sp) +int init_srcu_struct(struct srcu_struct *ssp) { - return init_srcu_struct_fields(sp); + return init_srcu_struct_fields(ssp); } EXPORT_SYMBOL_GPL(init_srcu_struct); @@ -84,22 +84,22 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); /* * cleanup_srcu_struct - deconstruct a sleep-RCU structure - * @sp: structure to clean up. + * @ssp: structure to clean up. * * Must invoke this after you are finished using a given srcu_struct that * was initialized via init_srcu_struct(), else you leak memory. */ -void _cleanup_srcu_struct(struct srcu_struct *sp, bool quiesced) +void _cleanup_srcu_struct(struct srcu_struct *ssp, bool quiesced) { - WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]); + WARN_ON(ssp->srcu_lock_nesting[0] || ssp->srcu_lock_nesting[1]); if (quiesced) - WARN_ON(work_pending(&sp->srcu_work)); + WARN_ON(work_pending(&ssp->srcu_work)); else - flush_work(&sp->srcu_work); - WARN_ON(sp->srcu_gp_running); - WARN_ON(sp->srcu_gp_waiting); - WARN_ON(sp->srcu_cb_head); - WARN_ON(&sp->srcu_cb_head != sp->srcu_cb_tail); + flush_work(&ssp->srcu_work); + WARN_ON(ssp->srcu_gp_running); + WARN_ON(ssp->srcu_gp_waiting); + WARN_ON(ssp->srcu_cb_head); + WARN_ON(&ssp->srcu_cb_head != ssp->srcu_cb_tail); } EXPORT_SYMBOL_GPL(_cleanup_srcu_struct); @@ -107,13 +107,13 @@ EXPORT_SYMBOL_GPL(_cleanup_srcu_struct); * Removes the count for the old reader from the appropriate element of * the srcu_struct. */ -void __srcu_read_unlock(struct srcu_struct *sp, int idx) +void __srcu_read_unlock(struct srcu_struct *ssp, int idx) { - int newval = sp->srcu_lock_nesting[idx] - 1; + int newval = ssp->srcu_lock_nesting[idx] - 1; - WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); - if (!newval && READ_ONCE(sp->srcu_gp_waiting)) - swake_up_one(&sp->srcu_wq); + WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval); + if (!newval && READ_ONCE(ssp->srcu_gp_waiting)) + swake_up_one(&ssp->srcu_wq); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); @@ -127,24 +127,24 @@ void srcu_drive_gp(struct work_struct *wp) int idx; struct rcu_head *lh; struct rcu_head *rhp; - struct srcu_struct *sp; + struct srcu_struct *ssp; - sp = container_of(wp, struct srcu_struct, srcu_work); - if (sp->srcu_gp_running || !READ_ONCE(sp->srcu_cb_head)) + ssp = container_of(wp, struct srcu_struct, srcu_work); + if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ - WRITE_ONCE(sp->srcu_gp_running, true); + WRITE_ONCE(ssp->srcu_gp_running, true); local_irq_disable(); - lh = sp->srcu_cb_head; - sp->srcu_cb_head = NULL; - sp->srcu_cb_tail = &sp->srcu_cb_head; + lh = ssp->srcu_cb_head; + ssp->srcu_cb_head = NULL; + ssp->srcu_cb_tail = &ssp->srcu_cb_head; local_irq_enable(); - idx = sp->srcu_idx; - WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx); - WRITE_ONCE(sp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ - swait_event_exclusive(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx])); - WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + idx = ssp->srcu_idx; + WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ + swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); + WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ /* Invoke the callbacks we removed above. */ while (lh) { @@ -161,9 +161,9 @@ void srcu_drive_gp(struct work_struct *wp) * at interrupt level, but the ->srcu_gp_running checks will * straighten that out. */ - WRITE_ONCE(sp->srcu_gp_running, false); - if (READ_ONCE(sp->srcu_cb_head)) - schedule_work(&sp->srcu_work); + WRITE_ONCE(ssp->srcu_gp_running, false); + if (READ_ONCE(ssp->srcu_cb_head)) + schedule_work(&ssp->srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); @@ -171,7 +171,7 @@ EXPORT_SYMBOL_GPL(srcu_drive_gp); * Enqueue an SRCU callback on the specified srcu_struct structure, * initiating grace-period processing if it is not already running. */ -void call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, +void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, rcu_callback_t func) { unsigned long flags; @@ -179,14 +179,14 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, rhp->func = func; rhp->next = NULL; local_irq_save(flags); - *sp->srcu_cb_tail = rhp; - sp->srcu_cb_tail = &rhp->next; + *ssp->srcu_cb_tail = rhp; + ssp->srcu_cb_tail = &rhp->next; local_irq_restore(flags); - if (!READ_ONCE(sp->srcu_gp_running)) { + if (!READ_ONCE(ssp->srcu_gp_running)) { if (likely(srcu_init_done)) - schedule_work(&sp->srcu_work); - else if (list_empty(&sp->srcu_work.entry)) - list_add(&sp->srcu_work.entry, &srcu_boot_list); + schedule_work(&ssp->srcu_work); + else if (list_empty(&ssp->srcu_work.entry)) + list_add(&ssp->srcu_work.entry, &srcu_boot_list); } } EXPORT_SYMBOL_GPL(call_srcu); @@ -194,13 +194,13 @@ EXPORT_SYMBOL_GPL(call_srcu); /* * synchronize_srcu - wait for prior SRCU read-side critical-section completion */ -void synchronize_srcu(struct srcu_struct *sp) +void synchronize_srcu(struct srcu_struct *ssp) { struct rcu_synchronize rs; init_rcu_head_on_stack(&rs.head); init_completion(&rs.completion); - call_srcu(sp, &rs.head, wakeme_after_rcu); + call_srcu(ssp, &rs.head, wakeme_after_rcu); wait_for_completion(&rs.completion); destroy_rcu_head_on_stack(&rs.head); } @@ -219,13 +219,13 @@ void __init rcu_scheduler_starting(void) */ void __init srcu_init(void) { - struct srcu_struct *sp; + struct srcu_struct *ssp; srcu_init_done = true; while (!list_empty(&srcu_boot_list)) { - sp = list_first_entry(&srcu_boot_list, + ssp = list_first_entry(&srcu_boot_list, struct srcu_struct, srcu_work.entry); - list_del_init(&sp->srcu_work.entry); - schedule_work(&sp->srcu_work); + list_del_init(&ssp->srcu_work.entry); + schedule_work(&ssp->srcu_work); } } diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 697a2d7e8e8a..3600d88d8956 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -56,7 +56,7 @@ static LIST_HEAD(srcu_boot_list); static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); -static void srcu_reschedule(struct srcu_struct *sp, unsigned long delay); +static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); /* Wrappers for lock acquisition and release, see raw_spin_lock_rcu_node(). */ @@ -92,7 +92,7 @@ do { \ * srcu_read_unlock() running against them. So if the is_static parameter * is set, don't initialize ->srcu_lock_count[] and ->srcu_unlock_count[]. */ -static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) +static void init_srcu_struct_nodes(struct srcu_struct *ssp, bool is_static) { int cpu; int i; @@ -103,13 +103,13 @@ static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) struct srcu_node *snp_first; /* Work out the overall tree geometry. */ - sp->level[0] = &sp->node[0]; + ssp->level[0] = &ssp->node[0]; for (i = 1; i < rcu_num_lvls; i++) - sp->level[i] = sp->level[i - 1] + num_rcu_lvl[i - 1]; + ssp->level[i] = ssp->level[i - 1] + num_rcu_lvl[i - 1]; rcu_init_levelspread(levelspread, num_rcu_lvl); /* Each pass through this loop initializes one srcu_node structure. */ - srcu_for_each_node_breadth_first(sp, snp) { + srcu_for_each_node_breadth_first(ssp, snp) { spin_lock_init(&ACCESS_PRIVATE(snp, lock)); WARN_ON_ONCE(ARRAY_SIZE(snp->srcu_have_cbs) != ARRAY_SIZE(snp->srcu_data_have_cbs)); @@ -120,17 +120,17 @@ static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) snp->srcu_gp_seq_needed_exp = 0; snp->grplo = -1; snp->grphi = -1; - if (snp == &sp->node[0]) { + if (snp == &ssp->node[0]) { /* Root node, special case. */ snp->srcu_parent = NULL; continue; } /* Non-root node. */ - if (snp == sp->level[level + 1]) + if (snp == ssp->level[level + 1]) level++; - snp->srcu_parent = sp->level[level - 1] + - (snp - sp->level[level]) / + snp->srcu_parent = ssp->level[level - 1] + + (snp - ssp->level[level]) / levelspread[level - 1]; } @@ -141,14 +141,14 @@ static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) WARN_ON_ONCE(ARRAY_SIZE(sdp->srcu_lock_count) != ARRAY_SIZE(sdp->srcu_unlock_count)); level = rcu_num_lvls - 1; - snp_first = sp->level[level]; + snp_first = ssp->level[level]; for_each_possible_cpu(cpu) { - sdp = per_cpu_ptr(sp->sda, cpu); + sdp = per_cpu_ptr(ssp->sda, cpu); spin_lock_init(&ACCESS_PRIVATE(sdp, lock)); rcu_segcblist_init(&sdp->srcu_cblist); sdp->srcu_cblist_invoking = false; - sdp->srcu_gp_seq_needed = sp->srcu_gp_seq; - sdp->srcu_gp_seq_needed_exp = sp->srcu_gp_seq; + sdp->srcu_gp_seq_needed = ssp->srcu_gp_seq; + sdp->srcu_gp_seq_needed_exp = ssp->srcu_gp_seq; sdp->mynode = &snp_first[cpu / levelspread[level]]; for (snp = sdp->mynode; snp != NULL; snp = snp->srcu_parent) { if (snp->grplo < 0) @@ -157,7 +157,7 @@ static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) } sdp->cpu = cpu; INIT_DELAYED_WORK(&sdp->work, srcu_invoke_callbacks); - sdp->sp = sp; + sdp->ssp = ssp; sdp->grpmask = 1 << (cpu - sdp->mynode->grplo); if (is_static) continue; @@ -176,35 +176,35 @@ static void init_srcu_struct_nodes(struct srcu_struct *sp, bool is_static) * parameter is passed through to init_srcu_struct_nodes(), and * also tells us that ->sda has already been wired up to srcu_data. */ -static int init_srcu_struct_fields(struct srcu_struct *sp, bool is_static) +static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) { - mutex_init(&sp->srcu_cb_mutex); - mutex_init(&sp->srcu_gp_mutex); - sp->srcu_idx = 0; - sp->srcu_gp_seq = 0; - sp->srcu_barrier_seq = 0; - mutex_init(&sp->srcu_barrier_mutex); - atomic_set(&sp->srcu_barrier_cpu_cnt, 0); - INIT_DELAYED_WORK(&sp->work, process_srcu); + mutex_init(&ssp->srcu_cb_mutex); + mutex_init(&ssp->srcu_gp_mutex); + ssp->srcu_idx = 0; + ssp->srcu_gp_seq = 0; + ssp->srcu_barrier_seq = 0; + mutex_init(&ssp->srcu_barrier_mutex); + atomic_set(&ssp->srcu_barrier_cpu_cnt, 0); + INIT_DELAYED_WORK(&ssp->work, process_srcu); if (!is_static) - sp->sda = alloc_percpu(struct srcu_data); - init_srcu_struct_nodes(sp, is_static); - sp->srcu_gp_seq_needed_exp = 0; - sp->srcu_last_gp_end = ktime_get_mono_fast_ns(); - smp_store_release(&sp->srcu_gp_seq_needed, 0); /* Init done. */ - return sp->sda ? 0 : -ENOMEM; + ssp->sda = alloc_percpu(struct srcu_data); + init_srcu_struct_nodes(ssp, is_static); + ssp->srcu_gp_seq_needed_exp = 0; + ssp->srcu_last_gp_end = ktime_get_mono_fast_ns(); + smp_store_release(&ssp->srcu_gp_seq_needed, 0); /* Init done. */ + return ssp->sda ? 0 : -ENOMEM; } #ifdef CONFIG_DEBUG_LOCK_ALLOC -int __init_srcu_struct(struct srcu_struct *sp, const char *name, +int __init_srcu_struct(struct srcu_struct *ssp, const char *name, struct lock_class_key *key) { /* Don't re-initialize a lock while it is held. */ - debug_check_no_locks_freed((void *)sp, sizeof(*sp)); - lockdep_init_map(&sp->dep_map, name, key, 0); - spin_lock_init(&ACCESS_PRIVATE(sp, lock)); - return init_srcu_struct_fields(sp, false); + debug_check_no_locks_freed((void *)ssp, sizeof(*ssp)); + lockdep_init_map(&ssp->dep_map, name, key, 0); + spin_lock_init(&ACCESS_PRIVATE(ssp, lock)); + return init_srcu_struct_fields(ssp, false); } EXPORT_SYMBOL_GPL(__init_srcu_struct); @@ -212,16 +212,16 @@ EXPORT_SYMBOL_GPL(__init_srcu_struct); /** * init_srcu_struct - initialize a sleep-RCU structure - * @sp: structure to initialize. + * @ssp: structure to initialize. * * Must invoke this on a given srcu_struct before passing that srcu_struct * to any other function. Each srcu_struct represents a separate domain * of SRCU protection. */ -int init_srcu_struct(struct srcu_struct *sp) +int init_srcu_struct(struct srcu_struct *ssp) { - spin_lock_init(&ACCESS_PRIVATE(sp, lock)); - return init_srcu_struct_fields(sp, false); + spin_lock_init(&ACCESS_PRIVATE(ssp, lock)); + return init_srcu_struct_fields(ssp, false); } EXPORT_SYMBOL_GPL(init_srcu_struct); @@ -231,37 +231,37 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); * First-use initialization of statically allocated srcu_struct * structure. Wiring up the combining tree is more than can be * done with compile-time initialization, so this check is added - * to each update-side SRCU primitive. Use sp->lock, which -is- + * to each update-side SRCU primitive. Use ssp->lock, which -is- * compile-time initialized, to resolve races involving multiple * CPUs trying to garner first-use privileges. */ -static void check_init_srcu_struct(struct srcu_struct *sp) +static void check_init_srcu_struct(struct srcu_struct *ssp) { unsigned long flags; /* The smp_load_acquire() pairs with the smp_store_release(). */ - if (!rcu_seq_state(smp_load_acquire(&sp->srcu_gp_seq_needed))) /*^^^*/ + if (!rcu_seq_state(smp_load_acquire(&ssp->srcu_gp_seq_needed))) /*^^^*/ return; /* Already initialized. */ - spin_lock_irqsave_rcu_node(sp, flags); - if (!rcu_seq_state(sp->srcu_gp_seq_needed)) { - spin_unlock_irqrestore_rcu_node(sp, flags); + spin_lock_irqsave_rcu_node(ssp, flags); + if (!rcu_seq_state(ssp->srcu_gp_seq_needed)) { + spin_unlock_irqrestore_rcu_node(ssp, flags); return; } - init_srcu_struct_fields(sp, true); - spin_unlock_irqrestore_rcu_node(sp, flags); + init_srcu_struct_fields(ssp, true); + spin_unlock_irqrestore_rcu_node(ssp, flags); } /* * Returns approximate total of the readers' ->srcu_lock_count[] values * for the rank of per-CPU counters specified by idx. */ -static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx) +static unsigned long srcu_readers_lock_idx(struct srcu_struct *ssp, int idx) { int cpu; unsigned long sum = 0; for_each_possible_cpu(cpu) { - struct srcu_data *cpuc = per_cpu_ptr(sp->sda, cpu); + struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu); sum += READ_ONCE(cpuc->srcu_lock_count[idx]); } @@ -272,13 +272,13 @@ static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx) * Returns approximate total of the readers' ->srcu_unlock_count[] values * for the rank of per-CPU counters specified by idx. */ -static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx) +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx) { int cpu; unsigned long sum = 0; for_each_possible_cpu(cpu) { - struct srcu_data *cpuc = per_cpu_ptr(sp->sda, cpu); + struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu); sum += READ_ONCE(cpuc->srcu_unlock_count[idx]); } @@ -289,11 +289,11 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx) * Return true if the number of pre-existing readers is determined to * be zero. */ -static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) +static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx) { unsigned long unlocks; - unlocks = srcu_readers_unlock_idx(sp, idx); + unlocks = srcu_readers_unlock_idx(ssp, idx); /* * Make sure that a lock is always counted if the corresponding @@ -329,25 +329,25 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient, * especially on 64-bit systems. */ - return srcu_readers_lock_idx(sp, idx) == unlocks; + return srcu_readers_lock_idx(ssp, idx) == unlocks; } /** * srcu_readers_active - returns true if there are readers. and false * otherwise - * @sp: which srcu_struct to count active readers (holding srcu_read_lock). + * @ssp: which srcu_struct to count active readers (holding srcu_read_lock). * * Note that this is not an atomic primitive, and can therefore suffer * severe errors when invoked on an active srcu_struct. That said, it * can be useful as an error check at cleanup time. */ -static bool srcu_readers_active(struct srcu_struct *sp) +static bool srcu_readers_active(struct srcu_struct *ssp) { int cpu; unsigned long sum = 0; for_each_possible_cpu(cpu) { - struct srcu_data *cpuc = per_cpu_ptr(sp->sda, cpu); + struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu); sum += READ_ONCE(cpuc->srcu_lock_count[0]); sum += READ_ONCE(cpuc->srcu_lock_count[1]); @@ -363,44 +363,44 @@ static bool srcu_readers_active(struct srcu_struct *sp) * Return grace-period delay, zero if there are expedited grace * periods pending, SRCU_INTERVAL otherwise. */ -static unsigned long srcu_get_delay(struct srcu_struct *sp) +static unsigned long srcu_get_delay(struct srcu_struct *ssp) { - if (ULONG_CMP_LT(READ_ONCE(sp->srcu_gp_seq), - READ_ONCE(sp->srcu_gp_seq_needed_exp))) + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), + READ_ONCE(ssp->srcu_gp_seq_needed_exp))) return 0; return SRCU_INTERVAL; } /* Helper for cleanup_srcu_struct() and cleanup_srcu_struct_quiesced(). */ -void _cleanup_srcu_struct(struct srcu_struct *sp, bool quiesced) +void _cleanup_srcu_struct(struct srcu_struct *ssp, bool quiesced) { int cpu; - if (WARN_ON(!srcu_get_delay(sp))) + if (WARN_ON(!srcu_get_delay(ssp))) return; /* Just leak it! */ - if (WARN_ON(srcu_readers_active(sp))) + if (WARN_ON(srcu_readers_active(ssp))) return; /* Just leak it! */ if (quiesced) { - if (WARN_ON(delayed_work_pending(&sp->work))) + if (WARN_ON(delayed_work_pending(&ssp->work))) return; /* Just leak it! */ } else { - flush_delayed_work(&sp->work); + flush_delayed_work(&ssp->work); } for_each_possible_cpu(cpu) if (quiesced) { - if (WARN_ON(delayed_work_pending(&per_cpu_ptr(sp->sda, cpu)->work))) + if (WARN_ON(delayed_work_pending(&per_cpu_ptr(ssp->sda, cpu)->work))) return; /* Just leak it! */ } else { - flush_delayed_work(&per_cpu_ptr(sp->sda, cpu)->work); + flush_delayed_work(&per_cpu_ptr(ssp->sda, cpu)->work); } - if (WARN_ON(rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)) != SRCU_STATE_IDLE) || - WARN_ON(srcu_readers_active(sp))) { + if (WARN_ON(rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) != SRCU_STATE_IDLE) || + WARN_ON(srcu_readers_active(ssp))) { pr_info("%s: Active srcu_struct %p state: %d\n", - __func__, sp, rcu_seq_state(READ_ONCE(sp->srcu_gp_seq))); + __func__, ssp, rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))); return; /* Caller forgot to stop doing call_srcu()? */ } - free_percpu(sp->sda); - sp->sda = NULL; + free_percpu(ssp->sda); + ssp->sda = NULL; } EXPORT_SYMBOL_GPL(_cleanup_srcu_struct); @@ -409,12 +409,12 @@ EXPORT_SYMBOL_GPL(_cleanup_srcu_struct); * srcu_struct. * Returns an index that must be passed to the matching srcu_read_unlock(). */ -int __srcu_read_lock(struct srcu_struct *sp) +int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(sp->srcu_idx) & 0x1; - this_cpu_inc(sp->sda->srcu_lock_count[idx]); + idx = READ_ONCE(ssp->srcu_idx) & 0x1; + this_cpu_inc(ssp->sda->srcu_lock_count[idx]); smp_mb(); /* B */ /* Avoid leaking the critical section. */ return idx; } @@ -425,10 +425,10 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); * element of the srcu_struct. Note that this may well be a different * CPU than that which was incremented by the corresponding srcu_read_lock(). */ -void __srcu_read_unlock(struct srcu_struct *sp, int idx) +void __srcu_read_unlock(struct srcu_struct *ssp, int idx) { smp_mb(); /* C */ /* Avoid leaking the critical section. */ - this_cpu_inc(sp->sda->srcu_unlock_count[idx]); + this_cpu_inc(ssp->sda->srcu_unlock_count[idx]); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); @@ -444,22 +444,22 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); /* * Start an SRCU grace period. */ -static void srcu_gp_start(struct srcu_struct *sp) +static void srcu_gp_start(struct srcu_struct *ssp) { - struct srcu_data *sdp = this_cpu_ptr(sp->sda); + struct srcu_data *sdp = this_cpu_ptr(ssp->sda); int state; - lockdep_assert_held(&ACCESS_PRIVATE(sp, lock)); - WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed)); + lockdep_assert_held(&ACCESS_PRIVATE(ssp, lock)); + WARN_ON_ONCE(ULONG_CMP_GE(ssp->srcu_gp_seq, ssp->srcu_gp_seq_needed)); spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ rcu_segcblist_advance(&sdp->srcu_cblist, - rcu_seq_current(&sp->srcu_gp_seq)); + rcu_seq_current(&ssp->srcu_gp_seq)); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&sp->srcu_gp_seq)); + rcu_seq_snap(&ssp->srcu_gp_seq)); spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ smp_mb(); /* Order prior store to ->srcu_gp_seq_needed vs. GP start. */ - rcu_seq_start(&sp->srcu_gp_seq); - state = rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)); + rcu_seq_start(&ssp->srcu_gp_seq); + state = rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)); WARN_ON_ONCE(state != SRCU_STATE_SCAN1); } @@ -513,7 +513,7 @@ static void srcu_schedule_cbs_sdp(struct srcu_data *sdp, unsigned long delay) * just-completed grace period, the one corresponding to idx. If possible, * schedule this invocation on the corresponding CPUs. */ -static void srcu_schedule_cbs_snp(struct srcu_struct *sp, struct srcu_node *snp, +static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp, unsigned long mask, unsigned long delay) { int cpu; @@ -521,7 +521,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *sp, struct srcu_node *snp, for (cpu = snp->grplo; cpu <= snp->grphi; cpu++) { if (!(mask & (1 << (cpu - snp->grplo)))) continue; - srcu_schedule_cbs_sdp(per_cpu_ptr(sp->sda, cpu), delay); + srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, cpu), delay); } } @@ -534,7 +534,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *sp, struct srcu_node *snp, * are initiating callback invocation. This allows the ->srcu_have_cbs[] * array to have a finite number of elements. */ -static void srcu_gp_end(struct srcu_struct *sp) +static void srcu_gp_end(struct srcu_struct *ssp) { unsigned long cbdelay; bool cbs; @@ -548,28 +548,28 @@ static void srcu_gp_end(struct srcu_struct *sp) struct srcu_node *snp; /* Prevent more than one additional grace period. */ - mutex_lock(&sp->srcu_cb_mutex); + mutex_lock(&ssp->srcu_cb_mutex); /* End the current grace period. */ - spin_lock_irq_rcu_node(sp); - idx = rcu_seq_state(sp->srcu_gp_seq); + spin_lock_irq_rcu_node(ssp); + idx = rcu_seq_state(ssp->srcu_gp_seq); WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); - cbdelay = srcu_get_delay(sp); - sp->srcu_last_gp_end = ktime_get_mono_fast_ns(); - rcu_seq_end(&sp->srcu_gp_seq); - gpseq = rcu_seq_current(&sp->srcu_gp_seq); - if (ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, gpseq)) - sp->srcu_gp_seq_needed_exp = gpseq; - spin_unlock_irq_rcu_node(sp); - mutex_unlock(&sp->srcu_gp_mutex); + cbdelay = srcu_get_delay(ssp); + ssp->srcu_last_gp_end = ktime_get_mono_fast_ns(); + rcu_seq_end(&ssp->srcu_gp_seq); + gpseq = rcu_seq_current(&ssp->srcu_gp_seq); + if (ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, gpseq)) + ssp->srcu_gp_seq_needed_exp = gpseq; + spin_unlock_irq_rcu_node(ssp); + mutex_unlock(&ssp->srcu_gp_mutex); /* A new grace period can start at this point. But only one. */ /* Initiate callback invocation as needed. */ idx = rcu_seq_ctr(gpseq) % ARRAY_SIZE(snp->srcu_have_cbs); - srcu_for_each_node_breadth_first(sp, snp) { + srcu_for_each_node_breadth_first(ssp, snp) { spin_lock_irq_rcu_node(snp); cbs = false; - last_lvl = snp >= sp->level[rcu_num_lvls - 1]; + last_lvl = snp >= ssp->level[rcu_num_lvls - 1]; if (last_lvl) cbs = snp->srcu_have_cbs[idx] == gpseq; snp->srcu_have_cbs[idx] = gpseq; @@ -580,12 +580,12 @@ static void srcu_gp_end(struct srcu_struct *sp) snp->srcu_data_have_cbs[idx] = 0; spin_unlock_irq_rcu_node(snp); if (cbs) - srcu_schedule_cbs_snp(sp, snp, mask, cbdelay); + srcu_schedule_cbs_snp(ssp, snp, mask, cbdelay); /* Occasionally prevent srcu_data counter wrap. */ if (!(gpseq & counter_wrap_check) && last_lvl) for (cpu = snp->grplo; cpu <= snp->grphi; cpu++) { - sdp = per_cpu_ptr(sp->sda, cpu); + sdp = per_cpu_ptr(ssp->sda, cpu); spin_lock_irqsave_rcu_node(sdp, flags); if (ULONG_CMP_GE(gpseq, sdp->srcu_gp_seq_needed + 100)) @@ -598,18 +598,18 @@ static void srcu_gp_end(struct srcu_struct *sp) } /* Callback initiation done, allow grace periods after next. */ - mutex_unlock(&sp->srcu_cb_mutex); + mutex_unlock(&ssp->srcu_cb_mutex); /* Start a new grace period if needed. */ - spin_lock_irq_rcu_node(sp); - gpseq = rcu_seq_current(&sp->srcu_gp_seq); + spin_lock_irq_rcu_node(ssp); + gpseq = rcu_seq_current(&ssp->srcu_gp_seq); if (!rcu_seq_state(gpseq) && - ULONG_CMP_LT(gpseq, sp->srcu_gp_seq_needed)) { - srcu_gp_start(sp); - spin_unlock_irq_rcu_node(sp); - srcu_reschedule(sp, 0); + ULONG_CMP_LT(gpseq, ssp->srcu_gp_seq_needed)) { + srcu_gp_start(ssp); + spin_unlock_irq_rcu_node(ssp); + srcu_reschedule(ssp, 0); } else { - spin_unlock_irq_rcu_node(sp); + spin_unlock_irq_rcu_node(ssp); } } @@ -620,13 +620,13 @@ static void srcu_gp_end(struct srcu_struct *sp) * but without expediting. To start a completely new grace period, * whether expedited or not, use srcu_funnel_gp_start() instead. */ -static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, +static void srcu_funnel_exp_start(struct srcu_struct *ssp, struct srcu_node *snp, unsigned long s) { unsigned long flags; for (; snp != NULL; snp = snp->srcu_parent) { - if (rcu_seq_done(&sp->srcu_gp_seq, s) || + if (rcu_seq_done(&ssp->srcu_gp_seq, s) || ULONG_CMP_GE(READ_ONCE(snp->srcu_gp_seq_needed_exp), s)) return; spin_lock_irqsave_rcu_node(snp, flags); @@ -637,10 +637,10 @@ static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, WRITE_ONCE(snp->srcu_gp_seq_needed_exp, s); spin_unlock_irqrestore_rcu_node(snp, flags); } - spin_lock_irqsave_rcu_node(sp, flags); - if (ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) - sp->srcu_gp_seq_needed_exp = s; - spin_unlock_irqrestore_rcu_node(sp, flags); + spin_lock_irqsave_rcu_node(ssp, flags); + if (ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, s)) + ssp->srcu_gp_seq_needed_exp = s; + spin_unlock_irqrestore_rcu_node(ssp, flags); } /* @@ -653,7 +653,7 @@ static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, * Note that this function also does the work of srcu_funnel_exp_start(), * in some cases by directly invoking it. */ -static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp, +static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, unsigned long s, bool do_norm) { unsigned long flags; @@ -663,7 +663,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp, /* Each pass through the loop does one level of the srcu_node tree. */ for (; snp != NULL; snp = snp->srcu_parent) { - if (rcu_seq_done(&sp->srcu_gp_seq, s) && snp != sdp->mynode) + if (rcu_seq_done(&ssp->srcu_gp_seq, s) && snp != sdp->mynode) return; /* GP already done and CBs recorded. */ spin_lock_irqsave_rcu_node(snp, flags); if (ULONG_CMP_GE(snp->srcu_have_cbs[idx], s)) { @@ -678,7 +678,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp, return; } if (!do_norm) - srcu_funnel_exp_start(sp, snp, s); + srcu_funnel_exp_start(ssp, snp, s); return; } snp->srcu_have_cbs[idx] = s; @@ -690,29 +690,29 @@ static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp, } /* Top of tree, must ensure the grace period will be started. */ - spin_lock_irqsave_rcu_node(sp, flags); - if (ULONG_CMP_LT(sp->srcu_gp_seq_needed, s)) { + spin_lock_irqsave_rcu_node(ssp, flags); + if (ULONG_CMP_LT(ssp->srcu_gp_seq_needed, s)) { /* * Record need for grace period s. Pair with load * acquire setting up for initialization. */ - smp_store_release(&sp->srcu_gp_seq_needed, s); /*^^^*/ + smp_store_release(&ssp->srcu_gp_seq_needed, s); /*^^^*/ } - if (!do_norm && ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) - sp->srcu_gp_seq_needed_exp = s; + if (!do_norm && ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, s)) + ssp->srcu_gp_seq_needed_exp = s; /* If grace period not already done and none in progress, start it. */ - if (!rcu_seq_done(&sp->srcu_gp_seq, s) && - rcu_seq_state(sp->srcu_gp_seq) == SRCU_STATE_IDLE) { - WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed)); - srcu_gp_start(sp); + if (!rcu_seq_done(&ssp->srcu_gp_seq, s) && + rcu_seq_state(ssp->srcu_gp_seq) == SRCU_STATE_IDLE) { + WARN_ON_ONCE(ULONG_CMP_GE(ssp->srcu_gp_seq, ssp->srcu_gp_seq_needed)); + srcu_gp_start(ssp); if (likely(srcu_init_done)) - queue_delayed_work(rcu_gp_wq, &sp->work, - srcu_get_delay(sp)); - else if (list_empty(&sp->work.work.entry)) - list_add(&sp->work.work.entry, &srcu_boot_list); + queue_delayed_work(rcu_gp_wq, &ssp->work, + srcu_get_delay(ssp)); + else if (list_empty(&ssp->work.work.entry)) + list_add(&ssp->work.work.entry, &srcu_boot_list); } - spin_unlock_irqrestore_rcu_node(sp, flags); + spin_unlock_irqrestore_rcu_node(ssp, flags); } /* @@ -720,12 +720,12 @@ static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp, * loop an additional time if there is an expedited grace period pending. * The caller must ensure that ->srcu_idx is not changed while checking. */ -static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount) +static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) { for (;;) { - if (srcu_readers_active_idx_check(sp, idx)) + if (srcu_readers_active_idx_check(ssp, idx)) return true; - if (--trycount + !srcu_get_delay(sp) <= 0) + if (--trycount + !srcu_get_delay(ssp) <= 0) return false; udelay(SRCU_RETRY_CHECK_DELAY); } @@ -736,7 +736,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount) * use the other rank of the ->srcu_(un)lock_count[] arrays. This allows * us to wait for pre-existing readers in a starvation-free manner. */ -static void srcu_flip(struct srcu_struct *sp) +static void srcu_flip(struct srcu_struct *ssp) { /* * Ensure that if this updater saw a given reader's increment @@ -748,7 +748,7 @@ static void srcu_flip(struct srcu_struct *sp) */ smp_mb(); /* E */ /* Pairs with B and C. */ - WRITE_ONCE(sp->srcu_idx, sp->srcu_idx + 1); + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* * Ensure that if the updater misses an __srcu_read_unlock() @@ -781,7 +781,7 @@ static void srcu_flip(struct srcu_struct *sp) * negligible when amoritized over that time period, and the extra latency * of a needlessly non-expedited grace period is similarly negligible. */ -static bool srcu_might_be_idle(struct srcu_struct *sp) +static bool srcu_might_be_idle(struct srcu_struct *ssp) { unsigned long curseq; unsigned long flags; @@ -790,7 +790,7 @@ static bool srcu_might_be_idle(struct srcu_struct *sp) /* If the local srcu_data structure has callbacks, not idle. */ local_irq_save(flags); - sdp = this_cpu_ptr(sp->sda); + sdp = this_cpu_ptr(ssp->sda); if (rcu_segcblist_pend_cbs(&sdp->srcu_cblist)) { local_irq_restore(flags); return false; /* Callbacks already present, so not idle. */ @@ -806,17 +806,17 @@ static bool srcu_might_be_idle(struct srcu_struct *sp) /* First, see if enough time has passed since the last GP. */ t = ktime_get_mono_fast_ns(); if (exp_holdoff == 0 || - time_in_range_open(t, sp->srcu_last_gp_end, - sp->srcu_last_gp_end + exp_holdoff)) + time_in_range_open(t, ssp->srcu_last_gp_end, + ssp->srcu_last_gp_end + exp_holdoff)) return false; /* Too soon after last GP. */ /* Next, check for probable idleness. */ - curseq = rcu_seq_current(&sp->srcu_gp_seq); + curseq = rcu_seq_current(&ssp->srcu_gp_seq); smp_mb(); /* Order ->srcu_gp_seq with ->srcu_gp_seq_needed. */ - if (ULONG_CMP_LT(curseq, READ_ONCE(sp->srcu_gp_seq_needed))) + if (ULONG_CMP_LT(curseq, READ_ONCE(ssp->srcu_gp_seq_needed))) return false; /* Grace period in progress, so not idle. */ smp_mb(); /* Order ->srcu_gp_seq with prior access. */ - if (curseq != rcu_seq_current(&sp->srcu_gp_seq)) + if (curseq != rcu_seq_current(&ssp->srcu_gp_seq)) return false; /* GP # changed, so not idle. */ return true; /* With reasonable probability, idle! */ } @@ -856,7 +856,7 @@ static void srcu_leak_callback(struct rcu_head *rhp) * srcu_read_lock(), and srcu_read_unlock() that are all passed the same * srcu_struct structure. */ -void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, +void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, rcu_callback_t func, bool do_norm) { unsigned long flags; @@ -866,7 +866,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, unsigned long s; struct srcu_data *sdp; - check_init_srcu_struct(sp); + check_init_srcu_struct(ssp); if (debug_rcu_head_queue(rhp)) { /* Probable double call_srcu(), so leak the callback. */ WRITE_ONCE(rhp->func, srcu_leak_callback); @@ -874,14 +874,14 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, return; } rhp->func = func; - idx = srcu_read_lock(sp); + idx = srcu_read_lock(ssp); local_irq_save(flags); - sdp = this_cpu_ptr(sp->sda); + sdp = this_cpu_ptr(ssp->sda); spin_lock_rcu_node(sdp); rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp, false); rcu_segcblist_advance(&sdp->srcu_cblist, - rcu_seq_current(&sp->srcu_gp_seq)); - s = rcu_seq_snap(&sp->srcu_gp_seq); + rcu_seq_current(&ssp->srcu_gp_seq)); + s = rcu_seq_snap(&ssp->srcu_gp_seq); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { sdp->srcu_gp_seq_needed = s; @@ -893,15 +893,15 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, } spin_unlock_irqrestore_rcu_node(sdp, flags); if (needgp) - srcu_funnel_gp_start(sp, sdp, s, do_norm); + srcu_funnel_gp_start(ssp, sdp, s, do_norm); else if (needexp) - srcu_funnel_exp_start(sp, sdp->mynode, s); - srcu_read_unlock(sp, idx); + srcu_funnel_exp_start(ssp, sdp->mynode, s); + srcu_read_unlock(ssp, idx); } /** * call_srcu() - Queue a callback for invocation after an SRCU grace period - * @sp: srcu_struct in queue the callback + * @ssp: srcu_struct in queue the callback * @rhp: structure to be used for queueing the SRCU callback. * @func: function to be invoked after the SRCU grace period * @@ -916,21 +916,21 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, * The callback will be invoked from process context, but must nevertheless * be fast and must not block. */ -void call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, +void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, rcu_callback_t func) { - __call_srcu(sp, rhp, func, true); + __call_srcu(ssp, rhp, func, true); } EXPORT_SYMBOL_GPL(call_srcu); /* * Helper function for synchronize_srcu() and synchronize_srcu_expedited(). */ -static void __synchronize_srcu(struct srcu_struct *sp, bool do_norm) +static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm) { struct rcu_synchronize rcu; - RCU_LOCKDEP_WARN(lock_is_held(&sp->dep_map) || + RCU_LOCKDEP_WARN(lock_is_held(&ssp->dep_map) || lock_is_held(&rcu_bh_lock_map) || lock_is_held(&rcu_lock_map) || lock_is_held(&rcu_sched_lock_map), @@ -939,10 +939,10 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool do_norm) if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) return; might_sleep(); - check_init_srcu_struct(sp); + check_init_srcu_struct(ssp); init_completion(&rcu.completion); init_rcu_head_on_stack(&rcu.head); - __call_srcu(sp, &rcu.head, wakeme_after_rcu, do_norm); + __call_srcu(ssp, &rcu.head, wakeme_after_rcu, do_norm); wait_for_completion(&rcu.completion); destroy_rcu_head_on_stack(&rcu.head); @@ -958,7 +958,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool do_norm) /** * synchronize_srcu_expedited - Brute-force SRCU grace period - * @sp: srcu_struct with which to synchronize. + * @ssp: srcu_struct with which to synchronize. * * Wait for an SRCU grace period to elapse, but be more aggressive about * spinning rather than blocking when waiting. @@ -966,15 +966,15 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool do_norm) * Note that synchronize_srcu_expedited() has the same deadlock and * memory-ordering properties as does synchronize_srcu(). */ -void synchronize_srcu_expedited(struct srcu_struct *sp) +void synchronize_srcu_expedited(struct srcu_struct *ssp) { - __synchronize_srcu(sp, rcu_gp_is_normal()); + __synchronize_srcu(ssp, rcu_gp_is_normal()); } EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); /** * synchronize_srcu - wait for prior SRCU read-side critical-section completion - * @sp: srcu_struct with which to synchronize. + * @ssp: srcu_struct with which to synchronize. * * Wait for the count to drain to zero of both indexes. To avoid the * possible starvation of synchronize_srcu(), it waits for the count of @@ -1016,12 +1016,12 @@ EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); * SRCU must also provide it. Note that detecting idleness is heuristic * and subject to both false positives and negatives. */ -void synchronize_srcu(struct srcu_struct *sp) +void synchronize_srcu(struct srcu_struct *ssp) { - if (srcu_might_be_idle(sp) || rcu_gp_is_expedited()) - synchronize_srcu_expedited(sp); + if (srcu_might_be_idle(ssp) || rcu_gp_is_expedited()) + synchronize_srcu_expedited(ssp); else - __synchronize_srcu(sp, true); + __synchronize_srcu(ssp, true); } EXPORT_SYMBOL_GPL(synchronize_srcu); @@ -1031,36 +1031,36 @@ EXPORT_SYMBOL_GPL(synchronize_srcu); static void srcu_barrier_cb(struct rcu_head *rhp) { struct srcu_data *sdp; - struct srcu_struct *sp; + struct srcu_struct *ssp; sdp = container_of(rhp, struct srcu_data, srcu_barrier_head); - sp = sdp->sp; - if (atomic_dec_and_test(&sp->srcu_barrier_cpu_cnt)) - complete(&sp->srcu_barrier_completion); + ssp = sdp->ssp; + if (atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt)) + complete(&ssp->srcu_barrier_completion); } /** * srcu_barrier - Wait until all in-flight call_srcu() callbacks complete. - * @sp: srcu_struct on which to wait for in-flight callbacks. + * @ssp: srcu_struct on which to wait for in-flight callbacks. */ -void srcu_barrier(struct srcu_struct *sp) +void srcu_barrier(struct srcu_struct *ssp) { int cpu; struct srcu_data *sdp; - unsigned long s = rcu_seq_snap(&sp->srcu_barrier_seq); + unsigned long s = rcu_seq_snap(&ssp->srcu_barrier_seq); - check_init_srcu_struct(sp); - mutex_lock(&sp->srcu_barrier_mutex); - if (rcu_seq_done(&sp->srcu_barrier_seq, s)) { + check_init_srcu_struct(ssp); + mutex_lock(&ssp->srcu_barrier_mutex); + if (rcu_seq_done(&ssp->srcu_barrier_seq, s)) { smp_mb(); /* Force ordering following return. */ - mutex_unlock(&sp->srcu_barrier_mutex); + mutex_unlock(&ssp->srcu_barrier_mutex); return; /* Someone else did our work for us. */ } - rcu_seq_start(&sp->srcu_barrier_seq); - init_completion(&sp->srcu_barrier_completion); + rcu_seq_start(&ssp->srcu_barrier_seq); + init_completion(&ssp->srcu_barrier_completion); /* Initial count prevents reaching zero until all CBs are posted. */ - atomic_set(&sp->srcu_barrier_cpu_cnt, 1); + atomic_set(&ssp->srcu_barrier_cpu_cnt, 1); /* * Each pass through this loop enqueues a callback, but only @@ -1071,39 +1071,39 @@ void srcu_barrier(struct srcu_struct *sp) * grace period as the last callback already in the queue. */ for_each_possible_cpu(cpu) { - sdp = per_cpu_ptr(sp->sda, cpu); + sdp = per_cpu_ptr(ssp->sda, cpu); spin_lock_irq_rcu_node(sdp); - atomic_inc(&sp->srcu_barrier_cpu_cnt); + atomic_inc(&ssp->srcu_barrier_cpu_cnt); sdp->srcu_barrier_head.func = srcu_barrier_cb; debug_rcu_head_queue(&sdp->srcu_barrier_head); if (!rcu_segcblist_entrain(&sdp->srcu_cblist, &sdp->srcu_barrier_head, 0)) { debug_rcu_head_unqueue(&sdp->srcu_barrier_head); - atomic_dec(&sp->srcu_barrier_cpu_cnt); + atomic_dec(&ssp->srcu_barrier_cpu_cnt); } spin_unlock_irq_rcu_node(sdp); } /* Remove the initial count, at which point reaching zero can happen. */ - if (atomic_dec_and_test(&sp->srcu_barrier_cpu_cnt)) - complete(&sp->srcu_barrier_completion); - wait_for_completion(&sp->srcu_barrier_completion); + if (atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt)) + complete(&ssp->srcu_barrier_completion); + wait_for_completion(&ssp->srcu_barrier_completion); - rcu_seq_end(&sp->srcu_barrier_seq); - mutex_unlock(&sp->srcu_barrier_mutex); + rcu_seq_end(&ssp->srcu_barrier_seq); + mutex_unlock(&ssp->srcu_barrier_mutex); } EXPORT_SYMBOL_GPL(srcu_barrier); /** * srcu_batches_completed - return batches completed. - * @sp: srcu_struct on which to report batch completion. + * @ssp: srcu_struct on which to report batch completion. * * Report the number of batches, correlated with, but not necessarily * precisely the same as, the number of grace periods that have elapsed. */ -unsigned long srcu_batches_completed(struct srcu_struct *sp) +unsigned long srcu_batches_completed(struct srcu_struct *ssp) { - return sp->srcu_idx; + return ssp->srcu_idx; } EXPORT_SYMBOL_GPL(srcu_batches_completed); @@ -1112,11 +1112,11 @@ EXPORT_SYMBOL_GPL(srcu_batches_completed); * to SRCU_STATE_SCAN2, and invoke srcu_gp_end() when scan has * completed in that state. */ -static void srcu_advance_state(struct srcu_struct *sp) +static void srcu_advance_state(struct srcu_struct *ssp) { int idx; - mutex_lock(&sp->srcu_gp_mutex); + mutex_lock(&ssp->srcu_gp_mutex); /* * Because readers might be delayed for an extended period after @@ -1128,47 +1128,47 @@ static void srcu_advance_state(struct srcu_struct *sp) * The load-acquire ensures that we see the accesses performed * by the prior grace period. */ - idx = rcu_seq_state(smp_load_acquire(&sp->srcu_gp_seq)); /* ^^^ */ + idx = rcu_seq_state(smp_load_acquire(&ssp->srcu_gp_seq)); /* ^^^ */ if (idx == SRCU_STATE_IDLE) { - spin_lock_irq_rcu_node(sp); - if (ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed)) { - WARN_ON_ONCE(rcu_seq_state(sp->srcu_gp_seq)); - spin_unlock_irq_rcu_node(sp); - mutex_unlock(&sp->srcu_gp_mutex); + spin_lock_irq_rcu_node(ssp); + if (ULONG_CMP_GE(ssp->srcu_gp_seq, ssp->srcu_gp_seq_needed)) { + WARN_ON_ONCE(rcu_seq_state(ssp->srcu_gp_seq)); + spin_unlock_irq_rcu_node(ssp); + mutex_unlock(&ssp->srcu_gp_mutex); return; } - idx = rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)); + idx = rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)); if (idx == SRCU_STATE_IDLE) - srcu_gp_start(sp); - spin_unlock_irq_rcu_node(sp); + srcu_gp_start(ssp); + spin_unlock_irq_rcu_node(ssp); if (idx != SRCU_STATE_IDLE) { - mutex_unlock(&sp->srcu_gp_mutex); + mutex_unlock(&ssp->srcu_gp_mutex); return; /* Someone else started the grace period. */ } } - if (rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)) == SRCU_STATE_SCAN1) { - idx = 1 ^ (sp->srcu_idx & 1); - if (!try_check_zero(sp, idx, 1)) { - mutex_unlock(&sp->srcu_gp_mutex); + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) == SRCU_STATE_SCAN1) { + idx = 1 ^ (ssp->srcu_idx & 1); + if (!try_check_zero(ssp, idx, 1)) { + mutex_unlock(&ssp->srcu_gp_mutex); return; /* readers present, retry later. */ } - srcu_flip(sp); - rcu_seq_set_state(&sp->srcu_gp_seq, SRCU_STATE_SCAN2); + srcu_flip(ssp); + rcu_seq_set_state(&ssp->srcu_gp_seq, SRCU_STATE_SCAN2); } - if (rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)) == SRCU_STATE_SCAN2) { + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) == SRCU_STATE_SCAN2) { /* * SRCU read-side critical sections are normally short, * so check at least twice in quick succession after a flip. */ - idx = 1 ^ (sp->srcu_idx & 1); - if (!try_check_zero(sp, idx, 2)) { - mutex_unlock(&sp->srcu_gp_mutex); + idx = 1 ^ (ssp->srcu_idx & 1); + if (!try_check_zero(ssp, idx, 2)) { + mutex_unlock(&ssp->srcu_gp_mutex); return; /* readers present, retry later. */ } - srcu_gp_end(sp); /* Releases ->srcu_gp_mutex. */ + srcu_gp_end(ssp); /* Releases ->srcu_gp_mutex. */ } } @@ -1184,14 +1184,14 @@ static void srcu_invoke_callbacks(struct work_struct *work) struct rcu_cblist ready_cbs; struct rcu_head *rhp; struct srcu_data *sdp; - struct srcu_struct *sp; + struct srcu_struct *ssp; sdp = container_of(work, struct srcu_data, work.work); - sp = sdp->sp; + ssp = sdp->ssp; rcu_cblist_init(&ready_cbs); spin_lock_irq_rcu_node(sdp); rcu_segcblist_advance(&sdp->srcu_cblist, - rcu_seq_current(&sp->srcu_gp_seq)); + rcu_seq_current(&ssp->srcu_gp_seq)); if (sdp->srcu_cblist_invoking || !rcu_segcblist_ready_cbs(&sdp->srcu_cblist)) { spin_unlock_irq_rcu_node(sdp); @@ -1217,7 +1217,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) spin_lock_irq_rcu_node(sdp); rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&sp->srcu_gp_seq)); + rcu_seq_snap(&ssp->srcu_gp_seq)); sdp->srcu_cblist_invoking = false; more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); spin_unlock_irq_rcu_node(sdp); @@ -1229,24 +1229,24 @@ static void srcu_invoke_callbacks(struct work_struct *work) * Finished one round of SRCU grace period. Start another if there are * more SRCU callbacks queued, otherwise put SRCU into not-running state. */ -static void srcu_reschedule(struct srcu_struct *sp, unsigned long delay) +static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay) { bool pushgp = true; - spin_lock_irq_rcu_node(sp); - if (ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed)) { - if (!WARN_ON_ONCE(rcu_seq_state(sp->srcu_gp_seq))) { + spin_lock_irq_rcu_node(ssp); + if (ULONG_CMP_GE(ssp->srcu_gp_seq, ssp->srcu_gp_seq_needed)) { + if (!WARN_ON_ONCE(rcu_seq_state(ssp->srcu_gp_seq))) { /* All requests fulfilled, time to go idle. */ pushgp = false; } - } else if (!rcu_seq_state(sp->srcu_gp_seq)) { + } else if (!rcu_seq_state(ssp->srcu_gp_seq)) { /* Outstanding request and no GP. Start one. */ - srcu_gp_start(sp); + srcu_gp_start(ssp); } - spin_unlock_irq_rcu_node(sp); + spin_unlock_irq_rcu_node(ssp); if (pushgp) - queue_delayed_work(rcu_gp_wq, &sp->work, delay); + queue_delayed_work(rcu_gp_wq, &ssp->work, delay); } /* @@ -1254,41 +1254,41 @@ static void srcu_reschedule(struct srcu_struct *sp, unsigned long delay) */ static void process_srcu(struct work_struct *work) { - struct srcu_struct *sp; + struct srcu_struct *ssp; - sp = container_of(work, struct srcu_struct, work.work); + ssp = container_of(work, struct srcu_struct, work.work); - srcu_advance_state(sp); - srcu_reschedule(sp, srcu_get_delay(sp)); + srcu_advance_state(ssp); + srcu_reschedule(ssp, srcu_get_delay(ssp)); } void srcutorture_get_gp_data(enum rcutorture_type test_type, - struct srcu_struct *sp, int *flags, + struct srcu_struct *ssp, int *flags, unsigned long *gp_seq) { if (test_type != SRCU_FLAVOR) return; *flags = 0; - *gp_seq = rcu_seq_current(&sp->srcu_gp_seq); + *gp_seq = rcu_seq_current(&ssp->srcu_gp_seq); } EXPORT_SYMBOL_GPL(srcutorture_get_gp_data); -void srcu_torture_stats_print(struct srcu_struct *sp, char *tt, char *tf) +void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf) { int cpu; int idx; unsigned long s0 = 0, s1 = 0; - idx = sp->srcu_idx & 0x1; + idx = ssp->srcu_idx & 0x1; pr_alert("%s%s Tree SRCU g%ld per-CPU(idx=%d):", - tt, tf, rcu_seq_current(&sp->srcu_gp_seq), idx); + tt, tf, rcu_seq_current(&ssp->srcu_gp_seq), idx); for_each_possible_cpu(cpu) { unsigned long l0, l1; unsigned long u0, u1; long c0, c1; struct srcu_data *sdp; - sdp = per_cpu_ptr(sp->sda, cpu); + sdp = per_cpu_ptr(ssp->sda, cpu); u0 = sdp->srcu_unlock_count[!idx]; u1 = sdp->srcu_unlock_count[idx]; @@ -1323,14 +1323,14 @@ early_initcall(srcu_bootup_announce); void __init srcu_init(void) { - struct srcu_struct *sp; + struct srcu_struct *ssp; srcu_init_done = true; while (!list_empty(&srcu_boot_list)) { - sp = list_first_entry(&srcu_boot_list, struct srcu_struct, + ssp = list_first_entry(&srcu_boot_list, struct srcu_struct, work.work.entry); - check_init_srcu_struct(sp); - list_del_init(&sp->work.work.entry); - queue_work(rcu_gp_wq, &sp->work.work); + check_init_srcu_struct(ssp); + list_del_init(&ssp->work.work.entry); + queue_work(rcu_gp_wq, &ssp->work.work); } } -- cgit v1.3-7-g2ca7 From 9adcfaffc34d53e498637237fb3701560359d50b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Sat, 24 Nov 2018 13:10:25 +0900 Subject: printk: Make printk_emit() local function. printk_emit() is called from only devkmsg_write() in the same file. Save object size by making it a local function. Link: http://lkml.kernel.org/r/5cc99d2c-c408-34f7-d1fc-e1cd2a9e31da@i-love.sakura.ne.jp Cc: Steven Rostedt Signed-off-by: Tetsuo Handa Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- include/linux/printk.h | 5 ----- kernel/printk/printk.c | 30 ++++++++++++++---------------- 2 files changed, 14 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/include/linux/printk.h b/include/linux/printk.h index cf3eccfe1543..55aa96975fa2 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -166,11 +166,6 @@ int vprintk_emit(int facility, int level, asmlinkage __printf(1, 0) int vprintk(const char *fmt, va_list args); -asmlinkage __printf(5, 6) __cold -int printk_emit(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, ...); - asmlinkage __printf(1, 2) __cold int printk(const char *fmt, ...); diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index b77150ad1965..a1d88212a5d2 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -753,6 +753,19 @@ struct devkmsg_user { char buf[CONSOLE_EXT_LOG_MAX]; }; +static __printf(3, 4) __cold +int devkmsg_emit(int facility, int level, const char *fmt, ...) +{ + va_list args; + int r; + + va_start(args, fmt); + r = vprintk_emit(facility, level, NULL, 0, fmt, args); + va_end(args); + + return r; +} + static ssize_t devkmsg_write(struct kiocb *iocb, struct iov_iter *from) { char *buf, *line; @@ -811,7 +824,7 @@ static ssize_t devkmsg_write(struct kiocb *iocb, struct iov_iter *from) } } - printk_emit(facility, level, NULL, 0, "%s", line); + devkmsg_emit(facility, level, "%s", line); kfree(buf); return ret; } @@ -1936,21 +1949,6 @@ asmlinkage int vprintk(const char *fmt, va_list args) } EXPORT_SYMBOL(vprintk); -asmlinkage int printk_emit(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, ...) -{ - va_list args; - int r; - - va_start(args, fmt); - r = vprintk_emit(facility, level, dict, dictlen, fmt, args); - va_end(args); - - return r; -} -EXPORT_SYMBOL(printk_emit); - int vprintk_default(const char *fmt, va_list args) { int r; -- cgit v1.3-7-g2ca7 From 2d25bc55235314d869459c574be14e8faa73aca3 Mon Sep 17 00:00:00 2001 From: Jessica Yu Date: Mon, 19 Nov 2018 17:43:58 +0100 Subject: module: make it clearer when we're handling kallsyms symbols vs exported symbols The module loader internally works with both exported symbols represented as struct kernel_symbol, as well as Elf symbols from a module's symbol table. It's hard to distinguish sometimes which type of symbol we're handling given that some helper function names are not consistent or helpful. Take get_ksymbol() for instance - are we looking for an exported symbol or a kallsyms symbol here? Or symname() and kernel_symbol_name() - which function handles an exported symbol and which one an Elf symbol? Clean up and unify the function naming scheme a bit to make it clear which kind of symbol we're handling. This change only affects static functions internal to the module loader. Reviewed-by: Miroslav Benes Signed-off-by: Jessica Yu --- kernel/module.c | 78 ++++++++++++++++++++++++++++++++------------------------- 1 file changed, 44 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..1b5edf78694c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -495,9 +495,9 @@ struct find_symbol_arg { const struct kernel_symbol *sym; }; -static bool check_symbol(const struct symsearch *syms, - struct module *owner, - unsigned int symnum, void *data) +static bool check_exported_symbol(const struct symsearch *syms, + struct module *owner, + unsigned int symnum, void *data) { struct find_symbol_arg *fsa = data; @@ -555,9 +555,9 @@ static int cmp_name(const void *va, const void *vb) return strcmp(a, kernel_symbol_name(b)); } -static bool find_symbol_in_section(const struct symsearch *syms, - struct module *owner, - void *data) +static bool find_exported_symbol_in_section(const struct symsearch *syms, + struct module *owner, + void *data) { struct find_symbol_arg *fsa = data; struct kernel_symbol *sym; @@ -565,13 +565,14 @@ static bool find_symbol_in_section(const struct symsearch *syms, sym = bsearch(fsa->name, syms->start, syms->stop - syms->start, sizeof(struct kernel_symbol), cmp_name); - if (sym != NULL && check_symbol(syms, owner, sym - syms->start, data)) + if (sym != NULL && check_exported_symbol(syms, owner, + sym - syms->start, data)) return true; return false; } -/* Find a symbol and return it, along with, (optional) crc and +/* Find an exported symbol and return it, along with, (optional) crc and * (optional) module which owns it. Needs preempt disabled or module_mutex. */ const struct kernel_symbol *find_symbol(const char *name, struct module **owner, @@ -585,7 +586,7 @@ const struct kernel_symbol *find_symbol(const char *name, fsa.gplok = gplok; fsa.warn = warn; - if (each_symbol_section(find_symbol_in_section, &fsa)) { + if (each_symbol_section(find_exported_symbol_in_section, &fsa)) { if (owner) *owner = fsa.owner; if (crc) @@ -2198,7 +2199,7 @@ EXPORT_SYMBOL_GPL(__symbol_get); * * You must hold the module_mutex. */ -static int verify_export_symbols(struct module *mod) +static int verify_exported_symbols(struct module *mod) { unsigned int i; struct module *owner; @@ -2519,10 +2520,10 @@ static void free_modinfo(struct module *mod) #ifdef CONFIG_KALLSYMS -/* lookup symbol in given range of kernel_symbols */ -static const struct kernel_symbol *lookup_symbol(const char *name, - const struct kernel_symbol *start, - const struct kernel_symbol *stop) +/* Lookup exported symbol in given range of kernel_symbols */ +static const struct kernel_symbol *lookup_exported_symbol(const char *name, + const struct kernel_symbol *start, + const struct kernel_symbol *stop) { return bsearch(name, start, stop - start, sizeof(struct kernel_symbol), cmp_name); @@ -2533,9 +2534,10 @@ static int is_exported(const char *name, unsigned long value, { const struct kernel_symbol *ks; if (!mod) - ks = lookup_symbol(name, __start___ksymtab, __stop___ksymtab); + ks = lookup_exported_symbol(name, __start___ksymtab, __stop___ksymtab); else - ks = lookup_symbol(name, mod->syms, mod->syms + mod->num_syms); + ks = lookup_exported_symbol(name, mod->syms, mod->syms + mod->num_syms); + return ks != NULL && kernel_symbol_value(ks) == value; } @@ -3592,7 +3594,7 @@ static int complete_formation(struct module *mod, struct load_info *info) mutex_lock(&module_mutex); /* Find duplicate symbols (must be called under lock). */ - err = verify_export_symbols(mod); + err = verify_exported_symbols(mod); if (err < 0) goto out; @@ -3911,15 +3913,19 @@ static inline int is_arm_mapping_symbol(const char *str) && (str[2] == '\0' || str[2] == '.'); } -static const char *symname(struct mod_kallsyms *kallsyms, unsigned int symnum) +static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum) { return kallsyms->strtab + kallsyms->symtab[symnum].st_name; } -static const char *get_ksymbol(struct module *mod, - unsigned long addr, - unsigned long *size, - unsigned long *offset) +/* + * Given a module and address, find the corresponding symbol and return its name + * while providing its size and offset if needed. + */ +static const char *find_kallsyms_symbol(struct module *mod, + unsigned long addr, + unsigned long *size, + unsigned long *offset) { unsigned int i, best = 0; unsigned long nextval; @@ -3939,8 +3945,8 @@ static const char *get_ksymbol(struct module *mod, /* We ignore unnamed symbols: they're uninformative * and inserted at a whim. */ - if (*symname(kallsyms, i) == '\0' - || is_arm_mapping_symbol(symname(kallsyms, i))) + if (*kallsyms_symbol_name(kallsyms, i) == '\0' + || is_arm_mapping_symbol(kallsyms_symbol_name(kallsyms, i))) continue; if (kallsyms->symtab[i].st_value <= addr @@ -3958,7 +3964,8 @@ static const char *get_ksymbol(struct module *mod, *size = nextval - kallsyms->symtab[best].st_value; if (offset) *offset = addr - kallsyms->symtab[best].st_value; - return symname(kallsyms, best); + + return kallsyms_symbol_name(kallsyms, best); } void * __weak dereference_module_function_descriptor(struct module *mod, @@ -3983,7 +3990,8 @@ const char *module_address_lookup(unsigned long addr, if (mod) { if (modname) *modname = mod->name; - ret = get_ksymbol(mod, addr, size, offset); + + ret = find_kallsyms_symbol(mod, addr, size, offset); } /* Make a copy in here where it's safe */ if (ret) { @@ -4006,9 +4014,10 @@ int lookup_module_symbol_name(unsigned long addr, char *symname) if (within_module(addr, mod)) { const char *sym; - sym = get_ksymbol(mod, addr, NULL, NULL); + sym = find_kallsyms_symbol(mod, addr, NULL, NULL); if (!sym) goto out; + strlcpy(symname, sym, KSYM_NAME_LEN); preempt_enable(); return 0; @@ -4031,7 +4040,7 @@ int lookup_module_symbol_attrs(unsigned long addr, unsigned long *size, if (within_module(addr, mod)) { const char *sym; - sym = get_ksymbol(mod, addr, size, offset); + sym = find_kallsyms_symbol(mod, addr, size, offset); if (!sym) goto out; if (modname) @@ -4062,7 +4071,7 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, if (symnum < kallsyms->num_symtab) { *value = kallsyms->symtab[symnum].st_value; *type = kallsyms->symtab[symnum].st_info; - strlcpy(name, symname(kallsyms, symnum), KSYM_NAME_LEN); + strlcpy(name, kallsyms_symbol_name(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); preempt_enable(); @@ -4074,13 +4083,14 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, return -ERANGE; } -static unsigned long mod_find_symname(struct module *mod, const char *name) +/* Given a module and name of symbol, find and return the symbol's value */ +static unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name) { unsigned int i; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); for (i = 0; i < kallsyms->num_symtab; i++) - if (strcmp(name, symname(kallsyms, i)) == 0 && + if (strcmp(name, kallsyms_symbol_name(kallsyms, i)) == 0 && kallsyms->symtab[i].st_shndx != SHN_UNDEF) return kallsyms->symtab[i].st_value; return 0; @@ -4097,12 +4107,12 @@ unsigned long module_kallsyms_lookup_name(const char *name) preempt_disable(); if ((colon = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) { if ((mod = find_module_all(name, colon - name, false)) != NULL) - ret = mod_find_symname(mod, colon+1); + ret = find_kallsyms_symbol_value(mod, colon+1); } else { list_for_each_entry_rcu(mod, &modules, list) { if (mod->state == MODULE_STATE_UNFORMED) continue; - if ((ret = mod_find_symname(mod, name)) != 0) + if ((ret = find_kallsyms_symbol_value(mod, name)) != 0) break; } } @@ -4131,7 +4141,7 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, const char *, if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) continue; - ret = fn(data, symname(kallsyms, i), + ret = fn(data, kallsyms_symbol_name(kallsyms, i), mod, kallsyms->symtab[i].st_value); if (ret != 0) return ret; -- cgit v1.3-7-g2ca7 From 96c6935212d632cb78db9149c61005b839e03b75 Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Tue, 6 Nov 2018 09:38:06 -0500 Subject: PM / QoS: Change to use DEFINE_SHOW_ATTRIBUTE macro Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li Acked-by: Pavel Machek Signed-off-by: Rafael J. Wysocki --- kernel/power/qos.c | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 86d72ffb811b..b7a82502857a 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -184,7 +184,7 @@ static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) c->target_value = value; } -static int pm_qos_dbg_show_requests(struct seq_file *s, void *unused) +static int pm_qos_debug_show(struct seq_file *s, void *unused) { struct pm_qos_object *qos = (struct pm_qos_object *)s->private; struct pm_qos_constraints *c; @@ -245,18 +245,7 @@ out: return 0; } -static int pm_qos_dbg_open(struct inode *inode, struct file *file) -{ - return single_open(file, pm_qos_dbg_show_requests, - inode->i_private); -} - -static const struct file_operations pm_qos_debug_fops = { - .open = pm_qos_dbg_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +DEFINE_SHOW_ATTRIBUTE(pm_qos_debug); /** * pm_qos_update_target - manages the constraints list and calls the notifiers -- cgit v1.3-7-g2ca7 From c43ac4a5301986c015137bb89568979f9b3264ca Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 27 Nov 2018 09:36:37 -0500 Subject: tracing: Do not line wrap short line in function_graph_enter() Commit 588ca1786f2dd ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack") removed a parameter from the call ftrace_push_return_trace() that made it so that the entire call was under 80 characters, but it did not remove the line break. There's no reason to break that line up, so make it a single line. Link: http://lkml.kernel.org/r/20181122100322.GN2131@hirez.programming.kicks-ass.net Reported-by: Peter Zijlstra Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_functions_graph.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 086af4f5c3e8..0d235e44d08e 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -188,8 +188,7 @@ int function_graph_enter(unsigned long ret, unsigned long func, trace.func = func; trace.depth = ++current->curr_ret_depth; - if (ftrace_push_return_trace(ret, func, - frame_pointer, retp)) + if (ftrace_push_return_trace(ret, func, frame_pointer, retp)) goto out; /* Only trace if the calling function expects to */ -- cgit v1.3-7-g2ca7 From d864a3ca883095aa12575b84841ebd52b3d808fa Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Mon, 12 Nov 2018 15:21:22 -0500 Subject: fgraph: Create a fgraph.c file to store function graph infrastructure As the function graph infrastructure can be used by thing other than tracing, moving the code to its own file out of the trace_functions_graph.c code makes more sense. The fgraph.c file will only contain the infrastructure required to hook into functions and their return code. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/Makefile | 1 + kernel/trace/fgraph.c | 232 +++++++++++++++++++++++++++++++++++ kernel/trace/trace_functions_graph.c | 220 --------------------------------- 3 files changed, 233 insertions(+), 220 deletions(-) create mode 100644 kernel/trace/fgraph.c (limited to 'kernel') diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index f81dadbc7c4a..c7ade7965464 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -57,6 +57,7 @@ obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o +obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o ifeq ($(CONFIG_BLOCK),y) obj-$(CONFIG_EVENT_TRACING) += blktrace.o endif diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c new file mode 100644 index 000000000000..5ad9c0e88b80 --- /dev/null +++ b/kernel/trace/fgraph.c @@ -0,0 +1,232 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Infrastructure to took into function calls and returns. + * Copyright (c) 2008-2009 Frederic Weisbecker + * Mostly borrowed from function tracer which + * is Copyright (c) Steven Rostedt + * + * Highly modified by Steven Rostedt (VMware). + */ +#include + +#include "trace.h" + +static bool kill_ftrace_graph; + +/** + * ftrace_graph_is_dead - returns true if ftrace_graph_stop() was called + * + * ftrace_graph_stop() is called when a severe error is detected in + * the function graph tracing. This function is called by the critical + * paths of function graph to keep those paths from doing any more harm. + */ +bool ftrace_graph_is_dead(void) +{ + return kill_ftrace_graph; +} + +/** + * ftrace_graph_stop - set to permanently disable function graph tracincg + * + * In case of an error int function graph tracing, this is called + * to try to keep function graph tracing from causing any more harm. + * Usually this is pretty severe and this is called to try to at least + * get a warning out to the user. + */ +void ftrace_graph_stop(void) +{ + kill_ftrace_graph = true; +} + +/* Add a function return address to the trace stack on thread info.*/ +static int +ftrace_push_return_trace(unsigned long ret, unsigned long func, + unsigned long frame_pointer, unsigned long *retp) +{ + unsigned long long calltime; + int index; + + if (unlikely(ftrace_graph_is_dead())) + return -EBUSY; + + if (!current->ret_stack) + return -EBUSY; + + /* + * We must make sure the ret_stack is tested before we read + * anything else. + */ + smp_rmb(); + + /* The return trace stack is full */ + if (current->curr_ret_stack == FTRACE_RETFUNC_DEPTH - 1) { + atomic_inc(¤t->trace_overrun); + return -EBUSY; + } + + /* + * The curr_ret_stack is an index to ftrace return stack of + * current task. Its value should be in [0, FTRACE_RETFUNC_ + * DEPTH) when the function graph tracer is used. To support + * filtering out specific functions, it makes the index + * negative by subtracting huge value (FTRACE_NOTRACE_DEPTH) + * so when it sees a negative index the ftrace will ignore + * the record. And the index gets recovered when returning + * from the filtered function by adding the FTRACE_NOTRACE_ + * DEPTH and then it'll continue to record functions normally. + * + * The curr_ret_stack is initialized to -1 and get increased + * in this function. So it can be less than -1 only if it was + * filtered out via ftrace_graph_notrace_addr() which can be + * set from set_graph_notrace file in tracefs by user. + */ + if (current->curr_ret_stack < -1) + return -EBUSY; + + calltime = trace_clock_local(); + + index = ++current->curr_ret_stack; + if (ftrace_graph_notrace_addr(func)) + current->curr_ret_stack -= FTRACE_NOTRACE_DEPTH; + barrier(); + current->ret_stack[index].ret = ret; + current->ret_stack[index].func = func; + current->ret_stack[index].calltime = calltime; +#ifdef HAVE_FUNCTION_GRAPH_FP_TEST + current->ret_stack[index].fp = frame_pointer; +#endif +#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR + current->ret_stack[index].retp = retp; +#endif + return 0; +} + +int function_graph_enter(unsigned long ret, unsigned long func, + unsigned long frame_pointer, unsigned long *retp) +{ + struct ftrace_graph_ent trace; + + trace.func = func; + trace.depth = ++current->curr_ret_depth; + + if (ftrace_push_return_trace(ret, func, frame_pointer, retp)) + goto out; + + /* Only trace if the calling function expects to */ + if (!ftrace_graph_entry(&trace)) + goto out_ret; + + return 0; + out_ret: + current->curr_ret_stack--; + out: + current->curr_ret_depth--; + return -EBUSY; +} + +/* Retrieve a function return address to the trace stack on thread info.*/ +static void +ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret, + unsigned long frame_pointer) +{ + int index; + + index = current->curr_ret_stack; + + /* + * A negative index here means that it's just returned from a + * notrace'd function. Recover index to get an original + * return address. See ftrace_push_return_trace(). + * + * TODO: Need to check whether the stack gets corrupted. + */ + if (index < 0) + index += FTRACE_NOTRACE_DEPTH; + + if (unlikely(index < 0 || index >= FTRACE_RETFUNC_DEPTH)) { + ftrace_graph_stop(); + WARN_ON(1); + /* Might as well panic, otherwise we have no where to go */ + *ret = (unsigned long)panic; + return; + } + +#ifdef HAVE_FUNCTION_GRAPH_FP_TEST + /* + * The arch may choose to record the frame pointer used + * and check it here to make sure that it is what we expect it + * to be. If gcc does not set the place holder of the return + * address in the frame pointer, and does a copy instead, then + * the function graph trace will fail. This test detects this + * case. + * + * Currently, x86_32 with optimize for size (-Os) makes the latest + * gcc do the above. + * + * Note, -mfentry does not use frame pointers, and this test + * is not needed if CC_USING_FENTRY is set. + */ + if (unlikely(current->ret_stack[index].fp != frame_pointer)) { + ftrace_graph_stop(); + WARN(1, "Bad frame pointer: expected %lx, received %lx\n" + " from func %ps return to %lx\n", + current->ret_stack[index].fp, + frame_pointer, + (void *)current->ret_stack[index].func, + current->ret_stack[index].ret); + *ret = (unsigned long)panic; + return; + } +#endif + + *ret = current->ret_stack[index].ret; + trace->func = current->ret_stack[index].func; + trace->calltime = current->ret_stack[index].calltime; + trace->overrun = atomic_read(¤t->trace_overrun); + trace->depth = current->curr_ret_depth--; + /* + * We still want to trace interrupts coming in if + * max_depth is set to 1. Make sure the decrement is + * seen before ftrace_graph_return. + */ + barrier(); +} + +/* + * Send the trace to the ring-buffer. + * @return the original return address. + */ +unsigned long ftrace_return_to_handler(unsigned long frame_pointer) +{ + struct ftrace_graph_ret trace; + unsigned long ret; + + ftrace_pop_return_trace(&trace, &ret, frame_pointer); + trace.rettime = trace_clock_local(); + ftrace_graph_return(&trace); + /* + * The ftrace_graph_return() may still access the current + * ret_stack structure, we need to make sure the update of + * curr_ret_stack is after that. + */ + barrier(); + current->curr_ret_stack--; + /* + * The curr_ret_stack can be less than -1 only if it was + * filtered out and it's about to return from the function. + * Recover the index and continue to trace normal functions. + */ + if (current->curr_ret_stack < -1) { + current->curr_ret_stack += FTRACE_NOTRACE_DEPTH; + return ret; + } + + if (unlikely(!ret)) { + ftrace_graph_stop(); + WARN_ON(1); + /* Might as well panic. What else to do? */ + ret = (unsigned long)panic; + } + + return ret; +} diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 0d235e44d08e..b846d82c2f95 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -16,33 +16,6 @@ #include "trace.h" #include "trace_output.h" -static bool kill_ftrace_graph; - -/** - * ftrace_graph_is_dead - returns true if ftrace_graph_stop() was called - * - * ftrace_graph_stop() is called when a severe error is detected in - * the function graph tracing. This function is called by the critical - * paths of function graph to keep those paths from doing any more harm. - */ -bool ftrace_graph_is_dead(void) -{ - return kill_ftrace_graph; -} - -/** - * ftrace_graph_stop - set to permanently disable function graph tracincg - * - * In case of an error int function graph tracing, this is called - * to try to keep function graph tracing from causing any more harm. - * Usually this is pretty severe and this is called to try to at least - * get a warning out to the user. - */ -void ftrace_graph_stop(void) -{ - kill_ftrace_graph = true; -} - /* When set, irq functions will be ignored */ static int ftrace_graph_skip_irqs; @@ -117,199 +90,6 @@ static void print_graph_duration(struct trace_array *tr, unsigned long long duration, struct trace_seq *s, u32 flags); -/* Add a function return address to the trace stack on thread info.*/ -static int -ftrace_push_return_trace(unsigned long ret, unsigned long func, - unsigned long frame_pointer, unsigned long *retp) -{ - unsigned long long calltime; - int index; - - if (unlikely(ftrace_graph_is_dead())) - return -EBUSY; - - if (!current->ret_stack) - return -EBUSY; - - /* - * We must make sure the ret_stack is tested before we read - * anything else. - */ - smp_rmb(); - - /* The return trace stack is full */ - if (current->curr_ret_stack == FTRACE_RETFUNC_DEPTH - 1) { - atomic_inc(¤t->trace_overrun); - return -EBUSY; - } - - /* - * The curr_ret_stack is an index to ftrace return stack of - * current task. Its value should be in [0, FTRACE_RETFUNC_ - * DEPTH) when the function graph tracer is used. To support - * filtering out specific functions, it makes the index - * negative by subtracting huge value (FTRACE_NOTRACE_DEPTH) - * so when it sees a negative index the ftrace will ignore - * the record. And the index gets recovered when returning - * from the filtered function by adding the FTRACE_NOTRACE_ - * DEPTH and then it'll continue to record functions normally. - * - * The curr_ret_stack is initialized to -1 and get increased - * in this function. So it can be less than -1 only if it was - * filtered out via ftrace_graph_notrace_addr() which can be - * set from set_graph_notrace file in tracefs by user. - */ - if (current->curr_ret_stack < -1) - return -EBUSY; - - calltime = trace_clock_local(); - - index = ++current->curr_ret_stack; - if (ftrace_graph_notrace_addr(func)) - current->curr_ret_stack -= FTRACE_NOTRACE_DEPTH; - barrier(); - current->ret_stack[index].ret = ret; - current->ret_stack[index].func = func; - current->ret_stack[index].calltime = calltime; -#ifdef HAVE_FUNCTION_GRAPH_FP_TEST - current->ret_stack[index].fp = frame_pointer; -#endif -#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR - current->ret_stack[index].retp = retp; -#endif - return 0; -} - -int function_graph_enter(unsigned long ret, unsigned long func, - unsigned long frame_pointer, unsigned long *retp) -{ - struct ftrace_graph_ent trace; - - trace.func = func; - trace.depth = ++current->curr_ret_depth; - - if (ftrace_push_return_trace(ret, func, frame_pointer, retp)) - goto out; - - /* Only trace if the calling function expects to */ - if (!ftrace_graph_entry(&trace)) - goto out_ret; - - return 0; - out_ret: - current->curr_ret_stack--; - out: - current->curr_ret_depth--; - return -EBUSY; -} - -/* Retrieve a function return address to the trace stack on thread info.*/ -static void -ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret, - unsigned long frame_pointer) -{ - int index; - - index = current->curr_ret_stack; - - /* - * A negative index here means that it's just returned from a - * notrace'd function. Recover index to get an original - * return address. See ftrace_push_return_trace(). - * - * TODO: Need to check whether the stack gets corrupted. - */ - if (index < 0) - index += FTRACE_NOTRACE_DEPTH; - - if (unlikely(index < 0 || index >= FTRACE_RETFUNC_DEPTH)) { - ftrace_graph_stop(); - WARN_ON(1); - /* Might as well panic, otherwise we have no where to go */ - *ret = (unsigned long)panic; - return; - } - -#ifdef HAVE_FUNCTION_GRAPH_FP_TEST - /* - * The arch may choose to record the frame pointer used - * and check it here to make sure that it is what we expect it - * to be. If gcc does not set the place holder of the return - * address in the frame pointer, and does a copy instead, then - * the function graph trace will fail. This test detects this - * case. - * - * Currently, x86_32 with optimize for size (-Os) makes the latest - * gcc do the above. - * - * Note, -mfentry does not use frame pointers, and this test - * is not needed if CC_USING_FENTRY is set. - */ - if (unlikely(current->ret_stack[index].fp != frame_pointer)) { - ftrace_graph_stop(); - WARN(1, "Bad frame pointer: expected %lx, received %lx\n" - " from func %ps return to %lx\n", - current->ret_stack[index].fp, - frame_pointer, - (void *)current->ret_stack[index].func, - current->ret_stack[index].ret); - *ret = (unsigned long)panic; - return; - } -#endif - - *ret = current->ret_stack[index].ret; - trace->func = current->ret_stack[index].func; - trace->calltime = current->ret_stack[index].calltime; - trace->overrun = atomic_read(¤t->trace_overrun); - trace->depth = current->curr_ret_depth--; - /* - * We still want to trace interrupts coming in if - * max_depth is set to 1. Make sure the decrement is - * seen before ftrace_graph_return. - */ - barrier(); -} - -/* - * Send the trace to the ring-buffer. - * @return the original return address. - */ -unsigned long ftrace_return_to_handler(unsigned long frame_pointer) -{ - struct ftrace_graph_ret trace; - unsigned long ret; - - ftrace_pop_return_trace(&trace, &ret, frame_pointer); - trace.rettime = trace_clock_local(); - ftrace_graph_return(&trace); - /* - * The ftrace_graph_return() may still access the current - * ret_stack structure, we need to make sure the update of - * curr_ret_stack is after that. - */ - barrier(); - current->curr_ret_stack--; - /* - * The curr_ret_stack can be less than -1 only if it was - * filtered out and it's about to return from the function. - * Recover the index and continue to trace normal functions. - */ - if (current->curr_ret_stack < -1) { - current->curr_ret_stack += FTRACE_NOTRACE_DEPTH; - return ret; - } - - if (unlikely(!ret)) { - ftrace_graph_stop(); - WARN_ON(1); - /* Might as well panic. What else to do? */ - ret = (unsigned long)panic; - } - - return ret; -} - /** * ftrace_graph_ret_addr - convert a potentially modified stack return address * to its original value -- cgit v1.3-7-g2ca7 From 9cd2992f2d6c8df54c5b937d5d1f8a23b684cc1d Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 14 Nov 2018 13:14:58 -0500 Subject: fgraph: Have set_graph_notrace only affect function_graph tracer In order to make the function graph infrastructure more generic, there can not be code specific for the function_graph tracer in the generic code. This includes the set_graph_notrace logic, that stops all graph calls when a function in the set_graph_notrace is hit. By using the trace_recursion mask, we can use a bit in the current task_struct to implement the notrace code, and move the logic out of fgraph.c and into trace_functions_graph.c and keeps it affecting only the tracer and not all call graph callbacks. Acked-by: Namhyung Kim Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/fgraph.c | 21 --------------------- kernel/trace/trace.h | 7 +++++++ kernel/trace/trace_functions_graph.c | 22 ++++++++++++++++++++++ 3 files changed, 29 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 5ad9c0e88b80..e852b69c0e64 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -64,30 +64,9 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, return -EBUSY; } - /* - * The curr_ret_stack is an index to ftrace return stack of - * current task. Its value should be in [0, FTRACE_RETFUNC_ - * DEPTH) when the function graph tracer is used. To support - * filtering out specific functions, it makes the index - * negative by subtracting huge value (FTRACE_NOTRACE_DEPTH) - * so when it sees a negative index the ftrace will ignore - * the record. And the index gets recovered when returning - * from the filtered function by adding the FTRACE_NOTRACE_ - * DEPTH and then it'll continue to record functions normally. - * - * The curr_ret_stack is initialized to -1 and get increased - * in this function. So it can be less than -1 only if it was - * filtered out via ftrace_graph_notrace_addr() which can be - * set from set_graph_notrace file in tracefs by user. - */ - if (current->curr_ret_stack < -1) - return -EBUSY; - calltime = trace_clock_local(); index = ++current->curr_ret_stack; - if (ftrace_graph_notrace_addr(func)) - current->curr_ret_stack -= FTRACE_NOTRACE_DEPTH; barrier(); current->ret_stack[index].ret = ret; current->ret_stack[index].func = func; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 447bd96ee658..f67060a75f38 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -534,6 +534,13 @@ enum { TRACE_GRAPH_DEPTH_START_BIT, TRACE_GRAPH_DEPTH_END_BIT, + + /* + * To implement set_graph_notrace, if this bit is set, we ignore + * function graph tracing of called functions, until the return + * function is called to clear it. + */ + TRACE_GRAPH_NOTRACE_BIT, }; #define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0) diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index b846d82c2f95..ecf543df943b 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -188,6 +188,18 @@ int trace_graph_entry(struct ftrace_graph_ent *trace) int cpu; int pc; + if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) + return 0; + + if (ftrace_graph_notrace_addr(trace->func)) { + trace_recursion_set(TRACE_GRAPH_NOTRACE_BIT); + /* + * Need to return 1 to have the return called + * that will clear the NOTRACE bit. + */ + return 1; + } + if (!ftrace_trace_task(tr)) return 0; @@ -290,6 +302,11 @@ void trace_graph_return(struct ftrace_graph_ret *trace) ftrace_graph_addr_finish(trace); + if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) { + trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT); + return; + } + local_irq_save(flags); cpu = raw_smp_processor_id(); data = per_cpu_ptr(tr->trace_buffer.data, cpu); @@ -315,6 +332,11 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace) { ftrace_graph_addr_finish(trace); + if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) { + trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT); + return; + } + if (tracing_thresh && (trace->rettime - trace->calltime < tracing_thresh)) return; -- cgit v1.3-7-g2ca7 From e9ee9efc0d176512cdce9d27ff8549d7ffa2bfcd Mon Sep 17 00:00:00 2001 From: David Miller Date: Fri, 30 Nov 2018 21:08:14 -0800 Subject: bpf: Add BPF_F_ANY_ALIGNMENT. Often we want to write tests cases that check things like bad context offset accesses. And one way to do this is to use an odd offset on, for example, a 32-bit load. This unfortunately triggers the alignment checks first on platforms that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS. So the test case see the alignment failure rather than what it was testing for. It is often not completely possible to respect the original intention of the test, or even test the same exact thing, while solving the alignment issue. Another option could have been to check the alignment after the context and other validations are performed by the verifier, but that is a non-trivial change to the verifier. Signed-off-by: David S. Miller Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 14 ++++++++++++++ kernel/bpf/syscall.c | 7 ++++++- kernel/bpf/verifier.c | 2 ++ tools/include/uapi/linux/bpf.h | 14 ++++++++++++++ tools/lib/bpf/bpf.c | 8 ++++---- tools/lib/bpf/bpf.h | 2 +- tools/testing/selftests/bpf/test_align.c | 4 ++-- tools/testing/selftests/bpf/test_verifier.c | 3 ++- 8 files changed, 45 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 597afdbc1ab9..8050caea7495 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -232,6 +232,20 @@ enum bpf_attach_type { */ #define BPF_F_STRICT_ALIGNMENT (1U << 0) +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the + * verifier will allow any alignment whatsoever. On platforms + * with strict alignment requirements for loads ands stores (such + * as sparc and mips) the verifier validates that all loads and + * stores provably follow this requirement. This flag turns that + * checking and enforcement off. + * + * It is mostly used for testing when we want to validate the + * context and memory access aspects of the verifier, but because + * of an unaligned access the alignment check would trigger before + * the one we are interested in. + */ +#define BPF_F_ANY_ALIGNMENT (1U << 1) + /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */ #define BPF_PSEUDO_MAP_FD 1 diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 85cbeec06e50..f9554d9a14e1 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1452,9 +1452,14 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (CHECK_ATTR(BPF_PROG_LOAD)) return -EINVAL; - if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT) + if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT)) return -EINVAL; + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && + (attr->prog_flags & BPF_F_ANY_ALIGNMENT) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* copy eBPF program license from user space */ if (strncpy_from_user(license, u64_to_user_ptr(attr->license), sizeof(license) - 1) < 0) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9584438fa2cc..71988337ac14 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6505,6 +6505,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT); if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) env->strict_alignment = true; + if (attr->prog_flags & BPF_F_ANY_ALIGNMENT) + env->strict_alignment = false; ret = replace_map_fd_with_map_ptr(env); if (ret < 0) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 597afdbc1ab9..8050caea7495 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -232,6 +232,20 @@ enum bpf_attach_type { */ #define BPF_F_STRICT_ALIGNMENT (1U << 0) +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the + * verifier will allow any alignment whatsoever. On platforms + * with strict alignment requirements for loads ands stores (such + * as sparc and mips) the verifier validates that all loads and + * stores provably follow this requirement. This flag turns that + * checking and enforcement off. + * + * It is mostly used for testing when we want to validate the + * context and memory access aspects of the verifier, but because + * of an unaligned access the alignment check would trigger before + * the one we are interested in. + */ +#define BPF_F_ANY_ALIGNMENT (1U << 1) + /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */ #define BPF_PSEUDO_MAP_FD 1 diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index ce1822194590..c19226cccf39 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -279,9 +279,9 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, } int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns, - size_t insns_cnt, int strict_alignment, - const char *license, __u32 kern_version, - char *log_buf, size_t log_buf_sz, int log_level) + size_t insns_cnt, __u32 prog_flags, const char *license, + __u32 kern_version, char *log_buf, size_t log_buf_sz, + int log_level) { union bpf_attr attr; @@ -295,7 +295,7 @@ int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns, attr.log_level = log_level; log_buf[0] = 0; attr.kern_version = kern_version; - attr.prog_flags = strict_alignment ? BPF_F_STRICT_ALIGNMENT : 0; + attr.prog_flags = prog_flags; return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 09e8bbe111d4..60392b70587c 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -98,7 +98,7 @@ LIBBPF_API int bpf_load_program(enum bpf_prog_type type, char *log_buf, size_t log_buf_sz); LIBBPF_API int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns, - size_t insns_cnt, int strict_alignment, + size_t insns_cnt, __u32 prog_flags, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz, int log_level); diff --git a/tools/testing/selftests/bpf/test_align.c b/tools/testing/selftests/bpf/test_align.c index 5f377ec53f2f..3c789d03b629 100644 --- a/tools/testing/selftests/bpf/test_align.c +++ b/tools/testing/selftests/bpf/test_align.c @@ -620,8 +620,8 @@ static int do_test_single(struct bpf_align_test *test) prog_len = probe_filter_length(prog); fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER, - prog, prog_len, 1, "GPL", 0, - bpf_vlog, sizeof(bpf_vlog), 2); + prog, prog_len, BPF_F_STRICT_ALIGNMENT, + "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 2); if (fd_prog < 0 && test->result != REJECT) { printf("Failed to load program.\n"); printf("%s", bpf_vlog); diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 5eace1f606fb..78e779c35869 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -14275,7 +14275,8 @@ static void do_test_single(struct bpf_test *test, bool unpriv, prog_len = probe_filter_length(prog); fd_prog = bpf_verify_program(prog_type, prog, prog_len, - test->flags & F_LOAD_WITH_STRICT_ALIGNMENT, + test->flags & F_LOAD_WITH_STRICT_ALIGNMENT ? + BPF_F_STRICT_ALIGNMENT : 0, "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1); expected_ret = unpriv && test->result_unpriv != UNDEF ? -- cgit v1.3-7-g2ca7 From b18814e767a445534ab9ccba02e82a31208f85d6 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 4 Nov 2018 17:27:56 +0100 Subject: dma-direct: provide page based alloc/free helpers Some architectures support remapping highmem into DMA coherent allocations. To use the common code for them we need variants of dma_direct_{alloc,free}_pages that do not use kernel virtual addresses. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- include/linux/dma-direct.h | 3 +++ kernel/dma/direct.c | 32 ++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 9e66bfe369aa..61b78f934f64 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -67,6 +67,9 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs); +struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); +void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page); dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs); diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 22a12ab5a5e9..680287779b0a 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -103,14 +103,13 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) min_not_zero(dev->coherent_dma_mask, dev->bus_dma_mask); } -void *dma_direct_alloc_pages(struct device *dev, size_t size, +struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs) { unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; int page_order = get_order(size); struct page *page = NULL; u64 phys_mask; - void *ret; if (attrs & DMA_ATTR_NO_WARN) gfp |= __GFP_NOWARN; @@ -150,11 +149,22 @@ again: } } + return page; +} + +void *dma_direct_alloc_pages(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs) +{ + struct page *page; + void *ret; + + page = __dma_direct_alloc_pages(dev, size, dma_handle, gfp, attrs); if (!page) return NULL; + ret = page_address(page); if (force_dma_unencrypted()) { - set_memory_decrypted((unsigned long)ret, 1 << page_order); + set_memory_decrypted((unsigned long)ret, 1 << get_order(size)); *dma_handle = __phys_to_dma(dev, page_to_phys(page)); } else { *dma_handle = phys_to_dma(dev, page_to_phys(page)); @@ -163,20 +173,22 @@ again: return ret; } -/* - * NOTE: this function must never look at the dma_addr argument, because we want - * to be able to use it as a helper for iommu implementations as well. - */ +void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page) +{ + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; + + if (!dma_release_from_contiguous(dev, page, count)) + __free_pages(page, get_order(size)); +} + void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs) { - unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; unsigned int page_order = get_order(size); if (force_dma_unencrypted()) set_memory_encrypted((unsigned long)cpu_addr, 1 << page_order); - if (!dma_release_from_contiguous(dev, virt_to_page(cpu_addr), count)) - free_pages((unsigned long)cpu_addr, page_order); + __dma_direct_free_pages(dev, size, virt_to_page(cpu_addr)); } void *dma_direct_alloc(struct device *dev, size_t size, -- cgit v1.3-7-g2ca7 From 704f2c20eaa566f6906e8812b6e2115889bd753d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sat, 22 Sep 2018 20:47:26 +0200 Subject: dma-direct: reject highmem pages from dma_alloc_from_contiguous dma_alloc_from_contiguous can return highmem pages depending on the setup, which a plain non-remapping DMA allocator can't handle. Detect this case and fail the allocation. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- kernel/dma/direct.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 680287779b0a..c49849bcced6 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -162,6 +162,18 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, if (!page) return NULL; + if (PageHighMem(page)) { + /* + * Depending on the cma= arguments and per-arch setup + * dma_alloc_from_contiguous could return highmem pages. + * Without remapping there is no way to return them here, + * so log an error and fail. + */ + dev_info(dev, "Rejecting highmem page from CMA.\n"); + __dma_direct_free_pages(dev, size, page); + return NULL; + } + ret = page_address(page); if (force_dma_unencrypted()) { set_memory_decrypted((unsigned long)ret, 1 << get_order(size)); -- cgit v1.3-7-g2ca7 From f0edfea8ef93ed6cc5f747c46c85c8e53e0798a0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 24 Aug 2018 10:31:08 +0200 Subject: dma-mapping: move the remap helpers to a separate file The dma remap code only makes sense for not cache coherent architectures (or possibly the corner case of highmem CMA allocations) and currently is only used by arm, arm64, csky and xtensa. Split it out into a separate file with a separate Kconfig symbol, which gets the right copyright notice given that this code was written by Laura Abbott working for Code Aurora at that point. Signed-off-by: Christoph Hellwig Acked-by: Laura Abbott Reviewed-by: Robin Murphy --- arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/csky/Kconfig | 1 + arch/xtensa/Kconfig | 1 + kernel/dma/Kconfig | 4 +++ kernel/dma/Makefile | 2 +- kernel/dma/mapping.c | 84 ------------------------------------------------- kernel/dma/remap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 8 files changed, 97 insertions(+), 85 deletions(-) create mode 100644 kernel/dma/remap.c (limited to 'kernel') diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 91be74d8df65..3b2852df6eb3 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -30,6 +30,7 @@ config ARM select CPU_PM if (SUSPEND || CPU_IDLE) select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS select DMA_DIRECT_OPS if !MMU + select DMA_REMAP if MMU select EDAC_SUPPORT select EDAC_ATOMIC_SCRUB select GENERIC_ALLOCATOR diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 787d7850e064..5d065acb6d10 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -82,6 +82,7 @@ config ARM64 select CRC32 select DCACHE_WORD_ACCESS select DMA_DIRECT_OPS + select DMA_REMAP select EDAC_SUPPORT select FRAME_POINTER select GENERIC_ALLOCATOR diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index cb64f8dacd08..8a30e006a845 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -9,6 +9,7 @@ config CSKY select CLKSRC_OF select DMA_DIRECT_OPS select DMA_NONCOHERENT_OPS + select DMA_REMAP select IRQ_DOMAIN select HANDLE_DOMAIN_IRQ select DW_APB_TIMER_OF diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index d29b7365da8d..239bfb16c58b 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -11,6 +11,7 @@ config XTENSA select CLONE_BACKWARDS select COMMON_CLK select DMA_DIRECT_OPS + select DMA_REMAP if MMU select GENERIC_ATOMIC64 select GENERIC_CLOCKEVENTS select GENERIC_IRQ_SHOW diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 645c7a2ecde8..c92e08173ed8 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -51,3 +51,7 @@ config SWIOTLB bool select DMA_DIRECT_OPS select NEED_DMA_MAP_STATE + +config DMA_REMAP + depends on MMU + bool diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index 7d581e4eea4a..f4feeceb8020 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -7,4 +7,4 @@ obj-$(CONFIG_DMA_DIRECT_OPS) += direct.o obj-$(CONFIG_DMA_VIRT_OPS) += virt.o obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o - +obj-$(CONFIG_DMA_REMAP) += remap.o diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 58dec7a92b7b..dfbc3deb95cd 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -262,87 +262,3 @@ int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, #endif /* !CONFIG_ARCH_NO_COHERENT_DMA_MMAP */ } EXPORT_SYMBOL(dma_common_mmap); - -#ifdef CONFIG_MMU -static struct vm_struct *__dma_common_pages_remap(struct page **pages, - size_t size, unsigned long vm_flags, pgprot_t prot, - const void *caller) -{ - struct vm_struct *area; - - area = get_vm_area_caller(size, vm_flags, caller); - if (!area) - return NULL; - - if (map_vm_area(area, prot, pages)) { - vunmap(area->addr); - return NULL; - } - - return area; -} - -/* - * remaps an array of PAGE_SIZE pages into another vm_area - * Cannot be used in non-sleeping contexts - */ -void *dma_common_pages_remap(struct page **pages, size_t size, - unsigned long vm_flags, pgprot_t prot, - const void *caller) -{ - struct vm_struct *area; - - area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller); - if (!area) - return NULL; - - area->pages = pages; - - return area->addr; -} - -/* - * remaps an allocated contiguous region into another vm_area. - * Cannot be used in non-sleeping contexts - */ - -void *dma_common_contiguous_remap(struct page *page, size_t size, - unsigned long vm_flags, - pgprot_t prot, const void *caller) -{ - int i; - struct page **pages; - struct vm_struct *area; - - pages = kmalloc(sizeof(struct page *) << get_order(size), GFP_KERNEL); - if (!pages) - return NULL; - - for (i = 0; i < (size >> PAGE_SHIFT); i++) - pages[i] = nth_page(page, i); - - area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller); - - kfree(pages); - - if (!area) - return NULL; - return area->addr; -} - -/* - * unmaps a range previously mapped by dma_common_*_remap - */ -void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags) -{ - struct vm_struct *area = find_vm_area(cpu_addr); - - if (!area || (area->flags & vm_flags) != vm_flags) { - WARN(1, "trying to free invalid coherent area: %p\n", cpu_addr); - return; - } - - unmap_kernel_range((unsigned long)cpu_addr, PAGE_ALIGN(size)); - vunmap(cpu_addr); -} -#endif diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c new file mode 100644 index 000000000000..a15c393ea4e5 --- /dev/null +++ b/kernel/dma/remap.c @@ -0,0 +1,88 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2014 The Linux Foundation + */ +#include +#include +#include + +static struct vm_struct *__dma_common_pages_remap(struct page **pages, + size_t size, unsigned long vm_flags, pgprot_t prot, + const void *caller) +{ + struct vm_struct *area; + + area = get_vm_area_caller(size, vm_flags, caller); + if (!area) + return NULL; + + if (map_vm_area(area, prot, pages)) { + vunmap(area->addr); + return NULL; + } + + return area; +} + +/* + * Remaps an array of PAGE_SIZE pages into another vm_area. + * Cannot be used in non-sleeping contexts + */ +void *dma_common_pages_remap(struct page **pages, size_t size, + unsigned long vm_flags, pgprot_t prot, + const void *caller) +{ + struct vm_struct *area; + + area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller); + if (!area) + return NULL; + + area->pages = pages; + + return area->addr; +} + +/* + * Remaps an allocated contiguous region into another vm_area. + * Cannot be used in non-sleeping contexts + */ +void *dma_common_contiguous_remap(struct page *page, size_t size, + unsigned long vm_flags, + pgprot_t prot, const void *caller) +{ + int i; + struct page **pages; + struct vm_struct *area; + + pages = kmalloc(sizeof(struct page *) << get_order(size), GFP_KERNEL); + if (!pages) + return NULL; + + for (i = 0; i < (size >> PAGE_SHIFT); i++) + pages[i] = nth_page(page, i); + + area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller); + + kfree(pages); + + if (!area) + return NULL; + return area->addr; +} + +/* + * Unmaps a range previously mapped by dma_common_*_remap + */ +void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags) +{ + struct vm_struct *area = find_vm_area(cpu_addr); + + if (!area || (area->flags & vm_flags) != vm_flags) { + WARN(1, "trying to free invalid coherent area: %p\n", cpu_addr); + return; + } + + unmap_kernel_range((unsigned long)cpu_addr, PAGE_ALIGN(size)); + vunmap(cpu_addr); +} -- cgit v1.3-7-g2ca7 From 0c3b3171ceccb8830c2bb5adff1b4e9b204c1450 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 4 Nov 2018 20:29:28 +0100 Subject: dma-mapping: move the arm64 noncoherent alloc/free support to common code The arm64 codebase to implement coherent dma allocation for architectures with non-coherent DMA is a good start for a generic implementation, given that is uses the generic remap helpers, provides the atomic pool for allocations that can't sleep and still is realtively simple and well tested. Move it to kernel/dma and allow architectures to opt into it using a config symbol. Architectures just need to provide a new arch_dma_prep_coherent helper to writeback an invalidate the caches for any memory that gets remapped for uncached access. Signed-off-by: Christoph Hellwig Reviewed-by: Will Deacon Reviewed-by: Robin Murphy --- arch/arm64/Kconfig | 2 +- arch/arm64/mm/dma-mapping.c | 184 +++------------------------------------- include/linux/dma-mapping.h | 5 ++ include/linux/dma-noncoherent.h | 2 + kernel/dma/Kconfig | 5 ++ kernel/dma/remap.c | 158 +++++++++++++++++++++++++++++++++- 6 files changed, 180 insertions(+), 176 deletions(-) (limited to 'kernel') diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 5d065acb6d10..2e645ea693ea 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -82,7 +82,7 @@ config ARM64 select CRC32 select DCACHE_WORD_ACCESS select DMA_DIRECT_OPS - select DMA_REMAP + select DMA_DIRECT_REMAP select EDAC_SUPPORT select FRAME_POINTER select GENERIC_ALLOCATOR diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index a3ac26284845..e2e7e5d0f94e 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -33,113 +33,6 @@ #include -static struct gen_pool *atomic_pool __ro_after_init; - -#define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K -static size_t atomic_pool_size __initdata = DEFAULT_DMA_COHERENT_POOL_SIZE; - -static int __init early_coherent_pool(char *p) -{ - atomic_pool_size = memparse(p, &p); - return 0; -} -early_param("coherent_pool", early_coherent_pool); - -static void *__alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags) -{ - unsigned long val; - void *ptr = NULL; - - if (!atomic_pool) { - WARN(1, "coherent pool not initialised!\n"); - return NULL; - } - - val = gen_pool_alloc(atomic_pool, size); - if (val) { - phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val); - - *ret_page = phys_to_page(phys); - ptr = (void *)val; - memset(ptr, 0, size); - } - - return ptr; -} - -static bool __in_atomic_pool(void *start, size_t size) -{ - return addr_in_gen_pool(atomic_pool, (unsigned long)start, size); -} - -static int __free_from_pool(void *start, size_t size) -{ - if (!__in_atomic_pool(start, size)) - return 0; - - gen_pool_free(atomic_pool, (unsigned long)start, size); - - return 1; -} - -void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, - gfp_t flags, unsigned long attrs) -{ - struct page *page; - void *ptr, *coherent_ptr; - pgprot_t prot = pgprot_writecombine(PAGE_KERNEL); - - size = PAGE_ALIGN(size); - - if (!gfpflags_allow_blocking(flags)) { - struct page *page = NULL; - void *addr = __alloc_from_pool(size, &page, flags); - - if (addr) - *dma_handle = phys_to_dma(dev, page_to_phys(page)); - - return addr; - } - - ptr = dma_direct_alloc_pages(dev, size, dma_handle, flags, attrs); - if (!ptr) - goto no_mem; - - /* remove any dirty cache lines on the kernel alias */ - __dma_flush_area(ptr, size); - - /* create a coherent mapping */ - page = virt_to_page(ptr); - coherent_ptr = dma_common_contiguous_remap(page, size, VM_USERMAP, - prot, __builtin_return_address(0)); - if (!coherent_ptr) - goto no_map; - - return coherent_ptr; - -no_map: - dma_direct_free_pages(dev, size, ptr, *dma_handle, attrs); -no_mem: - return NULL; -} - -void arch_dma_free(struct device *dev, size_t size, void *vaddr, - dma_addr_t dma_handle, unsigned long attrs) -{ - if (!__free_from_pool(vaddr, PAGE_ALIGN(size))) { - void *kaddr = phys_to_virt(dma_to_phys(dev, dma_handle)); - - vunmap(vaddr); - dma_direct_free_pages(dev, size, kaddr, dma_handle, attrs); - } -} - -long arch_dma_coherent_to_pfn(struct device *dev, void *cpu_addr, - dma_addr_t dma_addr) -{ - return __phys_to_pfn(dma_to_phys(dev, dma_addr)); -} - pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs) { @@ -160,6 +53,11 @@ void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, __dma_unmap_area(phys_to_virt(paddr), size, dir); } +void arch_dma_prep_coherent(struct page *page, size_t size) +{ + __dma_flush_area(page_address(page), size); +} + #ifdef CONFIG_IOMMU_DMA static int __swiotlb_get_sgtable_page(struct sg_table *sgt, struct page *page, size_t size) @@ -191,67 +89,6 @@ static int __swiotlb_mmap_pfn(struct vm_area_struct *vma, } #endif /* CONFIG_IOMMU_DMA */ -static int __init atomic_pool_init(void) -{ - pgprot_t prot = __pgprot(PROT_NORMAL_NC); - unsigned long nr_pages = atomic_pool_size >> PAGE_SHIFT; - struct page *page; - void *addr; - unsigned int pool_size_order = get_order(atomic_pool_size); - - if (dev_get_cma_area(NULL)) - page = dma_alloc_from_contiguous(NULL, nr_pages, - pool_size_order, false); - else - page = alloc_pages(GFP_DMA32, pool_size_order); - - if (page) { - int ret; - void *page_addr = page_address(page); - - memset(page_addr, 0, atomic_pool_size); - __dma_flush_area(page_addr, atomic_pool_size); - - atomic_pool = gen_pool_create(PAGE_SHIFT, -1); - if (!atomic_pool) - goto free_page; - - addr = dma_common_contiguous_remap(page, atomic_pool_size, - VM_USERMAP, prot, atomic_pool_init); - - if (!addr) - goto destroy_genpool; - - ret = gen_pool_add_virt(atomic_pool, (unsigned long)addr, - page_to_phys(page), - atomic_pool_size, -1); - if (ret) - goto remove_mapping; - - gen_pool_set_algo(atomic_pool, - gen_pool_first_fit_order_align, - NULL); - - pr_info("DMA: preallocated %zu KiB pool for atomic allocations\n", - atomic_pool_size / 1024); - return 0; - } - goto out; - -remove_mapping: - dma_common_free_remap(addr, atomic_pool_size, VM_USERMAP); -destroy_genpool: - gen_pool_destroy(atomic_pool); - atomic_pool = NULL; -free_page: - if (!dma_release_from_contiguous(NULL, page, nr_pages)) - __free_pages(page, pool_size_order); -out: - pr_err("DMA: failed to allocate %zu KiB pool for atomic coherent allocation\n", - atomic_pool_size / 1024); - return -ENOMEM; -} - /******************************************** * The following APIs are for dummy DMA ops * ********************************************/ @@ -350,8 +187,7 @@ static int __init arm64_dma_init(void) TAINT_CPU_OUT_OF_SPEC, "ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (%d < %d)", ARCH_DMA_MINALIGN, cache_line_size()); - - return atomic_pool_init(); + return dma_atomic_pool_init(GFP_DMA32, __pgprot(PROT_NORMAL_NC)); } arch_initcall(arm64_dma_init); @@ -397,7 +233,7 @@ static void *__iommu_alloc_attrs(struct device *dev, size_t size, page = alloc_pages(gfp, get_order(size)); addr = page ? page_address(page) : NULL; } else { - addr = __alloc_from_pool(size, &page, gfp); + addr = dma_alloc_from_pool(size, &page, gfp); } if (!addr) return NULL; @@ -407,7 +243,7 @@ static void *__iommu_alloc_attrs(struct device *dev, size_t size, if (coherent) __free_pages(page, get_order(size)); else - __free_from_pool(addr, size); + dma_free_from_pool(addr, size); addr = NULL; } } else if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) { @@ -471,9 +307,9 @@ static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr, * coherent devices. * Hence how dodgy the below logic looks... */ - if (__in_atomic_pool(cpu_addr, size)) { + if (dma_in_atomic_pool(cpu_addr, size)) { iommu_dma_unmap_page(dev, handle, iosize, 0, 0); - __free_from_pool(cpu_addr, size); + dma_free_from_pool(cpu_addr, size); } else if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) { struct page *page = vmalloc_to_page(cpu_addr); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0f81c713f6e9..1a0edcde7d14 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -455,6 +455,11 @@ void *dma_common_pages_remap(struct page **pages, size_t size, const void *caller); void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags); +int __init dma_atomic_pool_init(gfp_t gfp, pgprot_t prot); +bool dma_in_atomic_pool(void *start, size_t size); +void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags); +bool dma_free_from_pool(void *start, size_t size); + /** * dma_mmap_attrs - map a coherent DMA allocation into user space * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices diff --git a/include/linux/dma-noncoherent.h b/include/linux/dma-noncoherent.h index 9051b055beec..306557331d7d 100644 --- a/include/linux/dma-noncoherent.h +++ b/include/linux/dma-noncoherent.h @@ -69,4 +69,6 @@ static inline void arch_sync_dma_for_cpu_all(struct device *dev) } #endif /* CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL */ +void arch_dma_prep_coherent(struct page *page, size_t size); + #endif /* _LINUX_DMA_NONCOHERENT_H */ diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c92e08173ed8..41c3b1df70eb 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -55,3 +55,8 @@ config SWIOTLB config DMA_REMAP depends on MMU bool + +config DMA_DIRECT_REMAP + bool + depends on DMA_DIRECT_OPS + select DMA_REMAP diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index a15c393ea4e5..b32bb08f96ae 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -1,8 +1,13 @@ // SPDX-License-Identifier: GPL-2.0 /* + * Copyright (C) 2012 ARM Ltd. * Copyright (c) 2014 The Linux Foundation */ -#include +#include +#include +#include +#include +#include #include #include @@ -86,3 +91,154 @@ void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags) unmap_kernel_range((unsigned long)cpu_addr, PAGE_ALIGN(size)); vunmap(cpu_addr); } + +#ifdef CONFIG_DMA_DIRECT_REMAP +static struct gen_pool *atomic_pool __ro_after_init; + +#define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K +static size_t atomic_pool_size __initdata = DEFAULT_DMA_COHERENT_POOL_SIZE; + +static int __init early_coherent_pool(char *p) +{ + atomic_pool_size = memparse(p, &p); + return 0; +} +early_param("coherent_pool", early_coherent_pool); + +int __init dma_atomic_pool_init(gfp_t gfp, pgprot_t prot) +{ + unsigned int pool_size_order = get_order(atomic_pool_size); + unsigned long nr_pages = atomic_pool_size >> PAGE_SHIFT; + struct page *page; + void *addr; + int ret; + + if (dev_get_cma_area(NULL)) + page = dma_alloc_from_contiguous(NULL, nr_pages, + pool_size_order, false); + else + page = alloc_pages(gfp, pool_size_order); + if (!page) + goto out; + + memset(page_address(page), 0, atomic_pool_size); + arch_dma_prep_coherent(page, atomic_pool_size); + + atomic_pool = gen_pool_create(PAGE_SHIFT, -1); + if (!atomic_pool) + goto free_page; + + addr = dma_common_contiguous_remap(page, atomic_pool_size, VM_USERMAP, + prot, __builtin_return_address(0)); + if (!addr) + goto destroy_genpool; + + ret = gen_pool_add_virt(atomic_pool, (unsigned long)addr, + page_to_phys(page), atomic_pool_size, -1); + if (ret) + goto remove_mapping; + gen_pool_set_algo(atomic_pool, gen_pool_first_fit_order_align, NULL); + + pr_info("DMA: preallocated %zu KiB pool for atomic allocations\n", + atomic_pool_size / 1024); + return 0; + +remove_mapping: + dma_common_free_remap(addr, atomic_pool_size, VM_USERMAP); +destroy_genpool: + gen_pool_destroy(atomic_pool); + atomic_pool = NULL; +free_page: + if (!dma_release_from_contiguous(NULL, page, nr_pages)) + __free_pages(page, pool_size_order); +out: + pr_err("DMA: failed to allocate %zu KiB pool for atomic coherent allocation\n", + atomic_pool_size / 1024); + return -ENOMEM; +} + +bool dma_in_atomic_pool(void *start, size_t size) +{ + return addr_in_gen_pool(atomic_pool, (unsigned long)start, size); +} + +void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags) +{ + unsigned long val; + void *ptr = NULL; + + if (!atomic_pool) { + WARN(1, "coherent pool not initialised!\n"); + return NULL; + } + + val = gen_pool_alloc(atomic_pool, size); + if (val) { + phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val); + + *ret_page = pfn_to_page(__phys_to_pfn(phys)); + ptr = (void *)val; + memset(ptr, 0, size); + } + + return ptr; +} + +bool dma_free_from_pool(void *start, size_t size) +{ + if (!dma_in_atomic_pool(start, size)) + return false; + gen_pool_free(atomic_pool, (unsigned long)start, size); + return true; +} + +void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, + gfp_t flags, unsigned long attrs) +{ + struct page *page = NULL; + void *ret, *kaddr; + + size = PAGE_ALIGN(size); + + if (!gfpflags_allow_blocking(flags)) { + ret = dma_alloc_from_pool(size, &page, flags); + if (!ret) + return NULL; + *dma_handle = phys_to_dma(dev, page_to_phys(page)); + return ret; + } + + kaddr = dma_direct_alloc_pages(dev, size, dma_handle, flags, attrs); + if (!kaddr) + return NULL; + page = virt_to_page(kaddr); + + /* remove any dirty cache lines on the kernel alias */ + arch_dma_prep_coherent(page, size); + + /* create a coherent mapping */ + ret = dma_common_contiguous_remap(page, size, VM_USERMAP, + arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs), + __builtin_return_address(0)); + if (!ret) + dma_direct_free_pages(dev, size, kaddr, *dma_handle, attrs); + return ret; +} + +void arch_dma_free(struct device *dev, size_t size, void *vaddr, + dma_addr_t dma_handle, unsigned long attrs) +{ + if (!dma_free_from_pool(vaddr, PAGE_ALIGN(size))) { + void *kaddr = phys_to_virt(dma_to_phys(dev, dma_handle)); + + vunmap(vaddr); + dma_direct_free_pages(dev, size, kaddr, dma_handle, attrs); + } +} + +long arch_dma_coherent_to_pfn(struct device *dev, void *cpu_addr, + dma_addr_t dma_addr) +{ + return __phys_to_pfn(dma_to_phys(dev, dma_addr)); +} +#endif /* CONFIG_DMA_DIRECT_REMAP */ -- cgit v1.3-7-g2ca7 From bfd56cd605219d90b210a5377fca31a644efe95c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 4 Nov 2018 17:38:39 +0100 Subject: dma-mapping: support highmem in the generic remap allocator By using __dma_direct_alloc_pages we can deal entirely with struct page instead of having to derive a kernel virtual address. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- kernel/dma/remap.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index b32bb08f96ae..dcc82dd668f8 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -196,7 +196,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flags, unsigned long attrs) { struct page *page = NULL; - void *ret, *kaddr; + void *ret; size = PAGE_ALIGN(size); @@ -208,10 +208,9 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, return ret; } - kaddr = dma_direct_alloc_pages(dev, size, dma_handle, flags, attrs); - if (!kaddr) + page = __dma_direct_alloc_pages(dev, size, dma_handle, flags, attrs); + if (!page) return NULL; - page = virt_to_page(kaddr); /* remove any dirty cache lines on the kernel alias */ arch_dma_prep_coherent(page, size); @@ -221,7 +220,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs), __builtin_return_address(0)); if (!ret) - dma_direct_free_pages(dev, size, kaddr, *dma_handle, attrs); + __dma_direct_free_pages(dev, size, page); return ret; } @@ -229,10 +228,11 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle, unsigned long attrs) { if (!dma_free_from_pool(vaddr, PAGE_ALIGN(size))) { - void *kaddr = phys_to_virt(dma_to_phys(dev, dma_handle)); + phys_addr_t phys = dma_to_phys(dev, dma_handle); + struct page *page = pfn_to_page(__phys_to_pfn(phys)); vunmap(vaddr); - dma_direct_free_pages(dev, size, kaddr, dma_handle, attrs); + __dma_direct_free_pages(dev, size, page); } } -- cgit v1.3-7-g2ca7 From e440e26a0251b054243b3db6356f3869a9f9c2d1 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 4 Nov 2018 17:51:50 +0100 Subject: dma-remap: support DMA_ATTR_NO_KERNEL_MAPPING Do not waste vmalloc space on allocations that do not require a mapping into the kernel address space. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- kernel/dma/remap.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index dcc82dd668f8..68a64e3ff6a1 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -200,7 +200,8 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, size = PAGE_ALIGN(size); - if (!gfpflags_allow_blocking(flags)) { + if (!gfpflags_allow_blocking(flags) && + !(attrs & DMA_ATTR_NO_KERNEL_MAPPING)) { ret = dma_alloc_from_pool(size, &page, flags); if (!ret) return NULL; @@ -215,6 +216,9 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, /* remove any dirty cache lines on the kernel alias */ arch_dma_prep_coherent(page, size); + if (attrs & DMA_ATTR_NO_KERNEL_MAPPING) + return page; /* opaque cookie */ + /* create a coherent mapping */ ret = dma_common_contiguous_remap(page, size, VM_USERMAP, arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs), @@ -227,7 +231,10 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, void arch_dma_free(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle, unsigned long attrs) { - if (!dma_free_from_pool(vaddr, PAGE_ALIGN(size))) { + if (attrs & DMA_ATTR_NO_KERNEL_MAPPING) { + /* vaddr is a struct page cookie, not a kernel address */ + __dma_direct_free_pages(dev, size, vaddr); + } else if (!dma_free_from_pool(vaddr, PAGE_ALIGN(size))) { phys_addr_t phys = dma_to_phys(dev, dma_handle); struct page *page = pfn_to_page(__phys_to_pfn(phys)); -- cgit v1.3-7-g2ca7 From 2af3024cd78f120d027cb44b454186ba9d7dab24 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 7 Nov 2018 14:11:40 -0800 Subject: cgroups: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change, even though it is but a comment. Signed-off-by: Paul E. McKenney Cc: Jens Axboe Cc: Dennis Zhou Cc: Johannes Weiner Cc: "Dennis Zhou (Facebook)" Acked-by: Tejun Heo --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6aaf5dd5383b..7a8429f8e280 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5343,7 +5343,7 @@ int __init cgroup_init(void) cgroup_rstat_boot(); /* - * The latency of the synchronize_sched() is too high for cgroups, + * The latency of the synchronize_rcu() is too high for cgroups, * avoid it at the cost of forcing all readers into the slow path. */ rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss); -- cgit v1.3-7-g2ca7 From 6932689e4145f545062ca8c86cf76f38854d63d0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 7 Nov 2018 14:16:57 -0800 Subject: livepatch: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change, even though it is but a comment. Signed-off-by: Paul E. McKenney --- kernel/livepatch/patch.c | 4 ++-- kernel/livepatch/transition.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 82d584225dc6..7702cb4064fc 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -61,7 +61,7 @@ static void notrace klp_ftrace_handler(unsigned long ip, ops = container_of(fops, struct klp_ops, fops); /* - * A variant of synchronize_sched() is used to allow patching functions + * A variant of synchronize_rcu() is used to allow patching functions * where RCU is not watching, see klp_synchronize_transition(). */ preempt_disable_notrace(); @@ -72,7 +72,7 @@ static void notrace klp_ftrace_handler(unsigned long ip, /* * func should never be NULL because preemption should be disabled here * and unregister_ftrace_function() does the equivalent of a - * synchronize_sched() before the func_stack removal. + * synchronize_rcu() before the func_stack removal. */ if (WARN_ON_ONCE(!func)) goto unlock; diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 5bc349805e03..304d5eb8a98c 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -52,7 +52,7 @@ static DECLARE_DELAYED_WORK(klp_transition_work, klp_transition_work_fn); /* * This function is just a stub to implement a hard force - * of synchronize_sched(). This requires synchronizing + * of synchronize_rcu(). This requires synchronizing * tasks even in userspace and idle. */ static void klp_sync(struct work_struct *work) @@ -175,7 +175,7 @@ void klp_cancel_transition(void) void klp_update_patch_state(struct task_struct *task) { /* - * A variant of synchronize_sched() is used to allow patching functions + * A variant of synchronize_rcu() is used to allow patching functions * where RCU is not watching, see klp_synchronize_transition(). */ preempt_disable_notrace(); -- cgit v1.3-7-g2ca7 From 4871848531af1d62f30032bfb872c43b9afe03ad Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 15 Aug 2018 15:32:51 -0700 Subject: rcutorture: Add call_rcu() flooding forward-progress tests This commit adds a call_rcu() flooding loop to the forward-progress test. This emulates tight userspace loops that force call_rcu() invocations, for example, the infamous loop containing close(open()) that instigated the addition of blimit. If RCU does not make sufficient forward progress in invoking the resulting flood of callbacks, rcutorture emits a warning. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 127 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 210c77460365..8cf700ca7845 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -259,6 +259,8 @@ static atomic_t barrier_cbs_invoked; /* Barrier callbacks invoked. */ static wait_queue_head_t *barrier_cbs_wq; /* Coordinate barrier testing. */ static DECLARE_WAIT_QUEUE_HEAD(barrier_wq); +static bool rcu_fwd_cb_nodelay; /* Short rcu_torture_delay() delays. */ + /* * Allocate an element from the rcu_tortures pool. */ @@ -348,7 +350,8 @@ rcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp) * period, and we want a long delay occasionally to trigger * force_quiescent_state. */ - if (!(torture_random(rrsp) % (nrealreaders * 2000 * longdelay_ms))) { + if (!rcu_fwd_cb_nodelay && + !(torture_random(rrsp) % (nrealreaders * 2000 * longdelay_ms))) { started = cur_ops->get_gp_seq(); ts = rcu_trace_clock_local(); if (preempt_count() & (SOFTIRQ_MASK | HARDIRQ_MASK)) @@ -1674,6 +1677,43 @@ static void rcu_torture_fwd_prog_cb(struct rcu_head *rhp) cur_ops->call(&fcsp->rh, rcu_torture_fwd_prog_cb); } +/* State for continuous-flood RCU callbacks. */ +struct rcu_fwd_cb { + struct rcu_head rh; + struct rcu_fwd_cb *rfc_next; + int rfc_gps; +}; +static DEFINE_SPINLOCK(rcu_fwd_lock); +static struct rcu_fwd_cb *rcu_fwd_cb_head; +static struct rcu_fwd_cb **rcu_fwd_cb_tail = &rcu_fwd_cb_head; +static long n_launders_cb; +static unsigned long rcu_fwd_startat; +#define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */ +#define MIN_FWD_CB_LAUNDERS 3 /* This many CB invocations to count. */ +#define MIN_FWD_CBS_LAUNDERED 100 /* Number of counted CBs. */ +static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ]; + +/* Callback function for continuous-flood RCU callbacks. */ +static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) +{ + int i; + struct rcu_fwd_cb *rfcp = container_of(rhp, struct rcu_fwd_cb, rh); + struct rcu_fwd_cb **rfcpp; + + rfcp->rfc_next = NULL; + rfcp->rfc_gps++; + spin_lock(&rcu_fwd_lock); + rfcpp = rcu_fwd_cb_tail; + rcu_fwd_cb_tail = &rfcp->rfc_next; + WRITE_ONCE(*rfcpp, rfcp); + WRITE_ONCE(n_launders_cb, n_launders_cb + 1); + i = ((jiffies - rcu_fwd_startat) / HZ); + if (i >= ARRAY_SIZE(n_launders_hist)) + i = ARRAY_SIZE(n_launders_hist) - 1; + n_launders_hist[i]++; + spin_unlock(&rcu_fwd_lock); +} + /* Carry out grace-period forward-progress testing. */ static int rcu_torture_fwd_prog(void *args) { @@ -1681,11 +1721,21 @@ static int rcu_torture_fwd_prog(void *args) unsigned long dur; struct fwd_cb_state fcs; unsigned long gps; + int i; int idx; + int j; + long n_launders; + long n_launders_cb_snap; + long n_launders_sa; + long n_max_cbs; + long n_max_gps; + struct rcu_fwd_cb *rfcp; + struct rcu_fwd_cb *rfcpn; int sd; int sd4; bool selfpropcb = false; unsigned long stopat; + unsigned long stoppedat; int tested = 0; int tested_tries = 0; static DEFINE_TORTURE_RANDOM(trs); @@ -1699,6 +1749,8 @@ static int rcu_torture_fwd_prog(void *args) } do { schedule_timeout_interruptible(fwd_progress_holdoff * HZ); + + /* Tight loop containing cond_resched(). */ if (selfpropcb) { WRITE_ONCE(fcs.stop, 0); cur_ops->call(&fcs.rh, rcu_torture_fwd_prog_cb); @@ -1708,7 +1760,8 @@ static int rcu_torture_fwd_prog(void *args) sd = cur_ops->stall_dur() + 1; sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div; dur = sd4 + torture_random(&trs) % (sd - sd4); - stopat = jiffies + dur; + rcu_fwd_startat = jiffies; + stopat = rcu_fwd_startat + dur; while (time_before(jiffies, stopat) && !torture_must_stop()) { idx = cur_ops->readlock(); udelay(10); @@ -1729,6 +1782,78 @@ static int rcu_torture_fwd_prog(void *args) cur_ops->sync(); /* Wait for running CB to complete. */ cur_ops->cb_barrier(); /* Wait for queued callbacks. */ } + + /* Loop continuously posting RCU callbacks. */ + WRITE_ONCE(rcu_fwd_cb_nodelay, true); + cur_ops->sync(); /* Later readers see above write. */ + rcu_fwd_startat = jiffies; + stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES; + n_launders = 0; + n_launders_cb = 0; + n_launders_sa = 0; + n_max_cbs = 0; + n_max_gps = 0; + for (i = 0; i < ARRAY_SIZE(n_launders_hist); i++) + n_launders_hist[i] = 0; + cver = READ_ONCE(rcu_torture_current_version); + gps = cur_ops->get_gp_seq(); + while (time_before(jiffies, stopat) && !torture_must_stop()) { + rfcp = READ_ONCE(rcu_fwd_cb_head); + rfcpn = NULL; + if (rfcp) + rfcpn = READ_ONCE(rfcp->rfc_next); + if (rfcpn) { + if (rfcp->rfc_gps >= MIN_FWD_CB_LAUNDERS && + ++n_max_gps >= MIN_FWD_CBS_LAUNDERED) + break; + rcu_fwd_cb_head = rfcpn; + n_launders++; + n_launders_sa++; + } else { + rfcp = kmalloc(sizeof(*rfcp), GFP_KERNEL); + if (WARN_ON_ONCE(!rfcp)) { + schedule_timeout_interruptible(1); + continue; + } + n_max_cbs++; + n_launders_sa = 0; + rfcp->rfc_gps = 0; + } + cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr); + cond_resched(); + } + stoppedat = jiffies; + n_launders_cb_snap = READ_ONCE(n_launders_cb); + cver = READ_ONCE(rcu_torture_current_version) - cver; + gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); + cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */ + for (;;) { + rfcp = rcu_fwd_cb_head; + if (!rfcp) + break; + rcu_fwd_cb_head = rfcp->rfc_next; + kfree(rfcp); + } + rcu_fwd_cb_tail = &rcu_fwd_cb_head; + WRITE_ONCE(rcu_fwd_cb_nodelay, false); + if (!torture_must_stop()) { + WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED); + pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: %ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n", + __func__, + stoppedat - rcu_fwd_startat, + jiffies - stoppedat, + n_launders + n_max_cbs - n_launders_cb_snap, + n_launders, n_launders_sa, + n_max_gps, n_max_cbs, cver, gps); + for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) + if (n_launders_hist[i] > 0) + break; + pr_alert("Callback-invocation histogram:"); + for (j = 0; j <= i; j++) + pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); + pr_cont("\n"); + } + /* Avoid slow periods, better to test when busy. */ stutter_wait("rcu_torture_fwd_prog"); } while (!torture_must_stop()); -- cgit v1.3-7-g2ca7 From 28cf5952f56005325f269ccfe402a880cd741189 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 Aug 2018 15:27:16 -0700 Subject: torture: Bring any extra CPUs online during kernel startup Currently, the torture scripts rely on the initrd/init script to bring any extra CPUs online, for example, in the case where the kernel and qemu have different ideas about how many CPUs are present. This works, but is an unnecessary dependency on initrd, which needs to vary depending on the distro. This commit therefore causes torture_onoff() to check for additional CPUs, attempting to bring any found online. Errors are ignored, just as they are by the initrd/init script. Signed-off-by: Paul E. McKenney --- kernel/torture.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'kernel') diff --git a/kernel/torture.c b/kernel/torture.c index 17d91f5fba2a..9410d1bf84d6 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -194,11 +194,23 @@ torture_onoff(void *arg) int cpu; int maxcpu = -1; DEFINE_TORTURE_RANDOM(rand); + int ret; VERBOSE_TOROUT_STRING("torture_onoff task started"); for_each_online_cpu(cpu) maxcpu = cpu; WARN_ON(maxcpu < 0); + if (!IS_MODULE(CONFIG_TORTURE_TEST)) + for_each_possible_cpu(cpu) { + if (cpu_online(cpu)) + continue; + ret = cpu_up(cpu); + if (ret && verbose) { + pr_alert("%s" TORTURE_FLAG + "%s: Initial online %d: errno %d\n", + __func__, torture_type, cpu, ret); + } + } if (maxcpu == 0) { VERBOSE_TOROUT_STRING("Only one CPU, so CPU-hotplug testing is disabled"); -- cgit v1.3-7-g2ca7 From fc6f9c57787e578473d47b7bbc846e317d17c1df Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 27 Aug 2018 14:43:05 -0700 Subject: rcutorture: Remove cbflood facility Now that the forward-progress code does a full-bore continuous callback flood lasting multiple seconds, there is little point in also posting a mere 60,000 callbacks every second or so. This commit therefore removes the old cbflood testing. Over time, it may be desirable to concurrently do full-bore continuous callback floods on all CPUs simultaneously, but one dragon at a time. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 18 ------ kernel/rcu/rcutorture.c | 86 +------------------------ 2 files changed, 1 insertion(+), 103 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3823679deea5..86e825e0927a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3743,24 +3743,6 @@ in microseconds. The default of zero says no holdoff. - rcutorture.cbflood_inter_holdoff= [KNL] - Set holdoff time (jiffies) between successive - callback-flood tests. - - rcutorture.cbflood_intra_holdoff= [KNL] - Set holdoff time (jiffies) between successive - bursts of callbacks within a given callback-flood - test. - - rcutorture.cbflood_n_burst= [KNL] - Set the number of bursts making up a given - callback-flood test. Set this to zero to - disable callback-flood testing. - - rcutorture.cbflood_n_per_burst= [KNL] - Set the number of callbacks to be registered - in a given burst of a callback-flood test. - rcutorture.fqs_duration= [KNL] Set duration of force_quiescent_state bursts in microseconds. diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 8cf700ca7845..17f480129a78 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -80,13 +80,6 @@ MODULE_AUTHOR("Paul E. McKenney and Josh Triplett 0 && - cbflood_inter_holdoff > 0 && - cbflood_intra_holdoff > 0 && - cur_ops->call && - cur_ops->cb_barrier) { - rhp = vmalloc(array3_size(cbflood_n_burst, - cbflood_n_per_burst, - sizeof(*rhp))); - err = !rhp; - } - if (err) { - VERBOSE_TOROUT_STRING("rcu_torture_cbflood disabled: Bad args or OOM"); - goto wait_for_stop; - } - VERBOSE_TOROUT_STRING("rcu_torture_cbflood task started"); - do { - schedule_timeout_interruptible(cbflood_inter_holdoff); - atomic_long_inc(&n_cbfloods); - WARN_ON(signal_pending(current)); - for (i = 0; i < cbflood_n_burst; i++) { - for (j = 0; j < cbflood_n_per_burst; j++) { - cur_ops->call(&rhp[i * cbflood_n_per_burst + j], - rcu_torture_cbflood_cb); - } - schedule_timeout_interruptible(cbflood_intra_holdoff); - WARN_ON(signal_pending(current)); - } - cur_ops->cb_barrier(); - stutter_wait("rcu_torture_cbflood"); - } while (!torture_must_stop()); - vfree(rhp); -wait_for_stop: - torture_kthread_stopping("rcu_torture_cbflood"); - return 0; -} - /* * RCU torture force-quiescent-state kthread. Repeatedly induces * bursts of calls to force_quiescent_state(), increasing the probability @@ -1460,11 +1397,10 @@ rcu_torture_stats_print(void) n_rcu_torture_boosts, atomic_long_read(&n_rcu_torture_timers)); torture_onoff_stats(); - pr_cont("barrier: %ld/%ld:%ld ", + pr_cont("barrier: %ld/%ld:%ld\n", n_barrier_successes, n_barrier_attempts, n_rcu_torture_barrier_error); - pr_cont("cbflood: %ld\n", atomic_long_read(&n_cbfloods)); pr_alert("%s%s ", torture_type, TORTURE_FLAG); if (atomic_read(&n_rcu_torture_mberror) != 0 || @@ -2093,8 +2029,6 @@ rcu_torture_cleanup(void) cur_ops->name, gp_seq, flags); torture_stop_kthread(rcu_torture_stats, stats_task); torture_stop_kthread(rcu_torture_fqs, fqs_task); - for (i = 0; i < ncbflooders; i++) - torture_stop_kthread(rcu_torture_cbflood, cbflood_task[i]); if (rcu_torture_can_boost()) cpuhp_remove_state(rcutor_hp); @@ -2377,24 +2311,6 @@ rcu_torture_init(void) goto unwind; if (object_debug) rcu_test_debug_objects(); - if (cbflood_n_burst > 0) { - /* Create the cbflood threads */ - ncbflooders = (num_online_cpus() + 3) / 4; - cbflood_task = kcalloc(ncbflooders, sizeof(*cbflood_task), - GFP_KERNEL); - if (!cbflood_task) { - VERBOSE_TOROUT_ERRSTRING("out of memory"); - firsterr = -ENOMEM; - goto unwind; - } - for (i = 0; i < ncbflooders; i++) { - firsterr = torture_create_kthread(rcu_torture_cbflood, - NULL, - cbflood_task[i]); - if (firsterr) - goto unwind; - } - } torture_init_end(); return 0; -- cgit v1.3-7-g2ca7 From 6b3de7a172bc59010a9d8e425877d98c1f24555e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 Aug 2018 14:38:43 -0700 Subject: rcutorture: Break up too-long rcu_torture_fwd_prog() function This commit splits rcu_torture_fwd_prog_nr() and rcu_torture_fwd_prog_cr() functions out of rcu_torture_fwd_prog() in order to reduce indentation pain and because rcu_torture_fwd_prog() was getting a bit too long. In addition, this will enable easier conditional execution of the rcu_torture_fwd_prog_cr() function, which can give false-positive failures in some NO_HZ_FULL configurations due to overloading the housekeeping CPUs. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 254 +++++++++++++++++++++++++----------------------- 1 file changed, 135 insertions(+), 119 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 17f480129a78..bcc33bb8d9a6 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1650,15 +1650,70 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) spin_unlock(&rcu_fwd_lock); } -/* Carry out grace-period forward-progress testing. */ -static int rcu_torture_fwd_prog(void *args) +/* Carry out need_resched()/cond_resched() forward-progress testing. */ +static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) { unsigned long cver; unsigned long dur; struct fwd_cb_state fcs; unsigned long gps; - int i; int idx; + int sd; + int sd4; + bool selfpropcb = false; + unsigned long stopat; + static DEFINE_TORTURE_RANDOM(trs); + + if (cur_ops->call && cur_ops->sync && cur_ops->cb_barrier) { + init_rcu_head_on_stack(&fcs.rh); + selfpropcb = true; + } + + /* Tight loop containing cond_resched(). */ + if (selfpropcb) { + WRITE_ONCE(fcs.stop, 0); + cur_ops->call(&fcs.rh, rcu_torture_fwd_prog_cb); + } + cver = READ_ONCE(rcu_torture_current_version); + gps = cur_ops->get_gp_seq(); + sd = cur_ops->stall_dur() + 1; + sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div; + dur = sd4 + torture_random(&trs) % (sd - sd4); + rcu_fwd_startat = jiffies; + stopat = rcu_fwd_startat + dur; + while (time_before(jiffies, stopat) && !torture_must_stop()) { + idx = cur_ops->readlock(); + udelay(10); + cur_ops->readunlock(idx); + if (!fwd_progress_need_resched || need_resched()) + cond_resched(); + } + (*tested_tries)++; + if (!time_before(jiffies, stopat) && !torture_must_stop()) { + (*tested)++; + cver = READ_ONCE(rcu_torture_current_version) - cver; + gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); + WARN_ON(!cver && gps < 2); + pr_alert("%s: Duration %ld cver %ld gps %ld\n", __func__, dur, cver, gps); + } + if (selfpropcb) { + WRITE_ONCE(fcs.stop, 1); + cur_ops->sync(); /* Wait for running CB to complete. */ + cur_ops->cb_barrier(); /* Wait for queued callbacks. */ + } + + if (selfpropcb) { + WARN_ON(READ_ONCE(fcs.stop) != 2); + destroy_rcu_head_on_stack(&fcs.rh); + } +} + +/* Carry out call_rcu() forward-progress testing. */ +static void rcu_torture_fwd_prog_cr(void) +{ + unsigned long cver; + unsigned long gps; + int i; int j; long n_launders; long n_launders_cb_snap; @@ -1667,136 +1722,97 @@ static int rcu_torture_fwd_prog(void *args) long n_max_gps; struct rcu_fwd_cb *rfcp; struct rcu_fwd_cb *rfcpn; - int sd; - int sd4; - bool selfpropcb = false; unsigned long stopat; unsigned long stoppedat; + + /* Loop continuously posting RCU callbacks. */ + WRITE_ONCE(rcu_fwd_cb_nodelay, true); + cur_ops->sync(); /* Later readers see above write. */ + rcu_fwd_startat = jiffies; + stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES; + n_launders = 0; + n_launders_cb = 0; + n_launders_sa = 0; + n_max_cbs = 0; + n_max_gps = 0; + for (i = 0; i < ARRAY_SIZE(n_launders_hist); i++) + n_launders_hist[i] = 0; + cver = READ_ONCE(rcu_torture_current_version); + gps = cur_ops->get_gp_seq(); + while (time_before(jiffies, stopat) && !torture_must_stop()) { + rfcp = READ_ONCE(rcu_fwd_cb_head); + rfcpn = NULL; + if (rfcp) + rfcpn = READ_ONCE(rfcp->rfc_next); + if (rfcpn) { + if (rfcp->rfc_gps >= MIN_FWD_CB_LAUNDERS && + ++n_max_gps >= MIN_FWD_CBS_LAUNDERED) + break; + rcu_fwd_cb_head = rfcpn; + n_launders++; + n_launders_sa++; + } else { + rfcp = kmalloc(sizeof(*rfcp), GFP_KERNEL); + if (WARN_ON_ONCE(!rfcp)) { + schedule_timeout_interruptible(1); + continue; + } + n_max_cbs++; + n_launders_sa = 0; + rfcp->rfc_gps = 0; + } + cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr); + cond_resched(); + } + stoppedat = jiffies; + n_launders_cb_snap = READ_ONCE(n_launders_cb); + cver = READ_ONCE(rcu_torture_current_version) - cver; + gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); + cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */ + for (;;) { + rfcp = rcu_fwd_cb_head; + if (!rfcp) + break; + rcu_fwd_cb_head = rfcp->rfc_next; + kfree(rfcp); + } + rcu_fwd_cb_tail = &rcu_fwd_cb_head; + WRITE_ONCE(rcu_fwd_cb_nodelay, false); + if (!torture_must_stop()) { + WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED); + pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: %ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n", + __func__, + stoppedat - rcu_fwd_startat, jiffies - stoppedat, + n_launders + n_max_cbs - n_launders_cb_snap, + n_launders, n_launders_sa, + n_max_gps, n_max_cbs, cver, gps); + for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) + if (n_launders_hist[i] > 0) + break; + pr_alert("Callback-invocation histogram:"); + for (j = 0; j <= i; j++) + pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); + pr_cont("\n"); + } +} + +/* Carry out grace-period forward-progress testing. */ +static int rcu_torture_fwd_prog(void *args) +{ int tested = 0; int tested_tries = 0; - static DEFINE_TORTURE_RANDOM(trs); VERBOSE_TOROUT_STRING("rcu_torture_fwd_progress task started"); if (!IS_ENABLED(CONFIG_SMP) || !IS_ENABLED(CONFIG_RCU_BOOST)) set_user_nice(current, MAX_NICE); - if (cur_ops->call && cur_ops->sync && cur_ops->cb_barrier) { - init_rcu_head_on_stack(&fcs.rh); - selfpropcb = true; - } do { schedule_timeout_interruptible(fwd_progress_holdoff * HZ); - - /* Tight loop containing cond_resched(). */ - if (selfpropcb) { - WRITE_ONCE(fcs.stop, 0); - cur_ops->call(&fcs.rh, rcu_torture_fwd_prog_cb); - } - cver = READ_ONCE(rcu_torture_current_version); - gps = cur_ops->get_gp_seq(); - sd = cur_ops->stall_dur() + 1; - sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div; - dur = sd4 + torture_random(&trs) % (sd - sd4); - rcu_fwd_startat = jiffies; - stopat = rcu_fwd_startat + dur; - while (time_before(jiffies, stopat) && !torture_must_stop()) { - idx = cur_ops->readlock(); - udelay(10); - cur_ops->readunlock(idx); - if (!fwd_progress_need_resched || need_resched()) - cond_resched(); - } - tested_tries++; - if (!time_before(jiffies, stopat) && !torture_must_stop()) { - tested++; - cver = READ_ONCE(rcu_torture_current_version) - cver; - gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); - WARN_ON(!cver && gps < 2); - pr_alert("%s: Duration %ld cver %ld gps %ld\n", __func__, dur, cver, gps); - } - if (selfpropcb) { - WRITE_ONCE(fcs.stop, 1); - cur_ops->sync(); /* Wait for running CB to complete. */ - cur_ops->cb_barrier(); /* Wait for queued callbacks. */ - } - - /* Loop continuously posting RCU callbacks. */ - WRITE_ONCE(rcu_fwd_cb_nodelay, true); - cur_ops->sync(); /* Later readers see above write. */ - rcu_fwd_startat = jiffies; - stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES; - n_launders = 0; - n_launders_cb = 0; - n_launders_sa = 0; - n_max_cbs = 0; - n_max_gps = 0; - for (i = 0; i < ARRAY_SIZE(n_launders_hist); i++) - n_launders_hist[i] = 0; - cver = READ_ONCE(rcu_torture_current_version); - gps = cur_ops->get_gp_seq(); - while (time_before(jiffies, stopat) && !torture_must_stop()) { - rfcp = READ_ONCE(rcu_fwd_cb_head); - rfcpn = NULL; - if (rfcp) - rfcpn = READ_ONCE(rfcp->rfc_next); - if (rfcpn) { - if (rfcp->rfc_gps >= MIN_FWD_CB_LAUNDERS && - ++n_max_gps >= MIN_FWD_CBS_LAUNDERED) - break; - rcu_fwd_cb_head = rfcpn; - n_launders++; - n_launders_sa++; - } else { - rfcp = kmalloc(sizeof(*rfcp), GFP_KERNEL); - if (WARN_ON_ONCE(!rfcp)) { - schedule_timeout_interruptible(1); - continue; - } - n_max_cbs++; - n_launders_sa = 0; - rfcp->rfc_gps = 0; - } - cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr); - cond_resched(); - } - stoppedat = jiffies; - n_launders_cb_snap = READ_ONCE(n_launders_cb); - cver = READ_ONCE(rcu_torture_current_version) - cver; - gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); - cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */ - for (;;) { - rfcp = rcu_fwd_cb_head; - if (!rfcp) - break; - rcu_fwd_cb_head = rfcp->rfc_next; - kfree(rfcp); - } - rcu_fwd_cb_tail = &rcu_fwd_cb_head; - WRITE_ONCE(rcu_fwd_cb_nodelay, false); - if (!torture_must_stop()) { - WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED); - pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: %ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n", - __func__, - stoppedat - rcu_fwd_startat, - jiffies - stoppedat, - n_launders + n_max_cbs - n_launders_cb_snap, - n_launders, n_launders_sa, - n_max_gps, n_max_cbs, cver, gps); - for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) - if (n_launders_hist[i] > 0) - break; - pr_alert("Callback-invocation histogram:"); - for (j = 0; j <= i; j++) - pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); - pr_cont("\n"); - } + rcu_torture_fwd_prog_nr(&tested, &tested_tries); + rcu_torture_fwd_prog_cr(); /* Avoid slow periods, better to test when busy. */ stutter_wait("rcu_torture_fwd_prog"); } while (!torture_must_stop()); - if (selfpropcb) { - WARN_ON(READ_ONCE(fcs.stop) != 2); - destroy_rcu_head_on_stack(&fcs.rh); - } /* Short runs might not contain a valid forward-progress attempt. */ WARN_ON(!tested && tested_tries >= 5); pr_alert("%s: tested %d tested_tries %d\n", __func__, tested, tested_tries); -- cgit v1.3-7-g2ca7 From 5ab7ab8362fa8a4f7995d65ea05edf71530e8004 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 21 Sep 2018 18:08:09 -0700 Subject: rcutorture: Affinity forward-progress test to avoid housekeeping CPUs This commit affinities the forward-progress tests to avoid hogging a housekeeping CPU on the theory that the offloaded callbacks will be running on those housekeeping CPUs. Signed-off-by: Paul E. McKenney [ paulmck: Fix NULL-pointer issue located by kbuild test robot. ] Tested-by: Rong Chen --- kernel/rcu/rcu.h | 2 ++ kernel/rcu/rcutorture.c | 1 + kernel/rcu/tree_plugin.h | 11 +++++++++++ 3 files changed, 14 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 2866166863f0..0f0f5ae8c3d4 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -539,8 +539,10 @@ extern struct workqueue_struct *rcu_par_gp_wq; #ifdef CONFIG_RCU_NOCB_CPU bool rcu_is_nocb_cpu(int cpu); +void rcu_bind_current_to_nocb(void); #else static inline bool rcu_is_nocb_cpu(int cpu) { return false; } +static inline void rcu_bind_current_to_nocb(void) { } #endif #endif /* __LINUX_RCU_H */ diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index bcc33bb8d9a6..36a3bc42782d 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1803,6 +1803,7 @@ static int rcu_torture_fwd_prog(void *args) int tested_tries = 0; VERBOSE_TOROUT_STRING("rcu_torture_fwd_progress task started"); + rcu_bind_current_to_nocb(); if (!IS_ENABLED(CONFIG_SMP) || !IS_ENABLED(CONFIG_RCU_BOOST)) set_user_nice(current, MAX_NICE); do { diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 605ff3b06098..25fe26c4e9c1 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2603,6 +2603,17 @@ static bool init_nocb_callback_list(struct rcu_data *rdp) return true; } +/* + * Bind the current task to the offloaded CPUs. If there are no offloaded + * CPUs, leave the task unbound. Splat if the bind attempt fails. + */ +void rcu_bind_current_to_nocb(void) +{ + if (cpumask_available(rcu_nocb_mask) && cpumask_weight(rcu_nocb_mask)) + WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask)); +} +EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb); + #else /* #ifdef CONFIG_RCU_NOCB_CPU */ static bool rcu_nocb_cpu_needs_barrier(int cpu) -- cgit v1.3-7-g2ca7 From 2a7d968816a94a4c52f0082c085c6714a5b3f1ec Mon Sep 17 00:00:00 2001 From: Pierce Griffiths Date: Fri, 21 Sep 2018 20:21:31 -0500 Subject: torture: Remove unnecessary "ret" variables Remove return variables (declared as "ret") in cases where, depending on whether a condition evaluates as true, the result of a function call can be immediately returned instead of storing the result in the return variable. When the condition evaluates as false, the constant initially stored in the return variable at declaration is returned instead. Signed-off-by: Pierce Griffiths Signed-off-by: Paul E. McKenney --- kernel/torture.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/torture.c b/kernel/torture.c index 9410d1bf84d6..bbf6d473e50c 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -245,16 +245,15 @@ stop: */ int torture_onoff_init(long ooholdoff, long oointerval) { - int ret = 0; - #ifdef CONFIG_HOTPLUG_CPU onoff_holdoff = ooholdoff; onoff_interval = oointerval; if (onoff_interval <= 0) return 0; - ret = torture_create_kthread(torture_onoff, NULL, onoff_task); -#endif /* #ifdef CONFIG_HOTPLUG_CPU */ - return ret; + return torture_create_kthread(torture_onoff, NULL, onoff_task); +#else /* #ifdef CONFIG_HOTPLUG_CPU */ + return 0; +#endif /* #else #ifdef CONFIG_HOTPLUG_CPU */ } EXPORT_SYMBOL_GPL(torture_onoff_init); @@ -525,15 +524,13 @@ static int torture_shutdown(void *arg) */ int torture_shutdown_init(int ssecs, void (*cleanup)(void)) { - int ret = 0; - torture_shutdown_hook = cleanup; if (ssecs > 0) { shutdown_time = ktime_add(ktime_get(), ktime_set(ssecs, 0)); - ret = torture_create_kthread(torture_shutdown, NULL, + return torture_create_kthread(torture_shutdown, NULL, shutdown_task); } - return ret; + return 0; } EXPORT_SYMBOL_GPL(torture_shutdown_init); @@ -632,13 +629,10 @@ static int torture_stutter(void *arg) /* * Initialize and kick off the torture_stutter kthread. */ -int torture_stutter_init(int s) +int torture_stutter_init(const int s) { - int ret; - stutter = s; - ret = torture_create_kthread(torture_stutter, NULL, stutter_task); - return ret; + return torture_create_kthread(torture_stutter, NULL, stutter_task); } EXPORT_SYMBOL_GPL(torture_stutter_init); -- cgit v1.3-7-g2ca7 From 61670adcb4a9f66ff3fa8a9e846a623d9a9e1553 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Oct 2018 13:27:41 -0700 Subject: rcutorture: Prepare for asynchronous access to rcu_fwd_startat Because rcutorture's forward-progress checking will trigger from an OOM notifier, this notifier will introduce asynchronous concurrent access to the rcu_fwd_startat variable. This commit therefore prepares for this by converting updates to WRITE_ONCE(). Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 36a3bc42782d..c4fd61dccedb 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1679,7 +1679,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) sd = cur_ops->stall_dur() + 1; sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div; dur = sd4 + torture_random(&trs) % (sd - sd4); - rcu_fwd_startat = jiffies; + WRITE_ONCE(rcu_fwd_startat, jiffies); stopat = rcu_fwd_startat + dur; while (time_before(jiffies, stopat) && !torture_must_stop()) { idx = cur_ops->readlock(); @@ -1728,7 +1728,7 @@ static void rcu_torture_fwd_prog_cr(void) /* Loop continuously posting RCU callbacks. */ WRITE_ONCE(rcu_fwd_cb_nodelay, true); cur_ops->sync(); /* Later readers see above write. */ - rcu_fwd_startat = jiffies; + WRITE_ONCE(rcu_fwd_startat, jiffies); stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES; n_launders = 0; n_launders_cb = 0; -- cgit v1.3-7-g2ca7 From e0aff97355575ac6a28a48a4217533a3953095c5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Oct 2018 17:40:54 -0700 Subject: rcutorture: Dump grace-period diagnostics upon forward-progress OOM This commit adds an OOM notifier during rcutorture forward-progress testing. If this notifier is invoked, it dumps out some grace-period state to help debug the forward-progress problem. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 2 ++ kernel/rcu/rcutorture.c | 31 ++++++++++++++++++++++++++++--- kernel/rcu/tree.c | 20 ++++++++++++++++++++ 3 files changed, 50 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 0f0f5ae8c3d4..a393e24a9195 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -526,12 +526,14 @@ srcu_batches_completed(struct srcu_struct *sp) { return 0; } static inline void rcu_force_quiescent_state(void) { } static inline void show_rcu_gp_kthreads(void) { } static inline int rcu_get_gp_kthreads_prio(void) { return 0; } +static inline void rcu_fwd_progress_check(unsigned long j) { } #else /* #ifdef CONFIG_TINY_RCU */ unsigned long rcu_get_gp_seq(void); unsigned long rcu_exp_batches_completed(void); unsigned long srcu_batches_completed(struct srcu_struct *sp); void show_rcu_gp_kthreads(void); int rcu_get_gp_kthreads_prio(void); +void rcu_fwd_progress_check(unsigned long j); void rcu_force_quiescent_state(void); extern struct workqueue_struct *rcu_gp_wq; extern struct workqueue_struct *rcu_par_gp_wq; diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index c4fd61dccedb..f28b88ecb47a 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "rcu.h" @@ -1624,6 +1625,7 @@ static struct rcu_fwd_cb *rcu_fwd_cb_head; static struct rcu_fwd_cb **rcu_fwd_cb_tail = &rcu_fwd_cb_head; static long n_launders_cb; static unsigned long rcu_fwd_startat; +static bool rcu_fwd_emergency_stop; #define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */ #define MIN_FWD_CB_LAUNDERS 3 /* This many CB invocations to count. */ #define MIN_FWD_CBS_LAUNDERED 100 /* Number of counted CBs. */ @@ -1681,7 +1683,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) dur = sd4 + torture_random(&trs) % (sd - sd4); WRITE_ONCE(rcu_fwd_startat, jiffies); stopat = rcu_fwd_startat + dur; - while (time_before(jiffies, stopat) && !torture_must_stop()) { + while (time_before(jiffies, stopat) && + !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { idx = cur_ops->readlock(); udelay(10); cur_ops->readunlock(idx); @@ -1689,7 +1692,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) cond_resched(); } (*tested_tries)++; - if (!time_before(jiffies, stopat) && !torture_must_stop()) { + if (!time_before(jiffies, stopat) && + !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { (*tested)++; cver = READ_ONCE(rcu_torture_current_version) - cver; gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); @@ -1739,7 +1743,8 @@ static void rcu_torture_fwd_prog_cr(void) n_launders_hist[i] = 0; cver = READ_ONCE(rcu_torture_current_version); gps = cur_ops->get_gp_seq(); - while (time_before(jiffies, stopat) && !torture_must_stop()) { + while (time_before(jiffies, stopat) && + !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { rfcp = READ_ONCE(rcu_fwd_cb_head); rfcpn = NULL; if (rfcp) @@ -1796,6 +1801,23 @@ static void rcu_torture_fwd_prog_cr(void) } } + +/* + * OOM notifier, but this only prints diagnostic information for the + * current forward-progress test. + */ +static int rcutorture_oom_notify(struct notifier_block *self, + unsigned long notused, void *nfreed) +{ + rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2)); + WRITE_ONCE(rcu_fwd_emergency_stop, true); + return NOTIFY_OK; +} + +static struct notifier_block rcutorture_oom_nb = { + .notifier_call = rcutorture_oom_notify +}; + /* Carry out grace-period forward-progress testing. */ static int rcu_torture_fwd_prog(void *args) { @@ -1808,8 +1830,11 @@ static int rcu_torture_fwd_prog(void *args) set_user_nice(current, MAX_NICE); do { schedule_timeout_interruptible(fwd_progress_holdoff * HZ); + WRITE_ONCE(rcu_fwd_emergency_stop, false); + register_oom_notifier(&rcutorture_oom_nb); rcu_torture_fwd_prog_nr(&tested, &tested_tries); rcu_torture_fwd_prog_cr(); + unregister_oom_notifier(&rcutorture_oom_nb); /* Avoid slow periods, better to test when busy. */ stutter_wait("rcu_torture_fwd_prog"); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6ec3abbe90e2..853b79a6ff10 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2657,6 +2657,26 @@ rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } +/* + * Do a forward-progress check for rcutorture. This is normally invoked + * due to an OOM event. The argument "j" gives the time period during + * which rcutorture would like progress to have been made. + */ +void rcu_fwd_progress_check(unsigned long j) +{ + struct rcu_data *rdp; + + if (rcu_gp_in_progress()) { + show_rcu_gp_kthreads(); + } else { + preempt_disable(); + rdp = this_cpu_ptr(&rcu_data); + rcu_check_gp_start_stall(rdp->mynode, rdp, j); + preempt_enable(); + } +} +EXPORT_SYMBOL_GPL(rcu_fwd_progress_check); + /* * This does the RCU core processing work for the specified rcu_data * structures. This may be called only from the CPU to whom the rdp -- cgit v1.3-7-g2ca7 From 903ee83d91776bc72d856147743687d4b6c99286 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Oct 2018 16:05:46 -0700 Subject: rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings The RCU CPU stall warnings print an estimate of the total number of RCU callbacks queued in the system, but this estimate leaves out the callbacks queued for nocbs CPUs. This commit therefore introduces rcu_get_n_cbs_cpu(), which gives an accurate callback estimate for both nocbs and normal CPUs, and uses this new function as needed. This commit also introduces a rcu_get_n_cbs_nocb_cpu() helper function that returns the number of callbacks for nocbs CPUs or zero otherwise, and also uses this function in place of direct access to ->nocb_q_count while in the area (fewer characters, you see). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 19 +++++++++++++++---- kernel/rcu/tree.h | 1 + kernel/rcu/tree_plugin.h | 24 +++++++++++++++++++----- 3 files changed, 35 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 853b79a6ff10..0933d8650890 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -207,6 +207,19 @@ static int rcu_gp_in_progress(void) return rcu_seq_state(rcu_seq_current(&rcu_state.gp_seq)); } +/* + * Return the number of callbacks queued on the specified CPU. + * Handles both the nocbs and normal cases. + */ +static long rcu_get_n_cbs_cpu(int cpu) +{ + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); + + if (rcu_segcblist_is_enabled(&rdp->cblist)) /* Online normal CPU? */ + return rcu_segcblist_n_cbs(&rdp->cblist); + return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */ +} + void rcu_softirq_qs(void) { rcu_qs(); @@ -1265,8 +1278,7 @@ static void print_other_cpu_stall(unsigned long gp_seq) print_cpu_stall_info_end(); for_each_possible_cpu(cpu) - totqlen += rcu_segcblist_n_cbs(&per_cpu_ptr(&rcu_data, - cpu)->cblist); + totqlen += rcu_get_n_cbs_cpu(cpu); pr_cont("(detected by %d, t=%ld jiffies, g=%ld, q=%lu)\n", smp_processor_id(), (long)(jiffies - rcu_state.gp_start), (long)rcu_seq_current(&rcu_state.gp_seq), totqlen); @@ -1326,8 +1338,7 @@ static void print_cpu_stall(void) raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags); print_cpu_stall_info_end(); for_each_possible_cpu(cpu) - totqlen += rcu_segcblist_n_cbs(&per_cpu_ptr(&rcu_data, - cpu)->cblist); + totqlen += rcu_get_n_cbs_cpu(cpu); pr_cont(" (t=%lu jiffies g=%ld q=%lu)\n", jiffies - rcu_state.gp_start, (long)rcu_seq_current(&rcu_state.gp_seq), totqlen); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index c3e2807a834a..a8f82b7dc5e2 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -455,6 +455,7 @@ static void __init rcu_spawn_nocb_kthreads(void); static void __init rcu_organize_nocb_kthreads(void); #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ static bool init_nocb_callback_list(struct rcu_data *rdp); +static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp); static void rcu_bind_gp_kthread(void); static bool rcu_nohz_full_cpu(void); static void rcu_dynticks_task_enter(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 25fe26c4e9c1..1b3dd2fc0cd6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2011,7 +2011,7 @@ static bool rcu_nocb_cpu_needs_barrier(int cpu) * (if a callback is in fact needed). This is associated with an * atomic_inc() in the caller. */ - ret = atomic_long_read(&rdp->nocb_q_count); + ret = rcu_get_n_cbs_nocb_cpu(rdp); #ifdef CONFIG_PROVE_RCU rhp = READ_ONCE(rdp->nocb_head); @@ -2066,7 +2066,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, TPS("WakeNotPoll")); return; } - len = atomic_long_read(&rdp->nocb_q_count); + len = rcu_get_n_cbs_nocb_cpu(rdp); if (old_rhpp == &rdp->nocb_head) { if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ @@ -2115,11 +2115,11 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, trace_rcu_kfree_callback(rcu_state.name, rhp, (unsigned long)rhp->func, -atomic_long_read(&rdp->nocb_q_count_lazy), - -atomic_long_read(&rdp->nocb_q_count)); + -rcu_get_n_cbs_nocb_cpu(rdp)); else trace_rcu_callback(rcu_state.name, rhp, -atomic_long_read(&rdp->nocb_q_count_lazy), - -atomic_long_read(&rdp->nocb_q_count)); + -rcu_get_n_cbs_nocb_cpu(rdp)); /* * If called from an extended quiescent state with interrupts @@ -2343,7 +2343,7 @@ static int rcu_nocb_kthread(void *arg) /* Each pass through the following loop invokes a callback. */ trace_rcu_batch_start(rcu_state.name, atomic_long_read(&rdp->nocb_q_count_lazy), - atomic_long_read(&rdp->nocb_q_count), -1); + rcu_get_n_cbs_nocb_cpu(rdp), -1); c = cl = 0; while (list) { next = list->next; @@ -2614,6 +2614,15 @@ void rcu_bind_current_to_nocb(void) } EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb); +/* + * Return the number of RCU callbacks still queued from the specified + * CPU, which must be a nocbs CPU. + */ +static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) +{ + return atomic_long_read(&rdp->nocb_q_count); +} + #else /* #ifdef CONFIG_RCU_NOCB_CPU */ static bool rcu_nocb_cpu_needs_barrier(int cpu) @@ -2674,6 +2683,11 @@ static bool init_nocb_callback_list(struct rcu_data *rdp) return false; } +static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) +{ + return 0; +} + #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ /* -- cgit v1.3-7-g2ca7 From bfcfcffc5f23e76a1b88f7412ee7efaec5107b28 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Oct 2018 17:09:56 -0700 Subject: rcu: Print per-CPU callback counts for forward-progress failures This commit prints out the non-zero per-CPU callback counts when a forware-progress error (OOM event) occurs. Signed-off-by: Paul E. McKenney [ paulmck: Fix a pair of uninitialized locals spotted by kbuild test robot. ] --- kernel/rcu/tree.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0933d8650890..90a67c75a447 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2675,6 +2675,10 @@ rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, */ void rcu_fwd_progress_check(unsigned long j) { + unsigned long cbs; + int cpu; + unsigned long max_cbs = 0; + int max_cpu = -1; struct rcu_data *rdp; if (rcu_gp_in_progress()) { @@ -2685,6 +2689,20 @@ void rcu_fwd_progress_check(unsigned long j) rcu_check_gp_start_stall(rdp->mynode, rdp, j); preempt_enable(); } + for_each_possible_cpu(cpu) { + cbs = rcu_get_n_cbs_cpu(cpu); + if (!cbs) + continue; + if (max_cpu < 0) + pr_info("%s: callbacks", __func__); + pr_cont(" %d: %lu", cpu, cbs); + if (cbs <= max_cbs) + continue; + max_cbs = cbs; + max_cpu = cpu; + } + if (max_cpu >= 0) + pr_cont("\n"); } EXPORT_SYMBOL_GPL(rcu_fwd_progress_check); -- cgit v1.3-7-g2ca7 From 8dd3b54689d9a2103c0817f2b7adc51760a45551 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Oct 2018 11:02:05 -0700 Subject: rcutorture: Print GP age upon forward-progress failure Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 90a67c75a447..f91631541965 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2682,6 +2682,8 @@ void rcu_fwd_progress_check(unsigned long j) struct rcu_data *rdp; if (rcu_gp_in_progress()) { + pr_info("%s: GP age %lu jiffies\n", + __func__, jiffies - rcu_state.gp_start); show_rcu_gp_kthreads(); } else { preempt_disable(); -- cgit v1.3-7-g2ca7 From 1a682754c7ed9df213069d5a0d3981f8360a32c2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Oct 2018 12:33:41 -0700 Subject: rcutorture: Print histogram of CB invocation at OOM time One reason why a forward-progress test might fail would be if something prevented or delayed callback invocation. This commit therefore adds a callback-invocation histogram printout when OOM is reported to rcutorture. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f28b88ecb47a..329f4fb13125 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1631,6 +1631,20 @@ static bool rcu_fwd_emergency_stop; #define MIN_FWD_CBS_LAUNDERED 100 /* Number of counted CBs. */ static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ]; +static void rcu_torture_fwd_cb_hist(void) +{ + int i; + int j; + + for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) + if (n_launders_hist[i] > 0) + break; + pr_alert("%s: Callback-invocation histogram:", __func__); + for (j = 0; j <= i; j++) + pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); + pr_cont("\n"); +} + /* Callback function for continuous-flood RCU callbacks. */ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) { @@ -1718,7 +1732,6 @@ static void rcu_torture_fwd_prog_cr(void) unsigned long cver; unsigned long gps; int i; - int j; long n_launders; long n_launders_cb_snap; long n_launders_sa; @@ -1791,13 +1804,7 @@ static void rcu_torture_fwd_prog_cr(void) n_launders + n_max_cbs - n_launders_cb_snap, n_launders, n_launders_sa, n_max_gps, n_max_cbs, cver, gps); - for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) - if (n_launders_hist[i] > 0) - break; - pr_alert("Callback-invocation histogram:"); - for (j = 0; j <= i; j++) - pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); - pr_cont("\n"); + rcu_torture_fwd_cb_hist(); } } @@ -1809,6 +1816,7 @@ static void rcu_torture_fwd_prog_cr(void) static int rcutorture_oom_notify(struct notifier_block *self, unsigned long notused, void *nfreed) { + rcu_torture_fwd_cb_hist(); rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2)); WRITE_ONCE(rcu_fwd_emergency_stop, true); return NOTIFY_OK; -- cgit v1.3-7-g2ca7 From c51d7b5e6c94aa6b554c27bd2b0eb64ebef02334 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Oct 2018 17:25:33 -0700 Subject: rcutorture: Print time since GP end upon forward-progress failure If rcutorture's forward-progress tests fail while a grace period is not in progress, it is useful to print the time since the last grace period ended as a way to detect failure to launch a new grace period. This commit therefore makes this change. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 5 ++++- kernel/rcu/tree.h | 2 ++ 2 files changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index f91631541965..9180158756d2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2000,7 +2000,8 @@ static void rcu_gp_cleanup(void) WRITE_ONCE(rcu_state.gp_activity, jiffies); raw_spin_lock_irq_rcu_node(rnp); - gp_duration = jiffies - rcu_state.gp_start; + rcu_state.gp_end = jiffies; + gp_duration = rcu_state.gp_end - rcu_state.gp_start; if (gp_duration > rcu_state.gp_max) rcu_state.gp_max = gp_duration; @@ -2686,6 +2687,8 @@ void rcu_fwd_progress_check(unsigned long j) __func__, jiffies - rcu_state.gp_start); show_rcu_gp_kthreads(); } else { + pr_info("%s: Last GP end %lu jiffies ago\n", + __func__, jiffies - rcu_state.gp_end); preempt_disable(); rdp = this_cpu_ptr(&rcu_data); rcu_check_gp_start_stall(rdp->mynode, rdp, j); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index a8f82b7dc5e2..d90b02b53c0e 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -328,6 +328,8 @@ struct rcu_state { /* force_quiescent_state(). */ unsigned long gp_start; /* Time at which GP started, */ /* but in jiffies. */ + unsigned long gp_end; /* Time last GP ended, again */ + /* in jiffies. */ unsigned long gp_activity; /* Time of last GP kthread */ /* activity in jiffies. */ unsigned long gp_req_activity; /* Time of last GP request */ -- cgit v1.3-7-g2ca7 From 73d665b1410afae405309ad4475a98924776ab13 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Oct 2018 10:54:22 -0700 Subject: rcutorture: Print forward-progress test age upon failure This commit prints the age of the forward-progress test in jiffies, in order to allow better interpretation of the callback-invocation histograms. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 329f4fb13125..080b5ac6340c 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1639,7 +1639,8 @@ static void rcu_torture_fwd_cb_hist(void) for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) if (n_launders_hist[i] > 0) break; - pr_alert("%s: Callback-invocation histogram:", __func__); + pr_alert("%s: Callback-invocation histogram (duration %lu jiffies):", + __func__, jiffies - rcu_fwd_startat); for (j = 0; j <= i; j++) pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); pr_cont("\n"); -- cgit v1.3-7-g2ca7 From 2667ccce9328e4e25ed77a83291c066d5e11e65a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Oct 2018 09:09:49 -0700 Subject: rcutorture: Recover from OOM during forward-progress tests This commit causes the OOM handler to do rcu_barrier() calls and to free up forward-progress callbacks in order to recover from OOM events. The current test is terminated, but subsequent forward-progress tests can proceed. This allows a long test to result in multiple forward-progress failures, greatly reducing the required testing time. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 60 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 49 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 080b5ac6340c..afa98162575d 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1649,13 +1649,14 @@ static void rcu_torture_fwd_cb_hist(void) /* Callback function for continuous-flood RCU callbacks. */ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) { + unsigned long flags; int i; struct rcu_fwd_cb *rfcp = container_of(rhp, struct rcu_fwd_cb, rh); struct rcu_fwd_cb **rfcpp; rfcp->rfc_next = NULL; rfcp->rfc_gps++; - spin_lock(&rcu_fwd_lock); + spin_lock_irqsave(&rcu_fwd_lock, flags); rfcpp = rcu_fwd_cb_tail; rcu_fwd_cb_tail = &rfcp->rfc_next; WRITE_ONCE(*rfcpp, rfcp); @@ -1664,7 +1665,33 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) if (i >= ARRAY_SIZE(n_launders_hist)) i = ARRAY_SIZE(n_launders_hist) - 1; n_launders_hist[i]++; - spin_unlock(&rcu_fwd_lock); + spin_unlock_irqrestore(&rcu_fwd_lock, flags); +} + +/* + * Free all callbacks on the rcu_fwd_cb_head list, either because the + * test is over or because we hit an OOM event. + */ +static unsigned long rcu_torture_fwd_prog_cbfree(void) +{ + unsigned long flags; + unsigned long freed = 0; + struct rcu_fwd_cb *rfcp; + + for (;;) { + spin_lock_irqsave(&rcu_fwd_lock, flags); + rfcp = rcu_fwd_cb_head; + if (!rfcp) + break; + rcu_fwd_cb_head = rfcp->rfc_next; + if (!rcu_fwd_cb_head) + rcu_fwd_cb_tail = &rcu_fwd_cb_head; + spin_unlock_irqrestore(&rcu_fwd_lock, flags); + kfree(rfcp); + freed++; + } + spin_unlock_irqrestore(&rcu_fwd_lock, flags); + return freed; } /* Carry out need_resched()/cond_resched() forward-progress testing. */ @@ -1743,6 +1770,9 @@ static void rcu_torture_fwd_prog_cr(void) unsigned long stopat; unsigned long stoppedat; + if (READ_ONCE(rcu_fwd_emergency_stop)) + return; /* Get out of the way quickly, no GP wait! */ + /* Loop continuously posting RCU callbacks. */ WRITE_ONCE(rcu_fwd_cb_nodelay, true); cur_ops->sync(); /* Later readers see above write. */ @@ -1788,16 +1818,10 @@ static void rcu_torture_fwd_prog_cr(void) cver = READ_ONCE(rcu_torture_current_version) - cver; gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps); cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */ - for (;;) { - rfcp = rcu_fwd_cb_head; - if (!rfcp) - break; - rcu_fwd_cb_head = rfcp->rfc_next; - kfree(rfcp); - } - rcu_fwd_cb_tail = &rcu_fwd_cb_head; + (void)rcu_torture_fwd_prog_cbfree(); + WRITE_ONCE(rcu_fwd_cb_nodelay, false); - if (!torture_must_stop()) { + if (!torture_must_stop() && !READ_ONCE(rcu_fwd_emergency_stop)) { WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED); pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: %ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n", __func__, @@ -1817,9 +1841,23 @@ static void rcu_torture_fwd_prog_cr(void) static int rcutorture_oom_notify(struct notifier_block *self, unsigned long notused, void *nfreed) { + WARN(1, "%s invoked upon OOM during forward-progress testing.\n", + __func__); rcu_torture_fwd_cb_hist(); rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2)); WRITE_ONCE(rcu_fwd_emergency_stop, true); + smp_mb(); /* Emergency stop before free and wait to avoid hangs. */ + pr_info("%s: Freed %lu RCU callbacks.\n", + __func__, rcu_torture_fwd_prog_cbfree()); + rcu_barrier(); + pr_info("%s: Freed %lu RCU callbacks.\n", + __func__, rcu_torture_fwd_prog_cbfree()); + rcu_barrier(); + pr_info("%s: Freed %lu RCU callbacks.\n", + __func__, rcu_torture_fwd_prog_cbfree()); + smp_mb(); /* Frees before return to avoid redoing OOM. */ + (*(unsigned long *)nfreed)++; /* Forward progress CBs freed! */ + pr_info("%s returning after OOM processing.\n", __func__); return NOTIFY_OK; } -- cgit v1.3-7-g2ca7 From 2e57bf97a6856f2dc10fb4377c452cb08f844047 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Oct 2018 16:43:09 -0700 Subject: rcutorture: Use 100ms buckets for forward-progress callback histograms This commit narrows the scope of each bucket of the forward-progress callback-invocation histograms from one second to 100 milliseconds, which aids debugging of forward-progress problems by making shorter-duration callback-invocation stalls visible. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index afa98162575d..a4c4a24bdcaa 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1629,7 +1629,8 @@ static bool rcu_fwd_emergency_stop; #define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */ #define MIN_FWD_CB_LAUNDERS 3 /* This many CB invocations to count. */ #define MIN_FWD_CBS_LAUNDERED 100 /* Number of counted CBs. */ -static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ]; +#define FWD_CBS_HIST_DIV 10 /* Histogram buckets/second. */ +static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / (HZ / FWD_CBS_HIST_DIV)]; static void rcu_torture_fwd_cb_hist(void) { @@ -1642,7 +1643,8 @@ static void rcu_torture_fwd_cb_hist(void) pr_alert("%s: Callback-invocation histogram (duration %lu jiffies):", __func__, jiffies - rcu_fwd_startat); for (j = 0; j <= i; j++) - pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]); + pr_cont(" %ds/%d: %ld", + j + 1, FWD_CBS_HIST_DIV, n_launders_hist[j]); pr_cont("\n"); } @@ -1661,7 +1663,7 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) rcu_fwd_cb_tail = &rfcp->rfc_next; WRITE_ONCE(*rfcpp, rfcp); WRITE_ONCE(n_launders_cb, n_launders_cb + 1); - i = ((jiffies - rcu_fwd_startat) / HZ); + i = ((jiffies - rcu_fwd_startat) / (HZ / FWD_CBS_HIST_DIV)); if (i >= ARRAY_SIZE(n_launders_hist)) i = ARRAY_SIZE(n_launders_hist) - 1; n_launders_hist[i]++; -- cgit v1.3-7-g2ca7 From 5ac7cdc29897e5fc3f5e214f3f8c8b03ef8d7029 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Oct 2018 05:46:58 -0700 Subject: rcutorture: Don't do busted forward-progress testing The "busted" rcutorture type is an intentionally broken implementation of RCU. Doing forward-progress testing on this implementation is not particularly meaningful on the one hand and can result in fatal abuse of the memory allocator on the other. This commit therefore disables forward-progress testing of the "busted" rcutorture type. Reported-by: kernel test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index a4c4a24bdcaa..f6e85faa4ff4 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1900,7 +1900,8 @@ static int __init rcu_torture_fwd_prog_init(void) { if (!fwd_progress) return 0; /* Not requested, so don't do it. */ - if (!cur_ops->stall_dur || cur_ops->stall_dur() <= 0) { + if (!cur_ops->stall_dur || cur_ops->stall_dur() <= 0 || + cur_ops == &rcu_busted_ops) { VERBOSE_TOROUT_STRING("rcu_torture_fwd_prog_init: Disabled, unsupported by RCU flavor under test"); return 0; } -- cgit v1.3-7-g2ca7 From 5482e9a93c83839f94e75db712e6837e6a39962c Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Sat, 1 Dec 2018 17:08:44 -0800 Subject: bpf: Fix memleak in aux->func_info and aux->btf The aux->func_info and aux->btf are leaked in the error out cases during bpf_prog_load(). This patch fixes it. Fixes: ba64e7d85252 ("bpf: btf: support proper non-jit func info") Cc: Yonghong Song Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f9554d9a14e1..4445d0d084d8 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1560,6 +1560,8 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) return err; free_used_maps: + kvfree(prog->aux->func_info); + btf_put(prog->aux->btf); bpf_prog_kallsyms_del_subprogs(prog); free_used_maps(prog->aux); free_prog: -- cgit v1.3-7-g2ca7 From fca0c116504e1c5c72963bf28b1ede0932228fef Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 3 Dec 2018 10:52:21 +0100 Subject: perf: Fix typos in comments Fix two typos in kernel/events/*. No change in functionality intended. Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Linus Torvalds Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/events/core.c | 2 +- kernel/events/hw_breakpoint.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 84530ab358c3..a2a6e0b4a881 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5541,7 +5541,7 @@ out_put: static const struct vm_operations_struct perf_mmap_vmops = { .open = perf_mmap_open, - .close = perf_mmap_close, /* non mergable */ + .close = perf_mmap_close, /* non mergeable */ .fault = perf_mmap_fault, .page_mkwrite = perf_mmap_fault, }; diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c index d6b56180827c..5befb338a18d 100644 --- a/kernel/events/hw_breakpoint.c +++ b/kernel/events/hw_breakpoint.c @@ -238,7 +238,7 @@ __weak void arch_unregister_hw_breakpoint(struct perf_event *bp) } /* - * Contraints to check before allowing this new breakpoint counter: + * Constraints to check before allowing this new breakpoint counter: * * == Non-pinned counter == (Considered as pinned for now) * -- cgit v1.3-7-g2ca7 From dfcb245e28481256a10a9133441baf2a93d26642 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 3 Dec 2018 10:05:56 +0100 Subject: sched: Fix various typos in comments Go over the scheduler source code and fix common typos in comments - and a typo in an actual variable name. No change in functionality intended. Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Linus Torvalds Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- include/linux/sched.h | 4 ++-- include/linux/sched/isolation.h | 4 ++-- include/linux/sched/mm.h | 2 +- include/linux/sched/stat.h | 2 +- kernel/sched/core.c | 2 +- kernel/sched/cputime.c | 2 +- kernel/sched/deadline.c | 2 +- kernel/sched/fair.c | 8 ++++---- kernel/sched/isolation.c | 14 +++++++------- kernel/sched/sched.h | 4 ++-- 10 files changed, 22 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 291a9bd5b97f..b8c7ba0e3796 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -176,7 +176,7 @@ struct task_group; * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). * * However, with slightly different timing the wakeup TASK_RUNNING store can - * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not + * also collide with the TASK_UNINTERRUPTIBLE store. Losing that store is not * a problem either because that will result in one extra go around the loop * and our @cond test will save the day. * @@ -515,7 +515,7 @@ struct sched_dl_entity { /* * Actual scheduling parameters. Initialized with the values above, - * they are continously updated during task execution. Note that + * they are continuously updated during task execution. Note that * the remaining runtime could be < 0 in case we are in overrun. */ s64 runtime; /* Remaining runtime for this instance */ diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index 4a6582c27dea..b0fb1446fe04 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -16,7 +16,7 @@ enum hk_flags { }; #ifdef CONFIG_CPU_ISOLATION -DECLARE_STATIC_KEY_FALSE(housekeeping_overriden); +DECLARE_STATIC_KEY_FALSE(housekeeping_overridden); extern int housekeeping_any_cpu(enum hk_flags flags); extern const struct cpumask *housekeeping_cpumask(enum hk_flags flags); extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags); @@ -43,7 +43,7 @@ static inline void housekeeping_init(void) { } static inline bool housekeeping_cpu(int cpu, enum hk_flags flags) { #ifdef CONFIG_CPU_ISOLATION - if (static_branch_unlikely(&housekeeping_overriden)) + if (static_branch_unlikely(&housekeeping_overridden)) return housekeeping_test_cpu(cpu, flags); #endif return true; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index aebb370a0006..3bfa6a0cbba4 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -153,7 +153,7 @@ static inline gfp_t current_gfp_context(gfp_t flags) { /* * NOIO implies both NOIO and NOFS and it is a weaker context - * so always make sure it makes precendence + * so always make sure it makes precedence */ if (unlikely(current->flags & PF_MEMALLOC_NOIO)) flags &= ~(__GFP_IO | __GFP_FS); diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h index f30954cc059d..568286411b43 100644 --- a/include/linux/sched/stat.h +++ b/include/linux/sched/stat.h @@ -8,7 +8,7 @@ * Various counters maintained by the scheduler and fork(), * exposed via /proc, sys.c or used by drivers via these APIs. * - * ( Note that all these values are aquired without locking, + * ( Note that all these values are acquired without locking, * so they can only be relied on in narrow circumstances. ) */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8050f266751a..e4ca15d75541 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2857,7 +2857,7 @@ unsigned long nr_running(void) * preemption, thus the result might have a time-of-check-to-time-of-use * race. The caller is responsible to use it correctly, for example: * - * - from a non-preemptable section (of course) + * - from a non-preemptible section (of course) * * - from a thread that is bound to a single CPU * diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..ba4a143bdcf3 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -525,7 +525,7 @@ void account_idle_ticks(unsigned long ticks) /* * Perform (stime * rtime) / total, but avoid multiplication overflow by - * loosing precision when the numbers are big. + * losing precision when the numbers are big. */ static u64 scale_stime(u64 stime, u64 rtime, u64 total) { diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 470ba6b464fe..b32bc1f7cd14 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -727,7 +727,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, * refill the runtime and set the deadline a period in the future, * because keeping the current (absolute) deadline of the task would * result in breaking guarantees promised to other tasks (refer to - * Documentation/scheduler/sched-deadline.txt for more informations). + * Documentation/scheduler/sched-deadline.txt for more information). * * This function returns true if: * diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e30dea59d215..fdc8356ea742 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -703,9 +703,9 @@ void init_entity_runnable_average(struct sched_entity *se) memset(sa, 0, sizeof(*sa)); /* - * Tasks are intialized with full load to be seen as heavy tasks until + * Tasks are initialized with full load to be seen as heavy tasks until * they get a chance to stabilize to their real load level. - * Group entities are intialized with zero load to reflect the fact that + * Group entities are initialized with zero load to reflect the fact that * nothing has been attached to the task group yet. */ if (entity_is_task(se)) @@ -3976,8 +3976,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. - * - Substract its load from the cfs_rq->runnable_avg. - * - Substract its previous weight from cfs_rq->load.weight. + * - Subtract its load from the cfs_rq->runnable_avg. + * - Subtract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index e6802181900f..81faddba9e20 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -8,14 +8,14 @@ */ #include "sched.h" -DEFINE_STATIC_KEY_FALSE(housekeeping_overriden); -EXPORT_SYMBOL_GPL(housekeeping_overriden); +DEFINE_STATIC_KEY_FALSE(housekeeping_overridden); +EXPORT_SYMBOL_GPL(housekeeping_overridden); static cpumask_var_t housekeeping_mask; static unsigned int housekeeping_flags; int housekeeping_any_cpu(enum hk_flags flags) { - if (static_branch_unlikely(&housekeeping_overriden)) + if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping_flags & flags) return cpumask_any_and(housekeeping_mask, cpu_online_mask); return smp_processor_id(); @@ -24,7 +24,7 @@ EXPORT_SYMBOL_GPL(housekeeping_any_cpu); const struct cpumask *housekeeping_cpumask(enum hk_flags flags) { - if (static_branch_unlikely(&housekeeping_overriden)) + if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping_flags & flags) return housekeeping_mask; return cpu_possible_mask; @@ -33,7 +33,7 @@ EXPORT_SYMBOL_GPL(housekeeping_cpumask); void housekeeping_affine(struct task_struct *t, enum hk_flags flags) { - if (static_branch_unlikely(&housekeeping_overriden)) + if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping_flags & flags) set_cpus_allowed_ptr(t, housekeeping_mask); } @@ -41,7 +41,7 @@ EXPORT_SYMBOL_GPL(housekeeping_affine); bool housekeeping_test_cpu(int cpu, enum hk_flags flags) { - if (static_branch_unlikely(&housekeeping_overriden)) + if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping_flags & flags) return cpumask_test_cpu(cpu, housekeeping_mask); return true; @@ -53,7 +53,7 @@ void __init housekeeping_init(void) if (!housekeeping_flags) return; - static_branch_enable(&housekeeping_overriden); + static_branch_enable(&housekeeping_overridden); if (housekeeping_flags & HK_FLAG_TICK) sched_tick_offload_init(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 71cd8b710599..9bde60a11805 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -637,7 +637,7 @@ struct dl_rq { /* * Deadline values of the currently executing and the * earliest ready task on this rq. Caching these facilitates - * the decision wether or not a ready but not running task + * the decision whether or not a ready but not running task * should migrate somewhere else. */ struct { @@ -1434,7 +1434,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu) #ifdef CONFIG_SMP /* * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be - * successfuly executed on another CPU. We must ensure that updates of + * successfully executed on another CPU. We must ensure that updates of * per-task data have been completed by this moment. */ smp_wmb(); -- cgit v1.3-7-g2ca7 From 1e7eacaf1db2f0f5f62fceda4e6c5a8869f00c13 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Sat, 1 Dec 2018 03:12:56 +0000 Subject: cpuset: Remove set but not used variable 'cs' Fixes gcc '-Wunused-but-set-variable' warning: kernel/cgroup/cpuset.c: In function 'cpuset_cancel_attach': kernel/cgroup/cpuset.c:2167:17: warning: variable 'cs' set but not used [-Wunused-but-set-variable] It never used since introduction in commit 1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling") Signed-off-by: YueHaibing Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 1151e93d71b6..f0decd8165e7 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2109,10 +2109,8 @@ out_unlock: static void cpuset_cancel_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; - struct cpuset *cs; cgroup_taskset_first(tset, &css); - cs = css_cs(css); mutex_lock(&cpuset_mutex); css_cs(css)->attach_in_progress--; -- cgit v1.3-7-g2ca7 From 9a547c7e575fc2501c12081558fda3027d0f2a5e Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 30 Nov 2018 16:13:16 -0500 Subject: audit: shorten PATH cap values when zero Since the vast majority of files (99.993% on a typical system) have no fcaps, display "0" instead of the full zero-padded 16 hex digits in the two PATH record cap_f* fields to save netlink bandwidth and disk space. Simply changing the format to %x won't work since the value is two (or possibly more in the future) 32-bit hexadecimal values concatenated and bits in higher order values will be misrepresented. Passes audit-testsuite and userspace tools already work fine. Please see the github issue tracker for more details https://github.com/linux-audit/audit-kernel/issues/101 Signed-off-by: Richard Guy Briggs Acked-by: Steve Grubb Signed-off-by: Paul Moore --- kernel/audit.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 779671883349..a0a4544e69ca 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2059,11 +2059,13 @@ void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap) { int i; - audit_log_format(ab, " %s=", prefix); - CAP_FOR_EACH_U32(i) { - audit_log_format(ab, "%08x", - cap->cap[CAP_LAST_U32 - i]); + if (cap_isclear(*cap)) { + audit_log_format(ab, " %s=0", prefix); + return; } + audit_log_format(ab, " %s=", prefix); + CAP_FOR_EACH_U32(i) + audit_log_format(ab, "%08x", cap->cap[CAP_LAST_U32 - i]); } static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name) -- cgit v1.3-7-g2ca7 From ce10a5b3954f2514af726beb78ed8d7350c5e41c Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Wed, 28 Nov 2018 15:43:09 -0800 Subject: timekeeping: Use proper seqcount initializer tk_core.seq is initialized open coded, but that misses to initialize the lockdep map when lockdep is enabled. Lockdep splats involving tk_core seq consequently lack a name and are hard to read. Use the proper initializer which takes care of the lockdep map initialization. [ tglx: Massaged changelog ] Signed-off-by: Bart Van Assche Signed-off-by: Thomas Gleixner Cc: peterz@infradead.org Cc: tj@kernel.org Cc: johannes.berg@intel.com Link: https://lkml.kernel.org/r/20181128234325.110011-12-bvanassche@acm.org --- kernel/time/timekeeping.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index cd02bd38cf2d..c801e25875a3 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -45,7 +45,9 @@ enum timekeeping_adv_mode { static struct { seqcount_t seq; struct timekeeper timekeeper; -} tk_core ____cacheline_aligned; +} tk_core ____cacheline_aligned = { + .seq = SEQCNT_ZERO(tk_core.seq), +}; static DEFINE_RAW_SPINLOCK(timekeeper_lock); static struct timekeeper shadow_timekeeper; -- cgit v1.3-7-g2ca7 From a1da439cc0d929891f0f7372c9e03530c809068c Mon Sep 17 00:00:00 2001 From: Marek Szyprowski Date: Wed, 5 Dec 2018 11:14:01 +0100 Subject: dma-mapping: fix lack of DMA address assignment in generic remap allocator Commit bfd56cd60521 ("dma-mapping: support highmem in the generic remap allocator") replaced dma_direct_alloc_pages() with __dma_direct_alloc_pages(), which doesn't set dma_handle and zero allocated memory. Fix it by doing this directly in the caller function. Fixes: bfd56cd60521 ("dma-mapping: support highmem in the generic remap allocator") Signed-off-by: Marek Szyprowski Tested-by: Thierry Reding Signed-off-by: Christoph Hellwig --- kernel/dma/remap.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 68a64e3ff6a1..8a44317cfc1a 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -223,8 +223,14 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, ret = dma_common_contiguous_remap(page, size, VM_USERMAP, arch_dma_mmap_pgprot(dev, PAGE_KERNEL, attrs), __builtin_return_address(0)); - if (!ret) + if (!ret) { __dma_direct_free_pages(dev, size, page); + return ret; + } + + *dma_handle = phys_to_dma(dev, page_to_phys(page)); + memset(ret, 0, size); + return ret; } -- cgit v1.3-7-g2ca7 From dc002bb62f10c5905420f8b8a7d5ec0da567fc82 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Fri, 23 Nov 2018 23:18:03 +0100 Subject: bpf: add __weak hook for allocating executable memory By default, BPF uses module_alloc() to allocate executable memory, but this is not necessary on all arches and potentially undesirable on some of them. So break out the module_alloc() and module_memfree() calls into __weak functions to allow them to be overridden in arch code. Signed-off-by: Ard Biesheuvel Signed-off-by: Daniel Borkmann --- kernel/bpf/core.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f93ed667546f..86817ab204e8 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -623,6 +623,16 @@ static void bpf_jit_uncharge_modmem(u32 pages) atomic_long_sub(pages, &bpf_jit_current); } +void *__weak bpf_jit_alloc_exec(unsigned long size) +{ + return module_alloc(size); +} + +void __weak bpf_jit_free_exec(void *addr) +{ + module_memfree(addr); +} + struct bpf_binary_header * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, @@ -640,7 +650,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, if (bpf_jit_charge_modmem(pages)) return NULL; - hdr = module_alloc(size); + hdr = bpf_jit_alloc_exec(size); if (!hdr) { bpf_jit_uncharge_modmem(pages); return NULL; @@ -664,7 +674,7 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr) { u32 pages = hdr->pages; - module_memfree(hdr); + bpf_jit_free_exec(hdr); bpf_jit_uncharge_modmem(pages); } -- cgit v1.3-7-g2ca7 From 7337224fc150b3b762190425399ac0e8dee380d1 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Wed, 5 Dec 2018 17:35:43 -0800 Subject: bpf: Improve the info.func_info and info.func_info_rec_size behavior 1) When bpf_dump_raw_ok() == false and the kernel can provide >=1 func_info to the userspace, the current behavior is setting the info.func_info_cnt to 0 instead of setting info.func_info to 0. It is different from the behavior in jited_func_lens/nr_jited_func_lens, jited_ksyms/nr_jited_ksyms...etc. This patch fixes it. (i.e. set func_info to 0 instead of func_info_cnt to 0 when bpf_dump_raw_ok() == false). 2) When the userspace passed in info.func_info_cnt == 0, the kernel will set the expected func_info size back to the info.func_info_rec_size. It is a way for the userspace to learn the kernel expected func_info_rec_size introduced in commit 838e96904ff3 ("bpf: Introduce bpf_func_info"). An exception is the kernel expected size is not set when func_info is not available for a bpf_prog. This makes the returned info.func_info_rec_size has different values depending on the returned value of info.func_info_cnt. This patch sets the kernel expected size to info.func_info_rec_size independent of the info.func_info_cnt. 3) The current logic only rejects invalid func_info_rec_size if func_info_cnt is non zero. This patch also rejects invalid nonzero info.func_info_rec_size and not equal to the kernel expected size. 4) Set info.btf_id as long as prog->aux->btf != NULL. That will setup the later copy_to_user() codes look the same as others which then easier to understand and maintain. prog->aux->btf is not NULL only if prog->aux->func_info_cnt > 0. Breaking up info.btf_id from prog->aux->func_info_cnt is needed for the later line info patch anyway. A similar change is made to bpf_get_prog_name(). Fixes: 838e96904ff3 ("bpf: Introduce bpf_func_info") Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 2 +- kernel/bpf/syscall.c | 46 ++++++++++++++++++++-------------------------- 2 files changed, 21 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 86817ab204e8..628b3970a49b 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -410,7 +410,7 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); /* prog->aux->name will be ignored if full btf name is available */ - if (prog->aux->btf) { + if (prog->aux->func_info_cnt) { type = btf_type_by_id(prog->aux->btf, prog->aux->func_info[prog->aux->func_idx].type_id); func_name = btf_name_by_offset(prog->aux->btf, type->name_off); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 4445d0d084d8..aa05aa38f4a8 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2083,6 +2083,12 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, return -EFAULT; } + if ((info.func_info_cnt || info.func_info_rec_size) && + info.func_info_rec_size != sizeof(struct bpf_func_info)) + return -EINVAL; + + info.func_info_rec_size = sizeof(struct bpf_func_info); + if (!capable(CAP_SYS_ADMIN)) { info.jited_prog_len = 0; info.xlated_prog_len = 0; @@ -2226,35 +2232,23 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } - if (prog->aux->btf) { - u32 krec_size = sizeof(struct bpf_func_info); - u32 ucnt, urec_size; - + if (prog->aux->btf) info.btf_id = btf_id(prog->aux->btf); - ucnt = info.func_info_cnt; - info.func_info_cnt = prog->aux->func_info_cnt; - urec_size = info.func_info_rec_size; - info.func_info_rec_size = krec_size; - if (ucnt) { - /* expect passed-in urec_size is what the kernel expects */ - if (urec_size != info.func_info_rec_size) - return -EINVAL; - - if (bpf_dump_raw_ok()) { - char __user *user_finfo; - - user_finfo = u64_to_user_ptr(info.func_info); - ucnt = min_t(u32, info.func_info_cnt, ucnt); - if (copy_to_user(user_finfo, prog->aux->func_info, - krec_size * ucnt)) - return -EFAULT; - } else { - info.func_info_cnt = 0; - } + ulen = info.func_info_cnt; + info.func_info_cnt = prog->aux->func_info_cnt; + if (info.func_info_cnt && ulen) { + if (bpf_dump_raw_ok()) { + char __user *user_finfo; + + user_finfo = u64_to_user_ptr(info.func_info); + ulen = min_t(u32, info.func_info_cnt, ulen); + if (copy_to_user(user_finfo, prog->aux->func_info, + info.func_info_rec_size * ulen)) + return -EFAULT; + } else { + info.func_info = 0; } - } else { - info.func_info_cnt = 0; } done: -- cgit v1.3-7-g2ca7 From d30d42e08c76cb9323ec6121190eb026b07f773b Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Wed, 5 Dec 2018 17:35:44 -0800 Subject: bpf: Change insn_offset to insn_off in bpf_func_info The later patch will introduce "struct bpf_line_info" which has member "line_off" and "file_off" referring back to the string section in btf. The line_"off" and file_"off" are more consistent to the naming convention in btf.h that means "offset" (e.g. name_off in "struct btf_type"). The to-be-added "struct bpf_line_info" also has another member, "insn_off" which is the same as the "insn_offset" in "struct bpf_func_info". Hence, this patch renames "insn_offset" to "insn_off" for "struct bpf_func_info". Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 2 +- kernel/bpf/verifier.c | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c8e1eeee2c5f..a84fd232d934 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2991,7 +2991,7 @@ struct bpf_flow_keys { }; struct bpf_func_info { - __u32 insn_offset; + __u32 insn_off; __u32 type_id; }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 71988337ac14..7658c61c1a88 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4707,24 +4707,24 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, goto free_btf; } - /* check insn_offset */ + /* check insn_off */ if (i == 0) { - if (krecord[i].insn_offset) { + if (krecord[i].insn_off) { verbose(env, - "nonzero insn_offset %u for the first func info record", - krecord[i].insn_offset); + "nonzero insn_off %u for the first func info record", + krecord[i].insn_off); ret = -EINVAL; goto free_btf; } - } else if (krecord[i].insn_offset <= prev_offset) { + } else if (krecord[i].insn_off <= prev_offset) { verbose(env, "same or smaller insn offset (%u) than previous func info record (%u)", - krecord[i].insn_offset, prev_offset); + krecord[i].insn_off, prev_offset); ret = -EINVAL; goto free_btf; } - if (env->subprog_info[i].start != krecord[i].insn_offset) { + if (env->subprog_info[i].start != krecord[i].insn_off) { verbose(env, "func_info BTF section doesn't match subprog layout in BPF program\n"); ret = -EINVAL; goto free_btf; @@ -4739,7 +4739,7 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, goto free_btf; } - prev_offset = krecord[i].insn_offset; + prev_offset = krecord[i].insn_off; urecord += urec_size; } @@ -4762,7 +4762,7 @@ static void adjust_btf_func(struct bpf_verifier_env *env) return; for (i = 0; i < env->subprog_cnt; i++) - env->prog->aux->func_info[i].insn_offset = env->subprog_info[i].start; + env->prog->aux->func_info[i].insn_off = env->subprog_info[i].start; } /* check %cur's range satisfies %old's */ -- cgit v1.3-7-g2ca7 From 92a98a2b9f64a8b3c200a7709ceae04d09c39451 Mon Sep 17 00:00:00 2001 From: AKASHI Takahiro Date: Thu, 15 Nov 2018 14:52:41 +0900 Subject: kexec_file: make kexec_image_post_load_cleanup_default() global Change this function from static to global so that arm64 can implement its own arch_kimage_file_post_load_cleanup() later using kexec_image_post_load_cleanup_default(). Signed-off-by: AKASHI Takahiro Acked-by: Dave Young Cc: Vivek Goyal Cc: Baoquan He Signed-off-by: Will Deacon --- include/linux/kexec.h | 1 + kernel/kexec_file.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 9e4e638fb505..49ab758f4d91 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -143,6 +143,7 @@ extern const struct kexec_file_ops * const kexec_file_loaders[]; int kexec_image_probe_default(struct kimage *image, void *buf, unsigned long buf_len); +int kexec_image_post_load_cleanup_default(struct kimage *image); /** * struct kexec_buf - parameters for finding a place for a buffer in memory diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 35cf0ad29718..9ce6672f4fa3 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -76,7 +76,7 @@ void * __weak arch_kexec_kernel_image_load(struct kimage *image) return kexec_image_load_default(image); } -static int kexec_image_post_load_cleanup_default(struct kimage *image) +int kexec_image_post_load_cleanup_default(struct kimage *image) { if (!image->fops || !image->fops->cleanup) return 0; -- cgit v1.3-7-g2ca7 From b6664ba42f1424d2768b605dd60cecc4428d9364 Mon Sep 17 00:00:00 2001 From: AKASHI Takahiro Date: Thu, 15 Nov 2018 14:52:42 +0900 Subject: s390, kexec_file: drop arch_kexec_mem_walk() Since s390 already knows where to locate buffers, calling arch_kexec_mem_walk() has no sense. So we can just drop it as kbuf->mem indicates this while all other architectures sets it to 0 initially. This change is a preparatory work for the next patch, where all the variant memory walks, either on system resource or memblock, will be put in one common place so that it will satisfy all the architectures' need. Signed-off-by: AKASHI Takahiro Reviewed-by: Philipp Rudo Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Dave Young Cc: Vivek Goyal Cc: Baoquan He Signed-off-by: Will Deacon --- arch/s390/kernel/machine_kexec_file.c | 10 ---------- include/linux/kexec.h | 8 ++++++++ kernel/kexec_file.c | 4 ++++ 3 files changed, 12 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/arch/s390/kernel/machine_kexec_file.c b/arch/s390/kernel/machine_kexec_file.c index f413f57f8d20..32023b4f9dc0 100644 --- a/arch/s390/kernel/machine_kexec_file.c +++ b/arch/s390/kernel/machine_kexec_file.c @@ -134,16 +134,6 @@ int kexec_file_add_initrd(struct kimage *image, struct s390_load_data *data, return ret; } -/* - * The kernel is loaded to a fixed location. Turn off kexec_locate_mem_hole - * and provide kbuf->mem by hand. - */ -int arch_kexec_walk_mem(struct kexec_buf *kbuf, - int (*func)(struct resource *, void *)) -{ - return 1; -} - int arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr *section, const Elf_Shdr *relsec, diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 49ab758f4d91..f378cb786f1b 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -145,6 +145,14 @@ int kexec_image_probe_default(struct kimage *image, void *buf, unsigned long buf_len); int kexec_image_post_load_cleanup_default(struct kimage *image); +/* + * If kexec_buf.mem is set to this value, kexec_locate_mem_hole() + * will try to allocate free memory. Arch may overwrite it. + */ +#ifndef KEXEC_BUF_MEM_UNKNOWN +#define KEXEC_BUF_MEM_UNKNOWN 0 +#endif + /** * struct kexec_buf - parameters for finding a place for a buffer in memory * @image: kexec image in which memory to search. diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 9ce6672f4fa3..9e6529da12ed 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -532,6 +532,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) { int ret; + /* Arch knows where to place */ + if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) + return 0; + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); return ret == 1 ? 0 : -EADDRNOTAVAIL; -- cgit v1.3-7-g2ca7 From 735c2f90e333b3d0adee52a8e7e855a0c0eca284 Mon Sep 17 00:00:00 2001 From: AKASHI Takahiro Date: Thu, 15 Nov 2018 14:52:43 +0900 Subject: powerpc, kexec_file: factor out memblock-based arch_kexec_walk_mem() Memblock list is another source for usable system memory layout. So move powerpc's arch_kexec_walk_mem() to common code so that other memblock-based architectures, particularly arm64, can also utilise it. A moved function is now renamed to kexec_walk_memblock() and integrated into kexec_locate_mem_hole(), which will now be usable for all architectures with no need for overriding arch_kexec_walk_mem(). With this change, arch_kexec_walk_mem() need no longer be a weak function, and was now renamed to kexec_walk_resources(). Since powerpc doesn't support kdump in its kexec_file_load(), the current kexec_walk_memblock() won't work for kdump either in this form, this will be fixed in the next patch. Signed-off-by: AKASHI Takahiro Cc: "Eric W. Biederman" Acked-by: Dave Young Cc: Vivek Goyal Cc: Baoquan He Acked-by: James Morse Signed-off-by: Will Deacon --- arch/powerpc/kernel/machine_kexec_file_64.c | 54 ------------------------- include/linux/kexec.h | 2 - kernel/kexec_file.c | 61 +++++++++++++++++++++++++++-- 3 files changed, 57 insertions(+), 60 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kernel/machine_kexec_file_64.c b/arch/powerpc/kernel/machine_kexec_file_64.c index c77e95e9b384..0d20c7ad40fa 100644 --- a/arch/powerpc/kernel/machine_kexec_file_64.c +++ b/arch/powerpc/kernel/machine_kexec_file_64.c @@ -24,7 +24,6 @@ #include #include -#include #include #include #include @@ -46,59 +45,6 @@ int arch_kexec_kernel_image_probe(struct kimage *image, void *buf, return kexec_image_probe_default(image, buf, buf_len); } -/** - * arch_kexec_walk_mem - call func(data) for each unreserved memory block - * @kbuf: Context info for the search. Also passed to @func. - * @func: Function to call for each memory block. - * - * This function is used by kexec_add_buffer and kexec_locate_mem_hole - * to find unreserved memory to load kexec segments into. - * - * Return: The memory walk will stop when func returns a non-zero value - * and that value will be returned. If all free regions are visited without - * func returning non-zero, then zero will be returned. - */ -int arch_kexec_walk_mem(struct kexec_buf *kbuf, - int (*func)(struct resource *, void *)) -{ - int ret = 0; - u64 i; - phys_addr_t mstart, mend; - struct resource res = { }; - - if (kbuf->top_down) { - for_each_free_mem_range_reverse(i, NUMA_NO_NODE, 0, - &mstart, &mend, NULL) { - /* - * In memblock, end points to the first byte after the - * range while in kexec, end points to the last byte - * in the range. - */ - res.start = mstart; - res.end = mend - 1; - ret = func(&res, kbuf); - if (ret) - break; - } - } else { - for_each_free_mem_range(i, NUMA_NO_NODE, 0, &mstart, &mend, - NULL) { - /* - * In memblock, end points to the first byte after the - * range while in kexec, end points to the last byte - * in the range. - */ - res.start = mstart; - res.end = mend - 1; - ret = func(&res, kbuf); - if (ret) - break; - } - } - - return ret; -} - /** * setup_purgatory - initialize the purgatory's global variables * @image: kexec image. diff --git a/include/linux/kexec.h b/include/linux/kexec.h index f378cb786f1b..d58d1f2fab10 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -192,8 +192,6 @@ int __weak arch_kexec_apply_relocations(struct purgatory_info *pi, const Elf_Shdr *relsec, const Elf_Shdr *symtab); -int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, - int (*func)(struct resource *, void *)); extern int kexec_add_buffer(struct kexec_buf *kbuf); int kexec_locate_mem_hole(struct kexec_buf *kbuf); diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 9e6529da12ed..d03195a8cb6e 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -499,8 +500,57 @@ static int locate_mem_hole_callback(struct resource *res, void *arg) return locate_mem_hole_bottom_up(start, end, kbuf); } +#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK +static int kexec_walk_memblock(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + return 0; +} +#else +static int kexec_walk_memblock(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + int ret = 0; + u64 i; + phys_addr_t mstart, mend; + struct resource res = { }; + + if (kbuf->top_down) { + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, 0, + &mstart, &mend, NULL) { + /* + * In memblock, end points to the first byte after the + * range while in kexec, end points to the last byte + * in the range. + */ + res.start = mstart; + res.end = mend - 1; + ret = func(&res, kbuf); + if (ret) + break; + } + } else { + for_each_free_mem_range(i, NUMA_NO_NODE, 0, &mstart, &mend, + NULL) { + /* + * In memblock, end points to the first byte after the + * range while in kexec, end points to the last byte + * in the range. + */ + res.start = mstart; + res.end = mend - 1; + ret = func(&res, kbuf); + if (ret) + break; + } + } + + return ret; +} +#endif + /** - * arch_kexec_walk_mem - call func(data) on free memory regions + * kexec_walk_resources - call func(data) on free memory regions * @kbuf: Context info for the search. Also passed to @func. * @func: Function to call for each memory region. * @@ -508,8 +558,8 @@ static int locate_mem_hole_callback(struct resource *res, void *arg) * and that value will be returned. If all free regions are visited without * func returning non-zero, then zero will be returned. */ -int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, - int (*func)(struct resource *, void *)) +static int kexec_walk_resources(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) { if (kbuf->image->type == KEXEC_TYPE_CRASH) return walk_iomem_res_desc(crashk_res.desc, @@ -536,7 +586,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) return 0; - ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); + if (IS_ENABLED(CONFIG_ARCH_DISCARD_MEMBLOCK)) + ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); + else + ret = kexec_walk_memblock(kbuf, locate_mem_hole_callback); return ret == 1 ? 0 : -EADDRNOTAVAIL; } -- cgit v1.3-7-g2ca7 From 497e1858647a0b859076dc96300527375977d76a Mon Sep 17 00:00:00 2001 From: AKASHI Takahiro Date: Thu, 15 Nov 2018 14:52:44 +0900 Subject: kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump In kdump case, there exists only one dedicated memblock region as usable memory (crashk_res). With this patch, kexec_walk_memblock() runs a given callback function on this region. Cosmetic change: 0 to MEMBLOCK_NONE at for_each_free_mem_range*() Signed-off-by: AKASHI Takahiro Acked-by: Dave Young Cc: Vivek Goyal Cc: Baoquan He Signed-off-by: Will Deacon --- kernel/kexec_file.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index d03195a8cb6e..f1d0e00a3971 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -515,8 +515,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf, phys_addr_t mstart, mend; struct resource res = { }; + if (kbuf->image->type == KEXEC_TYPE_CRASH) + return func(&crashk_res, kbuf); + if (kbuf->top_down) { - for_each_free_mem_range_reverse(i, NUMA_NO_NODE, 0, + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, &mstart, &mend, NULL) { /* * In memblock, end points to the first byte after the @@ -530,8 +533,8 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf, break; } } else { - for_each_free_mem_range(i, NUMA_NO_NODE, 0, &mstart, &mend, - NULL) { + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &mstart, &mend, NULL) { /* * In memblock, end points to the first byte after the * range while in kexec, end points to the last byte -- cgit v1.3-7-g2ca7 From b0cbeae4944924640bf550b75487729a20204c14 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 21 Nov 2018 18:52:35 +0100 Subject: dma-direct: remove the mapping_error dma_map_ops method The dma-direct code already returns (~(dma_addr_t)0x0) on mapping failures, so we can switch over to returning DMA_MAPPING_ERROR and let the core dma-mapping code handle the rest. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- arch/powerpc/kernel/dma-swiotlb.c | 1 - include/linux/dma-direct.h | 3 --- kernel/dma/direct.c | 8 +------- kernel/dma/swiotlb.c | 11 +++++------ 4 files changed, 6 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c index 5fc335f4d9cd..3d8df2cf8be9 100644 --- a/arch/powerpc/kernel/dma-swiotlb.c +++ b/arch/powerpc/kernel/dma-swiotlb.c @@ -59,7 +59,6 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = { .sync_single_for_device = swiotlb_sync_single_for_device, .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = swiotlb_sync_sg_for_device, - .mapping_error = dma_direct_mapping_error, .get_required_mask = swiotlb_powerpc_get_required, }; diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 61b78f934f64..6e5a47ae7d64 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -5,8 +5,6 @@ #include #include -#define DIRECT_MAPPING_ERROR (~(dma_addr_t)0) - #ifdef CONFIG_ARCH_HAS_PHYS_TO_DMA #include #else @@ -76,5 +74,4 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs); int dma_direct_supported(struct device *dev, u64 mask); -int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index c49849bcced6..308f88a750c8 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -289,7 +289,7 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, dma_addr_t dma_addr = phys_to_dma(dev, phys); if (!check_addr(dev, dma_addr, size, __func__)) - return DIRECT_MAPPING_ERROR; + return DMA_MAPPING_ERROR; if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) dma_direct_sync_single_for_device(dev, dma_addr, size, dir); @@ -336,11 +336,6 @@ int dma_direct_supported(struct device *dev, u64 mask) return mask >= phys_to_dma(dev, min_mask); } -int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr) -{ - return dma_addr == DIRECT_MAPPING_ERROR; -} - const struct dma_map_ops dma_direct_ops = { .alloc = dma_direct_alloc, .free = dma_direct_free, @@ -359,7 +354,6 @@ const struct dma_map_ops dma_direct_ops = { #endif .get_required_mask = dma_direct_get_required_mask, .dma_supported = dma_direct_supported, - .mapping_error = dma_direct_mapping_error, .cache_sync = arch_dma_cache_sync, }; EXPORT_SYMBOL(dma_direct_ops); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 045930e32c0e..ff1ce81bb623 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -631,21 +631,21 @@ static dma_addr_t swiotlb_bounce_page(struct device *dev, phys_addr_t *phys, if (unlikely(swiotlb_force == SWIOTLB_NO_FORCE)) { dev_warn_ratelimited(dev, "Cannot do DMA to address %pa\n", phys); - return DIRECT_MAPPING_ERROR; + return DMA_MAPPING_ERROR; } /* Oh well, have to allocate and map a bounce buffer. */ *phys = swiotlb_tbl_map_single(dev, __phys_to_dma(dev, io_tlb_start), *phys, size, dir, attrs); if (*phys == SWIOTLB_MAP_ERROR) - return DIRECT_MAPPING_ERROR; + return DMA_MAPPING_ERROR; /* Ensure that the address returned is DMA'ble */ dma_addr = __phys_to_dma(dev, *phys); if (unlikely(!dma_capable(dev, dma_addr, size))) { swiotlb_tbl_unmap_single(dev, *phys, size, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); - return DIRECT_MAPPING_ERROR; + return DMA_MAPPING_ERROR; } return dma_addr; @@ -680,7 +680,7 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, if (!dev_is_dma_coherent(dev) && (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0 && - dev_addr != DIRECT_MAPPING_ERROR) + dev_addr != DMA_MAPPING_ERROR) arch_sync_dma_for_device(dev, phys, size, dir); return dev_addr; @@ -789,7 +789,7 @@ swiotlb_map_sg_attrs(struct device *dev, struct scatterlist *sgl, int nelems, for_each_sg(sgl, sg, nelems, i) { sg->dma_address = swiotlb_map_page(dev, sg_page(sg), sg->offset, sg->length, dir, attrs); - if (sg->dma_address == DIRECT_MAPPING_ERROR) + if (sg->dma_address == DMA_MAPPING_ERROR) goto out_error; sg_dma_len(sg) = sg->length; } @@ -869,7 +869,6 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask) } const struct dma_map_ops swiotlb_dma_ops = { - .mapping_error = dma_direct_mapping_error, .alloc = dma_direct_alloc, .free = dma_direct_free, .sync_single_for_cpu = swiotlb_sync_single_for_cpu, -- cgit v1.3-7-g2ca7 From df44b479654f62b478c18ee4d8bc4e9f897a9844 Mon Sep 17 00:00:00 2001 From: Peter Rajnoha Date: Wed, 5 Dec 2018 12:27:44 +0100 Subject: kobject: return error code if writing /sys/.../uevent fails Propagate error code back to userspace if writing the /sys/.../uevent file fails. Before, the write operation always returned with success, even if we failed to recognize the input string or if we failed to generate the uevent itself. With the error codes properly propagated back to userspace, we are able to react in userspace accordingly by not assuming and awaiting a uevent that is not delivered. Signed-off-by: Peter Rajnoha Signed-off-by: Greg Kroah-Hartman --- drivers/base/bus.c | 12 ++++++++---- drivers/base/core.c | 8 +++++++- kernel/module.c | 6 ++++-- 3 files changed, 19 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 8bfd27ec73d6..b886b15cb53b 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -611,8 +611,10 @@ static void remove_probe_files(struct bus_type *bus) static ssize_t uevent_store(struct device_driver *drv, const char *buf, size_t count) { - kobject_synth_uevent(&drv->p->kobj, buf, count); - return count; + int rc; + + rc = kobject_synth_uevent(&drv->p->kobj, buf, count); + return rc ? rc : count; } static DRIVER_ATTR_WO(uevent); @@ -828,8 +830,10 @@ static void klist_devices_put(struct klist_node *n) static ssize_t bus_uevent_store(struct bus_type *bus, const char *buf, size_t count) { - kobject_synth_uevent(&bus->p->subsys.kobj, buf, count); - return count; + int rc; + + rc = kobject_synth_uevent(&bus->p->subsys.kobj, buf, count); + return rc ? rc : count; } static BUS_ATTR(uevent, S_IWUSR, NULL, bus_uevent_store); diff --git a/drivers/base/core.c b/drivers/base/core.c index e2285059161d..a4ced331bc50 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1073,8 +1073,14 @@ out: static ssize_t uevent_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - if (kobject_synth_uevent(&dev->kobj, buf, count)) + int rc; + + rc = kobject_synth_uevent(&dev->kobj, buf, count); + + if (rc) { dev_err(dev, "uevent: failed to send synthetic uevent\n"); + return rc; + } return count; } diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..0812a7f80fa7 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1207,8 +1207,10 @@ static ssize_t store_uevent(struct module_attribute *mattr, struct module_kobject *mk, const char *buffer, size_t count) { - kobject_synth_uevent(&mk->kobj, buffer, count); - return count; + int rc; + + rc = kobject_synth_uevent(&mk->kobj, buffer, count); + return rc ? rc : count; } struct module_attribute module_uevent = -- cgit v1.3-7-g2ca7 From ded653ccbec0335a78fa7a7aff3ec9870349fafb Mon Sep 17 00:00:00 2001 From: Deepa Dinamani Date: Wed, 19 Sep 2018 21:41:04 -0700 Subject: signal: Add set_user_sigmask() Refactor reading sigset from userspace and updating sigmask into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during, and restored after, the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll, and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Note that the calls to sigprocmask() ignored the return value from the api as the function only returns an error on an invalid first argument that is hardcoded at these call sites. The updated logic uses set_current_blocked() instead. Signed-off-by: Deepa Dinamani Signed-off-by: Arnd Bergmann --- fs/aio.c | 23 +++++++---------------- fs/eventpoll.c | 22 ++++++---------------- fs/select.c | 50 ++++++++++++-------------------------------------- include/linux/compat.h | 4 ++++ include/linux/signal.h | 2 ++ kernel/signal.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 76 insertions(+), 70 deletions(-) (limited to 'kernel') diff --git a/fs/aio.c b/fs/aio.c index 301e6314183b..6ddb63ee8eb6 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2104,14 +2104,10 @@ SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT; - if (ksig.sigmask) { - if (ksig.sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, ksig.sigmask, sizeof(ksigmask))) - return -EFAULT; - sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + + ret = set_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + if (ret) + return ret; ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); if (signal_pending(current)) { @@ -2174,14 +2170,9 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT; - if (ksig.sigmask) { - if (ksig.sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, ksig.sigmask)) - return -EFAULT; - sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + if (ret) + return ret; ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); if (signal_pending(current)) { diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 42bbe6824b4b..2d86eeba837b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2223,14 +2223,9 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - if (sigmask) { - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - sigsaved = current->blocked; - set_current_blocked(&ksigmask); - } + error = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (error) + return error; error = do_epoll_wait(epfd, events, maxevents, timeout); @@ -2266,14 +2261,9 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - sigsaved = current->blocked; - set_current_blocked(&ksigmask); - } + err = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (err) + return err; err = do_epoll_wait(epfd, events, maxevents, timeout); diff --git a/fs/select.c b/fs/select.c index 22b3bf89f051..65c78b4147a2 100644 --- a/fs/select.c +++ b/fs/select.c @@ -717,16 +717,9 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return -EINVAL; } - if (sigmask) { - /* XXX: Don't preclude handling different sized sigset_t's. */ - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret; ret = core_sys_select(n, inp, outp, exp, to); ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); @@ -1061,16 +1054,9 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return -EINVAL; } - if (sigmask) { - /* XXX: Don't preclude handling different sized sigset_t's. */ - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret; ret = do_sys_poll(ufds, nfds, to); @@ -1323,15 +1309,9 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return -EINVAL; } - if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret; ret = compat_core_sys_select(n, inp, outp, exp, to); ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); @@ -1389,15 +1369,9 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return -EINVAL; } - if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret; ret = do_sys_poll(ufds, nfds, to); diff --git a/include/linux/compat.h b/include/linux/compat.h index 88720b443cd6..17c497b82690 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -169,6 +169,10 @@ typedef struct { compat_sigset_word sig[_COMPAT_NSIG_WORDS]; } compat_sigset_t; +int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, + sigset_t *set, sigset_t *oldset, + size_t sigsetsize); + struct compat_sigaction { #ifndef __ARCH_HAS_IRIX_SIGACTION compat_uptr_t sa_handler; diff --git a/include/linux/signal.h b/include/linux/signal.h index f428e86f4800..ce14b951befb 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -273,6 +273,8 @@ extern int group_send_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *p, enum pid_type type); extern int __group_send_sig_info(int, struct kernel_siginfo *, struct task_struct *); extern int sigprocmask(int, sigset_t *, sigset_t *); +extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, + sigset_t *oldset, size_t sigsetsize); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 9a32bc2088c9..811b5d440617 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2735,6 +2735,51 @@ int sigprocmask(int how, sigset_t *set, sigset_t *oldset) } EXPORT_SYMBOL(sigprocmask); +/* + * The api helps set app-provided sigmasks. + * + * This is useful for syscalls such as ppoll, pselect, io_pgetevents and + * epoll_pwait where a new sigmask is passed from userland for the syscalls. + */ +int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, + sigset_t *oldset, size_t sigsetsize) +{ + if (!usigmask) + return 0; + + if (sigsetsize != sizeof(sigset_t)) + return -EINVAL; + if (copy_from_user(set, usigmask, sizeof(sigset_t))) + return -EFAULT; + + *oldset = current->blocked; + set_current_blocked(set); + + return 0; +} +EXPORT_SYMBOL(set_user_sigmask); + +#ifdef CONFIG_COMPAT +int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, + sigset_t *set, sigset_t *oldset, + size_t sigsetsize) +{ + if (!usigmask) + return 0; + + if (sigsetsize != sizeof(compat_sigset_t)) + return -EINVAL; + if (get_compat_sigset(set, usigmask)) + return -EFAULT; + + *oldset = current->blocked; + set_current_blocked(set); + + return 0; +} +EXPORT_SYMBOL(set_compat_user_sigmask); +#endif + /** * sys_rt_sigprocmask - change the list of currently blocked signals * @how: whether to add, remove, or set signals -- cgit v1.3-7-g2ca7 From 854a6ed56839a40f6b5d02a2962f48841482eec4 Mon Sep 17 00:00:00 2001 From: Deepa Dinamani Date: Wed, 19 Sep 2018 21:41:05 -0700 Subject: signal: Add restore_user_sigmask() Refactor the logic to restore the sigmask before the syscall returns into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during the execution and restored after the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Signed-off-by: Deepa Dinamani Signed-off-by: Arnd Bergmann --- fs/aio.c | 29 +++++------------------- fs/eventpoll.c | 30 ++----------------------- fs/select.c | 60 +++++++------------------------------------------- include/linux/signal.h | 2 ++ kernel/signal.c | 33 +++++++++++++++++++++++++++ 5 files changed, 51 insertions(+), 103 deletions(-) (limited to 'kernel') diff --git a/fs/aio.c b/fs/aio.c index 6ddb63ee8eb6..39a1f2df6805 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2110,18 +2110,9 @@ SYSCALL_DEFINE6(io_pgetevents, return ret; ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); - if (signal_pending(current)) { - if (ksig.sigmask) { - current->saved_sigmask = sigsaved; - set_restore_sigmask(); - } - - if (!ret) - ret = -ERESTARTNOHAND; - } else { - if (ksig.sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); - } + restore_user_sigmask(ksig.sigmask, &sigsaved); + if (signal_pending(current) && !ret) + ret = -ERESTARTNOHAND; return ret; } @@ -2175,17 +2166,9 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, return ret; ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); - if (signal_pending(current)) { - if (ksig.sigmask) { - current->saved_sigmask = sigsaved; - set_restore_sigmask(); - } - if (!ret) - ret = -ERESTARTNOHAND; - } else { - if (ksig.sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); - } + restore_user_sigmask(ksig.sigmask, &sigsaved); + if (signal_pending(current) && !ret) + ret = -ERESTARTNOHAND; return ret; } diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2d86eeba837b..8a5a1010886b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2229,20 +2229,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, error = do_epoll_wait(epfd, events, maxevents, timeout); - /* - * If we changed the signal mask, we need to restore the original one. - * In case we've got a signal while waiting, we do not restore the - * signal mask yet, and we allow do_signal() to deliver the signal on - * the way back to userspace, before the signal mask is restored. - */ - if (sigmask) { - if (error == -EINTR) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } else - set_current_blocked(&sigsaved); - } + restore_user_sigmask(sigmask, &sigsaved); return error; } @@ -2267,20 +2254,7 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, err = do_epoll_wait(epfd, events, maxevents, timeout); - /* - * If we changed the signal mask, we need to restore the original one. - * In case we've got a signal while waiting, we do not restore the - * signal mask yet, and we allow do_signal() to deliver the signal on - * the way back to userspace, before the signal mask is restored. - */ - if (sigmask) { - if (err == -EINTR) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } else - set_current_blocked(&sigsaved); - } + restore_user_sigmask(sigmask, &sigsaved); return err; } diff --git a/fs/select.c b/fs/select.c index 65c78b4147a2..eb9132520197 100644 --- a/fs/select.c +++ b/fs/select.c @@ -724,19 +724,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, ret = core_sys_select(n, inp, outp, exp, to); ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); - if (ret == -ERESTARTNOHAND) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); + restore_user_sigmask(sigmask, &sigsaved); return ret; } @@ -1060,21 +1048,11 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, ret = do_sys_poll(ufds, nfds, to); + restore_user_sigmask(sigmask, &sigsaved); + /* We can restart this syscall, usually */ - if (ret == -EINTR) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } + if (ret == -EINTR) ret = -ERESTARTNOHAND; - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); @@ -1316,19 +1294,7 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, ret = compat_core_sys_select(n, inp, outp, exp, to); ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); - if (ret == -ERESTARTNOHAND) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); + restore_user_sigmask(sigmask, &sigsaved); return ret; } @@ -1375,21 +1341,11 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, ret = do_sys_poll(ufds, nfds, to); + restore_user_sigmask(sigmask, &sigsaved); + /* We can restart this syscall, usually */ - if (ret == -EINTR) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } + if (ret == -EINTR) ret = -ERESTARTNOHAND; - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); diff --git a/include/linux/signal.h b/include/linux/signal.h index ce14b951befb..cc7e2c1cd444 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -275,6 +275,8 @@ extern int __group_send_sig_info(int, struct kernel_siginfo *, struct task_struc extern int sigprocmask(int, sigset_t *, sigset_t *); extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, sigset_t *oldset, size_t sigsetsize); +extern void restore_user_sigmask(const void __user *usigmask, + sigset_t *sigsaved); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 811b5d440617..3c8ea7a328e0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2780,6 +2780,39 @@ int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, EXPORT_SYMBOL(set_compat_user_sigmask); #endif +/* + * restore_user_sigmask: + * usigmask: sigmask passed in from userland. + * sigsaved: saved sigmask when the syscall started and changed the sigmask to + * usigmask. + * + * This is useful for syscalls such as ppoll, pselect, io_pgetevents and + * epoll_pwait where a new sigmask is passed in from userland for the syscalls. + */ +void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved) +{ + + if (!usigmask) + return; + /* + * When signals are pending, do not restore them here. + * Restoring sigmask here can lead to delivering signals that the above + * syscalls are intended to block because of the sigmask passed in. + */ + if (signal_pending(current)) { + current->saved_sigmask = *sigsaved; + set_restore_sigmask(); + return; + } + + /* + * This is needed because the fast syscall return path does not restore + * saved_sigmask when signals are not pending. + */ + set_current_blocked(sigsaved); +} +EXPORT_SYMBOL(restore_user_sigmask); + /** * sys_rt_sigprocmask - change the list of currently blocked signals * @how: whether to add, remove, or set signals -- cgit v1.3-7-g2ca7 From 04e7712f4460585e5eed5b853fd8b82a9943958f Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 17 Apr 2018 16:31:07 +0200 Subject: y2038: futex: Move compat implementation into futex.c We are going to share the compat_sys_futex() handler between 64-bit architectures and 32-bit architectures that need to deal with both 32-bit and 64-bit time_t, and this is easier if both entry points are in the same file. In fact, most other system call handlers do the same thing these days, so let's follow the trend here and merge all of futex_compat.c into futex.c. In the process, a few minor changes have to be done to make sure everything still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become local symbol, and the compat version of the fetch_robust_entry() function gets renamed to compat_fetch_robust_entry() to avoid a symbol clash. This is intended as a purely cosmetic patch, no behavior should change. Signed-off-by: Arnd Bergmann --- include/linux/futex.h | 8 -- kernel/Makefile | 3 - kernel/futex.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++- kernel/futex_compat.c | 202 -------------------------------------------------- 4 files changed, 192 insertions(+), 216 deletions(-) delete mode 100644 kernel/futex_compat.c (limited to 'kernel') diff --git a/include/linux/futex.h b/include/linux/futex.h index 821ae502d3d8..ccaef0097785 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -9,9 +9,6 @@ struct inode; struct mm_struct; struct task_struct; -extern int -handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi); - /* * Futexes are matched on equal values of this key. * The key type depends on whether it's a shared or private mapping. @@ -55,11 +52,6 @@ extern void exit_robust_list(struct task_struct *curr); long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); -#ifdef CONFIG_HAVE_FUTEX_CMPXCHG -#define futex_cmpxchg_enabled 1 -#else -extern int futex_cmpxchg_enabled; -#endif #else static inline void exit_robust_list(struct task_struct *curr) { diff --git a/kernel/Makefile b/kernel/Makefile index 7343b3a9bff0..8e40a6742d23 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -49,9 +49,6 @@ obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += time/ obj-$(CONFIG_FUTEX) += futex.o -ifeq ($(CONFIG_COMPAT),y) -obj-$(CONFIG_FUTEX) += futex_compat.o -endif obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_SMP) += smp.o ifneq ($(CONFIG_SMP),y) diff --git a/kernel/futex.c b/kernel/futex.c index f423f9b6577e..5cc7c3b098e9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -44,6 +44,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include #include #include @@ -173,8 +174,10 @@ * double_lock_hb() and double_unlock_hb(), respectively. */ -#ifndef CONFIG_HAVE_FUTEX_CMPXCHG -int __read_mostly futex_cmpxchg_enabled; +#ifdef CONFIG_HAVE_FUTEX_CMPXCHG +#define futex_cmpxchg_enabled 1 +#else +static int __read_mostly futex_cmpxchg_enabled; #endif /* @@ -3360,7 +3363,7 @@ err_unlock: * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: */ -int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) +static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) { u32 uval, uninitialized_var(nval), mval; @@ -3589,6 +3592,192 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); } +#ifdef CONFIG_COMPAT +/* + * Fetch a robust-list pointer. Bit 0 signals PI futexes: + */ +static inline int +compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry, + compat_uptr_t __user *head, unsigned int *pi) +{ + if (get_user(*uentry, head)) + return -EFAULT; + + *entry = compat_ptr((*uentry) & ~1); + *pi = (unsigned int)(*uentry) & 1; + + return 0; +} + +static void __user *futex_uaddr(struct robust_list __user *entry, + compat_long_t futex_offset) +{ + compat_uptr_t base = ptr_to_compat(entry); + void __user *uaddr = compat_ptr(base + futex_offset); + + return uaddr; +} + +/* + * Walk curr->robust_list (very carefully, it's a userspace list!) + * and mark any locks found there dead, and notify any waiters. + * + * We silently return on any sign of list-walking problem. + */ +void compat_exit_robust_list(struct task_struct *curr) +{ + struct compat_robust_list_head __user *head = curr->compat_robust_list; + struct robust_list __user *entry, *next_entry, *pending; + unsigned int limit = ROBUST_LIST_LIMIT, pi, pip; + unsigned int uninitialized_var(next_pi); + compat_uptr_t uentry, next_uentry, upending; + compat_long_t futex_offset; + int rc; + + if (!futex_cmpxchg_enabled) + return; + + /* + * Fetch the list head (which was registered earlier, via + * sys_set_robust_list()): + */ + if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi)) + return; + /* + * Fetch the relative futex offset: + */ + if (get_user(futex_offset, &head->futex_offset)) + return; + /* + * Fetch any possibly pending lock-add first, and handle it + * if it exists: + */ + if (compat_fetch_robust_entry(&upending, &pending, + &head->list_op_pending, &pip)) + return; + + next_entry = NULL; /* avoid warning with gcc */ + while (entry != (struct robust_list __user *) &head->list) { + /* + * Fetch the next entry in the list before calling + * handle_futex_death: + */ + rc = compat_fetch_robust_entry(&next_uentry, &next_entry, + (compat_uptr_t __user *)&entry->next, &next_pi); + /* + * A pending lock might already be on the list, so + * dont process it twice: + */ + if (entry != pending) { + void __user *uaddr = futex_uaddr(entry, futex_offset); + + if (handle_futex_death(uaddr, curr, pi)) + return; + } + if (rc) + return; + uentry = next_uentry; + entry = next_entry; + pi = next_pi; + /* + * Avoid excessively long or circular lists: + */ + if (!--limit) + break; + + cond_resched(); + } + if (pending) { + void __user *uaddr = futex_uaddr(pending, futex_offset); + + handle_futex_death(uaddr, curr, pip); + } +} + +COMPAT_SYSCALL_DEFINE2(set_robust_list, + struct compat_robust_list_head __user *, head, + compat_size_t, len) +{ + if (!futex_cmpxchg_enabled) + return -ENOSYS; + + if (unlikely(len != sizeof(*head))) + return -EINVAL; + + current->compat_robust_list = head; + + return 0; +} + +COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid, + compat_uptr_t __user *, head_ptr, + compat_size_t __user *, len_ptr) +{ + struct compat_robust_list_head __user *head; + unsigned long ret; + struct task_struct *p; + + if (!futex_cmpxchg_enabled) + return -ENOSYS; + + rcu_read_lock(); + + ret = -ESRCH; + if (!pid) + p = current; + else { + p = find_task_by_vpid(pid); + if (!p) + goto err_unlock; + } + + ret = -EPERM; + if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) + goto err_unlock; + + head = p->compat_robust_list; + rcu_read_unlock(); + + if (put_user(sizeof(*head), len_ptr)) + return -EFAULT; + return put_user(ptr_to_compat(head), head_ptr); + +err_unlock: + rcu_read_unlock(); + + return ret; +} + +COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, + struct old_timespec32 __user *, utime, u32 __user *, uaddr2, + u32, val3) +{ + struct timespec ts; + ktime_t t, *tp = NULL; + int val2 = 0; + int cmd = op & FUTEX_CMD_MASK; + + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || + cmd == FUTEX_WAIT_BITSET || + cmd == FUTEX_WAIT_REQUEUE_PI)) { + if (compat_get_timespec(&ts, utime)) + return -EFAULT; + if (!timespec_valid(&ts)) + return -EINVAL; + + t = timespec_to_ktime(ts); + if (cmd == FUTEX_WAIT) + t = ktime_add_safe(ktime_get(), t); + tp = &t; + } + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE || + cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP) + val2 = (int) (unsigned long) utime; + + return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); +} +#endif /* CONFIG_COMPAT */ + static void __init futex_detect_cmpxchg(void) { #ifndef CONFIG_HAVE_FUTEX_CMPXCHG diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c deleted file mode 100644 index 410a77a8f6e2..000000000000 --- a/kernel/futex_compat.c +++ /dev/null @@ -1,202 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * linux/kernel/futex_compat.c - * - * Futex compatibililty routines. - * - * Copyright 2006, Red Hat, Inc., Ingo Molnar - */ - -#include -#include -#include -#include -#include -#include - -#include - - -/* - * Fetch a robust-list pointer. Bit 0 signals PI futexes: - */ -static inline int -fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry, - compat_uptr_t __user *head, unsigned int *pi) -{ - if (get_user(*uentry, head)) - return -EFAULT; - - *entry = compat_ptr((*uentry) & ~1); - *pi = (unsigned int)(*uentry) & 1; - - return 0; -} - -static void __user *futex_uaddr(struct robust_list __user *entry, - compat_long_t futex_offset) -{ - compat_uptr_t base = ptr_to_compat(entry); - void __user *uaddr = compat_ptr(base + futex_offset); - - return uaddr; -} - -/* - * Walk curr->robust_list (very carefully, it's a userspace list!) - * and mark any locks found there dead, and notify any waiters. - * - * We silently return on any sign of list-walking problem. - */ -void compat_exit_robust_list(struct task_struct *curr) -{ - struct compat_robust_list_head __user *head = curr->compat_robust_list; - struct robust_list __user *entry, *next_entry, *pending; - unsigned int limit = ROBUST_LIST_LIMIT, pi, pip; - unsigned int uninitialized_var(next_pi); - compat_uptr_t uentry, next_uentry, upending; - compat_long_t futex_offset; - int rc; - - if (!futex_cmpxchg_enabled) - return; - - /* - * Fetch the list head (which was registered earlier, via - * sys_set_robust_list()): - */ - if (fetch_robust_entry(&uentry, &entry, &head->list.next, &pi)) - return; - /* - * Fetch the relative futex offset: - */ - if (get_user(futex_offset, &head->futex_offset)) - return; - /* - * Fetch any possibly pending lock-add first, and handle it - * if it exists: - */ - if (fetch_robust_entry(&upending, &pending, - &head->list_op_pending, &pip)) - return; - - next_entry = NULL; /* avoid warning with gcc */ - while (entry != (struct robust_list __user *) &head->list) { - /* - * Fetch the next entry in the list before calling - * handle_futex_death: - */ - rc = fetch_robust_entry(&next_uentry, &next_entry, - (compat_uptr_t __user *)&entry->next, &next_pi); - /* - * A pending lock might already be on the list, so - * dont process it twice: - */ - if (entry != pending) { - void __user *uaddr = futex_uaddr(entry, futex_offset); - - if (handle_futex_death(uaddr, curr, pi)) - return; - } - if (rc) - return; - uentry = next_uentry; - entry = next_entry; - pi = next_pi; - /* - * Avoid excessively long or circular lists: - */ - if (!--limit) - break; - - cond_resched(); - } - if (pending) { - void __user *uaddr = futex_uaddr(pending, futex_offset); - - handle_futex_death(uaddr, curr, pip); - } -} - -COMPAT_SYSCALL_DEFINE2(set_robust_list, - struct compat_robust_list_head __user *, head, - compat_size_t, len) -{ - if (!futex_cmpxchg_enabled) - return -ENOSYS; - - if (unlikely(len != sizeof(*head))) - return -EINVAL; - - current->compat_robust_list = head; - - return 0; -} - -COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid, - compat_uptr_t __user *, head_ptr, - compat_size_t __user *, len_ptr) -{ - struct compat_robust_list_head __user *head; - unsigned long ret; - struct task_struct *p; - - if (!futex_cmpxchg_enabled) - return -ENOSYS; - - rcu_read_lock(); - - ret = -ESRCH; - if (!pid) - p = current; - else { - p = find_task_by_vpid(pid); - if (!p) - goto err_unlock; - } - - ret = -EPERM; - if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) - goto err_unlock; - - head = p->compat_robust_list; - rcu_read_unlock(); - - if (put_user(sizeof(*head), len_ptr)) - return -EFAULT; - return put_user(ptr_to_compat(head), head_ptr); - -err_unlock: - rcu_read_unlock(); - - return ret; -} - -COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, - struct old_timespec32 __user *, utime, u32 __user *, uaddr2, - u32, val3) -{ - struct timespec ts; - ktime_t t, *tp = NULL; - int val2 = 0; - int cmd = op & FUTEX_CMD_MASK; - - if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || - cmd == FUTEX_WAIT_BITSET || - cmd == FUTEX_WAIT_REQUEUE_PI)) { - if (compat_get_timespec(&ts, utime)) - return -EFAULT; - if (!timespec_valid(&ts)) - return -EINVAL; - - t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) - t = ktime_add_safe(ktime_get(), t); - tp = &t; - } - if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE || - cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP) - val2 = (int) (unsigned long) utime; - - return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); -} -- cgit v1.3-7-g2ca7 From bec2f7cbb73eadf5e1cc7d54ecb0980ede244257 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 17 Apr 2018 17:23:35 +0200 Subject: y2038: futex: Add support for __kernel_timespec This prepares sys_futex for y2038 safe calling: the native syscall is changed to receive a __kernel_timespec argument, which will be switched to 64-bit time_t in the future. All the internal time handling gets changed to timespec64, and the compat_sys_futex entry point is moved under the CONFIG_COMPAT_32BIT_TIME check to provide compatibility for existing 32-bit architectures. Signed-off-by: Arnd Bergmann --- include/linux/syscalls.h | 2 +- kernel/futex.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a27cf407de92..247ad9eca955 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -553,7 +553,7 @@ asmlinkage long sys_unshare(unsigned long unshare_flags); /* kernel/futex.c */ asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val, - struct timespec __user *utime, u32 __user *uaddr2, + struct __kernel_timespec __user *utime, u32 __user *uaddr2, u32 val3); asmlinkage long sys_get_robust_list(int pid, struct robust_list_head __user * __user *head_ptr, diff --git a/kernel/futex.c b/kernel/futex.c index 5cc7c3b098e9..b305beaab739 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3558,10 +3558,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, - struct timespec __user *, utime, u32 __user *, uaddr2, + struct __kernel_timespec __user *, utime, u32 __user *, uaddr2, u32, val3) { - struct timespec ts; + struct timespec64 ts; ktime_t t, *tp = NULL; u32 val2 = 0; int cmd = op & FUTEX_CMD_MASK; @@ -3571,12 +3571,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, cmd == FUTEX_WAIT_REQUEUE_PI)) { if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG)))) return -EFAULT; - if (copy_from_user(&ts, utime, sizeof(ts)) != 0) + if (get_timespec64(&ts, utime)) return -EFAULT; - if (!timespec_valid(&ts)) + if (!timespec64_valid(&ts)) return -EINVAL; - t = timespec_to_ktime(ts); + t = timespec64_to_ktime(ts); if (cmd == FUTEX_WAIT) t = ktime_add_safe(ktime_get(), t); tp = &t; @@ -3747,12 +3747,14 @@ err_unlock: return ret; } +#endif /* CONFIG_COMPAT */ +#ifdef CONFIG_COMPAT_32BIT_TIME COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, struct old_timespec32 __user *, utime, u32 __user *, uaddr2, u32, val3) { - struct timespec ts; + struct timespec64 ts; ktime_t t, *tp = NULL; int val2 = 0; int cmd = op & FUTEX_CMD_MASK; @@ -3760,12 +3762,12 @@ COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_WAIT_REQUEUE_PI)) { - if (compat_get_timespec(&ts, utime)) + if (get_old_timespec32(&ts, utime)) return -EFAULT; - if (!timespec_valid(&ts)) + if (!timespec64_valid(&ts)) return -EINVAL; - t = timespec_to_ktime(ts); + t = timespec64_to_ktime(ts); if (cmd == FUTEX_WAIT) t = ktime_add_safe(ktime_get(), t); tp = &t; @@ -3776,7 +3778,7 @@ COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); } -#endif /* CONFIG_COMPAT */ +#endif /* CONFIG_COMPAT_32BIT_TIME */ static void __init futex_detect_cmpxchg(void) { -- cgit v1.3-7-g2ca7 From 2dc6b100f928aac8d7532bf7112d3f8d3f952bad Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Wed, 5 Dec 2018 13:52:34 -0500 Subject: bpf: interpreter support BPF_ALU | BPF_ARSH This patch implements interpreting BPF_ALU | BPF_ARSH. Do arithmetic right shift on low 32-bit sub-register, and zero the high 32 bits. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 52 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 628b3970a49b..a5b223ef7131 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -933,32 +933,34 @@ EXPORT_SYMBOL_GPL(__bpf_call_base); #define BPF_INSN_MAP(INSN_2, INSN_3) \ /* 32 bit ALU operations. */ \ /* Register based. */ \ - INSN_3(ALU, ADD, X), \ - INSN_3(ALU, SUB, X), \ - INSN_3(ALU, AND, X), \ - INSN_3(ALU, OR, X), \ - INSN_3(ALU, LSH, X), \ - INSN_3(ALU, RSH, X), \ - INSN_3(ALU, XOR, X), \ - INSN_3(ALU, MUL, X), \ - INSN_3(ALU, MOV, X), \ - INSN_3(ALU, DIV, X), \ - INSN_3(ALU, MOD, X), \ + INSN_3(ALU, ADD, X), \ + INSN_3(ALU, SUB, X), \ + INSN_3(ALU, AND, X), \ + INSN_3(ALU, OR, X), \ + INSN_3(ALU, LSH, X), \ + INSN_3(ALU, RSH, X), \ + INSN_3(ALU, XOR, X), \ + INSN_3(ALU, MUL, X), \ + INSN_3(ALU, MOV, X), \ + INSN_3(ALU, ARSH, X), \ + INSN_3(ALU, DIV, X), \ + INSN_3(ALU, MOD, X), \ INSN_2(ALU, NEG), \ INSN_3(ALU, END, TO_BE), \ INSN_3(ALU, END, TO_LE), \ /* Immediate based. */ \ - INSN_3(ALU, ADD, K), \ - INSN_3(ALU, SUB, K), \ - INSN_3(ALU, AND, K), \ - INSN_3(ALU, OR, K), \ - INSN_3(ALU, LSH, K), \ - INSN_3(ALU, RSH, K), \ - INSN_3(ALU, XOR, K), \ - INSN_3(ALU, MUL, K), \ - INSN_3(ALU, MOV, K), \ - INSN_3(ALU, DIV, K), \ - INSN_3(ALU, MOD, K), \ + INSN_3(ALU, ADD, K), \ + INSN_3(ALU, SUB, K), \ + INSN_3(ALU, AND, K), \ + INSN_3(ALU, OR, K), \ + INSN_3(ALU, LSH, K), \ + INSN_3(ALU, RSH, K), \ + INSN_3(ALU, XOR, K), \ + INSN_3(ALU, MUL, K), \ + INSN_3(ALU, MOV, K), \ + INSN_3(ALU, ARSH, K), \ + INSN_3(ALU, DIV, K), \ + INSN_3(ALU, MOD, K), \ /* 64 bit ALU operations. */ \ /* Register based. */ \ INSN_3(ALU64, ADD, X), \ @@ -1137,6 +1139,12 @@ select_insn: DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32; insn++; CONT; + ALU_ARSH_X: + DST = (u64) (u32) ((*(s32 *) &DST) >> SRC); + CONT; + ALU_ARSH_K: + DST = (u64) (u32) ((*(s32 *) &DST) >> IMM); + CONT; ALU64_ARSH_X: (*(s64 *) &DST) >>= SRC; CONT; -- cgit v1.3-7-g2ca7 From c49f7dbd4f9c2c49df7fc0f5b50c1350ee7e01ee Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Wed, 5 Dec 2018 13:52:35 -0500 Subject: bpf: verifier remove the rejection on BPF_ALU | BPF_ARSH This patch remove the rejection on BPF_ALU | BPF_ARSH as we have supported them on interpreter and all JIT back-ends Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7658c61c1a88..2752d35ad073 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3649,11 +3649,6 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return -EINVAL; } - if (opcode == BPF_ARSH && BPF_CLASS(insn->code) != BPF_ALU64) { - verbose(env, "BPF_ARSH not supported for 32 bit ALU\n"); - return -EINVAL; - } - if ((opcode == BPF_LSH || opcode == BPF_RSH || opcode == BPF_ARSH) && BPF_SRC(insn->code) == BPF_K) { int size = BPF_CLASS(insn->code) == BPF_ALU64 ? 64 : 32; -- cgit v1.3-7-g2ca7 From db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Mon Sep 17 00:00:00 2001 From: Dennis Zhou Date: Wed, 5 Dec 2018 12:10:35 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg Prior patches ensured that any bio that interacts with a request_queue is properly associated with a blkg. This makes bio->bi_css unnecessary as blkg maintains a reference to blkcg already. This removes the bio field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Acked-by: Tejun Heo Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++------------------------------------ block/bounce.c | 2 +- drivers/block/loop.c | 5 ++-- drivers/md/raid0.c | 2 +- include/linux/bio.h | 11 ++++----- include/linux/blk-cgroup.h | 8 +++---- include/linux/blk_types.h | 7 +++--- kernel/trace/blktrace.c | 4 ++-- 8 files changed, 32 insertions(+), 66 deletions(-) (limited to 'kernel') diff --git a/block/bio.c b/block/bio.c index b42477b6a225..2b6bc7b805ec 100644 --- a/block/bio.c +++ b/block/bio.c @@ -610,7 +610,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio->bi_iter = bio_src->bi_iter; bio->bi_io_vec = bio_src->bi_io_vec; - bio_clone_blkcg_association(bio, bio_src); + bio_clone_blkg_association(bio, bio_src); blkcg_bio_issue_init(bio); } EXPORT_SYMBOL(__bio_clone_fast); @@ -1957,34 +1957,6 @@ EXPORT_SYMBOL(bioset_init_from_src); #ifdef CONFIG_BLK_CGROUP -/** - * bio_associate_blkcg - associate a bio with the specified blkcg - * @bio: target bio - * @blkcg_css: css of the blkcg to associate - * - * Associate @bio with the blkcg specified by @blkcg_css. Block layer will - * treat @bio as if it were issued by a task which belongs to the blkcg. - * - * This function takes an extra reference of @blkcg_css which will be put - * when @bio is released. The caller must own @bio and is responsible for - * synchronizing calls to this function. If @blkcg_css is %NULL, a call to - * blkcg_get_css() finds the current css from the kthread or task. - */ -int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) -{ - if (unlikely(bio->bi_css)) - return -EBUSY; - - if (blkcg_css) - css_get(blkcg_css); - else - blkcg_css = blkcg_get_css(); - - bio->bi_css = blkcg_css; - return 0; -} -EXPORT_SYMBOL_GPL(bio_associate_blkcg); - /** * bio_disassociate_blkg - puts back the blkg reference if associated * @bio: target bio @@ -1994,6 +1966,8 @@ EXPORT_SYMBOL_GPL(bio_associate_blkcg); void bio_disassociate_blkg(struct bio *bio) { if (bio->bi_blkg) { + /* a ref is always taken on css */ + css_put(&bio_blkcg(bio)->css); blkg_put(bio->bi_blkg); bio->bi_blkg = NULL; } @@ -2047,7 +2021,6 @@ void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { css_get(css); - bio->bi_css = css; __bio_associate_blkg_from_css(bio, css); } EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); @@ -2066,13 +2039,10 @@ void bio_associate_blkg_from_page(struct bio *bio, struct page *page) { struct cgroup_subsys_state *css; - if (unlikely(bio->bi_css)) - return; if (!page->mem_cgroup) return; css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); - bio->bi_css = css; __bio_associate_blkg_from_css(bio, css); } #endif /* CONFIG_MEMCG */ @@ -2094,8 +2064,10 @@ void bio_associate_blkg(struct bio *bio) rcu_read_lock(); - bio_associate_blkcg(bio, NULL); - blkcg = bio_blkcg(bio); + if (bio->bi_blkg) + blkcg = bio->bi_blkg->blkcg; + else + blkcg = css_to_blkcg(blkcg_get_css()); if (!blkcg->css.parent) { __bio_associate_blkg(bio, q->root_blkg); @@ -2115,27 +2087,22 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg); */ void bio_disassociate_task(struct bio *bio) { - if (bio->bi_css) { - css_put(bio->bi_css); - bio->bi_css = NULL; - } bio_disassociate_blkg(bio); } /** - * bio_clone_blkcg_association - clone blkcg association from src to dst bio + * bio_clone_blkg_association - clone blkg association from src to dst bio * @dst: destination bio * @src: source bio */ -void bio_clone_blkcg_association(struct bio *dst, struct bio *src) +void bio_clone_blkg_association(struct bio *dst, struct bio *src) { - if (src->bi_css) - WARN_ON(bio_associate_blkcg(dst, src->bi_css)); - - if (src->bi_blkg) + if (src->bi_blkg) { + css_get(&bio_blkcg(src)->css); __bio_associate_blkg(dst, src->bi_blkg); + } } -EXPORT_SYMBOL_GPL(bio_clone_blkcg_association); +EXPORT_SYMBOL_GPL(bio_clone_blkg_association); #endif /* CONFIG_BLK_CGROUP */ static void __init biovec_init_slabs(void) diff --git a/block/bounce.c b/block/bounce.c index cfb96d5170d0..ffb9e9ecfa7e 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -277,7 +277,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask, } } - bio_clone_blkcg_association(bio, bio_src); + bio_clone_blkg_association(bio, bio_src); blkcg_bio_issue_init(bio); return bio; diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 176ab1f28eca..0770004616de 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -77,6 +77,7 @@ #include #include #include +#include #include "loop.h" @@ -1820,8 +1821,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, /* always use the first bio's css */ #ifdef CONFIG_BLK_CGROUP - if (cmd->use_aio && rq->bio && rq->bio->bi_css) { - cmd->css = rq->bio->bi_css; + if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { + cmd->css = &bio_blkcg(rq->bio)->css; css_get(cmd->css); } else #endif diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index ac1cffd2a09b..f3fb5bb8c82a 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -542,7 +542,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) !discard_bio) continue; bio_chain(discard_bio, bio); - bio_clone_blkcg_association(discard_bio, bio); + bio_clone_blkg_association(discard_bio, bio); if (mddev->gendisk) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), diff --git a/include/linux/bio.h b/include/linux/bio.h index f0438061a5a3..84e1c4dc703a 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -498,7 +498,7 @@ do { \ do { \ (dst)->bi_disk = (src)->bi_disk; \ (dst)->bi_partno = (src)->bi_partno; \ - bio_clone_blkcg_association(dst, src); \ + bio_clone_blkg_association(dst, src); \ } while (0) #define bio_dev(bio) \ @@ -512,24 +512,21 @@ static inline void bio_associate_blkg_from_page(struct bio *bio, #endif #ifdef CONFIG_BLK_CGROUP -int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); void bio_disassociate_blkg(struct bio *bio); void bio_associate_blkg(struct bio *bio); void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css); void bio_disassociate_task(struct bio *bio); -void bio_clone_blkcg_association(struct bio *dst, struct bio *src); +void bio_clone_blkg_association(struct bio *dst, struct bio *src); #else /* CONFIG_BLK_CGROUP */ -static inline int bio_associate_blkcg(struct bio *bio, - struct cgroup_subsys_state *blkcg_css) { return 0; } static inline void bio_disassociate_blkg(struct bio *bio) { } static inline void bio_associate_blkg(struct bio *bio) { } static inline void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { } static inline void bio_disassociate_task(struct bio *bio) { } -static inline void bio_clone_blkcg_association(struct bio *dst, - struct bio *src) { } +static inline void bio_clone_blkg_association(struct bio *dst, + struct bio *src) { } #endif /* CONFIG_BLK_CGROUP */ #ifdef CONFIG_HIGHMEM diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 8b069c3775ee..f11c37f8ce09 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -309,8 +309,8 @@ static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css) */ static inline struct blkcg *__bio_blkcg(struct bio *bio) { - if (bio && bio->bi_css) - return css_to_blkcg(bio->bi_css); + if (bio && bio->bi_blkg) + return bio->bi_blkg->blkcg; return css_to_blkcg(blkcg_css()); } @@ -324,8 +324,8 @@ static inline struct blkcg *__bio_blkcg(struct bio *bio) */ static inline struct blkcg *bio_blkcg(struct bio *bio) { - if (bio && bio->bi_css) - return css_to_blkcg(bio->bi_css); + if (bio && bio->bi_blkg) + return bio->bi_blkg->blkcg; return NULL; } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index c0ba1a038ff3..46c005d601ac 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -174,10 +174,11 @@ struct bio { void *bi_private; #ifdef CONFIG_BLK_CGROUP /* - * Optional css associated with this bio. Put on bio - * release. Read comment on top of bio_associate_current(). + * Represents the association of the css and request_queue for the bio. + * If a bio goes direct to device, it will not have a blkg as it will + * not have a request_queue associated with it. The reference is put + * on release of the bio. */ - struct cgroup_subsys_state *bi_css; struct blkcg_gq *bi_blkg; struct bio_issue bi_issue; #endif diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 2868d85f1fb1..fac0ddf8a8e2 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -764,9 +764,9 @@ blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio) if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP)) return NULL; - if (!bio->bi_css) + if (!bio->bi_blkg) return NULL; - return cgroup_get_kernfs_id(bio->bi_css->cgroup); + return cgroup_get_kernfs_id(bio_blkcg(bio)->css.cgroup); } #else static union kernfs_node_id * -- cgit v1.3-7-g2ca7 From fc5a828bfad628c1092194f2814604943561c52d Mon Sep 17 00:00:00 2001 From: Dennis Zhou Date: Wed, 5 Dec 2018 12:10:36 -0500 Subject: blkcg: remove additional reference to the css The previous patch in this series removed carrying around a pointer to the css in blkg. However, the blkg association logic still relied on taking a reference on the css to ensure we wouldn't fail in getting a reference for the blkg. Here the implicit dependency on the css is removed. The association continues to rely on the tryget logic walking up the blkg tree. This streamlines the three ways that association can happen: normal, swap, and writeback. Signed-off-by: Dennis Zhou Acked-by: Tejun Heo Reviewed-by: Josef Bacik Signed-off-by: Jens Axboe --- block/bio.c | 66 ++++++++++++++++++++-------------------------- include/linux/blk-cgroup.h | 41 ---------------------------- include/linux/cgroup.h | 2 ++ kernel/cgroup/cgroup.c | 48 ++++++++++++++++++++++++++------- 4 files changed, 69 insertions(+), 88 deletions(-) (limited to 'kernel') diff --git a/block/bio.c b/block/bio.c index 2b6bc7b805ec..ce1e512dca5a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1966,8 +1966,6 @@ EXPORT_SYMBOL(bioset_init_from_src); void bio_disassociate_blkg(struct bio *bio) { if (bio->bi_blkg) { - /* a ref is always taken on css */ - css_put(&bio_blkcg(bio)->css); blkg_put(bio->bi_blkg); bio->bi_blkg = NULL; } @@ -1995,33 +1993,31 @@ static void __bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg) bio->bi_blkg = blkg_try_get_closest(blkg); } -static void __bio_associate_blkg_from_css(struct bio *bio, - struct cgroup_subsys_state *css) -{ - struct blkcg_gq *blkg; - - rcu_read_lock(); - - blkg = blkg_lookup_create(css_to_blkcg(css), bio->bi_disk->queue); - __bio_associate_blkg(bio, blkg); - - rcu_read_unlock(); -} - /** * bio_associate_blkg_from_css - associate a bio with a specified css * @bio: target bio * @css: target css * * Associate @bio with the blkg found by combining the css's blkg and the - * request_queue of the @bio. This takes a reference on the css that will - * be put upon freeing of @bio. + * request_queue of the @bio. This falls back to the queue's root_blkg if + * the association fails with the css. */ void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { - css_get(css); - __bio_associate_blkg_from_css(bio, css); + struct request_queue *q = bio->bi_disk->queue; + struct blkcg_gq *blkg; + + rcu_read_lock(); + + if (!css || !css->parent) + blkg = q->root_blkg; + else + blkg = blkg_lookup_create(css_to_blkcg(css), q); + + __bio_associate_blkg(bio, blkg); + + rcu_read_unlock(); } EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); @@ -2032,8 +2028,8 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); * @page: the page to lookup the blkcg from * * Associate @bio with the blkg from @page's owning memcg and the respective - * request_queue. This works like every other associate function wrt - * references. + * request_queue. If cgroup_e_css returns %NULL, fall back to the queue's + * root_blkg. */ void bio_associate_blkg_from_page(struct bio *bio, struct page *page) { @@ -2042,8 +2038,12 @@ void bio_associate_blkg_from_page(struct bio *bio, struct page *page) if (!page->mem_cgroup) return; - css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); - __bio_associate_blkg_from_css(bio, css); + rcu_read_lock(); + + css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); + bio_associate_blkg_from_css(bio, css); + + rcu_read_unlock(); } #endif /* CONFIG_MEMCG */ @@ -2058,24 +2058,16 @@ void bio_associate_blkg_from_page(struct bio *bio, struct page *page) */ void bio_associate_blkg(struct bio *bio) { - struct request_queue *q = bio->bi_disk->queue; - struct blkcg *blkcg; - struct blkcg_gq *blkg; + struct cgroup_subsys_state *css; rcu_read_lock(); if (bio->bi_blkg) - blkcg = bio->bi_blkg->blkcg; + css = &bio_blkcg(bio)->css; else - blkcg = css_to_blkcg(blkcg_get_css()); + css = blkcg_css(); - if (!blkcg->css.parent) { - __bio_associate_blkg(bio, q->root_blkg); - } else { - blkg = blkg_lookup_create(blkcg, q); - - __bio_associate_blkg(bio, blkg); - } + bio_associate_blkg_from_css(bio, css); rcu_read_unlock(); } @@ -2097,10 +2089,8 @@ void bio_disassociate_task(struct bio *bio) */ void bio_clone_blkg_association(struct bio *dst, struct bio *src) { - if (src->bi_blkg) { - css_get(&bio_blkcg(src)->css); + if (src->bi_blkg) __bio_associate_blkg(dst, src->bi_blkg); - } } EXPORT_SYMBOL_GPL(bio_clone_blkg_association); #endif /* CONFIG_BLK_CGROUP */ diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index f11c37f8ce09..284819a4d122 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -247,47 +247,6 @@ static inline struct cgroup_subsys_state *blkcg_css(void) return task_css(current, io_cgrp_id); } -/** - * blkcg_get_css - find and get a reference to the css - * - * Find the css associated with either the kthread or the current task. - * This takes a reference on the blkcg which will need to be managed by the - * caller. - */ -static inline struct cgroup_subsys_state *blkcg_get_css(void) -{ - struct cgroup_subsys_state *css; - - rcu_read_lock(); - - css = kthread_blkcg(); - if (css) { - css_get(css); - } else { - /* - * This is a bit complicated. It is possible task_css() is - * seeing an old css pointer here. This is caused by the - * current thread migrating away from this cgroup and this - * cgroup dying. css_tryget() will fail when trying to take a - * ref on a cgroup that's ref count has hit 0. - * - * Therefore, if it does fail, this means current must have - * been swapped away already and this is waiting for it to - * propagate on the polling cpu. Hence the use of cpu_relax(). - */ - while (true) { - css = task_css(current, io_cgrp_id); - if (likely(css_tryget(css))) - break; - cpu_relax(); - } - } - - rcu_read_unlock(); - - return css; -} - static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css) { return css ? container_of(css, struct blkcg, css) : NULL; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 9d12757a65b0..9968332cceed 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -93,6 +93,8 @@ extern struct css_set init_css_set; bool css_has_online_children(struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss); +struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup, + struct cgroup_subsys *ss); struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, struct cgroup_subsys *ss); struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6aaf5dd5383b..8b79318810ad 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -493,7 +493,7 @@ static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, } /** - * cgroup_e_css - obtain a cgroup's effective css for the specified subsystem + * cgroup_e_css_by_mask - obtain a cgroup's effective css for the specified ss * @cgrp: the cgroup of interest * @ss: the subsystem of interest (%NULL returns @cgrp->self) * @@ -502,8 +502,8 @@ static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, * enabled. If @ss is associated with the hierarchy @cgrp is on, this * function is guaranteed to return non-NULL css. */ -static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp, - struct cgroup_subsys *ss) +static struct cgroup_subsys_state *cgroup_e_css_by_mask(struct cgroup *cgrp, + struct cgroup_subsys *ss) { lockdep_assert_held(&cgroup_mutex); @@ -523,6 +523,35 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp, return cgroup_css(cgrp, ss); } +/** + * cgroup_e_css - obtain a cgroup's effective css for the specified subsystem + * @cgrp: the cgroup of interest + * @ss: the subsystem of interest + * + * Find and get the effective css of @cgrp for @ss. The effective css is + * defined as the matching css of the nearest ancestor including self which + * has @ss enabled. If @ss is not mounted on the hierarchy @cgrp is on, + * the root css is returned, so this function always returns a valid css. + * + * The returned css is not guaranteed to be online, and therefore it is the + * callers responsiblity to tryget a reference for it. + */ +struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp, + struct cgroup_subsys *ss) +{ + struct cgroup_subsys_state *css; + + do { + css = cgroup_css(cgrp, ss); + + if (css) + return css; + cgrp = cgroup_parent(cgrp); + } while (cgrp); + + return init_css_set.subsys[ss->id]; +} + /** * cgroup_get_e_css - get a cgroup's effective css for the specified subsystem * @cgrp: the cgroup of interest @@ -605,10 +634,11 @@ EXPORT_SYMBOL_GPL(of_css); * * Should be called under cgroup_[tree_]mutex. */ -#define for_each_e_css(css, ssid, cgrp) \ - for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \ - if (!((css) = cgroup_e_css(cgrp, cgroup_subsys[(ssid)]))) \ - ; \ +#define for_each_e_css(css, ssid, cgrp) \ + for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \ + if (!((css) = cgroup_e_css_by_mask(cgrp, \ + cgroup_subsys[(ssid)]))) \ + ; \ else /** @@ -1007,7 +1037,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset, * @ss is in this hierarchy, so we want the * effective css from @cgrp. */ - template[i] = cgroup_e_css(cgrp, ss); + template[i] = cgroup_e_css_by_mask(cgrp, ss); } else { /* * @ss is not in this hierarchy, so we don't want @@ -3024,7 +3054,7 @@ static int cgroup_apply_control(struct cgroup *cgrp) return ret; /* - * At this point, cgroup_e_css() results reflect the new csses + * At this point, cgroup_e_css_by_mask() results reflect the new csses * making the following cgroup_update_dfl_csses() properly update * css associations of all tasks in the subtree. */ -- cgit v1.3-7-g2ca7 From 761efe8a94cfcd0a3dd90f2008411550f3520b63 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Sun, 18 Nov 2018 18:44:04 -0500 Subject: function_graph: Remove the use of FTRACE_NOTRACE_DEPTH The curr_ret_stack is no longer set to a negative value when a function is not to be traced by the function graph tracer. Remove the usage of FTRACE_NOTRACE_DEPTH, as it is no longer needed. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- include/linux/ftrace.h | 1 - kernel/trace/fgraph.c | 19 ------------------- kernel/trace/trace_functions_graph.c | 11 ----------- 3 files changed, 31 deletions(-) (limited to 'kernel') diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 10bd46434908..98625f10d982 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -790,7 +790,6 @@ unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, */ #define __notrace_funcgraph notrace -#define FTRACE_NOTRACE_DEPTH 65536 #define FTRACE_RETFUNC_DEPTH 50 #define FTRACE_RETSTACK_ALLOC_SIZE 32 extern int register_ftrace_graph(trace_func_graph_ret_t retfunc, diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index e852b69c0e64..de887a983ac7 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -112,16 +112,6 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret, index = current->curr_ret_stack; - /* - * A negative index here means that it's just returned from a - * notrace'd function. Recover index to get an original - * return address. See ftrace_push_return_trace(). - * - * TODO: Need to check whether the stack gets corrupted. - */ - if (index < 0) - index += FTRACE_NOTRACE_DEPTH; - if (unlikely(index < 0 || index >= FTRACE_RETFUNC_DEPTH)) { ftrace_graph_stop(); WARN_ON(1); @@ -190,15 +180,6 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) */ barrier(); current->curr_ret_stack--; - /* - * The curr_ret_stack can be less than -1 only if it was - * filtered out and it's about to return from the function. - * Recover the index and continue to trace normal functions. - */ - if (current->curr_ret_stack < -1) { - current->curr_ret_stack += FTRACE_NOTRACE_DEPTH; - return ret; - } if (unlikely(!ret)) { ftrace_graph_stop(); diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index ecf543df943b..eaf9b1629956 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -115,9 +115,6 @@ unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, if (ret != (unsigned long)return_to_handler) return ret; - if (index < -1) - index += FTRACE_NOTRACE_DEPTH; - if (index < 0) return ret; @@ -675,10 +672,6 @@ print_graph_entry_leaf(struct trace_iterator *iter, cpu_data = per_cpu_ptr(data->cpu_data, cpu); - /* If a graph tracer ignored set_graph_notrace */ - if (call->depth < -1) - call->depth += FTRACE_NOTRACE_DEPTH; - /* * Comments display at + 1 to depth. Since * this is a leaf function, keep the comments @@ -721,10 +714,6 @@ print_graph_entry_nested(struct trace_iterator *iter, struct fgraph_cpu_data *cpu_data; int cpu = iter->cpu; - /* If a graph tracer ignored set_graph_notrace */ - if (call->depth < -1) - call->depth += FTRACE_NOTRACE_DEPTH; - cpu_data = per_cpu_ptr(data->cpu_data, cpu); cpu_data->depth = call->depth; -- cgit v1.3-7-g2ca7 From 3306fc4aff464f9c08c8899695a218f4b1125d4a Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 15 Nov 2018 12:32:38 -0500 Subject: ftrace: Create new ftrace_internal.h header In order to move function graph infrastructure into its own file (fgraph.h) it needs to access various functions and variables in ftrace.c that are currently static. Create a new file called ftrace-internal.h that holds the function prototypes and the extern declarations of the variables needed by fgraph.c as well, and make them global in ftrace.c such that they can be used outside that file. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 76 ++++++++---------------------------------- kernel/trace/ftrace_internal.h | 75 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 89 insertions(+), 62 deletions(-) create mode 100644 kernel/trace/ftrace_internal.h (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 77734451cb05..52c89428b0db 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -40,6 +40,7 @@ #include #include +#include "ftrace_internal.h" #include "trace_output.h" #include "trace_stat.h" @@ -77,7 +78,7 @@ #define ASSIGN_OPS_HASH(opsname, val) #endif -static struct ftrace_ops ftrace_list_end __read_mostly = { +struct ftrace_ops ftrace_list_end __read_mostly = { .func = ftrace_stub, .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_STUB, INIT_OPS_HASH(ftrace_list_end) @@ -112,11 +113,11 @@ static void ftrace_update_trampoline(struct ftrace_ops *ops); */ static int ftrace_disabled __read_mostly; -static DEFINE_MUTEX(ftrace_lock); +DEFINE_MUTEX(ftrace_lock); -static struct ftrace_ops __rcu *ftrace_ops_list __read_mostly = &ftrace_list_end; +struct ftrace_ops __rcu *ftrace_ops_list __read_mostly = &ftrace_list_end; ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub; -static struct ftrace_ops global_ops; +struct ftrace_ops global_ops; #if ARCH_SUPPORTS_FTRACE_OPS static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, @@ -127,26 +128,6 @@ static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip); #define ftrace_ops_list_func ((ftrace_func_t)ftrace_ops_no_ops) #endif -/* - * Traverse the ftrace_global_list, invoking all entries. The reason that we - * can use rcu_dereference_raw_notrace() is that elements removed from this list - * are simply leaked, so there is no need to interact with a grace-period - * mechanism. The rcu_dereference_raw_notrace() calls are needed to handle - * concurrent insertions into the ftrace_global_list. - * - * Silly Alpha and silly pointer-speculation compiler optimizations! - */ -#define do_for_each_ftrace_op(op, list) \ - op = rcu_dereference_raw_notrace(list); \ - do - -/* - * Optimized for just a single item in the list (as that is the normal case). - */ -#define while_for_each_ftrace_op(op) \ - while (likely(op = rcu_dereference_raw_notrace((op)->next)) && \ - unlikely((op) != &ftrace_list_end)) - static inline void ftrace_ops_init(struct ftrace_ops *ops) { #ifdef CONFIG_DYNAMIC_FTRACE @@ -187,17 +168,11 @@ static void ftrace_sync_ipi(void *data) } #ifdef CONFIG_FUNCTION_GRAPH_TRACER -static void update_function_graph_func(void); - /* Both enabled by default (can be cleared by function_graph tracer flags */ static bool fgraph_sleep_time = true; static bool fgraph_graph_time = true; - -#else -static inline void update_function_graph_func(void) { } #endif - static ftrace_func_t ftrace_ops_get_list_func(struct ftrace_ops *ops) { /* @@ -334,7 +309,7 @@ static int remove_ftrace_ops(struct ftrace_ops __rcu **list, static void ftrace_update_trampoline(struct ftrace_ops *ops); -static int __register_ftrace_function(struct ftrace_ops *ops) +int __register_ftrace_function(struct ftrace_ops *ops) { if (ops->flags & FTRACE_OPS_FL_DELETED) return -EINVAL; @@ -375,7 +350,7 @@ static int __register_ftrace_function(struct ftrace_ops *ops) return 0; } -static int __unregister_ftrace_function(struct ftrace_ops *ops) +int __unregister_ftrace_function(struct ftrace_ops *ops) { int ret; @@ -1022,9 +997,7 @@ static __init void ftrace_profile_tracefs(struct dentry *d_tracer) #endif /* CONFIG_FUNCTION_PROFILER */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER -static int ftrace_graph_active; -#else -# define ftrace_graph_active 0 +int ftrace_graph_active; #endif #ifdef CONFIG_DYNAMIC_FTRACE @@ -1067,7 +1040,7 @@ static const struct ftrace_hash empty_hash = { }; #define EMPTY_HASH ((struct ftrace_hash *)&empty_hash) -static struct ftrace_ops global_ops = { +struct ftrace_ops global_ops = { .func = ftrace_stub, .local_hash.notrace_hash = EMPTY_HASH, .local_hash.filter_hash = EMPTY_HASH, @@ -1503,7 +1476,7 @@ static bool hash_contains_ip(unsigned long ip, * This needs to be called with preemption disabled as * the hashes are freed with call_rcu_sched(). */ -static int +int ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs) { struct ftrace_ops_hash hash; @@ -2682,7 +2655,7 @@ static void ftrace_startup_all(int command) update_all_ops = false; } -static int ftrace_startup(struct ftrace_ops *ops, int command) +int ftrace_startup(struct ftrace_ops *ops, int command) { int ret; @@ -2724,7 +2697,7 @@ static int ftrace_startup(struct ftrace_ops *ops, int command) return 0; } -static int ftrace_shutdown(struct ftrace_ops *ops, int command) +int ftrace_shutdown(struct ftrace_ops *ops, int command) { int ret; @@ -6177,7 +6150,7 @@ void ftrace_init_trace_array(struct trace_array *tr) } #else -static struct ftrace_ops global_ops = { +struct ftrace_ops global_ops = { .func = ftrace_stub, .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED | @@ -6194,31 +6167,10 @@ core_initcall(ftrace_nodyn_init); static inline int ftrace_init_dyn_tracefs(struct dentry *d_tracer) { return 0; } static inline void ftrace_startup_enable(int command) { } static inline void ftrace_startup_all(int command) { } -/* Keep as macros so we do not need to define the commands */ -# define ftrace_startup(ops, command) \ - ({ \ - int ___ret = __register_ftrace_function(ops); \ - if (!___ret) \ - (ops)->flags |= FTRACE_OPS_FL_ENABLED; \ - ___ret; \ - }) -# define ftrace_shutdown(ops, command) \ - ({ \ - int ___ret = __unregister_ftrace_function(ops); \ - if (!___ret) \ - (ops)->flags &= ~FTRACE_OPS_FL_ENABLED; \ - ___ret; \ - }) # define ftrace_startup_sysctl() do { } while (0) # define ftrace_shutdown_sysctl() do { } while (0) -static inline int -ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs) -{ - return 1; -} - static void ftrace_update_trampoline(struct ftrace_ops *ops) { } @@ -6930,7 +6882,7 @@ static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace) * function against the global ops, and not just trace any function * that any ftrace_ops registered. */ -static void update_function_graph_func(void) +void update_function_graph_func(void) { struct ftrace_ops *op; bool do_test = false; diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h new file mode 100644 index 000000000000..0515a2096f90 --- /dev/null +++ b/kernel/trace/ftrace_internal.h @@ -0,0 +1,75 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KERNEL_FTRACE_INTERNAL_H +#define _LINUX_KERNEL_FTRACE_INTERNAL_H + +#ifdef CONFIG_FUNCTION_TRACER + +/* + * Traverse the ftrace_global_list, invoking all entries. The reason that we + * can use rcu_dereference_raw_notrace() is that elements removed from this list + * are simply leaked, so there is no need to interact with a grace-period + * mechanism. The rcu_dereference_raw_notrace() calls are needed to handle + * concurrent insertions into the ftrace_global_list. + * + * Silly Alpha and silly pointer-speculation compiler optimizations! + */ +#define do_for_each_ftrace_op(op, list) \ + op = rcu_dereference_raw_notrace(list); \ + do + +/* + * Optimized for just a single item in the list (as that is the normal case). + */ +#define while_for_each_ftrace_op(op) \ + while (likely(op = rcu_dereference_raw_notrace((op)->next)) && \ + unlikely((op) != &ftrace_list_end)) + +extern struct ftrace_ops __rcu *ftrace_ops_list; +extern struct ftrace_ops ftrace_list_end; +extern struct mutex ftrace_lock; +extern struct ftrace_ops global_ops; + +#ifdef CONFIG_DYNAMIC_FTRACE + +int ftrace_startup(struct ftrace_ops *ops, int command); +int ftrace_shutdown(struct ftrace_ops *ops, int command); +int ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs); + +#else /* !CONFIG_DYNAMIC_FTRACE */ + +int __register_ftrace_function(struct ftrace_ops *ops); +int __unregister_ftrace_function(struct ftrace_ops *ops); +/* Keep as macros so we do not need to define the commands */ +# define ftrace_startup(ops, command) \ + ({ \ + int ___ret = __register_ftrace_function(ops); \ + if (!___ret) \ + (ops)->flags |= FTRACE_OPS_FL_ENABLED; \ + ___ret; \ + }) +# define ftrace_shutdown(ops, command) \ + ({ \ + int ___ret = __unregister_ftrace_function(ops); \ + if (!___ret) \ + (ops)->flags &= ~FTRACE_OPS_FL_ENABLED; \ + ___ret; \ + }) +static inline int +ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs) +{ + return 1; +} +#endif /* CONFIG_DYNAMIC_FTRACE */ + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +extern int ftrace_graph_active; +void update_function_graph_func(void); +#else /* !CONFIG_FUNCTION_GRAPH_TRACER */ +# define ftrace_graph_active 0 +static inline void update_function_graph_func(void) { } +#endif /* CONFIG_FUNCTION_GRAPH_TRACER */ + +#else /* !CONFIG_FUNCTION_TRACER */ +#endif /* CONFIG_FUNCTION_TRACER */ + +#endif -- cgit v1.3-7-g2ca7 From c8dd0f45874547e6e77bab03d71feb16c4cb98a8 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 23 Nov 2018 13:06:07 -0500 Subject: function_graph: Do not expose the graph_time option when profiler is not configured When the function profiler is not configured, the "graph_time" option is meaningless, as the function profiler is the only thing that makes use of it. Do not expose it if the profiler is not configured. Link: http://lkml.kernel.org/r/20181123061133.GA195223@google.com Reported-by: Joel Fernandes Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.h | 5 +++++ kernel/trace/trace_functions_graph.c | 4 ++++ 2 files changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index f67060a75f38..ab16eca76e59 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -862,7 +862,12 @@ static __always_inline bool ftrace_hash_empty(struct ftrace_hash *hash) #define TRACE_GRAPH_PRINT_FILL_MASK (0x3 << TRACE_GRAPH_PRINT_FILL_SHIFT) extern void ftrace_graph_sleep_time_control(bool enable); + +#ifdef CONFIG_FUNCTION_PROFILER extern void ftrace_graph_graph_time_control(bool enable); +#else +static inline void ftrace_graph_graph_time_control(bool enable) { } +#endif extern enum print_line_t print_graph_function_flags(struct trace_iterator *iter, u32 flags); diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index eaf9b1629956..855c13c61e77 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -60,8 +60,12 @@ static struct tracer_opt trace_opts[] = { { TRACER_OPT(funcgraph-tail, TRACE_GRAPH_PRINT_TAIL) }, /* Include sleep time (scheduled out) between entry and return */ { TRACER_OPT(sleep-time, TRACE_GRAPH_SLEEP_TIME) }, + +#ifdef CONFIG_FUNCTION_PROFILER /* Include time within nested functions */ { TRACER_OPT(graph-time, TRACE_GRAPH_GRAPH_TIME) }, +#endif + { } /* Empty entry */ }; -- cgit v1.3-7-g2ca7 From e73e679f656e678b0e7f8961094201f3544f4541 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 15 Nov 2018 12:35:13 -0500 Subject: fgraph: Move function graph specific code into fgraph.c To make the function graph infrastructure more managable, the code needs to be in its own file (fgraph.c). Move the code that is specific for managing the function graph infrastructure out of ftrace.c and into fgraph.c Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/fgraph.c | 360 +++++++++++++++++++++++++++++++++++++++++++++++- kernel/trace/ftrace.c | 368 +------------------------------------------------- 2 files changed, 366 insertions(+), 362 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index de887a983ac7..374f3e42e29e 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -7,11 +7,27 @@ * * Highly modified by Steven Rostedt (VMware). */ +#include #include +#include -#include "trace.h" +#include + +#include "ftrace_internal.h" + +#ifdef CONFIG_DYNAMIC_FTRACE +#define ASSIGN_OPS_HASH(opsname, val) \ + .func_hash = val, \ + .local_hash.regex_lock = __MUTEX_INITIALIZER(opsname.local_hash.regex_lock), +#else +#define ASSIGN_OPS_HASH(opsname, val) +#endif static bool kill_ftrace_graph; +int ftrace_graph_active; + +/* Both enabled by default (can be cleared by function_graph tracer flags */ +static bool fgraph_sleep_time = true; /** * ftrace_graph_is_dead - returns true if ftrace_graph_stop() was called @@ -161,6 +177,31 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret, barrier(); } +/* + * Hibernation protection. + * The state of the current task is too much unstable during + * suspend/restore to disk. We want to protect against that. + */ +static int +ftrace_suspend_notifier_call(struct notifier_block *bl, unsigned long state, + void *unused) +{ + switch (state) { + case PM_HIBERNATION_PREPARE: + pause_graph_tracing(); + break; + + case PM_POST_HIBERNATION: + unpause_graph_tracing(); + break; + } + return NOTIFY_DONE; +} + +static struct notifier_block ftrace_suspend_notifier = { + .notifier_call = ftrace_suspend_notifier_call, +}; + /* * Send the trace to the ring-buffer. * @return the original return address. @@ -190,3 +231,320 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) return ret; } + +static struct ftrace_ops graph_ops = { + .func = ftrace_stub, + .flags = FTRACE_OPS_FL_RECURSION_SAFE | + FTRACE_OPS_FL_INITIALIZED | + FTRACE_OPS_FL_PID | + FTRACE_OPS_FL_STUB, +#ifdef FTRACE_GRAPH_TRAMP_ADDR + .trampoline = FTRACE_GRAPH_TRAMP_ADDR, + /* trampoline_size is only needed for dynamically allocated tramps */ +#endif + ASSIGN_OPS_HASH(graph_ops, &global_ops.local_hash) +}; + +void ftrace_graph_sleep_time_control(bool enable) +{ + fgraph_sleep_time = enable; +} + +int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace) +{ + return 0; +} + +/* The callbacks that hook a function */ +trace_func_graph_ret_t ftrace_graph_return = + (trace_func_graph_ret_t)ftrace_stub; +trace_func_graph_ent_t ftrace_graph_entry = ftrace_graph_entry_stub; +static trace_func_graph_ent_t __ftrace_graph_entry = ftrace_graph_entry_stub; + +/* Try to assign a return stack array on FTRACE_RETSTACK_ALLOC_SIZE tasks. */ +static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list) +{ + int i; + int ret = 0; + int start = 0, end = FTRACE_RETSTACK_ALLOC_SIZE; + struct task_struct *g, *t; + + for (i = 0; i < FTRACE_RETSTACK_ALLOC_SIZE; i++) { + ret_stack_list[i] = + kmalloc_array(FTRACE_RETFUNC_DEPTH, + sizeof(struct ftrace_ret_stack), + GFP_KERNEL); + if (!ret_stack_list[i]) { + start = 0; + end = i; + ret = -ENOMEM; + goto free; + } + } + + read_lock(&tasklist_lock); + do_each_thread(g, t) { + if (start == end) { + ret = -EAGAIN; + goto unlock; + } + + if (t->ret_stack == NULL) { + atomic_set(&t->tracing_graph_pause, 0); + atomic_set(&t->trace_overrun, 0); + t->curr_ret_stack = -1; + t->curr_ret_depth = -1; + /* Make sure the tasks see the -1 first: */ + smp_wmb(); + t->ret_stack = ret_stack_list[start++]; + } + } while_each_thread(g, t); + +unlock: + read_unlock(&tasklist_lock); +free: + for (i = start; i < end; i++) + kfree(ret_stack_list[i]); + return ret; +} + +static void +ftrace_graph_probe_sched_switch(void *ignore, bool preempt, + struct task_struct *prev, struct task_struct *next) +{ + unsigned long long timestamp; + int index; + + /* + * Does the user want to count the time a function was asleep. + * If so, do not update the time stamps. + */ + if (fgraph_sleep_time) + return; + + timestamp = trace_clock_local(); + + prev->ftrace_timestamp = timestamp; + + /* only process tasks that we timestamped */ + if (!next->ftrace_timestamp) + return; + + /* + * Update all the counters in next to make up for the + * time next was sleeping. + */ + timestamp -= next->ftrace_timestamp; + + for (index = next->curr_ret_stack; index >= 0; index--) + next->ret_stack[index].calltime += timestamp; +} + +static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace) +{ + if (!ftrace_ops_test(&global_ops, trace->func, NULL)) + return 0; + return __ftrace_graph_entry(trace); +} + +/* + * The function graph tracer should only trace the functions defined + * by set_ftrace_filter and set_ftrace_notrace. If another function + * tracer ops is registered, the graph tracer requires testing the + * function against the global ops, and not just trace any function + * that any ftrace_ops registered. + */ +void update_function_graph_func(void) +{ + struct ftrace_ops *op; + bool do_test = false; + + /* + * The graph and global ops share the same set of functions + * to test. If any other ops is on the list, then + * the graph tracing needs to test if its the function + * it should call. + */ + do_for_each_ftrace_op(op, ftrace_ops_list) { + if (op != &global_ops && op != &graph_ops && + op != &ftrace_list_end) { + do_test = true; + /* in double loop, break out with goto */ + goto out; + } + } while_for_each_ftrace_op(op); + out: + if (do_test) + ftrace_graph_entry = ftrace_graph_entry_test; + else + ftrace_graph_entry = __ftrace_graph_entry; +} + +static DEFINE_PER_CPU(struct ftrace_ret_stack *, idle_ret_stack); + +static void +graph_init_task(struct task_struct *t, struct ftrace_ret_stack *ret_stack) +{ + atomic_set(&t->tracing_graph_pause, 0); + atomic_set(&t->trace_overrun, 0); + t->ftrace_timestamp = 0; + /* make curr_ret_stack visible before we add the ret_stack */ + smp_wmb(); + t->ret_stack = ret_stack; +} + +/* + * Allocate a return stack for the idle task. May be the first + * time through, or it may be done by CPU hotplug online. + */ +void ftrace_graph_init_idle_task(struct task_struct *t, int cpu) +{ + t->curr_ret_stack = -1; + t->curr_ret_depth = -1; + /* + * The idle task has no parent, it either has its own + * stack or no stack at all. + */ + if (t->ret_stack) + WARN_ON(t->ret_stack != per_cpu(idle_ret_stack, cpu)); + + if (ftrace_graph_active) { + struct ftrace_ret_stack *ret_stack; + + ret_stack = per_cpu(idle_ret_stack, cpu); + if (!ret_stack) { + ret_stack = + kmalloc_array(FTRACE_RETFUNC_DEPTH, + sizeof(struct ftrace_ret_stack), + GFP_KERNEL); + if (!ret_stack) + return; + per_cpu(idle_ret_stack, cpu) = ret_stack; + } + graph_init_task(t, ret_stack); + } +} + +/* Allocate a return stack for newly created task */ +void ftrace_graph_init_task(struct task_struct *t) +{ + /* Make sure we do not use the parent ret_stack */ + t->ret_stack = NULL; + t->curr_ret_stack = -1; + t->curr_ret_depth = -1; + + if (ftrace_graph_active) { + struct ftrace_ret_stack *ret_stack; + + ret_stack = kmalloc_array(FTRACE_RETFUNC_DEPTH, + sizeof(struct ftrace_ret_stack), + GFP_KERNEL); + if (!ret_stack) + return; + graph_init_task(t, ret_stack); + } +} + +void ftrace_graph_exit_task(struct task_struct *t) +{ + struct ftrace_ret_stack *ret_stack = t->ret_stack; + + t->ret_stack = NULL; + /* NULL must become visible to IRQs before we free it: */ + barrier(); + + kfree(ret_stack); +} + +/* Allocate a return stack for each task */ +static int start_graph_tracing(void) +{ + struct ftrace_ret_stack **ret_stack_list; + int ret, cpu; + + ret_stack_list = kmalloc_array(FTRACE_RETSTACK_ALLOC_SIZE, + sizeof(struct ftrace_ret_stack *), + GFP_KERNEL); + + if (!ret_stack_list) + return -ENOMEM; + + /* The cpu_boot init_task->ret_stack will never be freed */ + for_each_online_cpu(cpu) { + if (!idle_task(cpu)->ret_stack) + ftrace_graph_init_idle_task(idle_task(cpu), cpu); + } + + do { + ret = alloc_retstack_tasklist(ret_stack_list); + } while (ret == -EAGAIN); + + if (!ret) { + ret = register_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); + if (ret) + pr_info("ftrace_graph: Couldn't activate tracepoint" + " probe to kernel_sched_switch\n"); + } + + kfree(ret_stack_list); + return ret; +} + +int register_ftrace_graph(trace_func_graph_ret_t retfunc, + trace_func_graph_ent_t entryfunc) +{ + int ret = 0; + + mutex_lock(&ftrace_lock); + + /* we currently allow only one tracer registered at a time */ + if (ftrace_graph_active) { + ret = -EBUSY; + goto out; + } + + register_pm_notifier(&ftrace_suspend_notifier); + + ftrace_graph_active++; + ret = start_graph_tracing(); + if (ret) { + ftrace_graph_active--; + goto out; + } + + ftrace_graph_return = retfunc; + + /* + * Update the indirect function to the entryfunc, and the + * function that gets called to the entry_test first. Then + * call the update fgraph entry function to determine if + * the entryfunc should be called directly or not. + */ + __ftrace_graph_entry = entryfunc; + ftrace_graph_entry = ftrace_graph_entry_test; + update_function_graph_func(); + + ret = ftrace_startup(&graph_ops, FTRACE_START_FUNC_RET); +out: + mutex_unlock(&ftrace_lock); + return ret; +} + +void unregister_ftrace_graph(void) +{ + mutex_lock(&ftrace_lock); + + if (unlikely(!ftrace_graph_active)) + goto out; + + ftrace_graph_active--; + ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub; + ftrace_graph_entry = ftrace_graph_entry_stub; + __ftrace_graph_entry = ftrace_graph_entry_stub; + ftrace_shutdown(&graph_ops, FTRACE_STOP_FUNC_RET); + unregister_pm_notifier(&ftrace_suspend_notifier); + unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); + + out: + mutex_unlock(&ftrace_lock); +} diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 52c89428b0db..c53533b833cf 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -19,7 +19,6 @@ #include #include #include -#include #include #include #include @@ -167,12 +166,6 @@ static void ftrace_sync_ipi(void *data) smp_rmb(); } -#ifdef CONFIG_FUNCTION_GRAPH_TRACER -/* Both enabled by default (can be cleared by function_graph tracer flags */ -static bool fgraph_sleep_time = true; -static bool fgraph_graph_time = true; -#endif - static ftrace_func_t ftrace_ops_get_list_func(struct ftrace_ops *ops) { /* @@ -790,6 +783,13 @@ function_profile_call(unsigned long ip, unsigned long parent_ip, } #ifdef CONFIG_FUNCTION_GRAPH_TRACER +static bool fgraph_graph_time = true; + +void ftrace_graph_graph_time_control(bool enable) +{ + fgraph_graph_time = enable; +} + static int profile_graph_entry(struct ftrace_graph_ent *trace) { int index = current->curr_ret_stack; @@ -996,10 +996,6 @@ static __init void ftrace_profile_tracefs(struct dentry *d_tracer) } #endif /* CONFIG_FUNCTION_PROFILER */ -#ifdef CONFIG_FUNCTION_GRAPH_TRACER -int ftrace_graph_active; -#endif - #ifdef CONFIG_DYNAMIC_FTRACE static struct ftrace_ops *removed_ops; @@ -6697,353 +6693,3 @@ ftrace_enable_sysctl(struct ctl_table *table, int write, mutex_unlock(&ftrace_lock); return ret; } - -#ifdef CONFIG_FUNCTION_GRAPH_TRACER - -static struct ftrace_ops graph_ops = { - .func = ftrace_stub, - .flags = FTRACE_OPS_FL_RECURSION_SAFE | - FTRACE_OPS_FL_INITIALIZED | - FTRACE_OPS_FL_PID | - FTRACE_OPS_FL_STUB, -#ifdef FTRACE_GRAPH_TRAMP_ADDR - .trampoline = FTRACE_GRAPH_TRAMP_ADDR, - /* trampoline_size is only needed for dynamically allocated tramps */ -#endif - ASSIGN_OPS_HASH(graph_ops, &global_ops.local_hash) -}; - -void ftrace_graph_sleep_time_control(bool enable) -{ - fgraph_sleep_time = enable; -} - -void ftrace_graph_graph_time_control(bool enable) -{ - fgraph_graph_time = enable; -} - -int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace) -{ - return 0; -} - -/* The callbacks that hook a function */ -trace_func_graph_ret_t ftrace_graph_return = - (trace_func_graph_ret_t)ftrace_stub; -trace_func_graph_ent_t ftrace_graph_entry = ftrace_graph_entry_stub; -static trace_func_graph_ent_t __ftrace_graph_entry = ftrace_graph_entry_stub; - -/* Try to assign a return stack array on FTRACE_RETSTACK_ALLOC_SIZE tasks. */ -static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list) -{ - int i; - int ret = 0; - int start = 0, end = FTRACE_RETSTACK_ALLOC_SIZE; - struct task_struct *g, *t; - - for (i = 0; i < FTRACE_RETSTACK_ALLOC_SIZE; i++) { - ret_stack_list[i] = - kmalloc_array(FTRACE_RETFUNC_DEPTH, - sizeof(struct ftrace_ret_stack), - GFP_KERNEL); - if (!ret_stack_list[i]) { - start = 0; - end = i; - ret = -ENOMEM; - goto free; - } - } - - read_lock(&tasklist_lock); - do_each_thread(g, t) { - if (start == end) { - ret = -EAGAIN; - goto unlock; - } - - if (t->ret_stack == NULL) { - atomic_set(&t->tracing_graph_pause, 0); - atomic_set(&t->trace_overrun, 0); - t->curr_ret_stack = -1; - t->curr_ret_depth = -1; - /* Make sure the tasks see the -1 first: */ - smp_wmb(); - t->ret_stack = ret_stack_list[start++]; - } - } while_each_thread(g, t); - -unlock: - read_unlock(&tasklist_lock); -free: - for (i = start; i < end; i++) - kfree(ret_stack_list[i]); - return ret; -} - -static void -ftrace_graph_probe_sched_switch(void *ignore, bool preempt, - struct task_struct *prev, struct task_struct *next) -{ - unsigned long long timestamp; - int index; - - /* - * Does the user want to count the time a function was asleep. - * If so, do not update the time stamps. - */ - if (fgraph_sleep_time) - return; - - timestamp = trace_clock_local(); - - prev->ftrace_timestamp = timestamp; - - /* only process tasks that we timestamped */ - if (!next->ftrace_timestamp) - return; - - /* - * Update all the counters in next to make up for the - * time next was sleeping. - */ - timestamp -= next->ftrace_timestamp; - - for (index = next->curr_ret_stack; index >= 0; index--) - next->ret_stack[index].calltime += timestamp; -} - -/* Allocate a return stack for each task */ -static int start_graph_tracing(void) -{ - struct ftrace_ret_stack **ret_stack_list; - int ret, cpu; - - ret_stack_list = kmalloc_array(FTRACE_RETSTACK_ALLOC_SIZE, - sizeof(struct ftrace_ret_stack *), - GFP_KERNEL); - - if (!ret_stack_list) - return -ENOMEM; - - /* The cpu_boot init_task->ret_stack will never be freed */ - for_each_online_cpu(cpu) { - if (!idle_task(cpu)->ret_stack) - ftrace_graph_init_idle_task(idle_task(cpu), cpu); - } - - do { - ret = alloc_retstack_tasklist(ret_stack_list); - } while (ret == -EAGAIN); - - if (!ret) { - ret = register_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); - if (ret) - pr_info("ftrace_graph: Couldn't activate tracepoint" - " probe to kernel_sched_switch\n"); - } - - kfree(ret_stack_list); - return ret; -} - -/* - * Hibernation protection. - * The state of the current task is too much unstable during - * suspend/restore to disk. We want to protect against that. - */ -static int -ftrace_suspend_notifier_call(struct notifier_block *bl, unsigned long state, - void *unused) -{ - switch (state) { - case PM_HIBERNATION_PREPARE: - pause_graph_tracing(); - break; - - case PM_POST_HIBERNATION: - unpause_graph_tracing(); - break; - } - return NOTIFY_DONE; -} - -static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace) -{ - if (!ftrace_ops_test(&global_ops, trace->func, NULL)) - return 0; - return __ftrace_graph_entry(trace); -} - -/* - * The function graph tracer should only trace the functions defined - * by set_ftrace_filter and set_ftrace_notrace. If another function - * tracer ops is registered, the graph tracer requires testing the - * function against the global ops, and not just trace any function - * that any ftrace_ops registered. - */ -void update_function_graph_func(void) -{ - struct ftrace_ops *op; - bool do_test = false; - - /* - * The graph and global ops share the same set of functions - * to test. If any other ops is on the list, then - * the graph tracing needs to test if its the function - * it should call. - */ - do_for_each_ftrace_op(op, ftrace_ops_list) { - if (op != &global_ops && op != &graph_ops && - op != &ftrace_list_end) { - do_test = true; - /* in double loop, break out with goto */ - goto out; - } - } while_for_each_ftrace_op(op); - out: - if (do_test) - ftrace_graph_entry = ftrace_graph_entry_test; - else - ftrace_graph_entry = __ftrace_graph_entry; -} - -static struct notifier_block ftrace_suspend_notifier = { - .notifier_call = ftrace_suspend_notifier_call, -}; - -int register_ftrace_graph(trace_func_graph_ret_t retfunc, - trace_func_graph_ent_t entryfunc) -{ - int ret = 0; - - mutex_lock(&ftrace_lock); - - /* we currently allow only one tracer registered at a time */ - if (ftrace_graph_active) { - ret = -EBUSY; - goto out; - } - - register_pm_notifier(&ftrace_suspend_notifier); - - ftrace_graph_active++; - ret = start_graph_tracing(); - if (ret) { - ftrace_graph_active--; - goto out; - } - - ftrace_graph_return = retfunc; - - /* - * Update the indirect function to the entryfunc, and the - * function that gets called to the entry_test first. Then - * call the update fgraph entry function to determine if - * the entryfunc should be called directly or not. - */ - __ftrace_graph_entry = entryfunc; - ftrace_graph_entry = ftrace_graph_entry_test; - update_function_graph_func(); - - ret = ftrace_startup(&graph_ops, FTRACE_START_FUNC_RET); -out: - mutex_unlock(&ftrace_lock); - return ret; -} - -void unregister_ftrace_graph(void) -{ - mutex_lock(&ftrace_lock); - - if (unlikely(!ftrace_graph_active)) - goto out; - - ftrace_graph_active--; - ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub; - ftrace_graph_entry = ftrace_graph_entry_stub; - __ftrace_graph_entry = ftrace_graph_entry_stub; - ftrace_shutdown(&graph_ops, FTRACE_STOP_FUNC_RET); - unregister_pm_notifier(&ftrace_suspend_notifier); - unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); - - out: - mutex_unlock(&ftrace_lock); -} - -static DEFINE_PER_CPU(struct ftrace_ret_stack *, idle_ret_stack); - -static void -graph_init_task(struct task_struct *t, struct ftrace_ret_stack *ret_stack) -{ - atomic_set(&t->tracing_graph_pause, 0); - atomic_set(&t->trace_overrun, 0); - t->ftrace_timestamp = 0; - /* make curr_ret_stack visible before we add the ret_stack */ - smp_wmb(); - t->ret_stack = ret_stack; -} - -/* - * Allocate a return stack for the idle task. May be the first - * time through, or it may be done by CPU hotplug online. - */ -void ftrace_graph_init_idle_task(struct task_struct *t, int cpu) -{ - t->curr_ret_stack = -1; - t->curr_ret_depth = -1; - /* - * The idle task has no parent, it either has its own - * stack or no stack at all. - */ - if (t->ret_stack) - WARN_ON(t->ret_stack != per_cpu(idle_ret_stack, cpu)); - - if (ftrace_graph_active) { - struct ftrace_ret_stack *ret_stack; - - ret_stack = per_cpu(idle_ret_stack, cpu); - if (!ret_stack) { - ret_stack = - kmalloc_array(FTRACE_RETFUNC_DEPTH, - sizeof(struct ftrace_ret_stack), - GFP_KERNEL); - if (!ret_stack) - return; - per_cpu(idle_ret_stack, cpu) = ret_stack; - } - graph_init_task(t, ret_stack); - } -} - -/* Allocate a return stack for newly created task */ -void ftrace_graph_init_task(struct task_struct *t) -{ - /* Make sure we do not use the parent ret_stack */ - t->ret_stack = NULL; - t->curr_ret_stack = -1; - t->curr_ret_depth = -1; - - if (ftrace_graph_active) { - struct ftrace_ret_stack *ret_stack; - - ret_stack = kmalloc_array(FTRACE_RETFUNC_DEPTH, - sizeof(struct ftrace_ret_stack), - GFP_KERNEL); - if (!ret_stack) - return; - graph_init_task(t, ret_stack); - } -} - -void ftrace_graph_exit_task(struct task_struct *t) -{ - struct ftrace_ret_stack *ret_stack = t->ret_stack; - - t->ret_stack = NULL; - /* NULL must become visible to IRQs before we free it: */ - barrier(); - - kfree(ret_stack); -} -#endif -- cgit v1.3-7-g2ca7 From 317e04ca905ac6c4b33cb879e9a107c412125f14 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 28 Nov 2018 10:26:27 -0500 Subject: tracing: Rearrange functions in trace_sched_wakeup.c Rearrange the functions in trace_sched_wakeup.c so that there are fewer #ifdef CONFIG_FUNCTION_TRACER and #ifdef CONFIG_FUNCTION_GRAPH_TRACER, instead of having the #ifdefs spread all over. No functional change is made. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_sched_wakeup.c | 272 ++++++++++++++++++-------------------- 1 file changed, 130 insertions(+), 142 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 7d04b9890755..2ce78100b4d3 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -35,26 +35,19 @@ static arch_spinlock_t wakeup_lock = static void wakeup_reset(struct trace_array *tr); static void __wakeup_reset(struct trace_array *tr); +static int start_func_tracer(struct trace_array *tr, int graph); +static void stop_func_tracer(struct trace_array *tr, int graph); static int save_flags; #ifdef CONFIG_FUNCTION_GRAPH_TRACER -static int wakeup_display_graph(struct trace_array *tr, int set); # define is_graph(tr) ((tr)->trace_flags & TRACE_ITER_DISPLAY_GRAPH) #else -static inline int wakeup_display_graph(struct trace_array *tr, int set) -{ - return 0; -} # define is_graph(tr) false #endif - #ifdef CONFIG_FUNCTION_TRACER -static int wakeup_graph_entry(struct ftrace_graph_ent *trace); -static void wakeup_graph_return(struct ftrace_graph_ret *trace); - static bool function_enabled; /* @@ -104,122 +97,8 @@ out_enable: return 0; } -/* - * wakeup uses its own tracer function to keep the overhead down: - */ -static void -wakeup_tracer_call(unsigned long ip, unsigned long parent_ip, - struct ftrace_ops *op, struct pt_regs *pt_regs) -{ - struct trace_array *tr = wakeup_trace; - struct trace_array_cpu *data; - unsigned long flags; - int pc; - - if (!func_prolog_preempt_disable(tr, &data, &pc)) - return; - - local_irq_save(flags); - trace_function(tr, ip, parent_ip, flags, pc); - local_irq_restore(flags); - - atomic_dec(&data->disabled); - preempt_enable_notrace(); -} - -static int register_wakeup_function(struct trace_array *tr, int graph, int set) -{ - int ret; - - /* 'set' is set if TRACE_ITER_FUNCTION is about to be set */ - if (function_enabled || (!set && !(tr->trace_flags & TRACE_ITER_FUNCTION))) - return 0; - - if (graph) - ret = register_ftrace_graph(&wakeup_graph_return, - &wakeup_graph_entry); - else - ret = register_ftrace_function(tr->ops); - - if (!ret) - function_enabled = true; - - return ret; -} - -static void unregister_wakeup_function(struct trace_array *tr, int graph) -{ - if (!function_enabled) - return; - - if (graph) - unregister_ftrace_graph(); - else - unregister_ftrace_function(tr->ops); - - function_enabled = false; -} - -static int wakeup_function_set(struct trace_array *tr, u32 mask, int set) -{ - if (!(mask & TRACE_ITER_FUNCTION)) - return 0; - - if (set) - register_wakeup_function(tr, is_graph(tr), 1); - else - unregister_wakeup_function(tr, is_graph(tr)); - return 1; -} -#else -static int register_wakeup_function(struct trace_array *tr, int graph, int set) -{ - return 0; -} -static void unregister_wakeup_function(struct trace_array *tr, int graph) { } -static int wakeup_function_set(struct trace_array *tr, u32 mask, int set) -{ - return 0; -} -#endif /* CONFIG_FUNCTION_TRACER */ - -static int wakeup_flag_changed(struct trace_array *tr, u32 mask, int set) -{ - struct tracer *tracer = tr->current_trace; - - if (wakeup_function_set(tr, mask, set)) - return 0; - #ifdef CONFIG_FUNCTION_GRAPH_TRACER - if (mask & TRACE_ITER_DISPLAY_GRAPH) - return wakeup_display_graph(tr, set); -#endif - - return trace_keep_overwrite(tracer, mask, set); -} -static int start_func_tracer(struct trace_array *tr, int graph) -{ - int ret; - - ret = register_wakeup_function(tr, graph, 0); - - if (!ret && tracing_is_enabled()) - tracer_enabled = 1; - else - tracer_enabled = 0; - - return ret; -} - -static void stop_func_tracer(struct trace_array *tr, int graph) -{ - tracer_enabled = 0; - - unregister_wakeup_function(tr, graph); -} - -#ifdef CONFIG_FUNCTION_GRAPH_TRACER static int wakeup_display_graph(struct trace_array *tr, int set) { if (!(is_graph(tr) ^ set)) @@ -318,20 +197,94 @@ static void wakeup_print_header(struct seq_file *s) else trace_default_header(s); } +#else /* CONFIG_FUNCTION_GRAPH_TRACER */ +static int wakeup_graph_entry(struct ftrace_graph_ent *trace) +{ + return -1; +} +static void wakeup_graph_return(struct ftrace_graph_ret *trace) { } +#endif /* else CONFIG_FUNCTION_GRAPH_TRACER */ +/* + * wakeup uses its own tracer function to keep the overhead down: + */ static void -__trace_function(struct trace_array *tr, - unsigned long ip, unsigned long parent_ip, - unsigned long flags, int pc) +wakeup_tracer_call(unsigned long ip, unsigned long parent_ip, + struct ftrace_ops *op, struct pt_regs *pt_regs) { - if (is_graph(tr)) - trace_graph_function(tr, ip, parent_ip, flags, pc); + struct trace_array *tr = wakeup_trace; + struct trace_array_cpu *data; + unsigned long flags; + int pc; + + if (!func_prolog_preempt_disable(tr, &data, &pc)) + return; + + local_irq_save(flags); + trace_function(tr, ip, parent_ip, flags, pc); + local_irq_restore(flags); + + atomic_dec(&data->disabled); + preempt_enable_notrace(); +} + +static int register_wakeup_function(struct trace_array *tr, int graph, int set) +{ + int ret; + + /* 'set' is set if TRACE_ITER_FUNCTION is about to be set */ + if (function_enabled || (!set && !(tr->trace_flags & TRACE_ITER_FUNCTION))) + return 0; + + if (graph) + ret = register_ftrace_graph(&wakeup_graph_return, + &wakeup_graph_entry); else - trace_function(tr, ip, parent_ip, flags, pc); + ret = register_ftrace_function(tr->ops); + + if (!ret) + function_enabled = true; + + return ret; } -#else -#define __trace_function trace_function +static void unregister_wakeup_function(struct trace_array *tr, int graph) +{ + if (!function_enabled) + return; + + if (graph) + unregister_ftrace_graph(); + else + unregister_ftrace_function(tr->ops); + + function_enabled = false; +} + +static int wakeup_function_set(struct trace_array *tr, u32 mask, int set) +{ + if (!(mask & TRACE_ITER_FUNCTION)) + return 0; + + if (set) + register_wakeup_function(tr, is_graph(tr), 1); + else + unregister_wakeup_function(tr, is_graph(tr)); + return 1; +} +#else /* CONFIG_FUNCTION_TRACER */ +static int register_wakeup_function(struct trace_array *tr, int graph, int set) +{ + return 0; +} +static void unregister_wakeup_function(struct trace_array *tr, int graph) { } +static int wakeup_function_set(struct trace_array *tr, u32 mask, int set) +{ + return 0; +} +#endif /* else CONFIG_FUNCTION_TRACER */ + +#ifndef CONFIG_FUNCTION_GRAPH_TRACER static enum print_line_t wakeup_print_line(struct trace_iterator *iter) { return TRACE_TYPE_UNHANDLED; @@ -340,23 +293,58 @@ static enum print_line_t wakeup_print_line(struct trace_iterator *iter) static void wakeup_trace_open(struct trace_iterator *iter) { } static void wakeup_trace_close(struct trace_iterator *iter) { } -#ifdef CONFIG_FUNCTION_TRACER -static int wakeup_graph_entry(struct ftrace_graph_ent *trace) -{ - return -1; -} -static void wakeup_graph_return(struct ftrace_graph_ret *trace) { } static void wakeup_print_header(struct seq_file *s) { trace_default_header(s); } -#else -static void wakeup_print_header(struct seq_file *s) +#endif /* !CONFIG_FUNCTION_GRAPH_TRACER */ + +static void +__trace_function(struct trace_array *tr, + unsigned long ip, unsigned long parent_ip, + unsigned long flags, int pc) +{ + if (is_graph(tr)) + trace_graph_function(tr, ip, parent_ip, flags, pc); + else + trace_function(tr, ip, parent_ip, flags, pc); +} + +static int wakeup_flag_changed(struct trace_array *tr, u32 mask, int set) { - trace_latency_header(s); + struct tracer *tracer = tr->current_trace; + + if (wakeup_function_set(tr, mask, set)) + return 0; + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER + if (mask & TRACE_ITER_DISPLAY_GRAPH) + return wakeup_display_graph(tr, set); +#endif + + return trace_keep_overwrite(tracer, mask, set); +} + +static int start_func_tracer(struct trace_array *tr, int graph) +{ + int ret; + + ret = register_wakeup_function(tr, graph, 0); + + if (!ret && tracing_is_enabled()) + tracer_enabled = 1; + else + tracer_enabled = 0; + + return ret; +} + +static void stop_func_tracer(struct trace_array *tr, int graph) +{ + tracer_enabled = 0; + + unregister_wakeup_function(tr, graph); } -#endif /* CONFIG_FUNCTION_TRACER */ -#endif /* CONFIG_FUNCTION_GRAPH_TRACER */ /* * Should this new latency be reported/recorded? -- cgit v1.3-7-g2ca7 From 688f7089d8851b1a81106f0c0b9b29181b2f2dc8 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 15 Nov 2018 14:06:47 -0500 Subject: fgraph: Add new fgraph_ops structure to enable function graph hooks Currently the registering of function graph is to pass in a entry and return function. We need to have a way to associate those functions together where the entry can determine to run the return hook. Having a structure that contains both functions will facilitate the process of converting the code to be able to do such. This is similar to the way function hooks are enabled (it passes in ftrace_ops). Instead of passing in the functions to use, a single structure is passed in to the registering function. The unregister function is now passed in the fgraph_ops handle. When we allow more than one callback to the function graph hooks, this will let the system know which one to remove. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- include/linux/ftrace.h | 21 +++++++++++---------- kernel/trace/fgraph.c | 9 ++++----- kernel/trace/ftrace.c | 10 +++++++--- kernel/trace/trace_functions_graph.c | 21 ++++++++++++++++----- kernel/trace/trace_irqsoff.c | 18 +++++++----------- kernel/trace/trace_sched_wakeup.c | 16 +++++++--------- kernel/trace/trace_selftest.c | 8 ++++++-- 7 files changed, 58 insertions(+), 45 deletions(-) (limited to 'kernel') diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 98625f10d982..21c80491ccde 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -749,6 +749,11 @@ typedef int (*trace_func_graph_ent_t)(struct ftrace_graph_ent *); /* entry */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER +struct fgraph_ops { + trace_func_graph_ent_t entryfunc; + trace_func_graph_ret_t retfunc; +}; + /* * Stack of return addresses for functions * of a thread. @@ -792,8 +797,9 @@ unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, #define FTRACE_RETFUNC_DEPTH 50 #define FTRACE_RETSTACK_ALLOC_SIZE 32 -extern int register_ftrace_graph(trace_func_graph_ret_t retfunc, - trace_func_graph_ent_t entryfunc); + +extern int register_ftrace_graph(struct fgraph_ops *ops); +extern void unregister_ftrace_graph(struct fgraph_ops *ops); extern bool ftrace_graph_is_dead(void); extern void ftrace_graph_stop(void); @@ -802,8 +808,6 @@ extern void ftrace_graph_stop(void); extern trace_func_graph_ret_t ftrace_graph_return; extern trace_func_graph_ent_t ftrace_graph_entry; -extern void unregister_ftrace_graph(void); - extern void ftrace_graph_init_task(struct task_struct *t); extern void ftrace_graph_exit_task(struct task_struct *t); extern void ftrace_graph_init_idle_task(struct task_struct *t, int cpu); @@ -825,12 +829,9 @@ static inline void ftrace_graph_init_task(struct task_struct *t) { } static inline void ftrace_graph_exit_task(struct task_struct *t) { } static inline void ftrace_graph_init_idle_task(struct task_struct *t, int cpu) { } -static inline int register_ftrace_graph(trace_func_graph_ret_t retfunc, - trace_func_graph_ent_t entryfunc) -{ - return -1; -} -static inline void unregister_ftrace_graph(void) { } +/* Define as macros as fgraph_ops may not be defined */ +#define register_ftrace_graph(ops) ({ -1; }) +#define unregister_ftrace_graph(ops) do { } while (0) static inline unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long ret, diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 374f3e42e29e..cc35606e9a3e 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -490,8 +490,7 @@ static int start_graph_tracing(void) return ret; } -int register_ftrace_graph(trace_func_graph_ret_t retfunc, - trace_func_graph_ent_t entryfunc) +int register_ftrace_graph(struct fgraph_ops *gops) { int ret = 0; @@ -512,7 +511,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc, goto out; } - ftrace_graph_return = retfunc; + ftrace_graph_return = gops->retfunc; /* * Update the indirect function to the entryfunc, and the @@ -520,7 +519,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc, * call the update fgraph entry function to determine if * the entryfunc should be called directly or not. */ - __ftrace_graph_entry = entryfunc; + __ftrace_graph_entry = gops->entryfunc; ftrace_graph_entry = ftrace_graph_entry_test; update_function_graph_func(); @@ -530,7 +529,7 @@ out: return ret; } -void unregister_ftrace_graph(void) +void unregister_ftrace_graph(struct fgraph_ops *gops) { mutex_lock(&ftrace_lock); diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index c53533b833cf..d06fe588e650 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -849,15 +849,19 @@ static void profile_graph_return(struct ftrace_graph_ret *trace) local_irq_restore(flags); } +static struct fgraph_ops fprofiler_ops = { + .entryfunc = &profile_graph_entry, + .retfunc = &profile_graph_return, +}; + static int register_ftrace_profiler(void) { - return register_ftrace_graph(&profile_graph_return, - &profile_graph_entry); + return register_ftrace_graph(&fprofiler_ops); } static void unregister_ftrace_profiler(void) { - unregister_ftrace_graph(); + unregister_ftrace_graph(&fprofiler_ops); } #else static struct ftrace_ops ftrace_profile_ops __read_mostly = { diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 855c13c61e77..140b4b51ab34 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -345,17 +345,25 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace) trace_graph_return(trace); } +static struct fgraph_ops funcgraph_thresh_ops = { + .entryfunc = &trace_graph_entry, + .retfunc = &trace_graph_thresh_return, +}; + +static struct fgraph_ops funcgraph_ops = { + .entryfunc = &trace_graph_entry, + .retfunc = &trace_graph_return, +}; + static int graph_trace_init(struct trace_array *tr) { int ret; set_graph_array(tr); if (tracing_thresh) - ret = register_ftrace_graph(&trace_graph_thresh_return, - &trace_graph_entry); + ret = register_ftrace_graph(&funcgraph_thresh_ops); else - ret = register_ftrace_graph(&trace_graph_return, - &trace_graph_entry); + ret = register_ftrace_graph(&funcgraph_ops); if (ret) return ret; tracing_start_cmdline_record(); @@ -366,7 +374,10 @@ static int graph_trace_init(struct trace_array *tr) static void graph_trace_reset(struct trace_array *tr) { tracing_stop_cmdline_record(); - unregister_ftrace_graph(); + if (tracing_thresh) + unregister_ftrace_graph(&funcgraph_thresh_ops); + else + unregister_ftrace_graph(&funcgraph_ops); } static int graph_trace_update_thresh(struct trace_array *tr) diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 98ea6d28df15..d3294721f119 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -218,6 +218,11 @@ static void irqsoff_graph_return(struct ftrace_graph_ret *trace) atomic_dec(&data->disabled); } +static struct fgraph_ops fgraph_ops = { + .entryfunc = &irqsoff_graph_entry, + .retfunc = &irqsoff_graph_return, +}; + static void irqsoff_trace_open(struct trace_iterator *iter) { if (is_graph(iter->tr)) @@ -272,13 +277,6 @@ __trace_function(struct trace_array *tr, #else #define __trace_function trace_function -#ifdef CONFIG_FUNCTION_TRACER -static int irqsoff_graph_entry(struct ftrace_graph_ent *trace) -{ - return -1; -} -#endif - static enum print_line_t irqsoff_print_line(struct trace_iterator *iter) { return TRACE_TYPE_UNHANDLED; @@ -288,7 +286,6 @@ static void irqsoff_trace_open(struct trace_iterator *iter) { } static void irqsoff_trace_close(struct trace_iterator *iter) { } #ifdef CONFIG_FUNCTION_TRACER -static void irqsoff_graph_return(struct ftrace_graph_ret *trace) { } static void irqsoff_print_header(struct seq_file *s) { trace_default_header(s); @@ -468,8 +465,7 @@ static int register_irqsoff_function(struct trace_array *tr, int graph, int set) return 0; if (graph) - ret = register_ftrace_graph(&irqsoff_graph_return, - &irqsoff_graph_entry); + ret = register_ftrace_graph(&fgraph_ops); else ret = register_ftrace_function(tr->ops); @@ -485,7 +481,7 @@ static void unregister_irqsoff_function(struct trace_array *tr, int graph) return; if (graph) - unregister_ftrace_graph(); + unregister_ftrace_graph(&fgraph_ops); else unregister_ftrace_function(tr->ops); diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 2ce78100b4d3..4ea7e6845efb 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -162,6 +162,11 @@ static void wakeup_graph_return(struct ftrace_graph_ret *trace) return; } +static struct fgraph_ops fgraph_wakeup_ops = { + .entryfunc = &wakeup_graph_entry, + .retfunc = &wakeup_graph_return, +}; + static void wakeup_trace_open(struct trace_iterator *iter) { if (is_graph(iter->tr)) @@ -197,12 +202,6 @@ static void wakeup_print_header(struct seq_file *s) else trace_default_header(s); } -#else /* CONFIG_FUNCTION_GRAPH_TRACER */ -static int wakeup_graph_entry(struct ftrace_graph_ent *trace) -{ - return -1; -} -static void wakeup_graph_return(struct ftrace_graph_ret *trace) { } #endif /* else CONFIG_FUNCTION_GRAPH_TRACER */ /* @@ -237,8 +236,7 @@ static int register_wakeup_function(struct trace_array *tr, int graph, int set) return 0; if (graph) - ret = register_ftrace_graph(&wakeup_graph_return, - &wakeup_graph_entry); + ret = register_ftrace_graph(&fgraph_wakeup_ops); else ret = register_ftrace_function(tr->ops); @@ -254,7 +252,7 @@ static void unregister_wakeup_function(struct trace_array *tr, int graph) return; if (graph) - unregister_ftrace_graph(); + unregister_ftrace_graph(&fgraph_wakeup_ops); else unregister_ftrace_function(tr->ops); diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 11e9daa4a568..9d402e7fc949 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -741,6 +741,11 @@ static int trace_graph_entry_watchdog(struct ftrace_graph_ent *trace) return trace_graph_entry(trace); } +static struct fgraph_ops fgraph_ops __initdata = { + .entryfunc = &trace_graph_entry_watchdog, + .retfunc = &trace_graph_return, +}; + /* * Pretty much the same than for the function tracer from which the selftest * has been borrowed. @@ -765,8 +770,7 @@ trace_selftest_startup_function_graph(struct tracer *trace, */ tracing_reset_online_cpus(&tr->trace_buffer); set_graph_array(tr); - ret = register_ftrace_graph(&trace_graph_return, - &trace_graph_entry_watchdog); + ret = register_ftrace_graph(&fgraph_ops); if (ret) { warn_failed_init_tracer(trace, ret); goto out; -- cgit v1.3-7-g2ca7 From 76b42b63ed0d004961097d3a3cd979129d4afd26 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Sun, 18 Nov 2018 18:36:19 -0500 Subject: function_graph: Move ftrace_graph_ret_addr() to fgraph.c Move the function function_graph_ret_addr() to fgraph.c, as the management of the curr_ret_stack is going to change, and all the accesses to ret_stack needs to be done in fgraph.c. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/fgraph.c | 55 ++++++++++++++++++++++++++++++++++++ kernel/trace/trace_functions_graph.c | 55 ------------------------------------ 2 files changed, 55 insertions(+), 55 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index cc35606e9a3e..90fcefcaff2a 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -232,6 +232,61 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) return ret; } +/** + * ftrace_graph_ret_addr - convert a potentially modified stack return address + * to its original value + * + * This function can be called by stack unwinding code to convert a found stack + * return address ('ret') to its original value, in case the function graph + * tracer has modified it to be 'return_to_handler'. If the address hasn't + * been modified, the unchanged value of 'ret' is returned. + * + * 'idx' is a state variable which should be initialized by the caller to zero + * before the first call. + * + * 'retp' is a pointer to the return address on the stack. It's ignored if + * the arch doesn't have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR defined. + */ +#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR +unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, + unsigned long ret, unsigned long *retp) +{ + int index = task->curr_ret_stack; + int i; + + if (ret != (unsigned long)return_to_handler) + return ret; + + if (index < 0) + return ret; + + for (i = 0; i <= index; i++) + if (task->ret_stack[i].retp == retp) + return task->ret_stack[i].ret; + + return ret; +} +#else /* !HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */ +unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, + unsigned long ret, unsigned long *retp) +{ + int task_idx; + + if (ret != (unsigned long)return_to_handler) + return ret; + + task_idx = task->curr_ret_stack; + + if (!task->ret_stack || task_idx < *idx) + return ret; + + task_idx -= *idx; + (*idx)++; + + return task->ret_stack[task_idx].ret; +} +#endif /* HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */ + static struct ftrace_ops graph_ops = { .func = ftrace_stub, .flags = FTRACE_OPS_FL_RECURSION_SAFE | diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 140b4b51ab34..c2af1560e856 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -94,61 +94,6 @@ static void print_graph_duration(struct trace_array *tr, unsigned long long duration, struct trace_seq *s, u32 flags); -/** - * ftrace_graph_ret_addr - convert a potentially modified stack return address - * to its original value - * - * This function can be called by stack unwinding code to convert a found stack - * return address ('ret') to its original value, in case the function graph - * tracer has modified it to be 'return_to_handler'. If the address hasn't - * been modified, the unchanged value of 'ret' is returned. - * - * 'idx' is a state variable which should be initialized by the caller to zero - * before the first call. - * - * 'retp' is a pointer to the return address on the stack. It's ignored if - * the arch doesn't have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR defined. - */ -#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR -unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, - unsigned long ret, unsigned long *retp) -{ - int index = task->curr_ret_stack; - int i; - - if (ret != (unsigned long)return_to_handler) - return ret; - - if (index < 0) - return ret; - - for (i = 0; i <= index; i++) - if (task->ret_stack[i].retp == retp) - return task->ret_stack[i].ret; - - return ret; -} -#else /* !HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */ -unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, - unsigned long ret, unsigned long *retp) -{ - int task_idx; - - if (ret != (unsigned long)return_to_handler) - return ret; - - task_idx = task->curr_ret_stack; - - if (!task->ret_stack || task_idx < *idx) - return ret; - - task_idx -= *idx; - (*idx)++; - - return task->ret_stack[task_idx].ret; -} -#endif /* HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */ - int __trace_graph_entry(struct trace_array *tr, struct ftrace_graph_ent *trace, unsigned long flags, -- cgit v1.3-7-g2ca7 From b0e21a61d3196762b61f43ae994ffd255f646774 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Mon, 19 Nov 2018 20:54:08 -0500 Subject: function_graph: Have profiler use new helper ftrace_graph_get_ret_stack() The ret_stack processing is going to change, and that is going to break anything that is accessing the ret_stack directly. One user is the function graph profiler. By using the ftrace_graph_get_ret_stack() helper function, the profiler can access the ret_stack entry without relying on the implementation details of the stack itself. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- include/linux/ftrace.h | 3 +++ kernel/trace/fgraph.c | 11 +++++++++++ kernel/trace/ftrace.c | 21 +++++++++++---------- 3 files changed, 25 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 21c80491ccde..98e141c71ad0 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -785,6 +785,9 @@ extern int function_graph_enter(unsigned long ret, unsigned long func, unsigned long frame_pointer, unsigned long *retp); +struct ftrace_ret_stack * +ftrace_graph_get_ret_stack(struct task_struct *task, int idx); + unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long ret, unsigned long *retp); diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 90fcefcaff2a..a3704ec8b599 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -232,6 +232,17 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) return ret; } +struct ftrace_ret_stack * +ftrace_graph_get_ret_stack(struct task_struct *task, int idx) +{ + idx = current->curr_ret_stack - idx; + + if (idx >= 0 && idx <= task->curr_ret_stack) + return ¤t->ret_stack[idx]; + + return NULL; +} + /** * ftrace_graph_ret_addr - convert a potentially modified stack return address * to its original value diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index d06fe588e650..8ef9fc226037 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -792,7 +792,7 @@ void ftrace_graph_graph_time_control(bool enable) static int profile_graph_entry(struct ftrace_graph_ent *trace) { - int index = current->curr_ret_stack; + struct ftrace_ret_stack *ret_stack; function_profile_call(trace->func, 0, NULL, NULL); @@ -800,14 +800,16 @@ static int profile_graph_entry(struct ftrace_graph_ent *trace) if (!current->ret_stack) return 0; - if (index >= 0 && index < FTRACE_RETFUNC_DEPTH) - current->ret_stack[index].subtime = 0; + ret_stack = ftrace_graph_get_ret_stack(current, 0); + if (ret_stack) + ret_stack->subtime = 0; return 1; } static void profile_graph_return(struct ftrace_graph_ret *trace) { + struct ftrace_ret_stack *ret_stack; struct ftrace_profile_stat *stat; unsigned long long calltime; struct ftrace_profile *rec; @@ -825,16 +827,15 @@ static void profile_graph_return(struct ftrace_graph_ret *trace) calltime = trace->rettime - trace->calltime; if (!fgraph_graph_time) { - int index; - - index = current->curr_ret_stack; /* Append this call time to the parent time to subtract */ - if (index) - current->ret_stack[index - 1].subtime += calltime; + ret_stack = ftrace_graph_get_ret_stack(current, 1); + if (ret_stack) + ret_stack->subtime += calltime; - if (current->ret_stack[index].subtime < calltime) - calltime -= current->ret_stack[index].subtime; + ret_stack = ftrace_graph_get_ret_stack(current, 0); + if (ret_stack && ret_stack->subtime < calltime) + calltime -= ret_stack->subtime; else calltime = 0; } -- cgit v1.3-7-g2ca7 From ca16b0fbb05242f18da9d810c07d3882ffed831c Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Wed, 20 Jun 2018 14:08:00 +0300 Subject: tracing: Have trace_stack nr_entries compare not be so subtle Dan Carpenter reviewed the trace_stack.c code and figured he found an off by one bug. "From reviewing the code, it seems possible for stack_trace_max.nr_entries to be set to .max_entries and in that case we would be reading one element beyond the end of the stack_dump_trace[] array. If it's not set to .max_entries then the bug doesn't affect runtime." Although it looks to be the case, it is not. Because we have: static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] = { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX }; struct stack_trace stack_trace_max = { .max_entries = STACK_TRACE_ENTRIES - 1, .entries = &stack_dump_trace[0], }; And: stack_trace_max.nr_entries = x; for (; x < i; x++) stack_dump_trace[x] = ULONG_MAX; Even if nr_entries equals max_entries, indexing with it into the stack_dump_trace[] array will not overflow the array. But if it is the case, the second part of the conditional that tests stack_dump_trace[nr_entries] to ULONG_MAX will always be true. By applying Dan's patch, it removes the subtle aspect of it and makes the if conditional slightly more efficient. Link: http://lkml.kernel.org/r/20180620110758.crunhd5bfep7zuiz@kili.mountain Signed-off-by: Dan Carpenter Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_stack.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index 2b0d1ee3241c..e2a153fc1afc 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -286,7 +286,7 @@ __next(struct seq_file *m, loff_t *pos) { long n = *pos - 1; - if (n > stack_trace_max.nr_entries || stack_dump_trace[n] == ULONG_MAX) + if (n >= stack_trace_max.nr_entries || stack_dump_trace[n] == ULONG_MAX) return NULL; m->private = (void *)n; -- cgit v1.3-7-g2ca7 From 2c2b0a78b373908926e4683ea5571332f63c0eb5 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 29 Nov 2018 20:32:26 -0500 Subject: ring-buffer: Add percentage of ring buffer full to wake up reader Instead of just waiting for a page to be full before waking up a pending reader, allow the reader to pass in a "percentage" of pages that have content before waking up a reader. This should help keep the process of reading the events not cause wake ups that constantly cause reading of the buffer. Signed-off-by: Steven Rostedt (VMware) --- include/linux/ring_buffer.h | 4 ++- kernel/trace/ring_buffer.c | 71 +++++++++++++++++++++++++++++++++++++++++---- kernel/trace/trace.c | 8 ++--- 3 files changed, 73 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 0940fda59872..5b9ae62272bb 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -97,7 +97,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k __ring_buffer_alloc((size), (flags), &__key); \ }) -int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full); +int ring_buffer_wait(struct ring_buffer *buffer, int cpu, int full); __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table); @@ -189,6 +189,8 @@ bool ring_buffer_time_stamp_abs(struct ring_buffer *buffer); size_t ring_buffer_page_len(void *page); +size_t ring_buffer_nr_pages(struct ring_buffer *buffer, int cpu); +size_t ring_buffer_nr_dirty_pages(struct ring_buffer *buffer, int cpu); void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu); void ring_buffer_free_read_page(struct ring_buffer *buffer, int cpu, void *data); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 65bd4616220d..9edb628603ab 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -487,6 +487,9 @@ struct ring_buffer_per_cpu { local_t dropped_events; local_t committing; local_t commits; + local_t pages_touched; + local_t pages_read; + size_t shortest_full; unsigned long read; unsigned long read_bytes; u64 write_stamp; @@ -529,6 +532,41 @@ struct ring_buffer_iter { u64 read_stamp; }; +/** + * ring_buffer_nr_pages - get the number of buffer pages in the ring buffer + * @buffer: The ring_buffer to get the number of pages from + * @cpu: The cpu of the ring_buffer to get the number of pages from + * + * Returns the number of pages used by a per_cpu buffer of the ring buffer. + */ +size_t ring_buffer_nr_pages(struct ring_buffer *buffer, int cpu) +{ + return buffer->buffers[cpu]->nr_pages; +} + +/** + * ring_buffer_nr_pages_dirty - get the number of used pages in the ring buffer + * @buffer: The ring_buffer to get the number of pages from + * @cpu: The cpu of the ring_buffer to get the number of pages from + * + * Returns the number of pages that have content in the ring buffer. + */ +size_t ring_buffer_nr_dirty_pages(struct ring_buffer *buffer, int cpu) +{ + size_t read; + size_t cnt; + + read = local_read(&buffer->buffers[cpu]->pages_read); + cnt = local_read(&buffer->buffers[cpu]->pages_touched); + /* The reader can read an empty page, but not more than that */ + if (cnt < read) { + WARN_ON_ONCE(read > cnt + 1); + return 0; + } + + return cnt - read; +} + /* * rb_wake_up_waiters - wake up tasks waiting for ring buffer input * @@ -556,7 +594,7 @@ static void rb_wake_up_waiters(struct irq_work *work) * as data is added to any of the @buffer's cpu buffers. Otherwise * it will wait for data to be added to a specific cpu buffer. */ -int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) +int ring_buffer_wait(struct ring_buffer *buffer, int cpu, int full) { struct ring_buffer_per_cpu *uninitialized_var(cpu_buffer); DEFINE_WAIT(wait); @@ -571,7 +609,7 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) if (cpu == RING_BUFFER_ALL_CPUS) { work = &buffer->irq_work; /* Full only makes sense on per cpu reads */ - full = false; + full = 0; } else { if (!cpumask_test_cpu(cpu, buffer->cpumask)) return -ENODEV; @@ -623,15 +661,22 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) !ring_buffer_empty_cpu(buffer, cpu)) { unsigned long flags; bool pagebusy; + size_t nr_pages; + size_t dirty; if (!full) break; raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; + nr_pages = cpu_buffer->nr_pages; + dirty = ring_buffer_nr_dirty_pages(buffer, cpu); + if (!cpu_buffer->shortest_full || + cpu_buffer->shortest_full < full) + cpu_buffer->shortest_full = full; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - - if (!pagebusy) + if (!pagebusy && + (!nr_pages || (dirty * 100) > full * nr_pages)) break; } @@ -1054,6 +1099,7 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, old_write = local_add_return(RB_WRITE_INTCNT, &next_page->write); old_entries = local_add_return(RB_WRITE_INTCNT, &next_page->entries); + local_inc(&cpu_buffer->pages_touched); /* * Just make sure we have seen our old_write and synchronize * with any interrupts that come in. @@ -2603,6 +2649,16 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; if (!pagebusy && cpu_buffer->irq_work.full_waiters_pending) { + size_t nr_pages; + size_t dirty; + size_t full; + + full = cpu_buffer->shortest_full; + nr_pages = cpu_buffer->nr_pages; + dirty = ring_buffer_nr_dirty_pages(buffer, cpu_buffer->cpu); + if (full && nr_pages && (dirty * 100) <= full * nr_pages) + return; + cpu_buffer->irq_work.wakeup_full = true; cpu_buffer->irq_work.full_waiters_pending = false; /* irq_work_queue() supplies it's own memory barriers */ @@ -3732,13 +3788,15 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) goto spin; /* - * Yeah! We succeeded in replacing the page. + * Yay! We succeeded in replacing the page. * * Now make the new head point back to the reader page. */ rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; rb_inc_page(cpu_buffer, &cpu_buffer->head_page); + local_inc(&cpu_buffer->pages_read); + /* Finally update the reader page to the new head */ cpu_buffer->reader_page = reader; cpu_buffer->reader_page->read = 0; @@ -4334,6 +4392,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) local_set(&cpu_buffer->entries, 0); local_set(&cpu_buffer->committing, 0); local_set(&cpu_buffer->commits, 0); + local_set(&cpu_buffer->pages_touched, 0); + local_set(&cpu_buffer->pages_read, 0); + cpu_buffer->shortest_full = 0; cpu_buffer->read = 0; cpu_buffer->read_bytes = 0; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ff1c4b20cd0a..48d5eb22ff33 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1431,7 +1431,7 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) } #endif /* CONFIG_TRACER_MAX_TRACE */ -static int wait_on_pipe(struct trace_iterator *iter, bool full) +static int wait_on_pipe(struct trace_iterator *iter, int full) { /* Iterators are static, they should be filled or empty */ if (trace_buffer_iter(iter, iter->cpu_file)) @@ -5693,7 +5693,7 @@ static int tracing_wait_pipe(struct file *filp) mutex_unlock(&iter->mutex); - ret = wait_on_pipe(iter, false); + ret = wait_on_pipe(iter, 0); mutex_lock(&iter->mutex); @@ -6751,7 +6751,7 @@ tracing_buffers_read(struct file *filp, char __user *ubuf, if ((filp->f_flags & O_NONBLOCK)) return -EAGAIN; - ret = wait_on_pipe(iter, false); + ret = wait_on_pipe(iter, 0); if (ret) return ret; @@ -6948,7 +6948,7 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) goto out; - ret = wait_on_pipe(iter, true); + ret = wait_on_pipe(iter, 1); if (ret) goto out; -- cgit v1.3-7-g2ca7 From 03329f9939781041424edd29487b9603a704fcd9 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 29 Nov 2018 21:38:42 -0500 Subject: tracing: Add tracefs file buffer_percentage Add a "buffer_percentage" file, that allows users to specify how much of the buffer (percentage of pages) need to be filled before waking up a task blocked on a per cpu trace_pipe_raw file. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 39 +++++++++++++++++++-------------- kernel/trace/trace.c | 54 +++++++++++++++++++++++++++++++++++++++++++++- kernel/trace/trace.h | 1 + 3 files changed, 77 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 9edb628603ab..5434c16f2192 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -489,6 +489,7 @@ struct ring_buffer_per_cpu { local_t commits; local_t pages_touched; local_t pages_read; + long last_pages_touch; size_t shortest_full; unsigned long read; unsigned long read_bytes; @@ -2632,7 +2633,9 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer, static __always_inline void rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) { - bool pagebusy; + size_t nr_pages; + size_t dirty; + size_t full; if (buffer->irq_work.waiters_pending) { buffer->irq_work.waiters_pending = false; @@ -2646,24 +2649,27 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) irq_work_queue(&cpu_buffer->irq_work.work); } - pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; + if (cpu_buffer->last_pages_touch == local_read(&cpu_buffer->pages_touched)) + return; - if (!pagebusy && cpu_buffer->irq_work.full_waiters_pending) { - size_t nr_pages; - size_t dirty; - size_t full; + if (cpu_buffer->reader_page == cpu_buffer->commit_page) + return; - full = cpu_buffer->shortest_full; - nr_pages = cpu_buffer->nr_pages; - dirty = ring_buffer_nr_dirty_pages(buffer, cpu_buffer->cpu); - if (full && nr_pages && (dirty * 100) <= full * nr_pages) - return; + if (!cpu_buffer->irq_work.full_waiters_pending) + return; - cpu_buffer->irq_work.wakeup_full = true; - cpu_buffer->irq_work.full_waiters_pending = false; - /* irq_work_queue() supplies it's own memory barriers */ - irq_work_queue(&cpu_buffer->irq_work.work); - } + cpu_buffer->last_pages_touch = local_read(&cpu_buffer->pages_touched); + + full = cpu_buffer->shortest_full; + nr_pages = cpu_buffer->nr_pages; + dirty = ring_buffer_nr_dirty_pages(buffer, cpu_buffer->cpu); + if (full && nr_pages && (dirty * 100) <= full * nr_pages) + return; + + cpu_buffer->irq_work.wakeup_full = true; + cpu_buffer->irq_work.full_waiters_pending = false; + /* irq_work_queue() supplies it's own memory barriers */ + irq_work_queue(&cpu_buffer->irq_work.work); } /* @@ -4394,6 +4400,7 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) local_set(&cpu_buffer->commits, 0); local_set(&cpu_buffer->pages_touched, 0); local_set(&cpu_buffer->pages_read, 0); + cpu_buffer->last_pages_touch = 0; cpu_buffer->shortest_full = 0; cpu_buffer->read = 0; cpu_buffer->read_bytes = 0; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 48d5eb22ff33..d382fd1aa4a6 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6948,7 +6948,7 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) goto out; - ret = wait_on_pipe(iter, 1); + ret = wait_on_pipe(iter, iter->tr->buffer_percent); if (ret) goto out; @@ -7662,6 +7662,53 @@ static const struct file_operations rb_simple_fops = { .llseek = default_llseek, }; +static ssize_t +buffer_percent_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + char buf[64]; + int r; + + r = tr->buffer_percent; + r = sprintf(buf, "%d\n", r); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +buffer_percent_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + unsigned long val; + int ret; + + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); + if (ret) + return ret; + + if (val > 100) + return -EINVAL; + + if (!val) + val = 1; + + tr->buffer_percent = val; + + (*ppos)++; + + return cnt; +} + +static const struct file_operations buffer_percent_fops = { + .open = tracing_open_generic_tr, + .read = buffer_percent_read, + .write = buffer_percent_write, + .release = tracing_release_generic_tr, + .llseek = default_llseek, +}; + struct dentry *trace_instance_dir; static void @@ -7970,6 +8017,11 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) trace_create_file("timestamp_mode", 0444, d_tracer, tr, &trace_time_stamp_mode_fops); + tr->buffer_percent = 1; + + trace_create_file("buffer_percent", 0444, d_tracer, + tr, &buffer_percent_fops); + create_trace_options_dir(tr); #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index ab16eca76e59..08900828d282 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -247,6 +247,7 @@ struct trace_array { int clock_id; int nr_topts; bool clear_trace; + int buffer_percent; struct tracer *current_trace; unsigned int trace_flags; unsigned char trace_flags_index[TRACE_FLAGS_MAX_SIZE]; -- cgit v1.3-7-g2ca7 From a7b1d74e872a4299e1aef66a648316c9c23e5ab4 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 29 Nov 2018 22:36:47 -0500 Subject: tracing: Change default buffer_percent to 50 After running several tests, it appears that having the reader wait till half the buffer is full before starting to read (and causing its own events to fill up the ring buffer constantly), works well. It keeps trace-cmd (the main user of this interface) from dominating the traces it records. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d382fd1aa4a6..194c01838e3f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8017,7 +8017,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) trace_create_file("timestamp_mode", 0444, d_tracer, tr, &trace_time_stamp_mode_fops); - tr->buffer_percent = 1; + tr->buffer_percent = 50; trace_create_file("buffer_percent", 0444, d_tracer, tr, &buffer_percent_fops); -- cgit v1.3-7-g2ca7 From 547cd9eacd1c699c8d1fa77c95c6cdb33b2eba6a Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:00:15 +0900 Subject: tracing/uprobes: Add busy check when cleanup all uprobes Add a busy check loop in cleanup_all_probes() before trying to remove all events in uprobe_events, the same way that kprobe_events does. Without this change, writing null to uprobe_events will try to remove events but if one of them is enabled, it will stop there leaving some events cleared and others not clceared. With this change, writing null to uprobe_events makes sure all events are not enabled before removing events. So, it clears all events, or returns an error (-EBUSY) with keeping all events. Link: http://lkml.kernel.org/r/154140841557.17322.12653952888762532401.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_uprobe.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 31ea48eceda1..b708e4ff7ea7 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -587,12 +587,19 @@ static int cleanup_all_probes(void) int ret = 0; mutex_lock(&uprobe_lock); + /* Ensure no probe is in use. */ + list_for_each_entry(tu, &uprobe_list, list) + if (trace_probe_is_enabled(&tu->tp)) { + ret = -EBUSY; + goto end; + } while (!list_empty(&uprobe_list)) { tu = list_entry(uprobe_list.next, struct trace_uprobe, list); ret = unregister_trace_uprobe(tu); if (ret) break; } +end: mutex_unlock(&uprobe_lock); return ret; } -- cgit v1.3-7-g2ca7 From fc800a10be26017f8f338bc8e500d48e3e6429d9 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:00:43 +0900 Subject: tracing: Lock event_mutex before synth_event_mutex synthetic event is using synth_event_mutex for protecting synth_event_list, and event_trigger_write() path acquires locks as below order. event_trigger_write(event_mutex) ->trigger_process_regex(trigger_cmd_mutex) ->event_hist_trigger_func(synth_event_mutex) On the other hand, synthetic event creation and deletion paths call trace_add_event_call() and trace_remove_event_call() which acquires event_mutex. In that case, if we keep the synth_event_mutex locked while registering/unregistering synthetic events, its dependency will be inversed. To avoid this issue, current synthetic event is using a 2 phase process to create/delete events. For example, it searches existing events under synth_event_mutex to check for event-name conflicts, and unlocks synth_event_mutex, then registers a new event under event_mutex locked. Finally, it locks synth_event_mutex and tries to add the new event to the list. But it can introduce complexity and a chance for name conflicts. To solve this simpler, this introduces trace_add_event_call_nolock() and trace_remove_event_call_nolock() which don't acquire event_mutex inside. synthetic event can lock event_mutex before synth_event_mutex to solve the lock dependency issue simpler. Link: http://lkml.kernel.org/r/154140844377.17322.13781091165954002713.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- include/linux/trace_events.h | 2 ++ kernel/trace/trace_events.c | 34 ++++++++++++++++++++++++++++------ kernel/trace/trace_events_hist.c | 24 ++++++++++-------------- 3 files changed, 40 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 4130a5497d40..3aa05593a53f 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -529,6 +529,8 @@ extern int trace_event_raw_init(struct trace_event_call *call); extern int trace_define_field(struct trace_event_call *call, const char *type, const char *name, int offset, int size, int is_signed, int filter_type); +extern int trace_add_event_call_nolock(struct trace_event_call *call); +extern int trace_remove_event_call_nolock(struct trace_event_call *call); extern int trace_add_event_call(struct trace_event_call *call); extern int trace_remove_event_call(struct trace_event_call *call); extern int trace_event_get_offsets(struct trace_event_call *call); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index f94be0c2827b..a3b157f689ee 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2305,11 +2305,11 @@ __trace_early_add_new_event(struct trace_event_call *call, struct ftrace_module_file_ops; static void __add_event_to_tracers(struct trace_event_call *call); -/* Add an additional event_call dynamically */ -int trace_add_event_call(struct trace_event_call *call) +int trace_add_event_call_nolock(struct trace_event_call *call) { int ret; - mutex_lock(&event_mutex); + lockdep_assert_held(&event_mutex); + mutex_lock(&trace_types_lock); ret = __register_event(call, NULL); @@ -2317,6 +2317,16 @@ int trace_add_event_call(struct trace_event_call *call) __add_event_to_tracers(call); mutex_unlock(&trace_types_lock); + return ret; +} + +/* Add an additional event_call dynamically */ +int trace_add_event_call(struct trace_event_call *call) +{ + int ret; + + mutex_lock(&event_mutex); + ret = trace_add_event_call_nolock(call); mutex_unlock(&event_mutex); return ret; } @@ -2366,17 +2376,29 @@ static int probe_remove_event_call(struct trace_event_call *call) return 0; } -/* Remove an event_call */ -int trace_remove_event_call(struct trace_event_call *call) +/* no event_mutex version */ +int trace_remove_event_call_nolock(struct trace_event_call *call) { int ret; - mutex_lock(&event_mutex); + lockdep_assert_held(&event_mutex); + mutex_lock(&trace_types_lock); down_write(&trace_event_sem); ret = probe_remove_event_call(call); up_write(&trace_event_sem); mutex_unlock(&trace_types_lock); + + return ret; +} + +/* Remove an event_call */ +int trace_remove_event_call(struct trace_event_call *call) +{ + int ret; + + mutex_lock(&event_mutex); + ret = trace_remove_event_call_nolock(call); mutex_unlock(&event_mutex); return ret; diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index eb908ef2ecec..1670c65389fe 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -912,7 +912,7 @@ static int register_synth_event(struct synth_event *event) call->data = event; call->tp = event->tp; - ret = trace_add_event_call(call); + ret = trace_add_event_call_nolock(call); if (ret) { pr_warn("Failed to register synthetic event: %s\n", trace_event_name(call)); @@ -936,7 +936,7 @@ static int unregister_synth_event(struct synth_event *event) struct trace_event_call *call = &event->call; int ret; - ret = trace_remove_event_call(call); + ret = trace_remove_event_call_nolock(call); return ret; } @@ -1013,12 +1013,10 @@ static void add_or_delete_synth_event(struct synth_event *event, int delete) if (delete) free_synth_event(event); else { - mutex_lock(&synth_event_mutex); if (!find_synth_event(event->name)) list_add(&event->list, &synth_event_list); else free_synth_event(event); - mutex_unlock(&synth_event_mutex); } } @@ -1030,6 +1028,7 @@ static int create_synth_event(int argc, char **argv) int i, consumed = 0, n_fields = 0, ret = 0; char *name; + mutex_lock(&event_mutex); mutex_lock(&synth_event_mutex); /* @@ -1102,8 +1101,6 @@ static int create_synth_event(int argc, char **argv) goto err; } out: - mutex_unlock(&synth_event_mutex); - if (event) { if (delete_event) { ret = unregister_synth_event(event); @@ -1113,10 +1110,13 @@ static int create_synth_event(int argc, char **argv) add_or_delete_synth_event(event, ret); } } + mutex_unlock(&synth_event_mutex); + mutex_unlock(&event_mutex); return ret; err: mutex_unlock(&synth_event_mutex); + mutex_unlock(&event_mutex); for (i = 0; i < n_fields; i++) free_synth_field(fields[i]); @@ -1127,12 +1127,10 @@ static int create_synth_event(int argc, char **argv) static int release_all_synth_events(void) { - struct list_head release_events; struct synth_event *event, *e; int ret = 0; - INIT_LIST_HEAD(&release_events); - + mutex_lock(&event_mutex); mutex_lock(&synth_event_mutex); list_for_each_entry(event, &synth_event_list, list) { @@ -1142,16 +1140,14 @@ static int release_all_synth_events(void) } } - list_splice_init(&event->list, &release_events); - - mutex_unlock(&synth_event_mutex); - - list_for_each_entry_safe(event, e, &release_events, list) { + list_for_each_entry_safe(event, e, &synth_event_list, list) { list_del(&event->list); ret = unregister_synth_event(event); add_or_delete_synth_event(event, !ret); } + mutex_unlock(&synth_event_mutex); + mutex_unlock(&event_mutex); return ret; } -- cgit v1.3-7-g2ca7 From faacb361f271be4baf2d807e2eeaba87e059225f Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:01:12 +0900 Subject: tracing: Simplify creation and deletion of synthetic events Since the event_mutex and synth_event_mutex ordering issue is gone, we can skip existing event check when adding or deleting events, and some redundant code in error path. This changes release_all_synth_events() to abort the process when it hits any error and returns the error code. It succeeds only if it has no error. Link: http://lkml.kernel.org/r/154140847194.17322.17960275728005067803.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 53 ++++++++++++++-------------------------- 1 file changed, 18 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 1670c65389fe..0feb7f460123 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -1008,18 +1008,6 @@ struct hist_var_data { struct hist_trigger_data *hist_data; }; -static void add_or_delete_synth_event(struct synth_event *event, int delete) -{ - if (delete) - free_synth_event(event); - else { - if (!find_synth_event(event->name)) - list_add(&event->list, &synth_event_list); - else - free_synth_event(event); - } -} - static int create_synth_event(int argc, char **argv) { struct synth_field *field, *fields[SYNTH_FIELDS_MAX]; @@ -1052,15 +1040,16 @@ static int create_synth_event(int argc, char **argv) if (event) { if (delete_event) { if (event->ref) { - event = NULL; ret = -EBUSY; goto out; } - list_del(&event->list); - goto out; - } - event = NULL; - ret = -EEXIST; + ret = unregister_synth_event(event); + if (!ret) { + list_del(&event->list); + free_synth_event(event); + } + } else + ret = -EEXIST; goto out; } else if (delete_event) { ret = -ENOENT; @@ -1100,29 +1089,21 @@ static int create_synth_event(int argc, char **argv) event = NULL; goto err; } + ret = register_synth_event(event); + if (!ret) + list_add(&event->list, &synth_event_list); + else + free_synth_event(event); out: - if (event) { - if (delete_event) { - ret = unregister_synth_event(event); - add_or_delete_synth_event(event, !ret); - } else { - ret = register_synth_event(event); - add_or_delete_synth_event(event, ret); - } - } mutex_unlock(&synth_event_mutex); mutex_unlock(&event_mutex); return ret; err: - mutex_unlock(&synth_event_mutex); - mutex_unlock(&event_mutex); - for (i = 0; i < n_fields; i++) free_synth_field(fields[i]); - free_synth_event(event); - return ret; + goto out; } static int release_all_synth_events(void) @@ -1141,10 +1122,12 @@ static int release_all_synth_events(void) } list_for_each_entry_safe(event, e, &synth_event_list, list) { - list_del(&event->list); - ret = unregister_synth_event(event); - add_or_delete_synth_event(event, !ret); + if (!ret) { + list_del(&event->list); + free_synth_event(event); + } else + break; } mutex_unlock(&synth_event_mutex); mutex_unlock(&event_mutex); -- cgit v1.3-7-g2ca7 From d00bbea9456f35fb34310d454e561f05d68d07fe Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:01:40 +0900 Subject: tracing: Integrate similar probe argument parsers Integrate similar argument parsers for kprobes and uprobes events into traceprobe_parse_probe_arg(). Link: http://lkml.kernel.org/r/154140850016.17322.9836787731210512176.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 48 ++------------------------------------------- kernel/trace/trace_probe.c | 47 +++++++++++++++++++++++++++++++++++++++++--- kernel/trace/trace_probe.h | 7 ++----- kernel/trace/trace_uprobe.c | 44 ++--------------------------------------- 4 files changed, 50 insertions(+), 96 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index fec67188c4d2..d313bcc259dc 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -548,7 +548,6 @@ static int create_trace_kprobe(int argc, char **argv) bool is_return = false, is_delete = false; char *symbol = NULL, *event = NULL, *group = NULL; int maxactive = 0; - char *arg; long offset = 0; void *addr = NULL; char buf[MAX_EVENT_NAME_LEN]; @@ -676,53 +675,10 @@ static int create_trace_kprobe(int argc, char **argv) } /* parse arguments */ - ret = 0; for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { - struct probe_arg *parg = &tk->tp.args[i]; - - /* Increment count for freeing args in error case */ - tk->tp.nr_args++; - - /* Parse argument name */ - arg = strchr(argv[i], '='); - if (arg) { - *arg++ = '\0'; - parg->name = kstrdup(argv[i], GFP_KERNEL); - } else { - arg = argv[i]; - /* If argument name is omitted, set "argN" */ - snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1); - parg->name = kstrdup(buf, GFP_KERNEL); - } - - if (!parg->name) { - pr_info("Failed to allocate argument[%d] name.\n", i); - ret = -ENOMEM; - goto error; - } - - if (!is_good_name(parg->name)) { - pr_info("Invalid argument[%d] name: %s\n", - i, parg->name); - ret = -EINVAL; - goto error; - } - - if (traceprobe_conflict_field_name(parg->name, - tk->tp.args, i)) { - pr_info("Argument[%d] name '%s' conflicts with " - "another field.\n", i, argv[i]); - ret = -EINVAL; - goto error; - } - - /* Parse fetch argument */ - ret = traceprobe_parse_probe_arg(arg, &tk->tp.size, parg, - flags); - if (ret) { - pr_info("Parse error at argument[%d]. (%d)\n", i, ret); + ret = traceprobe_parse_probe_arg(&tk->tp, i, argv[i], flags); + if (ret) goto error; - } } ret = register_trace_kprobe(tk); diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index bd30e9398d2a..449150c6a87f 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -348,7 +348,7 @@ static int __parse_bitfield_probe_arg(const char *bf, } /* String length checking wrapper */ -int traceprobe_parse_probe_arg(char *arg, ssize_t *size, +static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, struct probe_arg *parg, unsigned int flags) { struct fetch_insn *code, *scode, *tmp = NULL; @@ -491,8 +491,8 @@ fail: } /* Return 1 if name is reserved or already used by another argument */ -int traceprobe_conflict_field_name(const char *name, - struct probe_arg *args, int narg) +static int traceprobe_conflict_field_name(const char *name, + struct probe_arg *args, int narg) { int i; @@ -507,6 +507,47 @@ int traceprobe_conflict_field_name(const char *name, return 0; } +int traceprobe_parse_probe_arg(struct trace_probe *tp, int i, char *arg, + unsigned int flags) +{ + struct probe_arg *parg = &tp->args[i]; + char *body; + int ret; + + /* Increment count for freeing args in error case */ + tp->nr_args++; + + body = strchr(arg, '='); + if (body) { + parg->name = kmemdup_nul(arg, body - arg, GFP_KERNEL); + body++; + } else { + /* If argument name is omitted, set "argN" */ + parg->name = kasprintf(GFP_KERNEL, "arg%d", i + 1); + body = arg; + } + if (!parg->name) + return -ENOMEM; + + if (!is_good_name(parg->name)) { + pr_info("Invalid argument[%d] name: %s\n", + i, parg->name); + return -EINVAL; + } + + if (traceprobe_conflict_field_name(parg->name, tp->args, i)) { + pr_info("Argument[%d]: '%s' conflicts with another field.\n", + i, parg->name); + return -EINVAL; + } + + /* Parse fetch argument */ + ret = traceprobe_parse_probe_arg_body(body, &tp->size, parg, flags); + if (ret) + pr_info("Parse error at argument[%d]. (%d)\n", i, ret); + return ret; +} + void traceprobe_free_probe_arg(struct probe_arg *arg) { struct fetch_insn *code = arg->code; diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 974afc1a3e73..feeec261b356 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -272,11 +272,8 @@ find_event_file_link(struct trace_probe *tp, struct trace_event_file *file) #define TPARG_FL_FENTRY BIT(2) #define TPARG_FL_MASK GENMASK(2, 0) -extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size, - struct probe_arg *parg, unsigned int flags); - -extern int traceprobe_conflict_field_name(const char *name, - struct probe_arg *args, int narg); +extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i, + char *arg, unsigned int flags); extern int traceprobe_update_arg(struct probe_arg *arg); extern void traceprobe_free_probe_arg(struct probe_arg *arg); diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index b708e4ff7ea7..6eaaa2150685 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -517,51 +517,11 @@ static int create_trace_uprobe(int argc, char **argv) } /* parse arguments */ - ret = 0; for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { - struct probe_arg *parg = &tu->tp.args[i]; - - /* Increment count for freeing args in error case */ - tu->tp.nr_args++; - - /* Parse argument name */ - arg = strchr(argv[i], '='); - if (arg) { - *arg++ = '\0'; - parg->name = kstrdup(argv[i], GFP_KERNEL); - } else { - arg = argv[i]; - /* If argument name is omitted, set "argN" */ - snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1); - parg->name = kstrdup(buf, GFP_KERNEL); - } - - if (!parg->name) { - pr_info("Failed to allocate argument[%d] name.\n", i); - ret = -ENOMEM; - goto error; - } - - if (!is_good_name(parg->name)) { - pr_info("Invalid argument[%d] name: %s\n", i, parg->name); - ret = -EINVAL; - goto error; - } - - if (traceprobe_conflict_field_name(parg->name, tu->tp.args, i)) { - pr_info("Argument[%d] name '%s' conflicts with " - "another field.\n", i, argv[i]); - ret = -EINVAL; - goto error; - } - - /* Parse fetch argument */ - ret = traceprobe_parse_probe_arg(arg, &tu->tp.size, parg, + ret = traceprobe_parse_probe_arg(&tu->tp, i, argv[i], is_return ? TPARG_FL_RETURN : 0); - if (ret) { - pr_info("Parse error at argument[%d]. (%d)\n", i, ret); + if (ret) goto error; - } } ret = register_trace_uprobe(tu); -- cgit v1.3-7-g2ca7 From 5448d44c38557fc15d1c53b608a9c9f0e1ca8f86 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:02:08 +0900 Subject: tracing: Add unified dynamic event framework Add unified dynamic event framework for ftrace kprobes, uprobes and synthetic events. Those dynamic events can be co-exist on same file because those syntax doesn't overlap. This introduces a framework part which provides a unified tracefs interface and operations. Link: http://lkml.kernel.org/r/154140852824.17322.12250362185969352095.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/Kconfig | 3 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 4 + kernel/trace/trace_dynevent.c | 210 ++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace_dynevent.h | 119 ++++++++++++++++++++++++ 5 files changed, 337 insertions(+) create mode 100644 kernel/trace/trace_dynevent.c create mode 100644 kernel/trace/trace_dynevent.h (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 5e3de28c7677..bf2e8a5a91f1 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -518,6 +518,9 @@ config BPF_EVENTS help This allows the user to attach BPF programs to kprobe events. +config DYNAMIC_EVENTS + def_bool n + config PROBE_EVENTS def_bool n diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index c7ade7965464..c2b2148bb1d2 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -79,6 +79,7 @@ endif ifeq ($(CONFIG_TRACING),y) obj-$(CONFIG_KGDB_KDB) += trace_kdb.o endif +obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 194c01838e3f..7e0332f90ed4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4604,6 +4604,10 @@ static const char readme_msg[] = "\t\t\t traces\n" #endif #endif /* CONFIG_STACK_TRACER */ +#ifdef CONFIG_DYNAMIC_EVENTS + " dynamic_events\t\t- Add/remove/show the generic dynamic events\n" + "\t\t\t Write into this file to define/undefine new trace events.\n" +#endif #ifdef CONFIG_KPROBE_EVENTS " kprobe_events\t\t- Add/remove/show the kernel dynamic events\n" "\t\t\t Write into this file to define/undefine new trace events.\n" diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c new file mode 100644 index 000000000000..f17a887abb66 --- /dev/null +++ b/kernel/trace/trace_dynevent.c @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Generic dynamic event control interface + * + * Copyright (C) 2018 Masami Hiramatsu + */ + +#include +#include +#include +#include +#include +#include + +#include "trace.h" +#include "trace_dynevent.h" + +static DEFINE_MUTEX(dyn_event_ops_mutex); +static LIST_HEAD(dyn_event_ops_list); + +int dyn_event_register(struct dyn_event_operations *ops) +{ + if (!ops || !ops->create || !ops->show || !ops->is_busy || + !ops->free || !ops->match) + return -EINVAL; + + INIT_LIST_HEAD(&ops->list); + mutex_lock(&dyn_event_ops_mutex); + list_add_tail(&ops->list, &dyn_event_ops_list); + mutex_unlock(&dyn_event_ops_mutex); + return 0; +} + +int dyn_event_release(int argc, char **argv, struct dyn_event_operations *type) +{ + struct dyn_event *pos, *n; + char *system = NULL, *event, *p; + int ret = -ENOENT; + + if (argv[0][1] != ':') + return -EINVAL; + + event = &argv[0][2]; + p = strchr(event, '/'); + if (p) { + system = event; + event = p + 1; + *p = '\0'; + } + if (event[0] == '\0') + return -EINVAL; + + mutex_lock(&event_mutex); + for_each_dyn_event_safe(pos, n) { + if (type && type != pos->ops) + continue; + if (pos->ops->match(system, event, pos)) { + ret = pos->ops->free(pos); + break; + } + } + mutex_unlock(&event_mutex); + + return ret; +} + +static int create_dyn_event(int argc, char **argv) +{ + struct dyn_event_operations *ops; + int ret; + + if (argv[0][0] == '-') + return dyn_event_release(argc, argv, NULL); + + mutex_lock(&dyn_event_ops_mutex); + list_for_each_entry(ops, &dyn_event_ops_list, list) { + ret = ops->create(argc, (const char **)argv); + if (!ret || ret != -ECANCELED) + break; + } + mutex_unlock(&dyn_event_ops_mutex); + if (ret == -ECANCELED) + ret = -EINVAL; + + return ret; +} + +/* Protected by event_mutex */ +LIST_HEAD(dyn_event_list); + +void *dyn_event_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&event_mutex); + return seq_list_start(&dyn_event_list, *pos); +} + +void *dyn_event_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &dyn_event_list, pos); +} + +void dyn_event_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&event_mutex); +} + +static int dyn_event_seq_show(struct seq_file *m, void *v) +{ + struct dyn_event *ev = v; + + if (ev && ev->ops) + return ev->ops->show(m, ev); + + return 0; +} + +static const struct seq_operations dyn_event_seq_op = { + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, + .show = dyn_event_seq_show +}; + +/* + * dyn_events_release_all - Release all specific events + * @type: the dyn_event_operations * which filters releasing events + * + * This releases all events which ->ops matches @type. If @type is NULL, + * all events are released. + * Return -EBUSY if any of them are in use, and return other errors when + * it failed to free the given event. Except for -EBUSY, event releasing + * process will be aborted at that point and there may be some other + * releasable events on the list. + */ +int dyn_events_release_all(struct dyn_event_operations *type) +{ + struct dyn_event *ev, *tmp; + int ret = 0; + + mutex_lock(&event_mutex); + for_each_dyn_event(ev) { + if (type && ev->ops != type) + continue; + if (ev->ops->is_busy(ev)) { + ret = -EBUSY; + goto out; + } + } + for_each_dyn_event_safe(ev, tmp) { + if (type && ev->ops != type) + continue; + ret = ev->ops->free(ev); + if (ret) + break; + } +out: + mutex_unlock(&event_mutex); + + return ret; +} + +static int dyn_event_open(struct inode *inode, struct file *file) +{ + int ret; + + if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) { + ret = dyn_events_release_all(NULL); + if (ret < 0) + return ret; + } + + return seq_open(file, &dyn_event_seq_op); +} + +static ssize_t dyn_event_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + return trace_parse_run_command(file, buffer, count, ppos, + create_dyn_event); +} + +static const struct file_operations dynamic_events_ops = { + .owner = THIS_MODULE, + .open = dyn_event_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .write = dyn_event_write, +}; + +/* Make a tracefs interface for controlling dynamic events */ +static __init int init_dynamic_event(void) +{ + struct dentry *d_tracer; + struct dentry *entry; + + d_tracer = tracing_init_dentry(); + if (IS_ERR(d_tracer)) + return 0; + + entry = tracefs_create_file("dynamic_events", 0644, d_tracer, + NULL, &dynamic_events_ops); + + /* Event list interface */ + if (!entry) + pr_warn("Could not create tracefs 'dynamic_events' entry\n"); + + return 0; +} +fs_initcall(init_dynamic_event); diff --git a/kernel/trace/trace_dynevent.h b/kernel/trace/trace_dynevent.h new file mode 100644 index 000000000000..8c334064e4d6 --- /dev/null +++ b/kernel/trace/trace_dynevent.h @@ -0,0 +1,119 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Common header file for generic dynamic events. + */ + +#ifndef _TRACE_DYNEVENT_H +#define _TRACE_DYNEVENT_H + +#include +#include +#include +#include + +#include "trace.h" + +struct dyn_event; + +/** + * struct dyn_event_operations - Methods for each type of dynamic events + * + * These methods must be set for each type, since there is no default method. + * Before using this for dyn_event_init(), it must be registered by + * dyn_event_register(). + * + * @create: Parse and create event method. This is invoked when user passes + * a event definition to dynamic_events interface. This must not destruct + * the arguments and return -ECANCELED if given arguments doesn't match its + * command prefix. + * @show: Showing method. This is invoked when user reads the event definitions + * via dynamic_events interface. + * @is_busy: Check whether given event is busy so that it can not be deleted. + * Return true if it is busy, otherwides false. + * @free: Delete the given event. Return 0 if success, otherwides error. + * @match: Check whether given event and system name match this event. + * Return true if it matches, otherwides false. + * + * Except for @create, these methods are called under holding event_mutex. + */ +struct dyn_event_operations { + struct list_head list; + int (*create)(int argc, const char *argv[]); + int (*show)(struct seq_file *m, struct dyn_event *ev); + bool (*is_busy)(struct dyn_event *ev); + int (*free)(struct dyn_event *ev); + bool (*match)(const char *system, const char *event, + struct dyn_event *ev); +}; + +/* Register new dyn_event type -- must be called at first */ +int dyn_event_register(struct dyn_event_operations *ops); + +/** + * struct dyn_event - Dynamic event list header + * + * The dyn_event structure encapsulates a list and a pointer to the operators + * for making a global list of dynamic events. + * User must includes this in each event structure, so that those events can + * be added/removed via dynamic_events interface. + */ +struct dyn_event { + struct list_head list; + struct dyn_event_operations *ops; +}; + +extern struct list_head dyn_event_list; + +static inline +int dyn_event_init(struct dyn_event *ev, struct dyn_event_operations *ops) +{ + if (!ev || !ops) + return -EINVAL; + + INIT_LIST_HEAD(&ev->list); + ev->ops = ops; + return 0; +} + +static inline int dyn_event_add(struct dyn_event *ev) +{ + lockdep_assert_held(&event_mutex); + + if (!ev || !ev->ops) + return -EINVAL; + + list_add_tail(&ev->list, &dyn_event_list); + return 0; +} + +static inline void dyn_event_remove(struct dyn_event *ev) +{ + lockdep_assert_held(&event_mutex); + list_del_init(&ev->list); +} + +void *dyn_event_seq_start(struct seq_file *m, loff_t *pos); +void *dyn_event_seq_next(struct seq_file *m, void *v, loff_t *pos); +void dyn_event_seq_stop(struct seq_file *m, void *v); +int dyn_events_release_all(struct dyn_event_operations *type); +int dyn_event_release(int argc, char **argv, struct dyn_event_operations *type); + +/* + * for_each_dyn_event - iterate over the dyn_event list + * @pos: the struct dyn_event * to use as a loop cursor + * + * This is just a basement of for_each macro. Wrap this for + * each actual event structure with ops filtering. + */ +#define for_each_dyn_event(pos) \ + list_for_each_entry(pos, &dyn_event_list, list) + +/* + * for_each_dyn_event - iterate over the dyn_event list safely + * @pos: the struct dyn_event * to use as a loop cursor + * @n: the struct dyn_event * to use as temporary storage + */ +#define for_each_dyn_event_safe(pos, n) \ + list_for_each_entry_safe(pos, n, &dyn_event_list, list) + +#endif -- cgit v1.3-7-g2ca7 From 6212dd29683eec51d6d05374a66ac953e81250e9 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:02:36 +0900 Subject: tracing/kprobes: Use dyn_event framework for kprobe events Use dyn_event framework for kprobe events. This shows kprobe events on "tracing/dynamic_events" file. User can also define new events via tracing/dynamic_events. Link: http://lkml.kernel.org/r/154140855646.17322.6619219995865980392.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/kprobetrace.rst | 3 + kernel/trace/Kconfig | 1 + kernel/trace/trace_kprobe.c | 319 ++++++++++++++++++++---------------- kernel/trace/trace_probe.c | 27 +++ kernel/trace/trace_probe.h | 2 + 5 files changed, 207 insertions(+), 145 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst index 47e765c2f2c3..235ce2ab131a 100644 --- a/Documentation/trace/kprobetrace.rst +++ b/Documentation/trace/kprobetrace.rst @@ -20,6 +20,9 @@ current_tracer. Instead of that, add probe points via /sys/kernel/debug/tracing/kprobe_events, and enable it via /sys/kernel/debug/tracing/events/kprobes//enable. +You can also use /sys/kernel/debug/tracing/dynamic_events instead of +kprobe_events. That interface will provide unified access to other +dynamic events too. Synopsis of kprobe_events ------------------------- diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index bf2e8a5a91f1..c0f6b0105609 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -461,6 +461,7 @@ config KPROBE_EVENTS bool "Enable kprobes-based dynamic events" select TRACING select PROBE_EVENTS + select DYNAMIC_EVENTS default y help This allows the user to add tracing events (similar to tracepoints) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d313bcc259dc..bdf8c2ad5152 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -12,6 +12,7 @@ #include #include +#include "trace_dynevent.h" #include "trace_kprobe_selftest.h" #include "trace_probe.h" #include "trace_probe_tmpl.h" @@ -19,17 +20,51 @@ #define KPROBE_EVENT_SYSTEM "kprobes" #define KRETPROBE_MAXACTIVE_MAX 4096 +static int trace_kprobe_create(int argc, const char **argv); +static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev); +static int trace_kprobe_release(struct dyn_event *ev); +static bool trace_kprobe_is_busy(struct dyn_event *ev); +static bool trace_kprobe_match(const char *system, const char *event, + struct dyn_event *ev); + +static struct dyn_event_operations trace_kprobe_ops = { + .create = trace_kprobe_create, + .show = trace_kprobe_show, + .is_busy = trace_kprobe_is_busy, + .free = trace_kprobe_release, + .match = trace_kprobe_match, +}; + /** * Kprobe event core functions */ struct trace_kprobe { - struct list_head list; + struct dyn_event devent; struct kretprobe rp; /* Use rp.kp for kprobe use */ unsigned long __percpu *nhit; const char *symbol; /* symbol name */ struct trace_probe tp; }; +static bool is_trace_kprobe(struct dyn_event *ev) +{ + return ev->ops == &trace_kprobe_ops; +} + +static struct trace_kprobe *to_trace_kprobe(struct dyn_event *ev) +{ + return container_of(ev, struct trace_kprobe, devent); +} + +/** + * for_each_trace_kprobe - iterate over the trace_kprobe list + * @pos: the struct trace_kprobe * for each entry + * @dpos: the struct dyn_event * to use as a loop cursor + */ +#define for_each_trace_kprobe(pos, dpos) \ + for_each_dyn_event(dpos) \ + if (is_trace_kprobe(dpos) && (pos = to_trace_kprobe(dpos))) + #define SIZEOF_TRACE_KPROBE(n) \ (offsetof(struct trace_kprobe, tp.args) + \ (sizeof(struct probe_arg) * (n))) @@ -81,6 +116,22 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) return ret; } +static bool trace_kprobe_is_busy(struct dyn_event *ev) +{ + struct trace_kprobe *tk = to_trace_kprobe(ev); + + return trace_probe_is_enabled(&tk->tp); +} + +static bool trace_kprobe_match(const char *system, const char *event, + struct dyn_event *ev) +{ + struct trace_kprobe *tk = to_trace_kprobe(ev); + + return strcmp(trace_event_name(&tk->tp.call), event) == 0 && + (!system || strcmp(tk->tp.call.class->system, system) == 0); +} + static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) { unsigned long nhit = 0; @@ -128,9 +179,6 @@ bool trace_kprobe_error_injectable(struct trace_event_call *call) static int register_kprobe_event(struct trace_kprobe *tk); static int unregister_kprobe_event(struct trace_kprobe *tk); -static DEFINE_MUTEX(probe_lock); -static LIST_HEAD(probe_list); - static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs); static int kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs); @@ -192,7 +240,7 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, if (!tk->tp.class.system) goto error; - INIT_LIST_HEAD(&tk->list); + dyn_event_init(&tk->devent, &trace_kprobe_ops); INIT_LIST_HEAD(&tk->tp.files); return tk; error: @@ -207,6 +255,9 @@ static void free_trace_kprobe(struct trace_kprobe *tk) { int i; + if (!tk) + return; + for (i = 0; i < tk->tp.nr_args; i++) traceprobe_free_probe_arg(&tk->tp.args[i]); @@ -220,9 +271,10 @@ static void free_trace_kprobe(struct trace_kprobe *tk) static struct trace_kprobe *find_trace_kprobe(const char *event, const char *group) { + struct dyn_event *pos; struct trace_kprobe *tk; - list_for_each_entry(tk, &probe_list, list) + for_each_trace_kprobe(tk, pos) if (strcmp(trace_event_name(&tk->tp.call), event) == 0 && strcmp(tk->tp.call.class->system, group) == 0) return tk; @@ -321,7 +373,7 @@ disable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) * created with perf_event_open. We don't need to wait for these * trace_kprobes */ - if (list_empty(&tk->list)) + if (list_empty(&tk->devent.list)) wait = 0; out: if (wait) { @@ -419,7 +471,7 @@ static void __unregister_trace_kprobe(struct trace_kprobe *tk) } } -/* Unregister a trace_probe and probe_event: call with locking probe_lock */ +/* Unregister a trace_probe and probe_event */ static int unregister_trace_kprobe(struct trace_kprobe *tk) { /* Enabled event can not be unregistered */ @@ -431,7 +483,7 @@ static int unregister_trace_kprobe(struct trace_kprobe *tk) return -EBUSY; __unregister_trace_kprobe(tk); - list_del(&tk->list); + dyn_event_remove(&tk->devent); return 0; } @@ -442,7 +494,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk) struct trace_kprobe *old_tk; int ret; - mutex_lock(&probe_lock); + mutex_lock(&event_mutex); /* Delete old (same name) event if exist */ old_tk = find_trace_kprobe(trace_event_name(&tk->tp.call), @@ -471,10 +523,10 @@ static int register_trace_kprobe(struct trace_kprobe *tk) if (ret < 0) unregister_kprobe_event(tk); else - list_add_tail(&tk->list, &probe_list); + dyn_event_add(&tk->devent); end: - mutex_unlock(&probe_lock); + mutex_unlock(&event_mutex); return ret; } @@ -483,6 +535,7 @@ static int trace_kprobe_module_callback(struct notifier_block *nb, unsigned long val, void *data) { struct module *mod = data; + struct dyn_event *pos; struct trace_kprobe *tk; int ret; @@ -490,8 +543,8 @@ static int trace_kprobe_module_callback(struct notifier_block *nb, return NOTIFY_DONE; /* Update probes on coming module */ - mutex_lock(&probe_lock); - list_for_each_entry(tk, &probe_list, list) { + mutex_lock(&event_mutex); + for_each_trace_kprobe(tk, pos) { if (trace_kprobe_within_module(tk, mod)) { /* Don't need to check busy - this should have gone. */ __unregister_trace_kprobe(tk); @@ -502,7 +555,7 @@ static int trace_kprobe_module_callback(struct notifier_block *nb, mod->name, ret); } } - mutex_unlock(&probe_lock); + mutex_unlock(&event_mutex); return NOTIFY_DONE; } @@ -520,7 +573,7 @@ static inline void sanitize_event_name(char *name) *name = '_'; } -static int create_trace_kprobe(int argc, char **argv) +static int trace_kprobe_create(int argc, const char *argv[]) { /* * Argument syntax: @@ -544,9 +597,10 @@ static int create_trace_kprobe(int argc, char **argv) * FETCHARG:TYPE : use TYPE instead of unsigned long. */ struct trace_kprobe *tk; - int i, ret = 0; - bool is_return = false, is_delete = false; - char *symbol = NULL, *event = NULL, *group = NULL; + int i, len, ret = 0; + bool is_return = false; + char *symbol = NULL, *tmp = NULL; + const char *event = NULL, *group = KPROBE_EVENT_SYSTEM; int maxactive = 0; long offset = 0; void *addr = NULL; @@ -554,26 +608,26 @@ static int create_trace_kprobe(int argc, char **argv) unsigned int flags = TPARG_FL_KERNEL; /* argc must be >= 1 */ - if (argv[0][0] == 'p') - is_return = false; - else if (argv[0][0] == 'r') { + if (argv[0][0] == 'r') { is_return = true; flags |= TPARG_FL_RETURN; - } else if (argv[0][0] == '-') - is_delete = true; - else { - pr_info("Probe definition must be started with 'p', 'r' or" - " '-'.\n"); - return -EINVAL; - } + } else if (argv[0][0] != 'p' || argc < 2) + return -ECANCELED; event = strchr(&argv[0][1], ':'); - if (event) { - event[0] = '\0'; + if (event) event++; - } + if (is_return && isdigit(argv[0][1])) { - ret = kstrtouint(&argv[0][1], 0, &maxactive); + if (event) + len = event - &argv[0][1] - 1; + else + len = strlen(&argv[0][1]); + if (len > MAX_EVENT_NAME_LEN - 1) + return -E2BIG; + memcpy(buf, &argv[0][1], len); + buf[len] = '\0'; + ret = kstrtouint(buf, 0, &maxactive); if (ret) { pr_info("Failed to parse maxactive.\n"); return ret; @@ -588,74 +642,37 @@ static int create_trace_kprobe(int argc, char **argv) } } - if (event) { - char *slash; - - slash = strchr(event, '/'); - if (slash) { - group = event; - event = slash + 1; - slash[0] = '\0'; - if (strlen(group) == 0) { - pr_info("Group name is not specified\n"); - return -EINVAL; - } - } - if (strlen(event) == 0) { - pr_info("Event name is not specified\n"); - return -EINVAL; - } - } - if (!group) - group = KPROBE_EVENT_SYSTEM; - - if (is_delete) { - if (!event) { - pr_info("Delete command needs an event name.\n"); - return -EINVAL; - } - mutex_lock(&probe_lock); - tk = find_trace_kprobe(event, group); - if (!tk) { - mutex_unlock(&probe_lock); - pr_info("Event %s/%s doesn't exist.\n", group, event); - return -ENOENT; - } - /* delete an event */ - ret = unregister_trace_kprobe(tk); - if (ret == 0) - free_trace_kprobe(tk); - mutex_unlock(&probe_lock); - return ret; - } - - if (argc < 2) { - pr_info("Probe point is not specified.\n"); - return -EINVAL; - } - /* try to parse an address. if that fails, try to read the * input as a symbol. */ if (kstrtoul(argv[1], 0, (unsigned long *)&addr)) { + /* Check whether uprobe event specified */ + if (strchr(argv[1], '/') && strchr(argv[1], ':')) + return -ECANCELED; /* a symbol specified */ - symbol = argv[1]; + symbol = kstrdup(argv[1], GFP_KERNEL); + if (!symbol) + return -ENOMEM; /* TODO: support .init module functions */ ret = traceprobe_split_symbol_offset(symbol, &offset); if (ret || offset < 0 || offset > UINT_MAX) { pr_info("Failed to parse either an address or a symbol.\n"); - return ret; + goto out; } if (kprobe_on_func_entry(NULL, symbol, offset)) flags |= TPARG_FL_FENTRY; if (offset && is_return && !(flags & TPARG_FL_FENTRY)) { pr_info("Given offset is not valid for return probe.\n"); - return -EINVAL; + ret = -EINVAL; + goto out; } } argc -= 2; argv += 2; - /* setup a probe */ - if (!event) { + if (event) { + ret = traceprobe_parse_event_name(&event, &group, buf); + if (ret) + goto out; + } else { /* Make a new event name */ if (symbol) snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_%ld", @@ -666,17 +683,27 @@ static int create_trace_kprobe(int argc, char **argv) sanitize_event_name(buf); event = buf; } + + /* setup a probe */ tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive, argc, is_return); if (IS_ERR(tk)) { pr_info("Failed to allocate trace_probe.(%d)\n", (int)PTR_ERR(tk)); - return PTR_ERR(tk); + ret = PTR_ERR(tk); + goto out; } /* parse arguments */ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { - ret = traceprobe_parse_probe_arg(&tk->tp, i, argv[i], flags); + tmp = kstrdup(argv[i], GFP_KERNEL); + if (!tmp) { + ret = -ENOMEM; + goto error; + } + + ret = traceprobe_parse_probe_arg(&tk->tp, i, tmp, flags); + kfree(tmp); if (ret) goto error; } @@ -684,60 +711,39 @@ static int create_trace_kprobe(int argc, char **argv) ret = register_trace_kprobe(tk); if (ret) goto error; - return 0; +out: + kfree(symbol); + return ret; error: free_trace_kprobe(tk); - return ret; + goto out; } -static int release_all_trace_kprobes(void) +static int create_or_delete_trace_kprobe(int argc, char **argv) { - struct trace_kprobe *tk; - int ret = 0; - - mutex_lock(&probe_lock); - /* Ensure no probe is in use. */ - list_for_each_entry(tk, &probe_list, list) - if (trace_probe_is_enabled(&tk->tp)) { - ret = -EBUSY; - goto end; - } - /* TODO: Use batch unregistration */ - while (!list_empty(&probe_list)) { - tk = list_entry(probe_list.next, struct trace_kprobe, list); - ret = unregister_trace_kprobe(tk); - if (ret) - goto end; - free_trace_kprobe(tk); - } - -end: - mutex_unlock(&probe_lock); + int ret; - return ret; -} + if (argv[0][0] == '-') + return dyn_event_release(argc, argv, &trace_kprobe_ops); -/* Probes listing interfaces */ -static void *probes_seq_start(struct seq_file *m, loff_t *pos) -{ - mutex_lock(&probe_lock); - return seq_list_start(&probe_list, *pos); + ret = trace_kprobe_create(argc, (const char **)argv); + return ret == -ECANCELED ? -EINVAL : ret; } -static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos) +static int trace_kprobe_release(struct dyn_event *ev) { - return seq_list_next(v, &probe_list, pos); -} + struct trace_kprobe *tk = to_trace_kprobe(ev); + int ret = unregister_trace_kprobe(tk); -static void probes_seq_stop(struct seq_file *m, void *v) -{ - mutex_unlock(&probe_lock); + if (!ret) + free_trace_kprobe(tk); + return ret; } -static int probes_seq_show(struct seq_file *m, void *v) +static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev) { - struct trace_kprobe *tk = v; + struct trace_kprobe *tk = to_trace_kprobe(ev); int i; seq_putc(m, trace_kprobe_is_return(tk) ? 'r' : 'p'); @@ -759,10 +765,20 @@ static int probes_seq_show(struct seq_file *m, void *v) return 0; } +static int probes_seq_show(struct seq_file *m, void *v) +{ + struct dyn_event *ev = v; + + if (!is_trace_kprobe(ev)) + return 0; + + return trace_kprobe_show(m, ev); +} + static const struct seq_operations probes_seq_op = { - .start = probes_seq_start, - .next = probes_seq_next, - .stop = probes_seq_stop, + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, .show = probes_seq_show }; @@ -771,7 +787,7 @@ static int probes_open(struct inode *inode, struct file *file) int ret; if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) { - ret = release_all_trace_kprobes(); + ret = dyn_events_release_all(&trace_kprobe_ops); if (ret < 0) return ret; } @@ -783,7 +799,7 @@ static ssize_t probes_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { return trace_parse_run_command(file, buffer, count, ppos, - create_trace_kprobe); + create_or_delete_trace_kprobe); } static const struct file_operations kprobe_events_ops = { @@ -798,8 +814,13 @@ static const struct file_operations kprobe_events_ops = { /* Probes profiling interfaces */ static int probes_profile_seq_show(struct seq_file *m, void *v) { - struct trace_kprobe *tk = v; + struct dyn_event *ev = v; + struct trace_kprobe *tk; + if (!is_trace_kprobe(ev)) + return 0; + + tk = to_trace_kprobe(ev); seq_printf(m, " %-44s %15lu %15lu\n", trace_event_name(&tk->tp.call), trace_kprobe_nhit(tk), @@ -809,9 +830,9 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) } static const struct seq_operations profile_seq_op = { - .start = probes_seq_start, - .next = probes_seq_next, - .stop = probes_seq_stop, + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, .show = probes_profile_seq_show }; @@ -1332,7 +1353,7 @@ static int register_kprobe_event(struct trace_kprobe *tk) kfree(call->print_fmt); return -ENODEV; } - ret = trace_add_event_call(call); + ret = trace_add_event_call_nolock(call); if (ret) { pr_info("Failed to register kprobe event: %s\n", trace_event_name(call)); @@ -1347,7 +1368,7 @@ static int unregister_kprobe_event(struct trace_kprobe *tk) int ret; /* tp->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call(&tk->tp.call); + ret = trace_remove_event_call_nolock(&tk->tp.call); if (!ret) kfree(tk->tp.call.print_fmt); return ret; @@ -1364,7 +1385,7 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs, char *event; /* - * local trace_kprobes are not added to probe_list, so they are never + * local trace_kprobes are not added to dyn_event, so they are never * searched in find_trace_kprobe(). Therefore, there is no concern of * duplicated name here. */ @@ -1422,6 +1443,11 @@ static __init int init_kprobe_trace(void) { struct dentry *d_tracer; struct dentry *entry; + int ret; + + ret = dyn_event_register(&trace_kprobe_ops); + if (ret) + return ret; if (register_module_notifier(&trace_kprobe_module_nb)) return -EINVAL; @@ -1479,9 +1505,8 @@ static __init int kprobe_trace_self_tests_init(void) pr_info("Testing kprobe tracing: "); - ret = trace_run_command("p:testprobe kprobe_trace_selftest_target " - "$stack $stack0 +0($stack)", - create_trace_kprobe); + ret = trace_run_command("p:testprobe kprobe_trace_selftest_target $stack $stack0 +0($stack)", + create_or_delete_trace_kprobe); if (WARN_ON_ONCE(ret)) { pr_warn("error on probing function entry.\n"); warn++; @@ -1501,8 +1526,8 @@ static __init int kprobe_trace_self_tests_init(void) } } - ret = trace_run_command("r:testprobe2 kprobe_trace_selftest_target " - "$retval", create_trace_kprobe); + ret = trace_run_command("r:testprobe2 kprobe_trace_selftest_target $retval", + create_or_delete_trace_kprobe); if (WARN_ON_ONCE(ret)) { pr_warn("error on probing function return.\n"); warn++; @@ -1572,20 +1597,24 @@ static __init int kprobe_trace_self_tests_init(void) disable_trace_kprobe(tk, file); } - ret = trace_run_command("-:testprobe", create_trace_kprobe); + ret = trace_run_command("-:testprobe", create_or_delete_trace_kprobe); if (WARN_ON_ONCE(ret)) { pr_warn("error on deleting a probe.\n"); warn++; } - ret = trace_run_command("-:testprobe2", create_trace_kprobe); + ret = trace_run_command("-:testprobe2", create_or_delete_trace_kprobe); if (WARN_ON_ONCE(ret)) { pr_warn("error on deleting a probe.\n"); warn++; } end: - release_all_trace_kprobes(); + ret = dyn_events_release_all(&trace_kprobe_ops); + if (WARN_ON_ONCE(ret)) { + pr_warn("error on cleaning up probes.\n"); + warn++; + } /* * Wait for the optimizer work to finish. Otherwise it might fiddle * with probes in already freed __init text. diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 449150c6a87f..ff86417c0149 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -154,6 +154,33 @@ int traceprobe_split_symbol_offset(char *symbol, long *offset) return 0; } +/* @buf must has MAX_EVENT_NAME_LEN size */ +int traceprobe_parse_event_name(const char **pevent, const char **pgroup, + char *buf) +{ + const char *slash, *event = *pevent; + + slash = strchr(event, '/'); + if (slash) { + if (slash == event) { + pr_info("Group name is not specified\n"); + return -EINVAL; + } + if (slash - event + 1 > MAX_EVENT_NAME_LEN) { + pr_info("Group name is too long\n"); + return -E2BIG; + } + strlcpy(buf, event, slash - event + 1); + *pgroup = buf; + *pevent = slash + 1; + } + if (strlen(event) == 0) { + pr_info("Event name is not specified\n"); + return -EINVAL; + } + return 0; +} + #define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long)) static int parse_probe_vars(char *arg, const struct fetch_type *t, diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index feeec261b356..8a63f8bc01bc 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -279,6 +279,8 @@ extern int traceprobe_update_arg(struct probe_arg *arg); extern void traceprobe_free_probe_arg(struct probe_arg *arg); extern int traceprobe_split_symbol_offset(char *symbol, long *offset); +extern int traceprobe_parse_event_name(const char **pevent, + const char **pgroup, char *buf); extern int traceprobe_set_print_fmt(struct trace_probe *tp, bool is_return); -- cgit v1.3-7-g2ca7 From 0597c49c69d53f00df99421d832453f4e92a5006 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:03:04 +0900 Subject: tracing/uprobes: Use dyn_event framework for uprobe events Use dyn_event framework for uprobe events. This shows uprobe events on "dynamic_events" file. User can also define new uprobe events via dynamic_events. Link: http://lkml.kernel.org/r/154140858481.17322.9091293846515154065.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/uprobetracer.rst | 4 + kernel/trace/Kconfig | 1 + kernel/trace/trace_uprobe.c | 278 +++++++++++++++++++---------------- 3 files changed, 153 insertions(+), 130 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst index d0822811527a..4c3bfde2ba47 100644 --- a/Documentation/trace/uprobetracer.rst +++ b/Documentation/trace/uprobetracer.rst @@ -18,6 +18,10 @@ current_tracer. Instead of that, add probe points via However unlike kprobe-event tracer, the uprobe event interface expects the user to calculate the offset of the probepoint in the object. +You can also use /sys/kernel/debug/tracing/dynamic_events instead of +uprobe_events. That interface will provide unified access to other +dynamic events too. + Synopsis of uprobe_tracer ------------------------- :: diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index c0f6b0105609..2cab3c5dfe2c 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -501,6 +501,7 @@ config UPROBE_EVENTS depends on PERF_EVENTS select UPROBES select PROBE_EVENTS + select DYNAMIC_EVENTS select TRACING default y help diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 6eaaa2150685..4a7b21c891f3 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -7,6 +7,7 @@ */ #define pr_fmt(fmt) "trace_kprobe: " fmt +#include #include #include #include @@ -14,6 +15,7 @@ #include #include +#include "trace_dynevent.h" #include "trace_probe.h" #include "trace_probe_tmpl.h" @@ -37,11 +39,26 @@ struct trace_uprobe_filter { struct list_head perf_events; }; +static int trace_uprobe_create(int argc, const char **argv); +static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev); +static int trace_uprobe_release(struct dyn_event *ev); +static bool trace_uprobe_is_busy(struct dyn_event *ev); +static bool trace_uprobe_match(const char *system, const char *event, + struct dyn_event *ev); + +static struct dyn_event_operations trace_uprobe_ops = { + .create = trace_uprobe_create, + .show = trace_uprobe_show, + .is_busy = trace_uprobe_is_busy, + .free = trace_uprobe_release, + .match = trace_uprobe_match, +}; + /* * uprobe event core functions */ struct trace_uprobe { - struct list_head list; + struct dyn_event devent; struct trace_uprobe_filter filter; struct uprobe_consumer consumer; struct path path; @@ -53,6 +70,25 @@ struct trace_uprobe { struct trace_probe tp; }; +static bool is_trace_uprobe(struct dyn_event *ev) +{ + return ev->ops == &trace_uprobe_ops; +} + +static struct trace_uprobe *to_trace_uprobe(struct dyn_event *ev) +{ + return container_of(ev, struct trace_uprobe, devent); +} + +/** + * for_each_trace_uprobe - iterate over the trace_uprobe list + * @pos: the struct trace_uprobe * for each entry + * @dpos: the struct dyn_event * to use as a loop cursor + */ +#define for_each_trace_uprobe(pos, dpos) \ + for_each_dyn_event(dpos) \ + if (is_trace_uprobe(dpos) && (pos = to_trace_uprobe(dpos))) + #define SIZEOF_TRACE_UPROBE(n) \ (offsetof(struct trace_uprobe, tp.args) + \ (sizeof(struct probe_arg) * (n))) @@ -60,9 +96,6 @@ struct trace_uprobe { static int register_uprobe_event(struct trace_uprobe *tu); static int unregister_uprobe_event(struct trace_uprobe *tu); -static DEFINE_MUTEX(uprobe_lock); -static LIST_HEAD(uprobe_list); - struct uprobe_dispatch_data { struct trace_uprobe *tu; unsigned long bp_addr; @@ -209,6 +242,22 @@ static inline bool is_ret_probe(struct trace_uprobe *tu) return tu->consumer.ret_handler != NULL; } +static bool trace_uprobe_is_busy(struct dyn_event *ev) +{ + struct trace_uprobe *tu = to_trace_uprobe(ev); + + return trace_probe_is_enabled(&tu->tp); +} + +static bool trace_uprobe_match(const char *system, const char *event, + struct dyn_event *ev) +{ + struct trace_uprobe *tu = to_trace_uprobe(ev); + + return strcmp(trace_event_name(&tu->tp.call), event) == 0 && + (!system || strcmp(tu->tp.call.class->system, system) == 0); +} + /* * Allocate new trace_uprobe and initialize it (including uprobes). */ @@ -236,7 +285,7 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs, bool is_ret) if (!tu->tp.class.system) goto error; - INIT_LIST_HEAD(&tu->list); + dyn_event_init(&tu->devent, &trace_uprobe_ops); INIT_LIST_HEAD(&tu->tp.files); tu->consumer.handler = uprobe_dispatcher; if (is_ret) @@ -255,6 +304,9 @@ static void free_trace_uprobe(struct trace_uprobe *tu) { int i; + if (!tu) + return; + for (i = 0; i < tu->tp.nr_args; i++) traceprobe_free_probe_arg(&tu->tp.args[i]); @@ -267,9 +319,10 @@ static void free_trace_uprobe(struct trace_uprobe *tu) static struct trace_uprobe *find_probe_event(const char *event, const char *group) { + struct dyn_event *pos; struct trace_uprobe *tu; - list_for_each_entry(tu, &uprobe_list, list) + for_each_trace_uprobe(tu, pos) if (strcmp(trace_event_name(&tu->tp.call), event) == 0 && strcmp(tu->tp.call.class->system, group) == 0) return tu; @@ -277,7 +330,7 @@ static struct trace_uprobe *find_probe_event(const char *event, const char *grou return NULL; } -/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */ +/* Unregister a trace_uprobe and probe_event */ static int unregister_trace_uprobe(struct trace_uprobe *tu) { int ret; @@ -286,7 +339,7 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) if (ret) return ret; - list_del(&tu->list); + dyn_event_remove(&tu->devent); free_trace_uprobe(tu); return 0; } @@ -302,13 +355,14 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) */ static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) { + struct dyn_event *pos; struct trace_uprobe *tmp, *old = NULL; struct inode *new_inode = d_real_inode(new->path.dentry); old = find_probe_event(trace_event_name(&new->tp.call), new->tp.call.class->system); - list_for_each_entry(tmp, &uprobe_list, list) { + for_each_trace_uprobe(tmp, pos) { if ((old ? old != tmp : true) && new_inode == d_real_inode(tmp->path.dentry) && new->offset == tmp->offset && @@ -326,7 +380,7 @@ static int register_trace_uprobe(struct trace_uprobe *tu) struct trace_uprobe *old_tu; int ret; - mutex_lock(&uprobe_lock); + mutex_lock(&event_mutex); /* register as an event */ old_tu = find_old_trace_uprobe(tu); @@ -348,10 +402,10 @@ static int register_trace_uprobe(struct trace_uprobe *tu) goto end; } - list_add_tail(&tu->list, &uprobe_list); + dyn_event_add(&tu->devent); end: - mutex_unlock(&uprobe_lock); + mutex_unlock(&event_mutex); return ret; } @@ -362,91 +416,49 @@ end: * * - Remove uprobe: -:[GRP/]EVENT */ -static int create_trace_uprobe(int argc, char **argv) +static int trace_uprobe_create(int argc, const char **argv) { struct trace_uprobe *tu; - char *arg, *event, *group, *filename, *rctr, *rctr_end; + const char *event = NULL, *group = UPROBE_EVENT_SYSTEM; + char *arg, *filename, *rctr, *rctr_end, *tmp; char buf[MAX_EVENT_NAME_LEN]; struct path path; unsigned long offset, ref_ctr_offset; - bool is_delete, is_return; + bool is_return = false; int i, ret; ret = 0; - is_delete = false; - is_return = false; - event = NULL; - group = NULL; ref_ctr_offset = 0; /* argc must be >= 1 */ - if (argv[0][0] == '-') - is_delete = true; - else if (argv[0][0] == 'r') + if (argv[0][0] == 'r') is_return = true; - else if (argv[0][0] != 'p') { - pr_info("Probe definition must be started with 'p', 'r' or '-'.\n"); - return -EINVAL; - } + else if (argv[0][0] != 'p' || argc < 2) + return -ECANCELED; - if (argv[0][1] == ':') { + if (argv[0][1] == ':') event = &argv[0][2]; - arg = strchr(event, '/'); - if (arg) { - group = event; - event = arg + 1; - event[-1] = '\0'; + if (!strchr(argv[1], '/')) + return -ECANCELED; - if (strlen(group) == 0) { - pr_info("Group name is not specified\n"); - return -EINVAL; - } - } - if (strlen(event) == 0) { - pr_info("Event name is not specified\n"); - return -EINVAL; - } - } - if (!group) - group = UPROBE_EVENT_SYSTEM; - - if (is_delete) { - int ret; - - if (!event) { - pr_info("Delete command needs an event name.\n"); - return -EINVAL; - } - mutex_lock(&uprobe_lock); - tu = find_probe_event(event, group); - - if (!tu) { - mutex_unlock(&uprobe_lock); - pr_info("Event %s/%s doesn't exist.\n", group, event); - return -ENOENT; - } - /* delete an event */ - ret = unregister_trace_uprobe(tu); - mutex_unlock(&uprobe_lock); - return ret; - } + filename = kstrdup(argv[1], GFP_KERNEL); + if (!filename) + return -ENOMEM; - if (argc < 2) { - pr_info("Probe point is not specified.\n"); - return -EINVAL; - } /* Find the last occurrence, in case the path contains ':' too. */ - arg = strrchr(argv[1], ':'); - if (!arg) - return -EINVAL; + arg = strrchr(filename, ':'); + if (!arg || !isdigit(arg[1])) { + kfree(filename); + return -ECANCELED; + } *arg++ = '\0'; - filename = argv[1]; ret = kern_path(filename, LOOKUP_FOLLOW, &path); - if (ret) + if (ret) { + kfree(filename); return ret; - + } if (!d_is_reg(path.dentry)) { ret = -EINVAL; goto fail_address_parse; @@ -480,7 +492,11 @@ static int create_trace_uprobe(int argc, char **argv) argv += 2; /* setup a probe */ - if (!event) { + if (event) { + ret = traceprobe_parse_event_name(&event, &group, buf); + if (ret) + goto fail_address_parse; + } else { char *tail; char *ptr; @@ -508,18 +524,19 @@ static int create_trace_uprobe(int argc, char **argv) tu->offset = offset; tu->ref_ctr_offset = ref_ctr_offset; tu->path = path; - tu->filename = kstrdup(filename, GFP_KERNEL); - - if (!tu->filename) { - pr_info("Failed to allocate filename.\n"); - ret = -ENOMEM; - goto error; - } + tu->filename = filename; /* parse arguments */ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { - ret = traceprobe_parse_probe_arg(&tu->tp, i, argv[i], + tmp = kstrdup(argv[i], GFP_KERNEL); + if (!tmp) { + ret = -ENOMEM; + goto error; + } + + ret = traceprobe_parse_probe_arg(&tu->tp, i, tmp, is_return ? TPARG_FL_RETURN : 0); + kfree(tmp); if (ret) goto error; } @@ -535,55 +552,35 @@ error: fail_address_parse: path_put(&path); + kfree(filename); pr_info("Failed to parse address or file.\n"); return ret; } -static int cleanup_all_probes(void) +static int create_or_delete_trace_uprobe(int argc, char **argv) { - struct trace_uprobe *tu; - int ret = 0; + int ret; - mutex_lock(&uprobe_lock); - /* Ensure no probe is in use. */ - list_for_each_entry(tu, &uprobe_list, list) - if (trace_probe_is_enabled(&tu->tp)) { - ret = -EBUSY; - goto end; - } - while (!list_empty(&uprobe_list)) { - tu = list_entry(uprobe_list.next, struct trace_uprobe, list); - ret = unregister_trace_uprobe(tu); - if (ret) - break; - } -end: - mutex_unlock(&uprobe_lock); - return ret; -} + if (argv[0][0] == '-') + return dyn_event_release(argc, argv, &trace_uprobe_ops); -/* Probes listing interfaces */ -static void *probes_seq_start(struct seq_file *m, loff_t *pos) -{ - mutex_lock(&uprobe_lock); - return seq_list_start(&uprobe_list, *pos); + ret = trace_uprobe_create(argc, (const char **)argv); + return ret == -ECANCELED ? -EINVAL : ret; } -static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos) +static int trace_uprobe_release(struct dyn_event *ev) { - return seq_list_next(v, &uprobe_list, pos); -} + struct trace_uprobe *tu = to_trace_uprobe(ev); -static void probes_seq_stop(struct seq_file *m, void *v) -{ - mutex_unlock(&uprobe_lock); + return unregister_trace_uprobe(tu); } -static int probes_seq_show(struct seq_file *m, void *v) +/* Probes listing interfaces */ +static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev) { - struct trace_uprobe *tu = v; + struct trace_uprobe *tu = to_trace_uprobe(ev); char c = is_ret_probe(tu) ? 'r' : 'p'; int i; @@ -601,11 +598,21 @@ static int probes_seq_show(struct seq_file *m, void *v) return 0; } +static int probes_seq_show(struct seq_file *m, void *v) +{ + struct dyn_event *ev = v; + + if (!is_trace_uprobe(ev)) + return 0; + + return trace_uprobe_show(m, ev); +} + static const struct seq_operations probes_seq_op = { - .start = probes_seq_start, - .next = probes_seq_next, - .stop = probes_seq_stop, - .show = probes_seq_show + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, + .show = probes_seq_show }; static int probes_open(struct inode *inode, struct file *file) @@ -613,7 +620,7 @@ static int probes_open(struct inode *inode, struct file *file) int ret; if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) { - ret = cleanup_all_probes(); + ret = dyn_events_release_all(&trace_uprobe_ops); if (ret) return ret; } @@ -624,7 +631,8 @@ static int probes_open(struct inode *inode, struct file *file) static ssize_t probes_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { - return trace_parse_run_command(file, buffer, count, ppos, create_trace_uprobe); + return trace_parse_run_command(file, buffer, count, ppos, + create_or_delete_trace_uprobe); } static const struct file_operations uprobe_events_ops = { @@ -639,17 +647,22 @@ static const struct file_operations uprobe_events_ops = { /* Probes profiling interfaces */ static int probes_profile_seq_show(struct seq_file *m, void *v) { - struct trace_uprobe *tu = v; + struct dyn_event *ev = v; + struct trace_uprobe *tu; + + if (!is_trace_uprobe(ev)) + return 0; + tu = to_trace_uprobe(ev); seq_printf(m, " %s %-44s %15lu\n", tu->filename, trace_event_name(&tu->tp.call), tu->nhit); return 0; } static const struct seq_operations profile_seq_op = { - .start = probes_seq_start, - .next = probes_seq_next, - .stop = probes_seq_stop, + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, .show = probes_profile_seq_show }; @@ -1307,7 +1320,7 @@ static int register_uprobe_event(struct trace_uprobe *tu) return -ENODEV; } - ret = trace_add_event_call(call); + ret = trace_add_event_call_nolock(call); if (ret) { pr_info("Failed to register uprobe event: %s\n", @@ -1324,7 +1337,7 @@ static int unregister_uprobe_event(struct trace_uprobe *tu) int ret; /* tu->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call(&tu->tp.call); + ret = trace_remove_event_call_nolock(&tu->tp.call); if (ret) return ret; kfree(tu->tp.call.print_fmt); @@ -1351,7 +1364,7 @@ create_local_trace_uprobe(char *name, unsigned long offs, } /* - * local trace_kprobes are not added to probe_list, so they are never + * local trace_kprobes are not added to dyn_event, so they are never * searched in find_trace_kprobe(). Therefore, there is no concern of * duplicated name "DUMMY_EVENT" here. */ @@ -1399,6 +1412,11 @@ void destroy_local_trace_uprobe(struct trace_event_call *event_call) static __init int init_uprobe_trace(void) { struct dentry *d_tracer; + int ret; + + ret = dyn_event_register(&trace_uprobe_ops); + if (ret) + return ret; d_tracer = tracing_init_dentry(); if (IS_ERR(d_tracer)) -- cgit v1.3-7-g2ca7 From 7bbab38d07f3185fddf6fce126e2239010efdfce Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:03:33 +0900 Subject: tracing: Use dyn_event framework for synthetic events Use dyn_event framework for synthetic events. This shows synthetic events on "tracing/dynamic_events" file in addition to tracing/synthetic_events interface. User can also define new events via tracing/dynamic_events with "s:" prefix. So, the new syntax is below; s:[synthetic/]EVENT_NAME TYPE ARG; [TYPE ARG;]... To remove events via tracing/dynamic_events, you can use "-:" prefix as same as other events. Link: http://lkml.kernel.org/r/154140861301.17322.15454611233735614508.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/Kconfig | 1 + kernel/trace/trace.c | 8 ++ kernel/trace/trace_events_hist.c | 265 ++++++++++++++++++++++++--------------- 3 files changed, 176 insertions(+), 98 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 2cab3c5dfe2c..fa8b1fe824f3 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -635,6 +635,7 @@ config HIST_TRIGGERS depends on ARCH_HAVE_NMI_SAFE_CMPXCHG select TRACING_MAP select TRACING + select DYNAMIC_EVENTS default n help Hist triggers allow one or more arbitrary trace event fields diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 7e0332f90ed4..911470ad9e94 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4620,6 +4620,9 @@ static const char readme_msg[] = "\t accepts: event-definitions (one definition per line)\n" "\t Format: p[:[/]] []\n" "\t r[maxactive][:[/]] []\n" +#ifdef CONFIG_HIST_TRIGGERS + "\t s:[synthetic/] []\n" +#endif "\t -:[/]\n" #ifdef CONFIG_KPROBE_EVENTS "\t place: [:][+]|\n" @@ -4638,6 +4641,11 @@ static const char readme_msg[] = "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n" "\t b@/,\n" "\t \\[\\]\n" +#ifdef CONFIG_HIST_TRIGGERS + "\t field: ;\n" + "\t stype: u8/u16/u32/u64, s8/s16/s32/s64, pid_t,\n" + "\t [unsigned] char/int/long\n" +#endif #endif " events/\t\t- Directory containing all trace event subsystems:\n" " enable\t\t- Write 0/1 to enable/disable tracing of all events\n" diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 0feb7f460123..414aabd67d1f 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -15,6 +15,7 @@ #include "tracing_map.h" #include "trace.h" +#include "trace_dynevent.h" #define SYNTH_SYSTEM "synthetic" #define SYNTH_FIELDS_MAX 16 @@ -292,6 +293,21 @@ struct hist_trigger_data { unsigned int n_max_var_str; }; +static int synth_event_create(int argc, const char **argv); +static int synth_event_show(struct seq_file *m, struct dyn_event *ev); +static int synth_event_release(struct dyn_event *ev); +static bool synth_event_is_busy(struct dyn_event *ev); +static bool synth_event_match(const char *system, const char *event, + struct dyn_event *ev); + +static struct dyn_event_operations synth_event_ops = { + .create = synth_event_create, + .show = synth_event_show, + .is_busy = synth_event_is_busy, + .free = synth_event_release, + .match = synth_event_match, +}; + struct synth_field { char *type; char *name; @@ -301,7 +317,7 @@ struct synth_field { }; struct synth_event { - struct list_head list; + struct dyn_event devent; int ref; char *name; struct synth_field **fields; @@ -312,6 +328,32 @@ struct synth_event { struct tracepoint *tp; }; +static bool is_synth_event(struct dyn_event *ev) +{ + return ev->ops == &synth_event_ops; +} + +static struct synth_event *to_synth_event(struct dyn_event *ev) +{ + return container_of(ev, struct synth_event, devent); +} + +static bool synth_event_is_busy(struct dyn_event *ev) +{ + struct synth_event *event = to_synth_event(ev); + + return event->ref != 0; +} + +static bool synth_event_match(const char *system, const char *event, + struct dyn_event *ev) +{ + struct synth_event *sev = to_synth_event(ev); + + return strcmp(sev->name, event) == 0 && + (!system || strcmp(system, SYNTH_SYSTEM) == 0); +} + struct action_data; typedef void (*action_fn_t) (struct hist_trigger_data *hist_data, @@ -402,7 +444,6 @@ static bool have_hist_err(void) return false; } -static LIST_HEAD(synth_event_list); static DEFINE_MUTEX(synth_event_mutex); struct synth_trace_event { @@ -738,14 +779,12 @@ static void free_synth_field(struct synth_field *field) kfree(field); } -static struct synth_field *parse_synth_field(int argc, char **argv, +static struct synth_field *parse_synth_field(int argc, const char **argv, int *consumed) { struct synth_field *field; - const char *prefix = NULL; - char *field_type = argv[0], *field_name; + const char *prefix = NULL, *field_type = argv[0], *field_name, *array; int len, ret = 0; - char *array; if (field_type[0] == ';') field_type++; @@ -762,20 +801,31 @@ static struct synth_field *parse_synth_field(int argc, char **argv, *consumed = 2; } - len = strlen(field_name); - if (field_name[len - 1] == ';') - field_name[len - 1] = '\0'; - field = kzalloc(sizeof(*field), GFP_KERNEL); if (!field) return ERR_PTR(-ENOMEM); - len = strlen(field_type) + 1; + len = strlen(field_name); array = strchr(field_name, '['); + if (array) + len -= strlen(array); + else if (field_name[len - 1] == ';') + len--; + + field->name = kmemdup_nul(field_name, len, GFP_KERNEL); + if (!field->name) { + ret = -ENOMEM; + goto free; + } + + if (field_type[0] == ';') + field_type++; + len = strlen(field_type) + 1; if (array) len += strlen(array); if (prefix) len += strlen(prefix); + field->type = kzalloc(len, GFP_KERNEL); if (!field->type) { ret = -ENOMEM; @@ -786,7 +836,8 @@ static struct synth_field *parse_synth_field(int argc, char **argv, strcat(field->type, field_type); if (array) { strcat(field->type, array); - *array = '\0'; + if (field->type[len - 1] == ';') + field->type[len - 1] = '\0'; } field->size = synth_field_size(field->type); @@ -800,11 +851,6 @@ static struct synth_field *parse_synth_field(int argc, char **argv, field->is_signed = synth_field_signed(field->type); - field->name = kstrdup(field_name, GFP_KERNEL); - if (!field->name) { - ret = -ENOMEM; - goto free; - } out: return field; free: @@ -868,9 +914,13 @@ static inline void trace_synth(struct synth_event *event, u64 *var_ref_vals, static struct synth_event *find_synth_event(const char *name) { + struct dyn_event *pos; struct synth_event *event; - list_for_each_entry(event, &synth_event_list, list) { + for_each_dyn_event(pos) { + if (!is_synth_event(pos)) + continue; + event = to_synth_event(pos); if (strcmp(event->name, name) == 0) return event; } @@ -921,7 +971,7 @@ static int register_synth_event(struct synth_event *event) ret = set_synth_event_print_fmt(call); if (ret < 0) { - trace_remove_event_call(call); + trace_remove_event_call_nolock(call); goto err; } out: @@ -959,7 +1009,7 @@ static void free_synth_event(struct synth_event *event) kfree(event); } -static struct synth_event *alloc_synth_event(char *event_name, int n_fields, +static struct synth_event *alloc_synth_event(const char *name, int n_fields, struct synth_field **fields) { struct synth_event *event; @@ -971,7 +1021,7 @@ static struct synth_event *alloc_synth_event(char *event_name, int n_fields, goto out; } - event->name = kstrdup(event_name, GFP_KERNEL); + event->name = kstrdup(name, GFP_KERNEL); if (!event->name) { kfree(event); event = ERR_PTR(-ENOMEM); @@ -985,6 +1035,8 @@ static struct synth_event *alloc_synth_event(char *event_name, int n_fields, goto out; } + dyn_event_init(&event->devent, &synth_event_ops); + for (i = 0; i < n_fields; i++) event->fields[i] = fields[i]; @@ -1008,16 +1060,11 @@ struct hist_var_data { struct hist_trigger_data *hist_data; }; -static int create_synth_event(int argc, char **argv) +static int __create_synth_event(int argc, const char *name, const char **argv) { struct synth_field *field, *fields[SYNTH_FIELDS_MAX]; struct synth_event *event = NULL; - bool delete_event = false; int i, consumed = 0, n_fields = 0, ret = 0; - char *name; - - mutex_lock(&event_mutex); - mutex_lock(&synth_event_mutex); /* * Argument syntax: @@ -1025,43 +1072,20 @@ static int create_synth_event(int argc, char **argv) * - Remove synthetic event: ! field[;field] ... * where 'field' = type field_name */ - if (argc < 1) { - ret = -EINVAL; - goto out; - } - name = argv[0]; - if (name[0] == '!') { - delete_event = true; - name++; - } + if (name[0] == '\0' || argc < 1) + return -EINVAL; + + mutex_lock(&event_mutex); + mutex_lock(&synth_event_mutex); event = find_synth_event(name); if (event) { - if (delete_event) { - if (event->ref) { - ret = -EBUSY; - goto out; - } - ret = unregister_synth_event(event); - if (!ret) { - list_del(&event->list); - free_synth_event(event); - } - } else - ret = -EEXIST; - goto out; - } else if (delete_event) { - ret = -ENOENT; + ret = -EEXIST; goto out; } - if (argc < 2) { - ret = -EINVAL; - goto out; - } - - for (i = 1; i < argc - 1; i++) { + for (i = 0; i < argc - 1; i++) { if (strcmp(argv[i], ";") == 0) continue; if (n_fields == SYNTH_FIELDS_MAX) { @@ -1091,7 +1115,7 @@ static int create_synth_event(int argc, char **argv) } ret = register_synth_event(event); if (!ret) - list_add(&event->list, &synth_event_list); + dyn_event_add(&event->devent); else free_synth_event(event); out: @@ -1106,57 +1130,77 @@ static int create_synth_event(int argc, char **argv) goto out; } -static int release_all_synth_events(void) +static int create_or_delete_synth_event(int argc, char **argv) { - struct synth_event *event, *e; - int ret = 0; - - mutex_lock(&event_mutex); - mutex_lock(&synth_event_mutex); - - list_for_each_entry(event, &synth_event_list, list) { - if (event->ref) { - mutex_unlock(&synth_event_mutex); - return -EBUSY; - } - } + const char *name = argv[0]; + struct synth_event *event = NULL; + int ret; - list_for_each_entry_safe(event, e, &synth_event_list, list) { - ret = unregister_synth_event(event); - if (!ret) { - list_del(&event->list); - free_synth_event(event); + /* trace_run_command() ensures argc != 0 */ + if (name[0] == '!') { + mutex_lock(&event_mutex); + mutex_lock(&synth_event_mutex); + event = find_synth_event(name + 1); + if (event) { + if (event->ref) + ret = -EBUSY; + else { + ret = unregister_synth_event(event); + if (!ret) { + dyn_event_remove(&event->devent); + free_synth_event(event); + } + } } else - break; + ret = -ENOENT; + mutex_unlock(&synth_event_mutex); + mutex_unlock(&event_mutex); + return ret; } - mutex_unlock(&synth_event_mutex); - mutex_unlock(&event_mutex); - return ret; + ret = __create_synth_event(argc - 1, name, (const char **)argv + 1); + return ret == -ECANCELED ? -EINVAL : ret; } - -static void *synth_events_seq_start(struct seq_file *m, loff_t *pos) +static int synth_event_create(int argc, const char **argv) { - mutex_lock(&synth_event_mutex); + const char *name = argv[0]; + int len; - return seq_list_start(&synth_event_list, *pos); -} + if (name[0] != 's' || name[1] != ':') + return -ECANCELED; + name += 2; -static void *synth_events_seq_next(struct seq_file *m, void *v, loff_t *pos) -{ - return seq_list_next(v, &synth_event_list, pos); + /* This interface accepts group name prefix */ + if (strchr(name, '/')) { + len = sizeof(SYNTH_SYSTEM "/") - 1; + if (strncmp(name, SYNTH_SYSTEM "/", len)) + return -EINVAL; + name += len; + } + return __create_synth_event(argc - 1, name, argv + 1); } -static void synth_events_seq_stop(struct seq_file *m, void *v) +static int synth_event_release(struct dyn_event *ev) { - mutex_unlock(&synth_event_mutex); + struct synth_event *event = to_synth_event(ev); + int ret; + + if (event->ref) + return -EBUSY; + + ret = unregister_synth_event(event); + if (ret) + return ret; + + dyn_event_remove(ev); + free_synth_event(event); + return 0; } -static int synth_events_seq_show(struct seq_file *m, void *v) +static int __synth_event_show(struct seq_file *m, struct synth_event *event) { struct synth_field *field; - struct synth_event *event = v; unsigned int i; seq_printf(m, "%s\t", event->name); @@ -1174,11 +1218,30 @@ static int synth_events_seq_show(struct seq_file *m, void *v) return 0; } +static int synth_event_show(struct seq_file *m, struct dyn_event *ev) +{ + struct synth_event *event = to_synth_event(ev); + + seq_printf(m, "s:%s/", event->class.system); + + return __synth_event_show(m, event); +} + +static int synth_events_seq_show(struct seq_file *m, void *v) +{ + struct dyn_event *ev = v; + + if (!is_synth_event(ev)) + return 0; + + return __synth_event_show(m, to_synth_event(ev)); +} + static const struct seq_operations synth_events_seq_op = { - .start = synth_events_seq_start, - .next = synth_events_seq_next, - .stop = synth_events_seq_stop, - .show = synth_events_seq_show + .start = dyn_event_seq_start, + .next = dyn_event_seq_next, + .stop = dyn_event_seq_stop, + .show = synth_events_seq_show, }; static int synth_events_open(struct inode *inode, struct file *file) @@ -1186,7 +1249,7 @@ static int synth_events_open(struct inode *inode, struct file *file) int ret; if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) { - ret = release_all_synth_events(); + ret = dyn_events_release_all(&synth_event_ops); if (ret < 0) return ret; } @@ -1199,7 +1262,7 @@ static ssize_t synth_events_write(struct file *file, size_t count, loff_t *ppos) { return trace_parse_run_command(file, buffer, count, ppos, - create_synth_event); + create_or_delete_synth_event); } static const struct file_operations synth_events_fops = { @@ -5791,6 +5854,12 @@ static __init int trace_events_hist_init(void) struct dentry *d_tracer; int err = 0; + err = dyn_event_register(&synth_event_ops); + if (err) { + pr_warn("Could not register synth_event_ops\n"); + return err; + } + d_tracer = tracing_init_dentry(); if (IS_ERR(d_tracer)) { err = PTR_ERR(d_tracer); -- cgit v1.3-7-g2ca7 From 0e2b81f7b52a1c1a8c46986f9ca01eb7b3c421f8 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:04:01 +0900 Subject: tracing: Remove unneeded synth_event_mutex Rmove unneeded synth_event_mutex. This mutex protects the reference count in synth_event, however, those operational points are already protected by event_mutex. 1. In __create_synth_event() and create_or_delete_synth_event(), those synth_event_mutex clearly obtained right after event_mutex. 2. event_hist_trigger_func() is trigger_hist_cmd.func() which is called by trigger_process_regex(), which is a part of event_trigger_regex_write() and this function takes event_mutex. 3. hist_unreg_all() is trigger_hist_cmd.unreg_all() which is called by event_trigger_regex_open() and it takes event_mutex. 4. onmatch_destroy() and onmatch_create() have long call tree, but both are finally invoked from event_trigger_regex_write() and event_trace_del_tracer(), former takes event_mutex, and latter ensures called under event_mutex locked. Finally, I ensured there is no resource conflict. For safety, I added lockdep_assert_held(&event_mutex) for each function. Link: http://lkml.kernel.org/r/154140864134.17322.4796059721306031894.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 414aabd67d1f..21e4954375a1 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -444,8 +444,6 @@ static bool have_hist_err(void) return false; } -static DEFINE_MUTEX(synth_event_mutex); - struct synth_trace_event { struct trace_entry ent; u64 fields[]; @@ -1077,7 +1075,6 @@ static int __create_synth_event(int argc, const char *name, const char **argv) return -EINVAL; mutex_lock(&event_mutex); - mutex_lock(&synth_event_mutex); event = find_synth_event(name); if (event) { @@ -1119,7 +1116,6 @@ static int __create_synth_event(int argc, const char *name, const char **argv) else free_synth_event(event); out: - mutex_unlock(&synth_event_mutex); mutex_unlock(&event_mutex); return ret; @@ -1139,7 +1135,6 @@ static int create_or_delete_synth_event(int argc, char **argv) /* trace_run_command() ensures argc != 0 */ if (name[0] == '!') { mutex_lock(&event_mutex); - mutex_lock(&synth_event_mutex); event = find_synth_event(name + 1); if (event) { if (event->ref) @@ -1153,7 +1148,6 @@ static int create_or_delete_synth_event(int argc, char **argv) } } else ret = -ENOENT; - mutex_unlock(&synth_event_mutex); mutex_unlock(&event_mutex); return ret; } @@ -3535,7 +3529,7 @@ static void onmatch_destroy(struct action_data *data) { unsigned int i; - mutex_lock(&synth_event_mutex); + lockdep_assert_held(&event_mutex); kfree(data->onmatch.match_event); kfree(data->onmatch.match_event_system); @@ -3548,8 +3542,6 @@ static void onmatch_destroy(struct action_data *data) data->onmatch.synth_event->ref--; kfree(data); - - mutex_unlock(&synth_event_mutex); } static void destroy_field_var(struct field_var *field_var) @@ -3700,15 +3692,14 @@ static int onmatch_create(struct hist_trigger_data *hist_data, struct synth_event *event; int ret = 0; - mutex_lock(&synth_event_mutex); + lockdep_assert_held(&event_mutex); + event = find_synth_event(data->onmatch.synth_event_name); if (!event) { hist_err("onmatch: Couldn't find synthetic event: ", data->onmatch.synth_event_name); - mutex_unlock(&synth_event_mutex); return -EINVAL; } event->ref++; - mutex_unlock(&synth_event_mutex); var_ref_idx = hist_data->n_var_refs; @@ -3782,9 +3773,7 @@ static int onmatch_create(struct hist_trigger_data *hist_data, out: return ret; err: - mutex_lock(&synth_event_mutex); event->ref--; - mutex_unlock(&synth_event_mutex); goto out; } @@ -5492,6 +5481,8 @@ static void hist_unreg_all(struct trace_event_file *file) struct synth_event *se; const char *se_name; + lockdep_assert_held(&event_mutex); + if (hist_file_check_refs(file)) return; @@ -5501,12 +5492,10 @@ static void hist_unreg_all(struct trace_event_file *file) list_del_rcu(&test->list); trace_event_trigger_enable_disable(file, 0); - mutex_lock(&synth_event_mutex); se_name = trace_event_name(file->event_call); se = find_synth_event(se_name); if (se) se->ref--; - mutex_unlock(&synth_event_mutex); update_cond_flag(file); if (hist_data->enable_timestamps) @@ -5532,6 +5521,8 @@ static int event_hist_trigger_func(struct event_command *cmd_ops, char *trigger, *p; int ret = 0; + lockdep_assert_held(&event_mutex); + if (glob && strlen(glob)) { last_cmd_set(param); hist_err_clear(); @@ -5622,14 +5613,10 @@ static int event_hist_trigger_func(struct event_command *cmd_ops, } cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file); - - mutex_lock(&synth_event_mutex); se_name = trace_event_name(file->event_call); se = find_synth_event(se_name); if (se) se->ref--; - mutex_unlock(&synth_event_mutex); - ret = 0; goto out_free; } @@ -5665,13 +5652,10 @@ enable: if (ret) goto out_unreg; - mutex_lock(&synth_event_mutex); se_name = trace_event_name(file->event_call); se = find_synth_event(se_name); if (se) se->ref++; - mutex_unlock(&synth_event_mutex); - /* Just return zero, not the number of registered triggers */ ret = 0; out: -- cgit v1.3-7-g2ca7 From c454a46b5efd8eff8880e88ece2976e60a26bf35 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 7 Dec 2018 16:42:25 -0800 Subject: bpf: Add bpf_line_info support This patch adds bpf_line_info support. It accepts an array of bpf_line_info objects during BPF_PROG_LOAD. The "line_info", "line_info_cnt" and "line_info_rec_size" are added to the "union bpf_attr". The "line_info_rec_size" makes bpf_line_info extensible in the future. The new "check_btf_line()" ensures the userspace line_info is valid for the kernel to use. When the verifier is translating/patching the bpf_prog (through "bpf_patch_insn_single()"), the line_infos' insn_off is also adjusted by the newly added "bpf_adj_linfo()". If the bpf_prog is jited, this patch also provides the jited addrs (in aux->jited_linfo) for the corresponding line_info.insn_off. "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo. It is currently called by the x86 jit. Other jits can also use "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches. In the future, if it deemed necessary, a particular jit could also provide its own "bpf_prog_fill_jited_linfo()" implementation. A few "*line_info*" fields are added to the bpf_prog_info such that the user can get the xlated line_info back (i.e. the line_info with its insn_off reflecting the translated prog). The jited_line_info is available if the prog is jited. It is an array of __u64. If the prog is not jited, jited_line_info_cnt is 0. The verifier's verbose log with line_info will be done in a follow up patch. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- arch/x86/net/bpf_jit_comp.c | 2 + include/linux/bpf.h | 21 +++++ include/linux/bpf_verifier.h | 1 + include/linux/btf.h | 1 + include/linux/filter.h | 7 ++ include/uapi/linux/bpf.h | 19 +++++ kernel/bpf/btf.c | 2 +- kernel/bpf/core.c | 118 +++++++++++++++++++++++++- kernel/bpf/syscall.c | 83 ++++++++++++++++-- kernel/bpf/verifier.c | 198 +++++++++++++++++++++++++++++++++++++------ 10 files changed, 419 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 2580cd2e98b1..5542303c43d9 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1181,6 +1181,8 @@ out_image: } if (!image || !prog->is_func || extra_pass) { + if (image) + bpf_prog_fill_jited_linfo(prog, addrs); out_addrs: kfree(addrs); kfree(jit_data); diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e82b7039fc66..0c992b86eb2c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -319,7 +319,28 @@ struct bpf_prog_aux { struct bpf_prog_offload *offload; struct btf *btf; struct bpf_func_info *func_info; + /* bpf_line_info loaded from userspace. linfo->insn_off + * has the xlated insn offset. + * Both the main and sub prog share the same linfo. + * The subprog can access its first linfo by + * using the linfo_idx. + */ + struct bpf_line_info *linfo; + /* jited_linfo is the jited addr of the linfo. It has a + * one to one mapping to linfo: + * jited_linfo[i] is the jited addr for the linfo[i]->insn_off. + * Both the main and sub prog share the same jited_linfo. + * The subprog can access its first jited_linfo by + * using the linfo_idx. + */ + void **jited_linfo; u32 func_info_cnt; + u32 nr_linfo; + /* subprog can use linfo_idx to access its first linfo and + * jited_linfo. + * main prog always has linfo_idx == 0 + */ + u32 linfo_idx; union { struct work_struct work; struct rcu_head rcu; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 11f5df1092d9..c736945be7c5 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -203,6 +203,7 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log) struct bpf_subprog_info { u32 start; /* insn idx of function entry point */ + u32 linfo_idx; /* The idx to the main_prog->aux->linfo */ u16 stack_depth; /* max. stack depth used by this function */ }; diff --git a/include/linux/btf.h b/include/linux/btf.h index 8c2199b5d250..b98405a56383 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -46,6 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m); int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); +bool btf_name_offset_valid(const struct btf *btf, u32 offset); #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); diff --git a/include/linux/filter.h b/include/linux/filter.h index d16deead65c6..29f21f9d7f68 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -718,6 +718,13 @@ void bpf_prog_free(struct bpf_prog *fp); bool bpf_opcode_in_insntable(u8 code); +void bpf_prog_free_linfo(struct bpf_prog *prog); +void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, + const u32 *insn_to_jit_off); +int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog); +void bpf_prog_free_jited_linfo(struct bpf_prog *prog); +void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog); + struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a84fd232d934..7a66db8d15d5 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -356,6 +356,9 @@ union bpf_attr { __u32 func_info_rec_size; /* userspace bpf_func_info size */ __aligned_u64 func_info; /* func info */ __u32 func_info_cnt; /* number of bpf_func_info records */ + __u32 line_info_rec_size; /* userspace bpf_line_info size */ + __aligned_u64 line_info; /* line info */ + __u32 line_info_cnt; /* number of bpf_line_info records */ }; struct { /* anonymous struct used by BPF_OBJ_* commands */ @@ -2679,6 +2682,12 @@ struct bpf_prog_info { __u32 func_info_rec_size; __aligned_u64 func_info; __u32 func_info_cnt; + __u32 line_info_cnt; + __aligned_u64 line_info; + __aligned_u64 jited_line_info; + __u32 jited_line_info_cnt; + __u32 line_info_rec_size; + __u32 jited_line_info_rec_size; } __attribute__((aligned(8))); struct bpf_map_info { @@ -2995,4 +3004,14 @@ struct bpf_func_info { __u32 type_id; }; +#define BPF_LINE_INFO_LINE_NUM(line_col) ((line_col) >> 10) +#define BPF_LINE_INFO_LINE_COL(line_col) ((line_col) & 0x3ff) + +struct bpf_line_info { + __u32 insn_off; + __u32 file_name_off; + __u32 line_off; + __u32 line_col; +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index a09b2f94ab25..e0a827f95e19 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -444,7 +444,7 @@ static const struct btf_kind_operations *btf_type_ops(const struct btf_type *t) return kind_ops[BTF_INFO_KIND(t->info)]; } -static bool btf_name_offset_valid(const struct btf *btf, u32 offset) +bool btf_name_offset_valid(const struct btf *btf, u32 offset) { return BTF_STR_OFFSET_VALID(offset) && offset < btf->hdr.str_len; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index a5b223ef7131..5cdd8da0e7f2 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -105,6 +105,91 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) } EXPORT_SYMBOL_GPL(bpf_prog_alloc); +int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog) +{ + if (!prog->aux->nr_linfo || !prog->jit_requested) + return 0; + + prog->aux->jited_linfo = kcalloc(prog->aux->nr_linfo, + sizeof(*prog->aux->jited_linfo), + GFP_KERNEL | __GFP_NOWARN); + if (!prog->aux->jited_linfo) + return -ENOMEM; + + return 0; +} + +void bpf_prog_free_jited_linfo(struct bpf_prog *prog) +{ + kfree(prog->aux->jited_linfo); + prog->aux->jited_linfo = NULL; +} + +void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog) +{ + if (prog->aux->jited_linfo && !prog->aux->jited_linfo[0]) + bpf_prog_free_jited_linfo(prog); +} + +/* The jit engine is responsible to provide an array + * for insn_off to the jited_off mapping (insn_to_jit_off). + * + * The idx to this array is the insn_off. Hence, the insn_off + * here is relative to the prog itself instead of the main prog. + * This array has one entry for each xlated bpf insn. + * + * jited_off is the byte off to the last byte of the jited insn. + * + * Hence, with + * insn_start: + * The first bpf insn off of the prog. The insn off + * here is relative to the main prog. + * e.g. if prog is a subprog, insn_start > 0 + * linfo_idx: + * The prog's idx to prog->aux->linfo and jited_linfo + * + * jited_linfo[linfo_idx] = prog->bpf_func + * + * For i > linfo_idx, + * + * jited_linfo[i] = prog->bpf_func + + * insn_to_jit_off[linfo[i].insn_off - insn_start - 1] + */ +void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, + const u32 *insn_to_jit_off) +{ + u32 linfo_idx, insn_start, insn_end, nr_linfo, i; + const struct bpf_line_info *linfo; + void **jited_linfo; + + if (!prog->aux->jited_linfo) + /* Userspace did not provide linfo */ + return; + + linfo_idx = prog->aux->linfo_idx; + linfo = &prog->aux->linfo[linfo_idx]; + insn_start = linfo[0].insn_off; + insn_end = insn_start + prog->len; + + jited_linfo = &prog->aux->jited_linfo[linfo_idx]; + jited_linfo[0] = prog->bpf_func; + + nr_linfo = prog->aux->nr_linfo - linfo_idx; + + for (i = 1; i < nr_linfo && linfo[i].insn_off < insn_end; i++) + /* The verifier ensures that linfo[i].insn_off is + * strictly increasing + */ + jited_linfo[i] = prog->bpf_func + + insn_to_jit_off[linfo[i].insn_off - insn_start - 1]; +} + +void bpf_prog_free_linfo(struct bpf_prog *prog) +{ + bpf_prog_free_jited_linfo(prog); + kvfree(prog->aux->linfo); +} + struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags) { @@ -294,6 +379,26 @@ static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta, return ret; } +static void bpf_adj_linfo(struct bpf_prog *prog, u32 off, u32 delta) +{ + struct bpf_line_info *linfo; + u32 i, nr_linfo; + + nr_linfo = prog->aux->nr_linfo; + if (!nr_linfo || !delta) + return; + + linfo = prog->aux->linfo; + + for (i = 0; i < nr_linfo; i++) + if (off < linfo[i].insn_off) + break; + + /* Push all off < linfo[i].insn_off by delta */ + for (; i < nr_linfo; i++) + linfo[i].insn_off += delta; +} + struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len) { @@ -349,6 +454,8 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, */ BUG_ON(bpf_adj_branches(prog_adj, off, insn_delta, false)); + bpf_adj_linfo(prog_adj, off, insn_delta); + return prog_adj; } @@ -1591,13 +1698,20 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err) * be JITed, but falls back to the interpreter. */ if (!bpf_prog_is_dev_bound(fp->aux)) { + *err = bpf_prog_alloc_jited_linfo(fp); + if (*err) + return fp; + fp = bpf_int_jit_compile(fp); -#ifdef CONFIG_BPF_JIT_ALWAYS_ON if (!fp->jited) { + bpf_prog_free_jited_linfo(fp); +#ifdef CONFIG_BPF_JIT_ALWAYS_ON *err = -ENOTSUPP; return fp; - } #endif + } else { + bpf_prog_free_unused_jited_linfo(fp); + } } else { *err = bpf_prog_offload_compile(fp); if (*err) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index aa05aa38f4a8..19c88cff7880 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1215,6 +1215,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) bpf_prog_kallsyms_del_all(prog); btf_put(prog->aux->btf); kvfree(prog->aux->func_info); + bpf_prog_free_linfo(prog); call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); } @@ -1439,7 +1440,7 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, } /* last field in 'union bpf_attr' used by this command */ -#define BPF_PROG_LOAD_LAST_FIELD func_info_cnt +#define BPF_PROG_LOAD_LAST_FIELD line_info_cnt static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) { @@ -1560,6 +1561,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) return err; free_used_maps: + bpf_prog_free_linfo(prog); kvfree(prog->aux->func_info); btf_put(prog->aux->btf); bpf_prog_kallsyms_del_subprogs(prog); @@ -2041,6 +2043,37 @@ static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog) return insns; } +static int set_info_rec_size(struct bpf_prog_info *info) +{ + /* + * Ensure info.*_rec_size is the same as kernel expected size + * + * or + * + * Only allow zero *_rec_size if both _rec_size and _cnt are + * zero. In this case, the kernel will set the expected + * _rec_size back to the info. + */ + + if ((info->func_info_cnt || info->func_info_rec_size) && + info->func_info_rec_size != sizeof(struct bpf_func_info)) + return -EINVAL; + + if ((info->line_info_cnt || info->line_info_rec_size) && + info->line_info_rec_size != sizeof(struct bpf_line_info)) + return -EINVAL; + + if ((info->jited_line_info_cnt || info->jited_line_info_rec_size) && + info->jited_line_info_rec_size != sizeof(__u64)) + return -EINVAL; + + info->func_info_rec_size = sizeof(struct bpf_func_info); + info->line_info_rec_size = sizeof(struct bpf_line_info); + info->jited_line_info_rec_size = sizeof(__u64); + + return 0; +} + static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, const union bpf_attr *attr, union bpf_attr __user *uattr) @@ -2083,11 +2116,9 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, return -EFAULT; } - if ((info.func_info_cnt || info.func_info_rec_size) && - info.func_info_rec_size != sizeof(struct bpf_func_info)) - return -EINVAL; - - info.func_info_rec_size = sizeof(struct bpf_func_info); + err = set_info_rec_size(&info); + if (err) + return err; if (!capable(CAP_SYS_ADMIN)) { info.jited_prog_len = 0; @@ -2095,6 +2126,8 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, info.nr_jited_ksyms = 0; info.nr_jited_func_lens = 0; info.func_info_cnt = 0; + info.line_info_cnt = 0; + info.jited_line_info_cnt = 0; goto done; } @@ -2251,6 +2284,44 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } + ulen = info.line_info_cnt; + info.line_info_cnt = prog->aux->nr_linfo; + if (info.line_info_cnt && ulen) { + if (bpf_dump_raw_ok()) { + __u8 __user *user_linfo; + + user_linfo = u64_to_user_ptr(info.line_info); + ulen = min_t(u32, info.line_info_cnt, ulen); + if (copy_to_user(user_linfo, prog->aux->linfo, + info.line_info_rec_size * ulen)) + return -EFAULT; + } else { + info.line_info = 0; + } + } + + ulen = info.jited_line_info_cnt; + if (prog->aux->jited_linfo) + info.jited_line_info_cnt = prog->aux->nr_linfo; + else + info.jited_line_info_cnt = 0; + if (info.jited_line_info_cnt && ulen) { + if (bpf_dump_raw_ok()) { + __u64 __user *user_linfo; + u32 i; + + user_linfo = u64_to_user_ptr(info.jited_line_info); + ulen = min_t(u32, info.jited_line_info_cnt, ulen); + for (i = 0; i < ulen; i++) { + if (put_user((__u64)(long)prog->aux->jited_linfo[i], + &user_linfo[i])) + return -EFAULT; + } + } else { + info.jited_line_info = 0; + } + } + done: if (copy_to_user(uinfo, &info, info_len) || put_user(info_len, &uattr->info.info_len)) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2752d35ad073..9d25506bd55a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4640,15 +4640,17 @@ err_free: #define MIN_BPF_FUNCINFO_SIZE 8 #define MAX_FUNCINFO_REC_SIZE 252 -static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, - union bpf_attr *attr, union bpf_attr __user *uattr) +static int check_btf_func(struct bpf_verifier_env *env, + const union bpf_attr *attr, + union bpf_attr __user *uattr) { u32 i, nfuncs, urec_size, min_size, prev_offset; u32 krec_size = sizeof(struct bpf_func_info); - struct bpf_func_info *krecord = NULL; + struct bpf_func_info *krecord; const struct btf_type *type; + struct bpf_prog *prog; + const struct btf *btf; void __user *urecord; - struct btf *btf; int ret = 0; nfuncs = attr->func_info_cnt; @@ -4668,20 +4670,15 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, return -EINVAL; } - btf = btf_get_by_fd(attr->prog_btf_fd); - if (IS_ERR(btf)) { - verbose(env, "unable to get btf from fd\n"); - return PTR_ERR(btf); - } + prog = env->prog; + btf = prog->aux->btf; urecord = u64_to_user_ptr(attr->func_info); min_size = min_t(u32, krec_size, urec_size); krecord = kvcalloc(nfuncs, krec_size, GFP_KERNEL | __GFP_NOWARN); - if (!krecord) { - ret = -ENOMEM; - goto free_btf; - } + if (!krecord) + return -ENOMEM; for (i = 0; i < nfuncs; i++) { ret = bpf_check_uarg_tail_zero(urecord, krec_size, urec_size); @@ -4694,12 +4691,12 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, if (put_user(min_size, &uattr->func_info_rec_size)) ret = -EFAULT; } - goto free_btf; + goto err_free; } if (copy_from_user(&krecord[i], urecord, min_size)) { ret = -EFAULT; - goto free_btf; + goto err_free; } /* check insn_off */ @@ -4709,20 +4706,20 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, "nonzero insn_off %u for the first func info record", krecord[i].insn_off); ret = -EINVAL; - goto free_btf; + goto err_free; } } else if (krecord[i].insn_off <= prev_offset) { verbose(env, "same or smaller insn offset (%u) than previous func info record (%u)", krecord[i].insn_off, prev_offset); ret = -EINVAL; - goto free_btf; + goto err_free; } if (env->subprog_info[i].start != krecord[i].insn_off) { verbose(env, "func_info BTF section doesn't match subprog layout in BPF program\n"); ret = -EINVAL; - goto free_btf; + goto err_free; } /* check type_id */ @@ -4731,20 +4728,18 @@ static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env, verbose(env, "invalid type id %d in func info", krecord[i].type_id); ret = -EINVAL; - goto free_btf; + goto err_free; } prev_offset = krecord[i].insn_off; urecord += urec_size; } - prog->aux->btf = btf; prog->aux->func_info = krecord; prog->aux->func_info_cnt = nfuncs; return 0; -free_btf: - btf_put(btf); +err_free: kvfree(krecord); return ret; } @@ -4760,6 +4755,150 @@ static void adjust_btf_func(struct bpf_verifier_env *env) env->prog->aux->func_info[i].insn_off = env->subprog_info[i].start; } +#define MIN_BPF_LINEINFO_SIZE (offsetof(struct bpf_line_info, line_col) + \ + sizeof(((struct bpf_line_info *)(0))->line_col)) +#define MAX_LINEINFO_REC_SIZE MAX_FUNCINFO_REC_SIZE + +static int check_btf_line(struct bpf_verifier_env *env, + const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + u32 i, s, nr_linfo, ncopy, expected_size, rec_size, prev_offset = 0; + struct bpf_subprog_info *sub; + struct bpf_line_info *linfo; + struct bpf_prog *prog; + const struct btf *btf; + void __user *ulinfo; + int err; + + nr_linfo = attr->line_info_cnt; + if (!nr_linfo) + return 0; + + rec_size = attr->line_info_rec_size; + if (rec_size < MIN_BPF_LINEINFO_SIZE || + rec_size > MAX_LINEINFO_REC_SIZE || + rec_size & (sizeof(u32) - 1)) + return -EINVAL; + + /* Need to zero it in case the userspace may + * pass in a smaller bpf_line_info object. + */ + linfo = kvcalloc(nr_linfo, sizeof(struct bpf_line_info), + GFP_KERNEL | __GFP_NOWARN); + if (!linfo) + return -ENOMEM; + + prog = env->prog; + btf = prog->aux->btf; + + s = 0; + sub = env->subprog_info; + ulinfo = u64_to_user_ptr(attr->line_info); + expected_size = sizeof(struct bpf_line_info); + ncopy = min_t(u32, expected_size, rec_size); + for (i = 0; i < nr_linfo; i++) { + err = bpf_check_uarg_tail_zero(ulinfo, expected_size, rec_size); + if (err) { + if (err == -E2BIG) { + verbose(env, "nonzero tailing record in line_info"); + if (put_user(expected_size, + &uattr->line_info_rec_size)) + err = -EFAULT; + } + goto err_free; + } + + if (copy_from_user(&linfo[i], ulinfo, ncopy)) { + err = -EFAULT; + goto err_free; + } + + /* + * Check insn_off to ensure + * 1) strictly increasing AND + * 2) bounded by prog->len + * + * The linfo[0].insn_off == 0 check logically falls into + * the later "missing bpf_line_info for func..." case + * because the first linfo[0].insn_off must be the + * first sub also and the first sub must have + * subprog_info[0].start == 0. + */ + if ((i && linfo[i].insn_off <= prev_offset) || + linfo[i].insn_off >= prog->len) { + verbose(env, "Invalid line_info[%u].insn_off:%u (prev_offset:%u prog->len:%u)\n", + i, linfo[i].insn_off, prev_offset, + prog->len); + err = -EINVAL; + goto err_free; + } + + if (!btf_name_offset_valid(btf, linfo[i].line_off) || + !btf_name_offset_valid(btf, linfo[i].file_name_off)) { + verbose(env, "Invalid line_info[%u].line_off or .file_name_off\n", i); + err = -EINVAL; + goto err_free; + } + + if (s != env->subprog_cnt) { + if (linfo[i].insn_off == sub[s].start) { + sub[s].linfo_idx = i; + s++; + } else if (sub[s].start < linfo[i].insn_off) { + verbose(env, "missing bpf_line_info for func#%u\n", s); + err = -EINVAL; + goto err_free; + } + } + + prev_offset = linfo[i].insn_off; + ulinfo += rec_size; + } + + if (s != env->subprog_cnt) { + verbose(env, "missing bpf_line_info for %u funcs starting from func#%u\n", + env->subprog_cnt - s, s); + err = -EINVAL; + goto err_free; + } + + prog->aux->linfo = linfo; + prog->aux->nr_linfo = nr_linfo; + + return 0; + +err_free: + kvfree(linfo); + return err; +} + +static int check_btf_info(struct bpf_verifier_env *env, + const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + struct btf *btf; + int err; + + if (!attr->func_info_cnt && !attr->line_info_cnt) + return 0; + + btf = btf_get_by_fd(attr->prog_btf_fd); + if (IS_ERR(btf)) + return PTR_ERR(btf); + env->prog->aux->btf = btf; + + err = check_btf_func(env, attr, uattr); + if (err) + return err; + + err = check_btf_line(env, attr, uattr); + if (err) + return err; + + return 0; +} + /* check %cur's range satisfies %old's */ static bool range_within(struct bpf_reg_state *old, struct bpf_reg_state *cur) @@ -6004,7 +6143,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) int i, j, subprog_start, subprog_end = 0, len, subprog; struct bpf_insn *insn; void *old_bpf_func; - int err = -ENOMEM; + int err; if (env->subprog_cnt <= 1) return 0; @@ -6035,6 +6174,11 @@ static int jit_subprogs(struct bpf_verifier_env *env) insn->imm = 1; } + err = bpf_prog_alloc_jited_linfo(prog); + if (err) + goto out_undo_insn; + + err = -ENOMEM; func = kcalloc(env->subprog_cnt, sizeof(prog), GFP_KERNEL); if (!func) goto out_undo_insn; @@ -6065,6 +6209,10 @@ static int jit_subprogs(struct bpf_verifier_env *env) func[i]->aux->name[0] = 'F'; func[i]->aux->stack_depth = env->subprog_info[i].stack_depth; func[i]->jit_requested = 1; + func[i]->aux->linfo = prog->aux->linfo; + func[i]->aux->nr_linfo = prog->aux->nr_linfo; + func[i]->aux->jited_linfo = prog->aux->jited_linfo; + func[i]->aux->linfo_idx = env->subprog_info[i].linfo_idx; func[i] = bpf_int_jit_compile(func[i]); if (!func[i]->jited) { err = -ENOTSUPP; @@ -6138,6 +6286,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) prog->bpf_func = func[0]->bpf_func; prog->aux->func = func; prog->aux->func_cnt = env->subprog_cnt; + bpf_prog_free_unused_jited_linfo(prog); return 0; out_free: for (i = 0; i < env->subprog_cnt; i++) @@ -6154,6 +6303,7 @@ out_undo_insn: insn->off = 0; insn->imm = env->insn_aux_data[i].call_imm; } + bpf_prog_free_jited_linfo(prog); return err; } @@ -6526,7 +6676,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (ret < 0) goto skip_full_check; - ret = check_btf_func(env->prog, env, attr, uattr); + ret = check_btf_info(env, attr, uattr); if (ret < 0) goto skip_full_check; -- cgit v1.3-7-g2ca7 From e80c1a9d5f514ce5134c6c4263a11607341466c9 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Tue, 4 Dec 2018 19:00:01 +0900 Subject: printk: fix printk_time race. Since printk_time can be toggled via /sys/module/printk/parameters/time , it is not safe to assume that output length does not change across multiple msg_print_text() calls. If we hit this race, we can observe failures such as SYSLOG_ACTION_READ_ALL writes more bytes than userspace has supplied, SYSLOG_ACTION_SIZE_UNREAD returns -EFAULT when succeeded, SYSLOG_ACTION_READ reads garbage memory or even triggers an kernel oops at _copy_to_user() due to integer overflow. To close this race, get a snapshot value of printk_time and pass it to SYSLOG_ACTION_READ, SYSLOG_ACTION_READ_ALL, SYSLOG_ACTION_SIZE_UNREAD and kmsg_dump_get_buffer(). Link: http://lkml.kernel.org/r/555af37c-b9e0-f940-cb73-a78eba2d4944@i-love.sakura.ne.jp To: Sergey Senozhatsky Cc: linux-kernel@vger.kernel.org Signed-off-by: Tetsuo Handa Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 70 ++++++++++++++++++++++++++++---------------------- 1 file changed, 39 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index a1d88212a5d2..c7d217764db3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -404,6 +404,7 @@ DECLARE_WAIT_QUEUE_HEAD(log_wait); static u64 syslog_seq; static u32 syslog_idx; static size_t syslog_partial; +static bool syslog_time; /* index and sequence number of the first record stored in the buffer */ static u64 log_first_seq; @@ -1229,12 +1230,7 @@ module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); static size_t print_time(u64 ts, char *buf) { - unsigned long rem_nsec; - - if (!printk_time) - return 0; - - rem_nsec = do_div(ts, 1000000000); + unsigned long rem_nsec = do_div(ts, 1000000000); if (!buf) return snprintf(NULL, 0, "[%5lu.000000] ", (unsigned long)ts); @@ -1243,7 +1239,8 @@ static size_t print_time(u64 ts, char *buf) (unsigned long)ts, rem_nsec / 1000); } -static size_t print_prefix(const struct printk_log *msg, bool syslog, char *buf) +static size_t print_prefix(const struct printk_log *msg, bool syslog, + bool time, char *buf) { size_t len = 0; unsigned int prefix = (msg->facility << 3) | msg->level; @@ -1262,11 +1259,13 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog, char *buf) } } - len += print_time(msg->ts_nsec, buf ? buf + len : NULL); + if (time) + len += print_time(msg->ts_nsec, buf ? buf + len : NULL); return len; } -static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *buf, size_t size) +static size_t msg_print_text(const struct printk_log *msg, bool syslog, + bool time, char *buf, size_t size) { const char *text = log_text(msg); size_t text_size = msg->text_len; @@ -1285,17 +1284,17 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *bu } if (buf) { - if (print_prefix(msg, syslog, NULL) + + if (print_prefix(msg, syslog, time, NULL) + text_len + 1 >= size - len) break; - len += print_prefix(msg, syslog, buf + len); + len += print_prefix(msg, syslog, time, buf + len); memcpy(buf + len, text, text_len); len += text_len; buf[len++] = '\n'; } else { /* SYSLOG_ACTION_* buffer size only calculation */ - len += print_prefix(msg, syslog, NULL); + len += print_prefix(msg, syslog, time, NULL); len += text_len; len++; } @@ -1332,9 +1331,17 @@ static int syslog_print(char __user *buf, int size) break; } + /* + * To keep reading/counting partial line consistent, + * use printk_time value as of the beginning of a line. + */ + if (!syslog_partial) + syslog_time = printk_time; + skip = syslog_partial; msg = log_from_idx(syslog_idx); - n = msg_print_text(msg, true, text, LOG_LINE_MAX + PREFIX_MAX); + n = msg_print_text(msg, true, syslog_time, text, + LOG_LINE_MAX + PREFIX_MAX); if (n - syslog_partial <= size) { /* message fits into buffer, move forward */ syslog_idx = log_next(syslog_idx); @@ -1374,11 +1381,13 @@ static int syslog_print_all(char __user *buf, int size, bool clear) u64 next_seq; u64 seq; u32 idx; + bool time; text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL); if (!text) return -ENOMEM; + time = printk_time; logbuf_lock_irq(); /* * Find first record that fits, including all following records, @@ -1389,7 +1398,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear) while (seq < log_next_seq) { struct printk_log *msg = log_from_idx(idx); - len += msg_print_text(msg, true, NULL, 0); + len += msg_print_text(msg, true, time, NULL, 0); idx = log_next(idx); seq++; } @@ -1400,7 +1409,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear) while (len > size && seq < log_next_seq) { struct printk_log *msg = log_from_idx(idx); - len -= msg_print_text(msg, true, NULL, 0); + len -= msg_print_text(msg, true, time, NULL, 0); idx = log_next(idx); seq++; } @@ -1411,14 +1420,9 @@ static int syslog_print_all(char __user *buf, int size, bool clear) len = 0; while (len >= 0 && seq < next_seq) { struct printk_log *msg = log_from_idx(idx); - int textlen; + int textlen = msg_print_text(msg, true, time, text, + LOG_LINE_MAX + PREFIX_MAX); - textlen = msg_print_text(msg, true, text, - LOG_LINE_MAX + PREFIX_MAX); - if (textlen < 0) { - len = textlen; - break; - } idx = log_next(idx); seq++; @@ -1542,11 +1546,14 @@ int do_syslog(int type, char __user *buf, int len, int source) } else { u64 seq = syslog_seq; u32 idx = syslog_idx; + bool time = syslog_partial ? syslog_time : printk_time; while (seq < log_next_seq) { struct printk_log *msg = log_from_idx(idx); - error += msg_print_text(msg, true, NULL, 0); + error += msg_print_text(msg, true, time, NULL, + 0); + time = printk_time; idx = log_next(idx); seq++; } @@ -2004,6 +2011,7 @@ EXPORT_SYMBOL(printk); #define LOG_LINE_MAX 0 #define PREFIX_MAX 0 +#define printk_time false static u64 syslog_seq; static u32 syslog_idx; @@ -2027,8 +2035,8 @@ static void console_lock_spinning_enable(void) { } static int console_lock_spinning_disable_and_check(void) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, const char *text, size_t len) {} -static size_t msg_print_text(const struct printk_log *msg, - bool syslog, char *buf, size_t size) { return 0; } +static size_t msg_print_text(const struct printk_log *msg, bool syslog, + bool time, char *buf, size_t size) { return 0; } static bool suppress_message_printing(int level) { return false; } #endif /* CONFIG_PRINTK */ @@ -2386,8 +2394,7 @@ skip: len += msg_print_text(msg, console_msg_format & MSG_FORMAT_SYSLOG, - text + len, - sizeof(text) - len); + printk_time, text + len, sizeof(text) - len); if (nr_ext_console_drivers) { ext_len = msg_print_ext_header(ext_text, sizeof(ext_text), @@ -3111,7 +3118,7 @@ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog, goto out; msg = log_from_idx(dumper->cur_idx); - l = msg_print_text(msg, syslog, line, size); + l = msg_print_text(msg, syslog, printk_time, line, size); dumper->cur_idx = log_next(dumper->cur_idx); dumper->cur_seq++; @@ -3182,6 +3189,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, u32 next_idx; size_t l = 0; bool ret = false; + bool time = printk_time; if (!dumper->active) goto out; @@ -3205,7 +3213,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, while (seq < dumper->next_seq) { struct printk_log *msg = log_from_idx(idx); - l += msg_print_text(msg, true, NULL, 0); + l += msg_print_text(msg, true, time, NULL, 0); idx = log_next(idx); seq++; } @@ -3216,7 +3224,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, while (l > size && seq < dumper->next_seq) { struct printk_log *msg = log_from_idx(idx); - l -= msg_print_text(msg, true, NULL, 0); + l -= msg_print_text(msg, true, time, NULL, 0); idx = log_next(idx); seq++; } @@ -3229,7 +3237,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, while (seq < dumper->next_seq) { struct printk_log *msg = log_from_idx(idx); - l += msg_print_text(msg, syslog, buf + l, size - l); + l += msg_print_text(msg, syslog, time, buf + l, size - l); idx = log_next(idx); seq++; } -- cgit v1.3-7-g2ca7 From 7e1413edd6194a9807aa5f3ac0378b9b4b9da879 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 4 Dec 2018 13:35:45 -0500 Subject: tracing: Consolidate trace_add/remove_event_call back to the nolock functions The trace_add/remove_event_call_nolock() functions were added to allow the tace_add/remove_event_call() code be called when the event_mutex lock was already taken. Now that all callers are done within the event_mutex, there's no reason to have two different interfaces. Remove the current wrapper trace_add/remove_event_call()s and rename the _nolock versions back to the original names. Link: http://lkml.kernel.org/r/154140866955.17322.2081425494660638846.stgit@devbox Acked-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- include/linux/trace_events.h | 2 -- kernel/trace/trace_events.c | 30 ++++-------------------------- kernel/trace/trace_events_hist.c | 6 +++--- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_uprobe.c | 4 ++-- 5 files changed, 11 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 3aa05593a53f..4130a5497d40 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -529,8 +529,6 @@ extern int trace_event_raw_init(struct trace_event_call *call); extern int trace_define_field(struct trace_event_call *call, const char *type, const char *name, int offset, int size, int is_signed, int filter_type); -extern int trace_add_event_call_nolock(struct trace_event_call *call); -extern int trace_remove_event_call_nolock(struct trace_event_call *call); extern int trace_add_event_call(struct trace_event_call *call); extern int trace_remove_event_call(struct trace_event_call *call); extern int trace_event_get_offsets(struct trace_event_call *call); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index a3b157f689ee..bd0162c0467c 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2305,7 +2305,8 @@ __trace_early_add_new_event(struct trace_event_call *call, struct ftrace_module_file_ops; static void __add_event_to_tracers(struct trace_event_call *call); -int trace_add_event_call_nolock(struct trace_event_call *call) +/* Add an additional event_call dynamically */ +int trace_add_event_call(struct trace_event_call *call) { int ret; lockdep_assert_held(&event_mutex); @@ -2320,17 +2321,6 @@ int trace_add_event_call_nolock(struct trace_event_call *call) return ret; } -/* Add an additional event_call dynamically */ -int trace_add_event_call(struct trace_event_call *call) -{ - int ret; - - mutex_lock(&event_mutex); - ret = trace_add_event_call_nolock(call); - mutex_unlock(&event_mutex); - return ret; -} - /* * Must be called under locking of trace_types_lock, event_mutex and * trace_event_sem. @@ -2376,8 +2366,8 @@ static int probe_remove_event_call(struct trace_event_call *call) return 0; } -/* no event_mutex version */ -int trace_remove_event_call_nolock(struct trace_event_call *call) +/* Remove an event_call */ +int trace_remove_event_call(struct trace_event_call *call) { int ret; @@ -2392,18 +2382,6 @@ int trace_remove_event_call_nolock(struct trace_event_call *call) return ret; } -/* Remove an event_call */ -int trace_remove_event_call(struct trace_event_call *call) -{ - int ret; - - mutex_lock(&event_mutex); - ret = trace_remove_event_call_nolock(call); - mutex_unlock(&event_mutex); - - return ret; -} - #define for_each_event(event, start, end) \ for (event = start; \ (unsigned long)event < (unsigned long)end; \ diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 21e4954375a1..82e72c48a5a9 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -960,7 +960,7 @@ static int register_synth_event(struct synth_event *event) call->data = event; call->tp = event->tp; - ret = trace_add_event_call_nolock(call); + ret = trace_add_event_call(call); if (ret) { pr_warn("Failed to register synthetic event: %s\n", trace_event_name(call)); @@ -969,7 +969,7 @@ static int register_synth_event(struct synth_event *event) ret = set_synth_event_print_fmt(call); if (ret < 0) { - trace_remove_event_call_nolock(call); + trace_remove_event_call(call); goto err; } out: @@ -984,7 +984,7 @@ static int unregister_synth_event(struct synth_event *event) struct trace_event_call *call = &event->call; int ret; - ret = trace_remove_event_call_nolock(call); + ret = trace_remove_event_call(call); return ret; } diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index bdf8c2ad5152..0e0f7b8024fb 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1353,7 +1353,7 @@ static int register_kprobe_event(struct trace_kprobe *tk) kfree(call->print_fmt); return -ENODEV; } - ret = trace_add_event_call_nolock(call); + ret = trace_add_event_call(call); if (ret) { pr_info("Failed to register kprobe event: %s\n", trace_event_name(call)); @@ -1368,7 +1368,7 @@ static int unregister_kprobe_event(struct trace_kprobe *tk) int ret; /* tp->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call_nolock(&tk->tp.call); + ret = trace_remove_event_call(&tk->tp.call); if (!ret) kfree(tk->tp.call.print_fmt); return ret; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 4a7b21c891f3..e335576b9411 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1320,7 +1320,7 @@ static int register_uprobe_event(struct trace_uprobe *tu) return -ENODEV; } - ret = trace_add_event_call_nolock(call); + ret = trace_add_event_call(call); if (ret) { pr_info("Failed to register uprobe event: %s\n", @@ -1337,7 +1337,7 @@ static int unregister_uprobe_event(struct trace_uprobe *tu) int ret; /* tu->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call_nolock(&tu->tp.call); + ret = trace_remove_event_call(&tu->tp.call); if (ret) return ret; kfree(tu->tp.call.print_fmt); -- cgit v1.3-7-g2ca7 From 1ce25e9f6fff7633955e8ce776e5e9d7a780d34b Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Nov 2018 18:04:57 +0900 Subject: tracing: Add generic event-name based remove event method Add a generic method to remove event from dynamic event list. This is same as other system under ftrace. You just need to pass the event name with '!', e.g. # echo p:new_grp/new_event _do_fork > dynamic_events This creates an event, and # echo '!p:new_grp/new_event _do_fork' > dynamic_events Or, # echo '!p:new_grp/new_event' > dynamic_events will remove new_grp/new_event event. Note that this doesn't check the event prefix (e.g. "p:") strictly, because the "group/event" name must be unique. Link: http://lkml.kernel.org/r/154140869774.17322.8887303560398645347.stgit@devbox Reviewed-by: Tom Zanussi Tested-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_dynevent.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c index f17a887abb66..dd1f43588d70 100644 --- a/kernel/trace/trace_dynevent.c +++ b/kernel/trace/trace_dynevent.c @@ -37,10 +37,17 @@ int dyn_event_release(int argc, char **argv, struct dyn_event_operations *type) char *system = NULL, *event, *p; int ret = -ENOENT; - if (argv[0][1] != ':') - return -EINVAL; + if (argv[0][0] == '-') { + if (argv[0][1] != ':') + return -EINVAL; + event = &argv[0][2]; + } else { + event = strchr(argv[0], ':'); + if (!event) + return -EINVAL; + event++; + } - event = &argv[0][2]; p = strchr(event, '/'); if (p) { system = event; @@ -69,7 +76,7 @@ static int create_dyn_event(int argc, char **argv) struct dyn_event_operations *ops; int ret; - if (argv[0][0] == '-') + if (argv[0][0] == '-' || argv[0][0] == '!') return dyn_event_release(argc, argv, NULL); mutex_lock(&dyn_event_ops_mutex); -- cgit v1.3-7-g2ca7 From a0572f687fb3c46e15554f4789797a077cc393b4 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 5 Dec 2018 12:48:53 -0500 Subject: ftrace: Allow ftrace_replace_code() to be schedulable The function ftrace_replace_code() is the ftrace engine that does the work to modify all the nops into the calls to the function callback in all the functions being traced. The generic version which is normally called from stop machine, but an architecture can implement a non stop machine version and still use the generic ftrace_replace_code(). When an architecture does this, ftrace_replace_code() may be called from a schedulable context, where it can allow the code to be preemptible, and schedule out. In order to allow an architecture to make ftrace_replace_code() schedulable, a new command flag is added called: FTRACE_MAY_SLEEP Which can be or'd to the command that is passed to ftrace_modify_all_code() that calls ftrace_replace_code() and will have it call cond_resched() in the loop that modifies the nops into the calls to the ftrace trampolines. Link: http://lkml.kernel.org/r/20181204192903.8193-1-anders.roxell@linaro.org Link: http://lkml.kernel.org/r/20181205183303.828422192@goodmis.org Reported-by: Anders Roxell Tested-by: Anders Roxell Signed-off-by: Steven Rostedt (VMware) --- include/linux/ftrace.h | 1 + kernel/trace/ftrace.c | 19 ++++++++++++++++--- 2 files changed, 17 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 98e141c71ad0..13485a19e964 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -389,6 +389,7 @@ enum { FTRACE_UPDATE_TRACE_FUNC = (1 << 2), FTRACE_START_FUNC_RET = (1 << 3), FTRACE_STOP_FUNC_RET = (1 << 4), + FTRACE_MAY_SLEEP = (1 << 5), }; /* diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8ef9fc226037..ab3e8b995e12 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -77,6 +77,11 @@ #define ASSIGN_OPS_HASH(opsname, val) #endif +enum { + FTRACE_MODIFY_ENABLE_FL = (1 << 0), + FTRACE_MODIFY_MAY_SLEEP_FL = (1 << 1), +}; + struct ftrace_ops ftrace_list_end __read_mostly = { .func = ftrace_stub, .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_STUB, @@ -2389,10 +2394,12 @@ __ftrace_replace_code(struct dyn_ftrace *rec, int enable) return -1; /* unknow ftrace bug */ } -void __weak ftrace_replace_code(int enable) +void __weak ftrace_replace_code(int mod_flags) { struct dyn_ftrace *rec; struct ftrace_page *pg; + int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL; + int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL; int failed; if (unlikely(ftrace_disabled)) @@ -2409,6 +2416,8 @@ void __weak ftrace_replace_code(int enable) /* Stop processing */ return; } + if (schedulable) + cond_resched(); } while_for_each_ftrace_rec(); } @@ -2522,8 +2531,12 @@ int __weak ftrace_arch_code_modify_post_process(void) void ftrace_modify_all_code(int command) { int update = command & FTRACE_UPDATE_TRACE_FUNC; + int mod_flags = 0; int err = 0; + if (command & FTRACE_MAY_SLEEP) + mod_flags = FTRACE_MODIFY_MAY_SLEEP_FL; + /* * If the ftrace_caller calls a ftrace_ops func directly, * we need to make sure that it only traces functions it @@ -2541,9 +2554,9 @@ void ftrace_modify_all_code(int command) } if (command & FTRACE_UPDATE_CALLS) - ftrace_replace_code(1); + ftrace_replace_code(mod_flags | FTRACE_MODIFY_ENABLE_FL); else if (command & FTRACE_DISABLE_CALLS) - ftrace_replace_code(0); + ftrace_replace_code(mod_flags); if (update && ftrace_trace_function != ftrace_ops_list_func) { function_trace_op = set_function_trace_op; -- cgit v1.3-7-g2ca7 From 45fe439bc3699851802062c05acf4cbac4d672bc Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 7 Dec 2018 12:25:52 -0500 Subject: fgraph: Add comment to describe ftrace_graph_get_ret_stack The ret_stack should not be accessed directly via the curr_ret_stack variable on the task_struct. This is because the ret_stack is going to be converted into a series of longs and not an array of ret_stack structures. The way that a ret_stack should be retrieved is via the ftrace_graph_get_ret_stack structure, but it needs to be documented on how to use it. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/fgraph.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index a3704ec8b599..d4f04f0ca646 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -232,6 +232,17 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) return ret; } +/** + * ftrace_graph_get_ret_stack - return the entry of the shadow stack + * @task: The task to read the shadow stack from + * @idx: Index down the shadow stack + * + * Return the ret_struct on the shadow stack of the @task at the + * call graph at @idx starting with zero. If @idx is zero, it + * will return the last saved ret_stack entry. If it is greater than + * zero, it will return the corresponding ret_stack for the depth + * of saved return addresses. + */ struct ftrace_ret_stack * ftrace_graph_get_ret_stack(struct task_struct *task, int idx) { -- cgit v1.3-7-g2ca7 From e434b8cdf788568ba65a0a0fd9f3cb41f3ca1803 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Fri, 7 Dec 2018 12:16:18 -0500 Subject: bpf: relax verifier restriction on BPF_MOV | BPF_ALU Currently, the destination register is marked as unknown for 32-bit sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is SCALAR_VALUE. This is too conservative that some valid cases will be rejected. Especially, this may turn a constant scalar value into unknown value that could break some assumptions of verifier. For example, test_l4lb_noinline.c has the following C code: struct real_definition *dst 1: if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6)) 2: return TC_ACT_SHOT; 3: 4: if (dst->flags & F_IPV6) { get_packet_dst is responsible for initializing "dst" into valid pointer and return true (1), otherwise return false (0). The compiled instruction sequence using alu32 will be: 412: (54) (u32) r7 &= (u32) 1 413: (bc) (u32) r0 = (u32) r7 414: (95) exit insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even r7 contains SCALAR_VALUE 1. This causes trouble when verifier is walking the code path that hasn't initialized "dst" inside get_packet_dst, for which case 0 is returned and we would then expect verifier concluding line 1 in the above C code pass the "if" check, therefore would skip fall through path starting at line 4. Now, because r0 returned from callee has became unknown value, so verifier won't skip analyzing path starting at line 4 and "dst->flags" requires dereferencing the pointer "dst" which actually hasn't be initialized for this path. This patch relaxed the code marking sub-register move destination. For a SCALAR_VALUE, it is safe to just copy the value from source then truncate it into 32-bit. A unit test also included to demonstrate this issue. This test will fail before this patch. This relaxation could let verifier skipping more paths for conditional comparison against immediate. It also let verifier recording a more accurate/strict value for one register at one state, if this state end up with going through exit without rejection and it is used for state comparison later, then it is possible an inaccurate/permissive value is better. So the real impact on verifier processed insn number is complex. But in all, without this fix, valid program could be rejected. >From real benchmarking on kernel selftests and Cilium bpf tests, there is no impact on processed instruction number when tests ares compiled with default compilation options. There is slightly improvements when they are compiled with -mattr=+alu32 after this patch. Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is rejected before this fix. Insn processed before/after this patch: default -mattr=+alu32 Kernel selftest === test_xdp.o 371/371 369/369 test_l4lb.o 6345/6345 5623/5623 test_xdp_noinline.o 2971/2971 rejected/2727 test_tcp_estates.o 429/429 430/430 Cilium bpf === bpf_lb-DLB_L3.o: 2085/2085 1685/1687 bpf_lb-DLB_L4.o: 2287/2287 1986/1982 bpf_lb-DUNKNOWN.o: 690/690 622/622 bpf_lxc.o: 95033/95033 N/A bpf_netdev.o: 7245/7245 N/A bpf_overlay.o: 2898/2898 3085/2947 NOTE: - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by verifier due to another issue inside verifier on supporting alu32 binary. - Each cilium bpf program could generate several processed insn number, above number is sum of them. v1->v2: - Restrict the change on SCALAR_VALUE. - Update benchmark numbers on Cilium bpf tests. Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 16 ++++++++++++---- tools/testing/selftests/bpf/test_verifier.c | 13 +++++++++++++ 2 files changed, 25 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9d25506bd55a..2e70b813a1a7 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3583,12 +3583,15 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return err; if (BPF_SRC(insn->code) == BPF_X) { + struct bpf_reg_state *src_reg = regs + insn->src_reg; + struct bpf_reg_state *dst_reg = regs + insn->dst_reg; + if (BPF_CLASS(insn->code) == BPF_ALU64) { /* case: R1 = R2 * copy register state to dest reg */ - regs[insn->dst_reg] = regs[insn->src_reg]; - regs[insn->dst_reg].live |= REG_LIVE_WRITTEN; + *dst_reg = *src_reg; + dst_reg->live |= REG_LIVE_WRITTEN; } else { /* R1 = (u32) R2 */ if (is_pointer_value(env, insn->src_reg)) { @@ -3596,9 +3599,14 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) "R%d partial copy of pointer\n", insn->src_reg); return -EACCES; + } else if (src_reg->type == SCALAR_VALUE) { + *dst_reg = *src_reg; + dst_reg->live |= REG_LIVE_WRITTEN; + } else { + mark_reg_unknown(env, regs, + insn->dst_reg); } - mark_reg_unknown(env, regs, insn->dst_reg); - coerce_reg_to_size(®s[insn->dst_reg], 4); + coerce_reg_to_size(dst_reg, 4); } } else { /* case: R = imm diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 36ce58b4933e..957e4711c46c 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -2959,6 +2959,19 @@ static struct bpf_test tests[] = { .result_unpriv = REJECT, .result = ACCEPT, }, + { + "alu32: mov u32 const", + .insns = { + BPF_MOV32_IMM(BPF_REG_7, 0), + BPF_ALU32_IMM(BPF_AND, BPF_REG_7, 1), + BPF_MOV32_REG(BPF_REG_0, BPF_REG_7), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_7, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 0, + }, { "unpriv: partial copy of pointer", .insns = { -- cgit v1.3-7-g2ca7 From 7a5725ddc6e18aab61f647c0996d3b7eb03ff5cb Mon Sep 17 00:00:00 2001 From: Song Liu Date: Mon, 10 Dec 2018 11:17:50 -0800 Subject: bpf: clean up bpf_prog_get_info_by_fd() info.nr_jited_ksyms and info.nr_jited_func_lens cannot be 0 in these two statements, so we don't need to check them. Signed-off-by: Song Liu Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19c88cff7880..a99a23bf5910 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2209,7 +2209,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, ulen = info.nr_jited_ksyms; info.nr_jited_ksyms = prog->aux->func_cnt ? : 1; - if (info.nr_jited_ksyms && ulen) { + if (ulen) { if (bpf_dump_raw_ok()) { unsigned long ksym_addr; u64 __user *user_ksyms; @@ -2240,7 +2240,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, ulen = info.nr_jited_func_lens; info.nr_jited_func_lens = prog->aux->func_cnt ? : 1; - if (info.nr_jited_func_lens && ulen) { + if (ulen) { if (bpf_dump_raw_ok()) { u32 __user *user_lens; u32 func_len, i; -- cgit v1.3-7-g2ca7 From 11d8b82d2222cade12caad2c125f23023777dcbc Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Mon, 10 Dec 2018 14:14:08 -0800 Subject: bpf: rename *_info_cnt to nr_*_info in bpf_prog_info In uapi bpf.h, currently we have the following fields in the struct bpf_prog_info: __u32 func_info_cnt; __u32 line_info_cnt; __u32 jited_line_info_cnt; The above field names "func_info_cnt" and "line_info_cnt" also appear in union bpf_attr for program loading. The original intention is to keep the names the same between bpf_prog_info and bpf_attr so it will imply what we returned to user space will be the same as what the user space passed to the kernel. Such a naming convention in bpf_prog_info is not consistent with other fields like: __u32 nr_jited_ksyms; __u32 nr_jited_func_lens; This patch made this adjustment so in bpf_prog_info newly introduced *_info_cnt becomes nr_*_info. Acked-by: Song Liu Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 6 +++--- kernel/bpf/syscall.c | 38 +++++++++++++++++++------------------- 2 files changed, 22 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1bee1135866a..f943ed803309 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2696,11 +2696,11 @@ struct bpf_prog_info { __u32 btf_id; __u32 func_info_rec_size; __aligned_u64 func_info; - __u32 func_info_cnt; - __u32 line_info_cnt; + __u32 nr_func_info; + __u32 nr_line_info; __aligned_u64 line_info; __aligned_u64 jited_line_info; - __u32 jited_line_info_cnt; + __u32 nr_jited_line_info; __u32 line_info_rec_size; __u32 jited_line_info_rec_size; } __attribute__((aligned(8))); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index a99a23bf5910..5745c7837621 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2055,15 +2055,15 @@ static int set_info_rec_size(struct bpf_prog_info *info) * _rec_size back to the info. */ - if ((info->func_info_cnt || info->func_info_rec_size) && + if ((info->nr_func_info || info->func_info_rec_size) && info->func_info_rec_size != sizeof(struct bpf_func_info)) return -EINVAL; - if ((info->line_info_cnt || info->line_info_rec_size) && + if ((info->nr_line_info || info->line_info_rec_size) && info->line_info_rec_size != sizeof(struct bpf_line_info)) return -EINVAL; - if ((info->jited_line_info_cnt || info->jited_line_info_rec_size) && + if ((info->nr_jited_line_info || info->jited_line_info_rec_size) && info->jited_line_info_rec_size != sizeof(__u64)) return -EINVAL; @@ -2125,9 +2125,9 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, info.xlated_prog_len = 0; info.nr_jited_ksyms = 0; info.nr_jited_func_lens = 0; - info.func_info_cnt = 0; - info.line_info_cnt = 0; - info.jited_line_info_cnt = 0; + info.nr_func_info = 0; + info.nr_line_info = 0; + info.nr_jited_line_info = 0; goto done; } @@ -2268,14 +2268,14 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, if (prog->aux->btf) info.btf_id = btf_id(prog->aux->btf); - ulen = info.func_info_cnt; - info.func_info_cnt = prog->aux->func_info_cnt; - if (info.func_info_cnt && ulen) { + ulen = info.nr_func_info; + info.nr_func_info = prog->aux->func_info_cnt; + if (info.nr_func_info && ulen) { if (bpf_dump_raw_ok()) { char __user *user_finfo; user_finfo = u64_to_user_ptr(info.func_info); - ulen = min_t(u32, info.func_info_cnt, ulen); + ulen = min_t(u32, info.nr_func_info, ulen); if (copy_to_user(user_finfo, prog->aux->func_info, info.func_info_rec_size * ulen)) return -EFAULT; @@ -2284,14 +2284,14 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } - ulen = info.line_info_cnt; - info.line_info_cnt = prog->aux->nr_linfo; - if (info.line_info_cnt && ulen) { + ulen = info.nr_line_info; + info.nr_line_info = prog->aux->nr_linfo; + if (info.nr_line_info && ulen) { if (bpf_dump_raw_ok()) { __u8 __user *user_linfo; user_linfo = u64_to_user_ptr(info.line_info); - ulen = min_t(u32, info.line_info_cnt, ulen); + ulen = min_t(u32, info.nr_line_info, ulen); if (copy_to_user(user_linfo, prog->aux->linfo, info.line_info_rec_size * ulen)) return -EFAULT; @@ -2300,18 +2300,18 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } - ulen = info.jited_line_info_cnt; + ulen = info.nr_jited_line_info; if (prog->aux->jited_linfo) - info.jited_line_info_cnt = prog->aux->nr_linfo; + info.nr_jited_line_info = prog->aux->nr_linfo; else - info.jited_line_info_cnt = 0; - if (info.jited_line_info_cnt && ulen) { + info.nr_jited_line_info = 0; + if (info.nr_jited_line_info && ulen) { if (bpf_dump_raw_ok()) { __u64 __user *user_linfo; u32 i; user_linfo = u64_to_user_ptr(info.jited_line_info); - ulen = min_t(u32, info.jited_line_info_cnt, ulen); + ulen = min_t(u32, info.nr_jited_line_info, ulen); for (i = 0; i < ulen; i++) { if (put_user((__u64)(long)prog->aux->jited_linfo[i], &user_linfo[i])) -- cgit v1.3-7-g2ca7 From 108c35a908d484df094f46a1e9d961d732737013 Mon Sep 17 00:00:00 2001 From: Daniel Lezcano Date: Mon, 3 Dec 2018 11:29:29 +0100 Subject: sched/cpufreq: Add the SPDX tags The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c. Add them and remove the license descriptions Signed-off-by: Daniel Lezcano Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- kernel/sched/cpufreq.c | 5 +---- kernel/sched/cpufreq_schedutil.c | 5 +---- 2 files changed, 2 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c index 5e54cbcae673..22bd8980f32f 100644 --- a/kernel/sched/cpufreq.c +++ b/kernel/sched/cpufreq.c @@ -1,12 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Scheduler code and data structures related to cpufreq. * * Copyright (C) 2016, Intel Corporation * Author: Rafael J. Wysocki - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #include "sched.h" diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 3fffad3bc8a8..626ddd4ffa43 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -1,12 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * CPUFreq governor based on scheduler-provided CPU utilization data. * * Copyright (C) 2016, Intel Corporation * Author: Rafael J. Wysocki - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -- cgit v1.3-7-g2ca7 From 9f191555ba4ba8fc82e589670e46a7f79b72a157 Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:28 +0000 Subject: dma-debug: Expose nr_total_entries in debugfs Expose nr_total_entries in debugfs, so that {num,min}_free_entries become even more meaningful to users interested in current/maximum utilisation. This becomes even more relevant once nr_total_entries may change at runtime beyond just the existing AMD GART debug code. Suggested-by: John Garry Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 3 +++ kernel/dma/debug.c | 7 +++++++ 2 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index ac66ae2509a9..6bdb095393b0 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -723,6 +723,9 @@ dma-api/min_free_entries This read-only file can be read to get the dma-api/num_free_entries The current number of free dma_debug_entries in the allocator. +dma-api/nr_total_entries The total number of dma_debug_entries in the + allocator, both free and used. + dma-api/driver-filter You can write a name of a driver into this file to limit the debug output to requests from that particular driver. Write an empty string to diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 231ca4628062..f6a141eb9438 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -142,6 +142,7 @@ static struct dentry *show_all_errors_dent __read_mostly; static struct dentry *show_num_errors_dent __read_mostly; static struct dentry *num_free_entries_dent __read_mostly; static struct dentry *min_free_entries_dent __read_mostly; +static struct dentry *nr_total_entries_dent __read_mostly; static struct dentry *filter_dent __read_mostly; /* per-driver filter related state */ @@ -926,6 +927,12 @@ static int dma_debug_fs_init(void) if (!min_free_entries_dent) goto out_err; + nr_total_entries_dent = debugfs_create_u32("nr_total_entries", 0444, + dma_debug_dent, + &nr_total_entries); + if (!nr_total_entries_dent) + goto out_err; + filter_dent = debugfs_create_file("driver_filter", 0644, dma_debug_dent, NULL, &filter_fops); if (!filter_dent) -- cgit v1.3-7-g2ca7 From f737b095c60c635db260e02fdb9f0efb9f3360c4 Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:27 +0000 Subject: dma-debug: Use pr_fmt() Use pr_fmt() to generate the "DMA-API: " prefix consistently. This results in it being added to a couple of pr_*() messages which were missing it before, and for the err_printk() calls moves it to the actual start of the message instead of somewhere in the middle. Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- kernel/dma/debug.c | 74 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 38 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index f6a141eb9438..29486eb9d1dc 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -17,6 +17,8 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#define pr_fmt(fmt) "DMA-API: " fmt + #include #include #include @@ -235,7 +237,7 @@ static bool driver_filter(struct device *dev) error_count += 1; \ if (driver_filter(dev) && \ (show_all_errors || show_num_errors > 0)) { \ - WARN(1, "%s %s: " format, \ + WARN(1, pr_fmt("%s %s: ") format, \ dev ? dev_driver_string(dev) : "NULL", \ dev ? dev_name(dev) : "NULL", ## arg); \ dump_entry_trace(entry); \ @@ -520,7 +522,7 @@ static void active_cacheline_inc_overlap(phys_addr_t cln) * prematurely. */ WARN_ONCE(overlap > ACTIVE_CACHELINE_MAX_OVERLAP, - "DMA-API: exceeded %d overlapping mappings of cacheline %pa\n", + pr_fmt("exceeded %d overlapping mappings of cacheline %pa\n"), ACTIVE_CACHELINE_MAX_OVERLAP, &cln); } @@ -615,7 +617,7 @@ void debug_dma_assert_idle(struct page *page) cln = to_cacheline_number(entry); err_printk(entry->dev, entry, - "DMA-API: cpu touching an active dma mapped cacheline [cln=%pa]\n", + "cpu touching an active dma mapped cacheline [cln=%pa]\n", &cln); } @@ -635,7 +637,7 @@ static void add_dma_entry(struct dma_debug_entry *entry) rc = active_cacheline_insert(entry); if (rc == -ENOMEM) { - pr_err("DMA-API: cacheline tracking ENOMEM, dma-debug disabled\n"); + pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); global_disable = true; } @@ -674,7 +676,7 @@ static struct dma_debug_entry *dma_entry_alloc(void) if (list_empty(&free_entries)) { global_disable = true; spin_unlock_irqrestore(&free_entries_lock, flags); - pr_err("DMA-API: debugging out of memory - disabling\n"); + pr_err("debugging out of memory - disabling\n"); return NULL; } @@ -778,7 +780,7 @@ static int prealloc_memory(u32 num_entries) num_free_entries = num_entries; min_free_entries = num_entries; - pr_info("DMA-API: preallocated %d debug entries\n", num_entries); + pr_info("preallocated %d debug entries\n", num_entries); return 0; @@ -851,7 +853,7 @@ static ssize_t filter_write(struct file *file, const char __user *userbuf, * switched off. */ if (current_driver_name[0]) - pr_info("DMA-API: switching off dma-debug driver filter\n"); + pr_info("switching off dma-debug driver filter\n"); current_driver_name[0] = 0; current_driver = NULL; goto out_unlock; @@ -869,7 +871,7 @@ static ssize_t filter_write(struct file *file, const char __user *userbuf, current_driver_name[i] = 0; current_driver = NULL; - pr_info("DMA-API: enable driver filter for driver [%s]\n", + pr_info("enable driver filter for driver [%s]\n", current_driver_name); out_unlock: @@ -888,7 +890,7 @@ static int dma_debug_fs_init(void) { dma_debug_dent = debugfs_create_dir("dma-api", NULL); if (!dma_debug_dent) { - pr_err("DMA-API: can not create debugfs directory\n"); + pr_err("can not create debugfs directory\n"); return -ENOMEM; } @@ -980,7 +982,7 @@ static int dma_debug_device_change(struct notifier_block *nb, unsigned long acti count = device_dma_allocations(dev, &entry); if (count == 0) break; - err_printk(dev, entry, "DMA-API: device driver has pending " + err_printk(dev, entry, "device driver has pending " "DMA allocations while released from device " "[count=%d]\n" "One of leaked entries details: " @@ -1030,14 +1032,14 @@ static int dma_debug_init(void) } if (dma_debug_fs_init() != 0) { - pr_err("DMA-API: error creating debugfs entries - disabling\n"); + pr_err("error creating debugfs entries - disabling\n"); global_disable = true; return 0; } if (prealloc_memory(nr_prealloc_entries) != 0) { - pr_err("DMA-API: debugging out of memory error - disabled\n"); + pr_err("debugging out of memory error - disabled\n"); global_disable = true; return 0; @@ -1047,7 +1049,7 @@ static int dma_debug_init(void) dma_debug_initialized = true; - pr_info("DMA-API: debugging enabled by kernel config\n"); + pr_info("debugging enabled by kernel config\n"); return 0; } core_initcall(dma_debug_init); @@ -1058,7 +1060,7 @@ static __init int dma_debug_cmdline(char *str) return -EINVAL; if (strncmp(str, "off", 3) == 0) { - pr_info("DMA-API: debugging disabled on kernel command line\n"); + pr_info("debugging disabled on kernel command line\n"); global_disable = true; } @@ -1092,11 +1094,11 @@ static void check_unmap(struct dma_debug_entry *ref) if (dma_mapping_error(ref->dev, ref->dev_addr)) { err_printk(ref->dev, NULL, - "DMA-API: device driver tries to free an " + "device driver tries to free an " "invalid DMA memory address\n"); } else { err_printk(ref->dev, NULL, - "DMA-API: device driver tries to free DMA " + "device driver tries to free DMA " "memory it has not allocated [device " "address=0x%016llx] [size=%llu bytes]\n", ref->dev_addr, ref->size); @@ -1105,7 +1107,7 @@ static void check_unmap(struct dma_debug_entry *ref) } if (ref->size != entry->size) { - err_printk(ref->dev, entry, "DMA-API: device driver frees " + err_printk(ref->dev, entry, "device driver frees " "DMA memory with different size " "[device address=0x%016llx] [map size=%llu bytes] " "[unmap size=%llu bytes]\n", @@ -1113,7 +1115,7 @@ static void check_unmap(struct dma_debug_entry *ref) } if (ref->type != entry->type) { - err_printk(ref->dev, entry, "DMA-API: device driver frees " + err_printk(ref->dev, entry, "device driver frees " "DMA memory with wrong function " "[device address=0x%016llx] [size=%llu bytes] " "[mapped as %s] [unmapped as %s]\n", @@ -1121,7 +1123,7 @@ static void check_unmap(struct dma_debug_entry *ref) type2name[entry->type], type2name[ref->type]); } else if ((entry->type == dma_debug_coherent) && (phys_addr(ref) != phys_addr(entry))) { - err_printk(ref->dev, entry, "DMA-API: device driver frees " + err_printk(ref->dev, entry, "device driver frees " "DMA memory with different CPU address " "[device address=0x%016llx] [size=%llu bytes] " "[cpu alloc address=0x%016llx] " @@ -1133,7 +1135,7 @@ static void check_unmap(struct dma_debug_entry *ref) if (ref->sg_call_ents && ref->type == dma_debug_sg && ref->sg_call_ents != entry->sg_call_ents) { - err_printk(ref->dev, entry, "DMA-API: device driver frees " + err_printk(ref->dev, entry, "device driver frees " "DMA sg list with different entry count " "[map count=%d] [unmap count=%d]\n", entry->sg_call_ents, ref->sg_call_ents); @@ -1144,7 +1146,7 @@ static void check_unmap(struct dma_debug_entry *ref) * DMA API don't handle this properly, so check for it here */ if (ref->direction != entry->direction) { - err_printk(ref->dev, entry, "DMA-API: device driver frees " + err_printk(ref->dev, entry, "device driver frees " "DMA memory with different direction " "[device address=0x%016llx] [size=%llu bytes] " "[mapped with %s] [unmapped with %s]\n", @@ -1160,7 +1162,7 @@ static void check_unmap(struct dma_debug_entry *ref) */ if (entry->map_err_type == MAP_ERR_NOT_CHECKED) { err_printk(ref->dev, entry, - "DMA-API: device driver failed to check map error" + "device driver failed to check map error" "[device address=0x%016llx] [size=%llu bytes] " "[mapped as %s]", ref->dev_addr, ref->size, @@ -1185,7 +1187,7 @@ static void check_for_stack(struct device *dev, return; addr = page_address(page) + offset; if (object_is_on_stack(addr)) - err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [addr=%p]\n", addr); + err_printk(dev, NULL, "device driver maps memory from stack [addr=%p]\n", addr); } else { /* Stack is vmalloced. */ int i; @@ -1195,7 +1197,7 @@ static void check_for_stack(struct device *dev, continue; addr = (u8 *)current->stack + i * PAGE_SIZE + offset; - err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [probable addr=%p]\n", addr); + err_printk(dev, NULL, "device driver maps memory from stack [probable addr=%p]\n", addr); break; } } @@ -1215,7 +1217,7 @@ static void check_for_illegal_area(struct device *dev, void *addr, unsigned long { if (overlap(addr, len, _stext, _etext) || overlap(addr, len, __start_rodata, __end_rodata)) - err_printk(dev, NULL, "DMA-API: device driver maps memory from kernel text or rodata [addr=%p] [len=%lu]\n", addr, len); + err_printk(dev, NULL, "device driver maps memory from kernel text or rodata [addr=%p] [len=%lu]\n", addr, len); } static void check_sync(struct device *dev, @@ -1231,7 +1233,7 @@ static void check_sync(struct device *dev, entry = bucket_find_contain(&bucket, ref, &flags); if (!entry) { - err_printk(dev, NULL, "DMA-API: device driver tries " + err_printk(dev, NULL, "device driver tries " "to sync DMA memory it has not allocated " "[device address=0x%016llx] [size=%llu bytes]\n", (unsigned long long)ref->dev_addr, ref->size); @@ -1239,7 +1241,7 @@ static void check_sync(struct device *dev, } if (ref->size > entry->size) { - err_printk(dev, entry, "DMA-API: device driver syncs" + err_printk(dev, entry, "device driver syncs" " DMA memory outside allocated range " "[device address=0x%016llx] " "[allocation size=%llu bytes] " @@ -1252,7 +1254,7 @@ static void check_sync(struct device *dev, goto out; if (ref->direction != entry->direction) { - err_printk(dev, entry, "DMA-API: device driver syncs " + err_printk(dev, entry, "device driver syncs " "DMA memory with different direction " "[device address=0x%016llx] [size=%llu bytes] " "[mapped with %s] [synced with %s]\n", @@ -1263,7 +1265,7 @@ static void check_sync(struct device *dev, if (to_cpu && !(entry->direction == DMA_FROM_DEVICE) && !(ref->direction == DMA_TO_DEVICE)) - err_printk(dev, entry, "DMA-API: device driver syncs " + err_printk(dev, entry, "device driver syncs " "device read-only DMA memory for cpu " "[device address=0x%016llx] [size=%llu bytes] " "[mapped with %s] [synced with %s]\n", @@ -1273,7 +1275,7 @@ static void check_sync(struct device *dev, if (!to_cpu && !(entry->direction == DMA_TO_DEVICE) && !(ref->direction == DMA_FROM_DEVICE)) - err_printk(dev, entry, "DMA-API: device driver syncs " + err_printk(dev, entry, "device driver syncs " "device write-only DMA memory to device " "[device address=0x%016llx] [size=%llu bytes] " "[mapped with %s] [synced with %s]\n", @@ -1283,7 +1285,7 @@ static void check_sync(struct device *dev, if (ref->sg_call_ents && ref->type == dma_debug_sg && ref->sg_call_ents != entry->sg_call_ents) { - err_printk(ref->dev, entry, "DMA-API: device driver syncs " + err_printk(ref->dev, entry, "device driver syncs " "DMA sg list with different entry count " "[map count=%d] [sync count=%d]\n", entry->sg_call_ents, ref->sg_call_ents); @@ -1304,7 +1306,7 @@ static void check_sg_segment(struct device *dev, struct scatterlist *sg) * whoever generated the list forgot to check them. */ if (sg->length > max_seg) - err_printk(dev, NULL, "DMA-API: mapping sg segment longer than device claims to support [len=%u] [max=%u]\n", + err_printk(dev, NULL, "mapping sg segment longer than device claims to support [len=%u] [max=%u]\n", sg->length, max_seg); /* * In some cases this could potentially be the DMA API @@ -1314,7 +1316,7 @@ static void check_sg_segment(struct device *dev, struct scatterlist *sg) start = sg_dma_address(sg); end = start + sg_dma_len(sg) - 1; if ((start ^ end) & ~boundary) - err_printk(dev, NULL, "DMA-API: mapping sg segment across boundary [start=0x%016llx] [end=0x%016llx] [boundary=0x%016llx]\n", + err_printk(dev, NULL, "mapping sg segment across boundary [start=0x%016llx] [end=0x%016llx] [boundary=0x%016llx]\n", start, end, boundary); #endif } @@ -1326,11 +1328,11 @@ void debug_dma_map_single(struct device *dev, const void *addr, return; if (!virt_addr_valid(addr)) - err_printk(dev, NULL, "DMA-API: device driver maps memory from invalid area [addr=%p] [len=%lu]\n", + err_printk(dev, NULL, "device driver maps memory from invalid area [addr=%p] [len=%lu]\n", addr, len); if (is_vmalloc_addr(addr)) - err_printk(dev, NULL, "DMA-API: device driver maps memory from vmalloc area [addr=%p] [len=%lu]\n", + err_printk(dev, NULL, "device driver maps memory from vmalloc area [addr=%p] [len=%lu]\n", addr, len); } EXPORT_SYMBOL(debug_dma_map_single); @@ -1787,7 +1789,7 @@ static int __init dma_debug_driver_setup(char *str) } if (current_driver_name[0]) - pr_info("DMA-API: enable driver filter for driver [%s]\n", + pr_info("enable driver filter for driver [%s]\n", current_driver_name); -- cgit v1.3-7-g2ca7 From 2b9d9ac02b9d8d32c515c82bb17401c429f160ab Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:29 +0000 Subject: dma-debug: Dynamically expand the dma_debug_entry pool Certain drivers such as large multi-queue network adapters can use pools of mapped DMA buffers larger than the default dma_debug_entry pool of 65536 entries, with the result that merely probing such a device can cause DMA debug to disable itself during boot unless explicitly given an appropriate "dma_debug_entries=..." option. Developers trying to debug some other driver on such a system may not be immediately aware of this, and at worst it can hide bugs if they fail to realise that dma-debug has already disabled itself unexpectedly by the time their code of interest gets to run. Even once they do realise, it can be a bit of a pain to emprirically determine a suitable number of preallocated entries to configure, short of massively over-allocating. There's really no need for such a static limit, though, since we can quite easily expand the pool at runtime in those rare cases that the preallocated entries are insufficient, which is arguably the least surprising and most useful behaviour. To that end, refactor the prealloc_memory() logic a little bit to generalise it for runtime reallocations as well. Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 11 +++---- kernel/dma/debug.c | 79 ++++++++++++++++++++++++----------------------- 2 files changed, 46 insertions(+), 44 deletions(-) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 6bdb095393b0..0fcb7561af1e 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -717,8 +717,8 @@ dma-api/num_errors The number in this file shows how many dma-api/min_free_entries This read-only file can be read to get the minimum number of free dma_debug_entries the allocator has ever seen. If this value goes - down to zero the code will disable itself - because it is not longer reliable. + down to zero the code will attempt to increase + nr_total_entries to compensate. dma-api/num_free_entries The current number of free dma_debug_entries in the allocator. @@ -745,10 +745,9 @@ driver filter at boot time. The debug code will only print errors for that driver afterwards. This filter can be disabled or changed later using debugfs. When the code disables itself at runtime this is most likely because it ran -out of dma_debug_entries. These entries are preallocated at boot. The number -of preallocated entries is defined per architecture. If it is too low for you -boot with 'dma_debug_entries=' to overwrite the -architectural default. +out of dma_debug_entries and was unable to allocate more on-demand. 65536 +entries are preallocated at boot - if this is too low for you boot with +'dma_debug_entries=' to overwrite the default. :: diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 29486eb9d1dc..ef7c90b7a346 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -47,6 +47,8 @@ #ifndef PREALLOC_DMA_DEBUG_ENTRIES #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) #endif +/* If the pool runs out, add this many new entries at once */ +#define DMA_DEBUG_DYNAMIC_ENTRIES 256 enum { dma_debug_single, @@ -646,6 +648,34 @@ static void add_dma_entry(struct dma_debug_entry *entry) */ } +static int dma_debug_create_entries(u32 num_entries, gfp_t gfp) +{ + struct dma_debug_entry *entry, *next_entry; + int i; + + for (i = 0; i < num_entries; ++i) { + entry = kzalloc(sizeof(*entry), gfp); + if (!entry) + goto out_err; + + list_add_tail(&entry->list, &free_entries); + } + + num_free_entries += num_entries; + nr_total_entries += num_entries; + + return 0; + +out_err: + + list_for_each_entry_safe(entry, next_entry, &free_entries, list) { + list_del(&entry->list); + kfree(entry); + } + + return -ENOMEM; +} + static struct dma_debug_entry *__dma_entry_alloc(void) { struct dma_debug_entry *entry; @@ -672,12 +702,14 @@ static struct dma_debug_entry *dma_entry_alloc(void) unsigned long flags; spin_lock_irqsave(&free_entries_lock, flags); - - if (list_empty(&free_entries)) { - global_disable = true; - spin_unlock_irqrestore(&free_entries_lock, flags); - pr_err("debugging out of memory - disabling\n"); - return NULL; + if (num_free_entries == 0) { + if (dma_debug_create_entries(DMA_DEBUG_DYNAMIC_ENTRIES, + GFP_ATOMIC)) { + global_disable = true; + spin_unlock_irqrestore(&free_entries_lock, flags); + pr_err("debugging out of memory - disabling\n"); + return NULL; + } } entry = __dma_entry_alloc(); @@ -764,36 +796,6 @@ int dma_debug_resize_entries(u32 num_entries) * 2. Preallocate a given number of dma_debug_entry structs */ -static int prealloc_memory(u32 num_entries) -{ - struct dma_debug_entry *entry, *next_entry; - int i; - - for (i = 0; i < num_entries; ++i) { - entry = kzalloc(sizeof(*entry), GFP_KERNEL); - if (!entry) - goto out_err; - - list_add_tail(&entry->list, &free_entries); - } - - num_free_entries = num_entries; - min_free_entries = num_entries; - - pr_info("preallocated %d debug entries\n", num_entries); - - return 0; - -out_err: - - list_for_each_entry_safe(entry, next_entry, &free_entries, list) { - list_del(&entry->list); - kfree(entry); - } - - return -ENOMEM; -} - static ssize_t filter_read(struct file *file, char __user *user_buf, size_t count, loff_t *ppos) { @@ -1038,14 +1040,15 @@ static int dma_debug_init(void) return 0; } - if (prealloc_memory(nr_prealloc_entries) != 0) { + if (dma_debug_create_entries(nr_prealloc_entries, GFP_KERNEL) != 0) { pr_err("debugging out of memory error - disabled\n"); global_disable = true; return 0; } - nr_total_entries = num_free_entries; + min_free_entries = num_free_entries; + pr_info("preallocated %d debug entries\n", nr_total_entries); dma_debug_initialized = true; -- cgit v1.3-7-g2ca7 From ceb51173b2b5bd7af0d99079f6bbacdcfc58fe08 Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:30 +0000 Subject: dma-debug: Make leak-like behaviour apparent Now that we can dynamically allocate DMA debug entries to cope with drivers maintaining excessively large numbers of live mappings, a driver which *does* actually have a bug leaking mappings (and is not unloaded) will no longer trigger the "DMA-API: debugging out of memory - disabling" message until it gets to actual kernel OOM conditions, which means it could go unnoticed for a while. To that end, let's inform the user each time the pool has grown to a multiple of its initial size, which should make it apparent that they either have a leak or might want to increase the preallocation size. Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 6 +++++- kernel/dma/debug.c | 13 +++++++++++++ 2 files changed, 18 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 0fcb7561af1e..7a7d8a415ce8 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -747,7 +747,11 @@ driver afterwards. This filter can be disabled or changed later using debugfs. When the code disables itself at runtime this is most likely because it ran out of dma_debug_entries and was unable to allocate more on-demand. 65536 entries are preallocated at boot - if this is too low for you boot with -'dma_debug_entries=' to overwrite the default. +'dma_debug_entries=' to overwrite the default. The +code will print to the kernel log each time it has dynamically allocated +as many entries as were initially preallocated. This is to indicate that a +larger preallocation size may be appropriate, or if it happens continually +that a driver may be leaking mappings. :: diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index ef7c90b7a346..912c23f4c177 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -691,6 +691,18 @@ static struct dma_debug_entry *__dma_entry_alloc(void) return entry; } +void __dma_entry_alloc_check_leak(void) +{ + u32 tmp = nr_total_entries % nr_prealloc_entries; + + /* Shout each time we tick over some multiple of the initial pool */ + if (tmp < DMA_DEBUG_DYNAMIC_ENTRIES) { + pr_info("dma_debug_entry pool grown to %u (%u00%%)\n", + nr_total_entries, + (nr_total_entries / nr_prealloc_entries)); + } +} + /* struct dma_entry allocator * * The next two functions implement the allocator for @@ -710,6 +722,7 @@ static struct dma_debug_entry *dma_entry_alloc(void) pr_err("debugging out of memory - disabling\n"); return NULL; } + __dma_entry_alloc_check_leak(); } entry = __dma_entry_alloc(); -- cgit v1.3-7-g2ca7 From 0cb0e25e421436a83ee39857923e4213b983e463 Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:32 +0000 Subject: dma/debug: Remove dma_debug_resize_entries() With the only caller now gone, we can clean up this part of dma-debug's exposed internals and make way to tweak the allocation behaviour. Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- include/linux/dma-debug.h | 7 ------- kernel/dma/debug.c | 46 ---------------------------------------------- 2 files changed, 53 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h index 30213adbb6b9..46e6131a72b6 100644 --- a/include/linux/dma-debug.h +++ b/include/linux/dma-debug.h @@ -30,8 +30,6 @@ struct bus_type; extern void dma_debug_add_bus(struct bus_type *bus); -extern int dma_debug_resize_entries(u32 num_entries); - extern void debug_dma_map_single(struct device *dev, const void *addr, unsigned long len); @@ -101,11 +99,6 @@ static inline void dma_debug_add_bus(struct bus_type *bus) { } -static inline int dma_debug_resize_entries(u32 num_entries) -{ - return 0; -} - static inline void debug_dma_map_single(struct device *dev, const void *addr, unsigned long len) { diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 912c23f4c177..36a42874b05f 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -755,52 +755,6 @@ static void dma_entry_free(struct dma_debug_entry *entry) spin_unlock_irqrestore(&free_entries_lock, flags); } -int dma_debug_resize_entries(u32 num_entries) -{ - int i, delta, ret = 0; - unsigned long flags; - struct dma_debug_entry *entry; - LIST_HEAD(tmp); - - spin_lock_irqsave(&free_entries_lock, flags); - - if (nr_total_entries < num_entries) { - delta = num_entries - nr_total_entries; - - spin_unlock_irqrestore(&free_entries_lock, flags); - - for (i = 0; i < delta; i++) { - entry = kzalloc(sizeof(*entry), GFP_KERNEL); - if (!entry) - break; - - list_add_tail(&entry->list, &tmp); - } - - spin_lock_irqsave(&free_entries_lock, flags); - - list_splice(&tmp, &free_entries); - nr_total_entries += i; - num_free_entries += i; - } else { - delta = nr_total_entries - num_entries; - - for (i = 0; i < delta && !list_empty(&free_entries); i++) { - entry = __dma_entry_alloc(); - kfree(entry); - } - - nr_total_entries -= i; - } - - if (nr_total_entries != num_entries) - ret = 1; - - spin_unlock_irqrestore(&free_entries_lock, flags); - - return ret; -} - /* * DMA-API debugging init code * -- cgit v1.3-7-g2ca7 From ad78dee0b630527bdfed809d1f5ed95c601886ae Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Mon, 10 Dec 2018 14:00:33 +0000 Subject: dma-debug: Batch dma_debug_entry allocation DMA debug entries are one of those things which aren't that useful individually - we will always want some larger quantity of them - and which we don't really need to manage the exact number of - we only care about having 'enough'. In that regard, the current behaviour of creating them one-by-one leads to a lot of unwarranted function call overhead and memory wasted on alignment padding. Now that we don't have to worry about freeing anything via dma_debug_resize_entries(), we can optimise the allocation behaviour by grabbing whole pages at once, which will save considerably on the aforementioned overheads, and probably offer a little more cache/TLB locality benefit for traversing the lists under normal operation. This should also give even less reason for an architecture-level override of the preallocation size, so make the definition unconditional - if there is still any desire to change the compile-time value for some platforms it would be better off as a Kconfig option anyway. Since freeing a whole page of entries at once becomes enough of a challenge that it's not really worth complicating dma_debug_init(), we may as well tweak the preallocation behaviour such that as long as we manage to allocate *some* pages, we can leave debugging enabled on a best-effort basis rather than otherwise wasting them. Signed-off-by: Robin Murphy Tested-by: Qian Cai Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 4 +++- kernel/dma/debug.c | 50 ++++++++++++++++++++--------------------------- 2 files changed, 24 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 7a7d8a415ce8..016eb6909b8a 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -747,7 +747,9 @@ driver afterwards. This filter can be disabled or changed later using debugfs. When the code disables itself at runtime this is most likely because it ran out of dma_debug_entries and was unable to allocate more on-demand. 65536 entries are preallocated at boot - if this is too low for you boot with -'dma_debug_entries=' to overwrite the default. The +'dma_debug_entries=' to overwrite the default. Note +that the code allocates entries in batches, so the exact number of +preallocated entries may be greater than the actual number requested. The code will print to the kernel log each time it has dynamically allocated as many entries as were initially preallocated. This is to indicate that a larger preallocation size may be appropriate, or if it happens continually diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 36a42874b05f..20ab0f6c1b70 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -43,12 +43,9 @@ #define HASH_FN_SHIFT 13 #define HASH_FN_MASK (HASH_SIZE - 1) -/* allow architectures to override this if absolutely required */ -#ifndef PREALLOC_DMA_DEBUG_ENTRIES #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) -#endif /* If the pool runs out, add this many new entries at once */ -#define DMA_DEBUG_DYNAMIC_ENTRIES 256 +#define DMA_DEBUG_DYNAMIC_ENTRIES (PAGE_SIZE / sizeof(struct dma_debug_entry)) enum { dma_debug_single, @@ -648,32 +645,22 @@ static void add_dma_entry(struct dma_debug_entry *entry) */ } -static int dma_debug_create_entries(u32 num_entries, gfp_t gfp) +static int dma_debug_create_entries(gfp_t gfp) { - struct dma_debug_entry *entry, *next_entry; + struct dma_debug_entry *entry; int i; - for (i = 0; i < num_entries; ++i) { - entry = kzalloc(sizeof(*entry), gfp); - if (!entry) - goto out_err; + entry = (void *)get_zeroed_page(gfp); + if (!entry) + return -ENOMEM; - list_add_tail(&entry->list, &free_entries); - } + for (i = 0; i < DMA_DEBUG_DYNAMIC_ENTRIES; i++) + list_add_tail(&entry[i].list, &free_entries); - num_free_entries += num_entries; - nr_total_entries += num_entries; + num_free_entries += DMA_DEBUG_DYNAMIC_ENTRIES; + nr_total_entries += DMA_DEBUG_DYNAMIC_ENTRIES; return 0; - -out_err: - - list_for_each_entry_safe(entry, next_entry, &free_entries, list) { - list_del(&entry->list); - kfree(entry); - } - - return -ENOMEM; } static struct dma_debug_entry *__dma_entry_alloc(void) @@ -715,8 +702,7 @@ static struct dma_debug_entry *dma_entry_alloc(void) spin_lock_irqsave(&free_entries_lock, flags); if (num_free_entries == 0) { - if (dma_debug_create_entries(DMA_DEBUG_DYNAMIC_ENTRIES, - GFP_ATOMIC)) { + if (dma_debug_create_entries(GFP_ATOMIC)) { global_disable = true; spin_unlock_irqrestore(&free_entries_lock, flags); pr_err("debugging out of memory - disabling\n"); @@ -987,7 +973,7 @@ void dma_debug_add_bus(struct bus_type *bus) static int dma_debug_init(void) { - int i; + int i, nr_pages; /* Do not use dma_debug_initialized here, since we really want to be * called to set dma_debug_initialized @@ -1007,15 +993,21 @@ static int dma_debug_init(void) return 0; } - if (dma_debug_create_entries(nr_prealloc_entries, GFP_KERNEL) != 0) { + nr_pages = DIV_ROUND_UP(nr_prealloc_entries, DMA_DEBUG_DYNAMIC_ENTRIES); + for (i = 0; i < nr_pages; ++i) + dma_debug_create_entries(GFP_KERNEL); + if (num_free_entries >= nr_prealloc_entries) { + pr_info("preallocated %d debug entries\n", nr_total_entries); + } else if (num_free_entries > 0) { + pr_warn("%d debug entries requested but only %d allocated\n", + nr_prealloc_entries, nr_total_entries); + } else { pr_err("debugging out of memory error - disabled\n"); global_disable = true; return 0; } - min_free_entries = num_free_entries; - pr_info("preallocated %d debug entries\n", nr_total_entries); dma_debug_initialized = true; -- cgit v1.3-7-g2ca7 From 1431a5d2cfa18d7006d9b0e7ab4548d9bb19ce55 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:32 -0800 Subject: locking/lockdep: Declare local symbols static This patch avoids that sparse complains about a missing declaration for the lock_classes array when building with CONFIG_DEBUG_LOCKDEP=n. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-9-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 1efada2dd9dd..7434a00b2b2f 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -138,6 +138,9 @@ static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES]; * get freed - this significantly simplifies the debugging code. */ unsigned long nr_lock_classes; +#ifndef CONFIG_DEBUG_LOCKDEP +static +#endif struct lock_class lock_classes[MAX_LOCKDEP_KEYS]; static inline struct lock_class *hlock_class(struct held_lock *hlock) -- cgit v1.3-7-g2ca7 From d35568bdb6ce4be3f885f8f189bbde5adc7e0160 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:33 -0800 Subject: locking/lockdep: Inline __lockdep_init_map() Since the function __lockdep_init_map() only has one caller, inline it into its caller. This patch does not change any functionality. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-10-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7434a00b2b2f..b5c8fcb6c070 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3091,7 +3091,7 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, /* * Initialize a lock instance's lock-class mapping info: */ -static void __lockdep_init_map(struct lockdep_map *lock, const char *name, +void lockdep_init_map(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass) { int i; @@ -3147,12 +3147,6 @@ static void __lockdep_init_map(struct lockdep_map *lock, const char *name, raw_local_irq_restore(flags); } } - -void lockdep_init_map(struct lockdep_map *lock, const char *name, - struct lock_class_key *key, int subclass) -{ - __lockdep_init_map(lock, name, key, subclass); -} EXPORT_SYMBOL_GPL(lockdep_init_map); struct lock_class_key __lockdep_no_validate__; -- cgit v1.3-7-g2ca7 From 2904d9fa45d3ce7153f1e10d78c570ecf7f19c35 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:34 -0800 Subject: locking/lockdep: Introduce lock_class_cache_is_registered() This patch does not change any functionality but makes the lockdep_reset_lock() function easier to read. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-11-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 50 +++++++++++++++++++++++++++++------------------- 1 file changed, 30 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b5c8fcb6c070..81388d028ac7 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4201,13 +4201,33 @@ void lockdep_free_key_range(void *start, unsigned long size) */ } -void lockdep_reset_lock(struct lockdep_map *lock) +/* + * Check whether any element of the @lock->class_cache[] array refers to a + * registered lock class. The caller must hold either the graph lock or the + * RCU read lock. + */ +static bool lock_class_cache_is_registered(struct lockdep_map *lock) { struct lock_class *class; struct hlist_head *head; - unsigned long flags; int i, j; - int locked; + + for (i = 0; i < CLASSHASH_SIZE; i++) { + head = classhash_table + i; + hlist_for_each_entry_rcu(class, head, hash_entry) { + for (j = 0; j < NR_LOCKDEP_CACHING_CLASSES; j++) + if (lock->class_cache[j] == class) + return true; + } + } + return false; +} + +void lockdep_reset_lock(struct lockdep_map *lock) +{ + struct lock_class *class; + unsigned long flags; + int j, locked; raw_local_irq_save(flags); @@ -4227,24 +4247,14 @@ void lockdep_reset_lock(struct lockdep_map *lock) * be gone. */ locked = graph_lock(); - for (i = 0; i < CLASSHASH_SIZE; i++) { - head = classhash_table + i; - hlist_for_each_entry_rcu(class, head, hash_entry) { - int match = 0; - - for (j = 0; j < NR_LOCKDEP_CACHING_CLASSES; j++) - match |= class == lock->class_cache[j]; - - if (unlikely(match)) { - if (debug_locks_off_graph_unlock()) { - /* - * We all just reset everything, how did it match? - */ - WARN_ON(1); - } - goto out_restore; - } + if (unlikely(lock_class_cache_is_registered(lock))) { + if (debug_locks_off_graph_unlock()) { + /* + * We all just reset everything, how did it match? + */ + WARN_ON(1); } + goto out_restore; } if (locked) graph_unlock(); -- cgit v1.3-7-g2ca7 From a66b6922dc6a5ece60ea9326153da3b062977a4d Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:35 -0800 Subject: locking/lockdep: Remove a superfluous INIT_LIST_HEAD() statement Initializing a list entry just before it is passed to list_add_tail_rcu() is not necessary because list_add_tail_rcu() overwrites the next and prev pointers anyway. Hence remove the INIT_LIST_HEAD() statement. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-12-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 81388d028ac7..346b5a1fd062 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -792,7 +792,6 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) class->key = key; class->name = lock->name; class->subclass = subclass; - INIT_LIST_HEAD(&class->lock_entry); INIT_LIST_HEAD(&class->locks_before); INIT_LIST_HEAD(&class->locks_after); class->name_version = count_matching_names(class); -- cgit v1.3-7-g2ca7 From 786fa29e9cb6810e21ab0d9c41a81d81d54d1d1b Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:36 -0800 Subject: locking/lockdep: Make concurrent lockdep_reset_lock() calls safe Since zap_class() removes items from the all_lock_classes list and the classhash_table, protect all zap_class() calls against concurrent data structure modifications with the graph lock. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-13-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 346b5a1fd062..737d2dd3ea56 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4122,6 +4122,9 @@ void lockdep_reset(void) raw_local_irq_restore(flags); } +/* + * Remove all references to a lock class. The caller must hold the graph lock. + */ static void zap_class(struct lock_class *class) { int i; @@ -4229,6 +4232,7 @@ void lockdep_reset_lock(struct lockdep_map *lock) int j, locked; raw_local_irq_save(flags); + locked = graph_lock(); /* * Remove all classes this lock might have: @@ -4245,7 +4249,6 @@ void lockdep_reset_lock(struct lockdep_map *lock) * Debug check: in the end all mapped classes should * be gone. */ - locked = graph_lock(); if (unlikely(lock_class_cache_is_registered(lock))) { if (debug_locks_off_graph_unlock()) { /* -- cgit v1.3-7-g2ca7 From fe27b0de8dfcdf8482558ce5d25e697fe74d851e Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 6 Dec 2018 17:11:37 -0800 Subject: locking/lockdep: Stop using RCU primitives to access 'all_lock_classes' Due to the previous patch all code that accesses the 'all_lock_classes' list holds the graph lock. Hence use regular list primitives instead of their RCU variants to access this list. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Johannes Berg Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sasha Levin Cc: Thomas Gleixner Cc: Waiman Long Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20181207011148.251812-14-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 737d2dd3ea56..5c837a537273 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -629,7 +629,8 @@ static int static_obj(void *obj) /* * To make lock name printouts unique, we calculate a unique - * class->name_version generation counter: + * class->name_version generation counter. The caller must hold the graph + * lock. */ static int count_matching_names(struct lock_class *new_class) { @@ -639,7 +640,7 @@ static int count_matching_names(struct lock_class *new_class) if (!new_class->name) return 0; - list_for_each_entry_rcu(class, &all_lock_classes, lock_entry) { + list_for_each_entry(class, &all_lock_classes, lock_entry) { if (new_class->key - new_class->subclass == class->key) return class->name_version; if (class->name && !strcmp(class->name, new_class->name)) @@ -803,7 +804,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) /* * Add it to the global list of classes: */ - list_add_tail_rcu(&class->lock_entry, &all_lock_classes); + list_add_tail(&class->lock_entry, &all_lock_classes); if (verbose(class)) { graph_unlock(); @@ -4141,7 +4142,7 @@ static void zap_class(struct lock_class *class) * Unhash the class and remove it from the all_lock_classes list: */ hlist_del_rcu(&class->hash_entry); - list_del_rcu(&class->lock_entry); + list_del(&class->lock_entry); RCU_INIT_POINTER(class->key, NULL); RCU_INIT_POINTER(class->name, NULL); -- cgit v1.3-7-g2ca7 From 80eb865768703c0f85a0603762742ae1dedf21f0 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Tue, 27 Nov 2018 12:01:10 +0100 Subject: sched/fair: Clean up comment in nohz_idle_balance() Concerning the comment associated to the atomic_fetch_andnot() in nohz_idle_balance(), Vincent explains [1]: "[...] the comment is useless and can be removed [...] it was referring to a line code above the comment that was present in a previous iteration of the patchset. This line disappeared in final version but the comment has stayed." So remove the comment. Vincent also points out that the full ordering associated to the atomic_fetch_andnot() primitive could be relaxed, but this patch insists on the current more conservative/fully ordered solution: "Performance" isn't a concern, stay away from "correctness"/subtle relaxed (re)ordering if possible..., just make sure not to confuse the next reader with misleading/out-of-date comments. [1] http://lkml.kernel.org/r/CAKfTPtBjA-oCBRkO6__npQwL3+HLjzk7riCcPU1R7YdO-EpuZg@mail.gmail.com Suggested-by: Vincent Guittot Signed-off-by: Andrea Parri Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20181127110110.5533-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac855b2f4774..db514993565b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9533,9 +9533,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) return false; } - /* - * barrier, pairs with nohz_balance_enter_idle(), ensures ... - */ + /* could be _relaxed() */ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); if (!(flags & NOHZ_KICK_MASK)) return false; -- cgit v1.3-7-g2ca7 From 765d0af19f5f388a34bf4533378f8398b72ded46 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 29 Aug 2018 15:19:11 +0200 Subject: sched/topology: Remove the ::smt_gain field from 'struct sched_domain' ::smt_gain is used to compute the capacity of CPUs of a SMT core with the constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs per core. The field has_free_capacity of struct numa_stat, which was the last user of this computation of number of CPUs per core, has been removed by: 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") We can now remove this constraint on core capacity and use the defautl value SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE becomes the maximum compute capacity of CPUs on every systems. This should help to simplify some code and remove fields like rd->max_cpu_capacity Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other places in the code when it wants the capacity of a CPUs to scale some metrics like in pelt, deadline or schedutil. In case on SMT, the value returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE. So remove it. Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1535548752-4434-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- include/linux/sched/topology.h | 1 - kernel/sched/sched.h | 3 --- kernel/sched/topology.c | 2 -- 3 files changed, 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 6b9976180c1e..7fa0bc17cd8c 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -89,7 +89,6 @@ struct sched_domain { unsigned int newidle_idx; unsigned int wake_idx; unsigned int forkexec_idx; - unsigned int smt_gain; int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9bde60a11805..ceb896404869 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1864,9 +1864,6 @@ unsigned long arch_scale_freq_capacity(int cpu) static __always_inline unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { - if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) - return sd->smt_gain / sd->span_weight; - return SCHED_CAPACITY_SCALE; } #endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 8d7f15ba5916..7364e0b427b7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1133,7 +1133,6 @@ sd_init(struct sched_domain_topology_level *tl, .last_balance = jiffies, .balance_interval = sd_weight, - .smt_gain = 0, .max_newidle_lb_cost = 0, .next_decay_max_lb_cost = jiffies, .child = child, @@ -1164,7 +1163,6 @@ sd_init(struct sched_domain_topology_level *tl, if (sd->flags & SD_SHARE_CPUCAPACITY) { sd->imbalance_pct = 110; - sd->smt_gain = 1178; /* ~15% */ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->imbalance_pct = 117; -- cgit v1.3-7-g2ca7 From 9ebc6053814d37b9de8cc291fba28f30a729c929 Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Sat, 3 Nov 2018 13:26:02 -0400 Subject: sched/core: Remove unnecessary unlikely() in push_*_task() WARN_ON() already contains an unlikely(), so it's not necessary to use WARN_ON(1). Signed-off-by: Yangtao Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20181103172602.1917-1-tiny.windzz@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 4 +--- kernel/sched/rt.c | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index b32bc1f7cd14..fb8b7b5d745d 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2042,10 +2042,8 @@ static int push_dl_task(struct rq *rq) return 0; retry: - if (unlikely(next_task == rq->curr)) { - WARN_ON(1); + if (WARN_ON(next_task == rq->curr)) return 0; - } /* * If next_task preempts rq->curr, and rq->curr diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 9aa3287ce301..e4f398ad9e73 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1813,10 +1813,8 @@ static int push_rt_task(struct rq *rq) return 0; retry: - if (unlikely(next_task == rq->curr)) { - WARN_ON(1); + if (WARN_ON(next_task == rq->curr)) return 0; - } /* * It's possible that the next_task slipped in of -- cgit v1.3-7-g2ca7 From 5bd0988be12733a42a1a3d50e3e2ddfd79e57518 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:14 +0000 Subject: sched/topology: Relocate arch_scale_cpu_capacity() to the internal header By default, arch_scale_cpu_capacity() is only visible from within the kernel/sched folder. Relocate it to include/linux/sched/topology.h to make it visible to other clients needing to know about the capacity of CPUs, such as the Energy Model framework. This also shrinks the public header. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-2-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- include/linux/sched/topology.h | 16 ++++++++++++++++ kernel/sched/sched.h | 18 ------------------ 2 files changed, 16 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7fa0bc17cd8c..c31d3a47a47c 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -201,6 +201,14 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl); # define SD_INIT_NAME(type) #endif +#ifndef arch_scale_cpu_capacity +static __always_inline +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{ + return SCHED_CAPACITY_SCALE; +} +#endif + #else /* CONFIG_SMP */ struct sched_domain_attr; @@ -216,6 +224,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) return true; } +#ifndef arch_scale_cpu_capacity +static __always_inline +unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu) +{ + return SCHED_CAPACITY_SCALE; +} +#endif + #endif /* !CONFIG_SMP */ static inline int task_node(const struct task_struct *p) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ceb896404869..66067152a831 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1859,24 +1859,6 @@ unsigned long arch_scale_freq_capacity(int cpu) } #endif -#ifdef CONFIG_SMP -#ifndef arch_scale_cpu_capacity -static __always_inline -unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) -{ - return SCHED_CAPACITY_SCALE; -} -#endif -#else -#ifndef arch_scale_cpu_capacity -static __always_inline -unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu) -{ - return SCHED_CAPACITY_SCALE; -} -#endif -#endif - #ifdef CONFIG_SMP #ifdef CONFIG_PREEMPT -- cgit v1.3-7-g2ca7 From 938e5e4b0d1502a93e787985cb95b136b40717b7 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:15 +0000 Subject: sched/cpufreq: Prepare schedutil for Energy Aware Scheduling Schedutil requests frequency by aggregating utilization signals from the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of them. Since Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests, it needs to forecast the decisions made by the governor. In order to prepare the introduction of EAS, introduce schedutil_freq_util() to centralize the aforementioned signal aggregation and make it available to both schedutil and EAS. Since frequency selection and energy estimation still need to deal with RT and DL signals slightly differently, schedutil_freq_util() is called with a different 'type' parameter in those two contexts, and returns an aggregated utilization signal accordingly. While at it, introduce the map_util_freq() function which is designed to make schedutil's 25% margin usable easily for both sugov and EAS. As EAS will be able to predict schedutil's frequency requests more accurately than any other governor by design, it'd be sensible to make sure EAS cannot be used without schedutil. This will be done later, once EAS has actually been introduced. Suggested-by: Peter Zijlstra Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- include/linux/sched/cpufreq.h | 6 +++++ kernel/sched/cpufreq_schedutil.c | 53 ++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 30 +++++++++++++++++++++++ 3 files changed, 74 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index 59667444669f..afa940cd50dc 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h @@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, void (*func)(struct update_util_data *data, u64 time, unsigned int flags)); void cpufreq_remove_update_util_hook(int cpu); + +static inline unsigned long map_util_freq(unsigned long util, + unsigned long freq, unsigned long cap) +{ + return (freq + (freq >> 2)) * util / cap; +} #endif /* CONFIG_CPU_FREQ */ #endif /* _LINUX_SCHED_CPUFREQ_H */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 3fffad3bc8a8..90128be27712 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -13,6 +13,7 @@ #include "sched.h" +#include #include struct sugov_tunables { @@ -167,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, unsigned int freq = arch_scale_freq_invariant() ? policy->cpuinfo.max_freq : policy->cur; - freq = (freq + (freq >> 2)) * util / max; + freq = map_util_freq(util, freq, max); if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) return sg_policy->next_freq; @@ -197,15 +198,13 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, * based on the task model parameters and gives the minimal utilization * required to meet deadlines. */ -static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu) +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs, + unsigned long max, enum schedutil_type type) { - struct rq *rq = cpu_rq(sg_cpu->cpu); - unsigned long util, irq, max; + unsigned long dl_util, util, irq; + struct rq *rq = cpu_rq(cpu); - sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu); - sg_cpu->bw_dl = cpu_bw_dl(rq); - - if (rt_rq_is_runnable(&rq->rt)) + if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) return max; /* @@ -223,21 +222,30 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu) * utilization (PELT windows are synchronized) we can directly add them * to obtain the CPU's actual utilization. */ - util = cpu_util_cfs(rq); + util = util_cfs; util += cpu_util_rt(rq); + dl_util = cpu_util_dl(rq); + /* - * We do not make cpu_util_dl() a permanent part of this sum because we - * want to use cpu_bw_dl() later on, but we need to check if the - * CFS+RT+DL sum is saturated (ie. no idle time) such that we select - * f_max when there is no idle time. + * For frequency selection we do not make cpu_util_dl() a permanent part + * of this sum because we want to use cpu_bw_dl() later on, but we need + * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such + * that we select f_max when there is no idle time. * * NOTE: numerical errors or stop class might cause us to not quite hit * saturation when we should -- something for later. */ - if ((util + cpu_util_dl(rq)) >= max) + if (util + dl_util >= max) return max; + /* + * OTOH, for energy computation we need the estimated running time, so + * include util_dl and ignore dl_bw. + */ + if (type == ENERGY_UTIL) + util += dl_util; + /* * There is still idle time; further improve the number by using the * irq metric. Because IRQ/steal time is hidden from the task clock we @@ -260,7 +268,22 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu) * bw_dl as requested freq. However, cpufreq is not yet ready for such * an interface. So, we only do the latter for now. */ - return min(max, util + sg_cpu->bw_dl); + if (type == FREQUENCY_UTIL) + util += cpu_bw_dl(rq); + + return min(max, util); +} + +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu) +{ + struct rq *rq = cpu_rq(sg_cpu->cpu); + unsigned long util = cpu_util_cfs(rq); + unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu); + + sg_cpu->max = max; + sg_cpu->bw_dl = cpu_bw_dl(rq); + + return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL); } /** diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 66067152a831..2eafa228aebf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2191,6 +2191,31 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #endif #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL +/** + * enum schedutil_type - CPU utilization type + * @FREQUENCY_UTIL: Utilization used to select frequency + * @ENERGY_UTIL: Utilization used during energy calculation + * + * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time + * need to be aggregated differently depending on the usage made of them. This + * enum is used within schedutil_freq_util() to differentiate the types of + * utilization expected by the callers, and adjust the aggregation accordingly. + */ +enum schedutil_type { + FREQUENCY_UTIL, + ENERGY_UTIL, +}; + +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs, + unsigned long max, enum schedutil_type type); + +static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs) +{ + unsigned long max = arch_scale_cpu_capacity(NULL, cpu); + + return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL); +} + static inline unsigned long cpu_bw_dl(struct rq *rq) { return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; @@ -2217,6 +2242,11 @@ static inline unsigned long cpu_util_rt(struct rq *rq) { return READ_ONCE(rq->avg_rt.util_avg); } +#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ +static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs) +{ + return cfs; +} #endif #ifdef CONFIG_HAVE_SCHED_AVG_IRQ -- cgit v1.3-7-g2ca7 From 27871f7a8a341ef5c636a337856369acf8013e4e Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:16 +0000 Subject: PM: Introduce an Energy Model management framework Several subsystems in the kernel (task scheduler and/or thermal at the time of writing) can benefit from knowing about the energy consumed by CPUs. Yet, this information can come from different sources (DT or firmware for example), in different formats, hence making it hard to exploit without a standard API. As an attempt to address this, introduce a centralized Energy Model (EM) management framework which aggregates the power values provided by drivers into a table for each performance domain in the system. The power cost tables are made available to interested clients (e.g. task scheduler or thermal) via platform-agnostic APIs. The overall design is represented by the diagram below (focused on Arm-related drivers as an example, but applicable to any architecture): +---------------+ +-----------------+ +-------------+ | Thermal (IPA) | | Scheduler (EAS) | | Other | +---------------+ +-----------------+ +-------------+ | | em_pd_energy() | | | em_cpu_get() | +-----------+ | +--------+ | | | v v v +---------------------+ | | | Energy Model | | | | Framework | | | +---------------------+ ^ ^ ^ | | | em_register_perf_domain() +----------+ | +---------+ | | | +---------------+ +---------------+ +--------------+ | cpufreq-dt | | arm_scmi | | Other | +---------------+ +---------------+ +--------------+ ^ ^ ^ | | | +--------------+ +---------------+ +--------------+ | Device Tree | | Firmware | | ? | +--------------+ +---------------+ +--------------+ Drivers (typically, but not limited to, CPUFreq drivers) can register data in the EM framework using the em_register_perf_domain() API. The calling driver must provide a callback function with a standardized signature that will be used by the EM framework to build the power cost tables of the performance domain. This design should offer a lot of flexibility to calling drivers which are free of reading information from any location and to use any technique to compute power costs. Moreover, the capacity states registered by drivers in the EM framework are not required to match real performance states of the target. This is particularly important on targets where the performance states are not known by the OS. The power cost coefficients managed by the EM framework are specified in milli-watts. Although the two potential users of those coefficients (IPA and EAS) only need relative correctness, IPA specifically needs to compare the power of CPUs with the power of other components (GPUs, for example), which are still expressed in absolute terms in their respective subsystems. Hence, specifying the power of CPUs in milli-watts should help transitioning IPA to using the EM framework without introducing new problems by keeping units comparable across sub-systems. On the longer term, the EM of other devices than CPUs could also be managed by the EM framework, which would enable to remove the absolute unit. However, this is not absolutely required as a first step, so this extension of the EM framework is left for later. On the client side, the EM framework offers APIs to access the power cost tables of a CPU (em_cpu_get()), and to estimate the energy consumed by the CPUs of a performance domain (em_pd_energy()). Clients such as the task scheduler can then use these APIs to access the shared data structures holding the Energy Model of CPUs. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-4-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- include/linux/energy_model.h | 187 ++++++++++++++++++++++++++++++++++++++++ kernel/power/Kconfig | 15 ++++ kernel/power/Makefile | 2 + kernel/power/energy_model.c | 201 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 405 insertions(+) create mode 100644 include/linux/energy_model.h create mode 100644 kernel/power/energy_model.c (limited to 'kernel') diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h new file mode 100644 index 000000000000..aa027f7bcb3e --- /dev/null +++ b/include/linux/energy_model.h @@ -0,0 +1,187 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_ENERGY_MODEL_H +#define _LINUX_ENERGY_MODEL_H +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_ENERGY_MODEL +/** + * em_cap_state - Capacity state of a performance domain + * @frequency: The CPU frequency in KHz, for consistency with CPUFreq + * @power: The power consumed by 1 CPU at this level, in milli-watts + * @cost: The cost coefficient associated with this level, used during + * energy calculation. Equal to: power * max_frequency / frequency + */ +struct em_cap_state { + unsigned long frequency; + unsigned long power; + unsigned long cost; +}; + +/** + * em_perf_domain - Performance domain + * @table: List of capacity states, in ascending order + * @nr_cap_states: Number of capacity states + * @cpus: Cpumask covering the CPUs of the domain + * + * A "performance domain" represents a group of CPUs whose performance is + * scaled together. All CPUs of a performance domain must have the same + * micro-architecture. Performance domains often have a 1-to-1 mapping with + * CPUFreq policies. + */ +struct em_perf_domain { + struct em_cap_state *table; + int nr_cap_states; + unsigned long cpus[0]; +}; + +#define EM_CPU_MAX_POWER 0xFFFF + +struct em_data_callback { + /** + * active_power() - Provide power at the next capacity state of a CPU + * @power : Active power at the capacity state in mW (modified) + * @freq : Frequency at the capacity state in kHz (modified) + * @cpu : CPU for which we do this operation + * + * active_power() must find the lowest capacity state of 'cpu' above + * 'freq' and update 'power' and 'freq' to the matching active power + * and frequency. + * + * The power is the one of a single CPU in the domain, expressed in + * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER] + * range. + * + * Return 0 on success. + */ + int (*active_power)(unsigned long *power, unsigned long *freq, int cpu); +}; +#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb } + +struct em_perf_domain *em_cpu_get(int cpu); +int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, + struct em_data_callback *cb); + +/** + * em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain + * @pd : performance domain for which energy has to be estimated + * @max_util : highest utilization among CPUs of the domain + * @sum_util : sum of the utilization of all CPUs in the domain + * + * Return: the sum of the energy consumed by the CPUs of the domain assuming + * a capacity state satisfying the max utilization of the domain. + */ +static inline unsigned long em_pd_energy(struct em_perf_domain *pd, + unsigned long max_util, unsigned long sum_util) +{ + unsigned long freq, scale_cpu; + struct em_cap_state *cs; + int i, cpu; + + /* + * In order to predict the capacity state, map the utilization of the + * most utilized CPU of the performance domain to a requested frequency, + * like schedutil. + */ + cpu = cpumask_first(to_cpumask(pd->cpus)); + scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + cs = &pd->table[pd->nr_cap_states - 1]; + freq = map_util_freq(max_util, cs->frequency, scale_cpu); + + /* + * Find the lowest capacity state of the Energy Model above the + * requested frequency. + */ + for (i = 0; i < pd->nr_cap_states; i++) { + cs = &pd->table[i]; + if (cs->frequency >= freq) + break; + } + + /* + * The capacity of a CPU in the domain at that capacity state (cs) + * can be computed as: + * + * cs->freq * scale_cpu + * cs->cap = -------------------- (1) + * cpu_max_freq + * + * So, ignoring the costs of idle states (which are not available in + * the EM), the energy consumed by this CPU at that capacity state is + * estimated as: + * + * cs->power * cpu_util + * cpu_nrg = -------------------- (2) + * cs->cap + * + * since 'cpu_util / cs->cap' represents its percentage of busy time. + * + * NOTE: Although the result of this computation actually is in + * units of power, it can be manipulated as an energy value + * over a scheduling period, since it is assumed to be + * constant during that interval. + * + * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product + * of two terms: + * + * cs->power * cpu_max_freq cpu_util + * cpu_nrg = ------------------------ * --------- (3) + * cs->freq scale_cpu + * + * The first term is static, and is stored in the em_cap_state struct + * as 'cs->cost'. + * + * Since all CPUs of the domain have the same micro-architecture, they + * share the same 'cs->cost', and the same CPU capacity. Hence, the + * total energy of the domain (which is the simple sum of the energy of + * all of its CPUs) can be factorized as: + * + * cs->cost * \Sum cpu_util + * pd_nrg = ------------------------ (4) + * scale_cpu + */ + return cs->cost * sum_util / scale_cpu; +} + +/** + * em_pd_nr_cap_states() - Get the number of capacity states of a perf. domain + * @pd : performance domain for which this must be done + * + * Return: the number of capacity states in the performance domain table + */ +static inline int em_pd_nr_cap_states(struct em_perf_domain *pd) +{ + return pd->nr_cap_states; +} + +#else +struct em_perf_domain {}; +struct em_data_callback {}; +#define EM_DATA_CB(_active_power_cb) { } + +static inline int em_register_perf_domain(cpumask_t *span, + unsigned int nr_states, struct em_data_callback *cb) +{ + return -EINVAL; +} +static inline struct em_perf_domain *em_cpu_get(int cpu) +{ + return NULL; +} +static inline unsigned long em_pd_energy(struct em_perf_domain *pd, + unsigned long max_util, unsigned long sum_util) +{ + return 0; +} +static inline int em_pd_nr_cap_states(struct em_perf_domain *pd) +{ + return 0; +} +#endif + +#endif diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 3a6c2f87699e..f8fe57d1022e 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -298,3 +298,18 @@ config PM_GENERIC_DOMAINS_OF config CPU_PM bool + +config ENERGY_MODEL + bool "Energy Model for CPUs" + depends on SMP + depends on CPU_FREQ + default n + help + Several subsystems (thermal and/or the task scheduler for example) + can leverage information about the energy consumed by CPUs to make + smarter decisions. This config option enables the framework from + which subsystems can access the energy models. + + The exact usage of the energy model is subsystem-dependent. + + If in doubt, say N. diff --git a/kernel/power/Makefile b/kernel/power/Makefile index a3f79f0eef36..e7e47d9be1e5 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o + +obj-$(CONFIG_ENERGY_MODEL) += energy_model.o diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c new file mode 100644 index 000000000000..d9dc2c38764a --- /dev/null +++ b/kernel/power/energy_model.c @@ -0,0 +1,201 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Energy Model of CPUs + * + * Copyright (c) 2018, Arm ltd. + * Written by: Quentin Perret, Arm ltd. + */ + +#define pr_fmt(fmt) "energy_model: " fmt + +#include +#include +#include +#include +#include + +/* Mapping of each CPU to the performance domain to which it belongs. */ +static DEFINE_PER_CPU(struct em_perf_domain *, em_data); + +/* + * Mutex serializing the registrations of performance domains and letting + * callbacks defined by drivers sleep. + */ +static DEFINE_MUTEX(em_pd_mutex); + +static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states, + struct em_data_callback *cb) +{ + unsigned long opp_eff, prev_opp_eff = ULONG_MAX; + unsigned long power, freq, prev_freq = 0; + int i, ret, cpu = cpumask_first(span); + struct em_cap_state *table; + struct em_perf_domain *pd; + u64 fmax; + + if (!cb->active_power) + return NULL; + + pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL); + if (!pd) + return NULL; + + table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL); + if (!table) + goto free_pd; + + /* Build the list of capacity states for this performance domain */ + for (i = 0, freq = 0; i < nr_states; i++, freq++) { + /* + * active_power() is a driver callback which ceils 'freq' to + * lowest capacity state of 'cpu' above 'freq' and updates + * 'power' and 'freq' accordingly. + */ + ret = cb->active_power(&power, &freq, cpu); + if (ret) { + pr_err("pd%d: invalid cap. state: %d\n", cpu, ret); + goto free_cs_table; + } + + /* + * We expect the driver callback to increase the frequency for + * higher capacity states. + */ + if (freq <= prev_freq) { + pr_err("pd%d: non-increasing freq: %lu\n", cpu, freq); + goto free_cs_table; + } + + /* + * The power returned by active_state() is expected to be + * positive, in milli-watts and to fit into 16 bits. + */ + if (!power || power > EM_CPU_MAX_POWER) { + pr_err("pd%d: invalid power: %lu\n", cpu, power); + goto free_cs_table; + } + + table[i].power = power; + table[i].frequency = prev_freq = freq; + + /* + * The hertz/watts efficiency ratio should decrease as the + * frequency grows on sane platforms. But this isn't always + * true in practice so warn the user if a higher OPP is more + * power efficient than a lower one. + */ + opp_eff = freq / power; + if (opp_eff >= prev_opp_eff) + pr_warn("pd%d: hertz/watts ratio non-monotonically decreasing: em_cap_state %d >= em_cap_state%d\n", + cpu, i, i - 1); + prev_opp_eff = opp_eff; + } + + /* Compute the cost of each capacity_state. */ + fmax = (u64) table[nr_states - 1].frequency; + for (i = 0; i < nr_states; i++) { + table[i].cost = div64_u64(fmax * table[i].power, + table[i].frequency); + } + + pd->table = table; + pd->nr_cap_states = nr_states; + cpumask_copy(to_cpumask(pd->cpus), span); + + return pd; + +free_cs_table: + kfree(table); +free_pd: + kfree(pd); + + return NULL; +} + +/** + * em_cpu_get() - Return the performance domain for a CPU + * @cpu : CPU to find the performance domain for + * + * Return: the performance domain to which 'cpu' belongs, or NULL if it doesn't + * exist. + */ +struct em_perf_domain *em_cpu_get(int cpu) +{ + return READ_ONCE(per_cpu(em_data, cpu)); +} +EXPORT_SYMBOL_GPL(em_cpu_get); + +/** + * em_register_perf_domain() - Register the Energy Model of a performance domain + * @span : Mask of CPUs in the performance domain + * @nr_states : Number of capacity states to register + * @cb : Callback functions providing the data of the Energy Model + * + * Create Energy Model tables for a performance domain using the callbacks + * defined in cb. + * + * If multiple clients register the same performance domain, all but the first + * registration will be ignored. + * + * Return 0 on success + */ +int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, + struct em_data_callback *cb) +{ + unsigned long cap, prev_cap = 0; + struct em_perf_domain *pd; + int cpu, ret = 0; + + if (!span || !nr_states || !cb) + return -EINVAL; + + /* + * Use a mutex to serialize the registration of performance domains and + * let the driver-defined callback functions sleep. + */ + mutex_lock(&em_pd_mutex); + + for_each_cpu(cpu, span) { + /* Make sure we don't register again an existing domain. */ + if (READ_ONCE(per_cpu(em_data, cpu))) { + ret = -EEXIST; + goto unlock; + } + + /* + * All CPUs of a domain must have the same micro-architecture + * since they all share the same table. + */ + cap = arch_scale_cpu_capacity(NULL, cpu); + if (prev_cap && prev_cap != cap) { + pr_err("CPUs of %*pbl must have the same capacity\n", + cpumask_pr_args(span)); + ret = -EINVAL; + goto unlock; + } + prev_cap = cap; + } + + /* Create the performance domain and add it to the Energy Model. */ + pd = em_create_pd(span, nr_states, cb); + if (!pd) { + ret = -EINVAL; + goto unlock; + } + + for_each_cpu(cpu, span) { + /* + * The per-cpu array can be read concurrently from em_cpu_get(). + * The barrier enforces the ordering needed to make sure readers + * can only access well formed em_perf_domain structs. + */ + smp_store_release(per_cpu_ptr(&em_data, cpu), pd); + } + + pr_debug("Created perf domain %*pbl\n", cpumask_pr_args(span)); +unlock: + mutex_unlock(&em_pd_mutex); + + return ret; +} +EXPORT_SYMBOL_GPL(em_register_perf_domain); -- cgit v1.3-7-g2ca7 From 6aa140fa4508933a6ac6717d65a403eb904d6c02 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:18 +0000 Subject: sched/topology: Reference the Energy Model of CPUs when available The existing scheduling domain hierarchy is defined to map to the cache topology of the system. However, Energy Aware Scheduling (EAS) requires more knowledge about the platform, and specifically needs to know about the span of Performance Domains (PD), which do not always align with caches. To address this issue, use the Energy Model (EM) of the system to extend the scheduler topology code with a representation of the PDs, alongside the scheduling domains. More specifically, a linked list of PDs is attached to each root domain. When multiple root domains are in use, each list contains only the PDs covering the CPUs of its root domain. If a PD spans over CPUs of multiple different root domains, it will be duplicated in all lists. The lists are fully maintained by the scheduler from partition_sched_domains() in order to cope with hotplug and cpuset changes. As for scheduling domains, the list are protected by RCU to ensure safe concurrent updates. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-6-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/sched.h | 21 ++++++++ kernel/sched/topology.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 151 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2eafa228aebf..808a565187b1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -709,6 +710,12 @@ static inline bool sched_asym_prefer(int a, int b) return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b); } +struct perf_domain { + struct em_perf_domain *em_pd; + struct perf_domain *next; + struct rcu_head rcu; +}; + /* * We add the notion of a root-domain which will be used to define per-domain * variables. Each exclusive cpuset essentially defines an island domain by @@ -761,6 +768,12 @@ struct root_domain { struct cpupri cpupri; unsigned long max_cpu_capacity; + + /* + * NULL-terminated list of performance domains intersecting with the + * CPUs of the rd. Protected by RCU. + */ + struct perf_domain *pd; }; extern struct root_domain def_root_domain; @@ -2276,3 +2289,11 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned return util; } #endif + +#ifdef CONFIG_SMP +#ifdef CONFIG_ENERGY_MODEL +#define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus))) +#else +#define perf_domain_span(pd) NULL +#endif +#endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 7364e0b427b7..169d25cafab5 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -201,6 +201,116 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) return 1; } +#ifdef CONFIG_ENERGY_MODEL +static void free_pd(struct perf_domain *pd) +{ + struct perf_domain *tmp; + + while (pd) { + tmp = pd->next; + kfree(pd); + pd = tmp; + } +} + +static struct perf_domain *find_pd(struct perf_domain *pd, int cpu) +{ + while (pd) { + if (cpumask_test_cpu(cpu, perf_domain_span(pd))) + return pd; + pd = pd->next; + } + + return NULL; +} + +static struct perf_domain *pd_init(int cpu) +{ + struct em_perf_domain *obj = em_cpu_get(cpu); + struct perf_domain *pd; + + if (!obj) { + if (sched_debug()) + pr_info("%s: no EM found for CPU%d\n", __func__, cpu); + return NULL; + } + + pd = kzalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) + return NULL; + pd->em_pd = obj; + + return pd; +} + +static void perf_domain_debug(const struct cpumask *cpu_map, + struct perf_domain *pd) +{ + if (!sched_debug() || !pd) + return; + + printk(KERN_DEBUG "root_domain %*pbl:", cpumask_pr_args(cpu_map)); + + while (pd) { + printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }", + cpumask_first(perf_domain_span(pd)), + cpumask_pr_args(perf_domain_span(pd)), + em_pd_nr_cap_states(pd->em_pd)); + pd = pd->next; + } + + printk(KERN_CONT "\n"); +} + +static void destroy_perf_domain_rcu(struct rcu_head *rp) +{ + struct perf_domain *pd; + + pd = container_of(rp, struct perf_domain, rcu); + free_pd(pd); +} + +static void build_perf_domains(const struct cpumask *cpu_map) +{ + struct perf_domain *pd = NULL, *tmp; + int cpu = cpumask_first(cpu_map); + struct root_domain *rd = cpu_rq(cpu)->rd; + int i; + + for_each_cpu(i, cpu_map) { + /* Skip already covered CPUs. */ + if (find_pd(pd, i)) + continue; + + /* Create the new pd and add it to the local list. */ + tmp = pd_init(i); + if (!tmp) + goto free; + tmp->next = pd; + pd = tmp; + } + + perf_domain_debug(cpu_map, pd); + + /* Attach the new list of performance domains to the root domain. */ + tmp = rd->pd; + rcu_assign_pointer(rd->pd, pd); + if (tmp) + call_rcu(&tmp->rcu, destroy_perf_domain_rcu); + + return; + +free: + free_pd(pd); + tmp = rd->pd; + rcu_assign_pointer(rd->pd, NULL); + if (tmp) + call_rcu(&tmp->rcu, destroy_perf_domain_rcu); +} +#else +static void free_pd(struct perf_domain *pd) { } +#endif /* CONFIG_ENERGY_MODEL */ + static void free_rootdomain(struct rcu_head *rcu) { struct root_domain *rd = container_of(rcu, struct root_domain, rcu); @@ -211,6 +321,7 @@ static void free_rootdomain(struct rcu_head *rcu) free_cpumask_var(rd->rto_mask); free_cpumask_var(rd->online); free_cpumask_var(rd->span); + free_pd(rd->pd); kfree(rd); } @@ -1959,8 +2070,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], /* Destroy deleted domains: */ for (i = 0; i < ndoms_cur; i++) { for (j = 0; j < n && !new_topology; j++) { - if (cpumask_equal(doms_cur[i], doms_new[j]) - && dattrs_equal(dattr_cur, i, dattr_new, j)) + if (cpumask_equal(doms_cur[i], doms_new[j]) && + dattrs_equal(dattr_cur, i, dattr_new, j)) goto match1; } /* No match - a current sched domain not in new doms_new[] */ @@ -1980,8 +2091,8 @@ match1: /* Build new domains: */ for (i = 0; i < ndoms_new; i++) { for (j = 0; j < n && !new_topology; j++) { - if (cpumask_equal(doms_new[i], doms_cur[j]) - && dattrs_equal(dattr_new, i, dattr_cur, j)) + if (cpumask_equal(doms_new[i], doms_cur[j]) && + dattrs_equal(dattr_new, i, dattr_cur, j)) goto match2; } /* No match - add a new doms_new */ @@ -1990,6 +2101,21 @@ match2: ; } +#ifdef CONFIG_ENERGY_MODEL + /* Build perf. domains: */ + for (i = 0; i < ndoms_new; i++) { + for (j = 0; j < n; j++) { + if (cpumask_equal(doms_new[i], doms_cur[j]) && + cpu_rq(cpumask_first(doms_cur[j]))->rd->pd) + goto match3; + } + /* No match - add perf. domains for a new rd */ + build_perf_domains(doms_new[i]); +match3: + ; + } +#endif + /* Remember the new sched domains: */ if (doms_cur != &fallback_doms) free_sched_domains(doms_cur, ndoms_cur); -- cgit v1.3-7-g2ca7 From 011b27bb5d3139e8b5fe9ceff1fc7f6dc3145071 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:19 +0000 Subject: sched/topology: Add lowest CPU asymmetry sched_domain level pointer Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_asym_cpucapacity, points to the lowest level at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the sd_asym shortcut to sd_asym_packing to avoid confusions. Generally speaking, the largest opportunity to save energy via scheduling comes from a smarter exploitation of heterogeneous platforms (i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will be used at first as the lowest domain where Energy-Aware Scheduling (EAS) should be applied. For example, it is possible to apply EAS within a socket on a multi-socket system, as long as each socket has an asymmetric topology. Energy-aware cross-sockets wake-up balancing will only happen when the system is over-utilized, or this_cpu and prev_cpu are in different sockets. Suggested-by: Morten Rasmussen Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-7-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 3 ++- kernel/sched/topology.c | 8 ++++++-- 3 files changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fdc8356ea742..a31a6d325901 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9299,7 +9299,7 @@ static void nohz_balancer_kick(struct rq *rq) } } - sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); if (sd) { for_each_cpu(i, sched_domain_span(sd)) { if (i == cpu || diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 808a565187b1..75c403674706 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1303,7 +1303,8 @@ DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); -DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(struct sched_domain *, sd_asym_packing); +DECLARE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity); extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 169d25cafab5..137ccfed9a43 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -508,7 +508,8 @@ DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); -DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_PER_CPU(struct sched_domain *, sd_asym_packing); +DEFINE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity); DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) @@ -534,7 +535,10 @@ static void update_top_cache_domain(int cpu) rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); sd = highest_flag_domain(cpu, SD_ASYM_PACKING); - rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); + rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd); + + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY); + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd); } /* -- cgit v1.3-7-g2ca7 From b68a4c0dba3b1e1dda1ede49f3c2fc72d3b54567 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:20 +0000 Subject: sched/topology: Disable EAS on inappropriate platforms Energy Aware Scheduling (EAS) in its current form is most relevant on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE) since this is where there is a lot of potential for saving energy through scheduling. This is particularly true since the Energy Model only includes the active power costs of CPUs, hence not providing enough data to compare packing-vs-spreading strategies. As such, disable EAS on root domains where the SD_ASYM_CPUCAPACITY flag is not set. While at it, disable EAS on systems where the complexity of the Energy Model is too high since that could lead to unacceptable scheduling overhead. All in all, EAS can be used on a root domain if and only if: 1. an Energy Model is available; 2. the root domain has an asymmetric CPU capacity topology; 3. the complexity of the root domain's EM is low enough to keep scheduling overheads low. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-8-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/topology.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 137ccfed9a43..6ddb804b2dec 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -270,12 +270,45 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp) free_pd(pd); } +/* + * EAS can be used on a root domain if it meets all the following conditions: + * 1. an Energy Model (EM) is available; + * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. + * 3. the EM complexity is low enough to keep scheduling overheads low; + * + * The complexity of the Energy Model is defined as: + * + * C = nr_pd * (nr_cpus + nr_cs) + * + * with parameters defined as: + * - nr_pd: the number of performance domains + * - nr_cpus: the number of CPUs + * - nr_cs: the sum of the number of capacity states of all performance + * domains (for example, on a system with 2 performance domains, + * with 10 capacity states each, nr_cs = 2 * 10 = 20). + * + * It is generally not a good idea to use such a model in the wake-up path on + * very complex platforms because of the associated scheduling overheads. The + * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs + * with per-CPU DVFS and less than 8 capacity states each, for example. + */ +#define EM_MAX_COMPLEXITY 2048 + static void build_perf_domains(const struct cpumask *cpu_map) { + int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map); struct perf_domain *pd = NULL, *tmp; int cpu = cpumask_first(cpu_map); struct root_domain *rd = cpu_rq(cpu)->rd; - int i; + + /* EAS is enabled for asymmetric CPU capacity topologies. */ + if (!per_cpu(sd_asym_cpucapacity, cpu)) { + if (sched_debug()) { + pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n", + cpumask_pr_args(cpu_map)); + } + goto free; + } for_each_cpu(i, cpu_map) { /* Skip already covered CPUs. */ @@ -288,6 +321,20 @@ static void build_perf_domains(const struct cpumask *cpu_map) goto free; tmp->next = pd; pd = tmp; + + /* + * Count performance domains and capacity states for the + * complexity check. + */ + nr_pd++; + nr_cs += em_pd_nr_cap_states(pd->em_pd); + } + + /* Bail out if the Energy Model complexity is too high. */ + if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) { + WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n", + cpumask_pr_args(cpu_map)); + goto free; } perf_domain_debug(cpu_map, pd); -- cgit v1.3-7-g2ca7 From 531b5c9f5cd05ead53324f419b32685a22eebe8b Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:21 +0000 Subject: sched/topology: Make Energy Aware Scheduling depend on schedutil Energy Aware Scheduling (EAS) is designed with the assumption that frequencies of CPUs follow their utilization value. When using a CPUFreq governor other than schedutil, the chances of this assumption being true are small, if any. When schedutil is being used, EAS' predictions are at least consistent with the frequency requests. Although those requests have no guarantees to be honored by the hardware, they should at least guide DVFS in the right direction and provide some hope in regards to the EAS model being accurate. To make sure EAS is only used in a sane configuration, create a strong dependency on schedutil being used. Since having sugov compiled-in does not provide that guarantee, make CPUFreq call a scheduler function on governor changes hence letting it rebuild the scheduling domains, check the governors of the online CPUs, and enable/disable EAS accordingly. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-9-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- drivers/cpufreq/cpufreq.c | 1 + include/linux/cpufreq.h | 8 ++++++++ kernel/sched/cpufreq_schedutil.c | 37 +++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 4 +--- kernel/sched/topology.c | 28 ++++++++++++++++++++++++---- 5 files changed, 69 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 7aa3dcad2175..6f23ebb395f1 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2277,6 +2277,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, ret = cpufreq_start_governor(policy); if (!ret) { pr_debug("cpufreq: governor change\n"); + sched_cpufreq_governor_change(policy, old_gov); return 0; } cpufreq_exit_governor(policy); diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 882a9b9e34bc..c86d6d8bdfed 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -950,6 +950,14 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy) } #endif +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +void sched_cpufreq_governor_change(struct cpufreq_policy *policy, + struct cpufreq_governor *old_gov); +#else +static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy, + struct cpufreq_governor *old_gov) { } +#endif + extern void arch_freq_prepare_all(void); extern unsigned int arch_freq_get_on_cpu(int cpu); diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 90128be27712..c2e53d1a3143 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -624,7 +624,7 @@ static struct kobj_type sugov_tunables_ktype = { /********************** cpufreq governor interface *********************/ -static struct cpufreq_governor schedutil_gov; +struct cpufreq_governor schedutil_gov; static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) { @@ -883,7 +883,7 @@ static void sugov_limits(struct cpufreq_policy *policy) sg_policy->need_freq_update = true; } -static struct cpufreq_governor schedutil_gov = { +struct cpufreq_governor schedutil_gov = { .name = "schedutil", .owner = THIS_MODULE, .dynamic_switching = true, @@ -906,3 +906,36 @@ static int __init sugov_register(void) return cpufreq_register_governor(&schedutil_gov); } fs_initcall(sugov_register); + +#ifdef CONFIG_ENERGY_MODEL +extern bool sched_energy_update; +extern struct mutex sched_energy_mutex; + +static void rebuild_sd_workfn(struct work_struct *work) +{ + mutex_lock(&sched_energy_mutex); + sched_energy_update = true; + rebuild_sched_domains(); + sched_energy_update = false; + mutex_unlock(&sched_energy_mutex); +} +static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn); + +/* + * EAS shouldn't be attempted without sugov, so rebuild the sched_domains + * on governor changes to make sure the scheduler knows about it. + */ +void sched_cpufreq_governor_change(struct cpufreq_policy *policy, + struct cpufreq_governor *old_gov) +{ + if (old_gov == &schedutil_gov || policy->governor == &schedutil_gov) { + /* + * When called from the cpufreq_register_driver() path, the + * cpu_hotplug_lock is already held, so use a work item to + * avoid nested locking in rebuild_sched_domains(). + */ + schedule_work(&rebuild_sd_work); + } + +} +#endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 75c403674706..fd84900b0b21 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2291,10 +2291,8 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned } #endif -#ifdef CONFIG_SMP -#ifdef CONFIG_ENERGY_MODEL +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) #define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus))) #else #define perf_domain_span(pd) NULL #endif -#endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 6ddb804b2dec..0a5a1d3a4eae 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -201,7 +201,10 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) return 1; } -#ifdef CONFIG_ENERGY_MODEL +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +DEFINE_MUTEX(sched_energy_mutex); +bool sched_energy_update; + static void free_pd(struct perf_domain *pd) { struct perf_domain *tmp; @@ -275,6 +278,7 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp) * 1. an Energy Model (EM) is available; * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. * 3. the EM complexity is low enough to keep scheduling overheads low; + * 4. schedutil is driving the frequency of all CPUs of the rd; * * The complexity of the Energy Model is defined as: * @@ -294,12 +298,15 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp) */ #define EM_MAX_COMPLEXITY 2048 +extern struct cpufreq_governor schedutil_gov; static void build_perf_domains(const struct cpumask *cpu_map) { int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map); struct perf_domain *pd = NULL, *tmp; int cpu = cpumask_first(cpu_map); struct root_domain *rd = cpu_rq(cpu)->rd; + struct cpufreq_policy *policy; + struct cpufreq_governor *gov; /* EAS is enabled for asymmetric CPU capacity topologies. */ if (!per_cpu(sd_asym_cpucapacity, cpu)) { @@ -315,6 +322,19 @@ static void build_perf_domains(const struct cpumask *cpu_map) if (find_pd(pd, i)) continue; + /* Do not attempt EAS if schedutil is not being used. */ + policy = cpufreq_cpu_get(i); + if (!policy) + goto free; + gov = policy->governor; + cpufreq_cpu_put(policy); + if (gov != &schedutil_gov) { + if (rd->pd) + pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n", + cpumask_pr_args(cpu_map)); + goto free; + } + /* Create the new pd and add it to the local list. */ tmp = pd_init(i); if (!tmp) @@ -356,7 +376,7 @@ free: } #else static void free_pd(struct perf_domain *pd) { } -#endif /* CONFIG_ENERGY_MODEL */ +#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL*/ static void free_rootdomain(struct rcu_head *rcu) { @@ -2152,10 +2172,10 @@ match2: ; } -#ifdef CONFIG_ENERGY_MODEL +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) /* Build perf. domains: */ for (i = 0; i < ndoms_new; i++) { - for (j = 0; j < n; j++) { + for (j = 0; j < n && !sched_energy_update; j++) { if (cpumask_equal(doms_new[i], doms_cur[j]) && cpu_rq(cpumask_first(doms_cur[j]))->rd->pd) goto match3; -- cgit v1.3-7-g2ca7 From 1f74de8798c93ce14801cc4e772603e51c841c33 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:22 +0000 Subject: sched/toplogy: Introduce the 'sched_energy_present' static key In order to make sure Energy Aware Scheduling (EAS) will not impact systems where no Energy Model is available, introduce a static key guarding the access to EAS code. Since EAS is enabled on a per-root-domain basis, the static key is enabled when at least one root domain meets all conditions for EAS. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-10-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/sched.h | 4 ++++ kernel/sched/topology.c | 28 ++++++++++++++++++++++++---- 2 files changed, 28 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fd84900b0b21..2b3cf356e958 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2296,3 +2296,7 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned #else #define perf_domain_span(pd) NULL #endif + +#ifdef CONFIG_SMP +extern struct static_key_false sched_energy_present; +#endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 0a5a1d3a4eae..3f35ba1d8fde 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -201,6 +201,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) return 1; } +DEFINE_STATIC_KEY_FALSE(sched_energy_present); #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) DEFINE_MUTEX(sched_energy_mutex); bool sched_energy_update; @@ -273,6 +274,19 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp) free_pd(pd); } +static void sched_energy_set(bool has_eas) +{ + if (!has_eas && static_branch_unlikely(&sched_energy_present)) { + if (sched_debug()) + pr_info("%s: stopping EAS\n", __func__); + static_branch_disable_cpuslocked(&sched_energy_present); + } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) { + if (sched_debug()) + pr_info("%s: starting EAS\n", __func__); + static_branch_enable_cpuslocked(&sched_energy_present); + } +} + /* * EAS can be used on a root domain if it meets all the following conditions: * 1. an Energy Model (EM) is available; @@ -299,7 +313,7 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp) #define EM_MAX_COMPLEXITY 2048 extern struct cpufreq_governor schedutil_gov; -static void build_perf_domains(const struct cpumask *cpu_map) +static bool build_perf_domains(const struct cpumask *cpu_map) { int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map); struct perf_domain *pd = NULL, *tmp; @@ -365,7 +379,7 @@ static void build_perf_domains(const struct cpumask *cpu_map) if (tmp) call_rcu(&tmp->rcu, destroy_perf_domain_rcu); - return; + return !!pd; free: free_pd(pd); @@ -373,6 +387,8 @@ free: rcu_assign_pointer(rd->pd, NULL); if (tmp) call_rcu(&tmp->rcu, destroy_perf_domain_rcu); + + return false; } #else static void free_pd(struct perf_domain *pd) { } @@ -2114,6 +2130,7 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur, void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], struct sched_domain_attr *dattr_new) { + bool __maybe_unused has_eas = false; int i, j, n; int new_topology; @@ -2177,14 +2194,17 @@ match2: for (i = 0; i < ndoms_new; i++) { for (j = 0; j < n && !sched_energy_update; j++) { if (cpumask_equal(doms_new[i], doms_cur[j]) && - cpu_rq(cpumask_first(doms_cur[j]))->rd->pd) + cpu_rq(cpumask_first(doms_cur[j]))->rd->pd) { + has_eas = true; goto match3; + } } /* No match - add perf. domains for a new rd */ - build_perf_domains(doms_new[i]); + has_eas |= build_perf_domains(doms_new[i]); match3: ; } + sched_energy_set(has_eas); #endif /* Remember the new sched domains: */ -- cgit v1.3-7-g2ca7 From 630246a06ae2a7a12d1fce85f1e5681032982791 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:24 +0000 Subject: sched/fair: Clean-up update_sg_lb_stats parameters In preparation for the introduction of a new root domain flag which can be set during load balance (the 'overutilized' flag), clean-up the set of parameters passed to update_sg_lb_stats(). More specifically, the 'local_group' and 'local_idx' parameters can be removed since they can easily be reconstructed from within the function. While at it, transform the 'overload' parameter into a flag stored in the 'sg_status' parameter hence facilitating the definition of new flags when needed. Suggested-by: Peter Zijlstra Suggested-by: Valentin Schneider Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-12-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 27 +++++++++++---------------- kernel/sched/sched.h | 3 +++ 2 files changed, 14 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a31a6d325901..e04f29098ec7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7905,16 +7905,16 @@ static bool update_nohz_stats(struct rq *rq, bool force) * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. * @group: sched_group whose statistics are to be updated. - * @load_idx: Load index of sched_domain of this_cpu for load calc. - * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. - * @overload: Indicate pullable load (e.g. >1 runnable task). + * @sg_status: Holds flag indicating the status of the sched_group */ static inline void update_sg_lb_stats(struct lb_env *env, - struct sched_group *group, int load_idx, - int local_group, struct sg_lb_stats *sgs, - bool *overload) + struct sched_group *group, + struct sg_lb_stats *sgs, + int *sg_status) { + int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group)); + int load_idx = get_sd_load_idx(env->sd, env->idle); unsigned long load; int i, nr_running; @@ -7938,7 +7938,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, nr_running = rq->nr_running; if (nr_running > 1) - *overload = true; + *sg_status |= SG_OVERLOAD; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; @@ -7954,7 +7954,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (env->sd->flags & SD_ASYM_CPUCAPACITY && sgs->group_misfit_task_load < rq->misfit_task_load) { sgs->group_misfit_task_load = rq->misfit_task_load; - *overload = 1; + *sg_status |= SG_OVERLOAD; } } @@ -8099,17 +8099,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; - int load_idx; - bool overload = false; bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; + int sg_status = 0; #ifdef CONFIG_NO_HZ_COMMON if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked)) env->flags |= LBF_NOHZ_STATS; #endif - load_idx = get_sd_load_idx(env->sd, env->idle); - do { struct sg_lb_stats *sgs = &tmp_sgs; int local_group; @@ -8124,8 +8121,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd update_group_capacity(env->sd, env->dst_cpu); } - update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload); + update_sg_lb_stats(env, sg, sgs, &sg_status); if (local_group) goto next_group; @@ -8175,8 +8171,7 @@ next_group: if (!env->sd->parent) { /* update overload indicator if we are at root domain */ - if (READ_ONCE(env->dst_rq->rd->overload) != overload) - WRITE_ONCE(env->dst_rq->rd->overload, overload); + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD); } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2b3cf356e958..d4d984846924 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -716,6 +716,9 @@ struct perf_domain { struct rcu_head rcu; }; +/* Scheduling group status flags */ +#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */ + /* * We add the notion of a root-domain which will be used to define per-domain * variables. Each exclusive cpuset essentially defines an island domain by -- cgit v1.3-7-g2ca7 From 2802bf3cd936fe2c8033a696d375a4d9d3974de4 Mon Sep 17 00:00:00 2001 From: Morten Rasmussen Date: Mon, 3 Dec 2018 09:56:25 +0000 Subject: sched/fair: Add over-utilization/tipping point indicator Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% regardless of how many additional tasks we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at its highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For systems where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. [ peterz: Added a comment explaining why new tasks are not accounted during overutilization detection. ] Signed-off-by: Morten Rasmussen Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-13-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 4 ++++ 2 files changed, 61 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e04f29098ec7..767e7675774b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5082,6 +5082,24 @@ static inline void hrtick_update(struct rq *rq) } #endif +#ifdef CONFIG_SMP +static inline unsigned long cpu_util(int cpu); +static unsigned long capacity_of(int cpu); + +static inline bool cpu_overutilized(int cpu) +{ + return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); +} + +static inline void update_overutilized_status(struct rq *rq) +{ + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) + WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); +} +#else +static inline void update_overutilized_status(struct rq *rq) { } +#endif + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -5139,8 +5157,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_group(se); } - if (!se) + if (!se) { add_nr_running(rq, 1); + /* + * Since new tasks are assigned an initial util_avg equal to + * half of the spare capacity of their CPU, tiny tasks have the + * ability to cross the overutilized threshold, which will + * result in the load balancer ruining all the task placement + * done by EAS. As a way to mitigate that effect, do not account + * for the first enqueue operation of new tasks during the + * overutilized flag detection. + * + * A better way of solving this problem would be to wait for + * the PELT signals of tasks to converge before taking them + * into account, but that is not straightforward to implement, + * and the following generally works well enough in practice. + */ + if (flags & ENQUEUE_WAKEUP) + update_overutilized_status(rq); + + } hrtick_update(rq); } @@ -7940,6 +7976,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (nr_running > 1) *sg_status |= SG_OVERLOAD; + if (cpu_overutilized(i)) + *sg_status |= SG_OVERUTILIZED; + #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; sgs->nr_preferred_running += rq->nr_preferred_running; @@ -8170,8 +8209,15 @@ next_group: env->fbq_type = fbq_classify_group(&sds->busiest_stat); if (!env->sd->parent) { + struct root_domain *rd = env->dst_rq->rd; + /* update overload indicator if we are at root domain */ - WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD); + WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD); + + /* Update over-utilization (tipping point, U >= 0) indicator */ + WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED); + } else if (sg_status & SG_OVERUTILIZED) { + WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED); } } @@ -8398,6 +8444,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * this level. */ update_sd_lb_stats(env, &sds); + + if (static_branch_unlikely(&sched_energy_present)) { + struct root_domain *rd = env->dst_rq->rd; + + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) + goto out_balanced; + } + local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -9798,6 +9852,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr); update_misfit_status(curr, rq); + update_overutilized_status(task_rq(curr)); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d4d984846924..0ba08924e017 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -718,6 +718,7 @@ struct perf_domain { /* Scheduling group status flags */ #define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */ +#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */ /* * We add the notion of a root-domain which will be used to define per-domain @@ -741,6 +742,9 @@ struct root_domain { */ int overload; + /* Indicate one or more cpus over-utilized (tipping point) */ + int overutilized; + /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). -- cgit v1.3-7-g2ca7 From 390031e4c309c94ecc07a558187eb5185200df83 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:26 +0000 Subject: sched/fair: Introduce an energy estimation helper function In preparation for the definition of an energy-aware wakeup path, introduce a helper function to estimate the consequence on system energy when a specific task wakes-up on a specific CPU. compute_energy() estimates the capacity state to be reached by all performance domains and estimates the consumption of each online CPU according to its Energy Model and its percentage of busy time. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-14-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 767e7675774b..b3c94584d947 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6377,6 +6377,82 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) return !task_fits_capacity(p, min_cap); } +/* + * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued) + * to @dst_cpu. + */ +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg); + + /* + * If @p migrates from @cpu to another, remove its contribution. Or, + * if @p migrates from another CPU to @cpu, add its contribution. In + * the other cases, @cpu is not impacted by the migration, so the + * util_avg should already be correct. + */ + if (task_cpu(p) == cpu && dst_cpu != cpu) + sub_positive(&util, task_util(p)); + else if (task_cpu(p) != cpu && dst_cpu == cpu) + util += task_util(p); + + if (sched_feat(UTIL_EST)) { + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); + + /* + * During wake-up, the task isn't enqueued yet and doesn't + * appear in the cfs_rq->avg.util_est.enqueued of any rq, + * so just add it (if needed) to "simulate" what will be + * cpu_util() after the task has been enqueued. + */ + if (dst_cpu == cpu) + util_est += _task_util_est(p); + + util = max(util, util_est); + } + + return min(util, capacity_orig_of(cpu)); +} + +/* + * compute_energy(): Estimates the energy that would be consumed if @p was + * migrated to @dst_cpu. compute_energy() predicts what will be the utilization + * landscape of the * CPUs after the task migration, and uses the Energy Model + * to compute what would be the energy if we decided to actually migrate that + * task. + */ +static long +compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) +{ + long util, max_util, sum_util, energy = 0; + int cpu; + + for (; pd; pd = pd->next) { + max_util = sum_util = 0; + /* + * The capacity state of CPUs of the current rd can be driven by + * CPUs of another rd if they belong to the same performance + * domain. So, account for the utilization of these CPUs too + * by masking pd with cpu_online_mask instead of the rd span. + * + * If an entire performance domain is outside of the current rd, + * it will not appear in its pd list and will not be accounted + * by compute_energy(). + */ + for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) { + util = cpu_util_next(cpu, p, dst_cpu); + util = schedutil_energy_util(cpu, util); + max_util = max(util, max_util); + sum_util += util; + } + + energy += em_pd_energy(pd->em_pd, max_util, sum_util); + } + + return energy; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, -- cgit v1.3-7-g2ca7 From 732cd75b8c920d3727e69957b14faa7c2d7c3b75 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:27 +0000 Subject: sched/fair: Select an energy-efficient CPU on task wake-up If an Energy Model (EM) is available and if the system isn't overutilized, re-route waking tasks into an energy-aware placement algorithm. The selection of an energy-efficient CPU for a task is achieved by estimating the impact on system-level active energy resulting from the placement of the task on the CPU with the highest spare capacity in each performance domain. This strategy spreads tasks in a performance domain and avoids overly aggressive task packing. The best CPU energy-wise is then selected if it saves a large enough amount of energy with respect to prev_cpu. Although it has already shown significant benefits on some existing targets, this approach cannot scale to platforms with numerous CPUs. This is an attempt to do something useful as writing a fast heuristic that performs reasonably well on a broad spectrum of architectures isn't an easy task. As such, the scope of usability of the energy-aware wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag set, and where the EM isn't too complex. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-15-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 141 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3c94584d947..ca469646ebe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6453,6 +6453,137 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) return energy; } +/* + * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the + * waking task. find_energy_efficient_cpu() looks for the CPU with maximum + * spare capacity in each performance domain and uses it as a potential + * candidate to execute the task. Then, it uses the Energy Model to figure + * out which of the CPU candidates is the most energy-efficient. + * + * The rationale for this heuristic is as follows. In a performance domain, + * all the most energy efficient CPU candidates (according to the Energy + * Model) are those for which we'll request a low frequency. When there are + * several CPUs for which the frequency request will be the same, we don't + * have enough data to break the tie between them, because the Energy Model + * only includes active power costs. With this model, if we assume that + * frequency requests follow utilization (e.g. using schedutil), the CPU with + * the maximum spare capacity in a performance domain is guaranteed to be among + * the best candidates of the performance domain. + * + * In practice, it could be preferable from an energy standpoint to pack + * small tasks on a CPU in order to let other CPUs go in deeper idle states, + * but that could also hurt our chances to go cluster idle, and we have no + * ways to tell with the current Energy Model if this is actually a good + * idea or not. So, find_energy_efficient_cpu() basically favors + * cluster-packing, and spreading inside a cluster. That should at least be + * a good thing for latency, and this is consistent with the idea that most + * of the energy savings of EAS come from the asymmetry of the system, and + * not so much from breaking the tie between identical CPUs. That's also the + * reason why EAS is enabled in the topology code only for systems where + * SD_ASYM_CPUCAPACITY is set. + * + * NOTE: Forkees are not accepted in the energy-aware wake-up path because + * they don't have any useful utilization data yet and it's not possible to + * forecast their impact on energy consumption. Consequently, they will be + * placed by find_idlest_cpu() on the least loaded CPU, which might turn out + * to be energy-inefficient in some use-cases. The alternative would be to + * bias new tasks towards specific types of CPUs first, or to try to infer + * their util_avg from the parent task, but those heuristics could hurt + * other use-cases too. So, until someone finds a better way to solve this, + * let's keep things simple by re-using the existing slow path. + */ + +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) +{ + unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX; + struct root_domain *rd = cpu_rq(smp_processor_id())->rd; + int cpu, best_energy_cpu = prev_cpu; + struct perf_domain *head, *pd; + unsigned long cpu_cap, util; + struct sched_domain *sd; + + rcu_read_lock(); + pd = rcu_dereference(rd->pd); + if (!pd || READ_ONCE(rd->overutilized)) + goto fail; + head = pd; + + /* + * Energy-aware wake-up happens on the lowest sched_domain starting + * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. + */ + sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); + while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) + sd = sd->parent; + if (!sd) + goto fail; + + sync_entity_load_avg(&p->se); + if (!task_util_est(p)) + goto unlock; + + for (; pd; pd = pd->next) { + unsigned long cur_energy, spare_cap, max_spare_cap = 0; + int max_spare_cap_cpu = -1; + + for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { + if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) + continue; + + /* Skip CPUs that will be overutilized. */ + util = cpu_util_next(cpu, p, cpu); + cpu_cap = capacity_of(cpu); + if (cpu_cap * 1024 < util * capacity_margin) + continue; + + /* Always use prev_cpu as a candidate. */ + if (cpu == prev_cpu) { + prev_energy = compute_energy(p, prev_cpu, head); + best_energy = min(best_energy, prev_energy); + continue; + } + + /* + * Find the CPU with the maximum spare capacity in + * the performance domain + */ + spare_cap = cpu_cap - util; + if (spare_cap > max_spare_cap) { + max_spare_cap = spare_cap; + max_spare_cap_cpu = cpu; + } + } + + /* Evaluate the energy impact of using this CPU. */ + if (max_spare_cap_cpu >= 0) { + cur_energy = compute_energy(p, max_spare_cap_cpu, head); + if (cur_energy < best_energy) { + best_energy = cur_energy; + best_energy_cpu = max_spare_cap_cpu; + } + } + } +unlock: + rcu_read_unlock(); + + /* + * Pick the best CPU if prev_cpu cannot be used, or if it saves at + * least 6% of the energy used by prev_cpu. + */ + if (prev_energy == ULONG_MAX) + return best_energy_cpu; + + if ((prev_energy - best_energy) > (prev_energy >> 4)) + return best_energy_cpu; + + return prev_cpu; + +fail: + rcu_read_unlock(); + + return -1; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -6476,8 +6607,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); - want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) - && cpumask_test_cpu(cpu, &p->cpus_allowed); + + if (static_branch_unlikely(&sched_energy_present)) { + new_cpu = find_energy_efficient_cpu(p, prev_cpu); + if (new_cpu >= 0) + return new_cpu; + new_cpu = prev_cpu; + } + + want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) && + cpumask_test_cpu(cpu, &p->cpus_allowed); } rcu_read_lock(); -- cgit v1.3-7-g2ca7 From db5113911abaa7eb20cf115d4339959c1aecea95 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Sun, 9 Dec 2018 11:24:11 -0700 Subject: seccomp: hoist struct seccomp_data recalculation higher In the next patch, we're going to use the sd pointer passed to __seccomp_filter() as the data to pass to userspace. Except that in some cases (__seccomp_filter(SECCOMP_RET_TRACE), emulate_vsyscall(), every time seccomp is inovked on power, etc.) the sd pointer will be NULL in order to force seccomp to recompute the register data. Previously this recomputation happened one level lower, in seccomp_run_filters(); this patch just moves it up a level higher to __seccomp_filter(). Thanks Oleg for spotting this. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" Acked-by: Serge Hallyn CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda Signed-off-by: Kees Cook --- kernel/seccomp.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index f2ae2324c232..96afc32e041d 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -188,7 +188,6 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) static u32 seccomp_run_filters(const struct seccomp_data *sd, struct seccomp_filter **match) { - struct seccomp_data sd_local; u32 ret = SECCOMP_RET_ALLOW; /* Make sure cross-thread synced filter points somewhere sane. */ struct seccomp_filter *f = @@ -198,11 +197,6 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; - if (!sd) { - populate_seccomp_data(&sd_local); - sd = &sd_local; - } - /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). @@ -658,6 +652,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, u32 filter_ret, action; struct seccomp_filter *match = NULL; int data; + struct seccomp_data sd_local; /* * Make sure that any changes to mode from another thread have @@ -665,6 +660,11 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, */ rmb(); + if (!sd) { + populate_seccomp_data(&sd_local); + sd = &sd_local; + } + filter_ret = seccomp_run_filters(sd, &match); data = filter_ret & SECCOMP_RET_DATA; action = filter_ret & SECCOMP_RET_ACTION_FULL; -- cgit v1.3-7-g2ca7 From a5662e4d81c4d4b08140c625d0f3c50b15786252 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Sun, 9 Dec 2018 11:24:12 -0700 Subject: seccomp: switch system call argument type to void * The const qualifier causes problems for any code that wants to write to the third argument of the seccomp syscall, as we will do in a future patch in this series. The third argument to the seccomp syscall is documented as void *, so rather than just dropping the const, let's switch everything to use void * as well. I believe this is safe because of 1. the documentation above, 2. there's no real type information exported about syscalls anywhere besides the man pages. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" Acked-by: Serge Hallyn CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda Signed-off-by: Kees Cook --- include/linux/seccomp.h | 2 +- include/linux/syscalls.h | 2 +- kernel/seccomp.c | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index e5320f6c8654..b5103c019cf4 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -43,7 +43,7 @@ extern void secure_computing_strict(int this_syscall); #endif extern long prctl_get_seccomp(void); -extern long prctl_set_seccomp(unsigned long, char __user *); +extern long prctl_set_seccomp(unsigned long, void __user *); static inline int seccomp_mode(struct seccomp *s) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2ac3d13a915b..a60694fb0f58 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -879,7 +879,7 @@ asmlinkage long sys_renameat2(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, unsigned int flags); asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, - const char __user *uargs); + void __user *uargs); asmlinkage long sys_getrandom(char __user *buf, size_t count, unsigned int flags); asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 96afc32e041d..393e029f778a 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -924,7 +924,7 @@ static long seccomp_get_action_avail(const char __user *uaction) /* Common entry point for both prctl and syscall. */ static long do_seccomp(unsigned int op, unsigned int flags, - const char __user *uargs) + void __user *uargs) { switch (op) { case SECCOMP_SET_MODE_STRICT: @@ -944,7 +944,7 @@ static long do_seccomp(unsigned int op, unsigned int flags, } SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags, - const char __user *, uargs) + void __user *, uargs) { return do_seccomp(op, flags, uargs); } @@ -956,10 +956,10 @@ SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags, * * Returns 0 on success or -EINVAL on failure. */ -long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) +long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter) { unsigned int op; - char __user *uargs; + void __user *uargs; switch (seccomp_mode) { case SECCOMP_MODE_STRICT: -- cgit v1.3-7-g2ca7 From 6a21cc50f0c7f87dae5259f6cfefe024412313f6 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Sun, 9 Dec 2018 11:24:13 -0700 Subject: seccomp: add a return code to trap to userspace This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mount() in general since various filesystems assume a trusted image. However, if an orchestrator knows that e.g. a particular block device has not been exposed to a container for writing, it want to allow the container to mount that block device (that is, handle the mount for it). This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to monitor services while starting. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" Acked-by: Serge Hallyn CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda Signed-off-by: Kees Cook --- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/userspace-api/seccomp_filter.rst | 84 +++++ include/linux/seccomp.h | 7 +- include/uapi/linux/seccomp.h | 40 ++- kernel/seccomp.c | 448 ++++++++++++++++++++++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 447 +++++++++++++++++++++++- 6 files changed, 1017 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index af6f6ba1fe80..c9558146ac58 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -79,6 +79,7 @@ Code Seq#(hex) Include File Comments 0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h +'!' 00-1F uapi/linux/seccomp.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem '$' 00-0F linux/perf_counter.h, linux/perf_event.h '%' 00-0F include/uapi/linux/stm.h diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst index 82a468bc7560..b1b846d8a094 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -122,6 +122,11 @@ In precedence order, they are: Results in the lower 16-bits of the return value being passed to userland as the errno without executing the system call. +``SECCOMP_RET_USER_NOTIF``: + Results in a ``struct seccomp_notif`` message sent on the userspace + notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below + on discussion of how to handle user notifications. + ``SECCOMP_RET_TRACE``: When returned, this value will cause the kernel to attempt to notify a ``ptrace()``-based tracer prior to executing the system @@ -183,6 +188,85 @@ The ``samples/seccomp/`` directory contains both an x86-specific example and a more generic example of a higher level macro interface for BPF program generation. +Userspace Notification +====================== + +The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a +particular syscall to userspace to be handled. This may be useful for +applications like container managers, which wish to intercept particular +syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior. + +To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` +argument to the ``seccomp()`` syscall: + +.. code-block:: c + + fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); + +which (on success) will return a listener fd for the filter, which can then be +passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to +a particular filter, and not a particular task. So if this task then forks, +notifications from both tasks will appear on the same filter fd. Reads and +writes to/from a filter fd are also synchronized, so a filter fd can safely +have many readers. + +The interface for a seccomp notification fd consists of two structures: + +.. code-block:: c + + struct seccomp_notif_sizes { + __u16 seccomp_notif; + __u16 seccomp_notif_resp; + __u16 seccomp_data; + }; + + struct seccomp_notif { + __u64 id; + __u32 pid; + __u32 flags; + struct seccomp_data data; + }; + + struct seccomp_notif_resp { + __u64 id; + __s64 val; + __s32 error; + __u32 flags; + }; + +The ``struct seccomp_notif_sizes`` structure can be used to determine the size +of the various structures used in seccomp notifications. The size of ``struct +seccomp_data`` may change in the future, so code should use: + +.. code-block:: c + + struct seccomp_notif_sizes sizes; + seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes); + +to determine the size of the various structures to allocate. See +samples/seccomp/user-trap.c for an example. + +Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a +seccomp notification fd to receive a ``struct seccomp_notif``, which contains +five members: the input length of the structure, a unique-per-filter ``id``, +the ``pid`` of the task which triggered this request (which may be 0 if the +task is in a pid ns not visible from the listener's pid namespace), a ``flags`` +member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing +whether or not the notification is a result of a non-fatal signal, and the +``data`` passed to seccomp. Userspace can then make a decision based on this +information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a +response, indicating what should be returned to userspace. The ``id`` member of +``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct +seccomp_notif``. + +It is worth noting that ``struct seccomp_data`` contains the values of register +arguments to the syscall, but does not contain pointers to memory. The task's +memory is accessible to suitably privileged traces via ``ptrace()`` or +``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned +above in this document: all arguments being read from the tracee's memory +should be read into the tracer's memory before any policy decisions are made. +This allows for an atomic decision on syscall arguments. + Sysctls ======= diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index b5103c019cf4..84868d37b35d 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -4,9 +4,10 @@ #include -#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG | \ - SECCOMP_FILTER_FLAG_SPEC_ALLOW) +#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ + SECCOMP_FILTER_FLAG_NEW_LISTENER) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 9efc0e73d50b..90734aa5aa36 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -15,11 +15,13 @@ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 +#define SECCOMP_GET_NOTIF_SIZES 3 /* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) +#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) /* * All BPF programs must return a 32-bit value. @@ -35,6 +37,7 @@ #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ @@ -60,4 +63,35 @@ struct seccomp_data { __u64 args[6]; }; +struct seccomp_notif_sizes { + __u16 seccomp_notif; + __u16 seccomp_notif_resp; + __u16 seccomp_data; +}; + +struct seccomp_notif { + __u64 id; + __u32 pid; + __u32 flags; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u64 id; + __s64 val; + __s32 error; + __u32 flags; +}; + +#define SECCOMP_IOC_MAGIC '!' +#define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr) +#define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type) +#define SECCOMP_IOW(nr, type) _IOW(SECCOMP_IOC_MAGIC, nr, type) +#define SECCOMP_IOWR(nr, type) _IOWR(SECCOMP_IOC_MAGIC, nr, type) + +/* Flags for seccomp notification fd ioctl. */ +#define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif) +#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \ + struct seccomp_notif_resp) +#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64) #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 393e029f778a..15b6be97fc09 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -33,12 +33,74 @@ #endif #ifdef CONFIG_SECCOMP_FILTER +#include #include #include #include #include #include #include +#include + +enum notify_state { + SECCOMP_NOTIFY_INIT, + SECCOMP_NOTIFY_SENT, + SECCOMP_NOTIFY_REPLIED, +}; + +struct seccomp_knotif { + /* The struct pid of the task whose filter triggered the notification */ + struct task_struct *task; + + /* The "cookie" for this request; this is unique for this filter. */ + u64 id; + + /* + * The seccomp data. This pointer is valid the entire time this + * notification is active, since it comes from __seccomp_filter which + * eclipses the entire lifecycle here. + */ + const struct seccomp_data *data; + + /* + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a + * struct seccomp_knotif is created and starts out in INIT. Once the + * handler reads the notification off of an FD, it transitions to SENT. + * If a signal is received the state transitions back to INIT and + * another message is sent. When the userspace handler replies, state + * transitions to REPLIED. + */ + enum notify_state state; + + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */ + int error; + long val; + + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */ + struct completion ready; + + struct list_head list; +}; + +/** + * struct notification - container for seccomp userspace notifications. Since + * most seccomp filters will not have notification listeners attached and this + * structure is fairly large, we store the notification-specific stuff in a + * separate structure. + * + * @request: A semaphore that users of this notification can wait on for + * changes. Actual reads and writes are still controlled with + * filter->notify_lock. + * @next_id: The id of the next request. + * @notifications: A list of struct seccomp_knotif elements. + * @wqh: A wait queue for poll. + */ +struct notification { + struct semaphore request; + u64 next_id; + struct list_head notifications; + wait_queue_head_t wqh; +}; /** * struct seccomp_filter - container for seccomp BPF programs @@ -50,6 +112,8 @@ * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate + * @notif: the struct that holds all notification related information + * @notify_lock: A lock for all notification-related accesses. * * seccomp_filter objects are organized in a tree linked via the @prev * pointer. For any task, it appears to be a singly-linked list starting @@ -66,6 +130,8 @@ struct seccomp_filter { bool log; struct seccomp_filter *prev; struct bpf_prog *prog; + struct notification *notif; + struct mutex notify_lock; }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -386,6 +452,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) if (!sfilter) return ERR_PTR(-ENOMEM); + mutex_init(&sfilter->notify_lock); ret = bpf_prog_create_from_user(&sfilter->prog, fprog, seccomp_check_filter, save_orig); if (ret < 0) { @@ -479,7 +546,6 @@ static long seccomp_attach_filter(unsigned int flags, static void __get_seccomp_filter(struct seccomp_filter *filter) { - /* Reference count is bounded by the number of total processes. */ refcount_inc(&filter->usage); } @@ -550,11 +616,13 @@ static void seccomp_send_sigsys(int syscall, int reason) #define SECCOMP_LOG_TRACE (1 << 4) #define SECCOMP_LOG_LOG (1 << 5) #define SECCOMP_LOG_ALLOW (1 << 6) +#define SECCOMP_LOG_USER_NOTIF (1 << 7) static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | SECCOMP_LOG_KILL_THREAD | SECCOMP_LOG_TRAP | SECCOMP_LOG_ERRNO | + SECCOMP_LOG_USER_NOTIF | SECCOMP_LOG_TRACE | SECCOMP_LOG_LOG; @@ -575,6 +643,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action, case SECCOMP_RET_TRACE: log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; break; + case SECCOMP_RET_USER_NOTIF: + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; + break; case SECCOMP_RET_LOG: log = seccomp_actions_logged & SECCOMP_LOG_LOG; break; @@ -646,6 +717,68 @@ void secure_computing_strict(int this_syscall) #else #ifdef CONFIG_SECCOMP_FILTER +static u64 seccomp_next_notify_id(struct seccomp_filter *filter) +{ + /* + * Note: overflow is ok here, the id just needs to be unique per + * filter. + */ + lockdep_assert_held(&filter->notify_lock); + return filter->notif->next_id++; +} + +static void seccomp_do_user_notification(int this_syscall, + struct seccomp_filter *match, + const struct seccomp_data *sd) +{ + int err; + long ret = 0; + struct seccomp_knotif n = {}; + + mutex_lock(&match->notify_lock); + err = -ENOSYS; + if (!match->notif) + goto out; + + n.task = current; + n.state = SECCOMP_NOTIFY_INIT; + n.data = sd; + n.id = seccomp_next_notify_id(match); + init_completion(&n.ready); + list_add(&n.list, &match->notif->notifications); + + up(&match->notif->request); + wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM); + mutex_unlock(&match->notify_lock); + + /* + * This is where we wait for a reply from userspace. + */ + err = wait_for_completion_interruptible(&n.ready); + mutex_lock(&match->notify_lock); + if (err == 0) { + ret = n.val; + err = n.error; + } + + /* + * Note that it's possible the listener died in between the time when + * we were notified of a respons (or a signal) and when we were able to + * re-acquire the lock, so only delete from the list if the + * notification actually exists. + * + * Also note that this test is only valid because there's no way to + * *reattach* to a notifier right now. If one is added, we'll need to + * keep track of the notif itself and make sure they match here. + */ + if (match->notif) + list_del(&n.list); +out: + mutex_unlock(&match->notify_lock); + syscall_set_return_value(current, task_pt_regs(current), + err, ret); +} + static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, const bool recheck_after_trace) { @@ -728,6 +861,10 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, return 0; + case SECCOMP_RET_USER_NOTIF: + seccomp_do_user_notification(this_syscall, match, sd); + goto skip; + case SECCOMP_RET_LOG: seccomp_log(this_syscall, 0, action, true); return 0; @@ -834,6 +971,263 @@ out: } #ifdef CONFIG_SECCOMP_FILTER +static int seccomp_notify_release(struct inode *inode, struct file *file) +{ + struct seccomp_filter *filter = file->private_data; + struct seccomp_knotif *knotif; + + mutex_lock(&filter->notify_lock); + + /* + * If this file is being closed because e.g. the task who owned it + * died, let's wake everyone up who was waiting on us. + */ + list_for_each_entry(knotif, &filter->notif->notifications, list) { + if (knotif->state == SECCOMP_NOTIFY_REPLIED) + continue; + + knotif->state = SECCOMP_NOTIFY_REPLIED; + knotif->error = -ENOSYS; + knotif->val = 0; + + complete(&knotif->ready); + } + + kfree(filter->notif); + filter->notif = NULL; + mutex_unlock(&filter->notify_lock); + __put_seccomp_filter(filter); + return 0; +} + +static long seccomp_notify_recv(struct seccomp_filter *filter, + void __user *buf) +{ + struct seccomp_knotif *knotif = NULL, *cur; + struct seccomp_notif unotif; + ssize_t ret; + + memset(&unotif, 0, sizeof(unotif)); + + ret = down_interruptible(&filter->notif->request); + if (ret < 0) + return ret; + + mutex_lock(&filter->notify_lock); + list_for_each_entry(cur, &filter->notif->notifications, list) { + if (cur->state == SECCOMP_NOTIFY_INIT) { + knotif = cur; + break; + } + } + + /* + * If we didn't find a notification, it could be that the task was + * interrupted by a fatal signal between the time we were woken and + * when we were able to acquire the rw lock. + */ + if (!knotif) { + ret = -ENOENT; + goto out; + } + + unotif.id = knotif->id; + unotif.pid = task_pid_vnr(knotif->task); + unotif.data = *(knotif->data); + + knotif->state = SECCOMP_NOTIFY_SENT; + wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM); + ret = 0; +out: + mutex_unlock(&filter->notify_lock); + + if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) { + ret = -EFAULT; + + /* + * Userspace screwed up. To make sure that we keep this + * notification alive, let's reset it back to INIT. It + * may have died when we released the lock, so we need to make + * sure it's still around. + */ + knotif = NULL; + mutex_lock(&filter->notify_lock); + list_for_each_entry(cur, &filter->notif->notifications, list) { + if (cur->id == unotif.id) { + knotif = cur; + break; + } + } + + if (knotif) { + knotif->state = SECCOMP_NOTIFY_INIT; + up(&filter->notif->request); + } + mutex_unlock(&filter->notify_lock); + } + + return ret; +} + +static long seccomp_notify_send(struct seccomp_filter *filter, + void __user *buf) +{ + struct seccomp_notif_resp resp = {}; + struct seccomp_knotif *knotif = NULL, *cur; + long ret; + + if (copy_from_user(&resp, buf, sizeof(resp))) + return -EFAULT; + + if (resp.flags) + return -EINVAL; + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return ret; + + list_for_each_entry(cur, &filter->notif->notifications, list) { + if (cur->id == resp.id) { + knotif = cur; + break; + } + } + + if (!knotif) { + ret = -ENOENT; + goto out; + } + + /* Allow exactly one reply. */ + if (knotif->state != SECCOMP_NOTIFY_SENT) { + ret = -EINPROGRESS; + goto out; + } + + ret = 0; + knotif->state = SECCOMP_NOTIFY_REPLIED; + knotif->error = resp.error; + knotif->val = resp.val; + complete(&knotif->ready); +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static long seccomp_notify_id_valid(struct seccomp_filter *filter, + void __user *buf) +{ + struct seccomp_knotif *knotif = NULL; + u64 id; + long ret; + + if (copy_from_user(&id, buf, sizeof(id))) + return -EFAULT; + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return ret; + + ret = -ENOENT; + list_for_each_entry(knotif, &filter->notif->notifications, list) { + if (knotif->id == id) { + if (knotif->state == SECCOMP_NOTIFY_SENT) + ret = 0; + goto out; + } + } + +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct seccomp_filter *filter = file->private_data; + void __user *buf = (void __user *)arg; + + switch (cmd) { + case SECCOMP_IOCTL_NOTIF_RECV: + return seccomp_notify_recv(filter, buf); + case SECCOMP_IOCTL_NOTIF_SEND: + return seccomp_notify_send(filter, buf); + case SECCOMP_IOCTL_NOTIF_ID_VALID: + return seccomp_notify_id_valid(filter, buf); + default: + return -EINVAL; + } +} + +static __poll_t seccomp_notify_poll(struct file *file, + struct poll_table_struct *poll_tab) +{ + struct seccomp_filter *filter = file->private_data; + __poll_t ret = 0; + struct seccomp_knotif *cur; + + poll_wait(file, &filter->notif->wqh, poll_tab); + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return EPOLLERR; + + list_for_each_entry(cur, &filter->notif->notifications, list) { + if (cur->state == SECCOMP_NOTIFY_INIT) + ret |= EPOLLIN | EPOLLRDNORM; + if (cur->state == SECCOMP_NOTIFY_SENT) + ret |= EPOLLOUT | EPOLLWRNORM; + if ((ret & EPOLLIN) && (ret & EPOLLOUT)) + break; + } + + mutex_unlock(&filter->notify_lock); + + return ret; +} + +static const struct file_operations seccomp_notify_ops = { + .poll = seccomp_notify_poll, + .release = seccomp_notify_release, + .unlocked_ioctl = seccomp_notify_ioctl, +}; + +static struct file *init_listener(struct seccomp_filter *filter) +{ + struct file *ret = ERR_PTR(-EBUSY); + struct seccomp_filter *cur; + + for (cur = current->seccomp.filter; cur; cur = cur->prev) { + if (cur->notif) + goto out; + } + + ret = ERR_PTR(-ENOMEM); + filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL); + if (!filter->notif) + goto out; + + sema_init(&filter->notif->request, 0); + filter->notif->next_id = get_random_u64(); + INIT_LIST_HEAD(&filter->notif->notifications); + init_waitqueue_head(&filter->notif->wqh); + + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, + filter, O_RDWR); + if (IS_ERR(ret)) + goto out_notif; + + /* The file has a reference to it now */ + __get_seccomp_filter(filter); + +out_notif: + if (IS_ERR(ret)) + kfree(filter->notif); +out: + return ret; +} + /** * seccomp_set_mode_filter: internal function for setting seccomp filter * @flags: flags to change filter behavior @@ -853,6 +1247,8 @@ static long seccomp_set_mode_filter(unsigned int flags, const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; + int listener = -1; + struct file *listener_f = NULL; /* Validate flags. */ if (flags & ~SECCOMP_FILTER_FLAG_MASK) @@ -863,13 +1259,28 @@ static long seccomp_set_mode_filter(unsigned int flags, if (IS_ERR(prepared)) return PTR_ERR(prepared); + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { + listener = get_unused_fd_flags(O_CLOEXEC); + if (listener < 0) { + ret = listener; + goto out_free; + } + + listener_f = init_listener(prepared); + if (IS_ERR(listener_f)) { + put_unused_fd(listener); + ret = PTR_ERR(listener_f); + goto out_free; + } + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ if (flags & SECCOMP_FILTER_FLAG_TSYNC && mutex_lock_killable(¤t->signal->cred_guard_mutex)) - goto out_free; + goto out_put_fd; spin_lock_irq(¤t->sighand->siglock); @@ -887,6 +1298,16 @@ out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) mutex_unlock(¤t->signal->cred_guard_mutex); +out_put_fd: + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { + if (ret < 0) { + fput(listener_f); + put_unused_fd(listener); + } else { + fd_install(listener, listener_f); + ret = listener; + } + } out_free: seccomp_filter_free(prepared); return ret; @@ -911,6 +1332,7 @@ static long seccomp_get_action_avail(const char __user *uaction) case SECCOMP_RET_KILL_THREAD: case SECCOMP_RET_TRAP: case SECCOMP_RET_ERRNO: + case SECCOMP_RET_USER_NOTIF: case SECCOMP_RET_TRACE: case SECCOMP_RET_LOG: case SECCOMP_RET_ALLOW: @@ -922,6 +1344,20 @@ static long seccomp_get_action_avail(const char __user *uaction) return 0; } +static long seccomp_get_notif_sizes(void __user *usizes) +{ + struct seccomp_notif_sizes sizes = { + .seccomp_notif = sizeof(struct seccomp_notif), + .seccomp_notif_resp = sizeof(struct seccomp_notif_resp), + .seccomp_data = sizeof(struct seccomp_data), + }; + + if (copy_to_user(usizes, &sizes, sizeof(sizes))) + return -EFAULT; + + return 0; +} + /* Common entry point for both prctl and syscall. */ static long do_seccomp(unsigned int op, unsigned int flags, void __user *uargs) @@ -938,6 +1374,11 @@ static long do_seccomp(unsigned int op, unsigned int flags, return -EINVAL; return seccomp_get_action_avail(uargs); + case SECCOMP_GET_NOTIF_SIZES: + if (flags != 0) + return -EINVAL; + + return seccomp_get_notif_sizes(uargs); default: return -EINVAL; } @@ -1111,6 +1552,7 @@ long seccomp_get_metadata(struct task_struct *task, #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" #define SECCOMP_RET_TRAP_NAME "trap" #define SECCOMP_RET_ERRNO_NAME "errno" +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif" #define SECCOMP_RET_TRACE_NAME "trace" #define SECCOMP_RET_LOG_NAME "log" #define SECCOMP_RET_ALLOW_NAME "allow" @@ -1120,6 +1562,7 @@ static const char seccomp_actions_avail[] = SECCOMP_RET_KILL_THREAD_NAME " " SECCOMP_RET_TRAP_NAME " " SECCOMP_RET_ERRNO_NAME " " + SECCOMP_RET_USER_NOTIF_NAME " " SECCOMP_RET_TRACE_NAME " " SECCOMP_RET_LOG_NAME " " SECCOMP_RET_ALLOW_NAME; @@ -1134,6 +1577,7 @@ static const struct seccomp_log_name seccomp_log_names[] = { { SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME }, { SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME }, { SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME }, + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME }, diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index e1473234968d..5c9768a1b8cd 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -5,6 +5,7 @@ * Test code for seccomp bpf. */ +#define _GNU_SOURCE #include /* @@ -40,10 +41,12 @@ #include #include #include +#include +#include -#define _GNU_SOURCE #include #include +#include #include "../kselftest_harness.h" @@ -133,6 +136,10 @@ struct seccomp_data { #define SECCOMP_GET_ACTION_AVAIL 2 #endif +#ifndef SECCOMP_GET_NOTIF_SIZES +#define SECCOMP_GET_NOTIF_SIZES 3 +#endif + #ifndef SECCOMP_FILTER_FLAG_TSYNC #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) #endif @@ -154,6 +161,44 @@ struct seccomp_metadata { }; #endif +#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) + +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U + +#define SECCOMP_IOC_MAGIC '!' +#define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr) +#define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type) +#define SECCOMP_IOW(nr, type) _IOW(SECCOMP_IOC_MAGIC, nr, type) +#define SECCOMP_IOWR(nr, type) _IOWR(SECCOMP_IOC_MAGIC, nr, type) + +/* Flags for seccomp notification fd ioctl. */ +#define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif) +#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \ + struct seccomp_notif_resp) +#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64) + +struct seccomp_notif { + __u64 id; + __u32 pid; + __u32 flags; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u64 id; + __s64 val; + __s32 error; + __u32 flags; +}; + +struct seccomp_notif_sizes { + __u16 seccomp_notif; + __u16 seccomp_notif_resp; + __u16 seccomp_data; +}; +#endif + #ifndef seccomp int seccomp(unsigned int op, unsigned int flags, void *args) { @@ -2077,7 +2122,8 @@ TEST(detect_seccomp_filter_flags) { unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, SECCOMP_FILTER_FLAG_LOG, - SECCOMP_FILTER_FLAG_SPEC_ALLOW }; + SECCOMP_FILTER_FLAG_SPEC_ALLOW, + SECCOMP_FILTER_FLAG_NEW_LISTENER }; unsigned int flag, all_flags; int i; long ret; @@ -2933,6 +2979,403 @@ skip: ASSERT_EQ(0, kill(pid, SIGKILL)); } +static int user_trap_syscall(int nr, unsigned int flags) +{ + struct sock_filter filter[] = { + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, + offsetof(struct seccomp_data, nr)), + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), + }; + + struct sock_fprog prog = { + .len = (unsigned short)ARRAY_SIZE(filter), + .filter = filter, + }; + + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); +} + +#define USER_NOTIF_MAGIC 116983961184613L +TEST(user_notification_basic) +{ + pid_t pid; + long ret; + int status, listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + struct pollfd pollfd; + + struct sock_filter filter[] = { + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog prog = { + .len = (unsigned short)ARRAY_SIZE(filter), + .filter = filter, + }; + + pid = fork(); + ASSERT_GE(pid, 0); + + /* Check that we get -ENOSYS with no listener attached */ + if (pid == 0) { + if (user_trap_syscall(__NR_getpid, 0) < 0) + exit(1); + ret = syscall(__NR_getpid); + exit(ret >= 0 || errno != ENOSYS); + } + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + /* Add some no-op filters so for grins. */ + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); + + /* Check that the basic notification machinery works */ + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_NEW_LISTENER); + EXPECT_GE(listener, 0); + + /* Installing a second listener in the chain should EBUSY */ + EXPECT_EQ(user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_NEW_LISTENER), + -1); + EXPECT_EQ(errno, EBUSY); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + pollfd.fd = listener; + pollfd.events = POLLIN | POLLOUT; + + EXPECT_GT(poll(&pollfd, 1, -1), 0); + EXPECT_EQ(pollfd.revents, POLLIN); + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + pollfd.fd = listener; + pollfd.events = POLLIN | POLLOUT; + + EXPECT_GT(poll(&pollfd, 1, -1), 0); + EXPECT_EQ(pollfd.revents, POLLOUT); + + EXPECT_EQ(req.data.nr, __NR_getpid); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + /* check that we make sure flags == 0 */ + resp.flags = 1; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1); + EXPECT_EQ(errno, EINVAL); + + resp.flags = 0; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + +TEST(user_notification_kill_in_middle) +{ + pid_t pid; + long ret; + int listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_NEW_LISTENER); + EXPECT_GE(listener, 0); + + /* + * Check that nothing bad happens when we kill the task in the middle + * of a syscall. + */ + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0); + + EXPECT_EQ(kill(pid, SIGKILL), 0); + EXPECT_EQ(waitpid(pid, NULL, 0), pid); + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), -1); + + resp.id = req.id; + ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, ENOENT); +} + +static int handled = -1; + +static void signal_handler(int signal) +{ + if (write(handled, "c", 1) != 1) + perror("write from signal"); +} + +TEST(user_notification_signal) +{ + pid_t pid; + long ret; + int status, listener, sk_pair[2]; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + char c; + + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); + + listener = user_trap_syscall(__NR_gettid, + SECCOMP_FILTER_FLAG_NEW_LISTENER); + EXPECT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + close(sk_pair[0]); + handled = sk_pair[1]; + if (signal(SIGUSR1, signal_handler) == SIG_ERR) { + perror("signal"); + exit(1); + } + /* + * ERESTARTSYS behavior is a bit hard to test, because we need + * to rely on a signal that has not yet been handled. Let's at + * least check that the error code gets propagated through, and + * hope that it doesn't break when there is actually a signal :) + */ + ret = syscall(__NR_gettid); + exit(!(ret == -1 && errno == 512)); + } + + close(sk_pair[1]); + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + EXPECT_EQ(kill(pid, SIGUSR1), 0); + + /* + * Make sure the signal really is delivered, which means we're not + * stuck in the user notification code any more and the notification + * should be dead. + */ + EXPECT_EQ(read(sk_pair[0], &c, 1), 1); + + resp.id = req.id; + resp.error = -EPERM; + resp.val = 0; + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1); + EXPECT_EQ(errno, ENOENT); + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + resp.id = req.id; + resp.error = -512; /* -ERESTARTSYS */ + resp.val = 0; + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + +TEST(user_notification_closed_listener) +{ + pid_t pid; + long ret; + int status, listener; + + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_NEW_LISTENER); + EXPECT_GE(listener, 0); + + /* + * Check that we get an ENOSYS when the listener is closed. + */ + pid = fork(); + ASSERT_GE(pid, 0); + if (pid == 0) { + close(listener); + ret = syscall(__NR_getpid); + exit(ret != -1 && errno != ENOSYS); + } + + close(listener); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + +/* + * Check that a pid in a child namespace still shows up as valid in ours. + */ +TEST(user_notification_child_pid_ns) +{ + pid_t pid; + int status, listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + ASSERT_EQ(unshare(CLONE_NEWPID), 0); + + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); + ASSERT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.pid, pid); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + close(listener); +} + +/* + * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e. + * invalid. + */ +TEST(user_notification_sibling_pid_ns) +{ + pid_t pid, pid2; + int status, listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); + ASSERT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ASSERT_EQ(unshare(CLONE_NEWPID), 0); + + pid2 = fork(); + ASSERT_GE(pid2, 0); + + if (pid2 == 0) + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); + + EXPECT_EQ(waitpid(pid2, &status, 0), pid2); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + exit(WEXITSTATUS(status)); + } + + /* Create the sibling ns, and sibling in it. */ + EXPECT_EQ(unshare(CLONE_NEWPID), 0); + EXPECT_EQ(errno, 0); + + pid2 = fork(); + EXPECT_GE(pid2, 0); + + if (pid2 == 0) { + ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + /* + * The pid should be 0, i.e. the task is in some namespace that + * we can't "see". + */ + ASSERT_EQ(req.pid, 0); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + exit(0); + } + + close(listener); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + EXPECT_EQ(waitpid(pid2, &status, 0), pid2); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + +TEST(user_notification_fault_recv) +{ + pid_t pid; + int status, listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); + ASSERT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); + + /* Do a bad recv() */ + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, NULL), -1); + EXPECT_EQ(errno, EFAULT); + + /* We should still be able to receive this notification, though. */ + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + EXPECT_EQ(req.pid, pid); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + +TEST(seccomp_get_notif_sizes) +{ + struct seccomp_notif_sizes sizes; + + EXPECT_EQ(seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes), 0); + EXPECT_EQ(sizes.seccomp_notif, sizeof(struct seccomp_notif)); + EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp)); +} + /* * TODO: * - add microbenchmarks -- cgit v1.3-7-g2ca7 From 5b20c6fd6a60e182243da31c47f2ebff5b0e3d57 Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Tue, 11 Dec 2018 11:37:44 -0500 Subject: timekeeping: Convert to DEFINE_SHOW_ATTRIBUTE Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li Signed-off-by: Thomas Gleixner Cc: john.stultz@linaro.org Cc: sboyd@kernel.org Link: https://lkml.kernel.org/r/20181211163744.22133-1-tiny.windzz@gmail.com --- kernel/time/timekeeping_debug.c | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c index f811882cfd13..86489950d690 100644 --- a/kernel/time/timekeeping_debug.c +++ b/kernel/time/timekeeping_debug.c @@ -19,7 +19,7 @@ static unsigned int sleep_time_bin[NUM_BINS] = {0}; -static int tk_debug_show_sleep_time(struct seq_file *s, void *data) +static int tk_debug_sleep_time_show(struct seq_file *s, void *data) { unsigned int bin; seq_puts(s, " time (secs) count\n"); @@ -33,18 +33,7 @@ static int tk_debug_show_sleep_time(struct seq_file *s, void *data) } return 0; } - -static int tk_debug_sleep_time_open(struct inode *inode, struct file *file) -{ - return single_open(file, tk_debug_show_sleep_time, NULL); -} - -static const struct file_operations tk_debug_sleep_time_fops = { - .open = tk_debug_sleep_time_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +DEFINE_SHOW_ATTRIBUTE(tk_debug_sleep_time); static int __init tk_debug_sleep_time_init(void) { -- cgit v1.3-7-g2ca7 From 07c17732bd687567802aaa5fa5c101c2776565d1 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Tue, 11 Dec 2018 18:49:05 +0900 Subject: printk: Remove print_prefix() calls with NULL buffer. We can save lines/size by removing print_prefix() with buf == NULL. This patch makes no functional change. Link: http://lkml.kernel.org/r/1544521745-11925-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp To: Steven Rostedt Cc: linux-kernel@vger.kernel.org Signed-off-by: Tetsuo Handa Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 39 ++++++++++++++------------------------- 1 file changed, 14 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index c7d217764db3..91db332ccf4d 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1228,13 +1228,15 @@ static inline void boot_delay_msec(int level) static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME); module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); +static size_t print_syslog(unsigned int level, char *buf) +{ + return sprintf(buf, "<%u>", level); +} + static size_t print_time(u64 ts, char *buf) { unsigned long rem_nsec = do_div(ts, 1000000000); - if (!buf) - return snprintf(NULL, 0, "[%5lu.000000] ", (unsigned long)ts); - return sprintf(buf, "[%5lu.%06lu] ", (unsigned long)ts, rem_nsec / 1000); } @@ -1243,24 +1245,11 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog, bool time, char *buf) { size_t len = 0; - unsigned int prefix = (msg->facility << 3) | msg->level; - - if (syslog) { - if (buf) { - len += sprintf(buf, "<%u>", prefix); - } else { - len += 3; - if (prefix > 999) - len += 3; - else if (prefix > 99) - len += 2; - else if (prefix > 9) - len++; - } - } + if (syslog) + len = print_syslog((msg->facility << 3) | msg->level, buf); if (time) - len += print_time(msg->ts_nsec, buf ? buf + len : NULL); + len += print_time(msg->ts_nsec, buf + len); return len; } @@ -1270,6 +1259,8 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog, const char *text = log_text(msg); size_t text_size = msg->text_len; size_t len = 0; + char prefix[PREFIX_MAX]; + const size_t prefix_len = print_prefix(msg, syslog, time, prefix); do { const char *next = memchr(text, '\n', text_size); @@ -1284,19 +1275,17 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog, } if (buf) { - if (print_prefix(msg, syslog, time, NULL) + - text_len + 1 >= size - len) + if (prefix_len + text_len + 1 >= size - len) break; - len += print_prefix(msg, syslog, time, buf + len); + memcpy(buf + len, prefix, prefix_len); + len += prefix_len; memcpy(buf + len, text, text_len); len += text_len; buf[len++] = '\n'; } else { /* SYSLOG_ACTION_* buffer size only calculation */ - len += print_prefix(msg, syslog, time, NULL); - len += text_len; - len++; + len += prefix_len + text_len + 1; } text = next; -- cgit v1.3-7-g2ca7 From 943a10f85265d4c14753d445003b6f5d8e3d959a Mon Sep 17 00:00:00 2001 From: Yangtao Li Date: Tue, 11 Dec 2018 11:20:48 -0500 Subject: PM / sleep: convert to DEFINE_SHOW_ATTRIBUTE Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li Acked-by: Pavel Machek Signed-off-by: Rafael J. Wysocki --- kernel/power/main.c | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/power/main.c b/kernel/power/main.c index 35b50823d83b..98e76cad128b 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -318,23 +318,12 @@ static int suspend_stats_show(struct seq_file *s, void *unused) return 0; } - -static int suspend_stats_open(struct inode *inode, struct file *file) -{ - return single_open(file, suspend_stats_show, NULL); -} - -static const struct file_operations suspend_stats_operations = { - .open = suspend_stats_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +DEFINE_SHOW_ATTRIBUTE(suspend_stats); static int __init pm_debugfs_init(void) { debugfs_create_file("suspend_stats", S_IFREG | S_IRUGO, - NULL, NULL, &suspend_stats_operations); + NULL, NULL, &suspend_stats_fops); return 0; } -- cgit v1.3-7-g2ca7 From 1b2b234b1318afb3775d4c6624fd5a96558f19df Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 10 Dec 2018 15:43:00 -0800 Subject: bpf: pass struct btf pointer to the map_check_btf() callback If key_type or value_type are of non-trivial data types (e.g. structure or typedef), it's not possible to check them without the additional information, which can't be obtained without a pointer to the btf structure. So, let's pass btf pointer to the map_check_btf() callbacks. Signed-off-by: Roman Gushchin Cc: Alexei Starovoitov Cc: Daniel Borkmann Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 3 +++ kernel/bpf/arraymap.c | 1 + kernel/bpf/lpm_trie.c | 1 + kernel/bpf/syscall.c | 3 ++- 4 files changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0c992b86eb2c..e734f163bd0b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -23,6 +23,7 @@ struct bpf_prog; struct bpf_map; struct sock; struct seq_file; +struct btf; struct btf_type; /* map is generic key/value storage optionally accesible by eBPF programs */ @@ -52,6 +53,7 @@ struct bpf_map_ops { void (*map_seq_show_elem)(struct bpf_map *map, void *key, struct seq_file *m); int (*map_check_btf)(const struct bpf_map *map, + const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type); }; @@ -126,6 +128,7 @@ static inline bool bpf_map_support_seq_show(const struct bpf_map *map) } int map_check_no_btf(const struct bpf_map *map, + const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 24583da9ffd1..25632a75d630 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -382,6 +382,7 @@ static void percpu_array_map_seq_show_elem(struct bpf_map *map, void *key, } static int array_map_check_btf(const struct bpf_map *map, + const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type) { diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index bfd4882e1106..abf1002080df 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -728,6 +728,7 @@ free_stack: } static int trie_check_btf(const struct bpf_map *map, + const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type) { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5745c7837621..70fb11106fc2 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -456,6 +456,7 @@ static int bpf_obj_name_cpy(char *dst, const char *src) } int map_check_no_btf(const struct bpf_map *map, + const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type) { @@ -478,7 +479,7 @@ static int map_check_btf(const struct bpf_map *map, const struct btf *btf, return -EINVAL; if (map->ops->map_check_btf) - ret = map->ops->map_check_btf(map, key_type, value_type); + ret = map->ops->map_check_btf(map, btf, key_type, value_type); return ret; } -- cgit v1.3-7-g2ca7 From 9a1126b63190e2541dd5d643f4bfeb5a7f493729 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 10 Dec 2018 15:43:01 -0800 Subject: bpf: add bpffs pretty print for cgroup local storage maps Implement bpffs pretty printing for cgroup local storage maps (both shared and per-cpu). Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c): Shared: $ cat /sys/fs/bpf/map_2 # WARNING!! The output is for debug purpose only # WARNING!! The output format will change {4294968594,1}: {9999,1039896} Per-cpu: $ cat /sys/fs/bpf/map_1 # WARNING!! The output is for debug purpose only # WARNING!! The output format will change {4294968594,1}: { cpu0: {0,0,0,0,0} cpu1: {0,0,0,0,0} cpu2: {1,104,0,0,0} cpu3: {0,0,0,0,0} } Signed-off-by: Roman Gushchin Cc: Alexei Starovoitov Cc: Daniel Borkmann Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/linux/btf.h | 1 + kernel/bpf/btf.c | 22 +++++++++++ kernel/bpf/local_storage.c | 93 +++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 115 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/btf.h b/include/linux/btf.h index b98405a56383..a4cf075b89eb 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -47,6 +47,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); bool btf_name_offset_valid(const struct btf *btf, u32 offset); +bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size); #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index bf34933cc413..1545ddfb6fa5 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -514,6 +514,28 @@ static bool btf_type_int_is_regular(const struct btf_type *t) return true; } +/* + * Check that given type is a regular int and has the expected size. + */ +bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size) +{ + u8 nr_bits, nr_bytes; + u32 int_data; + + if (!btf_type_is_int(t)) + return false; + + int_data = btf_type_int(t); + nr_bits = BTF_INT_BITS(int_data); + nr_bytes = BITS_ROUNDUP_BYTES(nr_bits); + if (BITS_PER_BYTE_MASKED(nr_bits) || + BTF_INT_OFFSET(int_data) || + nr_bytes != expected_size) + return false; + + return true; +} + __printf(2, 3) static void __btf_verifier_log(struct bpf_verifier_log *log, const char *fmt, ...) { diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index b65017dead44..5eca03da0989 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -1,11 +1,13 @@ //SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include #include #include +#include DEFINE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); @@ -308,6 +310,94 @@ static int cgroup_storage_delete_elem(struct bpf_map *map, void *key) return -EINVAL; } +static int cgroup_storage_check_btf(const struct bpf_map *map, + const struct btf *btf, + const struct btf_type *key_type, + const struct btf_type *value_type) +{ + const struct btf_type *t; + struct btf_member *m; + u32 id, size; + + /* Key is expected to be of struct bpf_cgroup_storage_key type, + * which is: + * struct bpf_cgroup_storage_key { + * __u64 cgroup_inode_id; + * __u32 attach_type; + * }; + */ + + /* + * Key_type must be a structure with two fields. + */ + if (BTF_INFO_KIND(key_type->info) != BTF_KIND_STRUCT || + BTF_INFO_VLEN(key_type->info) != 2) + return -EINVAL; + + /* + * The first field must be a 64 bit integer at 0 offset. + */ + m = (struct btf_member *)(key_type + 1); + if (m->offset) + return -EINVAL; + id = m->type; + t = btf_type_id_size(btf, &id, NULL); + size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, cgroup_inode_id); + if (!t || !btf_type_is_reg_int(t, size)) + return -EINVAL; + + /* + * The second field must be a 32 bit integer at 64 bit offset. + */ + m++; + if (m->offset != offsetof(struct bpf_cgroup_storage_key, attach_type) * + BITS_PER_BYTE) + return -EINVAL; + id = m->type; + t = btf_type_id_size(btf, &id, NULL); + size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, attach_type); + if (!t || !btf_type_is_reg_int(t, size)) + return -EINVAL; + + return 0; +} + +static void cgroup_storage_seq_show_elem(struct bpf_map *map, void *_key, + struct seq_file *m) +{ + enum bpf_cgroup_storage_type stype = cgroup_storage_type(map); + struct bpf_cgroup_storage_key *key = _key; + struct bpf_cgroup_storage *storage; + int cpu; + + rcu_read_lock(); + storage = cgroup_storage_lookup(map_to_storage(map), key, false); + if (!storage) { + rcu_read_unlock(); + return; + } + + btf_type_seq_show(map->btf, map->btf_key_type_id, key, m); + stype = cgroup_storage_type(map); + if (stype == BPF_CGROUP_STORAGE_SHARED) { + seq_puts(m, ": "); + btf_type_seq_show(map->btf, map->btf_value_type_id, + &READ_ONCE(storage->buf)->data[0], m); + seq_puts(m, "\n"); + } else { + seq_puts(m, ": {\n"); + for_each_possible_cpu(cpu) { + seq_printf(m, "\tcpu%d: ", cpu); + btf_type_seq_show(map->btf, map->btf_value_type_id, + per_cpu_ptr(storage->percpu_buf, cpu), + m); + seq_puts(m, "\n"); + } + seq_puts(m, "}\n"); + } + rcu_read_unlock(); +} + const struct bpf_map_ops cgroup_storage_map_ops = { .map_alloc = cgroup_storage_map_alloc, .map_free = cgroup_storage_map_free, @@ -315,7 +405,8 @@ const struct bpf_map_ops cgroup_storage_map_ops = { .map_lookup_elem = cgroup_storage_lookup_elem, .map_update_elem = cgroup_storage_update_elem, .map_delete_elem = cgroup_storage_delete_elem, - .map_check_btf = map_check_no_btf, + .map_check_btf = cgroup_storage_check_btf, + .map_seq_show_elem = cgroup_storage_seq_show_elem, }; int bpf_cgroup_storage_assign(struct bpf_prog *prog, struct bpf_map *_map) -- cgit v1.3-7-g2ca7 From 06459901d55ee2f690b8e1fe084fb03061d617cf Mon Sep 17 00:00:00 2001 From: Bartosz Golaszewski Date: Fri, 9 Nov 2018 18:21:32 +0100 Subject: irq/irq_sim: Store multiple interrupt offsets in a bitmap Two threads can try to fire the irq_sim with different offsets and will end up fighting for the irq_work asignment. Thomas Gleixner suggested a solution based on a bitfield where we set a bit for every offset associated with an interrupt that should be fired and then iterate over all set bits in the interrupt handler. This is a slightly modified solution using a bitmap so that we don't impose a limit on the number of interrupts one can allocate with irq_sim. Suggested-by: Thomas Gleixner Signed-off-by: Bartosz Golaszewski Signed-off-by: Marc Zyngier --- include/linux/irq_sim.h | 2 +- kernel/irq/irq_sim.c | 23 +++++++++++++++++++++-- 2 files changed, 22 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/irq_sim.h b/include/linux/irq_sim.h index 630a57e55db6..4500d453a63e 100644 --- a/include/linux/irq_sim.h +++ b/include/linux/irq_sim.h @@ -16,7 +16,7 @@ struct irq_sim_work_ctx { struct irq_work work; - int irq; + unsigned long *pending; }; struct irq_sim_irq_ctx { diff --git a/kernel/irq/irq_sim.c b/kernel/irq/irq_sim.c index dd20d0d528d4..98a20e1594ce 100644 --- a/kernel/irq/irq_sim.c +++ b/kernel/irq/irq_sim.c @@ -34,9 +34,20 @@ static struct irq_chip irq_sim_irqchip = { static void irq_sim_handle_irq(struct irq_work *work) { struct irq_sim_work_ctx *work_ctx; + unsigned int offset = 0; + struct irq_sim *sim; + int irqnum; work_ctx = container_of(work, struct irq_sim_work_ctx, work); - handle_simple_irq(irq_to_desc(work_ctx->irq)); + sim = container_of(work_ctx, struct irq_sim, work_ctx); + + while (!bitmap_empty(work_ctx->pending, sim->irq_count)) { + offset = find_next_bit(work_ctx->pending, + sim->irq_count, offset); + clear_bit(offset, work_ctx->pending); + irqnum = irq_sim_irqnum(sim, offset); + handle_simple_irq(irq_to_desc(irqnum)); + } } /** @@ -63,6 +74,13 @@ int irq_sim_init(struct irq_sim *sim, unsigned int num_irqs) return sim->irq_base; } + sim->work_ctx.pending = bitmap_zalloc(num_irqs, GFP_KERNEL); + if (!sim->work_ctx.pending) { + kfree(sim->irqs); + irq_free_descs(sim->irq_base, num_irqs); + return -ENOMEM; + } + for (i = 0; i < num_irqs; i++) { sim->irqs[i].irqnum = sim->irq_base + i; sim->irqs[i].enabled = false; @@ -89,6 +107,7 @@ EXPORT_SYMBOL_GPL(irq_sim_init); void irq_sim_fini(struct irq_sim *sim) { irq_work_sync(&sim->work_ctx.work); + bitmap_free(sim->work_ctx.pending); irq_free_descs(sim->irq_base, sim->irq_count); kfree(sim->irqs); } @@ -143,7 +162,7 @@ EXPORT_SYMBOL_GPL(devm_irq_sim_init); void irq_sim_fire(struct irq_sim *sim, unsigned int offset) { if (sim->irqs[offset].enabled) { - sim->work_ctx.irq = irq_sim_irqnum(sim, offset); + set_bit(offset, sim->work_ctx.pending); irq_work_queue(&sim->work_ctx.work); } } -- cgit v1.3-7-g2ca7 From 9e794163a69c103633fefb10a3879408d4e4e2c8 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Wed, 12 Dec 2018 10:18:21 -0800 Subject: bpf: Remove bpf_dump_raw_ok() check for func_info and line_info The func_info and line_info have the bpf insn offset but they do not contain kernel address. They will still be useful for the userspace tool to annotate the xlated insn. This patch removes the bpf_dump_raw_ok() guard for the func_info and line_info during bpf_prog_get_info_by_fd(). The guard stays for jited_line_info which contains the kernel address. Although this bpf_dump_raw_ok() guard behavior has started since the earlier func_info patch series, I marked the Fixes tag to the latest line_info patch series which contains both func_info and line_info and this patch is fixing for both of them. Fixes: c454a46b5efd ("bpf: Add bpf_line_info support") Signed-off-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann --- kernel/bpf/syscall.c | 32 ++++++++++++-------------------- 1 file changed, 12 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 70fb11106fc2..b7c585838c72 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2272,33 +2272,25 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, ulen = info.nr_func_info; info.nr_func_info = prog->aux->func_info_cnt; if (info.nr_func_info && ulen) { - if (bpf_dump_raw_ok()) { - char __user *user_finfo; + char __user *user_finfo; - user_finfo = u64_to_user_ptr(info.func_info); - ulen = min_t(u32, info.nr_func_info, ulen); - if (copy_to_user(user_finfo, prog->aux->func_info, - info.func_info_rec_size * ulen)) - return -EFAULT; - } else { - info.func_info = 0; - } + user_finfo = u64_to_user_ptr(info.func_info); + ulen = min_t(u32, info.nr_func_info, ulen); + if (copy_to_user(user_finfo, prog->aux->func_info, + info.func_info_rec_size * ulen)) + return -EFAULT; } ulen = info.nr_line_info; info.nr_line_info = prog->aux->nr_linfo; if (info.nr_line_info && ulen) { - if (bpf_dump_raw_ok()) { - __u8 __user *user_linfo; + __u8 __user *user_linfo; - user_linfo = u64_to_user_ptr(info.line_info); - ulen = min_t(u32, info.nr_line_info, ulen); - if (copy_to_user(user_linfo, prog->aux->linfo, - info.line_info_rec_size * ulen)) - return -EFAULT; - } else { - info.line_info = 0; - } + user_linfo = u64_to_user_ptr(info.line_info); + ulen = min_t(u32, info.nr_line_info, ulen); + if (copy_to_user(user_linfo, prog->aux->linfo, + info.line_info_rec_size * ulen)) + return -EFAULT; } ulen = info.nr_jited_line_info; -- cgit v1.3-7-g2ca7 From c872bdb38febb4c31ece3599c52cf1f833b89f4e Mon Sep 17 00:00:00 2001 From: Song Liu Date: Wed, 12 Dec 2018 09:37:46 -0800 Subject: bpf: include sub program tags in bpf_prog_info Changes v2 -> v3: 1. remove check for bpf_dump_raw_ok(). Changes v1 -> v2: 1. Fix error path as Martin suggested. This patch adds nr_prog_tags and prog_tags to bpf_prog_info. This is a reliable way for user space to get tags of all sub programs. Before this patch, user space need to find sub program tags via kallsyms. This feature will be used in BPF introspection, where user space queries information about BPF programs via sys_bpf. Signed-off-by: Song Liu Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 22 ++++++++++++++++++++++ 2 files changed, 24 insertions(+) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index aa582cd5bfcf..e7d57e89f25f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2717,6 +2717,8 @@ struct bpf_prog_info { __u32 nr_jited_line_info; __u32 line_info_rec_size; __u32 jited_line_info_rec_size; + __u32 nr_prog_tags; + __aligned_u64 prog_tags; } __attribute__((aligned(8))); struct bpf_map_info { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b7c585838c72..7f1410d6fbe9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2315,6 +2315,28 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } + ulen = info.nr_prog_tags; + info.nr_prog_tags = prog->aux->func_cnt ? : 1; + if (ulen) { + __u8 __user (*user_prog_tags)[BPF_TAG_SIZE]; + u32 i; + + user_prog_tags = u64_to_user_ptr(info.prog_tags); + ulen = min_t(u32, info.nr_prog_tags, ulen); + if (prog->aux->func_cnt) { + for (i = 0; i < ulen; i++) { + if (copy_to_user(user_prog_tags[i], + prog->aux->func[i]->tag, + BPF_TAG_SIZE)) + return -EFAULT; + } + } else { + if (copy_to_user(user_prog_tags[0], + prog->tag, BPF_TAG_SIZE)) + return -EFAULT; + } + } + done: if (copy_to_user(uinfo, &info, info_len) || put_user(info_len, &uattr->info.info_len)) -- cgit v1.3-7-g2ca7 From ba830885656414101b2f8ca88786524d4bb5e8c1 Mon Sep 17 00:00:00 2001 From: Kristina Martsenko Date: Fri, 7 Dec 2018 18:39:28 +0000 Subject: arm64: add prctl control for resetting ptrauth keys Add an arm64-specific prctl to allow a thread to reinitialize its pointer authentication keys to random values. This can be useful when exec() is not used for starting new processes, to ensure that different processes still have different keys. Signed-off-by: Kristina Martsenko Signed-off-by: Will Deacon --- arch/arm64/include/asm/pointer_auth.h | 3 +++ arch/arm64/include/asm/processor.h | 4 +++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/pointer_auth.c | 47 +++++++++++++++++++++++++++++++++++ include/uapi/linux/prctl.h | 8 ++++++ kernel/sys.c | 8 ++++++ 6 files changed, 71 insertions(+) create mode 100644 arch/arm64/kernel/pointer_auth.c (limited to 'kernel') diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h index 5ccf49b4dac3..80eb03afd677 100644 --- a/arch/arm64/include/asm/pointer_auth.h +++ b/arch/arm64/include/asm/pointer_auth.h @@ -63,6 +63,8 @@ static inline void ptrauth_keys_switch(struct ptrauth_keys *keys) __ptrauth_key_install(APGA, keys->apga); } +extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg); + /* * The EL0 pointer bits used by a pointer authentication code. * This is dependent on TBI0 being enabled, or bits 63:56 would also apply. @@ -86,6 +88,7 @@ do { \ ptrauth_keys_switch(&(tsk)->thread_info.keys_user) #else /* CONFIG_ARM64_PTR_AUTH */ +#define ptrauth_prctl_reset_keys(tsk, arg) (-EINVAL) #define ptrauth_strip_insn_pac(lr) (lr) #define ptrauth_thread_init_user(tsk) #define ptrauth_thread_switch(tsk) diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index f4b8e09aff56..142c708cb429 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -44,6 +44,7 @@ #include #include #include +#include #include #include @@ -289,6 +290,9 @@ extern void __init minsigstksz_setup(void); #define SVE_SET_VL(arg) sve_set_current_vl(arg) #define SVE_GET_VL() sve_get_current_vl() +/* PR_PAC_RESET_KEYS prctl */ +#define PAC_RESET_KEYS(tsk, arg) ptrauth_prctl_reset_keys(tsk, arg) + /* * For CONFIG_GCC_PLUGIN_STACKLEAK * diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 583334ce1c2c..df08d735b21d 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -58,6 +58,7 @@ arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o arm64-obj-$(CONFIG_CRASH_CORE) += crash_core.o arm64-obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o arm64-obj-$(CONFIG_ARM64_SSBD) += ssbd.o +arm64-obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o obj-y += $(arm64-obj-y) vdso/ probes/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c new file mode 100644 index 000000000000..b9f6f5f3409a --- /dev/null +++ b/arch/arm64/kernel/pointer_auth.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include + +int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg) +{ + struct ptrauth_keys *keys = &tsk->thread_info.keys_user; + unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY | + PR_PAC_APDAKEY | PR_PAC_APDBKEY; + unsigned long key_mask = addr_key_mask | PR_PAC_APGAKEY; + + if (!system_supports_address_auth() && !system_supports_generic_auth()) + return -EINVAL; + + if (!arg) { + ptrauth_keys_init(keys); + ptrauth_keys_switch(keys); + return 0; + } + + if (arg & ~key_mask) + return -EINVAL; + + if (((arg & addr_key_mask) && !system_supports_address_auth()) || + ((arg & PR_PAC_APGAKEY) && !system_supports_generic_auth())) + return -EINVAL; + + if (arg & PR_PAC_APIAKEY) + get_random_bytes(&keys->apia, sizeof(keys->apia)); + if (arg & PR_PAC_APIBKEY) + get_random_bytes(&keys->apib, sizeof(keys->apib)); + if (arg & PR_PAC_APDAKEY) + get_random_bytes(&keys->apda, sizeof(keys->apda)); + if (arg & PR_PAC_APDBKEY) + get_random_bytes(&keys->apdb, sizeof(keys->apdb)); + if (arg & PR_PAC_APGAKEY) + get_random_bytes(&keys->apga, sizeof(keys->apga)); + + ptrauth_keys_switch(keys); + + return 0; +} diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index c0d7ea0bf5b6..0f535a501391 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -219,4 +219,12 @@ struct prctl_mm_map { # define PR_SPEC_DISABLE (1UL << 2) # define PR_SPEC_FORCE_DISABLE (1UL << 3) +/* Reset arm64 pointer authentication keys */ +#define PR_PAC_RESET_KEYS 54 +# define PR_PAC_APIAKEY (1UL << 0) +# define PR_PAC_APIBKEY (1UL << 1) +# define PR_PAC_APDAKEY (1UL << 2) +# define PR_PAC_APDBKEY (1UL << 3) +# define PR_PAC_APGAKEY (1UL << 4) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 123bd73046ec..64b5a230f38d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -121,6 +121,9 @@ #ifndef SVE_GET_VL # define SVE_GET_VL() (-EINVAL) #endif +#ifndef PAC_RESET_KEYS +# define PAC_RESET_KEYS(a, b) (-EINVAL) +#endif /* * this is where the system-wide overflow UID and GID are defined, for @@ -2476,6 +2479,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = arch_prctl_spec_ctrl_set(me, arg2, arg3); break; + case PR_PAC_RESET_KEYS: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = PAC_RESET_KEYS(me, arg2); + break; default: error = -EINVAL; break; -- cgit v1.3-7-g2ca7 From 20b105feda8d42360bd690b8fdf6b2f00ba4a993 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 4 Dec 2018 08:04:31 -0800 Subject: dma-mapping: remove a pointless memset in dma_atomic_pool_init We already zero the memory after allocating it from the pool that this function fills, and having the memset here in this form means we can't support CMA highmem allocations. Signed-off-by: Christoph Hellwig Reported-by: Russell King - ARM Linux --- kernel/dma/remap.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 8a44317cfc1a..18cc09fc27b9 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -121,7 +121,6 @@ int __init dma_atomic_pool_init(gfp_t gfp, pgprot_t prot) if (!page) goto out; - memset(page_address(page), 0, atomic_pool_size); arch_dma_prep_coherent(page, atomic_pool_size); atomic_pool = gen_pool_create(PAGE_SHIFT, -1); -- cgit v1.3-7-g2ca7 From 8d59b5f2a44611d7327a2a14b36090d692186f60 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 14:58:59 +0100 Subject: dma-mapping: simplify the dma_sync_single_range_for_{cpu,device} implementation We can just call the regular calls after adding offset the the address instead of reimplementing them. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- include/linux/dma-debug.h | 27 --------------------------- include/linux/dma-mapping.h | 34 ++++++++++------------------------ kernel/dma/debug.c | 42 ------------------------------------------ 3 files changed, 10 insertions(+), 93 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h index 46e6131a72b6..2ad5c363d7d5 100644 --- a/include/linux/dma-debug.h +++ b/include/linux/dma-debug.h @@ -70,17 +70,6 @@ extern void debug_dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, int direction); -extern void debug_dma_sync_single_range_for_cpu(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, - size_t size, - int direction); - -extern void debug_dma_sync_single_range_for_device(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, - size_t size, int direction); - extern void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems, int direction); @@ -167,22 +156,6 @@ static inline void debug_dma_sync_single_for_device(struct device *dev, { } -static inline void debug_dma_sync_single_range_for_cpu(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, - size_t size, - int direction) -{ -} - -static inline void debug_dma_sync_single_range_for_device(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, - size_t size, - int direction) -{ -} - static inline void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems, int direction) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 7799c2b27849..8916499d2805 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -360,6 +360,13 @@ static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, debug_dma_sync_single_for_cpu(dev, addr, size, dir); } +static inline void dma_sync_single_range_for_cpu(struct device *dev, + dma_addr_t addr, unsigned long offset, size_t size, + enum dma_data_direction dir) +{ + return dma_sync_single_for_cpu(dev, addr + offset, size, dir); +} + static inline void dma_sync_single_for_device(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) @@ -372,32 +379,11 @@ static inline void dma_sync_single_for_device(struct device *dev, debug_dma_sync_single_for_device(dev, addr, size, dir); } -static inline void dma_sync_single_range_for_cpu(struct device *dev, - dma_addr_t addr, - unsigned long offset, - size_t size, - enum dma_data_direction dir) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - - BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_cpu) - ops->sync_single_for_cpu(dev, addr + offset, size, dir); - debug_dma_sync_single_range_for_cpu(dev, addr, offset, size, dir); -} - static inline void dma_sync_single_range_for_device(struct device *dev, - dma_addr_t addr, - unsigned long offset, - size_t size, - enum dma_data_direction dir) + dma_addr_t addr, unsigned long offset, size_t size, + enum dma_data_direction dir) { - const struct dma_map_ops *ops = get_dma_ops(dev); - - BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_device) - ops->sync_single_for_device(dev, addr + offset, size, dir); - debug_dma_sync_single_range_for_device(dev, addr, offset, size, dir); + return dma_sync_single_for_device(dev, addr + offset, size, dir); } static inline void diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 20ab0f6c1b70..164706da2a73 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -1633,48 +1633,6 @@ void debug_dma_sync_single_for_device(struct device *dev, } EXPORT_SYMBOL(debug_dma_sync_single_for_device); -void debug_dma_sync_single_range_for_cpu(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, size_t size, - int direction) -{ - struct dma_debug_entry ref; - - if (unlikely(dma_debug_disabled())) - return; - - ref.type = dma_debug_single; - ref.dev = dev; - ref.dev_addr = dma_handle; - ref.size = offset + size; - ref.direction = direction; - ref.sg_call_ents = 0; - - check_sync(dev, &ref, true); -} -EXPORT_SYMBOL(debug_dma_sync_single_range_for_cpu); - -void debug_dma_sync_single_range_for_device(struct device *dev, - dma_addr_t dma_handle, - unsigned long offset, - size_t size, int direction) -{ - struct dma_debug_entry ref; - - if (unlikely(dma_debug_disabled())) - return; - - ref.type = dma_debug_single; - ref.dev = dev; - ref.dev_addr = dma_handle; - ref.size = offset + size; - ref.direction = direction; - ref.sg_call_ents = 0; - - check_sync(dev, &ref, false); -} -EXPORT_SYMBOL(debug_dma_sync_single_range_for_device); - void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems, int direction) { -- cgit v1.3-7-g2ca7 From 05887cb610a54bf568de7f0bc07c4a64e45ac6f9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 12:25:54 -0800 Subject: dma-mapping: move dma_get_required_mask to kernel/dma dma_get_required_mask should really be with the rest of the DMA mapping implementation instead of in drivers/base as a lone outlier. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- drivers/base/platform.c | 31 ------------------------------- kernel/dma/mapping.c | 34 +++++++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 41b91af95afb..eae841935a45 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -1179,37 +1179,6 @@ int __init platform_bus_init(void) return error; } -#ifndef ARCH_HAS_DMA_GET_REQUIRED_MASK -static u64 dma_default_get_required_mask(struct device *dev) -{ - u32 low_totalram = ((max_pfn - 1) << PAGE_SHIFT); - u32 high_totalram = ((max_pfn - 1) >> (32 - PAGE_SHIFT)); - u64 mask; - - if (!high_totalram) { - /* convert to mask just covering totalram */ - low_totalram = (1 << (fls(low_totalram) - 1)); - low_totalram += low_totalram - 1; - mask = low_totalram; - } else { - high_totalram = (1 << (fls(high_totalram) - 1)); - high_totalram += high_totalram - 1; - mask = (((u64)high_totalram) << 32) + 0xffffffff; - } - return mask; -} - -u64 dma_get_required_mask(struct device *dev) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - - if (ops->get_required_mask) - return ops->get_required_mask(dev); - return dma_default_get_required_mask(dev); -} -EXPORT_SYMBOL_GPL(dma_get_required_mask); -#endif - static __initdata LIST_HEAD(early_platform_driver_list); static __initdata LIST_HEAD(early_platform_device_list); diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index dfbc3deb95cd..dfe29d18dba1 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -5,7 +5,7 @@ * Copyright (c) 2006 SUSE Linux Products GmbH * Copyright (c) 2006 Tejun Heo */ - +#include /* for max_pfn */ #include #include #include @@ -262,3 +262,35 @@ int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, #endif /* !CONFIG_ARCH_NO_COHERENT_DMA_MMAP */ } EXPORT_SYMBOL(dma_common_mmap); + +#ifndef ARCH_HAS_DMA_GET_REQUIRED_MASK +static u64 dma_default_get_required_mask(struct device *dev) +{ + u32 low_totalram = ((max_pfn - 1) << PAGE_SHIFT); + u32 high_totalram = ((max_pfn - 1) >> (32 - PAGE_SHIFT)); + u64 mask; + + if (!high_totalram) { + /* convert to mask just covering totalram */ + low_totalram = (1 << (fls(low_totalram) - 1)); + low_totalram += low_totalram - 1; + mask = low_totalram; + } else { + high_totalram = (1 << (fls(high_totalram) - 1)); + high_totalram += high_totalram - 1; + mask = (((u64)high_totalram) << 32) + 0xffffffff; + } + return mask; +} + +u64 dma_get_required_mask(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (ops->get_required_mask) + return ops->get_required_mask(dev); + return dma_default_get_required_mask(dev); +} +EXPORT_SYMBOL_GPL(dma_get_required_mask); +#endif + -- cgit v1.3-7-g2ca7 From 7249c1a52df9967cd23550f3dc24fb6ca43cdc6a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 12:43:30 -0800 Subject: dma-mapping: move various slow path functions out of line There is no need to have all setup and coherent allocation / freeing routines inline. Move them out of line to keep the implemeation nicely encapsulated and save some kernel text size. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- arch/powerpc/include/asm/dma-mapping.h | 1 - include/linux/dma-mapping.h | 150 +++------------------------------ kernel/dma/mapping.c | 140 +++++++++++++++++++++++++++++- 3 files changed, 151 insertions(+), 140 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 8fa394520af6..5201f2b7838c 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -108,7 +108,6 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) } #define HAVE_ARCH_DMA_SET_MASK 1 -extern int dma_set_mask(struct device *dev, u64 dma_mask); extern u64 __dma_get_required_mask(struct device *dev); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 3b431cc58794..0bbce52606c2 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -440,107 +440,24 @@ bool dma_in_atomic_pool(void *start, size_t size); void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags); bool dma_free_from_pool(void *start, size_t size); -/** - * dma_mmap_attrs - map a coherent DMA allocation into user space - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @vma: vm_area_struct describing requested user mapping - * @cpu_addr: kernel CPU-view address returned from dma_alloc_attrs - * @handle: device-view address returned from dma_alloc_attrs - * @size: size of memory originally requested in dma_alloc_attrs - * @attrs: attributes of mapping properties requested in dma_alloc_attrs - * - * Map a coherent DMA buffer previously allocated by dma_alloc_attrs - * into user space. The coherent DMA buffer must not be freed by the - * driver until the user space mapping has been released. - */ -static inline int -dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, - dma_addr_t dma_addr, size_t size, unsigned long attrs) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!ops); - if (ops->mmap) - return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs); - return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size, attrs); -} - +int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + unsigned long attrs); #define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0) int dma_common_get_sgtable(struct device *dev, struct sg_table *sgt, void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs); -static inline int -dma_get_sgtable_attrs(struct device *dev, struct sg_table *sgt, void *cpu_addr, - dma_addr_t dma_addr, size_t size, - unsigned long attrs) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!ops); - if (ops->get_sgtable) - return ops->get_sgtable(dev, sgt, cpu_addr, dma_addr, size, - attrs); - return dma_common_get_sgtable(dev, sgt, cpu_addr, dma_addr, size, - attrs); -} - +int dma_get_sgtable_attrs(struct device *dev, struct sg_table *sgt, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + unsigned long attrs); #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0) -#ifndef arch_dma_alloc_attrs -#define arch_dma_alloc_attrs(dev) (true) -#endif - -static inline void *dma_alloc_attrs(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flag, - unsigned long attrs) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - void *cpu_addr; - - BUG_ON(!ops); - WARN_ON_ONCE(dev && !dev->coherent_dma_mask); - - if (dma_alloc_from_dev_coherent(dev, size, dma_handle, &cpu_addr)) - return cpu_addr; - - /* let the implementation decide on the zone to allocate from: */ - flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM); - - if (!arch_dma_alloc_attrs(&dev)) - return NULL; - if (!ops->alloc) - return NULL; - - cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs); - debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr); - return cpu_addr; -} - -static inline void dma_free_attrs(struct device *dev, size_t size, - void *cpu_addr, dma_addr_t dma_handle, - unsigned long attrs) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - - BUG_ON(!ops); - - if (dma_release_from_dev_coherent(dev, get_order(size), cpu_addr)) - return; - /* - * On non-coherent platforms which implement DMA-coherent buffers via - * non-cacheable remaps, ops->free() may call vunmap(). Thus getting - * this far in IRQ context is a) at risk of a BUG_ON() or trying to - * sleep on some machines, and b) an indication that the driver is - * probably misusing the coherent API anyway. - */ - WARN_ON(irqs_disabled()); - - if (!ops->free || !cpu_addr) - return; - - debug_dma_free_coherent(dev, size, cpu_addr, dma_handle); - ops->free(dev, size, cpu_addr, dma_handle, attrs); -} +void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, + gfp_t flag, unsigned long attrs); +void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle, unsigned long attrs); static inline void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp) @@ -565,35 +482,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) return 0; } -static inline void dma_check_mask(struct device *dev, u64 mask) -{ - if (sme_active() && (mask < (((u64)sme_get_me_mask() << 1) - 1))) - dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); -} - -static inline int dma_supported(struct device *dev, u64 mask) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - - if (!ops) - return 0; - if (!ops->dma_supported) - return 1; - return ops->dma_supported(dev, mask); -} - -#ifndef HAVE_ARCH_DMA_SET_MASK -static inline int dma_set_mask(struct device *dev, u64 mask) -{ - if (!dev->dma_mask || !dma_supported(dev, mask)) - return -EIO; - - dma_check_mask(dev, mask); - - *dev->dma_mask = mask; - return 0; -} -#endif +int dma_supported(struct device *dev, u64 mask); +int dma_set_mask(struct device *dev, u64 mask); +int dma_set_coherent_mask(struct device *dev, u64 mask); static inline u64 dma_get_mask(struct device *dev) { @@ -602,21 +493,6 @@ static inline u64 dma_get_mask(struct device *dev) return DMA_BIT_MASK(32); } -#ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK -int dma_set_coherent_mask(struct device *dev, u64 mask); -#else -static inline int dma_set_coherent_mask(struct device *dev, u64 mask) -{ - if (!dma_supported(dev, mask)) - return -EIO; - - dma_check_mask(dev, mask); - - dev->coherent_dma_mask = mask; - return 0; -} -#endif - /* * Set both the DMA mask and the coherent DMA mask to the same thing. * Note that we don't check the return value from dma_set_coherent_mask() diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index dfe29d18dba1..176ae3e08916 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -223,7 +223,20 @@ int dma_common_get_sgtable(struct device *dev, struct sg_table *sgt, sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0); return ret; } -EXPORT_SYMBOL(dma_common_get_sgtable); + +int dma_get_sgtable_attrs(struct device *dev, struct sg_table *sgt, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + unsigned long attrs) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + BUG_ON(!ops); + if (ops->get_sgtable) + return ops->get_sgtable(dev, sgt, cpu_addr, dma_addr, size, + attrs); + return dma_common_get_sgtable(dev, sgt, cpu_addr, dma_addr, size, + attrs); +} +EXPORT_SYMBOL(dma_get_sgtable_attrs); /* * Create userspace mapping for the DMA-coherent memory. @@ -261,7 +274,31 @@ int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, return -ENXIO; #endif /* !CONFIG_ARCH_NO_COHERENT_DMA_MMAP */ } -EXPORT_SYMBOL(dma_common_mmap); + +/** + * dma_mmap_attrs - map a coherent DMA allocation into user space + * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices + * @vma: vm_area_struct describing requested user mapping + * @cpu_addr: kernel CPU-view address returned from dma_alloc_attrs + * @dma_addr: device-view address returned from dma_alloc_attrs + * @size: size of memory originally requested in dma_alloc_attrs + * @attrs: attributes of mapping properties requested in dma_alloc_attrs + * + * Map a coherent DMA buffer previously allocated by dma_alloc_attrs into user + * space. The coherent DMA buffer must not be freed by the driver until the + * user space mapping has been released. + */ +int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + unsigned long attrs) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + BUG_ON(!ops); + if (ops->mmap) + return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs); + return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size, attrs); +} +EXPORT_SYMBOL(dma_mmap_attrs); #ifndef ARCH_HAS_DMA_GET_REQUIRED_MASK static u64 dma_default_get_required_mask(struct device *dev) @@ -294,3 +331,102 @@ u64 dma_get_required_mask(struct device *dev) EXPORT_SYMBOL_GPL(dma_get_required_mask); #endif +#ifndef arch_dma_alloc_attrs +#define arch_dma_alloc_attrs(dev) (true) +#endif + +void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, + gfp_t flag, unsigned long attrs) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + void *cpu_addr; + + BUG_ON(!ops); + WARN_ON_ONCE(dev && !dev->coherent_dma_mask); + + if (dma_alloc_from_dev_coherent(dev, size, dma_handle, &cpu_addr)) + return cpu_addr; + + /* let the implementation decide on the zone to allocate from: */ + flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM); + + if (!arch_dma_alloc_attrs(&dev)) + return NULL; + if (!ops->alloc) + return NULL; + + cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs); + debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr); + return cpu_addr; +} +EXPORT_SYMBOL(dma_alloc_attrs); + +void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle, unsigned long attrs) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + BUG_ON(!ops); + + if (dma_release_from_dev_coherent(dev, get_order(size), cpu_addr)) + return; + /* + * On non-coherent platforms which implement DMA-coherent buffers via + * non-cacheable remaps, ops->free() may call vunmap(). Thus getting + * this far in IRQ context is a) at risk of a BUG_ON() or trying to + * sleep on some machines, and b) an indication that the driver is + * probably misusing the coherent API anyway. + */ + WARN_ON(irqs_disabled()); + + if (!ops->free || !cpu_addr) + return; + + debug_dma_free_coherent(dev, size, cpu_addr, dma_handle); + ops->free(dev, size, cpu_addr, dma_handle, attrs); +} +EXPORT_SYMBOL(dma_free_attrs); + +static inline void dma_check_mask(struct device *dev, u64 mask) +{ + if (sme_active() && (mask < (((u64)sme_get_me_mask() << 1) - 1))) + dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); +} + +int dma_supported(struct device *dev, u64 mask) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (!ops) + return 0; + if (!ops->dma_supported) + return 1; + return ops->dma_supported(dev, mask); +} +EXPORT_SYMBOL(dma_supported); + +#ifndef HAVE_ARCH_DMA_SET_MASK +int dma_set_mask(struct device *dev, u64 mask) +{ + if (!dev->dma_mask || !dma_supported(dev, mask)) + return -EIO; + + dma_check_mask(dev, mask); + *dev->dma_mask = mask; + return 0; +} +EXPORT_SYMBOL(dma_set_mask); +#endif + +#ifndef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK +int dma_set_coherent_mask(struct device *dev, u64 mask) +{ + if (!dma_supported(dev, mask)) + return -EIO; + + dma_check_mask(dev, mask); + dev->coherent_dma_mask = mask; + return 0; +} +EXPORT_SYMBOL(dma_set_coherent_mask); +#endif -- cgit v1.3-7-g2ca7 From 8ddbe5943c0b1259b5ddb6dc1729863433fc256c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 12:47:50 -0800 Subject: dma-mapping: move dma_cache_sync out of line This isn't exactly a slow path routine, but it is not super critical either, and moving it out of line will help to keep the include chain clean for the following DMA indirection bypass work. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- include/linux/dma-mapping.h | 12 ++---------- kernel/dma/mapping.c | 11 +++++++++++ 2 files changed, 13 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0bbce52606c2..0f0078490df4 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -411,16 +411,8 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, #define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0) #define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0) -static inline void -dma_cache_sync(struct device *dev, void *vaddr, size_t size, - enum dma_data_direction dir) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - - BUG_ON(!valid_dma_direction(dir)); - if (ops->cache_sync) - ops->cache_sync(dev, vaddr, size, dir); -} +void dma_cache_sync(struct device *dev, void *vaddr, size_t size, + enum dma_data_direction dir); extern int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, dma_addr_t dma_addr, size_t size, diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 176ae3e08916..0b18cfbdde95 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -430,3 +430,14 @@ int dma_set_coherent_mask(struct device *dev, u64 mask) } EXPORT_SYMBOL(dma_set_coherent_mask); #endif + +void dma_cache_sync(struct device *dev, void *vaddr, size_t size, + enum dma_data_direction dir) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + BUG_ON(!valid_dma_direction(dir)); + if (ops->cache_sync) + ops->cache_sync(dev, vaddr, size, dir); +} +EXPORT_SYMBOL(dma_cache_sync); -- cgit v1.3-7-g2ca7 From 3731c3d4774e38b9d91c01943e1e6a243c1776be Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 12:50:26 -0800 Subject: dma-mapping: always build the direct mapping code All architectures except for sparc64 use the dma-direct code in some form, and even for sparc64 we had the discussion of a direct mapping mode a while ago. In preparation for directly calling the direct mapping code don't bother having it optionally but always build the code in. This is a minor hardship for some powerpc and arm configs that don't pull it in yet (although they should in a relase ot two), and sparc64 which currently doesn't need it at all, but it will reduce the ifdef mess we'd otherwise need significantly. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- arch/alpha/Kconfig | 1 - arch/arc/Kconfig | 1 - arch/arm/Kconfig | 1 - arch/arm64/Kconfig | 1 - arch/c6x/Kconfig | 1 - arch/csky/Kconfig | 1 - arch/h8300/Kconfig | 1 - arch/hexagon/Kconfig | 1 - arch/m68k/Kconfig | 1 - arch/microblaze/Kconfig | 1 - arch/mips/Kconfig | 1 - arch/nds32/Kconfig | 1 - arch/nios2/Kconfig | 1 - arch/openrisc/Kconfig | 1 - arch/parisc/Kconfig | 1 - arch/riscv/Kconfig | 1 - arch/s390/Kconfig | 1 - arch/sh/Kconfig | 1 - arch/sparc/Kconfig | 1 - arch/unicore32/Kconfig | 1 - arch/x86/Kconfig | 1 - arch/xtensa/Kconfig | 1 - kernel/dma/Kconfig | 7 ------- kernel/dma/Makefile | 3 +-- 24 files changed, 1 insertion(+), 31 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index a7e748a46c18..5da6ff54b3e7 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -203,7 +203,6 @@ config ALPHA_EIGER config ALPHA_JENSEN bool "Jensen" depends on BROKEN - select DMA_DIRECT_OPS help DEC PC 150 AXP (aka Jensen): This is a very old Digital system - one of the first-generation Alpha systems. A number of these systems diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index fd48d698da29..7deaabeb531a 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -17,7 +17,6 @@ config ARC select BUILDTIME_EXTABLE_SORT select CLONE_BACKWARDS select COMMON_CLK - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 if !ISA_ARCV2 || !(ARC_HAS_LL64 && ARC_HAS_LLSC) select GENERIC_CLOCKEVENTS select GENERIC_FIND_FIRST_BIT diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index a858ee791ef0..586fc30b23bd 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -30,7 +30,6 @@ config ARM select CLONE_BACKWARDS select CPU_PM if (SUSPEND || CPU_IDLE) select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS - select DMA_DIRECT_OPS if !MMU select DMA_REMAP if MMU select EDAC_SUPPORT select EDAC_ATOMIC_SCRUB diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 06cf0ef24367..2092080240b0 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -80,7 +80,6 @@ config ARM64 select CPU_PM if (SUSPEND || CPU_IDLE) select CRC32 select DCACHE_WORD_ACCESS - select DMA_DIRECT_OPS select DMA_DIRECT_REMAP select EDAC_SUPPORT select FRAME_POINTER diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig index 84420109113d..456e154674d1 100644 --- a/arch/c6x/Kconfig +++ b/arch/c6x/Kconfig @@ -9,7 +9,6 @@ config C6X select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE select CLKDEV_LOOKUP - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 select GENERIC_IRQ_SHOW select HAVE_ARCH_TRACEHOOK diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index ea74f3a9eeaf..37bed8aadf95 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -7,7 +7,6 @@ config CSKY select COMMON_CLK select CLKSRC_MMIO select CLKSRC_OF - select DMA_DIRECT_OPS select DMA_DIRECT_REMAP select IRQ_DOMAIN select HANDLE_DOMAIN_IRQ diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig index d19c6b16cd5d..6472a0685470 100644 --- a/arch/h8300/Kconfig +++ b/arch/h8300/Kconfig @@ -22,7 +22,6 @@ config H8300 select HAVE_ARCH_KGDB select HAVE_ARCH_HASH select CPU_NO_EFFICIENT_FFS - select DMA_DIRECT_OPS config CPU_BIG_ENDIAN def_bool y diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index 2b688af379e6..d71036c598de 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -31,7 +31,6 @@ config HEXAGON select GENERIC_CLOCKEVENTS_BROADCAST select MODULES_USE_ELF_RELA select GENERIC_CPU_DEVICES - select DMA_DIRECT_OPS ---help--- Qualcomm Hexagon is a processor architecture designed for high performance and low power across a wide variety of applications. diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index 1bc9f1ba759a..8a5868e9a3a0 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -26,7 +26,6 @@ config M68K select MODULES_USE_ELF_RELA select OLD_SIGSUSPEND3 select OLD_SIGACTION - select DMA_DIRECT_OPS if HAS_DMA select ARCH_DISCARD_MEMBLOCK config CPU_BIG_ENDIAN diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index effed2efd306..eda9e2315ef5 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -12,7 +12,6 @@ config MICROBLAZE select TIMER_OF select CLONE_BACKWARDS3 select COMMON_CLK - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 select GENERIC_CLOCKEVENTS select GENERIC_CPU_DEVICES diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 8272ea4c7264..2993aa9842c0 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -16,7 +16,6 @@ config MIPS select BUILDTIME_EXTABLE_SORT select CLONE_BACKWARDS select CPU_PM if CPU_IDLE - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 if !64BIT select GENERIC_CLOCKEVENTS select GENERIC_CMOS_UPDATE diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig index 7a04adacb2f0..1af6bbae7220 100644 --- a/arch/nds32/Kconfig +++ b/arch/nds32/Kconfig @@ -11,7 +11,6 @@ config NDS32 select CLKSRC_MMIO select CLONE_BACKWARDS select COMMON_CLK - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 select GENERIC_CPU_DEVICES select GENERIC_CLOCKEVENTS diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index 7e95506e957a..f6c4b0f49997 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -4,7 +4,6 @@ config NIOS2 select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_NO_SWAP - select DMA_DIRECT_OPS select TIMER_OF select GENERIC_ATOMIC64 select GENERIC_CLOCKEVENTS diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig index 285f7d05c8ed..d0feebad5a8f 100644 --- a/arch/openrisc/Kconfig +++ b/arch/openrisc/Kconfig @@ -7,7 +7,6 @@ config OPENRISC def_bool y select ARCH_HAS_SYNC_DMA_FOR_DEVICE - select DMA_DIRECT_OPS select OF select OF_EARLY_FLATTREE select IRQ_DOMAIN diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 428ee50fc3db..6e1b71da0e71 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -185,7 +185,6 @@ config PA11 depends on PA7000 || PA7100LC || PA7200 || PA7300LC select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE - select DMA_DIRECT_OPS select DMA_NONCOHERENT_CACHE_SYNC config PREFETCH diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 55da93f4e818..51d89c4b1dca 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -19,7 +19,6 @@ config RISCV select ARCH_WANT_FRAME_POINTERS select CLONE_BACKWARDS select COMMON_CLK - select DMA_DIRECT_OPS select GENERIC_CLOCKEVENTS select GENERIC_CPU_DEVICES select GENERIC_IRQ_SHOW diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5624e8607054..21d271d04ca6 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -139,7 +139,6 @@ config S390 select HAVE_COPY_THREAD_TLS select HAVE_DEBUG_KMEMLEAK select HAVE_DMA_CONTIGUOUS - select DMA_DIRECT_OPS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_EFFICIENT_UNALIGNED_ACCESS diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index f82a4da7adf3..10fd4e9c454b 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -7,7 +7,6 @@ config SUPERH select ARCH_NO_COHERENT_DMA_MMAP if !MMU select HAVE_PATA_PLATFORM select CLKDEV_LOOKUP - select DMA_DIRECT_OPS select HAVE_IDE if HAS_IOPORT_MAP select HAVE_MEMBLOCK_NODE_MAP select ARCH_DISCARD_MEMBLOCK diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 8853b6ceae17..f5bb9ded1d18 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -48,7 +48,6 @@ config SPARC config SPARC32 def_bool !64BIT select ARCH_HAS_SYNC_DMA_FOR_CPU - select DMA_DIRECT_OPS select GENERIC_ATOMIC64 select CLZ_TAB select HAVE_UID16 diff --git a/arch/unicore32/Kconfig b/arch/unicore32/Kconfig index a4c05159dca5..2681027d7bff 100644 --- a/arch/unicore32/Kconfig +++ b/arch/unicore32/Kconfig @@ -4,7 +4,6 @@ config UNICORE32 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO - select DMA_DIRECT_OPS select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_BZIP2 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index adc845b66f01..c14d4a35be13 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -89,7 +89,6 @@ config X86 select CLOCKSOURCE_VALIDATE_LAST_CYCLE select CLOCKSOURCE_WATCHDOG select DCACHE_WORD_ACCESS - select DMA_DIRECT_OPS select EDAC_ATOMIC_SCRUB select EDAC_SUPPORT select GENERIC_CLOCKEVENTS diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 75488b606edc..36338e7564a3 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -9,7 +9,6 @@ config XTENSA select BUILDTIME_EXTABLE_SORT select CLONE_BACKWARDS select COMMON_CLK - select DMA_DIRECT_OPS select DMA_REMAP if MMU select GENERIC_ATOMIC64 select GENERIC_CLOCKEVENTS diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 41c3b1df70eb..ca88b867e7fe 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -35,13 +35,8 @@ config ARCH_HAS_DMA_COHERENT_TO_PFN config ARCH_HAS_DMA_MMAP_PGPROT bool -config DMA_DIRECT_OPS - bool - depends on HAS_DMA - config DMA_NONCOHERENT_CACHE_SYNC bool - depends on DMA_DIRECT_OPS config DMA_VIRT_OPS bool @@ -49,7 +44,6 @@ config DMA_VIRT_OPS config SWIOTLB bool - select DMA_DIRECT_OPS select NEED_DMA_MAP_STATE config DMA_REMAP @@ -58,5 +52,4 @@ config DMA_REMAP config DMA_DIRECT_REMAP bool - depends on DMA_DIRECT_OPS select DMA_REMAP diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index f4feeceb8020..a626f643cd63 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -1,9 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 -obj-$(CONFIG_HAS_DMA) += mapping.o +obj-$(CONFIG_HAS_DMA) += mapping.o direct.o obj-$(CONFIG_DMA_CMA) += contiguous.o obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += coherent.o -obj-$(CONFIG_DMA_DIRECT_OPS) += direct.o obj-$(CONFIG_DMA_VIRT_OPS) += virt.o obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -- cgit v1.3-7-g2ca7 From 90ac706e98fcb24fb0b0a259558987f33cc2f0f6 Mon Sep 17 00:00:00 2001 From: Robin Murphy Date: Thu, 6 Dec 2018 13:14:44 -0800 Subject: dma-mapping: factor out dummy DMA ops The dummy DMA ops are currently used by arm64 for any device which has an invalid ACPI description and is thus barred from using DMA due to not knowing whether is is cache-coherent or not. Factor these out into general dma-mapping code so that they can be referenced from other common code paths. In the process, we can prune all the optional callbacks which just do the same thing as the default behaviour, and fill in .map_resource for completeness. Signed-off-by: Robin Murphy [hch: moved to a separate source file] Reviewed-by: Rafael J. Wysocki Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck Signed-off-by: Christoph Hellwig --- arch/arm64/include/asm/dma-mapping.h | 4 +- arch/arm64/mm/dma-mapping.c | 86 ------------------------------------ include/linux/dma-mapping.h | 1 + kernel/dma/Makefile | 2 +- kernel/dma/dummy.c | 39 ++++++++++++++++ 5 files changed, 42 insertions(+), 90 deletions(-) create mode 100644 kernel/dma/dummy.c (limited to 'kernel') diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index c41f3fb1446c..273e778f7de2 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -24,15 +24,13 @@ #include #include -extern const struct dma_map_ops dummy_dma_ops; - static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { /* * We expect no ISA devices, and all other DMA masters are expected to * have someone call arch_setup_dma_ops at device creation time. */ - return &dummy_dma_ops; + return &dma_dummy_ops; } void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 4c0f498069e8..6ff6ec8806c1 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -89,92 +89,6 @@ static int __swiotlb_mmap_pfn(struct vm_area_struct *vma, } #endif /* CONFIG_IOMMU_DMA */ -/******************************************** - * The following APIs are for dummy DMA ops * - ********************************************/ - -static void *__dummy_alloc(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flags, - unsigned long attrs) -{ - return NULL; -} - -static void __dummy_free(struct device *dev, size_t size, - void *vaddr, dma_addr_t dma_handle, - unsigned long attrs) -{ -} - -static int __dummy_mmap(struct device *dev, - struct vm_area_struct *vma, - void *cpu_addr, dma_addr_t dma_addr, size_t size, - unsigned long attrs) -{ - return -ENXIO; -} - -static dma_addr_t __dummy_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, - unsigned long attrs) -{ - return DMA_MAPPING_ERROR; -} - -static void __dummy_unmap_page(struct device *dev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) -{ -} - -static int __dummy_map_sg(struct device *dev, struct scatterlist *sgl, - int nelems, enum dma_data_direction dir, - unsigned long attrs) -{ - return 0; -} - -static void __dummy_unmap_sg(struct device *dev, - struct scatterlist *sgl, int nelems, - enum dma_data_direction dir, - unsigned long attrs) -{ -} - -static void __dummy_sync_single(struct device *dev, - dma_addr_t dev_addr, size_t size, - enum dma_data_direction dir) -{ -} - -static void __dummy_sync_sg(struct device *dev, - struct scatterlist *sgl, int nelems, - enum dma_data_direction dir) -{ -} - -static int __dummy_dma_supported(struct device *hwdev, u64 mask) -{ - return 0; -} - -const struct dma_map_ops dummy_dma_ops = { - .alloc = __dummy_alloc, - .free = __dummy_free, - .mmap = __dummy_mmap, - .map_page = __dummy_map_page, - .unmap_page = __dummy_unmap_page, - .map_sg = __dummy_map_sg, - .unmap_sg = __dummy_unmap_sg, - .sync_single_for_cpu = __dummy_sync_single, - .sync_single_for_device = __dummy_sync_single, - .sync_sg_for_cpu = __dummy_sync_sg, - .sync_sg_for_device = __dummy_sync_sg, - .dma_supported = __dummy_dma_supported, -}; -EXPORT_SYMBOL(dummy_dma_ops); - static int __init arm64_dma_init(void) { WARN_TAINT(ARCH_DMA_MINALIGN < cache_line_size(), diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0f0078490df4..269ee27fc3d9 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -136,6 +136,7 @@ struct dma_map_ops { extern const struct dma_map_ops dma_direct_ops; extern const struct dma_map_ops dma_virt_ops; +extern const struct dma_map_ops dma_dummy_ops; #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1)) diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index a626f643cd63..72ff6e46aa86 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 -obj-$(CONFIG_HAS_DMA) += mapping.o direct.o +obj-$(CONFIG_HAS_DMA) += mapping.o direct.o dummy.o obj-$(CONFIG_DMA_CMA) += contiguous.o obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += coherent.o obj-$(CONFIG_DMA_VIRT_OPS) += virt.o diff --git a/kernel/dma/dummy.c b/kernel/dma/dummy.c new file mode 100644 index 000000000000..05607642c888 --- /dev/null +++ b/kernel/dma/dummy.c @@ -0,0 +1,39 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Dummy DMA ops that always fail. + */ +#include + +static int dma_dummy_mmap(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + unsigned long attrs) +{ + return -ENXIO; +} + +static dma_addr_t dma_dummy_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ + return DMA_MAPPING_ERROR; +} + +static int dma_dummy_map_sg(struct device *dev, struct scatterlist *sgl, + int nelems, enum dma_data_direction dir, + unsigned long attrs) +{ + return 0; +} + +static int dma_dummy_supported(struct device *hwdev, u64 mask) +{ + return 0; +} + +const struct dma_map_ops dma_dummy_ops = { + .mmap = dma_dummy_mmap, + .map_page = dma_dummy_map_page, + .map_sg = dma_dummy_map_sg, + .dma_supported = dma_dummy_supported, +}; +EXPORT_SYMBOL(dma_dummy_ops); -- cgit v1.3-7-g2ca7 From b907e20508d02462a50c2841da0a5e3883fdab39 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 11:42:52 +0100 Subject: swiotlb: remove SWIOTLB_MAP_ERROR We can use DMA_MAPPING_ERROR instead, which already maps to the same value. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- drivers/xen/swiotlb-xen.c | 4 ++-- include/linux/swiotlb.h | 3 --- kernel/dma/swiotlb.c | 4 ++-- 3 files changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 6dc969d5ea2f..833e80b46eb2 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -403,7 +403,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page, map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir, attrs); - if (map == SWIOTLB_MAP_ERROR) + if (map == DMA_MAPPING_ERROR) return DMA_MAPPING_ERROR; dev_addr = xen_phys_to_bus(map); @@ -572,7 +572,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, sg_phys(sg), sg->length, dir, attrs); - if (map == SWIOTLB_MAP_ERROR) { + if (map == DMA_MAPPING_ERROR) { dev_warn(hwdev, "swiotlb buffer is full\n"); /* Don't panic here, we expect map_sg users to do proper error handling. */ diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index a387b59640a4..14aec0b70dd9 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -46,9 +46,6 @@ enum dma_sync_target { SYNC_FOR_DEVICE = 1, }; -/* define the last possible byte of physical address space as a mapping error */ -#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0) - extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr, phys_addr_t phys, size_t size, diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index ff1ce81bb623..19ba8e473d71 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -526,7 +526,7 @@ not_found: spin_unlock_irqrestore(&io_tlb_lock, flags); if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes)\n", size); - return SWIOTLB_MAP_ERROR; + return DMA_MAPPING_ERROR; found: spin_unlock_irqrestore(&io_tlb_lock, flags); @@ -637,7 +637,7 @@ static dma_addr_t swiotlb_bounce_page(struct device *dev, phys_addr_t *phys, /* Oh well, have to allocate and map a bounce buffer. */ *phys = swiotlb_tbl_map_single(dev, __phys_to_dma(dev, io_tlb_start), *phys, size, dir, attrs); - if (*phys == SWIOTLB_MAP_ERROR) + if (*phys == DMA_MAPPING_ERROR) return DMA_MAPPING_ERROR; /* Ensure that the address returned is DMA'ble */ -- cgit v1.3-7-g2ca7 From 68c608345cc569bcfa1c1b2add4c00c343ecf933 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 07:06:04 -0800 Subject: swiotlb: remove dma_mark_clean Instead of providing a special dma_mark_clean hook just for ia64, switch ia64 to use the normal arch_sync_dma_for_cpu hooks instead. This means that we now also set the PG_arch_1 bit for pages in the swiotlb buffer, which isn't stricly needed as we will never execute code out of the swiotlb buffer, but otherwise harmless. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- arch/ia64/Kconfig | 3 ++- arch/ia64/kernel/dma-mapping.c | 20 +++++++++++++++++++- arch/ia64/mm/init.c | 19 ++++++++----------- drivers/xen/swiotlb-xen.c | 20 +------------------- include/linux/dma-direct.h | 8 -------- kernel/dma/swiotlb.c | 18 +----------------- 6 files changed, 31 insertions(+), 57 deletions(-) (limited to 'kernel') diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index d6f203658994..c587e3316c38 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -28,7 +28,8 @@ config IA64 select HAVE_ARCH_TRACEHOOK select HAVE_MEMBLOCK_NODE_MAP select HAVE_VIRT_CPU_ACCOUNTING - select ARCH_HAS_DMA_MARK_CLEAN + select ARCH_HAS_DMA_COHERENT_TO_PFN + select ARCH_HAS_SYNC_DMA_FOR_CPU select VIRT_TO_BUS select ARCH_DISCARD_MEMBLOCK select GENERIC_IRQ_PROBE diff --git a/arch/ia64/kernel/dma-mapping.c b/arch/ia64/kernel/dma-mapping.c index 7a471d8d67d4..36dd6aa6d759 100644 --- a/arch/ia64/kernel/dma-mapping.c +++ b/arch/ia64/kernel/dma-mapping.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 -#include +#include #include #include @@ -16,6 +16,24 @@ const struct dma_map_ops *dma_get_ops(struct device *dev) EXPORT_SYMBOL(dma_get_ops); #ifdef CONFIG_SWIOTLB +void *arch_dma_alloc(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs) +{ + return dma_direct_alloc_pages(dev, size, dma_handle, gfp, attrs); +} + +void arch_dma_free(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_addr, unsigned long attrs) +{ + dma_direct_free_pages(dev, size, cpu_addr, dma_addr, attrs); +} + +long arch_dma_coherent_to_pfn(struct device *dev, void *cpu_addr, + dma_addr_t dma_addr) +{ + return page_to_pfn(virt_to_page(cpu_addr)); +} + void __init swiotlb_dma_init(void) { dma_ops = &swiotlb_dma_ops; diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index d5e12ff1d73c..0cf43bb13d6e 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -71,18 +72,14 @@ __ia64_sync_icache_dcache (pte_t pte) * DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to * flush them when they get mapped into an executable vm-area. */ -void -dma_mark_clean(void *addr, size_t size) +void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir) { - unsigned long pg_addr, end; - - pg_addr = PAGE_ALIGN((unsigned long) addr); - end = (unsigned long) addr + size; - while (pg_addr + PAGE_SIZE <= end) { - struct page *page = virt_to_page(pg_addr); - set_bit(PG_arch_1, &page->flags); - pg_addr += PAGE_SIZE; - } + unsigned long pfn = PHYS_PFN(paddr); + + do { + set_bit(PG_arch_1, &pfn_to_page(pfn)->flags); + } while (++pfn <= PHYS_PFN(paddr + size - 1)); } inline void diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 833e80b46eb2..989cf872b98c 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -441,21 +441,8 @@ static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr, xen_dma_unmap_page(hwdev, dev_addr, size, dir, attrs); /* NOTE: We use dev_addr here, not paddr! */ - if (is_xen_swiotlb_buffer(dev_addr)) { + if (is_xen_swiotlb_buffer(dev_addr)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); - return; - } - - if (dir != DMA_FROM_DEVICE) - return; - - /* - * phys_to_virt doesn't work with hihgmem page but we could - * call dma_mark_clean() with hihgmem page here. However, we - * are fine since dma_mark_clean() is null on POWERPC. We can - * make dma_mark_clean() take a physical address if necessary. - */ - dma_mark_clean(phys_to_virt(paddr), size); } static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, @@ -493,11 +480,6 @@ xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr, if (target == SYNC_FOR_DEVICE) xen_dma_sync_single_for_device(hwdev, dev_addr, size, dir); - - if (dir != DMA_FROM_DEVICE) - return; - - dma_mark_clean(phys_to_virt(paddr), size); } void diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 6e5a47ae7d64..1aa73f4907ae 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -48,14 +48,6 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr) return __sme_clr(__dma_to_phys(dev, daddr)); } -#ifdef CONFIG_ARCH_HAS_DMA_MARK_CLEAN -void dma_mark_clean(void *addr, size_t size); -#else -static inline void dma_mark_clean(void *addr, size_t size) -{ -} -#endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */ - u64 dma_direct_get_required_mask(struct device *dev); void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 19ba8e473d71..2e126bac5d7d 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -706,21 +706,8 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0) arch_sync_dma_for_cpu(hwdev, paddr, size, dir); - if (is_swiotlb_buffer(paddr)) { + if (is_swiotlb_buffer(paddr)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); - return; - } - - if (dir != DMA_FROM_DEVICE) - return; - - /* - * phys_to_virt doesn't work with hihgmem page but we could - * call dma_mark_clean() with hihgmem page here. However, we - * are fine since dma_mark_clean() is null on POWERPC. We can - * make dma_mark_clean() take a physical address if necessary. - */ - dma_mark_clean(phys_to_virt(paddr), size); } /* @@ -750,9 +737,6 @@ swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr, if (!dev_is_dma_coherent(hwdev) && target == SYNC_FOR_DEVICE) arch_sync_dma_for_device(hwdev, paddr, size, dir); - - if (!is_swiotlb_buffer(paddr) && dir == DMA_FROM_DEVICE) - dma_mark_clean(phys_to_virt(paddr), size); } void -- cgit v1.3-7-g2ca7 From 58dfd4ac022037c6a562e92fc6d2a778819b2162 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 07:43:05 +0100 Subject: dma-direct: improve addressability error reporting Only report report a DMA addressability report once to avoid spewing the kernel log with repeated message. Also provide a stack trace to make it easy to find the actual caller that caused the problem. Last but not least move the actual check into the fast path and only leave the error reporting in a helper. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- kernel/dma/direct.c | 36 +++++++++++++++--------------------- 1 file changed, 15 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 308f88a750c8..edb24f94ea1e 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -30,27 +30,16 @@ static inline bool force_dma_unencrypted(void) return sev_active(); } -static bool -check_addr(struct device *dev, dma_addr_t dma_addr, size_t size, - const char *caller) +static void report_addr(struct device *dev, dma_addr_t dma_addr, size_t size) { - if (unlikely(dev && !dma_capable(dev, dma_addr, size))) { - if (!dev->dma_mask) { - dev_err(dev, - "%s: call on device without dma_mask\n", - caller); - return false; - } - - if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) { - dev_err(dev, - "%s: overflow %pad+%zu of device mask %llx bus mask %llx\n", - caller, &dma_addr, size, - *dev->dma_mask, dev->bus_dma_mask); - } - return false; + if (!dev->dma_mask) { + dev_err_once(dev, "DMA map on device without dma_mask\n"); + } else if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) { + dev_err_once(dev, + "overflow %pad+%zu of DMA mask %llx bus mask %llx\n", + &dma_addr, size, *dev->dma_mask, dev->bus_dma_mask); } - return true; + WARN_ON_ONCE(1); } static inline dma_addr_t phys_to_dma_direct(struct device *dev, @@ -288,8 +277,10 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, phys_addr_t phys = page_to_phys(page) + offset; dma_addr_t dma_addr = phys_to_dma(dev, phys); - if (!check_addr(dev, dma_addr, size, __func__)) + if (unlikely(dev && !dma_capable(dev, dma_addr, size))) { + report_addr(dev, dma_addr, size); return DMA_MAPPING_ERROR; + } if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) dma_direct_sync_single_for_device(dev, dma_addr, size, dir); @@ -306,8 +297,11 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, BUG_ON(!sg_page(sg)); sg_dma_address(sg) = phys_to_dma(dev, sg_phys(sg)); - if (!check_addr(dev, sg_dma_address(sg), sg->length, __func__)) + if (unlikely(dev && !dma_capable(dev, sg_dma_address(sg), + sg->length))) { + report_addr(dev, sg_dma_address(sg), sg->length); return 0; + } sg_dma_len(sg) = sg->length; } -- cgit v1.3-7-g2ca7 From 17ac524719f3fc88c1a90528f4789e4b4f618512 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 11:14:09 +0100 Subject: dma-direct: use dma_direct_map_page to implement dma_direct_map_sg No need to duplicate the mapping logic. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- kernel/dma/direct.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index edb24f94ea1e..d45306473c90 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -217,6 +217,7 @@ static void dma_direct_sync_single_for_device(struct device *dev, arch_sync_dma_for_device(dev, dma_to_phys(dev, addr), size, dir); } +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) static void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) { @@ -229,6 +230,7 @@ static void dma_direct_sync_sg_for_device(struct device *dev, for_each_sg(sgl, sg, nents, i) arch_sync_dma_for_device(dev, sg_phys(sg), sg->length, dir); } +#endif #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) @@ -294,19 +296,13 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, struct scatterlist *sg; for_each_sg(sgl, sg, nents, i) { - BUG_ON(!sg_page(sg)); - - sg_dma_address(sg) = phys_to_dma(dev, sg_phys(sg)); - if (unlikely(dev && !dma_capable(dev, sg_dma_address(sg), - sg->length))) { - report_addr(dev, sg_dma_address(sg), sg->length); + sg->dma_address = dma_direct_map_page(dev, sg_page(sg), + sg->offset, sg->length, dir, attrs); + if (sg->dma_address == DMA_MAPPING_ERROR) return 0; - } sg_dma_len(sg) = sg->length; } - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - dma_direct_sync_sg_for_device(dev, sgl, nents, dir); return nents; } -- cgit v1.3-7-g2ca7 From 55897af63091ebc2c3f239c6a6666f748113ac50 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 11:43:54 +0100 Subject: dma-direct: merge swiotlb_dma_ops into the dma_direct code While the dma-direct code is (relatively) clean and simple we actually have to use the swiotlb ops for the mapping on many architectures due to devices with addressing limits. Instead of keeping two implementations around this commit allows the dma-direct implementation to call the swiotlb bounce buffering functions and thus share the guts of the mapping implementation. This also simplified the dma-mapping setup on a few architectures where we don't have to differenciate which implementation to use. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- arch/arm64/mm/dma-mapping.c | 2 +- arch/ia64/hp/common/hwsw_iommu.c | 2 +- arch/ia64/hp/common/sba_iommu.c | 6 +- arch/ia64/kernel/dma-mapping.c | 2 +- arch/mips/include/asm/dma-mapping.h | 2 - arch/powerpc/kernel/dma-swiotlb.c | 16 +-- arch/riscv/include/asm/dma-mapping.h | 15 --- arch/x86/kernel/pci-swiotlb.c | 4 +- arch/x86/mm/mem_encrypt.c | 7 -- arch/x86/pci/sta2x11-fixup.c | 1 - include/linux/dma-direct.h | 12 ++ include/linux/swiotlb.h | 74 +++++------ kernel/dma/direct.c | 113 ++++++++++++----- kernel/dma/swiotlb.c | 232 ++--------------------------------- 14 files changed, 150 insertions(+), 338 deletions(-) delete mode 100644 arch/riscv/include/asm/dma-mapping.h (limited to 'kernel') diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 6ff6ec8806c1..ab1e417204d0 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -463,7 +463,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent) { if (!dev->dma_ops) - dev->dma_ops = &swiotlb_dma_ops; + dev->dma_ops = &dma_direct_ops; dev->dma_coherent = coherent; __iommu_setup_dma_ops(dev, dma_base, size, iommu); diff --git a/arch/ia64/hp/common/hwsw_iommu.c b/arch/ia64/hp/common/hwsw_iommu.c index 58969039bed2..f40ca499b246 100644 --- a/arch/ia64/hp/common/hwsw_iommu.c +++ b/arch/ia64/hp/common/hwsw_iommu.c @@ -38,7 +38,7 @@ static inline int use_swiotlb(struct device *dev) const struct dma_map_ops *hwsw_dma_get_ops(struct device *dev) { if (use_swiotlb(dev)) - return &swiotlb_dma_ops; + return &dma_direct_ops; return &sba_dma_ops; } EXPORT_SYMBOL(hwsw_dma_get_ops); diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c index 0d21c0b5b23d..5ee74820a0f6 100644 --- a/arch/ia64/hp/common/sba_iommu.c +++ b/arch/ia64/hp/common/sba_iommu.c @@ -2065,8 +2065,6 @@ static int __init acpi_sba_ioc_init_acpi(void) /* This has to run before acpi_scan_init(). */ arch_initcall(acpi_sba_ioc_init_acpi); -extern const struct dma_map_ops swiotlb_dma_ops; - static int __init sba_init(void) { @@ -2080,7 +2078,7 @@ sba_init(void) * a successful kdump kernel boot is to use the swiotlb. */ if (is_kdump_kernel()) { - dma_ops = &swiotlb_dma_ops; + dma_ops = &dma_direct_ops; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to initialize software I/O TLB:" " Try machvec=dig boot option"); @@ -2102,7 +2100,7 @@ sba_init(void) * If we didn't find something sba_iommu can claim, we * need to setup the swiotlb and switch to the dig machvec. */ - dma_ops = &swiotlb_dma_ops; + dma_ops = &dma_direct_ops; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to find SBA IOMMU or initialize " "software I/O TLB: Try machvec=dig boot option"); diff --git a/arch/ia64/kernel/dma-mapping.c b/arch/ia64/kernel/dma-mapping.c index 36dd6aa6d759..80cd3e1ea95a 100644 --- a/arch/ia64/kernel/dma-mapping.c +++ b/arch/ia64/kernel/dma-mapping.c @@ -36,7 +36,7 @@ long arch_dma_coherent_to_pfn(struct device *dev, void *cpu_addr, void __init swiotlb_dma_init(void) { - dma_ops = &swiotlb_dma_ops; + dma_ops = &dma_direct_ops; swiotlb_init(1); } #endif diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h index b4c477eb46ce..69f914667f3e 100644 --- a/arch/mips/include/asm/dma-mapping.h +++ b/arch/mips/include/asm/dma-mapping.h @@ -10,8 +10,6 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { #if defined(CONFIG_MACH_JAZZ) return &jazz_dma_ops; -#elif defined(CONFIG_SWIOTLB) - return &swiotlb_dma_ops; #else return &dma_direct_ops; #endif diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c index 3d8df2cf8be9..430a7d0aa2cb 100644 --- a/arch/powerpc/kernel/dma-swiotlb.c +++ b/arch/powerpc/kernel/dma-swiotlb.c @@ -50,15 +50,15 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = { .alloc = __dma_nommu_alloc_coherent, .free = __dma_nommu_free_coherent, .mmap = dma_nommu_mmap_coherent, - .map_sg = swiotlb_map_sg_attrs, - .unmap_sg = swiotlb_unmap_sg_attrs, + .map_sg = dma_direct_map_sg, + .unmap_sg = dma_direct_unmap_sg, .dma_supported = swiotlb_dma_supported, - .map_page = swiotlb_map_page, - .unmap_page = swiotlb_unmap_page, - .sync_single_for_cpu = swiotlb_sync_single_for_cpu, - .sync_single_for_device = swiotlb_sync_single_for_device, - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, - .sync_sg_for_device = swiotlb_sync_sg_for_device, + .map_page = dma_direct_map_page, + .unmap_page = dma_direct_unmap_page, + .sync_single_for_cpu = dma_direct_sync_single_for_cpu, + .sync_single_for_device = dma_direct_sync_single_for_device, + .sync_sg_for_cpu = dma_direct_sync_sg_for_cpu, + .sync_sg_for_device = dma_direct_sync_sg_for_device, .get_required_mask = swiotlb_powerpc_get_required, }; diff --git a/arch/riscv/include/asm/dma-mapping.h b/arch/riscv/include/asm/dma-mapping.h deleted file mode 100644 index 8facc1c8fa05..000000000000 --- a/arch/riscv/include/asm/dma-mapping.h +++ /dev/null @@ -1,15 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#ifndef _RISCV_ASM_DMA_MAPPING_H -#define _RISCV_ASM_DMA_MAPPING_H 1 - -#ifdef CONFIG_SWIOTLB -#include -static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) -{ - return &swiotlb_dma_ops; -} -#else -#include -#endif /* CONFIG_SWIOTLB */ - -#endif /* _RISCV_ASM_DMA_MAPPING_H */ diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c index bd08b9e1c9e2..5f5302028a9a 100644 --- a/arch/x86/kernel/pci-swiotlb.c +++ b/arch/x86/kernel/pci-swiotlb.c @@ -62,10 +62,8 @@ IOMMU_INIT(pci_swiotlb_detect_4gb, void __init pci_swiotlb_init(void) { - if (swiotlb) { + if (swiotlb) swiotlb_init(0); - dma_ops = &swiotlb_dma_ops; - } } void __init pci_swiotlb_late_init(void) diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 006f373f54ab..385afa2b9e17 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -380,13 +380,6 @@ void __init mem_encrypt_init(void) /* Call into SWIOTLB to update the SWIOTLB DMA buffers */ swiotlb_update_mem_attributes(); - /* - * With SEV, DMA operations cannot use encryption, we need to use - * SWIOTLB to bounce buffer DMA operation. - */ - if (sev_active()) - dma_ops = &swiotlb_dma_ops; - /* * With SEV, we need to unroll the rep string I/O instructions. */ diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c index 7a5bafb76d77..3cdafea55ab6 100644 --- a/arch/x86/pci/sta2x11-fixup.c +++ b/arch/x86/pci/sta2x11-fixup.c @@ -168,7 +168,6 @@ static void sta2x11_setup_pdev(struct pci_dev *pdev) return; pci_set_consistent_dma_mask(pdev, STA2X11_AMBA_SIZE - 1); pci_set_dma_mask(pdev, STA2X11_AMBA_SIZE - 1); - pdev->dev.dma_ops = &swiotlb_dma_ops; pdev->dev.archdata.is_sta2x11 = true; /* We must enable all devices as master, for audio DMA to work */ diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 1aa73f4907ae..3b0a3ea3876d 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -63,7 +63,19 @@ void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page) dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs); +void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs); +void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs); +void dma_direct_sync_single_for_device(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir); +void dma_direct_sync_sg_for_device(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir); +void dma_direct_sync_single_for_cpu(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir); +void dma_direct_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 14aec0b70dd9..7c007ed7505f 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -16,8 +16,6 @@ enum swiotlb_force { SWIOTLB_NO_FORCE, /* swiotlb=noforce */ }; -extern enum swiotlb_force swiotlb_force; - /* * Maximum allowable number of contiguous slabs to map, * must be a power of 2. What is the appropriate value ? @@ -62,56 +60,44 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev, size_t size, enum dma_data_direction dir, enum dma_sync_target target); -/* Accessory functions. */ - -extern dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, - unsigned long attrs); -extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs); - -extern int -swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems, - enum dma_data_direction dir, - unsigned long attrs); - -extern void -swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl, - int nelems, enum dma_data_direction dir, - unsigned long attrs); - -extern void -swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir); - -extern void -swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg, - int nelems, enum dma_data_direction dir); - -extern void -swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir); - -extern void -swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg, - int nelems, enum dma_data_direction dir); - extern int swiotlb_dma_supported(struct device *hwdev, u64 mask); #ifdef CONFIG_SWIOTLB -extern void __init swiotlb_exit(void); +extern enum swiotlb_force swiotlb_force; +extern phys_addr_t io_tlb_start, io_tlb_end; + +static inline bool is_swiotlb_buffer(phys_addr_t paddr) +{ + return paddr >= io_tlb_start && paddr < io_tlb_end; +} + +bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, + size_t size, enum dma_data_direction dir, unsigned long attrs); +void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); #else -static inline void swiotlb_exit(void) { } -static inline unsigned int swiotlb_max_segment(void) { return 0; } -#endif +#define swiotlb_force SWIOTLB_NO_FORCE +static inline bool is_swiotlb_buffer(phys_addr_t paddr) +{ + return false; +} +static inline bool swiotlb_map(struct device *dev, phys_addr_t *phys, + dma_addr_t *dma_addr, size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ + return false; +} +static inline void swiotlb_exit(void) +{ +} +static inline unsigned int swiotlb_max_segment(void) +{ + return 0; +} +#endif /* CONFIG_SWIOTLB */ extern void swiotlb_print_info(void); extern void swiotlb_set_max_segment(unsigned int); -extern const struct dma_map_ops swiotlb_dma_ops; - #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index d45306473c90..85d8286a0ba2 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -13,6 +13,7 @@ #include #include #include +#include /* * Most architectures use ZONE_DMA for the first 16 Megabytes, but @@ -209,69 +210,110 @@ void dma_direct_free(struct device *dev, size_t size, dma_direct_free_pages(dev, size, cpu_addr, dma_addr, attrs); } -static void dma_direct_sync_single_for_device(struct device *dev, +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ + defined(CONFIG_SWIOTLB) +void dma_direct_sync_single_for_device(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) { - if (dev_is_dma_coherent(dev)) - return; - arch_sync_dma_for_device(dev, dma_to_phys(dev, addr), size, dir); + phys_addr_t paddr = dma_to_phys(dev, addr); + + if (unlikely(is_swiotlb_buffer(paddr))) + swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_DEVICE); + + if (!dev_is_dma_coherent(dev)) + arch_sync_dma_for_device(dev, paddr, size, dir); } -#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) -static void dma_direct_sync_sg_for_device(struct device *dev, +void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) { struct scatterlist *sg; int i; - if (dev_is_dma_coherent(dev)) - return; + for_each_sg(sgl, sg, nents, i) { + if (unlikely(is_swiotlb_buffer(sg_phys(sg)))) + swiotlb_tbl_sync_single(dev, sg_phys(sg), sg->length, + dir, SYNC_FOR_DEVICE); - for_each_sg(sgl, sg, nents, i) - arch_sync_dma_for_device(dev, sg_phys(sg), sg->length, dir); + if (!dev_is_dma_coherent(dev)) + arch_sync_dma_for_device(dev, sg_phys(sg), sg->length, + dir); + } } #endif #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ - defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) -static void dma_direct_sync_single_for_cpu(struct device *dev, + defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \ + defined(CONFIG_SWIOTLB) +void dma_direct_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) { - if (dev_is_dma_coherent(dev)) - return; - arch_sync_dma_for_cpu(dev, dma_to_phys(dev, addr), size, dir); - arch_sync_dma_for_cpu_all(dev); + phys_addr_t paddr = dma_to_phys(dev, addr); + + if (!dev_is_dma_coherent(dev)) { + arch_sync_dma_for_cpu(dev, paddr, size, dir); + arch_sync_dma_for_cpu_all(dev); + } + + if (unlikely(is_swiotlb_buffer(paddr))) + swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_CPU); } -static void dma_direct_sync_sg_for_cpu(struct device *dev, +void dma_direct_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) { struct scatterlist *sg; int i; - if (dev_is_dma_coherent(dev)) - return; + for_each_sg(sgl, sg, nents, i) { + if (!dev_is_dma_coherent(dev)) + arch_sync_dma_for_cpu(dev, sg_phys(sg), sg->length, dir); + + if (unlikely(is_swiotlb_buffer(sg_phys(sg)))) + swiotlb_tbl_sync_single(dev, sg_phys(sg), sg->length, dir, + SYNC_FOR_CPU); + } - for_each_sg(sgl, sg, nents, i) - arch_sync_dma_for_cpu(dev, sg_phys(sg), sg->length, dir); - arch_sync_dma_for_cpu_all(dev); + if (!dev_is_dma_coherent(dev)) + arch_sync_dma_for_cpu_all(dev); } -static void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, +void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { + phys_addr_t phys = dma_to_phys(dev, addr); + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) dma_direct_sync_single_for_cpu(dev, addr, size, dir); + + if (unlikely(is_swiotlb_buffer(phys))) + swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } -static void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, +void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs) +{ + struct scatterlist *sg; + int i; + + for_each_sg(sgl, sg, nents, i) + dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir, + attrs); +} +#else +void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - dma_direct_sync_sg_for_cpu(dev, sgl, nents, dir); } #endif +static inline bool dma_direct_possible(struct device *dev, dma_addr_t dma_addr, + size_t size) +{ + return swiotlb_force != SWIOTLB_FORCE && + (!dev || dma_capable(dev, dma_addr, size)); +} + dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs) @@ -279,13 +321,14 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, phys_addr_t phys = page_to_phys(page) + offset; dma_addr_t dma_addr = phys_to_dma(dev, phys); - if (unlikely(dev && !dma_capable(dev, dma_addr, size))) { + if (unlikely(!dma_direct_possible(dev, dma_addr, size)) && + !swiotlb_map(dev, &phys, &dma_addr, size, dir, attrs)) { report_addr(dev, dma_addr, size); return DMA_MAPPING_ERROR; } - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - dma_direct_sync_single_for_device(dev, dma_addr, size, dir); + if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) + arch_sync_dma_for_device(dev, phys, size, dir); return dma_addr; } @@ -299,11 +342,15 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, sg->dma_address = dma_direct_map_page(dev, sg_page(sg), sg->offset, sg->length, dir, attrs); if (sg->dma_address == DMA_MAPPING_ERROR) - return 0; + goto out_unmap; sg_dma_len(sg) = sg->length; } return nents; + +out_unmap: + dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); + return 0; } /* @@ -331,12 +378,14 @@ const struct dma_map_ops dma_direct_ops = { .free = dma_direct_free, .map_page = dma_direct_map_page, .map_sg = dma_direct_map_sg, -#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ + defined(CONFIG_SWIOTLB) .sync_single_for_device = dma_direct_sync_single_for_device, .sync_sg_for_device = dma_direct_sync_sg_for_device, #endif #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ - defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) + defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \ + defined(CONFIG_SWIOTLB) .sync_single_for_cpu = dma_direct_sync_single_for_cpu, .sync_sg_for_cpu = dma_direct_sync_sg_for_cpu, .unmap_page = dma_direct_unmap_page, diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 2e126bac5d7d..d6361776dc5c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -21,7 +21,6 @@ #include #include -#include #include #include #include @@ -65,7 +64,7 @@ enum swiotlb_force swiotlb_force; * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this * API. */ -static phys_addr_t io_tlb_start, io_tlb_end; +phys_addr_t io_tlb_start, io_tlb_end; /* * The number of IO TLB blocks (in groups of 64) between io_tlb_start and @@ -383,11 +382,6 @@ void __init swiotlb_exit(void) max_segment = 0; } -static int is_swiotlb_buffer(phys_addr_t paddr) -{ - return paddr >= io_tlb_start && paddr < io_tlb_end; -} - /* * Bounce: copy the swiotlb buffer back to the original dma location */ @@ -623,221 +617,36 @@ void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr, } } -static dma_addr_t swiotlb_bounce_page(struct device *dev, phys_addr_t *phys, +/* + * Create a swiotlb mapping for the buffer at @phys, and in case of DMAing + * to the device copy the data into it as well. + */ +bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { - dma_addr_t dma_addr; + trace_swiotlb_bounced(dev, *dma_addr, size, swiotlb_force); if (unlikely(swiotlb_force == SWIOTLB_NO_FORCE)) { dev_warn_ratelimited(dev, "Cannot do DMA to address %pa\n", phys); - return DMA_MAPPING_ERROR; + return false; } /* Oh well, have to allocate and map a bounce buffer. */ *phys = swiotlb_tbl_map_single(dev, __phys_to_dma(dev, io_tlb_start), *phys, size, dir, attrs); if (*phys == DMA_MAPPING_ERROR) - return DMA_MAPPING_ERROR; + return false; /* Ensure that the address returned is DMA'ble */ - dma_addr = __phys_to_dma(dev, *phys); - if (unlikely(!dma_capable(dev, dma_addr, size))) { + *dma_addr = __phys_to_dma(dev, *phys); + if (unlikely(!dma_capable(dev, *dma_addr, size))) { swiotlb_tbl_unmap_single(dev, *phys, size, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); - return DMA_MAPPING_ERROR; - } - - return dma_addr; -} - -/* - * Map a single buffer of the indicated size for DMA in streaming mode. The - * physical address to use is returned. - * - * Once the device is given the dma address, the device owns this memory until - * either swiotlb_unmap_page or swiotlb_dma_sync_single is performed. - */ -dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, - unsigned long attrs) -{ - phys_addr_t phys = page_to_phys(page) + offset; - dma_addr_t dev_addr = phys_to_dma(dev, phys); - - BUG_ON(dir == DMA_NONE); - /* - * If the address happens to be in the device's DMA window, - * we can safely return the device addr and not worry about bounce - * buffering it. - */ - if (!dma_capable(dev, dev_addr, size) || - swiotlb_force == SWIOTLB_FORCE) { - trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force); - dev_addr = swiotlb_bounce_page(dev, &phys, size, dir, attrs); + return false; } - if (!dev_is_dma_coherent(dev) && - (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0 && - dev_addr != DMA_MAPPING_ERROR) - arch_sync_dma_for_device(dev, phys, size, dir); - - return dev_addr; -} - -/* - * Unmap a single streaming mode DMA translation. The dma_addr and size must - * match what was provided for in a previous swiotlb_map_page call. All - * other usages are undefined. - * - * After this call, reads by the cpu to the buffer are guaranteed to see - * whatever the device wrote there. - */ -void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) -{ - phys_addr_t paddr = dma_to_phys(hwdev, dev_addr); - - BUG_ON(dir == DMA_NONE); - - if (!dev_is_dma_coherent(hwdev) && - (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0) - arch_sync_dma_for_cpu(hwdev, paddr, size, dir); - - if (is_swiotlb_buffer(paddr)) - swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); -} - -/* - * Make physical memory consistent for a single streaming mode DMA translation - * after a transfer. - * - * If you perform a swiotlb_map_page() but wish to interrogate the buffer - * using the cpu, yet do not wish to teardown the dma mapping, you must - * call this function before doing so. At the next point you give the dma - * address back to the card, you must first perform a - * swiotlb_dma_sync_for_device, and then the device again owns the buffer - */ -static void -swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - enum dma_sync_target target) -{ - phys_addr_t paddr = dma_to_phys(hwdev, dev_addr); - - BUG_ON(dir == DMA_NONE); - - if (!dev_is_dma_coherent(hwdev) && target == SYNC_FOR_CPU) - arch_sync_dma_for_cpu(hwdev, paddr, size, dir); - - if (is_swiotlb_buffer(paddr)) - swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target); - - if (!dev_is_dma_coherent(hwdev) && target == SYNC_FOR_DEVICE) - arch_sync_dma_for_device(hwdev, paddr, size, dir); -} - -void -swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir) -{ - swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU); -} - -void -swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir) -{ - swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE); -} - -/* - * Map a set of buffers described by scatterlist in streaming mode for DMA. - * This is the scatter-gather version of the above swiotlb_map_page - * interface. Here the scatter gather list elements are each tagged with the - * appropriate dma address and length. They are obtained via - * sg_dma_{address,length}(SG). - * - * Device ownership issues as mentioned above for swiotlb_map_page are the - * same here. - */ -int -swiotlb_map_sg_attrs(struct device *dev, struct scatterlist *sgl, int nelems, - enum dma_data_direction dir, unsigned long attrs) -{ - struct scatterlist *sg; - int i; - - for_each_sg(sgl, sg, nelems, i) { - sg->dma_address = swiotlb_map_page(dev, sg_page(sg), sg->offset, - sg->length, dir, attrs); - if (sg->dma_address == DMA_MAPPING_ERROR) - goto out_error; - sg_dma_len(sg) = sg->length; - } - - return nelems; - -out_error: - swiotlb_unmap_sg_attrs(dev, sgl, i, dir, - attrs | DMA_ATTR_SKIP_CPU_SYNC); - sg_dma_len(sgl) = 0; - return 0; -} - -/* - * Unmap a set of streaming mode DMA translations. Again, cpu read rules - * concerning calls here are the same as for swiotlb_unmap_page() above. - */ -void -swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl, - int nelems, enum dma_data_direction dir, - unsigned long attrs) -{ - struct scatterlist *sg; - int i; - - BUG_ON(dir == DMA_NONE); - - for_each_sg(sgl, sg, nelems, i) - swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg), dir, - attrs); -} - -/* - * Make physical memory consistent for a set of streaming mode DMA translations - * after a transfer. - * - * The same as swiotlb_sync_single_* but for a scatter-gather list, same rules - * and usage. - */ -static void -swiotlb_sync_sg(struct device *hwdev, struct scatterlist *sgl, - int nelems, enum dma_data_direction dir, - enum dma_sync_target target) -{ - struct scatterlist *sg; - int i; - - for_each_sg(sgl, sg, nelems, i) - swiotlb_sync_single(hwdev, sg->dma_address, - sg_dma_len(sg), dir, target); -} - -void -swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg, - int nelems, enum dma_data_direction dir) -{ - swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU); -} - -void -swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg, - int nelems, enum dma_data_direction dir) -{ - swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE); + return true; } /* @@ -851,18 +660,3 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask) { return __phys_to_dma(hwdev, io_tlb_end - 1) <= mask; } - -const struct dma_map_ops swiotlb_dma_ops = { - .alloc = dma_direct_alloc, - .free = dma_direct_free, - .sync_single_for_cpu = swiotlb_sync_single_for_cpu, - .sync_single_for_device = swiotlb_sync_single_for_device, - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, - .sync_sg_for_device = swiotlb_sync_sg_for_device, - .map_sg = swiotlb_map_sg_attrs, - .unmap_sg = swiotlb_unmap_sg_attrs, - .map_page = swiotlb_map_page, - .unmap_page = swiotlb_unmap_page, - .dma_supported = dma_direct_supported, -}; -EXPORT_SYMBOL(swiotlb_dma_ops); -- cgit v1.3-7-g2ca7 From 356da6d0cde3323236977fce54c1f9612a742036 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 13:39:32 -0800 Subject: dma-mapping: bypass indirect calls for dma-direct Avoid expensive indirect calls in the fast path DMA mapping operations by directly calling the dma_direct_* ops if we are using the directly mapped DMA operations. Signed-off-by: Christoph Hellwig Acked-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Tony Luck --- arch/alpha/include/asm/dma-mapping.h | 2 +- arch/arc/mm/cache.c | 2 +- arch/arm/include/asm/dma-mapping.h | 2 +- arch/arm/mm/dma-mapping-nommu.c | 14 +---- arch/arm64/mm/dma-mapping.c | 3 - arch/ia64/hp/common/hwsw_iommu.c | 2 +- arch/ia64/hp/common/sba_iommu.c | 4 +- arch/ia64/kernel/dma-mapping.c | 1 - arch/mips/include/asm/dma-mapping.h | 2 +- arch/parisc/kernel/setup.c | 4 -- arch/sparc/include/asm/dma-mapping.h | 4 +- arch/x86/kernel/pci-dma.c | 2 +- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 2 +- drivers/iommu/amd_iommu.c | 13 +--- include/asm-generic/dma-mapping.h | 2 +- include/linux/dma-direct.h | 17 ------ include/linux/dma-mapping.h | 111 ++++++++++++++++++++++++++++++----- include/linux/dma-noncoherent.h | 5 +- kernel/dma/direct.c | 37 +++--------- kernel/dma/mapping.c | 40 ++++++++----- 20 files changed, 150 insertions(+), 119 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/include/asm/dma-mapping.h b/arch/alpha/include/asm/dma-mapping.h index 8beeafd4f68e..0ee6a5c99b16 100644 --- a/arch/alpha/include/asm/dma-mapping.h +++ b/arch/alpha/include/asm/dma-mapping.h @@ -7,7 +7,7 @@ extern const struct dma_map_ops alpha_pci_ops; static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { #ifdef CONFIG_ALPHA_JENSEN - return &dma_direct_ops; + return NULL; #else return &alpha_pci_ops; #endif diff --git a/arch/arc/mm/cache.c b/arch/arc/mm/cache.c index f2701c13a66b..e188bb3ede53 100644 --- a/arch/arc/mm/cache.c +++ b/arch/arc/mm/cache.c @@ -1280,7 +1280,7 @@ void __init arc_cache_init_master(void) /* * In case of IOC (say IOC+SLC case), pointers above could still be set * but end up not being relevant as the first function in chain is not - * called at all for @dma_direct_ops + * called at all for devices using coherent DMA. * arch_sync_dma_for_cpu() -> dma_cache_*() -> __dma_cache_*() */ } diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 965b7c846ecb..31d3b96f0f4b 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -18,7 +18,7 @@ extern const struct dma_map_ops arm_coherent_dma_ops; static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { - return IS_ENABLED(CONFIG_MMU) ? &arm_dma_ops : &dma_direct_ops; + return IS_ENABLED(CONFIG_MMU) ? &arm_dma_ops : NULL; } #ifdef __arch_page_to_dma diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index 712416ecd8e6..f304b10e23a4 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -22,7 +22,7 @@ #include "dma.h" /* - * dma_direct_ops is used if + * The generic direct mapping code is used if * - MMU/MPU is off * - cpu is v7m w/o cache support * - device is coherent @@ -209,16 +209,9 @@ const struct dma_map_ops arm_nommu_dma_ops = { }; EXPORT_SYMBOL(arm_nommu_dma_ops); -static const struct dma_map_ops *arm_nommu_get_dma_map_ops(bool coherent) -{ - return coherent ? &dma_direct_ops : &arm_nommu_dma_ops; -} - void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent) { - const struct dma_map_ops *dma_ops; - if (IS_ENABLED(CONFIG_CPU_V7M)) { /* * Cache support for v7m is optional, so can be treated as @@ -234,7 +227,6 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, dev->archdata.dma_coherent = (get_cr() & CR_M) ? coherent : true; } - dma_ops = arm_nommu_get_dma_map_ops(dev->archdata.dma_coherent); - - set_dma_ops(dev, dma_ops); + if (!dev->archdata.dma_coherent) + set_dma_ops(dev, &arm_nommu_dma_ops); } diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index ab1e417204d0..95eda81e3f2d 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -462,9 +462,6 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent) { - if (!dev->dma_ops) - dev->dma_ops = &dma_direct_ops; - dev->dma_coherent = coherent; __iommu_setup_dma_ops(dev, dma_base, size, iommu); diff --git a/arch/ia64/hp/common/hwsw_iommu.c b/arch/ia64/hp/common/hwsw_iommu.c index f40ca499b246..8840ed97712f 100644 --- a/arch/ia64/hp/common/hwsw_iommu.c +++ b/arch/ia64/hp/common/hwsw_iommu.c @@ -38,7 +38,7 @@ static inline int use_swiotlb(struct device *dev) const struct dma_map_ops *hwsw_dma_get_ops(struct device *dev) { if (use_swiotlb(dev)) - return &dma_direct_ops; + return NULL; return &sba_dma_ops; } EXPORT_SYMBOL(hwsw_dma_get_ops); diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c index 5ee74820a0f6..5a361e51cb1e 100644 --- a/arch/ia64/hp/common/sba_iommu.c +++ b/arch/ia64/hp/common/sba_iommu.c @@ -2078,7 +2078,7 @@ sba_init(void) * a successful kdump kernel boot is to use the swiotlb. */ if (is_kdump_kernel()) { - dma_ops = &dma_direct_ops; + dma_ops = NULL; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to initialize software I/O TLB:" " Try machvec=dig boot option"); @@ -2100,7 +2100,7 @@ sba_init(void) * If we didn't find something sba_iommu can claim, we * need to setup the swiotlb and switch to the dig machvec. */ - dma_ops = &dma_direct_ops; + dma_ops = NULL; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to find SBA IOMMU or initialize " "software I/O TLB: Try machvec=dig boot option"); diff --git a/arch/ia64/kernel/dma-mapping.c b/arch/ia64/kernel/dma-mapping.c index 80cd3e1ea95a..ad7d9963de34 100644 --- a/arch/ia64/kernel/dma-mapping.c +++ b/arch/ia64/kernel/dma-mapping.c @@ -36,7 +36,6 @@ long arch_dma_coherent_to_pfn(struct device *dev, void *cpu_addr, void __init swiotlb_dma_init(void) { - dma_ops = &dma_direct_ops; swiotlb_init(1); } #endif diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h index 69f914667f3e..20dfaad3a55d 100644 --- a/arch/mips/include/asm/dma-mapping.h +++ b/arch/mips/include/asm/dma-mapping.h @@ -11,7 +11,7 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) #if defined(CONFIG_MACH_JAZZ) return &jazz_dma_ops; #else - return &dma_direct_ops; + return NULL; #endif } diff --git a/arch/parisc/kernel/setup.c b/arch/parisc/kernel/setup.c index cd227f1cf629..54818cd78bd0 100644 --- a/arch/parisc/kernel/setup.c +++ b/arch/parisc/kernel/setup.c @@ -99,10 +99,6 @@ void __init dma_ops_init(void) case pcxl2: pa7300lc_init(); - case pcxl: /* falls through */ - case pcxs: - case pcxt: - hppa_dma_ops = &dma_direct_ops; break; default: break; diff --git a/arch/sparc/include/asm/dma-mapping.h b/arch/sparc/include/asm/dma-mapping.h index 55a44f08a9a4..ed32845bd2d2 100644 --- a/arch/sparc/include/asm/dma-mapping.h +++ b/arch/sparc/include/asm/dma-mapping.h @@ -12,11 +12,11 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { #ifdef CONFIG_SPARC_LEON if (sparc_cpu_model == sparc_leon) - return &dma_direct_ops; + return NULL; #endif #if defined(CONFIG_SPARC32) && defined(CONFIG_PCI) if (bus == &pci_bus_type) - return &dma_direct_ops; + return NULL; #endif return dma_ops; } diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index f4562fcec681..d460998ae828 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -17,7 +17,7 @@ static bool disable_dac_quirk __read_mostly; -const struct dma_map_ops *dma_ops = &dma_direct_ops; +const struct dma_map_ops *dma_ops; EXPORT_SYMBOL(dma_ops); #ifdef CONFIG_IOMMU_DEBUG diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c index 61a84b958d67..50637f372e9f 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c @@ -581,7 +581,7 @@ static int vmw_dma_select_mode(struct vmw_private *dev_priv) dev_priv->map_mode = vmw_dma_map_populate; - if (dma_ops->sync_single_for_cpu) + if (dma_ops && dma_ops->sync_single_for_cpu) dev_priv->map_mode = vmw_dma_alloc_coherent; #ifdef CONFIG_SWIOTLB if (swiotlb_nr_tbl() == 0) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index c5d6c7c42b0a..567221cca13c 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2184,7 +2184,7 @@ static int amd_iommu_add_device(struct device *dev) dev_name(dev)); iommu_ignore_device(dev); - dev->dma_ops = &dma_direct_ops; + dev->dma_ops = NULL; goto out; } init_iommu_group(dev); @@ -2770,17 +2770,6 @@ int __init amd_iommu_init_dma_ops(void) swiotlb = (iommu_pass_through || sme_me_mask) ? 1 : 0; iommu_detected = 1; - /* - * In case we don't initialize SWIOTLB (actually the common case - * when AMD IOMMU is enabled and SME is not active), make sure there - * are global dma_ops set as a fall-back for devices not handled by - * this driver (for example non-PCI devices). When SME is active, - * make sure that swiotlb variable remains set so the global dma_ops - * continue to be SWIOTLB. - */ - if (!swiotlb) - dma_ops = &dma_direct_ops; - if (amd_iommu_unmap_flush) pr_info("AMD-Vi: IO/TLB flush on unmap enabled\n"); else diff --git a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h index 880a292d792f..c13f46109e88 100644 --- a/include/asm-generic/dma-mapping.h +++ b/include/asm-generic/dma-mapping.h @@ -4,7 +4,7 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { - return &dma_direct_ops; + return NULL; } #endif /* _ASM_GENERIC_DMA_MAPPING_H */ diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 3b0a3ea3876d..b7338702592a 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -60,22 +60,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page); -dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs); -void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, enum dma_data_direction dir, unsigned long attrs); -int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, - enum dma_data_direction dir, unsigned long attrs); -void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, unsigned long attrs); -void dma_direct_sync_single_for_device(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_device(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); -void dma_direct_sync_single_for_cpu(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_cpu(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 269ee27fc3d9..f422aec0f53c 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -134,7 +134,6 @@ struct dma_map_ops { #define DMA_MAPPING_ERROR (~(dma_addr_t)0) -extern const struct dma_map_ops dma_direct_ops; extern const struct dma_map_ops dma_virt_ops; extern const struct dma_map_ops dma_dummy_ops; @@ -222,6 +221,69 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +static inline bool dma_is_direct(const struct dma_map_ops *ops) +{ + return likely(!ops); +} + +/* + * All the dma_direct_* declarations are here just for the indirect call bypass, + * and must not be used directly drivers! + */ +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir, + unsigned long attrs); +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, + enum dma_data_direction dir, unsigned long attrs); + +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ + defined(CONFIG_SWIOTLB) +void dma_direct_sync_single_for_device(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir); +void dma_direct_sync_sg_for_device(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir); +#else +static inline void dma_direct_sync_single_for_device(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir) +{ +} +static inline void dma_direct_sync_sg_for_device(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir) +{ +} +#endif + +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ + defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \ + defined(CONFIG_SWIOTLB) +void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs); +void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs); +void dma_direct_sync_single_for_cpu(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir); +void dma_direct_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir); +#else +static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ +} +static inline void dma_direct_unmap_sg(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir, + unsigned long attrs) +{ +} +static inline void dma_direct_sync_single_for_cpu(struct device *dev, + dma_addr_t addr, size_t size, enum dma_data_direction dir) +{ +} +static inline void dma_direct_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sgl, int nents, enum dma_data_direction dir) +{ +} +#endif + static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, @@ -232,9 +294,12 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, BUG_ON(!valid_dma_direction(dir)); debug_dma_map_single(dev, ptr, size); - addr = ops->map_page(dev, virt_to_page(ptr), - offset_in_page(ptr), size, - dir, attrs); + if (dma_is_direct(ops)) + addr = dma_direct_map_page(dev, virt_to_page(ptr), + offset_in_page(ptr), size, dir, attrs); + else + addr = ops->map_page(dev, virt_to_page(ptr), + offset_in_page(ptr), size, dir, attrs); debug_dma_map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, addr, true); @@ -249,7 +314,9 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->unmap_page) + if (dma_is_direct(ops)) + dma_direct_unmap_page(dev, addr, size, dir, attrs); + else if (ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); debug_dma_unmap_page(dev, addr, size, dir, true); } @@ -272,7 +339,10 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int ents; BUG_ON(!valid_dma_direction(dir)); - ents = ops->map_sg(dev, sg, nents, dir, attrs); + if (dma_is_direct(ops)) + ents = dma_direct_map_sg(dev, sg, nents, dir, attrs); + else + ents = ops->map_sg(dev, sg, nents, dir, attrs); BUG_ON(ents < 0); debug_dma_map_sg(dev, sg, nents, ents, dir); @@ -287,7 +357,9 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg BUG_ON(!valid_dma_direction(dir)); debug_dma_unmap_sg(dev, sg, nents, dir); - if (ops->unmap_sg) + if (dma_is_direct(ops)) + dma_direct_unmap_sg(dev, sg, nents, dir, attrs); + else if (ops->unmap_sg) ops->unmap_sg(dev, sg, nents, dir, attrs); } @@ -301,7 +373,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev, dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); - addr = ops->map_page(dev, page, offset, size, dir, attrs); + if (dma_is_direct(ops)) + addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); + else + addr = ops->map_page(dev, page, offset, size, dir, attrs); debug_dma_map_page(dev, page, offset, size, dir, addr, false); return addr; @@ -322,7 +397,7 @@ static inline dma_addr_t dma_map_resource(struct device *dev, BUG_ON(pfn_valid(PHYS_PFN(phys_addr))); addr = phys_addr; - if (ops->map_resource) + if (ops && ops->map_resource) addr = ops->map_resource(dev, phys_addr, size, dir, attrs); debug_dma_map_resource(dev, phys_addr, size, dir, addr); @@ -337,7 +412,7 @@ static inline void dma_unmap_resource(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->unmap_resource) + if (ops && ops->unmap_resource) ops->unmap_resource(dev, addr, size, dir, attrs); debug_dma_unmap_resource(dev, addr, size, dir); } @@ -349,7 +424,9 @@ static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_cpu) + if (dma_is_direct(ops)) + dma_direct_sync_single_for_cpu(dev, addr, size, dir); + else if (ops->sync_single_for_cpu) ops->sync_single_for_cpu(dev, addr, size, dir); debug_dma_sync_single_for_cpu(dev, addr, size, dir); } @@ -368,7 +445,9 @@ static inline void dma_sync_single_for_device(struct device *dev, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_device) + if (dma_is_direct(ops)) + dma_direct_sync_single_for_device(dev, addr, size, dir); + else if (ops->sync_single_for_device) ops->sync_single_for_device(dev, addr, size, dir); debug_dma_sync_single_for_device(dev, addr, size, dir); } @@ -387,7 +466,9 @@ dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_sg_for_cpu) + if (dma_is_direct(ops)) + dma_direct_sync_sg_for_cpu(dev, sg, nelems, dir); + else if (ops->sync_sg_for_cpu) ops->sync_sg_for_cpu(dev, sg, nelems, dir); debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir); } @@ -399,7 +480,9 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_sg_for_device) + if (dma_is_direct(ops)) + dma_direct_sync_sg_for_device(dev, sg, nelems, dir); + else if (ops->sync_sg_for_device) ops->sync_sg_for_device(dev, sg, nelems, dir); debug_dma_sync_sg_for_device(dev, sg, nelems, dir); diff --git a/include/linux/dma-noncoherent.h b/include/linux/dma-noncoherent.h index 306557331d7d..69b36ed31a99 100644 --- a/include/linux/dma-noncoherent.h +++ b/include/linux/dma-noncoherent.h @@ -38,7 +38,10 @@ pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot, void arch_dma_cache_sync(struct device *dev, void *vaddr, size_t size, enum dma_data_direction direction); #else -#define arch_dma_cache_sync NULL +static inline void arch_dma_cache_sync(struct device *dev, void *vaddr, + size_t size, enum dma_data_direction direction) +{ +} #endif /* CONFIG_DMA_NONCOHERENT_CACHE_SYNC */ #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 85d8286a0ba2..79da61b49fa4 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -223,6 +223,7 @@ void dma_direct_sync_single_for_device(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_device(dev, paddr, size, dir); } +EXPORT_SYMBOL(dma_direct_sync_single_for_device); void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) @@ -240,6 +241,7 @@ void dma_direct_sync_sg_for_device(struct device *dev, dir); } } +EXPORT_SYMBOL(dma_direct_sync_sg_for_device); #endif #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ @@ -258,6 +260,7 @@ void dma_direct_sync_single_for_cpu(struct device *dev, if (unlikely(is_swiotlb_buffer(paddr))) swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_CPU); } +EXPORT_SYMBOL(dma_direct_sync_single_for_cpu); void dma_direct_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) @@ -277,6 +280,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu_all(dev); } +EXPORT_SYMBOL(dma_direct_sync_sg_for_cpu); void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) @@ -289,6 +293,7 @@ void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, if (unlikely(is_swiotlb_buffer(phys))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } +EXPORT_SYMBOL(dma_direct_unmap_page); void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) @@ -300,11 +305,7 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir, attrs); } -#else -void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, unsigned long attrs) -{ -} +EXPORT_SYMBOL(dma_direct_unmap_sg); #endif static inline bool dma_direct_possible(struct device *dev, dma_addr_t dma_addr, @@ -331,6 +332,7 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, arch_sync_dma_for_device(dev, phys, size, dir); return dma_addr; } +EXPORT_SYMBOL(dma_direct_map_page); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) @@ -352,6 +354,7 @@ out_unmap: dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); return 0; } +EXPORT_SYMBOL(dma_direct_map_sg); /* * Because 32-bit DMA masks are so common we expect every architecture to be @@ -372,27 +375,3 @@ int dma_direct_supported(struct device *dev, u64 mask) return mask >= phys_to_dma(dev, min_mask); } - -const struct dma_map_ops dma_direct_ops = { - .alloc = dma_direct_alloc, - .free = dma_direct_free, - .map_page = dma_direct_map_page, - .map_sg = dma_direct_map_sg, -#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ - defined(CONFIG_SWIOTLB) - .sync_single_for_device = dma_direct_sync_single_for_device, - .sync_sg_for_device = dma_direct_sync_sg_for_device, -#endif -#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ - defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \ - defined(CONFIG_SWIOTLB) - .sync_single_for_cpu = dma_direct_sync_single_for_cpu, - .sync_sg_for_cpu = dma_direct_sync_sg_for_cpu, - .unmap_page = dma_direct_unmap_page, - .unmap_sg = dma_direct_unmap_sg, -#endif - .get_required_mask = dma_direct_get_required_mask, - .dma_supported = dma_direct_supported, - .cache_sync = arch_dma_cache_sync, -}; -EXPORT_SYMBOL(dma_direct_ops); diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 0b18cfbdde95..fc84c81029d9 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -7,6 +7,7 @@ */ #include /* for max_pfn */ #include +#include #include #include #include @@ -229,8 +230,8 @@ int dma_get_sgtable_attrs(struct device *dev, struct sg_table *sgt, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!ops); - if (ops->get_sgtable) + + if (!dma_is_direct(ops) && ops->get_sgtable) return ops->get_sgtable(dev, sgt, cpu_addr, dma_addr, size, attrs); return dma_common_get_sgtable(dev, sgt, cpu_addr, dma_addr, size, @@ -293,8 +294,8 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!ops); - if (ops->mmap) + + if (!dma_is_direct(ops) && ops->mmap) return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs); return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size, attrs); } @@ -324,6 +325,8 @@ u64 dma_get_required_mask(struct device *dev) { const struct dma_map_ops *ops = get_dma_ops(dev); + if (dma_is_direct(ops)) + return dma_direct_get_required_mask(dev); if (ops->get_required_mask) return ops->get_required_mask(dev); return dma_default_get_required_mask(dev); @@ -341,7 +344,6 @@ void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, const struct dma_map_ops *ops = get_dma_ops(dev); void *cpu_addr; - BUG_ON(!ops); WARN_ON_ONCE(dev && !dev->coherent_dma_mask); if (dma_alloc_from_dev_coherent(dev, size, dma_handle, &cpu_addr)) @@ -352,10 +354,14 @@ void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, if (!arch_dma_alloc_attrs(&dev)) return NULL; - if (!ops->alloc) + + if (dma_is_direct(ops)) + cpu_addr = dma_direct_alloc(dev, size, dma_handle, flag, attrs); + else if (ops->alloc) + cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs); + else return NULL; - cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs); debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr); return cpu_addr; } @@ -366,8 +372,6 @@ void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, { const struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!ops); - if (dma_release_from_dev_coherent(dev, get_order(size), cpu_addr)) return; /* @@ -379,11 +383,14 @@ void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, */ WARN_ON(irqs_disabled()); - if (!ops->free || !cpu_addr) + if (!cpu_addr) return; debug_dma_free_coherent(dev, size, cpu_addr, dma_handle); - ops->free(dev, size, cpu_addr, dma_handle, attrs); + if (dma_is_direct(ops)) + dma_direct_free(dev, size, cpu_addr, dma_handle, attrs); + else if (ops->free) + ops->free(dev, size, cpu_addr, dma_handle, attrs); } EXPORT_SYMBOL(dma_free_attrs); @@ -397,9 +404,9 @@ int dma_supported(struct device *dev, u64 mask) { const struct dma_map_ops *ops = get_dma_ops(dev); - if (!ops) - return 0; - if (!ops->dma_supported) + if (dma_is_direct(ops)) + return dma_direct_supported(dev, mask); + if (ops->dma_supported) return 1; return ops->dma_supported(dev, mask); } @@ -437,7 +444,10 @@ void dma_cache_sync(struct device *dev, void *vaddr, size_t size, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->cache_sync) + + if (dma_is_direct(ops)) + arch_dma_cache_sync(dev, vaddr, size, dir); + else if (ops->cache_sync) ops->cache_sync(dev, vaddr, size, dir); } EXPORT_SYMBOL(dma_cache_sync); -- cgit v1.3-7-g2ca7 From 9f8c1c5712954f9d8877ac55b18adbdf03e51e1f Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 12 Dec 2018 10:45:38 +0100 Subject: bpf: remove obsolete prog->aux sanitation in bpf_insn_prepare_dump This logic is not needed anymore since we got rid of the verifier rewrite that was using prog->aux address in f6069b9aa993 ("bpf: fix redirect to map under tail calls"). Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 7 ------- 1 file changed, 7 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7f1410d6fbe9..6ae062f1cf20 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2032,13 +2032,6 @@ static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog) insns[i + 1].imm = 0; continue; } - - if (!bpf_dump_raw_ok() && - imm == (unsigned long)prog->aux) { - insns[i].imm = 0; - insns[i + 1].imm = 0; - continue; - } } return insns; -- cgit v1.3-7-g2ca7 From 319deec7db6c0aab276d2447f778e7cffed24c7c Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Wed, 12 Dec 2018 19:46:54 -0700 Subject: seccomp: fix poor type promotion sparse complains, kernel/seccomp.c:1172:13: warning: incorrect type in assignment (different base types) kernel/seccomp.c:1172:13: expected restricted __poll_t [usertype] ret kernel/seccomp.c:1172:13: got int kernel/seccomp.c:1173:13: warning: restricted __poll_t degrades to integer Instead of assigning this to ret, since we don't use this anywhere, let's just test it against 0 directly. Signed-off-by: Tycho Andersen Reported-by: 0day robot Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Kees Cook --- kernel/seccomp.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 15b6be97fc09..d7f538847b84 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1169,8 +1169,7 @@ static __poll_t seccomp_notify_poll(struct file *file, poll_wait(file, &filter->notif->wqh, poll_tab); - ret = mutex_lock_interruptible(&filter->notify_lock); - if (ret < 0) + if (mutex_lock_interruptible(&filter->notify_lock) < 0) return EPOLLERR; list_for_each_entry(cur, &filter->notif->notifications, list) { -- cgit v1.3-7-g2ca7 From d406db524c32ca35bd85cada28a547fff3115715 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Sun, 9 Dec 2018 14:25:02 +0800 Subject: audit: remove duplicated include from audit.c Remove duplicated include. Signed-off-by: YueHaibing Signed-off-by: Paul Moore --- kernel/audit.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index a0a4544e69ca..632d36059556 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -60,7 +60,6 @@ #include #include #include -#include #include -- cgit v1.3-7-g2ca7 From 5439c985c5a83a8419f762115afdf560ab72a452 Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Fri, 14 Dec 2018 17:05:54 +0100 Subject: module: Overwrite st_size instead of st_info st_info is currently overwritten after relocation and used to store the elf_type(). However, we're going to need it fix kallsyms on ARM's Thumb-2 kernels, so preserve st_info and overwrite the st_size field instead. st_size is neither used by the module core nor by any architecture. Reviewed-by: Miroslav Benes Reviewed-by: Dave Martin Signed-off-by: Vincent Whitchurch Signed-off-by: Jessica Yu --- kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 1b5edf78694c..b36ff8a3d562 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2684,7 +2684,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_size = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -4070,7 +4070,7 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, kallsyms = rcu_dereference_sched(mod->kallsyms); if (symnum < kallsyms->num_symtab) { *value = kallsyms->symtab[symnum].st_value; - *type = kallsyms->symtab[symnum].st_info; + *type = kallsyms->symtab[symnum].st_size; strlcpy(name, kallsyms_symbol_name(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); -- cgit v1.3-7-g2ca7 From 93d77e7f1410c366050d6035dcba1a5167c7cf0b Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Fri, 14 Dec 2018 17:05:55 +0100 Subject: ARM: module: Fix function kallsyms on Thumb-2 Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNC GLOBAL DEFAULT 2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 333: 00002d4d 36 FUNC GLOBAL DEFAULT 1 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 00002d4c T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7f802d4d t tun_get_socket [tun] Because of this, the symbol+offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 00002d4c : 2d4c: 6943 ldr r3, [r0, #20] 2d4e: 4a07 ldr r2, [pc, #28] 2d50: 4293 cmp r3, r2 a crash when tun_get_socket is called with NULL results in: PC is at tun_xdp+0xa3/0xa4 [tun] pc : [<7f802d4c>] As can be seen, the "PC is at" line reports the wrong symbol name, and the symbol+offset will point to the wrong source line if it is passed to gdb. To solve this, add a way for archs to fixup the reading of these module kallsyms values, and use that to clear the lowest bit for function symbols on Thumb-2. After the fix: # cat /proc/kallsyms | grep tun_get_socket 7f802d4c t tun_get_socket [tun] PC is at tun_get_socket+0x0/0x24 [tun] pc : [<7f802d4c>] Signed-off-by: Vincent Whitchurch Signed-off-by: Jessica Yu --- arch/arm/include/asm/module.h | 11 +++++++++++ include/linux/module.h | 7 +++++++ kernel/module.c | 43 +++++++++++++++++++++++++++---------------- 3 files changed, 45 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/arch/arm/include/asm/module.h b/arch/arm/include/asm/module.h index 9e81b7c498d8..182163b55546 100644 --- a/arch/arm/include/asm/module.h +++ b/arch/arm/include/asm/module.h @@ -61,4 +61,15 @@ u32 get_module_plt(struct module *mod, unsigned long loc, Elf32_Addr val); MODULE_ARCH_VERMAGIC_ARMTHUMB \ MODULE_ARCH_VERMAGIC_P2V +#ifdef CONFIG_THUMB2_KERNEL +#define HAVE_ARCH_KALLSYMS_SYMBOL_VALUE +static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + #endif /* _ASM_ARM_MODULE_H */ diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..c0b4b7840b57 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -486,6 +486,13 @@ struct module { #define MODULE_ARCH_INIT {} #endif +#ifndef HAVE_ARCH_KALLSYMS_SYMBOL_VALUE +static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) +{ + return sym->st_value; +} +#endif + extern struct mutex module_mutex; /* FIXME: It'd be nice to isolate modules during init, too, so they diff --git a/kernel/module.c b/kernel/module.c index b36ff8a3d562..164bf201eae4 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3928,7 +3928,7 @@ static const char *find_kallsyms_symbol(struct module *mod, unsigned long *offset) { unsigned int i, best = 0; - unsigned long nextval; + unsigned long nextval, bestval; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); /* At worse, next value is at end of module */ @@ -3937,10 +3937,15 @@ static const char *find_kallsyms_symbol(struct module *mod, else nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size; + bestval = kallsyms_symbol_value(&kallsyms->symtab[best]); + /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { - if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) + const Elf_Sym *sym = &kallsyms->symtab[i]; + unsigned long thisval = kallsyms_symbol_value(sym); + + if (sym->st_shndx == SHN_UNDEF) continue; /* We ignore unnamed symbols: they're uninformative @@ -3949,21 +3954,21 @@ static const char *find_kallsyms_symbol(struct module *mod, || is_arm_mapping_symbol(kallsyms_symbol_name(kallsyms, i))) continue; - if (kallsyms->symtab[i].st_value <= addr - && kallsyms->symtab[i].st_value > kallsyms->symtab[best].st_value) + if (thisval <= addr && thisval > bestval) { best = i; - if (kallsyms->symtab[i].st_value > addr - && kallsyms->symtab[i].st_value < nextval) - nextval = kallsyms->symtab[i].st_value; + bestval = thisval; + } + if (thisval > addr && thisval < nextval) + nextval = thisval; } if (!best) return NULL; if (size) - *size = nextval - kallsyms->symtab[best].st_value; + *size = nextval - bestval; if (offset) - *offset = addr - kallsyms->symtab[best].st_value; + *offset = addr - bestval; return kallsyms_symbol_name(kallsyms, best); } @@ -4069,8 +4074,10 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, continue; kallsyms = rcu_dereference_sched(mod->kallsyms); if (symnum < kallsyms->num_symtab) { - *value = kallsyms->symtab[symnum].st_value; - *type = kallsyms->symtab[symnum].st_size; + const Elf_Sym *sym = &kallsyms->symtab[symnum]; + + *value = kallsyms_symbol_value(sym); + *type = sym->st_size; strlcpy(name, kallsyms_symbol_name(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); @@ -4089,10 +4096,13 @@ static unsigned long find_kallsyms_symbol_value(struct module *mod, const char * unsigned int i; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); - for (i = 0; i < kallsyms->num_symtab; i++) + for (i = 0; i < kallsyms->num_symtab; i++) { + const Elf_Sym *sym = &kallsyms->symtab[i]; + if (strcmp(name, kallsyms_symbol_name(kallsyms, i)) == 0 && - kallsyms->symtab[i].st_shndx != SHN_UNDEF) - return kallsyms->symtab[i].st_value; + sym->st_shndx != SHN_UNDEF) + return kallsyms_symbol_value(sym); + } return 0; } @@ -4137,12 +4147,13 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, const char *, if (mod->state == MODULE_STATE_UNFORMED) continue; for (i = 0; i < kallsyms->num_symtab; i++) { + const Elf_Sym *sym = &kallsyms->symtab[i]; - if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) + if (sym->st_shndx == SHN_UNDEF) continue; ret = fn(data, kallsyms_symbol_name(kallsyms, i), - mod, kallsyms->symtab[i].st_value); + mod, kallsyms_symbol_value(sym)); if (ret != 0) return ret; } -- cgit v1.3-7-g2ca7 From 23127b33ec80e656921362d7dc82a0064bac20a2 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Thu, 13 Dec 2018 10:41:46 -0800 Subject: bpf: Create a new btf_name_by_offset() for non type name use case The current btf_name_by_offset() is returning "(anon)" type name for the offset == 0 case and "(invalid-name-offset)" for the out-of-bound offset case. It fits well for the internal BTF verbose log purpose which is focusing on type. For example, offset == 0 => "(anon)" => anonymous type/name. Returning non-NULL for the bad offset case is needed during the BTF verification process because the BTF verifier may complain about another field first before discovering the name_off is invalid. However, it may not be ideal for the newer use case which does not necessary mean type name. For example, when logging line_info in the BPF verifier in the next patch, it is better to log an empty src line instead of logging "(anon)". The existing bpf_name_by_offset() is renamed to __bpf_name_by_offset() and static to btf.c. A new bpf_name_by_offset() is added for generic context usage. It returns "\0" for name_off == 0 (note that btf->strings[0] is "\0") and NULL for invalid offset. It allows the caller to decide what is the best output in its context. The new btf_name_by_offset() is overlapped with btf_name_offset_valid(). Hence, btf_name_offset_valid() is removed from btf.h to keep the btf.h API minimal. The existing btf_name_offset_valid() usage in btf.c could also be replaced later. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/linux/btf.h | 1 - kernel/bpf/btf.c | 31 ++++++++++++++++++++----------- kernel/bpf/verifier.c | 4 ++-- 3 files changed, 22 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/btf.h b/include/linux/btf.h index a4cf075b89eb..58000d7e06e3 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -46,7 +46,6 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m); int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); -bool btf_name_offset_valid(const struct btf *btf, u32 offset); bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size); #ifdef CONFIG_BPF_SYSCALL diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 1545ddfb6fa5..8fa0bf1c33fd 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -474,7 +474,7 @@ static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) return !*src; } -const char *btf_name_by_offset(const struct btf *btf, u32 offset) +static const char *__btf_name_by_offset(const struct btf *btf, u32 offset) { if (!offset) return "(anon)"; @@ -484,6 +484,14 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset) return "(invalid-name-offset)"; } +const char *btf_name_by_offset(const struct btf *btf, u32 offset) +{ + if (offset < btf->hdr.str_len) + return &btf->strings[offset]; + + return NULL; +} + const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) { if (type_id > btf->nr_types) @@ -576,7 +584,7 @@ __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env, __btf_verifier_log(log, "[%u] %s %s%s", env->log_type_id, btf_kind_str[kind], - btf_name_by_offset(btf, t->name_off), + __btf_name_by_offset(btf, t->name_off), log_details ? " " : ""); if (log_details) @@ -620,7 +628,7 @@ static void btf_verifier_log_member(struct btf_verifier_env *env, btf_verifier_log_type(env, struct_type, NULL); __btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u", - btf_name_by_offset(btf, member->name_off), + __btf_name_by_offset(btf, member->name_off), member->type, member->offset); if (fmt && *fmt) { @@ -1872,7 +1880,7 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env, btf_verifier_log(env, "\t%s val=%d\n", - btf_name_by_offset(btf, enums[i].name_off), + __btf_name_by_offset(btf, enums[i].name_off), enums[i].val); } @@ -1896,7 +1904,8 @@ static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t, for (i = 0; i < nr_enums; i++) { if (v == enums[i].val) { seq_printf(m, "%s", - btf_name_by_offset(btf, enums[i].name_off)); + __btf_name_by_offset(btf, + enums[i].name_off)); return; } } @@ -1954,20 +1963,20 @@ static void btf_func_proto_log(struct btf_verifier_env *env, } btf_verifier_log(env, "%u %s", args[0].type, - btf_name_by_offset(env->btf, - args[0].name_off)); + __btf_name_by_offset(env->btf, + args[0].name_off)); for (i = 1; i < nr_args - 1; i++) btf_verifier_log(env, ", %u %s", args[i].type, - btf_name_by_offset(env->btf, - args[i].name_off)); + __btf_name_by_offset(env->btf, + args[i].name_off)); if (nr_args > 1) { const struct btf_param *last_arg = &args[nr_args - 1]; if (last_arg->type) btf_verifier_log(env, ", %u %s", last_arg->type, - btf_name_by_offset(env->btf, - last_arg->name_off)); + __btf_name_by_offset(env->btf, + last_arg->name_off)); else btf_verifier_log(env, ", vararg"); } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8b511a4fe84a..89ce2613fdb0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4910,8 +4910,8 @@ static int check_btf_line(struct bpf_verifier_env *env, goto err_free; } - if (!btf_name_offset_valid(btf, linfo[i].line_off) || - !btf_name_offset_valid(btf, linfo[i].file_name_off)) { + if (!btf_name_by_offset(btf, linfo[i].line_off) || + !btf_name_by_offset(btf, linfo[i].file_name_off)) { verbose(env, "Invalid line_info[%u].line_off or .file_name_off\n", i); err = -EINVAL; goto err_free; -- cgit v1.3-7-g2ca7 From d9762e84ede3eae9636f5dbbe0c8f0390d37e114 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Thu, 13 Dec 2018 10:41:48 -0800 Subject: bpf: verbose log bpf_line_info in verifier This patch adds bpf_line_info during the verifier's verbose. It can give error context for debug purpose. ~~~~~~~~~~ Here is the verbose log for backedge: while (a) { a += bpf_get_smp_processor_id(); bpf_trace_printk(fmt, sizeof(fmt), a); } ~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint 13: while (a) { 3: a += bpf_get_smp_processor_id(); back-edge from insn 13 to 3 ~~~~~~~~~~ Here is the verbose log for invalid pkt access: Modification to test_xdp_noinline.c: data = (void *)(long)xdp->data; data_end = (void *)(long)xdp->data_end; /* if (data + 4 > data_end) return XDP_DROP; */ *(u32 *)data = dst->dst; ~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp ; data = (void *)(long)xdp->data; 224: (79) r2 = *(u64 *)(r10 -112) 225: (61) r2 = *(u32 *)(r2 +0) ; *(u32 *)data = dst->dst; 226: (63) *(u32 *)(r2 +0) = r1 invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0) R2 offset is outside of the packet Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 1 + kernel/bpf/verifier.c | 74 +++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 70 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index c736945be7c5..548dcbdb7111 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -224,6 +224,7 @@ struct bpf_verifier_env { bool allow_ptr_leaks; bool seen_direct_write; struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ + const struct bpf_line_info *prev_linfo; struct bpf_verifier_log log; struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1]; u32 subprog_cnt; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 89ce2613fdb0..ba8e3134bbc2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "disasm.h" @@ -216,6 +217,27 @@ struct bpf_call_arg_meta { static DEFINE_MUTEX(bpf_verifier_lock); +static const struct bpf_line_info * +find_linfo(const struct bpf_verifier_env *env, u32 insn_off) +{ + const struct bpf_line_info *linfo; + const struct bpf_prog *prog; + u32 i, nr_linfo; + + prog = env->prog; + nr_linfo = prog->aux->nr_linfo; + + if (!nr_linfo || insn_off >= prog->len) + return NULL; + + linfo = prog->aux->linfo; + for (i = 1; i < nr_linfo; i++) + if (insn_off < linfo[i].insn_off) + break; + + return &linfo[i - 1]; +} + void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt, va_list args) { @@ -266,6 +288,42 @@ __printf(2, 3) static void verbose(void *private_data, const char *fmt, ...) va_end(args); } +static const char *ltrim(const char *s) +{ + while (isspace(*s)) + s++; + + return s; +} + +__printf(3, 4) static void verbose_linfo(struct bpf_verifier_env *env, + u32 insn_off, + const char *prefix_fmt, ...) +{ + const struct bpf_line_info *linfo; + + if (!bpf_verifier_log_needed(&env->log)) + return; + + linfo = find_linfo(env, insn_off); + if (!linfo || linfo == env->prev_linfo) + return; + + if (prefix_fmt) { + va_list args; + + va_start(args, prefix_fmt); + bpf_verifier_vlog(&env->log, prefix_fmt, args); + va_end(args); + } + + verbose(env, "%s\n", + ltrim(btf_name_by_offset(env->prog->aux->btf, + linfo->line_off))); + + env->prev_linfo = linfo; +} + static bool type_is_pkt_pointer(enum bpf_reg_type type) { return type == PTR_TO_PACKET || @@ -4561,6 +4619,7 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) return 0; if (w < 0 || w >= env->prog->len) { + verbose_linfo(env, t, "%d: ", t); verbose(env, "jump out of range from insn %d to %d\n", t, w); return -EINVAL; } @@ -4578,6 +4637,8 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) insn_stack[cur_stack++] = w; return 1; } else if ((insn_state[w] & 0xF0) == DISCOVERED) { + verbose_linfo(env, t, "%d: ", t); + verbose_linfo(env, w, "%d: ", w); verbose(env, "back-edge from insn %d to %d\n", t, w); return -EINVAL; } else if (insn_state[w] == EXPLORED) { @@ -4600,10 +4661,6 @@ static int check_cfg(struct bpf_verifier_env *env) int ret = 0; int i, t; - ret = check_subprogs(env); - if (ret < 0) - return ret; - insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL); if (!insn_state) return -ENOMEM; @@ -5448,6 +5505,8 @@ static int do_check(struct bpf_verifier_env *env) int insn_processed = 0; bool do_print_state = false; + env->prev_linfo = NULL; + state = kzalloc(sizeof(struct bpf_verifier_state), GFP_KERNEL); if (!state) return -ENOMEM; @@ -5521,6 +5580,7 @@ static int do_check(struct bpf_verifier_env *env) .private_data = env, }; + verbose_linfo(env, insn_idx, "; "); verbose(env, "%d: ", insn_idx); print_bpf_insn(&cbs, insn, env->allow_ptr_leaks); } @@ -6755,7 +6815,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); - ret = check_cfg(env); + ret = check_subprogs(env); if (ret < 0) goto skip_full_check; @@ -6763,6 +6823,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (ret < 0) goto skip_full_check; + ret = check_cfg(env); + if (ret < 0) + goto skip_full_check; + ret = do_check(env); if (env->cur_state) { free_verifier_state(env->cur_state, true); -- cgit v1.3-7-g2ca7 From b233920c97a6201eb47fbafd78365c6946e6d7b6 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 13 Dec 2018 11:42:31 -0800 Subject: bpf: speed up stacksafe check Don't check the same stack liveness condition 8 times. once is enough. Signed-off-by: Alexei Starovoitov Acked-by: Edward Cree Acked-by: Jakub Kicinski Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ba8e3134bbc2..0c6deb3f2be4 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5204,9 +5204,11 @@ static bool stacksafe(struct bpf_func_state *old, for (i = 0; i < old->allocated_stack; i++) { spi = i / BPF_REG_SIZE; - if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ)) + if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ)) { + i += BPF_REG_SIZE - 1; /* explored state didn't use this */ continue; + } if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) continue; -- cgit v1.3-7-g2ca7 From 19e2dbb7dd978d24505e918ac54d6f7dfdc88b1d Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 13 Dec 2018 11:42:33 -0800 Subject: bpf: improve stacksafe state comparison "if (old->allocated_stack > cur->allocated_stack)" check is too conservative. In some cases explored stack could have allocated more space, but that stack space was not live. The test case improves from 19 to 15 processed insns and improvement on real programs is significant as well: before after bpf_lb-DLB_L3.o 1940 1831 bpf_lb-DLB_L4.o 3089 3029 bpf_lb-DUNKNOWN.o 1065 1064 bpf_lxc-DDROP_ALL.o 28052 26309 bpf_lxc-DUNKNOWN.o 35487 33517 bpf_netdev.o 10864 9713 bpf_overlay.o 6643 6184 bpf_lcx_jit.o 38437 37335 Signed-off-by: Alexei Starovoitov Acked-by: Edward Cree Acked-by: Jakub Kicinski Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 13 +++++++------ tools/testing/selftests/bpf/test_verifier.c | 22 ++++++++++++++++++++++ 2 files changed, 29 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0c6deb3f2be4..e4724fe8120f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5191,12 +5191,6 @@ static bool stacksafe(struct bpf_func_state *old, { int i, spi; - /* if explored stack has more populated slots than current stack - * such stacks are not equivalent - */ - if (old->allocated_stack > cur->allocated_stack) - return false; - /* walk slots of the explored stack and ignore any additional * slots in the current stack, since explored(safe) state * didn't use them @@ -5212,6 +5206,13 @@ static bool stacksafe(struct bpf_func_state *old, if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) continue; + + /* explored stack has more populated slots than current stack + * and these slots were used + */ + if (i >= cur->allocated_stack) + return false; + /* if old state was safe with misc data in the stack * it will be safe with zero-initialized stack. * The opposite is not true diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 82359cdbc805..f9de7fe0c26d 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -13647,6 +13647,28 @@ static struct bpf_test tests[] = { .prog_type = BPF_PROG_TYPE_SCHED_CLS, .result = ACCEPT, }, + { + "allocated_stack", + .insns = { + BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32), + BPF_ALU64_REG(BPF_MOV, BPF_REG_7, BPF_REG_0), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_6, -8), + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_10, -8), + BPF_STX_MEM(BPF_B, BPF_REG_10, BPF_REG_7, -9), + BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_10, -9), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = ACCEPT, + .insn_processed = 15, + }, { "reference tracking in call: free reference in subprog and outside", .insns = { -- cgit v1.3-7-g2ca7 From 9242b5f5615c823bfc1e9aea284617ff25a55f10 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 13 Dec 2018 11:42:34 -0800 Subject: bpf: add self-check logic to liveness analysis Introduce REG_LIVE_DONE to check the liveness propagation and prepare the states for merging. See algorithm description in clean_live_states(). Signed-off-by: Alexei Starovoitov Acked-by: Jakub Kicinski Signed-off-by: Daniel Borkmann --- include/linux/bpf_verifier.h | 1 + kernel/bpf/verifier.c | 108 ++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 108 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 548dcbdb7111..c233efc106c6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -38,6 +38,7 @@ enum bpf_reg_liveness { REG_LIVE_NONE = 0, /* reg hasn't been read or written this branch */ REG_LIVE_READ, /* reg was read, so we're sensitive to initial value */ REG_LIVE_WRITTEN, /* reg was written first, screening off later reads */ + REG_LIVE_DONE = 4, /* liveness won't be updating this register anymore */ }; struct bpf_reg_state { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e4724fe8120f..0125731e2512 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -397,12 +397,14 @@ static char slot_type_char[] = { static void print_liveness(struct bpf_verifier_env *env, enum bpf_reg_liveness live) { - if (live & (REG_LIVE_READ | REG_LIVE_WRITTEN)) + if (live & (REG_LIVE_READ | REG_LIVE_WRITTEN | REG_LIVE_DONE)) verbose(env, "_"); if (live & REG_LIVE_READ) verbose(env, "r"); if (live & REG_LIVE_WRITTEN) verbose(env, "w"); + if (live & REG_LIVE_DONE) + verbose(env, "D"); } static struct bpf_func_state *func(struct bpf_verifier_env *env, @@ -1132,6 +1134,12 @@ static int mark_reg_read(struct bpf_verifier_env *env, /* if read wasn't screened by an earlier write ... */ if (writes && state->live & REG_LIVE_WRITTEN) break; + if (parent->live & REG_LIVE_DONE) { + verbose(env, "verifier BUG type %s var_off %lld off %d\n", + reg_type_str[parent->type], + parent->var_off.value, parent->off); + return -EFAULT; + } /* ... then we depend on parent's value */ parent->live |= REG_LIVE_READ; state = parent; @@ -5078,6 +5086,102 @@ static bool check_ids(u32 old_id, u32 cur_id, struct idpair *idmap) return false; } +static void clean_func_state(struct bpf_verifier_env *env, + struct bpf_func_state *st) +{ + enum bpf_reg_liveness live; + int i, j; + + for (i = 0; i < BPF_REG_FP; i++) { + live = st->regs[i].live; + /* liveness must not touch this register anymore */ + st->regs[i].live |= REG_LIVE_DONE; + if (!(live & REG_LIVE_READ)) + /* since the register is unused, clear its state + * to make further comparison simpler + */ + __mark_reg_not_init(&st->regs[i]); + } + + for (i = 0; i < st->allocated_stack / BPF_REG_SIZE; i++) { + live = st->stack[i].spilled_ptr.live; + /* liveness must not touch this stack slot anymore */ + st->stack[i].spilled_ptr.live |= REG_LIVE_DONE; + if (!(live & REG_LIVE_READ)) { + __mark_reg_not_init(&st->stack[i].spilled_ptr); + for (j = 0; j < BPF_REG_SIZE; j++) + st->stack[i].slot_type[j] = STACK_INVALID; + } + } +} + +static void clean_verifier_state(struct bpf_verifier_env *env, + struct bpf_verifier_state *st) +{ + int i; + + if (st->frame[0]->regs[0].live & REG_LIVE_DONE) + /* all regs in this state in all frames were already marked */ + return; + + for (i = 0; i <= st->curframe; i++) + clean_func_state(env, st->frame[i]); +} + +/* the parentage chains form a tree. + * the verifier states are added to state lists at given insn and + * pushed into state stack for future exploration. + * when the verifier reaches bpf_exit insn some of the verifer states + * stored in the state lists have their final liveness state already, + * but a lot of states will get revised from liveness point of view when + * the verifier explores other branches. + * Example: + * 1: r0 = 1 + * 2: if r1 == 100 goto pc+1 + * 3: r0 = 2 + * 4: exit + * when the verifier reaches exit insn the register r0 in the state list of + * insn 2 will be seen as !REG_LIVE_READ. Then the verifier pops the other_branch + * of insn 2 and goes exploring further. At the insn 4 it will walk the + * parentage chain from insn 4 into insn 2 and will mark r0 as REG_LIVE_READ. + * + * Since the verifier pushes the branch states as it sees them while exploring + * the program the condition of walking the branch instruction for the second + * time means that all states below this branch were already explored and + * their final liveness markes are already propagated. + * Hence when the verifier completes the search of state list in is_state_visited() + * we can call this clean_live_states() function to mark all liveness states + * as REG_LIVE_DONE to indicate that 'parent' pointers of 'struct bpf_reg_state' + * will not be used. + * This function also clears the registers and stack for states that !READ + * to simplify state merging. + * + * Important note here that walking the same branch instruction in the callee + * doesn't meant that the states are DONE. The verifier has to compare + * the callsites + */ +static void clean_live_states(struct bpf_verifier_env *env, int insn, + struct bpf_verifier_state *cur) +{ + struct bpf_verifier_state_list *sl; + int i; + + sl = env->explored_states[insn]; + if (!sl) + return; + + while (sl != STATE_LIST_MARK) { + if (sl->state.curframe != cur->curframe) + goto next; + for (i = 0; i <= cur->curframe; i++) + if (sl->state.frame[i]->callsite != cur->frame[i]->callsite) + goto next; + clean_verifier_state(env, &sl->state); +next: + sl = sl->next; + } +} + /* Returns true if (rold safe implies rcur safe) */ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, struct idpair *idmap) @@ -5396,6 +5500,8 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) */ return 0; + clean_live_states(env, insn_idx, cur); + while (sl != STATE_LIST_MARK) { if (states_equal(env, &sl->state, cur)) { /* reached equivalent register/stack state, -- cgit v1.3-7-g2ca7 From 723679339d087d79e36c0af67f4be84d866fee20 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 11 Dec 2018 20:00:51 +0900 Subject: kconfig: warn no new line at end of file It would be nice to warn if a new line is missing at end of file. We could do this by checkpatch.pl for arbitrary files, but new line is rather essential as a statement terminator in Kconfig. The warning message looks like this: kernel/Kconfig.preempt:60:warning: no new line at end of file Currently, kernel/Kconfig.preempt is the only file with no new line at end of file. Fix it. I know there are some false negative cases. For example, no warning is displayed when the last line contains some whitespaces/comments, but no new line. Yet, this commit works well for most cases. Signed-off-by: Masahiro Yamada --- kernel/Kconfig.preempt | 2 +- scripts/kconfig/zconf.l | 4 ++++ 2 files changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index cd1655122ec0..0fee5fe6c899 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -57,4 +57,4 @@ config PREEMPT endchoice config PREEMPT_COUNT - bool \ No newline at end of file + bool diff --git a/scripts/kconfig/zconf.l b/scripts/kconfig/zconf.l index 9038e9736bf0..8e856f9e6da9 100644 --- a/scripts/kconfig/zconf.l +++ b/scripts/kconfig/zconf.l @@ -261,6 +261,10 @@ n [A-Za-z0-9_-] <> { BEGIN(INITIAL); + if (prev_token != T_EOL && prev_token != T_HELPTEXT) + fprintf(stderr, "%s:%d:warning: no new line at end of file\n", + current_file->name, yylineno); + if (current_file) { zconf_endfile(); return T_EOL; -- cgit v1.3-7-g2ca7 From fb1a59fae8baa3f3c69b72a87ff94fc4fa5683ec Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 17 Dec 2018 17:20:55 +0900 Subject: kprobes: Blacklist symbols in arch-defined prohibited area Blacklist symbols in arch-defined probe-prohibited areas. With this change, user can see all symbols which are prohibited to probe in debugfs. All archtectures which have custom prohibit areas should define its own arch_populate_kprobe_blacklist() function, but unless that, all symbols marked __kprobes are blacklisted. Reported-by: Andrea Righi Tested-by: Andrea Righi Signed-off-by: Masami Hiramatsu Cc: Andy Lutomirski Cc: Anil S Keshavamurthy Cc: Borislav Petkov Cc: David S. Miller Cc: Linus Torvalds Cc: Naveen N. Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Yonghong Song Link: http://lkml.kernel.org/r/154503485491.26176.15823229545155174796.stgit@devbox Signed-off-by: Ingo Molnar --- include/linux/kprobes.h | 3 +++ kernel/kprobes.c | 67 ++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 56 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index e909413e4e38..5da8a1de2187 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -242,10 +242,13 @@ extern int arch_init_kprobes(void); extern void show_registers(struct pt_regs *regs); extern void kprobes_inc_nmissed_count(struct kprobe *p); extern bool arch_within_kprobe_blacklist(unsigned long addr); +extern int arch_populate_kprobe_blacklist(void); extern bool arch_kprobe_on_func_entry(unsigned long offset); extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset); extern bool within_kprobe_blacklist(unsigned long addr); +extern int kprobe_add_ksym_blacklist(unsigned long entry); +extern int kprobe_add_area_blacklist(unsigned long start, unsigned long end); struct kprobe_insn_cache { struct mutex mutex; diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 90e98e233647..90569aec0f24 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2093,6 +2093,47 @@ void dump_kprobe(struct kprobe *kp) } NOKPROBE_SYMBOL(dump_kprobe); +int kprobe_add_ksym_blacklist(unsigned long entry) +{ + struct kprobe_blacklist_entry *ent; + unsigned long offset = 0, size = 0; + + if (!kernel_text_address(entry) || + !kallsyms_lookup_size_offset(entry, &size, &offset)) + return -EINVAL; + + ent = kmalloc(sizeof(*ent), GFP_KERNEL); + if (!ent) + return -ENOMEM; + ent->start_addr = entry; + ent->end_addr = entry + size; + INIT_LIST_HEAD(&ent->list); + list_add_tail(&ent->list, &kprobe_blacklist); + + return (int)size; +} + +/* Add all symbols in given area into kprobe blacklist */ +int kprobe_add_area_blacklist(unsigned long start, unsigned long end) +{ + unsigned long entry; + int ret = 0; + + for (entry = start; entry < end; entry += ret) { + ret = kprobe_add_ksym_blacklist(entry); + if (ret < 0) + return ret; + if (ret == 0) /* In case of alias symbol */ + ret = 1; + } + return 0; +} + +int __init __weak arch_populate_kprobe_blacklist(void) +{ + return 0; +} + /* * Lookup and populate the kprobe_blacklist. * @@ -2104,26 +2145,24 @@ NOKPROBE_SYMBOL(dump_kprobe); static int __init populate_kprobe_blacklist(unsigned long *start, unsigned long *end) { + unsigned long entry; unsigned long *iter; - struct kprobe_blacklist_entry *ent; - unsigned long entry, offset = 0, size = 0; + int ret; for (iter = start; iter < end; iter++) { entry = arch_deref_entry_point((void *)*iter); - - if (!kernel_text_address(entry) || - !kallsyms_lookup_size_offset(entry, &size, &offset)) + ret = kprobe_add_ksym_blacklist(entry); + if (ret == -EINVAL) continue; - - ent = kmalloc(sizeof(*ent), GFP_KERNEL); - if (!ent) - return -ENOMEM; - ent->start_addr = entry; - ent->end_addr = entry + size; - INIT_LIST_HEAD(&ent->list); - list_add_tail(&ent->list, &kprobe_blacklist); + if (ret < 0) + return ret; } - return 0; + + /* Symbols in __kprobes_text are blacklisted */ + ret = kprobe_add_area_blacklist((unsigned long)__kprobes_text_start, + (unsigned long)__kprobes_text_end); + + return ret ? : arch_populate_kprobe_blacklist(); } /* Module notifier call back, checking kprobes on the module */ -- cgit v1.3-7-g2ca7 From 6c4fc209fcf9d27efbaa48368773e4d2bfbd59aa Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Sun, 16 Dec 2018 00:49:47 +0100 Subject: bpf: remove useless version check for prog load Existing libraries and tracing frameworks work around this kernel version check by automatically deriving the kernel version from uname(3) or similar such that the user does not need to do it manually; these workarounds also make the version check useless at the same time. Moreover, most other BPF tracing types enabling bpf_probe_read()-like functionality have /not/ adapted this check, and in general these days it is well understood anyway that all the tracing programs are not stable with regards to future kernels as kernel internal data structures are subject to change from release to release. Back at last netconf we discussed [0] and agreed to remove this check from bpf_prog_load() and instead document it here in the uapi header that there is no such guarantee for stable API for these programs. [0] http://vger.kernel.org/netconf2018_files/DanielBorkmann_netconf2018.pdf Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Acked-by: Quentin Monnet Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 10 +++++++++- kernel/bpf/syscall.c | 5 ----- tools/include/uapi/linux/bpf.h | 10 +++++++++- 3 files changed, 18 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e7d57e89f25f..1d324c2cbca2 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -133,6 +133,14 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, }; +/* Note that tracing related programs such as + * BPF_PROG_TYPE_{KPROBE,TRACEPOINT,PERF_EVENT,RAW_TRACEPOINT} + * are not subject to a stable API since kernel internal data + * structures can change from release to release and may + * therefore break existing tracing BPF programs. Tracing BPF + * programs correspond to /a/ specific kernel which is to be + * analyzed, and not /a/ specific kernel /and/ all future ones. + */ enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET_FILTER, @@ -343,7 +351,7 @@ union bpf_attr { __u32 log_level; /* verbosity level of verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied buffer */ - __u32 kern_version; /* checked when prog_type=kprobe */ + __u32 kern_version; /* not used */ __u32 prog_flags; char prog_name[BPF_OBJ_NAME_LEN]; __u32 prog_ifindex; /* ifindex of netdev to prep for */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 6ae062f1cf20..5db31067d85e 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1473,11 +1473,6 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (attr->insn_cnt == 0 || attr->insn_cnt > BPF_MAXINSNS) return -E2BIG; - - if (type == BPF_PROG_TYPE_KPROBE && - attr->kern_version != LINUX_VERSION_CODE) - return -EINVAL; - if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && !capable(CAP_SYS_ADMIN)) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index e7d57e89f25f..1d324c2cbca2 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -133,6 +133,14 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, }; +/* Note that tracing related programs such as + * BPF_PROG_TYPE_{KPROBE,TRACEPOINT,PERF_EVENT,RAW_TRACEPOINT} + * are not subject to a stable API since kernel internal data + * structures can change from release to release and may + * therefore break existing tracing BPF programs. Tracing BPF + * programs correspond to /a/ specific kernel which is to be + * analyzed, and not /a/ specific kernel /and/ all future ones. + */ enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET_FILTER, @@ -343,7 +351,7 @@ union bpf_attr { __u32 log_level; /* verbosity level of verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied buffer */ - __u32 kern_version; /* checked when prog_type=kprobe */ + __u32 kern_version; /* not used */ __u32 prog_flags; char prog_name[BPF_OBJ_NAME_LEN]; __u32 prog_ifindex; /* ifindex of netdev to prep for */ -- cgit v1.3-7-g2ca7 From f97be3ab044cf6dd342fed7668c977ba07a7cd95 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sat, 15 Dec 2018 22:13:50 -0800 Subject: bpf: btf: refactor btf_int_bits_seq_show() Refactor function btf_int_bits_seq_show() by creating function btf_bitfield_seq_show() which has no dependence on btf and btf_type. The function btf_bitfield_seq_show() will be in later patch to directly dump bitfield member values. Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 8fa0bf1c33fd..72caa799e82f 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1068,26 +1068,16 @@ static void btf_int_log(struct btf_verifier_env *env, btf_int_encoding_str(BTF_INT_ENCODING(int_data))); } -static void btf_int_bits_seq_show(const struct btf *btf, - const struct btf_type *t, - void *data, u8 bits_offset, - struct seq_file *m) +static void btf_bitfield_seq_show(void *data, u8 bits_offset, + u8 nr_bits, struct seq_file *m) { u16 left_shift_bits, right_shift_bits; - u32 int_data = btf_type_int(t); - u8 nr_bits = BTF_INT_BITS(int_data); - u8 total_bits_offset; u8 nr_copy_bytes; u8 nr_copy_bits; u64 print_num; - /* - * bits_offset is at most 7. - * BTF_INT_OFFSET() cannot exceed 64 bits. - */ - total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); - data += BITS_ROUNDDOWN_BYTES(total_bits_offset); - bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset); + data += BITS_ROUNDDOWN_BYTES(bits_offset); + bits_offset = BITS_PER_BYTE_MASKED(bits_offset); nr_copy_bits = nr_bits + bits_offset; nr_copy_bytes = BITS_ROUNDUP_BYTES(nr_copy_bits); @@ -1107,6 +1097,23 @@ static void btf_int_bits_seq_show(const struct btf *btf, seq_printf(m, "0x%llx", print_num); } +static void btf_int_bits_seq_show(const struct btf *btf, + const struct btf_type *t, + void *data, u8 bits_offset, + struct seq_file *m) +{ + u32 int_data = btf_type_int(t); + u8 nr_bits = BTF_INT_BITS(int_data); + u8 total_bits_offset; + + /* + * bits_offset is at most 7. + * BTF_INT_OFFSET() cannot exceed 64 bits. + */ + total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); + btf_bitfield_seq_show(data, total_bits_offset, nr_bits, m); +} + static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct seq_file *m) -- cgit v1.3-7-g2ca7 From 9d5f9f701b1891466fb3dbb1806ad97716f95cc3 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sat, 15 Dec 2018 22:13:51 -0800 Subject: bpf: btf: fix struct/union/fwd types with kind_flag This patch fixed two issues with BTF. One is related to struct/union bitfield encoding and the other is related to forward type. Issue #1 and solution: ====================== Current btf encoding of bitfield follows what pahole generates. For each bitfield, pahole will duplicate the type chain and put the bitfield size at the final int or enum type. Since the BTF enum type cannot encode bit size, pahole workarounds the issue by generating an int type whenever the enum bit size is not 32. For example, -bash-4.4$ cat t.c typedef int ___int; enum A { A1, A2, A3 }; struct t { int a[5]; ___int b:4; volatile enum A c:4; } g; -bash-4.4$ gcc -c -O2 -g t.c The current kernel supports the following BTF encoding: $ pahole -JV t.o [1] TYPEDEF ___int type_id=2 [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED [3] ENUM A size=4 vlen=3 A1 val=0 A2 val=1 A3 val=2 [4] STRUCT t size=24 vlen=3 a type_id=5 bits_offset=0 b type_id=9 bits_offset=160 c type_id=11 bits_offset=164 [5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5 [6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none) [7] VOLATILE (anon) type_id=3 [8] INT int size=1 bit_offset=0 nr_bits=4 encoding=(none) [9] TYPEDEF ___int type_id=8 [10] INT (anon) size=1 bit_offset=0 nr_bits=4 encoding=SIGNED [11] VOLATILE (anon) type_id=10 Two issues are in the above: . by changing enum type to int, we lost the original type information and this will not be ideal later when we try to convert BTF to a header file. . the type duplication for bitfields will cause BTF bloat. Duplicated types cannot be deduplicated later if the bitfield size is different. To fix this issue, this patch implemented a compatible change for BTF struct type encoding: . the bit 31 of struct_type->info, previously reserved, now is used to indicate whether bitfield_size is encoded in btf_member or not. . if bit 31 of struct_type->info is set, btf_member->offset will encode like: bit 0 - 23: bit offset bit 24 - 31: bitfield size if bit 31 is not set, the old behavior is preserved: bit 0 - 31: bit offset So if the struct contains a bit field, the maximum bit offset will be reduced to (2^24 - 1) instead of MAX_UINT. The maximum bitfield size will be 256 which is enough for today as maximum bitfield in compiler can be 128 where int128 type is supported. This kernel patch intends to support the new BTF encoding: $ pahole -JV t.o [1] TYPEDEF ___int type_id=2 [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED [3] ENUM A size=4 vlen=3 A1 val=0 A2 val=1 A3 val=2 [4] STRUCT t kind_flag=1 size=24 vlen=3 a type_id=5 bitfield_size=0 bits_offset=0 b type_id=1 bitfield_size=4 bits_offset=160 c type_id=7 bitfield_size=4 bits_offset=164 [5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5 [6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none) [7] VOLATILE (anon) type_id=3 Issue #2 and solution: ====================== Current forward type in BTF does not specify whether the original type is struct or union. This will not work for type pretty print and BTF-to-header-file conversion as struct/union must be specified. $ cat tt.c struct t; union u; int foo(struct t *t, union u *u) { return 0; } $ gcc -c -g -O2 tt.c $ pahole -JV tt.o [1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED [2] FWD t type_id=0 [3] PTR (anon) type_id=2 [4] FWD u type_id=0 [5] PTR (anon) type_id=4 To fix this issue, similar to issue #1, type->info bit 31 is used. If the bit is set, it is union type. Otherwise, it is a struct type. $ pahole -JV tt.o [1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED [2] FWD t kind_flag=0 type_id=0 [3] PTR (anon) kind_flag=0 type_id=2 [4] FWD u kind_flag=1 type_id=0 [5] PTR (anon) kind_flag=0 type_id=4 Pahole/LLVM change: =================== The new kind_flag functionality has been implemented in pahole and llvm: https://github.com/yonghong-song/pahole/tree/bitfield https://github.com/yonghong-song/llvm/tree/bitfield Note that pahole hasn't implemented func/func_proto kind and .BTF.ext. So to print function signature with bpftool, the llvm compiler should be used. Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") Acked-by: Martin KaFai Lau Signed-off-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- include/uapi/linux/btf.h | 20 +++- kernel/bpf/btf.c | 280 +++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 278 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h index 14f66948fc95..7b7475ef2f17 100644 --- a/include/uapi/linux/btf.h +++ b/include/uapi/linux/btf.h @@ -34,7 +34,9 @@ struct btf_type { * bits 0-15: vlen (e.g. # of struct's members) * bits 16-23: unused * bits 24-27: kind (e.g. int, ptr, array...etc) - * bits 28-31: unused + * bits 28-30: unused + * bit 31: kind_flag, currently used by + * struct, union and fwd */ __u32 info; /* "size" is used by INT, ENUM, STRUCT and UNION. @@ -52,6 +54,7 @@ struct btf_type { #define BTF_INFO_KIND(info) (((info) >> 24) & 0x0f) #define BTF_INFO_VLEN(info) ((info) & 0xffff) +#define BTF_INFO_KFLAG(info) ((info) >> 31) #define BTF_KIND_UNKN 0 /* Unknown */ #define BTF_KIND_INT 1 /* Integer */ @@ -110,9 +113,22 @@ struct btf_array { struct btf_member { __u32 name_off; __u32 type; - __u32 offset; /* offset in bits */ + /* If the type info kind_flag is set, the btf_member offset + * contains both member bitfield size and bit offset. The + * bitfield size is set for bitfield members. If the type + * info kind_flag is not set, the offset contains only bit + * offset. + */ + __u32 offset; }; +/* If the struct/union type info kind_flag is set, the + * following two macros are used to access bitfield_size + * and bit_offset from btf_member.offset. + */ +#define BTF_MEMBER_BITFIELD_SIZE(val) ((val) >> 24) +#define BTF_MEMBER_BIT_OFFSET(val) ((val) & 0xffffff) + /* BTF_KIND_FUNC_PROTO is followed by multiple "struct btf_param". * The exact number of btf_param is stored in the vlen (of the * info in "struct btf_type"). diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 72caa799e82f..93b6905e3a9b 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -164,7 +164,7 @@ #define BITS_ROUNDUP_BYTES(bits) \ (BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits)) -#define BTF_INFO_MASK 0x0f00ffff +#define BTF_INFO_MASK 0x8f00ffff #define BTF_INT_MASK 0x0fffffff #define BTF_TYPE_ID_VALID(type_id) ((type_id) <= BTF_MAX_TYPE) #define BTF_STR_OFFSET_VALID(name_off) ((name_off) <= BTF_MAX_NAME_OFFSET) @@ -274,6 +274,10 @@ struct btf_kind_operations { const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type); + int (*check_kflag_member)(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type); void (*log_details)(struct btf_verifier_env *env, const struct btf_type *t); void (*seq_show)(const struct btf *btf, const struct btf_type *t, @@ -419,6 +423,25 @@ static u16 btf_type_vlen(const struct btf_type *t) return BTF_INFO_VLEN(t->info); } +static bool btf_type_kflag(const struct btf_type *t) +{ + return BTF_INFO_KFLAG(t->info); +} + +static u32 btf_member_bit_offset(const struct btf_type *struct_type, + const struct btf_member *member) +{ + return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) + : member->offset; +} + +static u32 btf_member_bitfield_size(const struct btf_type *struct_type, + const struct btf_member *member) +{ + return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset) + : 0; +} + static u32 btf_type_int(const struct btf_type *t) { return *(u32 *)(t + 1); @@ -627,9 +650,17 @@ static void btf_verifier_log_member(struct btf_verifier_env *env, if (env->phase != CHECK_META) btf_verifier_log_type(env, struct_type, NULL); - __btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u", - __btf_name_by_offset(btf, member->name_off), - member->type, member->offset); + if (btf_type_kflag(struct_type)) + __btf_verifier_log(log, + "\t%s type_id=%u bitfield_size=%u bits_offset=%u", + __btf_name_by_offset(btf, member->name_off), + member->type, + BTF_MEMBER_BITFIELD_SIZE(member->offset), + BTF_MEMBER_BIT_OFFSET(member->offset)); + else + __btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u", + __btf_name_by_offset(btf, member->name_off), + member->type, member->offset); if (fmt && *fmt) { __btf_verifier_log(log, " "); @@ -945,6 +976,38 @@ static int btf_df_check_member(struct btf_verifier_env *env, return -EINVAL; } +static int btf_df_check_kflag_member(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type) +{ + btf_verifier_log_basic(env, struct_type, + "Unsupported check_kflag_member"); + return -EINVAL; +} + +/* Used for ptr, array and struct/union type members. + * int, enum and modifier types have their specific callback functions. + */ +static int btf_generic_check_kflag_member(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type) +{ + if (BTF_MEMBER_BITFIELD_SIZE(member->offset)) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member bitfield_size"); + return -EINVAL; + } + + /* bitfield size is 0, so member->offset represents bit offset only. + * It is safe to call non kflag check_member variants. + */ + return btf_type_ops(member_type)->check_member(env, struct_type, + member, + member_type); +} + static int btf_df_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { @@ -997,6 +1060,62 @@ static int btf_int_check_member(struct btf_verifier_env *env, return 0; } +static int btf_int_check_kflag_member(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type) +{ + u32 struct_bits_off, nr_bits, nr_int_data_bits, bytes_offset; + u32 int_data = btf_type_int(member_type); + u32 struct_size = struct_type->size; + u32 nr_copy_bits; + + /* a regular int type is required for the kflag int member */ + if (!btf_type_int_is_regular(member_type)) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member base type"); + return -EINVAL; + } + + /* check sanity of bitfield size */ + nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); + struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); + nr_int_data_bits = BTF_INT_BITS(int_data); + if (!nr_bits) { + /* Not a bitfield member, member offset must be at byte + * boundary. + */ + if (BITS_PER_BYTE_MASKED(struct_bits_off)) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member offset"); + return -EINVAL; + } + + nr_bits = nr_int_data_bits; + } else if (nr_bits > nr_int_data_bits) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member bitfield_size"); + return -EINVAL; + } + + bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); + nr_copy_bits = nr_bits + BITS_PER_BYTE_MASKED(struct_bits_off); + if (nr_copy_bits > BITS_PER_U64) { + btf_verifier_log_member(env, struct_type, member, + "nr_copy_bits exceeds 64"); + return -EINVAL; + } + + if (struct_size < bytes_offset || + struct_size - bytes_offset < BITS_ROUNDUP_BYTES(nr_copy_bits)) { + btf_verifier_log_member(env, struct_type, member, + "Member exceeds struct_size"); + return -EINVAL; + } + + return 0; +} + static s32 btf_int_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) @@ -1016,6 +1135,11 @@ static s32 btf_int_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + int_data = btf_type_int(t); if (int_data & ~BTF_INT_MASK) { btf_verifier_log_basic(env, t, "Invalid int_data:%x", @@ -1097,6 +1221,7 @@ static void btf_bitfield_seq_show(void *data, u8 bits_offset, seq_printf(m, "0x%llx", print_num); } + static void btf_int_bits_seq_show(const struct btf *btf, const struct btf_type *t, void *data, u8 bits_offset, @@ -1163,6 +1288,7 @@ static const struct btf_kind_operations int_ops = { .check_meta = btf_int_check_meta, .resolve = btf_df_resolve, .check_member = btf_int_check_member, + .check_kflag_member = btf_int_check_kflag_member, .log_details = btf_int_log, .seq_show = btf_int_seq_show, }; @@ -1192,6 +1318,31 @@ static int btf_modifier_check_member(struct btf_verifier_env *env, resolved_type); } +static int btf_modifier_check_kflag_member(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type) +{ + const struct btf_type *resolved_type; + u32 resolved_type_id = member->type; + struct btf_member resolved_member; + struct btf *btf = env->btf; + + resolved_type = btf_type_id_size(btf, &resolved_type_id, NULL); + if (!resolved_type) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member"); + return -EINVAL; + } + + resolved_member = *member; + resolved_member.type = resolved_type_id; + + return btf_type_ops(resolved_type)->check_kflag_member(env, struct_type, + &resolved_member, + resolved_type); +} + static int btf_ptr_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, @@ -1227,6 +1378,11 @@ static int btf_ref_type_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + if (!BTF_TYPE_ID_VALID(t->type)) { btf_verifier_log_type(env, t, "Invalid type_id"); return -EINVAL; @@ -1380,6 +1536,7 @@ static struct btf_kind_operations modifier_ops = { .check_meta = btf_ref_type_check_meta, .resolve = btf_modifier_resolve, .check_member = btf_modifier_check_member, + .check_kflag_member = btf_modifier_check_kflag_member, .log_details = btf_ref_type_log, .seq_show = btf_modifier_seq_show, }; @@ -1388,6 +1545,7 @@ static struct btf_kind_operations ptr_ops = { .check_meta = btf_ref_type_check_meta, .resolve = btf_ptr_resolve, .check_member = btf_ptr_check_member, + .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_ref_type_log, .seq_show = btf_ptr_seq_show, }; @@ -1422,6 +1580,7 @@ static struct btf_kind_operations fwd_ops = { .check_meta = btf_fwd_check_meta, .resolve = btf_df_resolve, .check_member = btf_df_check_member, + .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_ref_type_log, .seq_show = btf_df_seq_show, }; @@ -1480,6 +1639,11 @@ static s32 btf_array_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + if (t->size) { btf_verifier_log_type(env, t, "size != 0"); return -EINVAL; @@ -1603,6 +1767,7 @@ static struct btf_kind_operations array_ops = { .check_meta = btf_array_check_meta, .resolve = btf_array_resolve, .check_member = btf_array_check_member, + .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_array_log, .seq_show = btf_array_seq_show, }; @@ -1641,6 +1806,7 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, u32 meta_needed, last_offset; struct btf *btf = env->btf; u32 struct_size = t->size; + u32 offset; u16 i; meta_needed = btf_type_vlen(t) * sizeof(*member); @@ -1682,7 +1848,8 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, return -EINVAL; } - if (is_union && member->offset) { + offset = btf_member_bit_offset(t, member); + if (is_union && offset) { btf_verifier_log_member(env, t, member, "Invalid member bits_offset"); return -EINVAL; @@ -1692,20 +1859,20 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, * ">" instead of ">=" because the last member could be * "char a[0];" */ - if (last_offset > member->offset) { + if (last_offset > offset) { btf_verifier_log_member(env, t, member, "Invalid member bits_offset"); return -EINVAL; } - if (BITS_ROUNDUP_BYTES(member->offset) > struct_size) { + if (BITS_ROUNDUP_BYTES(offset) > struct_size) { btf_verifier_log_member(env, t, member, "Member bits_offset exceeds its struct size"); return -EINVAL; } btf_verifier_log_member(env, t, member, NULL); - last_offset = member->offset; + last_offset = offset; } return meta_needed; @@ -1735,9 +1902,14 @@ static int btf_struct_resolve(struct btf_verifier_env *env, last_member_type = btf_type_by_id(env->btf, last_member_type_id); - err = btf_type_ops(last_member_type)->check_member(env, v->t, - last_member, - last_member_type); + if (btf_type_kflag(v->t)) + err = btf_type_ops(last_member_type)->check_kflag_member(env, v->t, + last_member, + last_member_type); + else + err = btf_type_ops(last_member_type)->check_member(env, v->t, + last_member, + last_member_type); if (err) return err; } @@ -1759,9 +1931,14 @@ static int btf_struct_resolve(struct btf_verifier_env *env, return env_stack_push(env, member_type, member_type_id); } - err = btf_type_ops(member_type)->check_member(env, v->t, - member, - member_type); + if (btf_type_kflag(v->t)) + err = btf_type_ops(member_type)->check_kflag_member(env, v->t, + member, + member_type); + else + err = btf_type_ops(member_type)->check_member(env, v->t, + member, + member_type); if (err) return err; } @@ -1789,17 +1966,26 @@ static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, for_each_member(i, t, member) { const struct btf_type *member_type = btf_type_by_id(btf, member->type); - u32 member_offset = member->offset; - u32 bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); - u8 bits8_offset = BITS_PER_BYTE_MASKED(member_offset); const struct btf_kind_operations *ops; + u32 member_offset, bitfield_size; + u32 bytes_offset; + u8 bits8_offset; if (i) seq_puts(m, seq); - ops = btf_type_ops(member_type); - ops->seq_show(btf, member_type, member->type, - data + bytes_offset, bits8_offset, m); + member_offset = btf_member_bit_offset(t, member); + bitfield_size = btf_member_bitfield_size(t, member); + if (bitfield_size) { + btf_bitfield_seq_show(data, member_offset, + bitfield_size, m); + } else { + bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); + bits8_offset = BITS_PER_BYTE_MASKED(member_offset); + ops = btf_type_ops(member_type); + ops->seq_show(btf, member_type, member->type, + data + bytes_offset, bits8_offset, m); + } } seq_puts(m, "}"); } @@ -1808,6 +1994,7 @@ static struct btf_kind_operations struct_ops = { .check_meta = btf_struct_check_meta, .resolve = btf_struct_resolve, .check_member = btf_struct_check_member, + .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_struct_log, .seq_show = btf_struct_seq_show, }; @@ -1837,6 +2024,41 @@ static int btf_enum_check_member(struct btf_verifier_env *env, return 0; } +static int btf_enum_check_kflag_member(struct btf_verifier_env *env, + const struct btf_type *struct_type, + const struct btf_member *member, + const struct btf_type *member_type) +{ + u32 struct_bits_off, nr_bits, bytes_end, struct_size; + u32 int_bitsize = sizeof(int) * BITS_PER_BYTE; + + struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); + nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); + if (!nr_bits) { + if (BITS_PER_BYTE_MASKED(struct_bits_off)) { + btf_verifier_log_member(env, struct_type, member, + "Member is not byte aligned"); + return -EINVAL; + } + + nr_bits = int_bitsize; + } else if (nr_bits > int_bitsize) { + btf_verifier_log_member(env, struct_type, member, + "Invalid member bitfield_size"); + return -EINVAL; + } + + struct_size = struct_type->size; + bytes_end = BITS_ROUNDUP_BYTES(struct_bits_off + nr_bits); + if (struct_size < bytes_end) { + btf_verifier_log_member(env, struct_type, member, + "Member exceeds struct_size"); + return -EINVAL; + } + + return 0; +} + static s32 btf_enum_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) @@ -1856,6 +2078,11 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + if (t->size != sizeof(int)) { btf_verifier_log_type(env, t, "Expected size:%zu", sizeof(int)); @@ -1924,6 +2151,7 @@ static struct btf_kind_operations enum_ops = { .check_meta = btf_enum_check_meta, .resolve = btf_df_resolve, .check_member = btf_enum_check_member, + .check_kflag_member = btf_enum_check_kflag_member, .log_details = btf_enum_log, .seq_show = btf_enum_seq_show, }; @@ -1946,6 +2174,11 @@ static s32 btf_func_proto_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + btf_verifier_log_type(env, t, NULL); return meta_needed; @@ -2005,6 +2238,7 @@ static struct btf_kind_operations func_proto_ops = { * Hence, there is no btf_func_check_member(). */ .check_member = btf_df_check_member, + .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_func_proto_log, .seq_show = btf_df_seq_show, }; @@ -2024,6 +2258,11 @@ static s32 btf_func_check_meta(struct btf_verifier_env *env, return -EINVAL; } + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + btf_verifier_log_type(env, t, NULL); return 0; @@ -2033,6 +2272,7 @@ static struct btf_kind_operations func_ops = { .check_meta = btf_func_check_meta, .resolve = btf_df_resolve, .check_member = btf_df_check_member, + .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_ref_type_log, .seq_show = btf_df_seq_show, }; -- cgit v1.3-7-g2ca7 From ffa0c1cf59596fba54546ea828305acfcc2cf55e Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sat, 15 Dec 2018 22:13:52 -0800 Subject: bpf: enable cgroup local storage map pretty print with kind_flag Commit 970289fc0a83 ("bpf: add bpffs pretty print for cgroup local storage maps") added bpffs pretty print for cgroup local storage maps. The commit worked for struct without kind_flag set. This patch refactored and made pretty print also work with kind_flag set for the struct. Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- include/linux/btf.h | 5 ++++- kernel/bpf/btf.c | 37 ++++++++++++++++++++++++++++--------- kernel/bpf/local_storage.c | 17 ++++------------- 3 files changed, 36 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/include/linux/btf.h b/include/linux/btf.h index 58000d7e06e3..12502e25e767 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -7,6 +7,7 @@ #include struct btf; +struct btf_member; struct btf_type; union bpf_attr; @@ -46,7 +47,9 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m); int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); -bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size); +bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, + const struct btf_member *m, + u32 expected_offset, u32 expected_size); #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 93b6905e3a9b..e804b26a0506 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -546,22 +546,41 @@ static bool btf_type_int_is_regular(const struct btf_type *t) } /* - * Check that given type is a regular int and has the expected size. + * Check that given struct member is a regular int with expected + * offset and size. */ -bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size) +bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, + const struct btf_member *m, + u32 expected_offset, u32 expected_size) { - u8 nr_bits, nr_bytes; - u32 int_data; + const struct btf_type *t; + u32 id, int_data; + u8 nr_bits; - if (!btf_type_is_int(t)) + id = m->type; + t = btf_type_id_size(btf, &id, NULL); + if (!t || !btf_type_is_int(t)) return false; int_data = btf_type_int(t); nr_bits = BTF_INT_BITS(int_data); - nr_bytes = BITS_ROUNDUP_BYTES(nr_bits); - if (BITS_PER_BYTE_MASKED(nr_bits) || - BTF_INT_OFFSET(int_data) || - nr_bytes != expected_size) + if (btf_type_kflag(s)) { + u32 bitfield_size = BTF_MEMBER_BITFIELD_SIZE(m->offset); + u32 bit_offset = BTF_MEMBER_BIT_OFFSET(m->offset); + + /* if kflag set, int should be a regular int and + * bit offset should be at byte boundary. + */ + return !bitfield_size && + BITS_ROUNDUP_BYTES(bit_offset) == expected_offset && + BITS_ROUNDUP_BYTES(nr_bits) == expected_size; + } + + if (BTF_INT_OFFSET(int_data) || + BITS_PER_BYTE_MASKED(m->offset) || + BITS_ROUNDUP_BYTES(m->offset) != expected_offset || + BITS_PER_BYTE_MASKED(nr_bits) || + BITS_ROUNDUP_BYTES(nr_bits) != expected_size) return false; return true; diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 5eca03da0989..07a34ef562a0 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -315,9 +315,8 @@ static int cgroup_storage_check_btf(const struct bpf_map *map, const struct btf_type *key_type, const struct btf_type *value_type) { - const struct btf_type *t; struct btf_member *m; - u32 id, size; + u32 offset, size; /* Key is expected to be of struct bpf_cgroup_storage_key type, * which is: @@ -338,25 +337,17 @@ static int cgroup_storage_check_btf(const struct bpf_map *map, * The first field must be a 64 bit integer at 0 offset. */ m = (struct btf_member *)(key_type + 1); - if (m->offset) - return -EINVAL; - id = m->type; - t = btf_type_id_size(btf, &id, NULL); size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, cgroup_inode_id); - if (!t || !btf_type_is_reg_int(t, size)) + if (!btf_member_is_reg_int(btf, key_type, m, 0, size)) return -EINVAL; /* * The second field must be a 32 bit integer at 64 bit offset. */ m++; - if (m->offset != offsetof(struct bpf_cgroup_storage_key, attach_type) * - BITS_PER_BYTE) - return -EINVAL; - id = m->type; - t = btf_type_id_size(btf, &id, NULL); + offset = offsetof(struct bpf_cgroup_storage_key, attach_type); size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, attach_type); - if (!t || !btf_type_is_reg_int(t, size)) + if (!btf_member_is_reg_int(btf, key_type, m, offset, size)) return -EINVAL; return 0; -- cgit v1.3-7-g2ca7 From 15ff2069cb7f967dae6a8f8c176ba51447c75f00 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Tue, 18 Dec 2018 06:05:04 +0900 Subject: printk: Add caller information to printk() output. Sometimes we want to print a series of printk() messages to consoles without being disturbed by concurrent printk() from interrupts and/or other threads. But we can't enforce printk() callers to use their local buffers because we need to ask them to make too much changes. Also, even buffering up to one line inside printk() might cause failing to emit an important clue under critical situation. Therefore, instead of trying to help buffering, let's try to help reconstructing messages by saving caller information as of calling log_store() and adding it as "[T$thread_id]" or "[C$processor_id]" upon printing to consoles. Some examples for console output: [ 1.222773][ T1] x86: Booting SMP configuration: [ 2.779635][ T1] pci 0000:00:01.0: PCI bridge to [bus 01] [ 5.069193][ T268] Fusion MPT base driver 3.04.20 [ 9.316504][ C2] random: fast init done [ 13.413336][ T3355] Initialized host personality Some examples for /dev/kmsg output: 6,496,1222773,-,caller=T1;x86: Booting SMP configuration: 6,968,2779635,-,caller=T1;pci 0000:00:01.0: PCI bridge to [bus 01] SUBSYSTEM=pci DEVICE=+pci:0000:00:01.0 6,1353,5069193,-,caller=T268;Fusion MPT base driver 3.04.20 5,1526,9316504,-,caller=C2;random: fast init done 6,1575,13413336,-,caller=T3355;Initialized host personality Note that this patch changes max length of messages which can be printed by printk() or written to /dev/kmsg interface from 992 bytes to 976 bytes, based on an assumption that userspace won't try to write messages hitting that border line to /dev/kmsg interface. Link: http://lkml.kernel.org/r/93f19e57-5051-c67d-9af4-b17624062d44@i-love.sakura.ne.jp Cc: Dmitry Vyukov Cc: Sergey Senozhatsky Cc: Steven Rostedt Cc: Linus Torvalds Cc: Andrew Morton Cc: LKML Cc: syzkaller Signed-off-by: Tetsuo Handa Acked-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++---- lib/Kconfig.debug | 17 ++++++++++++++++ 2 files changed, 68 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 91db332ccf4d..e3d977bbc7ab 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -357,6 +357,9 @@ struct printk_log { u8 facility; /* syslog facility */ u8 flags:5; /* internal record flags */ u8 level:3; /* syslog level */ +#ifdef CONFIG_PRINTK_CALLER + u32 caller_id; /* thread id or processor id */ +#endif } #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS __packed __aligned(4) @@ -423,7 +426,11 @@ static u64 exclusive_console_stop_seq; static u64 clear_seq; static u32 clear_idx; +#ifdef CONFIG_PRINTK_CALLER +#define PREFIX_MAX 48 +#else #define PREFIX_MAX 32 +#endif #define LOG_LINE_MAX (1024 - PREFIX_MAX) #define LOG_LEVEL(v) ((v) & 0x07) @@ -626,6 +633,12 @@ static int log_store(int facility, int level, msg->ts_nsec = ts_nsec; else msg->ts_nsec = local_clock(); +#ifdef CONFIG_PRINTK_CALLER + if (in_task()) + msg->caller_id = task_pid_nr(current); + else + msg->caller_id = 0x80000000 + raw_smp_processor_id(); +#endif memset(log_dict(msg) + dict_len, 0, pad_len); msg->len = size; @@ -689,12 +702,21 @@ static ssize_t msg_print_ext_header(char *buf, size_t size, struct printk_log *msg, u64 seq) { u64 ts_usec = msg->ts_nsec; + char caller[20]; +#ifdef CONFIG_PRINTK_CALLER + u32 id = msg->caller_id; + + snprintf(caller, sizeof(caller), ",caller=%c%u", + id & 0x80000000 ? 'C' : 'T', id & ~0x80000000); +#else + caller[0] = '\0'; +#endif do_div(ts_usec, 1000); - return scnprintf(buf, size, "%u,%llu,%llu,%c;", - (msg->facility << 3) | msg->level, seq, ts_usec, - msg->flags & LOG_CONT ? 'c' : '-'); + return scnprintf(buf, size, "%u,%llu,%llu,%c%s;", + (msg->facility << 3) | msg->level, seq, ts_usec, + msg->flags & LOG_CONT ? 'c' : '-', caller); } static ssize_t msg_print_ext_body(char *buf, size_t size, @@ -1039,6 +1061,9 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_OFFSET(printk_log, len); VMCOREINFO_OFFSET(printk_log, text_len); VMCOREINFO_OFFSET(printk_log, dict_len); +#ifdef CONFIG_PRINTK_CALLER + VMCOREINFO_OFFSET(printk_log, caller_id); +#endif } #endif @@ -1237,10 +1262,23 @@ static size_t print_time(u64 ts, char *buf) { unsigned long rem_nsec = do_div(ts, 1000000000); - return sprintf(buf, "[%5lu.%06lu] ", + return sprintf(buf, "[%5lu.%06lu]", (unsigned long)ts, rem_nsec / 1000); } +#ifdef CONFIG_PRINTK_CALLER +static size_t print_caller(u32 id, char *buf) +{ + char caller[12]; + + snprintf(caller, sizeof(caller), "%c%u", + id & 0x80000000 ? 'C' : 'T', id & ~0x80000000); + return sprintf(buf, "[%6s]", caller); +} +#else +#define print_caller(id, buf) 0 +#endif + static size_t print_prefix(const struct printk_log *msg, bool syslog, bool time, char *buf) { @@ -1248,8 +1286,17 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog, if (syslog) len = print_syslog((msg->facility << 3) | msg->level, buf); + if (time) len += print_time(msg->ts_nsec, buf + len); + + len += print_caller(msg->caller_id, buf + len); + + if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) { + buf[len++] = ' '; + buf[len] = '\0'; + } + return len; } diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 8d24f4ed66fd..89df6750b365 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -17,6 +17,23 @@ config PRINTK_TIME The behavior is also controlled by the kernel command line parameter printk.time=1. See Documentation/admin-guide/kernel-parameters.rst +config PRINTK_CALLER + bool "Show caller information on printks" + depends on PRINTK + help + Selecting this option causes printk() to add a caller "thread id" (if + in task context) or a caller "processor id" (if not in task context) + to every message. + + This option is intended for environments where multiple threads + concurrently call printk() for many times, for it is difficult to + interpret without knowing where these lines (or sometimes individual + line which was divided into multiple lines due to race) came from. + + Since toggling after boot makes the code racy, currently there is + no option to enable/disable at the kernel command line parameter or + sysfs interface. + config CONSOLE_LOGLEVEL_DEFAULT int "Default console loglevel (1-15)" range 1 15 -- cgit v1.3-7-g2ca7 From 07daef8b41e0d9e7802a448f6766504e7641a234 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Sun, 9 Dec 2018 14:22:25 +0800 Subject: ntp: Remove duplicated include Signed-off-by: YueHaibing Signed-off-by: Thomas Gleixner Cc: Cc: Link: https://lkml.kernel.org/r/20181209062225.4344-1-yuehaibing@huawei.com --- kernel/time/ntp.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index c5e0cba3b39c..bc3a3c37ec9c 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -17,7 +17,6 @@ #include #include #include -#include #include "ntp_internal.h" #include "timekeeping_internal.h" -- cgit v1.3-7-g2ca7 From c5f48c0a7aa1a8c82d81cdf27e63aa0a5544c6e6 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 3 Dec 2018 11:44:51 +0100 Subject: genirq: Fix various typos in comments Go over the IRQ subsystem source code (including irqchip drivers) and fix common typos in comments. No change in functionality intended. Signed-off-by: Ingo Molnar Cc: Thomas Gleixner Cc: Jason Cooper Cc: Marc Zyngier Cc: Peter Zijlstra Cc: Linus Torvalds Cc: linux-kernel@vger.kernel.org --- drivers/irqchip/irq-dw-apb-ictl.c | 2 +- drivers/irqchip/irq-gic.c | 6 +++--- drivers/irqchip/irq-renesas-h8s.c | 2 +- drivers/irqchip/irq-s3c24xx.c | 2 +- include/linux/irqchip.h | 4 ++-- kernel/irq/chip.c | 2 +- kernel/irq/ipi.c | 4 ++-- kernel/irq/manage.c | 2 +- kernel/irq/spurious.c | 6 +++--- 9 files changed, 15 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/drivers/irqchip/irq-dw-apb-ictl.c b/drivers/irqchip/irq-dw-apb-ictl.c index 0a19618ce2c8..e4550e9c810b 100644 --- a/drivers/irqchip/irq-dw-apb-ictl.c +++ b/drivers/irqchip/irq-dw-apb-ictl.c @@ -105,7 +105,7 @@ static int __init dw_apb_ictl_init(struct device_node *np, * DW IP can be configured to allow 2-64 irqs. We can determine * the number of irqs supported by writing into enable register * and look for bits not set, as corresponding flip-flops will - * have been removed by sythesis tool. + * have been removed by synthesis tool. */ /* mask and enable all interrupts */ diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index ced10c44b68a..ba2a37a27a54 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -604,8 +604,8 @@ void gic_dist_save(struct gic_chip_data *gic) /* * Restores the GIC distributor registers during resume or when coming out of * idle. Must be called before enabling interrupts. If a level interrupt - * that occured while the GIC was suspended is still present, it will be - * handled normally, but any edge interrupts that occured will not be seen by + * that occurred while the GIC was suspended is still present, it will be + * handled normally, but any edge interrupts that occurred will not be seen by * the GIC and need to be handled by the platform-specific wakeup source. */ void gic_dist_restore(struct gic_chip_data *gic) @@ -899,7 +899,7 @@ void gic_migrate_target(unsigned int new_cpu_id) gic_cpu_map[cpu] = 1 << new_cpu_id; /* - * Find all the peripheral interrupts targetting the current + * Find all the peripheral interrupts targeting the current * CPU interface and migrate them to the new CPU interface. * We skip DIST_TARGET 0 to 7 as they are read-only. */ diff --git a/drivers/irqchip/irq-renesas-h8s.c b/drivers/irqchip/irq-renesas-h8s.c index 85234d456638..4e2461bae944 100644 --- a/drivers/irqchip/irq-renesas-h8s.c +++ b/drivers/irqchip/irq-renesas-h8s.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* - * H8S interrupt contoller driver + * H8S interrupt controller driver * * Copyright 2015 Yoshinori Sato */ diff --git a/drivers/irqchip/irq-s3c24xx.c b/drivers/irqchip/irq-s3c24xx.c index c19766fe8a1a..b623f300f1b1 100644 --- a/drivers/irqchip/irq-s3c24xx.c +++ b/drivers/irqchip/irq-s3c24xx.c @@ -58,7 +58,7 @@ struct s3c_irq_data { }; /* - * Sructure holding the controller data + * Structure holding the controller data * @reg_pending register holding pending irqs * @reg_intpnd special register intpnd in main intc * @reg_mask mask register diff --git a/include/linux/irqchip.h b/include/linux/irqchip.h index 89c34b200671..950e4b2458f0 100644 --- a/include/linux/irqchip.h +++ b/include/linux/irqchip.h @@ -19,7 +19,7 @@ * the association between their DT compatible string and their * initialization function. * - * @name: name that must be unique accross all IRQCHIP_DECLARE of the + * @name: name that must be unique across all IRQCHIP_DECLARE of the * same file. * @compstr: compatible string of the irqchip driver * @fn: initialization function @@ -30,7 +30,7 @@ * This macro must be used by the different irqchip drivers to declare * the association between their version and their initialization function. * - * @name: name that must be unique accross all IRQCHIP_ACPI_DECLARE of the + * @name: name that must be unique across all IRQCHIP_ACPI_DECLARE of the * same file. * @subtable: Subtable to be identified in MADT * @validate: Function to be called on that subtable to check its validity. diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index a2b3d9de999c..34e969069488 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -929,7 +929,7 @@ __irq_do_set_handler(struct irq_desc *desc, irq_flow_handler_t handle, break; /* * Bail out if the outer chip is not set up - * and the interrrupt supposed to be started + * and the interrupt supposed to be started * right away. */ if (WARN_ON(is_chained)) diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c index 8b778e37dc6d..43e3d1be622c 100644 --- a/kernel/irq/ipi.c +++ b/kernel/irq/ipi.c @@ -56,7 +56,7 @@ int irq_reserve_ipi(struct irq_domain *domain, unsigned int next; /* - * The IPI requires a seperate HW irq on each CPU. We require + * The IPI requires a separate HW irq on each CPU. We require * that the destination mask is consecutive. If an * implementation needs to support holes, it can reserve * several IPI ranges. @@ -172,7 +172,7 @@ irq_hw_number_t ipi_get_hwirq(unsigned int irq, unsigned int cpu) /* * Get the real hardware irq number if the underlying implementation - * uses a seperate irq per cpu. If the underlying implementation uses + * uses a separate irq per cpu. If the underlying implementation uses * a single hardware irq for all cpus then the IPI send mechanism * needs to take care of the cpu destinations. */ diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 9dbdccab3b6a..a4888ce4667a 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -915,7 +915,7 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { } #endif /* - * Interrupts which are not explicitely requested as threaded + * Interrupts which are not explicitly requested as threaded * interrupts rely on the implicit bh/preempt disable of the hard irq * context. So we need to disable bh here to avoid deadlocks and other * side effects. diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index d867d6ddafdd..6d2fa6914b30 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -66,7 +66,7 @@ static int try_one_irq(struct irq_desc *desc, bool force) raw_spin_lock(&desc->lock); /* - * PER_CPU, nested thread interrupts and interrupts explicitely + * PER_CPU, nested thread interrupts and interrupts explicitly * marked polled are excluded from polling. */ if (irq_settings_is_per_cpu(desc) || @@ -76,7 +76,7 @@ static int try_one_irq(struct irq_desc *desc, bool force) /* * Do not poll disabled interrupts unless the spurious - * disabled poller asks explicitely. + * disabled poller asks explicitly. */ if (irqd_irq_disabled(&desc->irq_data) && !force) goto out; @@ -292,7 +292,7 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret) * So in case a thread is woken, we just note the fact and * defer the analysis to the next hardware interrupt. * - * The threaded handlers store whether they sucessfully + * The threaded handlers store whether they successfully * handled an interrupt and we check whether that number * changed versus the last invocation. * -- cgit v1.3-7-g2ca7 From e11d4284e2f4de5048c6d1787c82226f0a198292 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 18 Apr 2018 13:43:52 +0200 Subject: y2038: socket: Add compat_sys_recvmmsg_time64 recvmmsg() takes two arguments to pointers of structures that differ between 32-bit and 64-bit architectures: mmsghdr and timespec. For y2038 compatbility, we are changing the native system call from timespec to __kernel_timespec with a 64-bit time_t (in another patch), and use the existing compat system call on both 32-bit and 64-bit architectures for compatibility with traditional 32-bit user space. As we now have two variants of recvmmsg() for 32-bit tasks that are both different from the variant that we use on 64-bit tasks, this means we also require two compat system calls! The solution I picked is to flip things around: The existing compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c and now handles the case for old user space on all architectures that have set CONFIG_COMPAT_32BIT_TIME. A new compat_sys_recvmmsg_time64() call gets added in the old place for 64-bit architectures only, this one handles the case of a compat mmsghdr structure combined with __kernel_timespec. In the indirect sys_socketcall(), we now need to call either do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of architecture we are on. For compat_sys_socketcall(), no such change is needed, we always call __compat_sys_recvmmsg(). I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc implementation for 64-bit time_t will need significant changes including an updated asm/unistd.h, and it seems better to consistently use the separate syscalls that configuration, leaving the socketcall only for backward compatibility with 32-bit time_t based libc. The naming is asymmetric for the moment, so both existing syscalls entry points keep their names, while the new ones are recvmmsg_time32 and compat_recvmmsg_time64 respectively. I expect that we will rename the compat syscalls later as we start using generated syscall tables everywhere and add these entry points. Signed-off-by: Arnd Bergmann --- include/linux/compat.h | 3 +++ include/linux/socket.h | 9 ++++--- include/linux/syscalls.h | 3 +++ kernel/sys_ni.c | 2 ++ net/compat.c | 34 ++++++++++---------------- net/socket.c | 62 +++++++++++++++++++++++++++++++++++------------- 6 files changed, 72 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/include/linux/compat.h b/include/linux/compat.h index 8be8daa38c9a..4b0463608589 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -893,6 +893,9 @@ asmlinkage long compat_sys_move_pages(pid_t pid, compat_ulong_t nr_pages, asmlinkage long compat_sys_rt_tgsigqueueinfo(compat_pid_t tgid, compat_pid_t pid, int sig, struct compat_siginfo __user *uinfo); +asmlinkage long compat_sys_recvmmsg_time64(int fd, struct compat_mmsghdr __user *mmsg, + unsigned vlen, unsigned int flags, + struct __kernel_timespec __user *timeout); asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg, unsigned vlen, unsigned int flags, struct old_timespec32 __user *timeout); diff --git a/include/linux/socket.h b/include/linux/socket.h index 8b571e9b9f76..333b5df8a1b2 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -348,7 +348,8 @@ struct ucred { extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_storage *kaddr); extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data); -struct timespec64; +struct __kernel_timespec; +struct old_timespec32; /* The __sys_...msg variants allow MSG_CMSG_COMPAT iff * forbid_cmsg_compat==false @@ -357,8 +358,10 @@ extern long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, bool forbid_cmsg_compat); extern long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, bool forbid_cmsg_compat); -extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, - unsigned int flags, struct timespec64 *timeout); +extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, + unsigned int vlen, unsigned int flags, + struct __kernel_timespec __user *timeout, + struct old_timespec32 __user *timeout32); extern int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags, bool forbid_cmsg_compat); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 247ad9eca955..03cda6793be3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -843,6 +843,9 @@ asmlinkage long sys_accept4(int, struct sockaddr __user *, int __user *, int); asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg, unsigned int vlen, unsigned flags, struct __kernel_timespec __user *timeout); +asmlinkage long sys_recvmmsg_time32(int fd, struct mmsghdr __user *msg, + unsigned int vlen, unsigned flags, + struct old_timespec32 __user *timeout); asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr, int options, struct rusage __user *ru); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index df556175be50..ab9d0e3c6d50 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -284,7 +284,9 @@ COND_SYSCALL_COMPAT(move_pages); COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); COND_SYSCALL(recvmmsg); +COND_SYSCALL(recvmmsg_time32); COND_SYSCALL_COMPAT(recvmmsg); +COND_SYSCALL_COMPAT(recvmmsg_time64); /* * Architecture specific syscalls: see further below diff --git a/net/compat.c b/net/compat.c index 47a614b370cd..f7084780a8f8 100644 --- a/net/compat.c +++ b/net/compat.c @@ -810,34 +810,23 @@ COMPAT_SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, buf, compat_size_t, len return __compat_sys_recvfrom(fd, buf, len, flags, addr, addrlen); } -static int __compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg, - unsigned int vlen, unsigned int flags, - struct old_timespec32 __user *timeout) +COMPAT_SYSCALL_DEFINE5(recvmmsg_time64, int, fd, struct compat_mmsghdr __user *, mmsg, + unsigned int, vlen, unsigned int, flags, + struct __kernel_timespec __user *, timeout) { - int datagrams; - struct timespec64 ktspec; - - if (timeout == NULL) - return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, - flags | MSG_CMSG_COMPAT, NULL); - - if (compat_get_timespec64(&ktspec, timeout)) - return -EFAULT; - - datagrams = __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, - flags | MSG_CMSG_COMPAT, &ktspec); - if (datagrams > 0 && compat_put_timespec64(&ktspec, timeout)) - datagrams = -EFAULT; - - return datagrams; + return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, + flags | MSG_CMSG_COMPAT, timeout, NULL); } +#ifdef CONFIG_COMPAT_32BIT_TIME COMPAT_SYSCALL_DEFINE5(recvmmsg, int, fd, struct compat_mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct old_timespec32 __user *, timeout) { - return __compat_sys_recvmmsg(fd, mmsg, vlen, flags, timeout); + return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, + flags | MSG_CMSG_COMPAT, NULL, timeout); } +#endif COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args) { @@ -925,8 +914,9 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args) ret = __compat_sys_recvmsg(a0, compat_ptr(a1), a[2]); break; case SYS_RECVMMSG: - ret = __compat_sys_recvmmsg(a0, compat_ptr(a1), a[2], a[3], - compat_ptr(a[4])); + ret = __sys_recvmmsg(a0, compat_ptr(a1), a[2], + a[3] | MSG_CMSG_COMPAT, NULL, + compat_ptr(a[4])); break; case SYS_ACCEPT4: ret = __sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), a[3]); diff --git a/net/socket.c b/net/socket.c index 593826e11a53..f137a96628f1 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2341,8 +2341,9 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct user_msghdr __user *, msg, * Linux recvmmsg interface */ -int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, - unsigned int flags, struct timespec64 *timeout) +static int do_recvmmsg(int fd, struct mmsghdr __user *mmsg, + unsigned int vlen, unsigned int flags, + struct timespec64 *timeout) { int fput_needed, err, datagrams; struct socket *sock; @@ -2451,25 +2452,32 @@ out_put: return datagrams; } -static int do_sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, - unsigned int vlen, unsigned int flags, - struct __kernel_timespec __user *timeout) +int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, + unsigned int vlen, unsigned int flags, + struct __kernel_timespec __user *timeout, + struct old_timespec32 __user *timeout32) { int datagrams; struct timespec64 timeout_sys; - if (flags & MSG_CMSG_COMPAT) - return -EINVAL; - - if (!timeout) - return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL); + if (timeout && get_timespec64(&timeout_sys, timeout)) + return -EFAULT; - if (get_timespec64(&timeout_sys, timeout)) + if (timeout32 && get_old_timespec32(&timeout_sys, timeout32)) return -EFAULT; - datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys); + if (!timeout && !timeout32) + return do_recvmmsg(fd, mmsg, vlen, flags, NULL); + + datagrams = do_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys); - if (datagrams > 0 && put_timespec64(&timeout_sys, timeout)) + if (datagrams <= 0) + return datagrams; + + if (timeout && put_timespec64(&timeout_sys, timeout)) + datagrams = -EFAULT; + + if (timeout32 && put_old_timespec32(&timeout_sys, timeout32)) datagrams = -EFAULT; return datagrams; @@ -2479,8 +2487,23 @@ SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct __kernel_timespec __user *, timeout) { - return do_sys_recvmmsg(fd, mmsg, vlen, flags, timeout); + if (flags & MSG_CMSG_COMPAT) + return -EINVAL; + + return __sys_recvmmsg(fd, mmsg, vlen, flags, timeout, NULL); +} + +#ifdef CONFIG_COMPAT_32BIT_TIME +SYSCALL_DEFINE5(recvmmsg_time32, int, fd, struct mmsghdr __user *, mmsg, + unsigned int, vlen, unsigned int, flags, + struct old_timespec32 __user *, timeout) +{ + if (flags & MSG_CMSG_COMPAT) + return -EINVAL; + + return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL, timeout); } +#endif #ifdef __ARCH_WANT_SYS_SOCKETCALL /* Argument list sizes for sys_socketcall */ @@ -2600,8 +2623,15 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) a[2], true); break; case SYS_RECVMMSG: - err = do_sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], - a[3], (struct __kernel_timespec __user *)a[4]); + if (IS_ENABLED(CONFIG_64BIT) || !IS_ENABLED(CONFIG_64BIT_TIME)) + err = __sys_recvmmsg(a0, (struct mmsghdr __user *)a1, + a[2], a[3], + (struct __kernel_timespec __user *)a[4], + NULL); + else + err = __sys_recvmmsg(a0, (struct mmsghdr __user *)a1, + a[2], a[3], NULL, + (struct old_timespec32 __user *)a[4]); break; case SYS_ACCEPT4: err = __sys_accept4(a0, (struct sockaddr __user *)a1, -- cgit v1.3-7-g2ca7 From df8522a340ee4ccb725036e1f9145f5646939aed Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 18 Apr 2018 16:15:37 +0200 Subject: y2038: signal: Add sys_rt_sigtimedwait_time32 Once sys_rt_sigtimedwait() gets changed to a 64-bit time_t, we have to provide compatibility support for existing binaries. An earlier version of this patch reused the compat_sys_rt_sigtimedwait entry point to avoid code duplication, but this newer approach duplicates the existing native entry point instead, which seems a bit cleaner. Signed-off-by: Arnd Bergmann --- include/linux/syscalls.h | 4 ++++ kernel/signal.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) (limited to 'kernel') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 03cda6793be3..251979d2e709 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -649,6 +649,10 @@ asmlinkage long sys_rt_sigtimedwait(const sigset_t __user *uthese, siginfo_t __user *uinfo, const struct __kernel_timespec __user *uts, size_t sigsetsize); +asmlinkage long sys_rt_sigtimedwait_time32(const sigset_t __user *uthese, + siginfo_t __user *uinfo, + const struct old_timespec32 __user *uts, + size_t sigsetsize); asmlinkage long sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user *uinfo); /* kernel/sys.c */ diff --git a/kernel/signal.c b/kernel/signal.c index 3c8ea7a328e0..be6744cd0a11 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3332,6 +3332,39 @@ SYSCALL_DEFINE4(rt_sigtimedwait, const sigset_t __user *, uthese, return ret; } +#ifdef CONFIG_COMPAT_32BIT_TIME +SYSCALL_DEFINE4(rt_sigtimedwait_time32, const sigset_t __user *, uthese, + siginfo_t __user *, uinfo, + const struct old_timespec32 __user *, uts, + size_t, sigsetsize) +{ + sigset_t these; + struct timespec64 ts; + kernel_siginfo_t info; + int ret; + + if (sigsetsize != sizeof(sigset_t)) + return -EINVAL; + + if (copy_from_user(&these, uthese, sizeof(these))) + return -EFAULT; + + if (uts) { + if (get_old_timespec32(&ts, uts)) + return -EFAULT; + } + + ret = do_sigtimedwait(&these, &info, uts ? &ts : NULL); + + if (ret > 0 && uinfo) { + if (copy_siginfo_to_user(uinfo, &info)) + ret = -EFAULT; + } + + return ret; +} +#endif + #ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, struct compat_siginfo __user *, uinfo, -- cgit v1.3-7-g2ca7 From 2367c4b5fa09b2947d03c5cd23d7bc0200b7fe4f Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 18 Apr 2018 16:18:35 +0200 Subject: y2038: signal: Add compat_sys_rt_sigtimedwait_time64 Now that 32-bit architectures have two variants of sys_rt_sigtimedwaid() for 32-bit and 64-bit time_t, we also need to have a second compat system call entry point on the corresponding 64-bit architectures. The traditional system call keeps getting handled by compat_sys_rt_sigtimedwait(), and this adds a new compat_sys_rt_sigtimedwait_time64() that differs only in the timeout argument type. The naming remains a bit asymmetric for the moment. Ideally we would want to have compat_sys_rt_sigtimedwait_time32() for the old version and compat_sys_rt_sigtimedwait() for the new one to mirror the names of the native entry points, but renaming the existing system call tables causes unnecessary churn. I would suggest renaming all such system calls together at a later point. Signed-off-by: Arnd Bergmann --- include/linux/compat.h | 3 +++ kernel/signal.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) (limited to 'kernel') diff --git a/include/linux/compat.h b/include/linux/compat.h index 4b0463608589..056be0d03722 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -788,6 +788,9 @@ asmlinkage long compat_sys_rt_sigpending(compat_sigset_t __user *uset, asmlinkage long compat_sys_rt_sigtimedwait(compat_sigset_t __user *uthese, struct compat_siginfo __user *uinfo, struct old_timespec32 __user *uts, compat_size_t sigsetsize); +asmlinkage long compat_sys_rt_sigtimedwait_time64(compat_sigset_t __user *uthese, + struct compat_siginfo __user *uinfo, + struct __kernel_timespec __user *uts, compat_size_t sigsetsize); asmlinkage long compat_sys_rt_sigqueueinfo(compat_pid_t pid, int sig, struct compat_siginfo __user *uinfo); /* No generic prototype for rt_sigreturn */ diff --git a/kernel/signal.c b/kernel/signal.c index be6744cd0a11..53e07d97ffe0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3366,6 +3366,37 @@ SYSCALL_DEFINE4(rt_sigtimedwait_time32, const sigset_t __user *, uthese, #endif #ifdef CONFIG_COMPAT +COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait_time64, compat_sigset_t __user *, uthese, + struct compat_siginfo __user *, uinfo, + struct __kernel_timespec __user *, uts, compat_size_t, sigsetsize) +{ + sigset_t s; + struct timespec64 t; + kernel_siginfo_t info; + long ret; + + if (sigsetsize != sizeof(sigset_t)) + return -EINVAL; + + if (get_compat_sigset(&s, uthese)) + return -EFAULT; + + if (uts) { + if (get_timespec64(&t, uts)) + return -EFAULT; + } + + ret = do_sigtimedwait(&s, &info, uts ? &t : NULL); + + if (ret > 0 && uinfo) { + if (copy_siginfo_to_user32(uinfo, &info)) + ret = -EFAULT; + } + + return ret; +} + +#ifdef CONFIG_COMPAT_32BIT_TIME COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, struct compat_siginfo __user *, uinfo, struct old_timespec32 __user *, uts, compat_size_t, sigsetsize) @@ -3396,6 +3427,7 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, return ret; } #endif +#endif /** * sys_kill - send a signal to a process -- cgit v1.3-7-g2ca7 From 926617889dc8383a120c66a2ecf7959a69f96950 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 14 Aug 2018 14:15:23 +0200 Subject: timekeeping: remove unused {read,update}_persistent_clock After arch/sh has removed the last reference to these functions, we can remove them completely and just rely on the 64-bit time_t based versions. This cleans up a rather ugly use of __weak functions. Signed-off-by: Arnd Bergmann Acked-by: John Stultz --- include/linux/timekeeping32.h | 6 ------ kernel/time/ntp.c | 10 +--------- kernel/time/timekeeping.c | 12 ++---------- 3 files changed, 3 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/include/linux/timekeeping32.h b/include/linux/timekeeping32.h index a502616f7e1c..0036ff314ac5 100644 --- a/include/linux/timekeeping32.h +++ b/include/linux/timekeeping32.h @@ -52,10 +52,4 @@ static inline void getboottime(struct timespec *ts) *ts = timespec64_to_timespec(ts64); } -/* - * Persistent clock related interfaces - */ -extern void read_persistent_clock(struct timespec *ts); -extern int update_persistent_clock(struct timespec now); - #endif diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index c5e0cba3b39c..e23be418d015 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -555,17 +555,9 @@ static void sync_rtc_clock(void) } #ifdef CONFIG_GENERIC_CMOS_UPDATE -int __weak update_persistent_clock(struct timespec now) -{ - return -ENODEV; -} - int __weak update_persistent_clock64(struct timespec64 now64) { - struct timespec now; - - now = timespec64_to_timespec(now64); - return update_persistent_clock(now); + return -ENODEV; } #endif diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2d110c948805..eb09be4871b3 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1467,7 +1467,7 @@ u64 timekeeping_max_deferment(void) } /** - * read_persistent_clock - Return time from the persistent clock. + * read_persistent_clock64 - Return time from the persistent clock. * * Weak dummy function for arches that do not yet support it. * Reads the time from the battery backed persistent clock. @@ -1475,20 +1475,12 @@ u64 timekeeping_max_deferment(void) * * XXX - Do be sure to remove it once all arches implement it. */ -void __weak read_persistent_clock(struct timespec *ts) +void __weak read_persistent_clock64(struct timespec64 *ts) { ts->tv_sec = 0; ts->tv_nsec = 0; } -void __weak read_persistent_clock64(struct timespec64 *ts64) -{ - struct timespec ts; - - read_persistent_clock(&ts); - *ts64 = timespec_to_timespec64(ts); -} - /** * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset * from the boot. -- cgit v1.3-7-g2ca7 From 437e78d3fd6d35e6d56230962e6d03bb5dcda7f6 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Fri, 7 Dec 2018 13:41:02 +0100 Subject: timekeeping: remove timespec_add/timespec_del The last users were removed a while ago since everyone moved to ktime_t, so we can remove the two unused interfaces for old timespec structures. With those two gone, set_normalized_timespec() is also unused, so remove that as well. Signed-off-by: Arnd Bergmann Acked-by: John Stultz --- include/linux/time32.h | 25 ------------------------- kernel/time/time.c | 36 ------------------------------------ 2 files changed, 61 deletions(-) (limited to 'kernel') diff --git a/include/linux/time32.h b/include/linux/time32.h index 61904a6c098f..118b9977080c 100644 --- a/include/linux/time32.h +++ b/include/linux/time32.h @@ -96,31 +96,6 @@ static inline int timespec_compare(const struct timespec *lhs, const struct time return lhs->tv_nsec - rhs->tv_nsec; } -extern void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec); - -static inline struct timespec timespec_add(struct timespec lhs, - struct timespec rhs) -{ - struct timespec ts_delta; - - set_normalized_timespec(&ts_delta, lhs.tv_sec + rhs.tv_sec, - lhs.tv_nsec + rhs.tv_nsec); - return ts_delta; -} - -/* - * sub = lhs - rhs, in normalized form - */ -static inline struct timespec timespec_sub(struct timespec lhs, - struct timespec rhs) -{ - struct timespec ts_delta; - - set_normalized_timespec(&ts_delta, lhs.tv_sec - rhs.tv_sec, - lhs.tv_nsec - rhs.tv_nsec); - return ts_delta; -} - /* * Returns true if the timespec is norm, false if denorm: */ diff --git a/kernel/time/time.c b/kernel/time/time.c index ad204cf6d001..532bb560252d 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -386,42 +386,6 @@ time64_t mktime64(const unsigned int year0, const unsigned int mon0, } EXPORT_SYMBOL(mktime64); -/** - * set_normalized_timespec - set timespec sec and nsec parts and normalize - * - * @ts: pointer to timespec variable to be set - * @sec: seconds to set - * @nsec: nanoseconds to set - * - * Set seconds and nanoseconds field of a timespec variable and - * normalize to the timespec storage format - * - * Note: The tv_nsec part is always in the range of - * 0 <= tv_nsec < NSEC_PER_SEC - * For negative values only the tv_sec field is negative ! - */ -void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec) -{ - while (nsec >= NSEC_PER_SEC) { - /* - * The following asm() prevents the compiler from - * optimising this loop into a modulo operation. See - * also __iter_div_u64_rem() in include/linux/time.h - */ - asm("" : "+rm"(nsec)); - nsec -= NSEC_PER_SEC; - ++sec; - } - while (nsec < 0) { - asm("" : "+rm"(nsec)); - nsec += NSEC_PER_SEC; - --sec; - } - ts->tv_sec = sec; - ts->tv_nsec = nsec; -} -EXPORT_SYMBOL(set_normalized_timespec); - /** * ns_to_timespec - Convert nanoseconds to timespec * @nsec: the nanoseconds value to be converted -- cgit v1.3-7-g2ca7 From a38d1107f937ca95dcf820161ef44ea683d6a0b1 Mon Sep 17 00:00:00 2001 From: Matt Mullins Date: Wed, 12 Dec 2018 16:42:37 -0800 Subject: bpf: support raw tracepoints in modules Distributions build drivers as modules, including network and filesystem drivers which export numerous tracepoints. This enables bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints. Signed-off-by: Matt Mullins Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/module.h | 4 ++ include/linux/trace_events.h | 8 +++- kernel/bpf/syscall.c | 11 +++-- kernel/module.c | 5 +++ kernel/trace/bpf_trace.c | 99 +++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 120 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..5f147dd5e709 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -432,6 +432,10 @@ struct module { unsigned int num_tracepoints; tracepoint_ptr_t *tracepoints_ptrs; #endif +#ifdef CONFIG_BPF_EVENTS + unsigned int num_bpf_raw_events; + struct bpf_raw_event_map *bpf_raw_events; +#endif #ifdef HAVE_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 4130a5497d40..8a62731673f7 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog); -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, u64 *probe_addr); @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf { return -EOPNOTSUPP; } -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { return NULL; } +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) +{ +} static inline int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5db31067d85e..0607db304def 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1604,6 +1604,7 @@ static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp) bpf_probe_unregister(raw_tp->btp, raw_tp->prog); bpf_prog_put(raw_tp->prog); } + bpf_put_raw_tracepoint(raw_tp->btp); kfree(raw_tp); return 0; } @@ -1629,13 +1630,15 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) return -EFAULT; tp_name[sizeof(tp_name) - 1] = 0; - btp = bpf_find_raw_tracepoint(tp_name); + btp = bpf_get_raw_tracepoint(tp_name); if (!btp) return -ENOENT; raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); - if (!raw_tp) - return -ENOMEM; + if (!raw_tp) { + err = -ENOMEM; + goto out_put_btp; + } raw_tp->btp = btp; prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd, @@ -1663,6 +1666,8 @@ out_put_prog: bpf_prog_put(prog); out_free_tp: kfree(raw_tp); +out_put_btp: + bpf_put_raw_tracepoint(btp); return err; } diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..06ec68f08387 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3093,6 +3093,11 @@ static int find_module_sections(struct module *mod, struct load_info *info) sizeof(*mod->tracepoints_ptrs), &mod->num_tracepoints); #endif +#ifdef CONFIG_BPF_EVENTS + mod->bpf_raw_events = section_objs(info, "__bpf_raw_tp_map", + sizeof(*mod->bpf_raw_events), + &mod->num_bpf_raw_events); +#endif #ifdef HAVE_JUMP_LABEL mod->jump_entries = section_objs(info, "__jump_table", sizeof(*mod->jump_entries), diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 9864a35c8bb5..9ddb6fddb4e0 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -17,6 +17,43 @@ #include "trace_probe.h" #include "trace.h" +#ifdef CONFIG_MODULES +struct bpf_trace_module { + struct module *module; + struct list_head list; +}; + +static LIST_HEAD(bpf_trace_modules); +static DEFINE_MUTEX(bpf_module_mutex); + +static struct bpf_raw_event_map *bpf_get_raw_tracepoint_module(const char *name) +{ + struct bpf_raw_event_map *btp, *ret = NULL; + struct bpf_trace_module *btm; + unsigned int i; + + mutex_lock(&bpf_module_mutex); + list_for_each_entry(btm, &bpf_trace_modules, list) { + for (i = 0; i < btm->module->num_bpf_raw_events; ++i) { + btp = &btm->module->bpf_raw_events[i]; + if (!strcmp(btp->tp->name, name)) { + if (try_module_get(btm->module)) + ret = btp; + goto out; + } + } + } +out: + mutex_unlock(&bpf_module_mutex); + return ret; +} +#else +static struct bpf_raw_event_map *bpf_get_raw_tracepoint_module(const char *name) +{ + return NULL; +} +#endif /* CONFIG_MODULES */ + u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); @@ -1076,7 +1113,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) extern struct bpf_raw_event_map __start__bpf_raw_tp[]; extern struct bpf_raw_event_map __stop__bpf_raw_tp[]; -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { struct bpf_raw_event_map *btp = __start__bpf_raw_tp; @@ -1084,7 +1121,16 @@ struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) if (!strcmp(btp->tp->name, name)) return btp; } - return NULL; + + return bpf_get_raw_tracepoint_module(name); +} + +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) +{ + struct module *mod = __module_address((unsigned long)btp); + + if (mod) + module_put(mod); } static __always_inline @@ -1222,3 +1268,52 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, return err; } + +#ifdef CONFIG_MODULES +int bpf_event_notify(struct notifier_block *nb, unsigned long op, void *module) +{ + struct bpf_trace_module *btm, *tmp; + struct module *mod = module; + + if (mod->num_bpf_raw_events == 0 || + (op != MODULE_STATE_COMING && op != MODULE_STATE_GOING)) + return 0; + + mutex_lock(&bpf_module_mutex); + + switch (op) { + case MODULE_STATE_COMING: + btm = kzalloc(sizeof(*btm), GFP_KERNEL); + if (btm) { + btm->module = module; + list_add(&btm->list, &bpf_trace_modules); + } + break; + case MODULE_STATE_GOING: + list_for_each_entry_safe(btm, tmp, &bpf_trace_modules, list) { + if (btm->module == module) { + list_del(&btm->list); + kfree(btm); + break; + } + } + break; + } + + mutex_unlock(&bpf_module_mutex); + + return 0; +} + +static struct notifier_block bpf_module_nb = { + .notifier_call = bpf_event_notify, +}; + +int __init bpf_event_init(void) +{ + register_module_notifier(&bpf_module_nb); + return 0; +} + +fs_initcall(bpf_event_init); +#endif /* CONFIG_MODULES */ -- cgit v1.3-7-g2ca7 From 0bae2d4d62d523f06ff1a8e88ce38b45400acd28 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 15 Dec 2018 03:34:40 -0500 Subject: bpf: correct slot_type marking logic to allow more stack slot sharing Verifier is supposed to support sharing stack slot allocated to ptr with SCALAR_VALUE for privileged program. However this doesn't happen for some cases. The reason is verifier is not clearing slot_type STACK_SPILL for all bytes, it only clears part of them, while verifier is using: slot_type[0] == STACK_SPILL as a convention to check one slot is ptr type. So, the consequence of partial clearing slot_type is verifier could treat a partially overridden ptr slot, which should now be a SCALAR_VALUE slot, still as ptr slot, and rejects some valid programs. Before this patch, test_xdp_noinline.o under bpf selftests, bpf_lxc.o and bpf_netdev.o under Cilium bpf repo, when built with -mattr=+alu32 are rejected due to this issue. After this patch, they all accepted. There is no processed insn number change before and after this patch on Cilium bpf programs. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Reviewed-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 5 +++++ tools/testing/selftests/bpf/test_verifier.c | 34 +++++++++++++++++++++++++++-- 2 files changed, 37 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0125731e2512..e0e77ffeefb8 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1286,6 +1286,10 @@ static int check_stack_write(struct bpf_verifier_env *env, /* regular write of data into stack destroys any spilled ptr */ state->stack[spi].spilled_ptr.type = NOT_INIT; + /* Mark slots as STACK_MISC if they belonged to spilled ptr. */ + if (state->stack[spi].slot_type[0] == STACK_SPILL) + for (i = 0; i < BPF_REG_SIZE; i++) + state->stack[spi].slot_type[i] = STACK_MISC; /* only mark the slot as written if all 8 bytes were written * otherwise read propagation may incorrectly stop too soon @@ -1303,6 +1307,7 @@ static int check_stack_write(struct bpf_verifier_env *env, register_is_null(&cur->regs[value_regno])) type = STACK_ZERO; + /* Mark slots affected by this stack write. */ for (i = 0; i < size; i++) state->stack[spi].slot_type[(slot - i) % BPF_REG_SIZE] = type; diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index f9de7fe0c26d..cf242734e2eb 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -1001,14 +1001,44 @@ static struct bpf_test tests[] = { BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_1, -8), /* mess up with R1 pointer on stack */ BPF_ST_MEM(BPF_B, BPF_REG_10, -7, 0x23), - /* fill back into R0 should fail */ + /* fill back into R0 is fine for priv. + * R0 now becomes SCALAR_VALUE. + */ BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), + /* Load from R0 should fail. */ + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), BPF_EXIT_INSN(), }, .errstr_unpriv = "attempt to corrupt spilled", - .errstr = "corrupted spill", + .errstr = "R0 invalid mem access 'inv", .result = REJECT, }, + { + "check corrupted spill/fill, LSB", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_1, -8), + BPF_ST_MEM(BPF_H, BPF_REG_10, -8, 0xcafe), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "attempt to corrupt spilled", + .result_unpriv = REJECT, + .result = ACCEPT, + .retval = POINTER_VALUE, + }, + { + "check corrupted spill/fill, MSB", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_1, -8), + BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0x12345678), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "attempt to corrupt spilled", + .result_unpriv = REJECT, + .result = ACCEPT, + .retval = POINTER_VALUE, + }, { "invalid src register in STX", .insns = { -- cgit v1.3-7-g2ca7 From 76c43ae84e3f455e0b460ed0c43799e018d09ee9 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Dec 2018 13:43:58 -0800 Subject: bpf: log struct/union attribute for forward type Current btf internal verbose logger logs fwd type as [2] FWD A type_id=0 where A is the type name. Commit 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") introduced kind_flag which can be used to distinguish whether a forward type is a struct or union. Also, "type_id=0" does not carry any meaningful information for fwd type as btf_type.type = 0 is simply enforced during btf verification and is not used anywhere else. This commit changed the log to [2] FWD A struct if kind_flag = 0, or [2] FWD A union if kind_flag = 1. Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index e804b26a0506..715f9fcf4712 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1595,12 +1595,18 @@ static s32 btf_fwd_check_meta(struct btf_verifier_env *env, return 0; } +static void btf_fwd_type_log(struct btf_verifier_env *env, + const struct btf_type *t) +{ + btf_verifier_log(env, "%s", btf_type_kflag(t) ? "union" : "struct"); +} + static struct btf_kind_operations fwd_ops = { .check_meta = btf_fwd_check_meta, .resolve = btf_df_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, - .log_details = btf_ref_type_log, + .log_details = btf_fwd_type_log, .seq_show = btf_df_seq_show, }; -- cgit v1.3-7-g2ca7 From c2899c3470de04823870b28646981695c0046efe Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 18 Dec 2018 16:06:53 +0100 Subject: genirq/affinity: Remove excess indentation Plus other coding style issues which stood out while staring at that code. Signed-off-by: Thomas Gleixner --- kernel/irq/affinity.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 08c904eb7279..e423bff1928c 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -95,11 +95,11 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask, } static int __irq_build_affinity_masks(const struct irq_affinity *affd, - int startvec, int numvecs, int firstvec, - cpumask_var_t *node_to_cpumask, - const struct cpumask *cpu_mask, - struct cpumask *nmsk, - struct cpumask *masks) + int startvec, int numvecs, int firstvec, + cpumask_var_t *node_to_cpumask, + const struct cpumask *cpu_mask, + struct cpumask *nmsk, + struct cpumask *masks) { int n, nodes, cpus_per_vec, extra_vecs, done = 0; int last_affv = firstvec + numvecs; @@ -180,10 +180,10 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, cpumask_var_t nmsk, npresmsk; if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) - return ret; + return ret; if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) - goto fail; + goto fail; ret = 0; /* Stabilize the cpumasks */ @@ -212,7 +212,7 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, put_online_cpus(); if (nr_present < numvecs) - WARN_ON(nr_present + nr_others < numvecs); + WARN_ON(nr_present + nr_others < numvecs); free_cpumask_var(npresmsk); @@ -271,9 +271,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) ret = irq_build_affinity_masks(affd, curvec, this_vecs, curvec, node_to_cpumask, masks); if (ret) { - kfree(masks); - masks = NULL; - goto outnodemsk; + kfree(masks); + masks = NULL; + goto outnodemsk; } curvec += this_vecs; usedvecs += this_vecs; -- cgit v1.3-7-g2ca7 From bec04037e4e484f41ee4d9409e40616874169d20 Mon Sep 17 00:00:00 2001 From: Dou Liyang Date: Tue, 4 Dec 2018 23:51:20 +0800 Subject: genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner Suggested-by: Bjorn Helgaas Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: linux-pci@vger.kernel.org Cc: kashyap.desai@broadcom.com Cc: shivasharan.srikanteshwara@broadcom.com Cc: sumit.saxena@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com --- drivers/pci/msi.c | 9 ++++----- include/linux/interrupt.h | 14 ++++++++++++-- include/linux/irq.h | 6 ++++-- include/linux/irqdomain.h | 6 ++++-- include/linux/msi.h | 4 ++-- kernel/irq/affinity.c | 22 ++++++++++++---------- kernel/irq/devres.c | 4 ++-- kernel/irq/irqdesc.c | 17 +++++++++-------- kernel/irq/irqdomain.c | 4 ++-- kernel/irq/msi.c | 8 ++++---- 10 files changed, 55 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 265ed3e4c920..7a1c8a09efa5 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -534,14 +534,13 @@ error_attrs: static struct msi_desc * msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) { - struct cpumask *masks = NULL; + struct irq_affinity_desc *masks = NULL; struct msi_desc *entry; u16 control; if (affd) masks = irq_create_affinity_masks(nvec, affd); - /* MSI Entry Initialization */ entry = alloc_msi_entry(&dev->dev, nvec, masks); if (!entry) @@ -672,7 +671,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, struct msix_entry *entries, int nvec, const struct irq_affinity *affd) { - struct cpumask *curmsk, *masks = NULL; + struct irq_affinity_desc *curmsk, *masks = NULL; struct msi_desc *entry; int ret, i; @@ -1264,7 +1263,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) for_each_pci_msi_entry(entry, dev) { if (i == nr) - return entry->affinity; + return &entry->affinity->mask; i++; } WARN_ON_ONCE(1); @@ -1276,7 +1275,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) nr >= entry->nvec_used)) return NULL; - return &entry->affinity[nr]; + return &entry->affinity[nr].mask; } else { return cpu_possible_mask; } diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index ca397ff40836..c44b7844dc83 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -257,6 +257,14 @@ struct irq_affinity { int *sets; }; +/** + * struct irq_affinity_desc - Interrupt affinity descriptor + * @mask: cpumask to hold the affinity assignment + */ +struct irq_affinity_desc { + struct cpumask mask; +}; + #if defined(CONFIG_SMP) extern cpumask_var_t irq_default_affinity; @@ -303,7 +311,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m); extern int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify); -struct cpumask *irq_create_affinity_masks(int nvec, const struct irq_affinity *affd); +struct irq_affinity_desc * +irq_create_affinity_masks(int nvec, const struct irq_affinity *affd); + int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd); #else /* CONFIG_SMP */ @@ -337,7 +347,7 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) return 0; } -static inline struct cpumask * +static inline struct irq_affinity_desc * irq_create_affinity_masks(int nvec, const struct irq_affinity *affd) { return NULL; diff --git a/include/linux/irq.h b/include/linux/irq.h index c9bffda04a45..def2b2aac8b1 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -27,6 +27,7 @@ struct seq_file; struct module; struct msi_msg; +struct irq_affinity_desc; enum irqchip_irq_state; /* @@ -834,11 +835,12 @@ struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) unsigned int arch_dynirq_lower_bound(unsigned int from); int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, - struct module *owner, const struct cpumask *affinity); + struct module *owner, + const struct irq_affinity_desc *affinity); int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, unsigned int cnt, int node, struct module *owner, - const struct cpumask *affinity); + const struct irq_affinity_desc *affinity); /* use macros to avoid needing export.h for THIS_MODULE */ #define irq_alloc_descs(irq, from, cnt, node) \ diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 068aa46f0d55..35965f41d7be 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -43,6 +43,7 @@ struct irq_chip; struct irq_data; struct cpumask; struct seq_file; +struct irq_affinity_desc; /* Number of irqs reserved for a legacy isa controller */ #define NUM_ISA_INTERRUPTS 16 @@ -266,7 +267,7 @@ extern bool irq_domain_check_msi_remap(void); extern void irq_set_default_host(struct irq_domain *host); extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs, irq_hw_number_t hwirq, int node, - const struct cpumask *affinity); + const struct irq_affinity_desc *affinity); static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node) { @@ -449,7 +450,8 @@ static inline struct irq_domain *irq_domain_add_hierarchy(struct irq_domain *par extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, unsigned int nr_irqs, int node, void *arg, - bool realloc, const struct cpumask *affinity); + bool realloc, + const struct irq_affinity_desc *affinity); extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs); extern int irq_domain_activate_irq(struct irq_data *irq_data, bool early); extern void irq_domain_deactivate_irq(struct irq_data *irq_data); diff --git a/include/linux/msi.h b/include/linux/msi.h index eb213b87617c..784fb52b9900 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -76,7 +76,7 @@ struct msi_desc { unsigned int nvec_used; struct device *dev; struct msi_msg msg; - struct cpumask *affinity; + struct irq_affinity_desc *affinity; union { /* PCI MSI/X specific data */ @@ -138,7 +138,7 @@ static inline void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg) #endif /* CONFIG_PCI_MSI */ struct msi_desc *alloc_msi_entry(struct device *dev, int nvec, - const struct cpumask *affinity); + const struct irq_affinity_desc *affinity); void free_msi_entry(struct msi_desc *entry); void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg); void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg); diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index e423bff1928c..c0fe591b0dc9 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -99,7 +99,7 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, struct cpumask *nmsk, - struct cpumask *masks) + struct irq_affinity_desc *masks) { int n, nodes, cpus_per_vec, extra_vecs, done = 0; int last_affv = firstvec + numvecs; @@ -117,7 +117,9 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, */ if (numvecs <= nodes) { for_each_node_mask(n, nodemsk) { - cpumask_or(masks + curvec, masks + curvec, node_to_cpumask[n]); + cpumask_or(&masks[curvec].mask, + &masks[curvec].mask, + node_to_cpumask[n]); if (++curvec == last_affv) curvec = firstvec; } @@ -150,7 +152,8 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, cpus_per_vec++; --extra_vecs; } - irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec); + irq_spread_init_one(&masks[curvec].mask, nmsk, + cpus_per_vec); } done += v; @@ -173,7 +176,7 @@ out: static int irq_build_affinity_masks(const struct irq_affinity *affd, int startvec, int numvecs, int firstvec, cpumask_var_t *node_to_cpumask, - struct cpumask *masks) + struct irq_affinity_desc *masks) { int curvec = startvec, nr_present, nr_others; int ret = -ENOMEM; @@ -226,15 +229,15 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, * @nvecs: The total number of vectors * @affd: Description of the affinity requirements * - * Returns the masks pointer or NULL if allocation failed. + * Returns the irq_affinity_desc pointer or NULL if allocation failed. */ -struct cpumask * +struct irq_affinity_desc * irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { int affvecs = nvecs - affd->pre_vectors - affd->post_vectors; int curvec, usedvecs; cpumask_var_t *node_to_cpumask; - struct cpumask *masks = NULL; + struct irq_affinity_desc *masks = NULL; int i, nr_sets; /* @@ -254,8 +257,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) - cpumask_copy(masks + curvec, irq_default_affinity); - + cpumask_copy(&masks[curvec].mask, irq_default_affinity); /* * Spread on present CPUs starting from affd->pre_vectors. If we * have multiple sets, build each sets affinity mask separately. @@ -285,7 +287,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) else curvec = affd->pre_vectors + usedvecs; for (; curvec < nvecs; curvec++) - cpumask_copy(masks + curvec, irq_default_affinity); + cpumask_copy(&masks[curvec].mask, irq_default_affinity); outnodemsk: free_node_to_cpumask(node_to_cpumask); diff --git a/kernel/irq/devres.c b/kernel/irq/devres.c index 6a682c229e10..5d5378ea0afe 100644 --- a/kernel/irq/devres.c +++ b/kernel/irq/devres.c @@ -169,7 +169,7 @@ static void devm_irq_desc_release(struct device *dev, void *res) * @cnt: Number of consecutive irqs to allocate * @node: Preferred node on which the irq descriptor should be allocated * @owner: Owning module (can be NULL) - * @affinity: Optional pointer to an affinity mask array of size @cnt + * @affinity: Optional pointer to an irq_affinity_desc array of size @cnt * which hints where the irq descriptors should be allocated * and which default affinities to use * @@ -179,7 +179,7 @@ static void devm_irq_desc_release(struct device *dev, void *res) */ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, unsigned int cnt, int node, struct module *owner, - const struct cpumask *affinity) + const struct irq_affinity_desc *affinity) { struct irq_desc_devres *dr; int base; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 578d0e5f1b5b..cb401d6c5040 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -449,28 +449,29 @@ static void free_desc(unsigned int irq) } static int alloc_descs(unsigned int start, unsigned int cnt, int node, - const struct cpumask *affinity, struct module *owner) + const struct irq_affinity_desc *affinity, + struct module *owner) { - const struct cpumask *mask = NULL; struct irq_desc *desc; unsigned int flags; int i; /* Validate affinity mask(s) */ if (affinity) { - for (i = 0, mask = affinity; i < cnt; i++, mask++) { - if (cpumask_empty(mask)) + for (i = 0; i < cnt; i++) { + if (cpumask_empty(&affinity[i].mask)) return -EINVAL; } } flags = affinity ? IRQD_AFFINITY_MANAGED | IRQD_MANAGED_SHUTDOWN : 0; - mask = NULL; for (i = 0; i < cnt; i++) { + const struct cpumask *mask = NULL; + if (affinity) { node = cpu_to_node(cpumask_first(affinity)); - mask = affinity; + mask = &affinity->mask; affinity++; } desc = alloc_desc(start + i, node, flags, mask, owner); @@ -575,7 +576,7 @@ static void free_desc(unsigned int irq) } static inline int alloc_descs(unsigned int start, unsigned int cnt, int node, - const struct cpumask *affinity, + const struct irq_affinity_desc *affinity, struct module *owner) { u32 i; @@ -705,7 +706,7 @@ EXPORT_SYMBOL_GPL(irq_free_descs); */ int __ref __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, - struct module *owner, const struct cpumask *affinity) + struct module *owner, const struct irq_affinity_desc *affinity) { int start, ret; diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 3366d11c3e02..8b0be4bd6565 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -969,7 +969,7 @@ const struct irq_domain_ops irq_domain_simple_ops = { EXPORT_SYMBOL_GPL(irq_domain_simple_ops); int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, - int node, const struct cpumask *affinity) + int node, const struct irq_affinity_desc *affinity) { unsigned int hint; @@ -1281,7 +1281,7 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain, */ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, unsigned int nr_irqs, int node, void *arg, - bool realloc, const struct cpumask *affinity) + bool realloc, const struct irq_affinity_desc *affinity) { int i, ret, virq; diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 4ca2fd46645d..ad26fbcfbfc8 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -23,11 +23,11 @@ * @nvec: The number of vectors used in this entry * @affinity: Optional pointer to an affinity mask array size of @nvec * - * If @affinity is not NULL then a an affinity array[@nvec] is allocated - * and the affinity masks from @affinity are copied. + * If @affinity is not NULL then an affinity array[@nvec] is allocated + * and the affinity masks and flags from @affinity are copied. */ -struct msi_desc * -alloc_msi_entry(struct device *dev, int nvec, const struct cpumask *affinity) +struct msi_desc *alloc_msi_entry(struct device *dev, int nvec, + const struct irq_affinity_desc *affinity) { struct msi_desc *desc; -- cgit v1.3-7-g2ca7 From c410abbbacb9b378365ba17a30df08b4b9eec64f Mon Sep 17 00:00:00 2001 From: Dou Liyang Date: Tue, 4 Dec 2018 23:51:21 +0800 Subject: genirq/affinity: Add is_managed to struct irq_affinity_desc Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. That limitation was reported by Kashyap and Sumit. Expand struct irq_affinity_desc with a new bit 'is_managed' which is set for truly managed interrupts (queue interrupts) and cleared for the general device interrupts. [ tglx: Simplify code and massage changelog ] Reported-by: Kashyap Desai Reported-by: Sumit Saxena Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: linux-pci@vger.kernel.org Cc: shivasharan.srikanteshwara@broadcom.com Cc: ming.lei@redhat.com Cc: hch@lst.de Cc: bhelgaas@google.com Cc: douliyang1@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-3-douliyangs@gmail.com --- include/linux/interrupt.h | 1 + kernel/irq/affinity.c | 4 ++++ kernel/irq/irqdesc.c | 13 ++++++++----- 3 files changed, 13 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c44b7844dc83..c672f34235e7 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -263,6 +263,7 @@ struct irq_affinity { */ struct irq_affinity_desc { struct cpumask mask; + unsigned int is_managed : 1; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index c0fe591b0dc9..45b68b4ea48b 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -289,6 +289,10 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (; curvec < nvecs; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); + /* Mark the managed interrupts */ + for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) + masks[i].is_managed = 1; + outnodemsk: free_node_to_cpumask(node_to_cpumask); return masks; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index cb401d6c5040..ee062b7939d3 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -453,27 +453,30 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, struct module *owner) { struct irq_desc *desc; - unsigned int flags; int i; /* Validate affinity mask(s) */ if (affinity) { - for (i = 0; i < cnt; i++) { + for (i = 0; i < cnt; i++, i++) { if (cpumask_empty(&affinity[i].mask)) return -EINVAL; } } - flags = affinity ? IRQD_AFFINITY_MANAGED | IRQD_MANAGED_SHUTDOWN : 0; - for (i = 0; i < cnt; i++) { const struct cpumask *mask = NULL; + unsigned int flags = 0; if (affinity) { - node = cpu_to_node(cpumask_first(affinity)); + if (affinity->is_managed) { + flags = IRQD_AFFINITY_MANAGED | + IRQD_MANAGED_SHUTDOWN; + } mask = &affinity->mask; + node = cpu_to_node(cpumask_first(mask)); affinity++; } + desc = alloc_desc(start + i, node, flags, mask, owner); if (!desc) goto err; -- cgit v1.3-7-g2ca7 From d89b22d46a40da3a1630ecea111beaf3ef10bc21 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 3 Dec 2018 11:30:30 +1100 Subject: cred: add cred_fscmp() for comparing creds. NFS needs to compare to credentials, to see if they can be treated the same w.r.t. filesystem access. Sometimes an ordering is needed when credentials are used as a key to an rbtree. NFS currently has its own private credential management from before 'struct cred' existed. To move it over to more consistent use of 'struct cred' we need a comparison function. This patch adds that function. Signed-off-by: NeilBrown Signed-off-by: Anna Schumaker --- include/linux/cred.h | 1 + kernel/cred.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) (limited to 'kernel') diff --git a/include/linux/cred.h b/include/linux/cred.h index 7eed6101c791..f1085767e1b3 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -169,6 +169,7 @@ extern int change_create_files_as(struct cred *, struct inode *); extern int set_security_override(struct cred *, u32); extern int set_security_override_from_ctx(struct cred *, const char *); extern int set_create_files_as(struct cred *, struct inode *); +extern int cred_fscmp(const struct cred *, const struct cred *); extern void __init cred_init(void); /* diff --git a/kernel/cred.c b/kernel/cred.c index ecf03657e71c..0b3ac72bd717 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -19,6 +19,7 @@ #include #include #include +#include #if 0 #define kdebug(FMT, ...) \ @@ -564,6 +565,60 @@ void revert_creds(const struct cred *old) } EXPORT_SYMBOL(revert_creds); +/** + * cred_fscmp - Compare two credentials with respect to filesystem access. + * @a: The first credential + * @b: The second credential + * + * cred_cmp() will return zero if both credentials have the same + * fsuid, fsgid, and supplementary groups. That is, if they will both + * provide the same access to files based on mode/uid/gid. + * If the credentials are different, then either -1 or 1 will + * be returned depending on whether @a comes before or after @b + * respectively in an arbitrary, but stable, ordering of credentials. + * + * Return: -1, 0, or 1 depending on comparison + */ +int cred_fscmp(const struct cred *a, const struct cred *b) +{ + struct group_info *ga, *gb; + int g; + + if (a == b) + return 0; + if (uid_lt(a->fsuid, b->fsuid)) + return -1; + if (uid_gt(a->fsuid, b->fsuid)) + return 1; + + if (gid_lt(a->fsgid, b->fsgid)) + return -1; + if (gid_gt(a->fsgid, b->fsgid)) + return 1; + + ga = a->group_info; + gb = b->group_info; + if (ga == gb) + return 0; + if (ga == NULL) + return -1; + if (gb == NULL) + return 1; + if (ga->ngroups < gb->ngroups) + return -1; + if (ga->ngroups > gb->ngroups) + return 1; + + for (g = 0; g < ga->ngroups; g++) { + if (gid_lt(ga->gid[g], gb->gid[g])) + return -1; + if (gid_gt(ga->gid[g], gb->gid[g])) + return 1; + } + return 0; +} +EXPORT_SYMBOL(cred_fscmp); + /* * initialise the credentials stuff */ -- cgit v1.3-7-g2ca7 From 97d0fb239c041f5f99655af74812c3ab75cc4346 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 3 Dec 2018 11:30:30 +1100 Subject: cred: add get_cred_rcu() Sometimes we want to opportunistically get a ref to a cred in an rcu_read_lock protected section. get_task_cred() does this, and NFS does as similar thing with its own credential structures. To prepare for NFS converting to use 'struct cred' more uniformly, define get_cred_rcu(), and use it in get_task_cred(). Signed-off-by: NeilBrown Signed-off-by: Anna Schumaker --- include/linux/cred.h | 11 +++++++++++ kernel/cred.c | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/cred.h b/include/linux/cred.h index f1085767e1b3..48979fcb95cf 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -252,6 +252,17 @@ static inline const struct cred *get_cred(const struct cred *cred) return get_new_cred(nonconst_cred); } +static inline const struct cred *get_cred_rcu(const struct cred *cred) +{ + struct cred *nonconst_cred = (struct cred *) cred; + if (!cred) + return NULL; + if (!atomic_inc_not_zero(&nonconst_cred->usage)) + return NULL; + validate_creds(cred); + return cred; +} + /** * put_cred - Release a reference to a set of credentials * @cred: The credentials to release diff --git a/kernel/cred.c b/kernel/cred.c index 0b3ac72bd717..ba60162249e8 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -195,7 +195,7 @@ const struct cred *get_task_cred(struct task_struct *task) do { cred = __task_cred((task)); BUG_ON(!cred); - } while (!atomic_inc_not_zero(&((struct cred *)cred)->usage)); + } while (!get_cred_rcu(cred)); rcu_read_unlock(); return cred; -- cgit v1.3-7-g2ca7 From a6d8e7637faac93f362b694311bff64d5a8c701e Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 3 Dec 2018 11:30:30 +1100 Subject: cred: export get_task_cred(). There is no reason that modules should not be able to use this, and NFS will need it when converted to use 'struct cred'. Signed-off-by: NeilBrown Signed-off-by: Anna Schumaker --- kernel/cred.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index ba60162249e8..21f4a97085b4 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -200,6 +200,7 @@ const struct cred *get_task_cred(struct task_struct *task) rcu_read_unlock(); return cred; } +EXPORT_SYMBOL(get_task_cred); /* * Allocate blank credentials, such that the credentials can be filled in at a -- cgit v1.3-7-g2ca7 From fdbaa0beb78b7c8847e261fe2c32816e9d1c54cc Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Wed, 19 Dec 2018 13:01:01 -0800 Subject: bpf: Ensure line_info.insn_off cannot point to insn with zero code This patch rejects a line_info if the bpf insn code referred by line_info.insn_off is 0. F.e. a broken userspace tool might generate a line_info.insn_off that points to the second 8 bytes of a BPF_LD_IMM64. Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e0e77ffeefb8..5c64281d566e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4980,6 +4980,14 @@ static int check_btf_line(struct bpf_verifier_env *env, goto err_free; } + if (!prog->insnsi[linfo[i].insn_off].code) { + verbose(env, + "Invalid insn code at line_info[%u].insn_off\n", + i); + err = -EINVAL; + goto err_free; + } + if (!btf_name_by_offset(btf, linfo[i].line_off) || !btf_name_by_offset(btf, linfo[i].file_name_off)) { verbose(env, "Invalid line_info[%u].line_off or .file_name_off\n", i); -- cgit v1.3-7-g2ca7 From 518a2f1925c3165befbf06b75e07636549d92c1c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 14 Dec 2018 09:00:40 +0100 Subject: dma-mapping: zero memory returned from dma_alloc_* If we want to map memory from the DMA allocator to userspace it must be zeroed at allocation time to prevent stale data leaks. We already do this on most common architectures, but some architectures don't do this yet, fix them up, either by passing GFP_ZERO when we use the normal page allocator or doing a manual memset otherwise. Signed-off-by: Christoph Hellwig Acked-by: Geert Uytterhoeven [m68k] Acked-by: Sam Ravnborg [sparc] --- arch/alpha/kernel/pci_iommu.c | 2 +- arch/arc/mm/dma.c | 2 +- arch/c6x/mm/dma-coherent.c | 5 ++++- arch/m68k/kernel/dma.c | 2 +- arch/microblaze/mm/consistent.c | 2 +- arch/openrisc/kernel/dma.c | 2 +- arch/parisc/kernel/pci-dma.c | 4 ++-- arch/s390/pci/pci_dma.c | 2 +- arch/sparc/kernel/ioport.c | 2 +- arch/sparc/mm/io-unit.c | 2 +- arch/sparc/mm/iommu.c | 2 +- arch/xtensa/kernel/pci-dma.c | 2 +- drivers/misc/mic/host/mic_boot.c | 2 +- kernel/dma/virt.c | 2 +- 14 files changed, 18 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c index e1716e0d92fd..aa0f50d0f823 100644 --- a/arch/alpha/kernel/pci_iommu.c +++ b/arch/alpha/kernel/pci_iommu.c @@ -443,7 +443,7 @@ static void *alpha_pci_alloc_coherent(struct device *dev, size_t size, gfp &= ~GFP_DMA; try_again: - cpu_addr = (void *)__get_free_pages(gfp, order); + cpu_addr = (void *)__get_free_pages(gfp | __GFP_ZERO, order); if (! cpu_addr) { printk(KERN_INFO "pci_alloc_consistent: " "get_free_pages failed from %pf\n", diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c index db203ff69ccf..1525ac00fd02 100644 --- a/arch/arc/mm/dma.c +++ b/arch/arc/mm/dma.c @@ -33,7 +33,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, */ BUG_ON(gfp & __GFP_HIGHMEM); - page = alloc_pages(gfp, order); + page = alloc_pages(gfp | __GFP_ZERO, order); if (!page) return NULL; diff --git a/arch/c6x/mm/dma-coherent.c b/arch/c6x/mm/dma-coherent.c index 01305c787201..75b79571732c 100644 --- a/arch/c6x/mm/dma-coherent.c +++ b/arch/c6x/mm/dma-coherent.c @@ -78,6 +78,7 @@ static void __free_dma_pages(u32 addr, int order) void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp, unsigned long attrs) { + void *ret; u32 paddr; int order; @@ -94,7 +95,9 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, if (!paddr) return NULL; - return phys_to_virt(paddr); + ret = phys_to_virt(paddr); + memset(ret, 0, 1 << order); + return ret; } /* diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c index e99993c57d6b..b4aa853051bd 100644 --- a/arch/m68k/kernel/dma.c +++ b/arch/m68k/kernel/dma.c @@ -32,7 +32,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, size = PAGE_ALIGN(size); order = get_order(size); - page = alloc_pages(flag, order); + page = alloc_pages(flag | __GFP_ZERO, order); if (!page) return NULL; diff --git a/arch/microblaze/mm/consistent.c b/arch/microblaze/mm/consistent.c index 45e0a1aa9357..3002cbca3059 100644 --- a/arch/microblaze/mm/consistent.c +++ b/arch/microblaze/mm/consistent.c @@ -81,7 +81,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, size = PAGE_ALIGN(size); order = get_order(size); - vaddr = __get_free_pages(gfp, order); + vaddr = __get_free_pages(gfp | __GFP_ZERO, order); if (!vaddr) return NULL; diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c index 159336adfa2f..f79457cb3741 100644 --- a/arch/openrisc/kernel/dma.c +++ b/arch/openrisc/kernel/dma.c @@ -89,7 +89,7 @@ arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, .mm = &init_mm }; - page = alloc_pages_exact(size, gfp); + page = alloc_pages_exact(size, gfp | __GFP_ZERO); if (!page) return NULL; diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c index 04c48f1ef3fb..239162355b58 100644 --- a/arch/parisc/kernel/pci-dma.c +++ b/arch/parisc/kernel/pci-dma.c @@ -404,7 +404,7 @@ static void *pcxl_dma_alloc(struct device *dev, size_t size, order = get_order(size); size = 1 << (order + PAGE_SHIFT); vaddr = pcxl_alloc_range(size); - paddr = __get_free_pages(flag, order); + paddr = __get_free_pages(flag | __GFP_ZERO, order); flush_kernel_dcache_range(paddr, size); paddr = __pa(paddr); map_uncached_pages(vaddr, size, paddr); @@ -429,7 +429,7 @@ static void *pcx_dma_alloc(struct device *dev, size_t size, if ((attrs & DMA_ATTR_NON_CONSISTENT) == 0) return NULL; - addr = (void *)__get_free_pages(flag, get_order(size)); + addr = (void *)__get_free_pages(flag | __GFP_ZERO, get_order(size)); if (addr) *dma_handle = (dma_addr_t)virt_to_phys(addr); diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c index 346ba382193a..9e52d1527f71 100644 --- a/arch/s390/pci/pci_dma.c +++ b/arch/s390/pci/pci_dma.c @@ -404,7 +404,7 @@ static void *s390_dma_alloc(struct device *dev, size_t size, dma_addr_t map; size = PAGE_ALIGN(size); - page = alloc_pages(flag, get_order(size)); + page = alloc_pages(flag | __GFP_ZERO, get_order(size)); if (!page) return NULL; diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c index baa235652c27..f89603855f1e 100644 --- a/arch/sparc/kernel/ioport.c +++ b/arch/sparc/kernel/ioport.c @@ -325,7 +325,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, return NULL; size = PAGE_ALIGN(size); - va = (void *) __get_free_pages(gfp, get_order(size)); + va = (void *) __get_free_pages(gfp | __GFP_ZERO, get_order(size)); if (!va) { printk("%s: no %zd pages\n", __func__, size >> PAGE_SHIFT); return NULL; diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c index 91be13935d40..f770ee7229d8 100644 --- a/arch/sparc/mm/io-unit.c +++ b/arch/sparc/mm/io-unit.c @@ -224,7 +224,7 @@ static void *iounit_alloc(struct device *dev, size_t len, return NULL; len = PAGE_ALIGN(len); - va = __get_free_pages(gfp, get_order(len)); + va = __get_free_pages(gfp | __GFP_ZERO, get_order(len)); if (!va) return NULL; diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c index fb771a634452..e8d5d73ca40d 100644 --- a/arch/sparc/mm/iommu.c +++ b/arch/sparc/mm/iommu.c @@ -344,7 +344,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t len, return NULL; len = PAGE_ALIGN(len); - va = __get_free_pages(gfp, get_order(len)); + va = __get_free_pages(gfp | __GFP_ZERO, get_order(len)); if (va == 0) return NULL; diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c index 1fc138b6bc0a..9171bff76fc4 100644 --- a/arch/xtensa/kernel/pci-dma.c +++ b/arch/xtensa/kernel/pci-dma.c @@ -160,7 +160,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, flag & __GFP_NOWARN); if (!page) - page = alloc_pages(flag, get_order(size)); + page = alloc_pages(flag | __GFP_ZERO, get_order(size)); if (!page) return NULL; diff --git a/drivers/misc/mic/host/mic_boot.c b/drivers/misc/mic/host/mic_boot.c index c327985c9523..6479435ac96b 100644 --- a/drivers/misc/mic/host/mic_boot.c +++ b/drivers/misc/mic/host/mic_boot.c @@ -149,7 +149,7 @@ static void *__mic_dma_alloc(struct device *dev, size_t size, struct scif_hw_dev *scdev = dev_get_drvdata(dev); struct mic_device *mdev = scdev_to_mdev(scdev); dma_addr_t tmp; - void *va = kmalloc(size, gfp); + void *va = kmalloc(size, gfp | __GFP_ZERO); if (va) { tmp = mic_map_single(mdev, va, size); diff --git a/kernel/dma/virt.c b/kernel/dma/virt.c index 631ddec4b60a..ebe128833af7 100644 --- a/kernel/dma/virt.c +++ b/kernel/dma/virt.c @@ -13,7 +13,7 @@ static void *dma_virt_alloc(struct device *dev, size_t size, { void *ret; - ret = (void *)__get_free_pages(gfp, get_order(size)); + ret = (void *)__get_free_pages(gfp | __GFP_ZERO, get_order(size)); if (ret) *dma_handle = (uintptr_t)ret; return ret; -- cgit v1.3-7-g2ca7 From 960ea056561a08e2b837b2f02d22c53226414a84 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Wed, 19 Dec 2018 22:13:04 -0800 Subject: bpf: verifier: teach the verifier to reason about the BPF_JSET instruction Some JITs (nfp) try to optimize code on their own. It could make sense in case of BPF_JSET instruction which is currently not interpreted by the verifier, meaning for instance that dead could would not be detected if it was under BPF_JSET branch. Teach the verifier basics of BPF_JSET, JIT optimizations will be removed shortly. Signed-off-by: Jakub Kicinski Reviewed-by: Jiong Wang Acked-by: Edward Cree Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5c64281d566e..98ed27bbd045 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3859,6 +3859,12 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) if (tnum_is_const(reg->var_off)) return !tnum_equals_const(reg->var_off, val); break; + case BPF_JSET: + if ((~reg->var_off.mask & reg->var_off.value) & val) + return 1; + if (!((reg->var_off.mask | reg->var_off.value) & val)) + return 0; + break; case BPF_JGT: if (reg->umin_value > val) return 1; @@ -3943,6 +3949,13 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, */ __mark_reg_known(false_reg, val); break; + case BPF_JSET: + false_reg->var_off = tnum_and(false_reg->var_off, + tnum_const(~val)); + if (is_power_of_2(val)) + true_reg->var_off = tnum_or(true_reg->var_off, + tnum_const(val)); + break; case BPF_JGT: false_reg->umax_value = min(false_reg->umax_value, val); true_reg->umin_value = max(true_reg->umin_value, val + 1); @@ -4015,6 +4028,13 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, */ __mark_reg_known(false_reg, val); break; + case BPF_JSET: + false_reg->var_off = tnum_and(false_reg->var_off, + tnum_const(~val)); + if (is_power_of_2(val)) + true_reg->var_off = tnum_or(true_reg->var_off, + tnum_const(val)); + break; case BPF_JGT: true_reg->umax_value = min(true_reg->umax_value, val - 1); false_reg->umin_value = max(false_reg->umin_value, val); -- cgit v1.3-7-g2ca7 From 9b38c4056b2736bb5902e8b0911832db666fd19b Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Wed, 19 Dec 2018 22:13:06 -0800 Subject: bpf: verifier: reorder stack size check with dead code sanitization Reorder the calls to check_max_stack_depth() and sanitize_dead_code() to separate functions which can rewrite instructions from pure checks. No functional changes. Signed-off-by: Jakub Kicinski Reviewed-by: Jiong Wang Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 98ed27bbd045..d27d5a880015 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6983,10 +6983,11 @@ skip_full_check: free_states(env); if (ret == 0) - sanitize_dead_code(env); + ret = check_max_stack_depth(env); + /* instruction rewrites happen after this point */ if (ret == 0) - ret = check_max_stack_depth(env); + sanitize_dead_code(env); if (ret == 0) /* program is valid, convert *(u32*)(ctx + off) accesses */ -- cgit v1.3-7-g2ca7 From 8b1cce9f5832a8eda17d37a3c49fb7dd2d650f46 Mon Sep 17 00:00:00 2001 From: Thierry Reding Date: Thu, 20 Dec 2018 17:35:47 +0100 Subject: dma-mapping: fix inverted logic in dma_supported The cleanup in commit 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct") accidentally inverted the logic in the check for the presence of a ->dma_supported() callback. Switch this back to the way it was to prevent a crash on boot. Fixes: 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct") Signed-off-by: Thierry Reding Signed-off-by: Christoph Hellwig --- kernel/dma/mapping.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index fc84c81029d9..d7c34d2d1ba5 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -406,7 +406,7 @@ int dma_supported(struct device *dev, u64 mask) if (dma_is_direct(ops)) return dma_direct_supported(dev, mask); - if (ops->dma_supported) + if (!ops->dma_supported) return 1; return ops->dma_supported(dev, mask); } -- cgit v1.3-7-g2ca7 From 77ea5f4cbe2084db9ab021ba73fb7eadf1610884 Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Wed, 19 Dec 2018 17:00:23 +0100 Subject: bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't The frame_size passed to build_skb must be aligned, else it is possible that the embedded struct skb_shared_info gets unaligned. For correctness make sure that xdpf->headroom in included in the alignment. No upstream drivers can hit this, as all XDP drivers provide an aligned headroom. This was discovered when playing with implementing XDP support for mvneta, which have a 2 bytes DSA header, and this Marvell ARM64 platform didn't like doing atomic operations on an unaligned skb_shinfo(skb)->dataref addresses. Fixes: 1c601d829ab0 ("bpf: cpumap xdp_buff to skb conversion and allocation") Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Daniel Borkmann --- kernel/bpf/cpumap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 24aac0d0f412..8974b3755670 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -183,7 +183,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * is not at a fixed memory location, with mixed length * packets, which is bad for cache-line hotness. */ - frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom + + frame_size = SKB_DATA_ALIGN(xdpf->len + xdpf->headroom) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); pkt_data_start = xdpf->data - xdpf->headroom; -- cgit v1.3-7-g2ca7 From e8d086ddb5339d72c60e6c7b8d28810f26960f9a Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 18 Dec 2018 15:50:02 -0500 Subject: tracing: Fix ftrace_graph_get_ret_stack() to use task and not current The function ftrace_graph_get_ret_stack() takes a task struct descriptor but uses current as the task to perform the operations on. In pretty much all cases the task decriptor is the same as current, so this wasn't an issue. But there is a case in the ARM architecture that passes in a task that is not current, and expects a result from that task, and this code breaks it. Fixes: 51584396cff5 ("arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack") Reported-by: James Morse Tested-by: James Morse Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/fgraph.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index d4f04f0ca646..8dfd5021b933 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -246,10 +246,10 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer) struct ftrace_ret_stack * ftrace_graph_get_ret_stack(struct task_struct *task, int idx) { - idx = current->curr_ret_stack - idx; + idx = task->curr_ret_stack - idx; if (idx >= 0 && idx <= task->curr_ret_stack) - return ¤t->ret_stack[idx]; + return &task->ret_stack[idx]; return NULL; } -- cgit v1.3-7-g2ca7 From 6801f0d5ca00bed159c7f87870cc6f5411837544 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:20 -0600 Subject: tracing: Remove unnecessary hist trigger struct field hist_field.var_idx is completely unused, so remove it. Link: http://lkml.kernel.org/r/d4e066c0f509f5f13ad3babc8c33ca6e7ddc439a.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 82e72c48a5a9..d29bf8a8e1dd 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -61,7 +61,6 @@ struct hist_field { char *system; char *event_name; char *name; - unsigned int var_idx; unsigned int var_ref_idx; bool read_once; }; -- cgit v1.3-7-g2ca7 From 2f31ed9308cc9e95c149078b276a47360ab507bb Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:21 -0600 Subject: tracing: Change strlen to sizeof for hist trigger static strings There's no need to use strlen() for static strings when the length is already known, so update trace_events_hist.c with sizeof() for those cases. Link: http://lkml.kernel.org/r/e3e754f2bd18e56eaa8baf79bee619316ebf4cfc.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index d29bf8a8e1dd..25d06b3ae1f6 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -507,7 +507,7 @@ static int synth_field_string_size(char *type) start = strstr(type, "char["); if (start == NULL) return -EINVAL; - start += strlen("char["); + start += sizeof("char[") - 1; end = strchr(type, ']'); if (!end || end < start) @@ -1843,8 +1843,8 @@ static int parse_action(char *str, struct hist_trigger_attrs *attrs) if (attrs->n_actions >= HIST_ACTIONS_MAX) return ret; - if ((strncmp(str, "onmatch(", strlen("onmatch(")) == 0) || - (strncmp(str, "onmax(", strlen("onmax(")) == 0)) { + if ((strncmp(str, "onmatch(", sizeof("onmatch(") - 1) == 0) || + (strncmp(str, "onmax(", sizeof("onmax(") - 1) == 0)) { attrs->action_str[attrs->n_actions] = kstrdup(str, GFP_KERNEL); if (!attrs->action_str[attrs->n_actions]) { ret = -ENOMEM; @@ -1861,34 +1861,34 @@ static int parse_assignment(char *str, struct hist_trigger_attrs *attrs) { int ret = 0; - if ((strncmp(str, "key=", strlen("key=")) == 0) || - (strncmp(str, "keys=", strlen("keys=")) == 0)) { + if ((strncmp(str, "key=", sizeof("key=") - 1) == 0) || + (strncmp(str, "keys=", sizeof("keys=") - 1) == 0)) { attrs->keys_str = kstrdup(str, GFP_KERNEL); if (!attrs->keys_str) { ret = -ENOMEM; goto out; } - } else if ((strncmp(str, "val=", strlen("val=")) == 0) || - (strncmp(str, "vals=", strlen("vals=")) == 0) || - (strncmp(str, "values=", strlen("values=")) == 0)) { + } else if ((strncmp(str, "val=", sizeof("val=") - 1) == 0) || + (strncmp(str, "vals=", sizeof("vals=") - 1) == 0) || + (strncmp(str, "values=", sizeof("values=") - 1) == 0)) { attrs->vals_str = kstrdup(str, GFP_KERNEL); if (!attrs->vals_str) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "sort=", strlen("sort=")) == 0) { + } else if (strncmp(str, "sort=", sizeof("sort=") - 1) == 0) { attrs->sort_key_str = kstrdup(str, GFP_KERNEL); if (!attrs->sort_key_str) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "name=", strlen("name=")) == 0) { + } else if (strncmp(str, "name=", sizeof("name=") - 1) == 0) { attrs->name = kstrdup(str, GFP_KERNEL); if (!attrs->name) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "clock=", strlen("clock=")) == 0) { + } else if (strncmp(str, "clock=", sizeof("clock=") - 1) == 0) { strsep(&str, "="); if (!str) { ret = -EINVAL; @@ -1901,7 +1901,7 @@ static int parse_assignment(char *str, struct hist_trigger_attrs *attrs) ret = -ENOMEM; goto out; } - } else if (strncmp(str, "size=", strlen("size=")) == 0) { + } else if (strncmp(str, "size=", sizeof("size=") - 1) == 0) { int map_bits = parse_map_size(str); if (map_bits < 0) { @@ -3497,7 +3497,7 @@ static struct action_data *onmax_parse(char *str) if (!onmax_fn_name || !str) goto free; - if (strncmp(onmax_fn_name, "save", strlen("save")) == 0) { + if (strncmp(onmax_fn_name, "save", sizeof("save") - 1) == 0) { char *params = strsep(&str, ")"); if (!params) { @@ -4302,8 +4302,8 @@ static int parse_actions(struct hist_trigger_data *hist_data) for (i = 0; i < hist_data->attrs->n_actions; i++) { str = hist_data->attrs->action_str[i]; - if (strncmp(str, "onmatch(", strlen("onmatch(")) == 0) { - char *action_str = str + strlen("onmatch("); + if (strncmp(str, "onmatch(", sizeof("onmatch(") - 1) == 0) { + char *action_str = str + sizeof("onmatch(") - 1; data = onmatch_parse(tr, action_str); if (IS_ERR(data)) { @@ -4311,8 +4311,8 @@ static int parse_actions(struct hist_trigger_data *hist_data) break; } data->fn = action_trace; - } else if (strncmp(str, "onmax(", strlen("onmax(")) == 0) { - char *action_str = str + strlen("onmax("); + } else if (strncmp(str, "onmax(", sizeof("onmax(") - 1) == 0) { + char *action_str = str + sizeof("onmax(") - 1; data = onmax_parse(action_str); if (IS_ERR(data)) { @@ -5548,9 +5548,9 @@ static int event_hist_trigger_func(struct event_command *cmd_ops, p++; continue; } - if (p >= param + strlen(param) - strlen("if") - 1) + if (p >= param + strlen(param) - (sizeof("if") - 1) - 1) return -EINVAL; - if (*(p + strlen("if")) != ' ' && *(p + strlen("if")) != '\t') { + if (*(p + sizeof("if") - 1) != ' ' && *(p + sizeof("if") - 1) != '\t') { p++; continue; } -- cgit v1.3-7-g2ca7 From e4f6d245031e04bdd12db390298acec0474a1a46 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 19 Dec 2018 13:09:16 -0600 Subject: tracing: Use var_refs[] for hist trigger reference checking Since all the variable reference hist_fields are collected into hist_data->var_refs[] array, there's no need to go through all the fields looking for them, or in separate arrays like synth_var_refs[], which will be going away soon anyway. This also allows us to get rid of some unnecessary code and functions currently used for the same purpose. Link: http://lkml.kernel.org/r/1545246556.4239.7.camel@gmail.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 68 +++++++--------------------------------- 1 file changed, 11 insertions(+), 57 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 25d06b3ae1f6..f3caf6e484f4 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -1297,75 +1297,29 @@ check_field_for_var_ref(struct hist_field *hist_field, struct hist_trigger_data *var_data, unsigned int var_idx) { - struct hist_field *found = NULL; - - if (hist_field && hist_field->flags & HIST_FIELD_FL_VAR_REF) { - if (hist_field->var.idx == var_idx && - hist_field->var.hist_data == var_data) { - found = hist_field; - } - } - - return found; -} - -static struct hist_field * -check_field_for_var_refs(struct hist_trigger_data *hist_data, - struct hist_field *hist_field, - struct hist_trigger_data *var_data, - unsigned int var_idx, - unsigned int level) -{ - struct hist_field *found = NULL; - unsigned int i; - - if (level > 3) - return found; - - if (!hist_field) - return found; - - found = check_field_for_var_ref(hist_field, var_data, var_idx); - if (found) - return found; - - for (i = 0; i < HIST_FIELD_OPERANDS_MAX; i++) { - struct hist_field *operand; + WARN_ON(!(hist_field && hist_field->flags & HIST_FIELD_FL_VAR_REF)); - operand = hist_field->operands[i]; - found = check_field_for_var_refs(hist_data, operand, var_data, - var_idx, level + 1); - if (found) - return found; - } + if (hist_field && hist_field->var.idx == var_idx && + hist_field->var.hist_data == var_data) + return hist_field; - return found; + return NULL; } static struct hist_field *find_var_ref(struct hist_trigger_data *hist_data, struct hist_trigger_data *var_data, unsigned int var_idx) { - struct hist_field *hist_field, *found = NULL; + struct hist_field *hist_field; unsigned int i; - for_each_hist_field(i, hist_data) { - hist_field = hist_data->fields[i]; - found = check_field_for_var_refs(hist_data, hist_field, - var_data, var_idx, 0); - if (found) - return found; - } - - for (i = 0; i < hist_data->n_synth_var_refs; i++) { - hist_field = hist_data->synth_var_refs[i]; - found = check_field_for_var_refs(hist_data, hist_field, - var_data, var_idx, 0); - if (found) - return found; + for (i = 0; i < hist_data->n_var_refs; i++) { + hist_field = hist_data->var_refs[i]; + if (check_field_for_var_ref(hist_field, var_data, var_idx)) + return hist_field; } - return found; + return NULL; } static struct hist_field *find_any_var_ref(struct hist_trigger_data *hist_data, -- cgit v1.3-7-g2ca7 From de40f033d4e84e843d6a12266e3869015ea9097c Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:23 -0600 Subject: tracing: Remove open-coding of hist trigger var_ref management Have create_var_ref() manage the hist trigger's var_ref list, rather than having similar code doing it in multiple places. This cleans up the code and makes sure var_refs are always accounted properly. Also, document the var_ref-related functions to make what their purpose clearer. Link: http://lkml.kernel.org/r/05ddae93ff514e66fc03897d6665231892939913.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 93 ++++++++++++++++++++++++++++++++-------- 1 file changed, 75 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index f3caf6e484f4..6309f4dbfb9c 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -1292,6 +1292,17 @@ static u64 hist_field_cpu(struct hist_field *hist_field, return cpu; } +/** + * check_field_for_var_ref - Check if a VAR_REF field references a variable + * @hist_field: The VAR_REF field to check + * @var_data: The hist trigger that owns the variable + * @var_idx: The trigger variable identifier + * + * Check the given VAR_REF field to see whether or not it references + * the given variable associated with the given trigger. + * + * Return: The VAR_REF field if it does reference the variable, NULL if not + */ static struct hist_field * check_field_for_var_ref(struct hist_field *hist_field, struct hist_trigger_data *var_data, @@ -1306,6 +1317,18 @@ check_field_for_var_ref(struct hist_field *hist_field, return NULL; } +/** + * find_var_ref - Check if a trigger has a reference to a trigger variable + * @hist_data: The hist trigger that might have a reference to the variable + * @var_data: The hist trigger that owns the variable + * @var_idx: The trigger variable identifier + * + * Check the list of var_refs[] on the first hist trigger to see + * whether any of them are references to the variable on the second + * trigger. + * + * Return: The VAR_REF field referencing the variable if so, NULL if not + */ static struct hist_field *find_var_ref(struct hist_trigger_data *hist_data, struct hist_trigger_data *var_data, unsigned int var_idx) @@ -1322,6 +1345,20 @@ static struct hist_field *find_var_ref(struct hist_trigger_data *hist_data, return NULL; } +/** + * find_any_var_ref - Check if there is a reference to a given trigger variable + * @hist_data: The hist trigger + * @var_idx: The trigger variable identifier + * + * Check to see whether the given variable is currently referenced by + * any other trigger. + * + * The trigger the variable is defined on is explicitly excluded - the + * assumption being that a self-reference doesn't prevent a trigger + * from being removed. + * + * Return: The VAR_REF field referencing the variable if so, NULL if not + */ static struct hist_field *find_any_var_ref(struct hist_trigger_data *hist_data, unsigned int var_idx) { @@ -1340,6 +1377,19 @@ static struct hist_field *find_any_var_ref(struct hist_trigger_data *hist_data, return found; } +/** + * check_var_refs - Check if there is a reference to any of trigger's variables + * @hist_data: The hist trigger + * + * A trigger can define one or more variables. If any one of them is + * currently referenced by any other trigger, this function will + * determine that. + + * Typically used to determine whether or not a trigger can be removed + * - if there are any references to a trigger's variables, it cannot. + * + * Return: True if there is a reference to any of trigger's variables + */ static bool check_var_refs(struct hist_trigger_data *hist_data) { struct hist_field *field; @@ -2343,7 +2393,23 @@ static int init_var_ref(struct hist_field *ref_field, goto out; } -static struct hist_field *create_var_ref(struct hist_field *var_field, +/** + * create_var_ref - Create a variable reference and attach it to trigger + * @hist_data: The trigger that will be referencing the variable + * @var_field: The VAR field to create a reference to + * @system: The optional system string + * @event_name: The optional event_name string + * + * Given a variable hist_field, create a VAR_REF hist_field that + * represents a reference to it. + * + * This function also adds the reference to the trigger that + * now references the variable. + * + * Return: The VAR_REF field if successful, NULL if not + */ +static struct hist_field *create_var_ref(struct hist_trigger_data *hist_data, + struct hist_field *var_field, char *system, char *event_name) { unsigned long flags = HIST_FIELD_FL_VAR_REF; @@ -2355,6 +2421,9 @@ static struct hist_field *create_var_ref(struct hist_field *var_field, destroy_hist_field(ref_field, 0); return NULL; } + + hist_data->var_refs[hist_data->n_var_refs] = ref_field; + ref_field->var_ref_idx = hist_data->n_var_refs++; } return ref_field; @@ -2428,7 +2497,8 @@ static struct hist_field *parse_var_ref(struct hist_trigger_data *hist_data, var_field = find_event_var(hist_data, system, event_name, var_name); if (var_field) - ref_field = create_var_ref(var_field, system, event_name); + ref_field = create_var_ref(hist_data, var_field, + system, event_name); if (!ref_field) hist_err_event("Couldn't find variable: $", @@ -2546,8 +2616,6 @@ static struct hist_field *parse_atom(struct hist_trigger_data *hist_data, if (!s) { hist_field = parse_var_ref(hist_data, ref_system, ref_event, ref_var); if (hist_field) { - hist_data->var_refs[hist_data->n_var_refs] = hist_field; - hist_field->var_ref_idx = hist_data->n_var_refs++; if (var_name) { hist_field = create_alias(hist_data, hist_field, var_name); if (!hist_field) { @@ -3321,7 +3389,6 @@ static int onmax_create(struct hist_trigger_data *hist_data, unsigned int var_ref_idx = hist_data->n_var_refs; struct field_var *field_var; char *onmax_var_str, *param; - unsigned long flags; unsigned int i; int ret = 0; @@ -3338,18 +3405,10 @@ static int onmax_create(struct hist_trigger_data *hist_data, return -EINVAL; } - flags = HIST_FIELD_FL_VAR_REF; - ref_field = create_hist_field(hist_data, NULL, flags, NULL); + ref_field = create_var_ref(hist_data, var_field, NULL, NULL); if (!ref_field) return -ENOMEM; - if (init_var_ref(ref_field, var_field, NULL, NULL)) { - destroy_hist_field(ref_field, 0); - ret = -ENOMEM; - goto out; - } - hist_data->var_refs[hist_data->n_var_refs] = ref_field; - ref_field->var_ref_idx = hist_data->n_var_refs++; data->onmax.var = ref_field; data->fn = onmax_save; @@ -3538,9 +3597,6 @@ static void save_synth_var_ref(struct hist_trigger_data *hist_data, struct hist_field *var_ref) { hist_data->synth_var_refs[hist_data->n_synth_var_refs++] = var_ref; - - hist_data->var_refs[hist_data->n_var_refs] = var_ref; - var_ref->var_ref_idx = hist_data->n_var_refs++; } static int check_synth_field(struct synth_event *event, @@ -3694,7 +3750,8 @@ static int onmatch_create(struct hist_trigger_data *hist_data, } if (check_synth_field(event, hist_field, field_pos) == 0) { - var_ref = create_var_ref(hist_field, system, event_name); + var_ref = create_var_ref(hist_data, hist_field, + system, event_name); if (!var_ref) { kfree(p); ret = -ENOMEM; -- cgit v1.3-7-g2ca7 From 656fe2ba85e81d00e4447bf77b8da2be3c47acb2 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:24 -0600 Subject: tracing: Use hist trigger's var_ref array to destroy var_refs Since every var ref for a trigger has an entry in the var_ref[] array, use that to destroy the var_refs, instead of piecemeal via the field expressions. This allows us to avoid having to keep and treat differently separate lists for the action-related references, which future patches will remove. Link: http://lkml.kernel.org/r/fad1a164f0e257c158e70d6eadbf6c586e04b2a2.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 6309f4dbfb9c..923c572ee68d 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -2190,6 +2190,15 @@ static int contains_operator(char *str) return field_op; } +static void __destroy_hist_field(struct hist_field *hist_field) +{ + kfree(hist_field->var.name); + kfree(hist_field->name); + kfree(hist_field->type); + + kfree(hist_field); +} + static void destroy_hist_field(struct hist_field *hist_field, unsigned int level) { @@ -2201,14 +2210,13 @@ static void destroy_hist_field(struct hist_field *hist_field, if (!hist_field) return; + if (hist_field->flags & HIST_FIELD_FL_VAR_REF) + return; /* var refs will be destroyed separately */ + for (i = 0; i < HIST_FIELD_OPERANDS_MAX; i++) destroy_hist_field(hist_field->operands[i], level + 1); - kfree(hist_field->var.name); - kfree(hist_field->name); - kfree(hist_field->type); - - kfree(hist_field); + __destroy_hist_field(hist_field); } static struct hist_field *create_hist_field(struct hist_trigger_data *hist_data, @@ -2335,6 +2343,12 @@ static void destroy_hist_fields(struct hist_trigger_data *hist_data) hist_data->fields[i] = NULL; } } + + for (i = 0; i < hist_data->n_var_refs; i++) { + WARN_ON(!(hist_data->var_refs[i]->flags & HIST_FIELD_FL_VAR_REF)); + __destroy_hist_field(hist_data->var_refs[i]); + hist_data->var_refs[i] = NULL; + } } static int init_var_ref(struct hist_field *ref_field, -- cgit v1.3-7-g2ca7 From 912201345f7c39e6b0ac283207be2b6641fa47b9 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:25 -0600 Subject: tracing: Remove hist trigger synth_var_refs All var_refs are now handled uniformly and there's no reason to treat the synth_refs in a special way now, so remove them and associated functions. Link: http://lkml.kernel.org/r/b4d3470526b8f0426dcec125399dad9ad9b8589d.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 18 ------------------ 1 file changed, 18 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 923c572ee68d..1b1aee944588 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -279,8 +279,6 @@ struct hist_trigger_data { struct action_data *actions[HIST_ACTIONS_MAX]; unsigned int n_actions; - struct hist_field *synth_var_refs[SYNTH_FIELDS_MAX]; - unsigned int n_synth_var_refs; struct field_var *field_vars[SYNTH_FIELDS_MAX]; unsigned int n_field_vars; unsigned int n_field_var_str; @@ -3599,20 +3597,6 @@ static void save_field_var(struct hist_trigger_data *hist_data, } -static void destroy_synth_var_refs(struct hist_trigger_data *hist_data) -{ - unsigned int i; - - for (i = 0; i < hist_data->n_synth_var_refs; i++) - destroy_hist_field(hist_data->synth_var_refs[i], 0); -} - -static void save_synth_var_ref(struct hist_trigger_data *hist_data, - struct hist_field *var_ref) -{ - hist_data->synth_var_refs[hist_data->n_synth_var_refs++] = var_ref; -} - static int check_synth_field(struct synth_event *event, struct hist_field *hist_field, unsigned int field_pos) @@ -3772,7 +3756,6 @@ static int onmatch_create(struct hist_trigger_data *hist_data, goto err; } - save_synth_var_ref(hist_data, var_ref); field_pos++; kfree(p); continue; @@ -4516,7 +4499,6 @@ static void destroy_hist_data(struct hist_trigger_data *hist_data) destroy_actions(hist_data); destroy_field_vars(hist_data); destroy_field_var_hists(hist_data); - destroy_synth_var_refs(hist_data); kfree(hist_data); } -- cgit v1.3-7-g2ca7 From 05ddb25cb31491b0901d057c40ad50d01b3db783 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 18 Dec 2018 14:33:26 -0600 Subject: tracing: Add hist trigger comments for variable-related fields Add a few comments to help clarify how variable and variable reference fields are used in the code. Link: http://lkml.kernel.org/r/ea857ce948531d7bec712bbb0f38360aa1d378ec.1545161087.git.tom.zanussi@linux.intel.com Acked-by: Namhyung Kim Reviewed-by: Masami Hiramatsu Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 1b1aee944588..c5448c103770 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -40,6 +40,16 @@ enum field_op_id { FIELD_OP_UNARY_MINUS, }; +/* + * A hist_var (histogram variable) contains variable information for + * hist_fields having the HIST_FIELD_FL_VAR or HIST_FIELD_FL_VAR_REF + * flag set. A hist_var has a variable name e.g. ts0, and is + * associated with a given histogram trigger, as specified by + * hist_data. The hist_var idx is the unique index assigned to the + * variable by the hist trigger's tracing_map. The idx is what is + * used to set a variable's value and, by a variable reference, to + * retrieve it. + */ struct hist_var { char *name; struct hist_trigger_data *hist_data; @@ -56,11 +66,29 @@ struct hist_field { const char *type; struct hist_field *operands[HIST_FIELD_OPERANDS_MAX]; struct hist_trigger_data *hist_data; + + /* + * Variable fields contain variable-specific info in var. + */ struct hist_var var; enum field_op_id operator; char *system; char *event_name; + + /* + * The name field is used for EXPR and VAR_REF fields. VAR + * fields contain the variable name in var.name. + */ char *name; + + /* + * When a histogram trigger is hit, if it has any references + * to variables, the values of those variables are collected + * into a var_ref_vals array by resolve_var_refs(). The + * current value of each variable is read from the tracing_map + * using the hist field's hist_var.idx and entered into the + * var_ref_idx entry i.e. var_ref_vals[var_ref_idx]. + */ unsigned int var_ref_idx; bool read_once; }; @@ -365,6 +393,14 @@ struct action_data { union { struct { + /* + * When a histogram trigger is hit, the values of any + * references to variables, including variables being passed + * as parameters to synthetic events, are collected into a + * var_ref_vals array. This var_ref_idx is the index of the + * first param in the array to be passed to the synthetic + * event invocation. + */ unsigned int var_ref_idx; char *match_event; char *match_event_system; -- cgit v1.3-7-g2ca7 From 59dd974bc079079c23b5429cba841696fa7fae41 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Mon, 29 Oct 2018 23:35:40 +0100 Subject: tracing: Merge seq_print_sym_short() and seq_print_sym_offset() These two functions are nearly identical, so we can avoid some code duplication by moving the conditional into a common implementation. Link: http://lkml.kernel.org/r/20181029223542.26175-2-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_output.c | 34 +++++++--------------------------- 1 file changed, 7 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 6e6cc64faa38..85ecd061c7be 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -339,34 +339,17 @@ static inline const char *kretprobed(const char *name) #endif /* CONFIG_KRETPROBES */ static void -seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address) +seq_print_sym(struct trace_seq *s, const char *fmt, unsigned long address, + bool offset) { char str[KSYM_SYMBOL_LEN]; #ifdef CONFIG_KALLSYMS const char *name; - kallsyms_lookup(address, NULL, NULL, NULL, str); - - name = kretprobed(str); - - if (name && strlen(name)) { - trace_seq_printf(s, fmt, name); - return; - } -#endif - snprintf(str, KSYM_SYMBOL_LEN, "0x%08lx", address); - trace_seq_printf(s, fmt, str); -} - -static void -seq_print_sym_offset(struct trace_seq *s, const char *fmt, - unsigned long address) -{ - char str[KSYM_SYMBOL_LEN]; -#ifdef CONFIG_KALLSYMS - const char *name; - - sprint_symbol(str, address); + if (offset) + sprint_symbol(str, address); + else + kallsyms_lookup(address, NULL, NULL, NULL, str); name = kretprobed(str); if (name && strlen(name)) { @@ -424,10 +407,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags) goto out; } - if (sym_flags & TRACE_ITER_SYM_OFFSET) - seq_print_sym_offset(s, "%s", ip); - else - seq_print_sym_short(s, "%s", ip); + seq_print_sym(s, "%s", ip, sym_flags & TRACE_ITER_SYM_OFFSET); if (sym_flags & TRACE_ITER_SYM_ADDR) trace_seq_printf(s, " <" IP_FMT ">", ip); -- cgit v1.3-7-g2ca7 From cc9f59fb3bc455984404dbab8be16f156a47d023 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Mon, 29 Oct 2018 23:35:41 +0100 Subject: tracing: Avoid -Wformat-nonliteral warning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Building with -Wformat-nonliteral, gcc complains kernel/trace/trace_output.c: In function ‘seq_print_sym’: kernel/trace/trace_output.c:356:3: warning: format not a string literal, argument types not checked [-Wformat-nonliteral] trace_seq_printf(s, fmt, name); But seq_print_sym only has a single caller which passes "%s" as fmt, so we might as well just use that directly. That also paves the way for further cleanups that will actually make that format string go away entirely. Link: http://lkml.kernel.org/r/20181029223542.26175-3-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_output.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 85ecd061c7be..f06fb899b746 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -339,8 +339,7 @@ static inline const char *kretprobed(const char *name) #endif /* CONFIG_KRETPROBES */ static void -seq_print_sym(struct trace_seq *s, const char *fmt, unsigned long address, - bool offset) +seq_print_sym(struct trace_seq *s, unsigned long address, bool offset) { char str[KSYM_SYMBOL_LEN]; #ifdef CONFIG_KALLSYMS @@ -353,12 +352,12 @@ seq_print_sym(struct trace_seq *s, const char *fmt, unsigned long address, name = kretprobed(str); if (name && strlen(name)) { - trace_seq_printf(s, fmt, name); + trace_seq_printf(s, "%s", name); return; } #endif snprintf(str, KSYM_SYMBOL_LEN, "0x%08lx", address); - trace_seq_printf(s, fmt, str); + trace_seq_printf(s, "%s", str); } #ifndef CONFIG_64BIT @@ -407,7 +406,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags) goto out; } - seq_print_sym(s, "%s", ip, sym_flags & TRACE_ITER_SYM_OFFSET); + seq_print_sym(s, ip, sym_flags & TRACE_ITER_SYM_OFFSET); if (sym_flags & TRACE_ITER_SYM_ADDR) trace_seq_printf(s, " <" IP_FMT ">", ip); -- cgit v1.3-7-g2ca7 From bea6957d5cd7cbb327083d3bcf080133ee63ef65 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Mon, 29 Oct 2018 23:35:42 +0100 Subject: tracing: Simplify printf'ing in seq_print_sym trace_seq_printf(..., "%s", ...) can be done with trace_seq_puts() instead, avoiding printf overhead. In the second instance, the string we're copying was just created from an snprintf() to a stack buffer, so we might as well do that printf directly. This naturally leads to moving the declaration of the str buffer inside the CONFIG_KALLSYMS guard, which in turn will make gcc inline the function for !CONFIG_KALLSYMS (it only has a single caller, but the huge stack frame seems to make gcc not inline it for CONFIG_KALLSYMS). Link: http://lkml.kernel.org/r/20181029223542.26175-4-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_output.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index f06fb899b746..54373d93e251 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -341,8 +341,8 @@ static inline const char *kretprobed(const char *name) static void seq_print_sym(struct trace_seq *s, unsigned long address, bool offset) { - char str[KSYM_SYMBOL_LEN]; #ifdef CONFIG_KALLSYMS + char str[KSYM_SYMBOL_LEN]; const char *name; if (offset) @@ -352,12 +352,11 @@ seq_print_sym(struct trace_seq *s, unsigned long address, bool offset) name = kretprobed(str); if (name && strlen(name)) { - trace_seq_printf(s, "%s", name); + trace_seq_puts(s, name); return; } #endif - snprintf(str, KSYM_SYMBOL_LEN, "0x%08lx", address); - trace_seq_printf(s, "%s", str); + trace_seq_printf(s, "0x%08lx", address); } #ifndef CONFIG_64BIT -- cgit v1.3-7-g2ca7 From 1cce377df1800154301525a8ee04100dd83c9633 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Wed, 16 May 2018 21:30:12 +0200 Subject: tracing: Make function ‘ftrace_exports’ static MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In commit 478409dd683d ("tracing: Add hook to function tracing for other subsystems to use"), a new function ‘ftrace_exports’ was added. Since this function can be made static, make it so. Silence the following warning triggered using W=1: kernel/trace/trace.c:2451:6: warning: no previous prototype for ‘ftrace_exports’ [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/20180516193012.25390-1-malat@debian.org Signed-off-by: Mathieu Malaterre Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 911470ad9e94..5afcfecb4bc2 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2452,7 +2452,7 @@ static inline void ftrace_exports_disable(void) static_branch_disable(&ftrace_exports_enabled); } -void ftrace_exports(struct ring_buffer_event *event) +static void ftrace_exports(struct ring_buffer_event *event) { struct trace_export *export; -- cgit v1.3-7-g2ca7 From 754481e6954cbef53f8bc4412ad48dde611e21d3 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 19 Dec 2018 22:38:21 -0500 Subject: tracing: Use str_has_prefix() helper for histogram code The tracing histogram code contains a lot of instances of the construct: strncmp(str, "const", sizeof("const") - 1) This can be prone to bugs due to typos or bad cut and paste. Use the str_has_prefix() helper macro instead that removes the need for having two copies of the constant string. Cc: Tom Zanussi Acked-by: Namhyung Kim Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index c5448c103770..9d590138f870 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -1881,8 +1881,8 @@ static int parse_action(char *str, struct hist_trigger_attrs *attrs) if (attrs->n_actions >= HIST_ACTIONS_MAX) return ret; - if ((strncmp(str, "onmatch(", sizeof("onmatch(") - 1) == 0) || - (strncmp(str, "onmax(", sizeof("onmax(") - 1) == 0)) { + if ((str_has_prefix(str, "onmatch(")) || + (str_has_prefix(str, "onmax("))) { attrs->action_str[attrs->n_actions] = kstrdup(str, GFP_KERNEL); if (!attrs->action_str[attrs->n_actions]) { ret = -ENOMEM; @@ -1899,34 +1899,34 @@ static int parse_assignment(char *str, struct hist_trigger_attrs *attrs) { int ret = 0; - if ((strncmp(str, "key=", sizeof("key=") - 1) == 0) || - (strncmp(str, "keys=", sizeof("keys=") - 1) == 0)) { + if ((str_has_prefix(str, "key=")) || + (str_has_prefix(str, "keys="))) { attrs->keys_str = kstrdup(str, GFP_KERNEL); if (!attrs->keys_str) { ret = -ENOMEM; goto out; } - } else if ((strncmp(str, "val=", sizeof("val=") - 1) == 0) || - (strncmp(str, "vals=", sizeof("vals=") - 1) == 0) || - (strncmp(str, "values=", sizeof("values=") - 1) == 0)) { + } else if ((str_has_prefix(str, "val=")) || + (str_has_prefix(str, "vals=")) || + (str_has_prefix(str, "values="))) { attrs->vals_str = kstrdup(str, GFP_KERNEL); if (!attrs->vals_str) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "sort=", sizeof("sort=") - 1) == 0) { + } else if (str_has_prefix(str, "sort=")) { attrs->sort_key_str = kstrdup(str, GFP_KERNEL); if (!attrs->sort_key_str) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "name=", sizeof("name=") - 1) == 0) { + } else if (str_has_prefix(str, "name=")) { attrs->name = kstrdup(str, GFP_KERNEL); if (!attrs->name) { ret = -ENOMEM; goto out; } - } else if (strncmp(str, "clock=", sizeof("clock=") - 1) == 0) { + } else if (str_has_prefix(str, "clock=")) { strsep(&str, "="); if (!str) { ret = -EINVAL; @@ -1939,7 +1939,7 @@ static int parse_assignment(char *str, struct hist_trigger_attrs *attrs) ret = -ENOMEM; goto out; } - } else if (strncmp(str, "size=", sizeof("size=") - 1) == 0) { + } else if (str_has_prefix(str, "size=")) { int map_bits = parse_map_size(str); if (map_bits < 0) { @@ -3558,7 +3558,7 @@ static struct action_data *onmax_parse(char *str) if (!onmax_fn_name || !str) goto free; - if (strncmp(onmax_fn_name, "save", sizeof("save") - 1) == 0) { + if (str_has_prefix(onmax_fn_name, "save")) { char *params = strsep(&str, ")"); if (!params) { @@ -4346,7 +4346,7 @@ static int parse_actions(struct hist_trigger_data *hist_data) for (i = 0; i < hist_data->attrs->n_actions; i++) { str = hist_data->attrs->action_str[i]; - if (strncmp(str, "onmatch(", sizeof("onmatch(") - 1) == 0) { + if (str_has_prefix(str, "onmatch(")) { char *action_str = str + sizeof("onmatch(") - 1; data = onmatch_parse(tr, action_str); @@ -4355,7 +4355,7 @@ static int parse_actions(struct hist_trigger_data *hist_data) break; } data->fn = action_trace; - } else if (strncmp(str, "onmax(", sizeof("onmax(") - 1) == 0) { + } else if (str_has_prefix(str, "onmax(")) { char *action_str = str + sizeof("onmax(") - 1; data = onmax_parse(action_str); -- cgit v1.3-7-g2ca7 From b6b2735514bcd70ad1556a33892a636b20ece671 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 20 Dec 2018 13:20:07 -0500 Subject: tracing: Use str_has_prefix() instead of using fixed sizes There are several instances of strncmp(str, "const", 123), where 123 is the strlen of the const string to check if "const" is the prefix of str. But this can be error prone. Use str_has_prefix() instead. Acked-by: Namhyung Kim Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- kernel/trace/trace_events.c | 2 +- kernel/trace/trace_events_hist.c | 2 +- kernel/trace/trace_probe.c | 4 ++-- kernel/trace/trace_stack.c | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 5afcfecb4bc2..eac2824a18ab 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4411,7 +4411,7 @@ static int trace_set_options(struct trace_array *tr, char *option) cmp = strstrip(option); - if (strncmp(cmp, "no", 2) == 0) { + if (str_has_prefix(cmp, "no")) { neg = 1; cmp += 2; } diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index bd0162c0467c..5b3b0c3c8a47 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -1251,7 +1251,7 @@ static int f_show(struct seq_file *m, void *v) */ array_descriptor = strchr(field->type, '['); - if (!strncmp(field->type, "__data_loc", 10)) + if (str_has_prefix(field->type, "__data_loc")) array_descriptor = NULL; if (!array_descriptor) diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 9d590138f870..0d878dcd1e4b 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -518,7 +518,7 @@ static int synth_event_define_fields(struct trace_event_call *call) static bool synth_field_signed(char *type) { - if (strncmp(type, "u", 1) == 0) + if (str_has_prefix(type, "u")) return false; return true; diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index ff86417c0149..541375737403 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -194,7 +194,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t, code->op = FETCH_OP_RETVAL; else ret = -EINVAL; - } else if (strncmp(arg, "stack", 5) == 0) { + } else if (str_has_prefix(arg, "stack")) { if (arg[5] == '\0') { code->op = FETCH_OP_STACKP; } else if (isdigit(arg[5])) { @@ -213,7 +213,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t, #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API } else if (((flags & TPARG_FL_MASK) == (TPARG_FL_KERNEL | TPARG_FL_FENTRY)) && - strncmp(arg, "arg", 3) == 0) { + str_has_prefix(arg, "arg")) { if (!isdigit(arg[3])) return -EINVAL; ret = kstrtoul(arg + 3, 10, ¶m); diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index e2a153fc1afc..3641f28c343f 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -448,7 +448,7 @@ static char stack_trace_filter_buf[COMMAND_LINE_SIZE+1] __initdata; static __init int enable_stacktrace(char *str) { - if (strncmp(str, "_filter=", 8) == 0) + if (str_has_prefix(str, "_filter=")) strncpy(stack_trace_filter_buf, str+8, COMMAND_LINE_SIZE); stack_tracer_enabled = 1; -- cgit v1.3-7-g2ca7 From 036876fa56204ae0fa59045bd6bbb2691a060633 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 21 Dec 2018 18:40:46 -0500 Subject: tracing: Have the historgram use the result of str_has_prefix() for len of prefix As str_has_prefix() returns the length on match, we can use that for the updating of the string pointer instead of recalculating the prefix size. Cc: Tom Zanussi Acked-by: Namhyung Kim Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 0d878dcd1e4b..449d90cfa151 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -4342,12 +4342,13 @@ static int parse_actions(struct hist_trigger_data *hist_data) unsigned int i; int ret = 0; char *str; + int len; for (i = 0; i < hist_data->attrs->n_actions; i++) { str = hist_data->attrs->action_str[i]; - if (str_has_prefix(str, "onmatch(")) { - char *action_str = str + sizeof("onmatch(") - 1; + if ((len = str_has_prefix(str, "onmatch("))) { + char *action_str = str + len; data = onmatch_parse(tr, action_str); if (IS_ERR(data)) { @@ -4355,8 +4356,8 @@ static int parse_actions(struct hist_trigger_data *hist_data) break; } data->fn = action_trace; - } else if (str_has_prefix(str, "onmax(")) { - char *action_str = str + sizeof("onmax(") - 1; + } else if ((len = str_has_prefix(str, "onmax("))) { + char *action_str = str + len; data = onmax_parse(action_str); if (IS_ERR(data)) { -- cgit v1.3-7-g2ca7 From 3d739c1f6156c70eb0548aa288dcfbac9e0bd162 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 21 Dec 2018 23:10:26 -0500 Subject: tracing: Use the return of str_has_prefix() to remove open coded numbers There are several locations that compare constants to the beginning of string variables to determine what commands should be done, then the constant length is used to index into the string. This is error prone as the hard coded numbers have to match the size of the constants. Instead, use the len returned from str_has_prefix() and remove the open coded string length sizes. Cc: Joe Perches Acked-by: Masami Hiramatsu (for trace_probe part) Acked-by: Namhyung Kim Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 8 +++++--- kernel/trace/trace_probe.c | 17 +++++++++-------- kernel/trace/trace_stack.c | 6 ++++-- 3 files changed, 18 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index eac2824a18ab..18b86c3974e1 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4408,13 +4408,15 @@ static int trace_set_options(struct trace_array *tr, char *option) int neg = 0; int ret; size_t orig_len = strlen(option); + int len; cmp = strstrip(option); - if (str_has_prefix(cmp, "no")) { + len = str_has_prefix(cmp, "no"); + if (len) neg = 1; - cmp += 2; - } + + cmp += len; mutex_lock(&trace_types_lock); diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 541375737403..9962cb5da8ac 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -186,19 +186,20 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, static int parse_probe_vars(char *arg, const struct fetch_type *t, struct fetch_insn *code, unsigned int flags) { - int ret = 0; unsigned long param; + int ret = 0; + int len; if (strcmp(arg, "retval") == 0) { if (flags & TPARG_FL_RETURN) code->op = FETCH_OP_RETVAL; else ret = -EINVAL; - } else if (str_has_prefix(arg, "stack")) { - if (arg[5] == '\0') { + } else if ((len = str_has_prefix(arg, "stack"))) { + if (arg[len] == '\0') { code->op = FETCH_OP_STACKP; - } else if (isdigit(arg[5])) { - ret = kstrtoul(arg + 5, 10, ¶m); + } else if (isdigit(arg[len])) { + ret = kstrtoul(arg + len, 10, ¶m); if (ret || ((flags & TPARG_FL_KERNEL) && param > PARAM_MAX_STACK)) ret = -EINVAL; @@ -213,10 +214,10 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t, #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API } else if (((flags & TPARG_FL_MASK) == (TPARG_FL_KERNEL | TPARG_FL_FENTRY)) && - str_has_prefix(arg, "arg")) { - if (!isdigit(arg[3])) + (len = str_has_prefix(arg, "arg"))) { + if (!isdigit(arg[len])) return -EINVAL; - ret = kstrtoul(arg + 3, 10, ¶m); + ret = kstrtoul(arg + len, 10, ¶m); if (ret || !param || param > PARAM_MAX_STACK) return -EINVAL; code->op = FETCH_OP_ARG; diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index 3641f28c343f..eec648a0d673 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -448,8 +448,10 @@ static char stack_trace_filter_buf[COMMAND_LINE_SIZE+1] __initdata; static __init int enable_stacktrace(char *str) { - if (str_has_prefix(str, "_filter=")) - strncpy(stack_trace_filter_buf, str+8, COMMAND_LINE_SIZE); + int len; + + if ((len = str_has_prefix(str, "_filter="))) + strncpy(stack_trace_filter_buf, str + len, COMMAND_LINE_SIZE); stack_tracer_enabled = 1; last_stack_tracer_enabled = 1; -- cgit v1.3-7-g2ca7 From 6d101ba6be2a26a3e1f513b5e293f0fd2b79ec5c Mon Sep 17 00:00:00 2001 From: Olof Johansson Date: Sun, 25 Nov 2018 14:41:05 -0800 Subject: sched/fair: Fix warning on non-SMP build Caused by making the variable static: kernel/sched/fair.c:119:21: warning: 'capacity_margin' defined but not used [-Wunused-variable] Seems easiest to just move it up under the existing ifdef CONFIG_SMP that's a few lines above. Fixes: ed8885a14433a ('sched/fair: Make some variables static') Signed-off-by: Olof Johansson Signed-off-by: Linus Torvalds --- kernel/sched/fair.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1c1cfbf6ba0c..d1907506318a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -94,6 +94,14 @@ int __weak arch_asym_cpu_priority(int cpu) { return -cpu; } + +/* + * The margin used when comparing utilization with CPU capacity: + * util * margin < capacity * 1024 + * + * (default: ~20%) + */ +static unsigned int capacity_margin = 1280; #endif #ifdef CONFIG_CFS_BANDWIDTH @@ -110,14 +118,6 @@ int __weak arch_asym_cpu_priority(int cpu) unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif -/* - * The margin used when comparing utilization with CPU capacity: - * util * margin < capacity * 1024 - * - * (default: ~20%) - */ -static unsigned int capacity_margin = 1280; - static inline void update_load_add(struct load_weight *lw, unsigned long inc) { lw->weight += inc; -- cgit v1.3-7-g2ca7 From e250d91d65750a0c0c62483ac4f9f357e7317617 Mon Sep 17 00:00:00 2001 From: Ondrej Mosnacek Date: Thu, 13 Dec 2018 15:17:37 +0100 Subject: cgroup: fix parsing empty mount option string This fixes the case where all mount options specified are consumed by an LSM and all that's left is an empty string. In this case cgroupfs should accept the string and not fail. How to reproduce (with SELinux enabled): # umount /sys/fs/cgroup/unified # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error. # dmesg | tail -n 1 [ 31.575952] cgroup: cgroup2: unknown option "" Fixes: 67e9c74b8a87 ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type") [NOTE: should apply on top of commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase] Suggested-by: Stephen Smalley Signed-off-by: Ondrej Mosnacek Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8da9bf5e422f..879c9f191f66 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1748,7 +1748,7 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) *root_flags = 0; - if (!data) + if (!data || *data == '\0') return 0; while ((token = strsep(&data, ",")) != NULL) { -- cgit v1.3-7-g2ca7 From 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 28 Dec 2018 10:31:07 -0800 Subject: cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param It can be useful to inhibit all cgroup1 hierarchies especially during transition and for debugging. cgroup_no_v1 can block hierarchies with controllers which leaves out the named hierarchies. Expand it to cover the named hierarchies so that "cgroup_no_v1=all,named" disables all cgroup1 hierarchies. Signed-off-by: Tejun Heo Suggested-by: Marcin Pawlowski Signed-off-by: Tejun Heo --- Documentation/admin-guide/kernel-parameters.txt | 8 ++++++-- kernel/cgroup/cgroup-v1.c | 14 +++++++++++++- 2 files changed, 19 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 81d1d5a74728..8b7652449228 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -486,10 +486,14 @@ cut the overhead, others just disable the usage. So only cgroup_disable=memory is actually worthy} - cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1 - Format: { controller[,controller...] | "all" } + cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1 + Format: { { controller | "all" | "named" } + [,{ controller | "all" | "named" }...] } Like cgroup_disable, but only applies to cgroup v1; the blacklisted controllers remain available in cgroup2. + "all" blacklists all controllers and "named" disables + named mounts. Specifying both "all" and "named" disables + all v1 hierarchies. cgroup.memory= [KNL] Pass options to the cgroup memory controller. Format: diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 51063e7a93c2..583b969b0c0e 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -27,6 +27,9 @@ /* Controllers blocked by the commandline in v1 */ static u16 cgroup_no_v1_mask; +/* disable named v1 mounts */ +static bool cgroup_no_v1_named; + /* * pidlist destructions need to be flushed on cgroup destruction. Use a * separate workqueue as flush domain. @@ -963,6 +966,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) } if (!strncmp(token, "name=", 5)) { const char *name = token + 5; + + /* blocked by boot param? */ + if (cgroup_no_v1_named) + return -ENOENT; /* Can't specify an empty name */ if (!strlen(name)) return -EINVAL; @@ -1292,7 +1299,12 @@ static int __init cgroup_no_v1(char *str) if (!strcmp(token, "all")) { cgroup_no_v1_mask = U16_MAX; - break; + continue; + } + + if (!strcmp(token, "named")) { + cgroup_no_v1_named = true; + continue; } for_each_subsys(ss, i) { -- cgit v1.3-7-g2ca7 From 3d6357de8aa09e1966770dc1171c72679946464f Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:20 -0800 Subject: mm: reference totalram_pages and managed_pages once per function Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Reviewed-by: Konstantin Khlebnikov Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/um/kernel/mem.c | 2 +- arch/x86/kernel/cpu/microcode/core.c | 5 +++-- drivers/hv/hv_balloon.c | 19 ++++++++++--------- fs/file_table.c | 7 ++++--- kernel/fork.c | 5 +++-- kernel/kexec_core.c | 5 +++-- mm/page_alloc.c | 5 +++-- mm/shmem.c | 3 ++- net/dccp/proto.c | 7 ++++--- net/netfilter/nf_conntrack_core.c | 7 ++++--- net/netfilter/xt_hashlimit.c | 5 +++-- net/sctp/protocol.c | 7 ++++--- 12 files changed, 44 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 1067469ba2ea..2da209687a22 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -52,7 +52,7 @@ void __init mem_init(void) /* this will put all low memory onto the freelists */ memblock_free_all(); max_low_pfn = totalram_pages; - max_pfn = totalram_pages; + max_pfn = max_low_pfn; mem_init_print_info(NULL); kmalloc_ok = 1; } diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 2637ff09d6a0..168fa272cc3e 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -434,9 +434,10 @@ static ssize_t microcode_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) { ssize_t ret = -EINVAL; + unsigned long nr_pages = totalram_pages; - if ((len >> PAGE_SHIFT) > totalram_pages) { - pr_err("too much data (max %ld pages)\n", totalram_pages); + if ((len >> PAGE_SHIFT) > nr_pages) { + pr_err("too much data (max %ld pages)\n", nr_pages); return ret; } diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 41631512ae97..f3e7da981610 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1090,6 +1090,7 @@ static void process_info(struct hv_dynmem_device *dm, struct dm_info_msg *msg) static unsigned long compute_balloon_floor(void) { unsigned long min_pages; + unsigned long nr_pages = totalram_pages; #define MB2PAGES(mb) ((mb) << (20 - PAGE_SHIFT)) /* Simple continuous piecewiese linear function: * max MiB -> min MiB gradient @@ -1102,16 +1103,16 @@ static unsigned long compute_balloon_floor(void) * 8192 744 (1/16) * 32768 1512 (1/32) */ - if (totalram_pages < MB2PAGES(128)) - min_pages = MB2PAGES(8) + (totalram_pages >> 1); - else if (totalram_pages < MB2PAGES(512)) - min_pages = MB2PAGES(40) + (totalram_pages >> 2); - else if (totalram_pages < MB2PAGES(2048)) - min_pages = MB2PAGES(104) + (totalram_pages >> 3); - else if (totalram_pages < MB2PAGES(8192)) - min_pages = MB2PAGES(232) + (totalram_pages >> 4); + if (nr_pages < MB2PAGES(128)) + min_pages = MB2PAGES(8) + (nr_pages >> 1); + else if (nr_pages < MB2PAGES(512)) + min_pages = MB2PAGES(40) + (nr_pages >> 2); + else if (nr_pages < MB2PAGES(2048)) + min_pages = MB2PAGES(104) + (nr_pages >> 3); + else if (nr_pages < MB2PAGES(8192)) + min_pages = MB2PAGES(232) + (nr_pages >> 4); else - min_pages = MB2PAGES(488) + (totalram_pages >> 5); + min_pages = MB2PAGES(488) + (nr_pages >> 5); #undef MB2PAGES return min_pages; } diff --git a/fs/file_table.c b/fs/file_table.c index e49af4caf15d..b6e9587f05c7 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -380,10 +380,11 @@ void __init files_init(void) void __init files_maxfiles_init(void) { unsigned long n; - unsigned long memreserve = (totalram_pages - nr_free_pages()) * 3/2; + unsigned long nr_pages = totalram_pages; + unsigned long memreserve = (nr_pages - nr_free_pages()) * 3/2; - memreserve = min(memreserve, totalram_pages - 1); - n = ((totalram_pages - memreserve) * (PAGE_SIZE / 1024)) / 10; + memreserve = min(memreserve, nr_pages - 1); + n = ((nr_pages - memreserve) * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = max_t(unsigned long, n, NR_FILE); } diff --git a/kernel/fork.c b/kernel/fork.c index e2a5156bc9c3..8617a326e9f5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -744,15 +744,16 @@ void __init __weak arch_task_cache_init(void) { } static void set_max_threads(unsigned int max_threads_suggested) { u64 threads; + unsigned long nr_pages = totalram_pages; /* * The number of threads shall be limited such that the thread * structures may only consume a small part of the available memory. */ - if (fls64(totalram_pages) + fls64(PAGE_SIZE) > 64) + if (fls64(nr_pages) + fls64(PAGE_SIZE) > 64) threads = MAX_THREADS; else - threads = div64_u64((u64) totalram_pages * (u64) PAGE_SIZE, + threads = div64_u64((u64) nr_pages * (u64) PAGE_SIZE, (u64) THREAD_SIZE * 8UL); if (threads > max_threads_suggested) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 86ef06d3dbe3..7e967ca98d92 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -152,6 +152,7 @@ int sanity_check_segment_list(struct kimage *image) int i; unsigned long nr_segments = image->nr_segments; unsigned long total_pages = 0; + unsigned long nr_pages = totalram_pages; /* * Verify we have good destination addresses. The caller is @@ -217,13 +218,13 @@ int sanity_check_segment_list(struct kimage *image) * wasted allocating pages, which can cause a soft lockup. */ for (i = 0; i < nr_segments; i++) { - if (PAGE_COUNT(image->segment[i].memsz) > totalram_pages / 2) + if (PAGE_COUNT(image->segment[i].memsz) > nr_pages / 2) return -EINVAL; total_pages += PAGE_COUNT(image->segment[i].memsz); } - if (total_pages > totalram_pages / 2) + if (total_pages > nr_pages / 2) return -EINVAL; /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c82c41edfe22..b79e79caea99 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7257,6 +7257,7 @@ static void calculate_totalreserve_pages(void) for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; long max = 0; + unsigned long managed_pages = zone->managed_pages; /* Find valid and maximum lowmem_reserve in the zone */ for (j = i; j < MAX_NR_ZONES; j++) { @@ -7267,8 +7268,8 @@ static void calculate_totalreserve_pages(void) /* we treat the high watermark as reserved pages. */ max += high_wmark_pages(zone); - if (max > zone->managed_pages) - max = zone->managed_pages; + if (max > managed_pages) + max = managed_pages; pgdat->totalreserve_pages += max; diff --git a/mm/shmem.c b/mm/shmem.c index 375f3ac19bb8..b1f0f54470fb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -114,7 +114,8 @@ static unsigned long shmem_default_max_blocks(void) static unsigned long shmem_default_max_inodes(void) { - return min(totalram_pages - totalhigh_pages, totalram_pages / 2); + unsigned long nr_pages = totalram_pages; + return min(nr_pages - totalhigh_pages, nr_pages / 2); } #endif diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 2cc5fbb1b29e..ff727ff61b5b 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1131,6 +1131,7 @@ EXPORT_SYMBOL_GPL(dccp_debug); static int __init dccp_init(void) { unsigned long goal; + unsigned long nr_pages = totalram_pages; int ehash_order, bhash_order, i; int rc; @@ -1157,10 +1158,10 @@ static int __init dccp_init(void) * * The methodology is similar to that of the buffer cache. */ - if (totalram_pages >= (128 * 1024)) - goal = totalram_pages >> (21 - PAGE_SHIFT); + if (nr_pages >= (128 * 1024)) + goal = nr_pages >> (21 - PAGE_SHIFT); else - goal = totalram_pages >> (23 - PAGE_SHIFT); + goal = nr_pages >> (23 - PAGE_SHIFT); if (thash_entries) goal = (thash_entries * diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index e87c21e47efe..5eb990830348 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2248,6 +2248,7 @@ static __always_inline unsigned int total_extension_size(void) int nf_conntrack_init_start(void) { + unsigned long nr_pages = totalram_pages; int max_factor = 8; int ret = -ENOMEM; int i; @@ -2267,11 +2268,11 @@ int nf_conntrack_init_start(void) * >= 4GB machines have 65536 buckets. */ nf_conntrack_htable_size - = (((totalram_pages << PAGE_SHIFT) / 16384) + = (((nr_pages << PAGE_SHIFT) / 16384) / sizeof(struct hlist_head)); - if (totalram_pages > (4 * (1024 * 1024 * 1024 / PAGE_SIZE))) + if (nr_pages > (4 * (1024 * 1024 * 1024 / PAGE_SIZE))) nf_conntrack_htable_size = 65536; - else if (totalram_pages > (1024 * 1024 * 1024 / PAGE_SIZE)) + else if (nr_pages > (1024 * 1024 * 1024 / PAGE_SIZE)) nf_conntrack_htable_size = 16384; if (nf_conntrack_htable_size < 32) nf_conntrack_htable_size = 32; diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 28e27a32d9b9..88b520ba2abc 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -274,14 +274,15 @@ static int htable_create(struct net *net, struct hashlimit_cfg3 *cfg, struct xt_hashlimit_htable *hinfo; const struct seq_operations *ops; unsigned int size, i; + unsigned long nr_pages = totalram_pages; int ret; if (cfg->size) { size = cfg->size; } else { - size = (totalram_pages << PAGE_SHIFT) / 16384 / + size = (nr_pages << PAGE_SHIFT) / 16384 / sizeof(struct hlist_head); - if (totalram_pages > 1024 * 1024 * 1024 / PAGE_SIZE) + if (nr_pages > 1024 * 1024 * 1024 / PAGE_SIZE) size = 8192; if (size < 16) size = 16; diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 9b277bd36d1a..a5b24182b3cc 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1368,6 +1368,7 @@ static __init int sctp_init(void) int status = -EINVAL; unsigned long goal; unsigned long limit; + unsigned long nr_pages = totalram_pages; int max_share; int order; int num_entries; @@ -1426,10 +1427,10 @@ static __init int sctp_init(void) * The methodology is similar to that of the tcp hash tables. * Though not identical. Start by getting a goal size */ - if (totalram_pages >= (128 * 1024)) - goal = totalram_pages >> (22 - PAGE_SHIFT); + if (nr_pages >= (128 * 1024)) + goal = nr_pages >> (22 - PAGE_SHIFT); else - goal = totalram_pages >> (24 - PAGE_SHIFT); + goal = nr_pages >> (24 - PAGE_SHIFT); /* Then compute the page order for said goal */ order = get_order(goal); -- cgit v1.3-7-g2ca7 From ca79b0c211af63fa3276f0e3fd7dd9ada2439839 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:29 -0800 Subject: mm: convert totalram_pages and totalhigh_pages variables to atomic totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Suggested-by: Michal Hocko Suggested-by: Vlastimil Babka Reviewed-by: Konstantin Khlebnikov Reviewed-by: Pavel Tatashin Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/csky/mm/init.c | 4 ++-- arch/powerpc/platforms/pseries/cmm.c | 10 +++++----- arch/s390/mm/init.c | 2 +- arch/um/kernel/mem.c | 2 +- arch/x86/kernel/cpu/microcode/core.c | 2 +- drivers/char/agp/backend.c | 4 ++-- drivers/gpu/drm/i915/i915_gem.c | 2 +- drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 4 ++-- drivers/hv/hv_balloon.c | 2 +- drivers/md/dm-bufio.c | 2 +- drivers/md/dm-crypt.c | 2 +- drivers/md/dm-integrity.c | 2 +- drivers/md/dm-stats.c | 2 +- drivers/media/platform/mtk-vpu/mtk_vpu.c | 2 +- drivers/misc/vmw_balloon.c | 2 +- drivers/parisc/ccio-dma.c | 4 ++-- drivers/parisc/sba_iommu.c | 4 ++-- drivers/staging/android/ion/ion_system_heap.c | 2 +- drivers/xen/xen-selfballoon.c | 6 +++--- fs/ceph/super.h | 2 +- fs/file_table.c | 2 +- fs/fuse/inode.c | 2 +- fs/nfs/write.c | 2 +- fs/nfsd/nfscache.c | 2 +- fs/ntfs/malloc.h | 2 +- fs/proc/base.c | 2 +- include/linux/highmem.h | 28 +++++++++++++++++++++++++-- include/linux/mm.h | 27 +++++++++++++++++++++++++- include/linux/swap.h | 1 - kernel/fork.c | 2 +- kernel/kexec_core.c | 2 +- kernel/power/snapshot.c | 2 +- mm/highmem.c | 5 ++--- mm/huge_memory.c | 2 +- mm/kasan/quarantine.c | 2 +- mm/memblock.c | 4 ++-- mm/mm_init.c | 2 +- mm/oom_kill.c | 2 +- mm/page_alloc.c | 20 ++++++++++--------- mm/shmem.c | 9 +++++---- mm/slab.c | 2 +- mm/swap.c | 2 +- mm/util.c | 2 +- mm/vmalloc.c | 4 ++-- mm/workingset.c | 2 +- mm/zswap.c | 4 ++-- net/dccp/proto.c | 2 +- net/decnet/dn_route.c | 2 +- net/ipv4/tcp_metrics.c | 2 +- net/netfilter/nf_conntrack_core.c | 2 +- net/netfilter/xt_hashlimit.c | 2 +- net/sctp/protocol.c | 2 +- security/integrity/ima/ima_kexec.c | 2 +- 53 files changed, 131 insertions(+), 81 deletions(-) (limited to 'kernel') diff --git a/arch/csky/mm/init.c b/arch/csky/mm/init.c index dc07c078f9b8..66e597053488 100644 --- a/arch/csky/mm/init.c +++ b/arch/csky/mm/init.c @@ -71,7 +71,7 @@ void free_initrd_mem(unsigned long start, unsigned long end) ClearPageReserved(virt_to_page(start)); init_page_count(virt_to_page(start)); free_page(start); - totalram_pages++; + totalram_pages_inc(); } } #endif @@ -88,7 +88,7 @@ void free_initmem(void) ClearPageReserved(virt_to_page(addr)); init_page_count(virt_to_page(addr)); free_page(addr); - totalram_pages++; + totalram_pages_inc(); addr += PAGE_SIZE; } diff --git a/arch/powerpc/platforms/pseries/cmm.c b/arch/powerpc/platforms/pseries/cmm.c index 25427a48feae..e8d63a6a9002 100644 --- a/arch/powerpc/platforms/pseries/cmm.c +++ b/arch/powerpc/platforms/pseries/cmm.c @@ -208,7 +208,7 @@ static long cmm_alloc_pages(long nr) pa->page[pa->index++] = addr; loaned_pages++; - totalram_pages--; + totalram_pages_dec(); spin_unlock(&cmm_lock); nr--; } @@ -247,7 +247,7 @@ static long cmm_free_pages(long nr) free_page(addr); loaned_pages--; nr--; - totalram_pages++; + totalram_pages_inc(); } spin_unlock(&cmm_lock); cmm_dbg("End request with %ld pages unfulfilled\n", nr); @@ -291,7 +291,7 @@ static void cmm_get_mpp(void) int rc; struct hvcall_mpp_data mpp_data; signed long active_pages_target, page_loan_request, target; - signed long total_pages = totalram_pages + loaned_pages; + signed long total_pages = totalram_pages() + loaned_pages; signed long min_mem_pages = (min_mem_mb * 1024 * 1024) / PAGE_SIZE; rc = h_get_mpp(&mpp_data); @@ -322,7 +322,7 @@ static void cmm_get_mpp(void) cmm_dbg("delta = %ld, loaned = %lu, target = %lu, oom = %lu, totalram = %lu\n", page_loan_request, loaned_pages, loaned_pages_target, - oom_freed_pages, totalram_pages); + oom_freed_pages, totalram_pages()); } static struct notifier_block cmm_oom_nb = { @@ -581,7 +581,7 @@ static int cmm_mem_going_offline(void *arg) free_page(pa_curr->page[idx]); freed++; loaned_pages--; - totalram_pages++; + totalram_pages_inc(); pa_curr->page[idx] = pa_last->page[--pa_last->index]; if (pa_last->index == 0) { if (pa_curr == pa_last) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 76d0708438e9..50388190b393 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -59,7 +59,7 @@ static void __init setup_zero_pages(void) order = 7; /* Limit number of empty zero pages for small memory sizes */ - while (order > 2 && (totalram_pages >> 10) < (1UL << order)) + while (order > 2 && (totalram_pages() >> 10) < (1UL << order)) order--; empty_zero_page = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order); diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 2da209687a22..8d21a83dd289 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -51,7 +51,7 @@ void __init mem_init(void) /* this will put all low memory onto the freelists */ memblock_free_all(); - max_low_pfn = totalram_pages; + max_low_pfn = totalram_pages(); max_pfn = max_low_pfn; mem_init_print_info(NULL); kmalloc_ok = 1; diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 168fa272cc3e..97f9ada9ceda 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -434,7 +434,7 @@ static ssize_t microcode_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) { ssize_t ret = -EINVAL; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); if ((len >> PAGE_SHIFT) > nr_pages) { pr_err("too much data (max %ld pages)\n", nr_pages); diff --git a/drivers/char/agp/backend.c b/drivers/char/agp/backend.c index 38ffb281df97..004a3ce8ba72 100644 --- a/drivers/char/agp/backend.c +++ b/drivers/char/agp/backend.c @@ -115,9 +115,9 @@ static int agp_find_max(void) long memory, index, result; #if PAGE_SHIFT < 20 - memory = totalram_pages >> (20 - PAGE_SHIFT); + memory = totalram_pages() >> (20 - PAGE_SHIFT); #else - memory = totalram_pages << (PAGE_SHIFT - 20); + memory = totalram_pages() << (PAGE_SHIFT - 20); #endif index = 1; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index d36a9755ad91..a9de07bb72c8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2559,7 +2559,7 @@ static int i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj) * If there's no chance of allocating enough pages for the whole * object, bail early. */ - if (page_count > totalram_pages) + if (page_count > totalram_pages()) return -ENOMEM; st = kmalloc(sizeof(*st), GFP_KERNEL); diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c index 69fe86b30fbb..a9ed0ecc94e2 100644 --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c @@ -170,7 +170,7 @@ static int igt_ppgtt_alloc(void *arg) * This should ensure that we do not run into the oomkiller during * the test and take down the machine wilfully. */ - limit = totalram_pages << PAGE_SHIFT; + limit = totalram_pages() << PAGE_SHIFT; limit = min(ppgtt->vm.total, limit); /* Check we can allocate the entire range */ @@ -1244,7 +1244,7 @@ static int exercise_mock(struct drm_i915_private *i915, u64 hole_start, u64 hole_end, unsigned long end_time)) { - const u64 limit = totalram_pages << PAGE_SHIFT; + const u64 limit = totalram_pages() << PAGE_SHIFT; struct i915_gem_context *ctx; struct i915_hw_ppgtt *ppgtt; IGT_TIMEOUT(end_time); diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index f3e7da981610..5301fef16c31 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1090,7 +1090,7 @@ static void process_info(struct hv_dynmem_device *dm, struct dm_info_msg *msg) static unsigned long compute_balloon_floor(void) { unsigned long min_pages; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); #define MB2PAGES(mb) ((mb) << (20 - PAGE_SHIFT)) /* Simple continuous piecewiese linear function: * max MiB -> min MiB gradient diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index dc385b70e4c3..8b0b628e5d1c 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1887,7 +1887,7 @@ static int __init dm_bufio_init(void) dm_bufio_allocated_vmalloc = 0; dm_bufio_current_allocated = 0; - mem = (__u64)mult_frac(totalram_pages - totalhigh_pages, + mem = (__u64)mult_frac(totalram_pages() - totalhigh_pages(), DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT; if (mem > ULONG_MAX) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index a7195eb5b8d8..a8c32de29e3f 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -2158,7 +2158,7 @@ static int crypt_wipe_key(struct crypt_config *cc) static void crypt_calculate_pages_per_client(void) { - unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100; + unsigned long pages = (totalram_pages() - totalhigh_pages()) * DM_CRYPT_MEMORY_PERCENT / 100; if (!dm_crypt_clients_n) return; diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index d4ad0bfee251..62baa3214cc7 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -2843,7 +2843,7 @@ static int create_journal(struct dm_integrity_c *ic, char **error) journal_pages = roundup((__u64)ic->journal_sections * ic->journal_section_sectors, PAGE_SIZE >> SECTOR_SHIFT) >> (PAGE_SHIFT - SECTOR_SHIFT); journal_desc_size = journal_pages * sizeof(struct page_list); - if (journal_pages >= totalram_pages - totalhigh_pages || journal_desc_size > ULONG_MAX) { + if (journal_pages >= totalram_pages() - totalhigh_pages() || journal_desc_size > ULONG_MAX) { *error = "Journal doesn't fit into memory"; r = -ENOMEM; goto bad; diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c index 21de30b4e2a1..45b92a3d9d8e 100644 --- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -85,7 +85,7 @@ static bool __check_shared_memory(size_t alloc_size) a = shared_memory_amount + alloc_size; if (a < shared_memory_amount) return false; - if (a >> PAGE_SHIFT > totalram_pages / DM_STATS_MEMORY_FACTOR) + if (a >> PAGE_SHIFT > totalram_pages() / DM_STATS_MEMORY_FACTOR) return false; #ifdef CONFIG_MMU if (a > (VMALLOC_END - VMALLOC_START) / DM_STATS_VMALLOC_FACTOR) diff --git a/drivers/media/platform/mtk-vpu/mtk_vpu.c b/drivers/media/platform/mtk-vpu/mtk_vpu.c index 616f78b24a79..b6602490a247 100644 --- a/drivers/media/platform/mtk-vpu/mtk_vpu.c +++ b/drivers/media/platform/mtk-vpu/mtk_vpu.c @@ -855,7 +855,7 @@ static int mtk_vpu_probe(struct platform_device *pdev) /* Set PTCM to 96K and DTCM to 32K */ vpu_cfg_writel(vpu, 0x2, VPU_TCM_CFG); - vpu->enable_4GB = !!(totalram_pages > (SZ_2G >> PAGE_SHIFT)); + vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT)); dev_info(dev, "4GB mode %u\n", vpu->enable_4GB); if (vpu->enable_4GB) { diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index 9b0b3fa4f836..e6126a4b95d3 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -570,7 +570,7 @@ static int vmballoon_send_get_target(struct vmballoon *b) unsigned long status; unsigned long limit; - limit = totalram_pages; + limit = totalram_pages(); /* Ensure limit fits in 32-bits */ if (limit != (u32)limit) diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c index 701a7d6a74d5..358e380eb7fa 100644 --- a/drivers/parisc/ccio-dma.c +++ b/drivers/parisc/ccio-dma.c @@ -1251,7 +1251,7 @@ ccio_ioc_init(struct ioc *ioc) ** Hot-Plug/Removal of PCI cards. (aka PCI OLARD). */ - iova_space_size = (u32) (totalram_pages / count_parisc_driver(&ccio_driver)); + iova_space_size = (u32) (totalram_pages() / count_parisc_driver(&ccio_driver)); /* limit IOVA space size to 1MB-1GB */ @@ -1290,7 +1290,7 @@ ccio_ioc_init(struct ioc *ioc) DBG_INIT("%s() hpa 0x%p mem %luMB IOV %dMB (%d bits)\n", __func__, ioc->ioc_regs, - (unsigned long) totalram_pages >> (20 - PAGE_SHIFT), + (unsigned long) totalram_pages() >> (20 - PAGE_SHIFT), iova_space_size>>20, iov_order + PAGE_SHIFT); diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index c1e599a429af..e0655949480a 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -1414,7 +1414,7 @@ sba_ioc_init(struct parisc_device *sba, struct ioc *ioc, int ioc_num) ** for DMA hints - ergo only 30 bits max. */ - iova_space_size = (u32) (totalram_pages/global_ioc_cnt); + iova_space_size = (u32) (totalram_pages()/global_ioc_cnt); /* limit IOVA space size to 1MB-1GB */ if (iova_space_size < (1 << (20 - PAGE_SHIFT))) { @@ -1439,7 +1439,7 @@ sba_ioc_init(struct parisc_device *sba, struct ioc *ioc, int ioc_num) DBG_INIT("%s() hpa 0x%lx mem %ldMB IOV %dMB (%d bits)\n", __func__, ioc->ioc_hpa, - (unsigned long) totalram_pages >> (20 - PAGE_SHIFT), + (unsigned long) totalram_pages() >> (20 - PAGE_SHIFT), iova_space_size>>20, iov_order + PAGE_SHIFT); diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c index 548bb02c0ca6..6cb0eebdff89 100644 --- a/drivers/staging/android/ion/ion_system_heap.c +++ b/drivers/staging/android/ion/ion_system_heap.c @@ -110,7 +110,7 @@ static int ion_system_heap_allocate(struct ion_heap *heap, unsigned long size_remaining = PAGE_ALIGN(size); unsigned int max_order = orders[0]; - if (size / PAGE_SIZE > totalram_pages / 2) + if (size / PAGE_SIZE > totalram_pages() / 2) return -ENOMEM; INIT_LIST_HEAD(&pages); diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 5165aa82bf7d..246f6122c9ee 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -189,7 +189,7 @@ static void selfballoon_process(struct work_struct *work) bool reset_timer = false; if (xen_selfballooning_enabled) { - cur_pages = totalram_pages; + cur_pages = totalram_pages(); tgt_pages = cur_pages; /* default is no change */ goal_pages = vm_memory_committed() + totalreserve_pages + @@ -227,7 +227,7 @@ static void selfballoon_process(struct work_struct *work) if (tgt_pages < floor_pages) tgt_pages = floor_pages; balloon_set_new_target(tgt_pages + - balloon_stats.current_pages - totalram_pages); + balloon_stats.current_pages - totalram_pages()); reset_timer = true; } #ifdef CONFIG_FRONTSWAP @@ -569,7 +569,7 @@ int xen_selfballoon_init(bool use_selfballooning, bool use_frontswap_selfshrink) * much more reliably and response faster in some cases. */ if (!selfballoon_reserved_mb) { - reserve_pages = totalram_pages / 10; + reserve_pages = totalram_pages() / 10; selfballoon_reserved_mb = PAGES2MB(reserve_pages); } schedule_delayed_work(&selfballoon_worker, selfballoon_interval * HZ); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 79a265ba9200..dfb64a5211b6 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -810,7 +810,7 @@ static inline int default_congestion_kb(void) * This allows larger machines to have larger/more transfers. * Limit the default to 256M */ - congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); + congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10); if (congestion_kb > 256*1024) congestion_kb = 256*1024; diff --git a/fs/file_table.c b/fs/file_table.c index b6e9587f05c7..5679e7fcb6b0 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -380,7 +380,7 @@ void __init files_init(void) void __init files_maxfiles_init(void) { unsigned long n; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); unsigned long memreserve = (nr_pages - nr_free_pages()) * 3/2; memreserve = min(memreserve, nr_pages - 1); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 568abed20eb2..76baaa6be393 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -824,7 +824,7 @@ static const struct super_operations fuse_super_operations = { static void sanitize_global_limit(unsigned *limit) { if (*limit == 0) - *limit = ((totalram_pages << PAGE_SHIFT) >> 13) / + *limit = ((totalram_pages() << PAGE_SHIFT) >> 13) / sizeof(struct fuse_req); if (*limit >= 1 << 16) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 586726a590d8..4f15665f0ad1 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -2121,7 +2121,7 @@ int __init nfs_init_writepagecache(void) * This allows larger machines to have larger/more transfers. * Limit the default to 256M */ - nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); + nfs_congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10); if (nfs_congestion_kb > 256*1024) nfs_congestion_kb = 256*1024; diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c index e2fe0e9ce0df..da52b594362a 100644 --- a/fs/nfsd/nfscache.c +++ b/fs/nfsd/nfscache.c @@ -99,7 +99,7 @@ static unsigned int nfsd_cache_size_limit(void) { unsigned int limit; - unsigned long low_pages = totalram_pages - totalhigh_pages; + unsigned long low_pages = totalram_pages() - totalhigh_pages(); limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10); return min_t(unsigned int, limit, 256*1024); diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h index ab172e5f51d9..5becc8acc8f4 100644 --- a/fs/ntfs/malloc.h +++ b/fs/ntfs/malloc.h @@ -47,7 +47,7 @@ static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask) return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM); /* return (void *)__get_free_page(gfp_mask); */ } - if (likely((size >> PAGE_SHIFT) < totalram_pages)) + if (likely((size >> PAGE_SHIFT) < totalram_pages())) return __vmalloc(size, gfp_mask, PAGE_KERNEL); return NULL; } diff --git a/fs/proc/base.c b/fs/proc/base.c index ce3465479447..d7fd1ca807d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -530,7 +530,7 @@ static const struct file_operations proc_lstats_operations = { static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - unsigned long totalpages = totalram_pages + total_swap_pages; + unsigned long totalpages = totalram_pages() + total_swap_pages; unsigned long points = 0; points = oom_badness(task, NULL, NULL, totalpages) * diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 0690679832d4..ea5cdbd8c2c3 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -36,7 +36,31 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); -extern unsigned long totalhigh_pages; +extern atomic_long_t _totalhigh_pages; +static inline unsigned long totalhigh_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalhigh_pages); +} + +static inline void totalhigh_pages_inc(void) +{ + atomic_long_inc(&_totalhigh_pages); +} + +static inline void totalhigh_pages_dec(void) +{ + atomic_long_dec(&_totalhigh_pages); +} + +static inline void totalhigh_pages_add(long count) +{ + atomic_long_add(count, &_totalhigh_pages); +} + +static inline void totalhigh_pages_set(long val) +{ + atomic_long_set(&_totalhigh_pages, val); +} void kmap_flush_unused(void); @@ -51,7 +75,7 @@ static inline struct page *kmap_to_page(void *addr) return virt_to_page(addr); } -#define totalhigh_pages 0UL +static inline unsigned long totalhigh_pages(void) { return 0UL; } #ifndef ARCH_HAS_KMAP static inline void *kmap(struct page *page) diff --git a/include/linux/mm.h b/include/linux/mm.h index b4d01969e700..1d2be4c2d34a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -48,7 +48,32 @@ static inline void set_max_mapnr(unsigned long limit) static inline void set_max_mapnr(unsigned long limit) { } #endif -extern unsigned long totalram_pages; +extern atomic_long_t _totalram_pages; +static inline unsigned long totalram_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalram_pages); +} + +static inline void totalram_pages_inc(void) +{ + atomic_long_inc(&_totalram_pages); +} + +static inline void totalram_pages_dec(void) +{ + atomic_long_dec(&_totalram_pages); +} + +static inline void totalram_pages_add(long count) +{ + atomic_long_add(count, &_totalram_pages); +} + +static inline void totalram_pages_set(long val) +{ + atomic_long_set(&_totalram_pages, val); +} + extern void * high_memory; extern int page_cluster; diff --git a/include/linux/swap.h b/include/linux/swap.h index a8f6d5d89524..77459d695010 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,7 +310,6 @@ void workingset_update_node(struct xa_node *node); } while (0) /* linux/mm/page_alloc.c */ -extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; extern unsigned long nr_free_buffer_pages(void); extern unsigned long nr_free_pagecache_pages(void); diff --git a/kernel/fork.c b/kernel/fork.c index 8617a326e9f5..c979605fe806 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -744,7 +744,7 @@ void __init __weak arch_task_cache_init(void) { } static void set_max_threads(unsigned int max_threads_suggested) { u64 threads; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); /* * The number of threads shall be limited such that the thread diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 7e967ca98d92..d7140447be75 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -152,7 +152,7 @@ int sanity_check_segment_list(struct kimage *image) int i; unsigned long nr_segments = image->nr_segments; unsigned long total_pages = 0; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); /* * Verify we have good destination addresses. The caller is diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index b0308a2c6000..640b2034edd6 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -105,7 +105,7 @@ unsigned long image_size; void __init hibernate_image_size_init(void) { - image_size = ((totalram_pages * 2) / 5) * PAGE_SIZE; + image_size = ((totalram_pages() * 2) / 5) * PAGE_SIZE; } /* diff --git a/mm/highmem.c b/mm/highmem.c index 59db3223a5d6..107b10f9878e 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -105,9 +105,8 @@ static inline wait_queue_head_t *get_pkmap_wait_queue_head(unsigned int color) } #endif -unsigned long totalhigh_pages __read_mostly; -EXPORT_SYMBOL(totalhigh_pages); - +atomic_long_t _totalhigh_pages __read_mostly; +EXPORT_SYMBOL(_totalhigh_pages); EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e84a10b0d310..da6682bb69aa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -420,7 +420,7 @@ static int __init hugepage_init(void) * where the extra memory used could hurt more than TLB overhead * is likely to save. The admin can still enable it through /sys. */ - if (totalram_pages < (512 << (20 - PAGE_SHIFT))) { + if (totalram_pages() < (512 << (20 - PAGE_SHIFT))) { transparent_hugepage_flags = 0; return 0; } diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 57334ef2d7ef..978bc4a3eb51 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -237,7 +237,7 @@ void quarantine_reduce(void) * Update quarantine size in case of hotplug. Allocate a fraction of * the installed memory to quarantine minus per-cpu queue limits. */ - total_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) / + total_size = (totalram_pages() << PAGE_SHIFT) / QUARANTINE_FRACTION; percpu_quarantines = QUARANTINE_PERCPU_SIZE * num_online_cpus(); new_quarantine_size = (total_size < percpu_quarantines) ? diff --git a/mm/memblock.c b/mm/memblock.c index 0068f87af1e8..a53d8697612c 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1576,7 +1576,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size) for (; cursor < end; cursor++) { memblock_free_pages(pfn_to_page(cursor), cursor, 0); - totalram_pages++; + totalram_pages_inc(); } } @@ -1978,7 +1978,7 @@ unsigned long __init memblock_free_all(void) reset_all_zones_managed_pages(); pages = free_low_memory_core_early(); - totalram_pages += pages; + totalram_pages_add(pages); return pages; } diff --git a/mm/mm_init.c b/mm/mm_init.c index 6838a530789b..33917105a3a2 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -146,7 +146,7 @@ static void __meminit mm_compute_batch(void) s32 batch = max_t(s32, nr*2, 32); /* batch size set to 0.4% of (total memory/#cpus), or max int32 */ - memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff); + memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff); vm_committed_as_batch = max_t(s32, memsized_batch, batch); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6589f60d5018..21d487749e1d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -269,7 +269,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) } /* Default to all available memory */ - oc->totalpages = totalram_pages + total_swap_pages; + oc->totalpages = totalram_pages() + total_swap_pages; if (!IS_ENABLED(CONFIG_NUMA)) return CONSTRAINT_NONE; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4b5c4ff68f18..eb2027892ef9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -16,6 +16,7 @@ #include #include +#include #include #include #include @@ -124,7 +125,8 @@ EXPORT_SYMBOL(node_states); /* Protect totalram_pages and zone->managed_pages */ static DEFINE_SPINLOCK(managed_page_count_lock); -unsigned long totalram_pages __read_mostly; +atomic_long_t _totalram_pages __read_mostly; +EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; @@ -4747,11 +4749,11 @@ EXPORT_SYMBOL_GPL(si_mem_available); void si_meminfo(struct sysinfo *val) { - val->totalram = totalram_pages; + val->totalram = totalram_pages(); val->sharedram = global_node_page_state(NR_SHMEM); val->freeram = global_zone_page_state(NR_FREE_PAGES); val->bufferram = nr_blockdev_pages(); - val->totalhigh = totalhigh_pages; + val->totalhigh = totalhigh_pages(); val->freehigh = nr_free_highpages(); val->mem_unit = PAGE_SIZE; } @@ -7077,10 +7079,10 @@ void adjust_managed_page_count(struct page *page, long count) { spin_lock(&managed_page_count_lock); atomic_long_add(count, &page_zone(page)->managed_pages); - totalram_pages += count; + totalram_pages_add(count); #ifdef CONFIG_HIGHMEM if (PageHighMem(page)) - totalhigh_pages += count; + totalhigh_pages_add(count); #endif spin_unlock(&managed_page_count_lock); } @@ -7123,9 +7125,9 @@ EXPORT_SYMBOL(free_reserved_area); void free_highmem_page(struct page *page) { __free_reserved_page(page); - totalram_pages++; + totalram_pages_inc(); atomic_long_inc(&page_zone(page)->managed_pages); - totalhigh_pages++; + totalhigh_pages_inc(); } #endif @@ -7174,10 +7176,10 @@ void __init mem_init_print_info(const char *str) physpages << (PAGE_SHIFT - 10), codesize >> 10, datasize >> 10, rosize >> 10, (init_data_size + init_code_size) >> 10, bss_size >> 10, - (physpages - totalram_pages - totalcma_pages) << (PAGE_SHIFT - 10), + (physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10), totalcma_pages << (PAGE_SHIFT - 10), #ifdef CONFIG_HIGHMEM - totalhigh_pages << (PAGE_SHIFT - 10), + totalhigh_pages() << (PAGE_SHIFT - 10), #endif str ? ", " : "", str ? str : ""); } diff --git a/mm/shmem.c b/mm/shmem.c index b1f0f54470fb..6ece1e2fe76e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -109,13 +109,14 @@ struct shmem_falloc { #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { - return totalram_pages / 2; + return totalram_pages() / 2; } static unsigned long shmem_default_max_inodes(void) { - unsigned long nr_pages = totalram_pages; - return min(nr_pages - totalhigh_pages, nr_pages / 2); + unsigned long nr_pages = totalram_pages(); + + return min(nr_pages - totalhigh_pages(), nr_pages / 2); } #endif @@ -3302,7 +3303,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, size = memparse(value,&rest); if (*rest == '%') { size <<= PAGE_SHIFT; - size *= totalram_pages; + size *= totalram_pages(); do_div(size, 100); rest++; } diff --git a/mm/slab.c b/mm/slab.c index 01991060714c..73fe23e649c9 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1235,7 +1235,7 @@ void __init kmem_cache_init(void) * page orders on machines with more than 32MB of memory if * not overridden on the command line. */ - if (!slab_max_order_set && totalram_pages > (32 << 20) >> PAGE_SHIFT) + if (!slab_max_order_set && totalram_pages() > (32 << 20) >> PAGE_SHIFT) slab_max_order = SLAB_MAX_ORDER_HI; /* Bootstrap is tricky, because several objects are allocated diff --git a/mm/swap.c b/mm/swap.c index 5d786019eab9..4d8a1f1afaab 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1022,7 +1022,7 @@ EXPORT_SYMBOL(pagevec_lookup_range_nr_tag); */ void __init swap_setup(void) { - unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); + unsigned long megs = totalram_pages() >> (20 - PAGE_SHIFT); /* Use a smaller cluster for small-memory machines */ if (megs < 16) diff --git a/mm/util.c b/mm/util.c index 8bf08b5b5760..4df23d64aac7 100644 --- a/mm/util.c +++ b/mm/util.c @@ -593,7 +593,7 @@ unsigned long vm_commit_limit(void) if (sysctl_overcommit_kbytes) allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10); else - allowed = ((totalram_pages - hugetlb_total_pages()) + allowed = ((totalram_pages() - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100); allowed += total_swap_pages; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 97d4b25d0373..871e41c55e23 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1634,7 +1634,7 @@ void *vmap(struct page **pages, unsigned int count, might_sleep(); - if (count > totalram_pages) + if (count > totalram_pages()) return NULL; size = (unsigned long)count << PAGE_SHIFT; @@ -1739,7 +1739,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, unsigned long real_size = size; size = PAGE_ALIGN(size); - if (!size || (size >> PAGE_SHIFT) > totalram_pages) + if (!size || (size >> PAGE_SHIFT) > totalram_pages()) goto fail; area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED | diff --git a/mm/workingset.c b/mm/workingset.c index d46f8c92aa2f..dcb994f2acc2 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -549,7 +549,7 @@ static int __init workingset_init(void) * double the initial memory by using totalram_pages as-is. */ timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; - max_order = fls_long(totalram_pages - 1); + max_order = fls_long(totalram_pages() - 1); if (max_order > timestamp_bits) bucket_order = max_order - timestamp_bits; pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", diff --git a/mm/zswap.c b/mm/zswap.c index cd91fd9d96b8..a4e4d36ec085 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -219,8 +219,8 @@ static const struct zpool_ops zswap_zpool_ops = { static bool zswap_is_full(void) { - return totalram_pages * zswap_max_pool_percent / 100 < - DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); + return totalram_pages() * zswap_max_pool_percent / 100 < + DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); } static void zswap_update_total_size(void) diff --git a/net/dccp/proto.c b/net/dccp/proto.c index ff727ff61b5b..0e2f71ab8367 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1131,7 +1131,7 @@ EXPORT_SYMBOL_GPL(dccp_debug); static int __init dccp_init(void) { unsigned long goal; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int ehash_order, bhash_order, i; int rc; diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 1c002c0fb712..950613ee7881 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1866,7 +1866,7 @@ void __init dn_route_init(void) dn_route_timer.expires = jiffies + decnet_dst_gc_interval * HZ; add_timer(&dn_route_timer); - goal = totalram_pages >> (26 - PAGE_SHIFT); + goal = totalram_pages() >> (26 - PAGE_SHIFT); for(order = 0; (1UL << order) < goal; order++) /* NOTHING */; diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 03b51cdcc731..b467a7cabf40 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -1000,7 +1000,7 @@ static int __net_init tcp_net_metrics_init(struct net *net) slots = tcpmhash_entries; if (!slots) { - if (totalram_pages >= 128 * 1024) + if (totalram_pages() >= 128 * 1024) slots = 16 * 1024; else slots = 8 * 1024; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 5eb990830348..741b533148ba 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2248,7 +2248,7 @@ static __always_inline unsigned int total_extension_size(void) int nf_conntrack_init_start(void) { - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int max_factor = 8; int ret = -ENOMEM; int i; diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 88b520ba2abc..8d86e39d6280 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -274,7 +274,7 @@ static int htable_create(struct net *net, struct hashlimit_cfg3 *cfg, struct xt_hashlimit_htable *hinfo; const struct seq_operations *ops; unsigned int size, i; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int ret; if (cfg->size) { diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index a5b24182b3cc..d5878ae55840 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1368,7 +1368,7 @@ static __init int sctp_init(void) int status = -EINVAL; unsigned long goal; unsigned long limit; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int max_share; int order; int num_entries; diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 16bd18747cfa..d6f32807b347 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -106,7 +106,7 @@ void ima_add_kexec_buffer(struct kimage *image) kexec_segment_size = ALIGN(ima_get_binary_runtime_size() + PAGE_SIZE / 2, PAGE_SIZE); if ((kexec_segment_size == ULONG_MAX) || - ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages / 2)) { + ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 2)) { pr_err("Binary measurement list too large.\n"); return; } -- cgit v1.3-7-g2ca7 From 808153e1187fa77ac7d7dad261ff476888dcf398 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:34:50 -0800 Subject: mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit devm_memremap_pages() is a facility that can create struct page entries for any arbitrary range and give drivers the ability to subvert core aspects of page management. Specifically the facility is tightly integrated with the kernel's memory hotplug functionality. It injects an altmap argument deep into the architecture specific vmemmap implementation to allow allocating from specific reserved pages, and it has Linux specific assumptions about page structure reference counting relative to get_user_pages() and get_user_pages_fast(). It was an oversight and a mistake that this was not marked EXPORT_SYMBOL_GPL from the outset. Again, devm_memremap_pagex() exposes and relies upon core kernel internal assumptions and will continue to evolve along with 'struct page', memory hotplug, and support for new memory types / topologies. Only an in-kernel GPL-only driver is expected to keep up with this ongoing evolution. This interface, and functionality derived from this interface, is not suitable for kernel-external drivers. Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reviewed-by: Christoph Hellwig Acked-by: Michal Hocko Cc: "Jérôme Glisse" Cc: Balbir Singh Cc: Logan Gunthorpe Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/memremap.c | 2 +- tools/testing/nvdimm/test/iomap.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/memremap.c b/kernel/memremap.c index 9eced2cc9f94..61dbcaa95530 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) err_array: return ERR_PTR(error); } -EXPORT_SYMBOL(devm_memremap_pages); +EXPORT_SYMBOL_GPL(devm_memremap_pages); unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) { diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index ff9d3a5825e1..ed18a0cbc0c8 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) return nfit_res->buf + offset - nfit_res->res.start; return devm_memremap_pages(dev, pgmap); } -EXPORT_SYMBOL(__wrap_devm_memremap_pages); +EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags) { -- cgit v1.3-7-g2ca7 From 06489cfbd915ff36c8e36df27f1c2dc60f97ca56 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:34:54 -0800 Subject: mm, devm_memremap_pages: kill mapping "System RAM" support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Given the fact that devm_memremap_pages() requires a percpu_ref that is torn down by devm_memremap_pages_release() the current support for mapping RAM is broken. Support for remapping "System RAM" has been broken since the beginning and there is no existing user of this this code path, so just kill the support and make it an explicit error. This cleanup also simplifies a follow-on patch to fix the error path when setting a devm release action for devm_memremap_pages_release() fails. Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reviewed-by: "Jérôme Glisse" Reviewed-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/memremap.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/memremap.c b/kernel/memremap.c index 61dbcaa95530..99d14940acfa 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -167,15 +167,12 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) is_ram = region_intersects(align_start, align_size, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE); - if (is_ram == REGION_MIXED) { - WARN_ONCE(1, "%s attempted on mixed region %pr\n", - __func__, res); + if (is_ram != REGION_DISJOINT) { + WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__, + is_ram == REGION_MIXED ? "mixed" : "ram", res); return ERR_PTR(-ENXIO); } - if (is_ram == REGION_INTERSECTS) - return __va(res->start); - if (!pgmap->ref) return ERR_PTR(-EINVAL); -- cgit v1.3-7-g2ca7 From a95c90f1e2c253b280385ecf3d4ebfe476926b28 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:34:57 -0800 Subject: mm, devm_memremap_pages: fix shutdown handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Reviewed-by: "Jérôme Glisse" Reported-by: Logan Gunthorpe Reviewed-by: Logan Gunthorpe Reviewed-by: Christoph Hellwig Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/dax/pmem.c | 14 +++----------- drivers/nvdimm/pmem.c | 13 +++++-------- include/linux/memremap.h | 2 ++ kernel/memremap.c | 30 ++++++++++++++---------------- tools/testing/nvdimm/test/iomap.c | 15 ++++++++++++++- 5 files changed, 38 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 99e2aace8078..2c1f459c0c63 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *data) percpu_ref_exit(ref); } -static void dax_pmem_percpu_kill(void *data) +static void dax_pmem_percpu_kill(struct percpu_ref *ref) { - struct percpu_ref *ref = data; struct dax_pmem *dax_pmem = to_dax_pmem(ref); dev_dbg(dax_pmem->dev, "trace\n"); @@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device *dev) } dax_pmem->pgmap.ref = &dax_pmem->ref; + dax_pmem->pgmap.kill = dax_pmem_percpu_kill; addr = devm_memremap_pages(dev, &dax_pmem->pgmap); - if (IS_ERR(addr)) { - devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); - percpu_ref_exit(&dax_pmem->ref); + if (IS_ERR(addr)) return PTR_ERR(addr); - } - - rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill, - &dax_pmem->ref); - if (rc) - return rc; /* adjust the dax_region resource to the start of data */ memcpy(&res, &dax_pmem->pgmap.res, sizeof(res)); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0e39e3d1846f..d28418b05a04 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -309,8 +309,11 @@ static void pmem_release_queue(void *q) blk_cleanup_queue(q); } -static void pmem_freeze_queue(void *q) +static void pmem_freeze_queue(struct percpu_ref *ref) { + struct request_queue *q; + + q = container_of(ref, typeof(*q), q_usage_counter); blk_freeze_queue_start(q); } @@ -402,6 +405,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags = PFN_DEV; pmem->pgmap.ref = &q->q_usage_counter; + pmem->pgmap.kill = pmem_freeze_queue; if (is_nd_pfn(dev)) { if (setup_pagemap_fsdax(dev, &pmem->pgmap)) return -ENOMEM; @@ -427,13 +431,6 @@ static int pmem_attach_disk(struct device *dev, memcpy(&bb_res, &nsio->res, sizeof(bb_res)); } - /* - * At release time the queue must be frozen before - * devm_memremap_pages is unwound - */ - if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) - return -ENOMEM; - if (IS_ERR(addr)) return PTR_ERR(addr); pmem->virt_addr = addr; diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 0ac69ddf5fc4..55db66b3716f 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref * @ref: reference count that pins the devm_memremap_pages() mapping + * @kill: callback to transition @ref to the dead state * @dev: host device of the mapping for debug * @data: private data pointer for page_free() * @type: memory type: see MEMORY_* in memory_hotplug.h @@ -122,6 +123,7 @@ struct dev_pagemap { bool altmap_valid; struct resource res; struct percpu_ref *ref; + void (*kill)(struct percpu_ref *ref); struct device *dev; void *data; enum memory_type type; diff --git a/kernel/memremap.c b/kernel/memremap.c index 99d14940acfa..5e45f0c327a5 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -88,14 +88,10 @@ static void devm_memremap_pages_release(void *data) resource_size_t align_start, align_size; unsigned long pfn; + pgmap->kill(pgmap->ref); for_each_device_pfn(pfn, pgmap) put_page(pfn_to_page(pfn)); - if (percpu_ref_tryget_live(pgmap->ref)) { - dev_WARN(dev, "%s: page mapping is still live!\n", __func__); - percpu_ref_put(pgmap->ref); - } - /* pages are dead and unused, undo the arch mapping */ align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) @@ -116,7 +112,7 @@ static void devm_memremap_pages_release(void *data) /** * devm_memremap_pages - remap and provide memmap backing for the given resource * @dev: hosting device for @res - * @pgmap: pointer to a struct dev_pgmap + * @pgmap: pointer to a struct dev_pagemap * * Notes: * 1/ At a minimum the res, ref and type members of @pgmap must be initialized @@ -125,11 +121,8 @@ static void devm_memremap_pages_release(void *data) * 2/ The altmap field may optionally be initialized, in which case altmap_valid * must be set to true * - * 3/ pgmap.ref must be 'live' on entry and 'dead' before devm_memunmap_pages() - * time (or devm release event). The expected order of events is that ref has - * been through percpu_ref_kill() before devm_memremap_pages_release(). The - * wait for the completion of all references being dropped and - * percpu_ref_exit() must occur after devm_memremap_pages_release(). + * 3/ pgmap->ref must be 'live' on entry and will be killed at + * devm_memremap_pages_release() time, or if this routine fails. * * 4/ res is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but @@ -145,6 +138,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) pgprot_t pgprot = PAGE_KERNEL; int error, nid, is_ram; + if (!pgmap->ref || !pgmap->kill) + return ERR_PTR(-EINVAL); + align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) - align_start; @@ -170,12 +166,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) if (is_ram != REGION_DISJOINT) { WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__, is_ram == REGION_MIXED ? "mixed" : "ram", res); - return ERR_PTR(-ENXIO); + error = -ENXIO; + goto err_array; } - if (!pgmap->ref) - return ERR_PTR(-EINVAL); - pgmap->dev = dev; error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start), @@ -217,7 +211,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) align_size >> PAGE_SHIFT, pgmap); percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap)); - devm_add_action(dev, devm_memremap_pages_release, pgmap); + error = devm_add_action_or_reset(dev, devm_memremap_pages_release, + pgmap); + if (error) + return ERR_PTR(error); return __va(res->start); @@ -228,6 +225,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) err_pfn_remap: pgmap_array_delete(res); err_array: + pgmap->kill(pgmap->ref); return ERR_PTR(error); } EXPORT_SYMBOL_GPL(devm_memremap_pages); diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index ed18a0cbc0c8..c6635fee27d8 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -104,13 +104,26 @@ void *__wrap_devm_memremap(struct device *dev, resource_size_t offset, } EXPORT_SYMBOL(__wrap_devm_memremap); +static void nfit_test_kill(void *_pgmap) +{ + struct dev_pagemap *pgmap = _pgmap; + + pgmap->kill(pgmap->ref); +} + void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) { resource_size_t offset = pgmap->res.start; struct nfit_test_resource *nfit_res = get_nfit_res(offset); - if (nfit_res) + if (nfit_res) { + int rc; + + rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap); + if (rc) + return ERR_PTR(rc); return nfit_res->buf + offset - nfit_res->res.start; + } return devm_memremap_pages(dev, pgmap); } EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); -- cgit v1.3-7-g2ca7 From 69324b8f48339de2f90fdf2f774687fc6c47629a Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:35:01 -0800 Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for consolidating all ZONE_DEVICE enabling via devm_memremap_pages(), teach it how to handle the constraints of MEMORY_DEVICE_PRIVATE ranges. [jglisse@redhat.com: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE] Link: http://lkml.kernel.org/r/154275559036.76910.12434636179931292607.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reviewed-by: Jérôme Glisse Acked-by: Christoph Hellwig Reported-by: Logan Gunthorpe Reviewed-by: Logan Gunthorpe Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/memremap.c | 53 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 41 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/memremap.c b/kernel/memremap.c index 5e45f0c327a5..3eef989ef035 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -98,9 +98,15 @@ static void devm_memremap_pages_release(void *data) - align_start; mem_hotplug_begin(); - arch_remove_memory(align_start, align_size, pgmap->altmap_valid ? - &pgmap->altmap : NULL); - kasan_remove_zero_shadow(__va(align_start), align_size); + if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + pfn = align_start >> PAGE_SHIFT; + __remove_pages(page_zone(pfn_to_page(pfn)), pfn, + align_size >> PAGE_SHIFT, NULL); + } else { + arch_remove_memory(align_start, align_size, + pgmap->altmap_valid ? &pgmap->altmap : NULL); + kasan_remove_zero_shadow(__va(align_start), align_size); + } mem_hotplug_done(); untrack_pfn(NULL, PHYS_PFN(align_start), align_size); @@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) goto err_pfn_remap; mem_hotplug_begin(); - error = kasan_add_zero_shadow(__va(align_start), align_size); - if (error) { - mem_hotplug_done(); - goto err_kasan; + + /* + * For device private memory we call add_pages() as we only need to + * allocate and initialize struct page for the device memory. More- + * over the device memory is un-accessible thus we do not want to + * create a linear mapping for the memory like arch_add_memory() + * would do. + * + * For all other device memory types, which are accessible by + * the CPU, we do want the linear mapping and thus use + * arch_add_memory(). + */ + if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + error = add_pages(nid, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, NULL, false); + } else { + error = kasan_add_zero_shadow(__va(align_start), align_size); + if (error) { + mem_hotplug_done(); + goto err_kasan; + } + + error = arch_add_memory(nid, align_start, align_size, altmap, + false); + } + + if (!error) { + struct zone *zone; + + zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; + move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, altmap); } - error = arch_add_memory(nid, align_start, align_size, altmap, false); - if (!error) - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, altmap); mem_hotplug_done(); if (error) goto err_add_memory; -- cgit v1.3-7-g2ca7 From 1c30844d2dfe272d58c8fc000960b835d13aa2ac Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:52 -0800 Subject: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 21 +++++++ include/linux/mm.h | 1 + include/linux/mmzone.h | 11 ++-- kernel/sysctl.c | 8 +++ mm/page_alloc.c | 43 +++++++++++++- mm/vmscan.c | 133 +++++++++++++++++++++++++++++++++++++++++--- 6 files changed, 202 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7d73882e2c27..187ce4f599a2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure +- watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -856,6 +857,26 @@ ten times more freeable objects than there are. ============================================================= +watermark_boost_factor: + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor parameter, +the unit is in fractions of 10,000. The default value of 15,000 means +that up to 150% of the high watermark will be reclaimed in the event of +a pageblock being mixed due to fragmentation. The level of reclaim is +determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. + +============================================================= + watermark_scale_factor: This factor controls the aggressiveness of kswapd. It defines the diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d2be4c2d34a..031b2ce983f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ extern int min_free_kbytes; +extern int watermark_boost_factor; extern int watermark_scale_factor; /* nommu.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dcf1b66a96ab..5b4bfb90fb94 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,10 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) -#define wmark_pages(z, i) (z->_watermark[i]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -364,6 +364,7 @@ struct zone { /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long _watermark[NR_WMARK]; + unsigned long watermark_boost; unsigned long nr_reserved_highatomic; @@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone) struct ctl_table; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5fc724e4e454..1825f712e73b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = { .proc_handler = min_free_kbytes_sysctl_handler, .extra1 = &zero, }, + { + .procname = "watermark_boost_factor", + .data = &watermark_boost_factor, + .maxlen = sizeof(watermark_boost_factor), + .mode = 0644, + .proc_handler = watermark_boost_factor_sysctl_handler, + .extra1 = &zero, + }, { .procname = "watermark_scale_factor", .data = &watermark_scale_factor, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32b3e121a388..80373eca453d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; +int watermark_boost_factor __read_mostly = 15000; int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __meminitdata; @@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } +static inline void boost_watermark(struct zone *zone) +{ + unsigned long max_boost; + + if (!watermark_boost_factor) + return; + + max_boost = mult_frac(zone->_watermark[WMARK_HIGH], + watermark_boost_factor, 10000); + max_boost = max(pageblock_nr_pages, max_boost); + + zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, + max_boost); +} + /* * This function implements actual steal behaviour. If order is large enough, * we can steal whole pageblock. If not, we first move freepages in this @@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt) * itself, so pages freed in the future will be put on the correct free list. */ static void steal_suitable_fallback(struct zone *zone, struct page *page, - int start_type, bool whole_block) + unsigned int alloc_flags, int start_type, bool whole_block) { unsigned int current_order = page_order(page); struct free_area *area; @@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; } + /* + * Boost watermarks to increase reclaim pressure to reduce the + * likelihood of future fallbacks. Wake kswapd now as the node + * may be balanced overall and kswapd will not wake naturally. + */ + boost_watermark(zone); + if (alloc_flags & ALLOC_KSWAPD) + wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + /* We are not allowed to try stealing from the whole block */ if (!whole_block) goto single_page; @@ -2443,7 +2468,8 @@ do_steal: page = list_first_entry(&area->free_list[fallback_mt], struct page, lru); - steal_suitable_fallback(zone, page, start_migratetype, can_steal); + steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, + can_steal); trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); @@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->watermark_boost = 0; spin_unlock_irqrestore(&zone->lock, flags); } @@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write, return 0; } +int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 24ab1f7394ab..bd8971a29204 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,6 +88,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + /* e.g. boosted watermark reclaim leaves slabs alone */ + unsigned int may_shrinkslab:1; + /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - shrink_slab(sc->gfp_mask, pgdat->node_id, + if (sc->may_shrinkslab) { + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + } /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .may_shrinkslab = 1, }; /* @@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, + .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, + .may_shrinkslab = 1, }; /* @@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } +static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx) +{ + int i; + struct zone *zone; + + /* + * Check for watermark boosts top-down as the higher zones + * are more likely to be boosted. Both watermarks and boosts + * should not be checked at the time time as reclaim would + * start prematurely when there is no boosting and a lower + * zone is balanced. + */ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + if (zone->watermark_boost) + return true; + } + + return false; +} + /* * Returns true if there is an eligible zone balanced for the request order * and classzone_idx @@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) unsigned long mark = -1; struct zone *zone; + /* + * Check watermarks bottom-up as lower zones are more likely to + * meet watermarks. + */ for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; @@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; unsigned long pflags; + unsigned long nr_boost_reclaim; + unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; + bool boosted; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, - .priority = DEF_PRIORITY, - .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = 1, }; psi_memstall_enter(&pflags); @@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); + /* + * Account for the reclaim boost. Note that the zone boost is left in + * place so that parallel allocations that are near the watermark will + * stall or direct reclaim until kswapd is finished. + */ + nr_boost_reclaim = 0; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + nr_boost_reclaim += zone->watermark_boost; + zone_boosts[i] = zone->watermark_boost; + } + boosted = nr_boost_reclaim; + +restart: + sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; + bool balanced; bool ret; sc.reclaim_idx = classzone_idx; @@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Note that - * sc.reclaim_idx is not used as buffer_heads_over_limit may - * have adjusted it. + * If the pgdat is imbalanced then ignore boosting and preserve + * the watermarks for a later time and restart. Note that the + * zone watermarks will be still reset at the end of balancing + * on the grounds that the normal reclaim should be enough to + * re-evaluate if boosting is required when kswapd next wakes. + */ + balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (!balanced && nr_boost_reclaim) { + nr_boost_reclaim = 0; + goto restart; + } + + /* + * If boosting is not active then only reclaim if there are no + * eligible zones. Note that sc.reclaim_idx is not used as + * buffer_heads_over_limit may have adjusted it. */ - if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + if (!nr_boost_reclaim && balanced) goto out; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) + raise_priority = false; + + /* + * Do not writeback or swap pages for boosted reclaim. The + * intent is to relieve pressure not issue sub-optimal IO + * from reclaim context. If no pages are reclaimed, the + * reclaim will be aborted. + */ + sc.may_writepage = !laptop_mode && !nr_boost_reclaim; + sc.may_swap = !nr_boost_reclaim; + sc.may_shrinkslab = !nr_boost_reclaim; + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * progress in reclaiming pages */ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; + nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed); + + /* + * If reclaim made no progress for a boost, stop reclaim as + * IO cannot be queued and it could be an infinite loop in + * extreme circumstances. + */ + if (nr_boost_reclaim && !nr_reclaimed) + break; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) pgdat->kswapd_failures++; out: + /* If reclaim was boosted, account for the reclaim done in this pass */ + if (boosted) { + unsigned long flags; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } + + /* + * As there is now likely space, wakeup kcompact to defragment + * pageblocks. + */ + wakeup_kcompactd(pgdat, pageblock_order, classzone_idx); + } + snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); psi_memstall_leave(&pflags); @@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, /* Hopeless node, leave it to direct reclaim if possible */ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || - pgdat_balanced(pgdat, order, classzone_idx)) { + (pgdat_balanced(pgdat, order, classzone_idx) && + !pgdat_watermark_boosted(pgdat, classzone_idx))) { /* * There may be plenty of free memory available, but it's too * fragmented for high-order allocations. Wake up kcompactd -- cgit v1.3-7-g2ca7 From ef8444ea01d7442652f8e1b8a8b94278cb57eafd Mon Sep 17 00:00:00 2001 From: yuzhoujian Date: Fri, 28 Dec 2018 00:36:07 -0800 Subject: mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian Acked-by: Michal Hocko Cc: Andrea Arcangeli Cc: David Rientjes Cc: "Kirill A . Shutemov" Cc: Roman Gushchin Cc: Tetsuo Handa Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 10 ++++++++++ kernel/cgroup/cpuset.c | 4 ++-- mm/oom_kill.c | 29 ++++++++++++++++++++--------- mm/page_alloc.c | 4 ++-- 4 files changed, 34 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/oom.h b/include/linux/oom.h index 69864a547663..d07992009265 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,13 @@ struct notifier_block; struct mem_cgroup; struct task_struct; +enum oom_constraint { + CONSTRAINT_NONE, + CONSTRAINT_CPUSET, + CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, +}; + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -42,6 +49,9 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + + /* Used to print the constraint info. */ + enum oom_constraint constraint; }; extern struct mutex oom_lock; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 266f10cb7222..9510a5b32eaf 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2666,9 +2666,9 @@ void cpuset_print_current_mems_allowed(void) rcu_read_lock(); cgrp = task_cs(current)->css.cgroup; - pr_info("%s cpuset=", current->comm); + pr_cont(",cpuset="); pr_cont_cgroup_name(cgrp); - pr_cont(" mems_allowed=%*pbl\n", + pr_cont(",mems_allowed=%*pbl", nodemask_pr_args(¤t->mems_allowed)); rcu_read_unlock(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 21d487749e1d..d90253f1ff93 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -245,11 +245,11 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, return points > 0 ? points : 1; } -enum oom_constraint { - CONSTRAINT_NONE, - CONSTRAINT_CPUSET, - CONSTRAINT_MEMORY_POLICY, - CONSTRAINT_MEMCG, +static const char * const oom_constraint_text[] = { + [CONSTRAINT_NONE] = "CONSTRAINT_NONE", + [CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET", + [CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY", + [CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG", }; /* @@ -428,16 +428,25 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) rcu_read_unlock(); } +static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) +{ + /* one line summary of the oom killer context. */ + pr_info("oom-kill:constraint=%s,nodemask=%*pbl", + oom_constraint_text[oc->constraint], + nodemask_pr_args(oc->nodemask)); + cpuset_print_current_mems_allowed(); + pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, + from_kuid(&init_user_ns, task_uid(victim))); +} + static void dump_header(struct oom_control *oc, struct task_struct *p) { - pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n", - current->comm, oc->gfp_mask, &oc->gfp_mask, - nodemask_pr_args(oc->nodemask), oc->order, + pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n", + current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, current->signal->oom_score_adj); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); - cpuset_print_current_mems_allowed(); dump_stack(); if (is_memcg_oom(oc)) mem_cgroup_print_oom_info(oc->memcg, p); @@ -448,6 +457,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) } if (sysctl_oom_dump_tasks) dump_tasks(oc->memcg, oc->nodemask); + if (p) + dump_oom_summary(oc, p); } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e97ebaf5ba26..a48db99da7b5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3517,13 +3517,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; - pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl\n", + pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl", current->comm, &vaf, gfp_mask, &gfp_mask, nodemask_pr_args(nodemask)); va_end(args); cpuset_print_current_mems_allowed(); - + pr_cont("\n"); dump_stack(); warn_alloc_show_mem(gfp_mask, nodemask); } -- cgit v1.3-7-g2ca7 From 2c2a5af6fed20cf74401c9d64319c76c5ff81309 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Fri, 28 Dec 2018 00:36:22 -0800 Subject: mm, memory_hotplug: add nid parameter to arch_remove_memory Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Reviewed-by: Pavel Tatashin Cc: Michal Hocko Cc: Dan Williams Cc: Jerome Glisse Cc: Jonathan Cameron Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/ia64/mm/init.c | 2 +- arch/powerpc/mm/mem.c | 3 ++- arch/s390/mm/init.c | 2 +- arch/sh/mm/init.c | 2 +- arch/x86/mm/init_32.c | 2 +- arch/x86/mm/init_64.c | 3 ++- include/linux/memory_hotplug.h | 4 ++-- kernel/memremap.c | 5 ++++- mm/memory_hotplug.c | 2 +- 9 files changed, 15 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index d5e12ff1d73c..904fe55e10fc 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -661,7 +661,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 20394e52fe27..33cc6f676fa6 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -139,7 +139,8 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap * } #ifdef CONFIG_MEMORY_HOTREMOVE -int __meminit arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int __meminit arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 50388190b393..3e82f66d5c61 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -242,7 +242,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { /* * There is no hardware or firmware interface which could trigger a diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index c8c13c777162..a8e5c0e00fca 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -443,7 +443,7 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = PFN_DOWN(start); unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 49ecf5ecf6d3..85c94f9a87f8 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -860,7 +860,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 484c1b92f078..bccff68e3267 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1141,7 +1141,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end) remove_pagetable(start, end, true, NULL); } -int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int __ref arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7383a7a76d69..9e4d9b9b93ea 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -107,8 +107,8 @@ static inline bool movable_node_is_enabled(void) } #ifdef CONFIG_MEMORY_HOTREMOVE -extern int arch_remove_memory(u64 start, u64 size, - struct vmem_altmap *altmap); +extern int arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap); extern int __remove_pages(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); #endif /* CONFIG_MEMORY_HOTREMOVE */ diff --git a/kernel/memremap.c b/kernel/memremap.c index 3eef989ef035..0d5603d76c37 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -87,6 +87,7 @@ static void devm_memremap_pages_release(void *data) struct resource *res = &pgmap->res; resource_size_t align_start, align_size; unsigned long pfn; + int nid; pgmap->kill(pgmap->ref); for_each_device_pfn(pfn, pgmap) @@ -97,13 +98,15 @@ static void devm_memremap_pages_release(void *data) align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) - align_start; + nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT)); + mem_hotplug_begin(); if (pgmap->type == MEMORY_DEVICE_PRIVATE) { pfn = align_start >> PAGE_SHIFT; __remove_pages(page_zone(pfn_to_page(pfn)), pfn, align_size >> PAGE_SHIFT, NULL); } else { - arch_remove_memory(align_start, align_size, + arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ? &pgmap->altmap : NULL); kasan_remove_zero_shadow(__va(align_start), align_size); } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6258e0e923cc..0718cf7427b2 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1841,7 +1841,7 @@ void __ref __remove_memory(int nid, u64 start, u64 size) memblock_free(start, size); memblock_remove(start, size); - arch_remove_memory(start, size, NULL); + arch_remove_memory(nid, start, size, NULL); try_offline_node(nid); -- cgit v1.3-7-g2ca7 From 65c78784135f847e49eb98e6b976e453e71100c3 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Fri, 28 Dec 2018 00:36:26 -0800 Subject: kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable This is a preparation for the next patch. Currently, we only call release_mem_region_adjustable() in __remove_pages if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm are being released by themselves with devm_release_mem_region. Since we do not want to touch any zone/page stuff during the removing of the memory (but during the offlining), we do not want to check for the zone here. So we need another way to tell release_mem_region_adjustable() to not realease the resource in case it belongs to HMM/devm. HMM/devm acquires/releases a resource through devm_request_mem_region/devm_release_mem_region. These resources have the flag IORESOURCE_MEM, while resources acquired by hot-add memory path (register_memory_resource()) contain IORESOURCE_SYSTEM_RAM. So, we can check for this flag in release_mem_region_adjustable, and if the resource does not contain such flag, we know that we are dealing with a HMM/devm resource, so we can back off. Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Reviewed-by: Pavel Tatashin Cc: Dan Williams Cc: Jerome Glisse Cc: Jonathan Cameron Cc: Michal Hocko Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/resource.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index b0fbf685c77a..915c02e8e5dd 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1256,6 +1256,21 @@ int release_mem_region_adjustable(struct resource *parent, continue; } + /* + * All memory regions added from memory-hotplug path have the + * flag IORESOURCE_SYSTEM_RAM. If the resource does not have + * this flag, we know that we are dealing with a resource coming + * from HMM/devm. HMM/devm use another mechanism to add/release + * a resource. This goes via devm_request_mem_region and + * devm_release_mem_region. + * HMM/devm take care to release their resources when they want, + * so if we are dealing with them, let us just back off here. + */ + if (!(res->flags & IORESOURCE_SYSRAM)) { + ret = 0; + break; + } + if (!(res->flags & IORESOURCE_MEM)) break; -- cgit v1.3-7-g2ca7 From ac46d4f3c43241ffa23d5bf36153a0830c0e02cc Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Fri, 28 Dec 2018 00:38:09 -0800 Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Acked-by: Christian König Acked-by: Jan Kara Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krcmar Cc: Michal Hocko Cc: Felix Kuehling Cc: Ralph Campbell Cc: John Hubbard From: Jérôme Glisse Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dax.c | 8 ++-- fs/proc/task_mmu.c | 7 +++- include/linux/mm.h | 6 ++- include/linux/mmu_notifier.h | 87 +++++++++++++++++++++++++------------- kernel/events/uprobes.c | 10 ++--- mm/huge_memory.c | 54 +++++++++++------------- mm/hugetlb.c | 52 +++++++++++------------ mm/khugepaged.c | 10 ++--- mm/ksm.c | 21 ++++------ mm/madvise.c | 21 +++++----- mm/memory.c | 99 ++++++++++++++++++++++---------------------- mm/migrate.c | 28 ++++++------- mm/mmu_notifier.c | 37 ++++------------- mm/mprotect.c | 15 +++---- mm/mremap.c | 10 ++--- mm/oom_kill.c | 17 ++++---- mm/rmap.c | 30 ++++++++------ 17 files changed, 262 insertions(+), 250 deletions(-) (limited to 'kernel') diff --git a/fs/dax.c b/fs/dax.c index 48132eca3761..262e14f29933 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -779,7 +779,8 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) { - unsigned long address, start, end; + struct mmu_notifier_range range; + unsigned long address; cond_resched(); @@ -793,7 +794,8 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, * call mmu_notifier_invalidate_range_start() on our behalf * before taking any lock. */ - if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl)) + if (follow_pte_pmd(vma->vm_mm, address, &range, + &ptep, &pmdp, &ptl)) continue; /* @@ -835,7 +837,7 @@ unlock_pte: pte_unmap_unlock(ptep, ptl); } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); } i_mmap_unlock_read(mapping); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 47c3764c469b..b3ddceb003bc 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1096,6 +1096,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, return -ESRCH; mm = get_task_mm(task); if (mm) { + struct mmu_notifier_range range; struct clear_refs_private cp = { .type = type, }; @@ -1139,11 +1140,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, downgrade_write(&mm->mmap_sem); break; } - mmu_notifier_invalidate_range_start(mm, 0, -1); + + mmu_notifier_range_init(&range, mm, 0, -1UL); + mmu_notifier_invalidate_range_start(&range); } walk_page_range(0, mm->highest_vm_end, &clear_refs_walk); if (type == CLEAR_REFS_SOFT_DIRTY) - mmu_notifier_invalidate_range_end(mm, 0, -1); + mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, 0, -1); up_read(&mm->mmap_sem); out_mm: diff --git a/include/linux/mm.h b/include/linux/mm.h index 3c39b9dc7a90..ea1f12d15365 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1451,6 +1451,8 @@ struct mm_walk { void *private; }; +struct mmu_notifier_range; + int walk_page_range(unsigned long addr, unsigned long end, struct mm_walk *walk); int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk); @@ -1459,8 +1461,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 3d377805b29c..4050ec1c3b45 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -220,11 +220,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); -extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable); -extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end, +extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); +extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -268,33 +265,37 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, __mmu_notifier_change_pte(mm, address, pte); } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_start(mm, start, end, true); + if (mm_has_notifiers(range->mm)) { + range->blockable = true; + __mmu_notifier_invalidate_range_start(range); + } } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - return __mmu_notifier_invalidate_range_start(mm, start, end, false); + if (mm_has_notifiers(range->mm)) { + range->blockable = false; + return __mmu_notifier_invalidate_range_start(range); + } return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, false); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, false); } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, true); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -315,6 +316,17 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __mmu_notifier_mm_destroy(mm); } + +static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + range->mm = mm; + range->start = start; + range->end = end; +} + #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young; \ @@ -427,6 +439,23 @@ extern void mmu_notifier_call_srcu(struct rcu_head *rcu, #else /* CONFIG_MMU_NOTIFIER */ +struct mmu_notifier_range { + unsigned long start; + unsigned long end; +}; + +static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, + unsigned long start, + unsigned long end) +{ + range->start = start; + range->end = end; +} + +#define mmu_notifier_range_init(range, mm, start, end) \ + _mmu_notifier_range_init(range, start, end) + + static inline int mm_has_notifiers(struct mm_struct *mm) { return 0; @@ -454,24 +483,24 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, { } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline +void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index abbd8da9ac21..8aef47ee7bfa 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -171,11 +171,11 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, .address = addr, }; int err; - /* For mmu_notifiers */ - const unsigned long mmun_start = addr; - const unsigned long mmun_end = addr + PAGE_SIZE; + struct mmu_notifier_range range; struct mem_cgroup *memcg; + mmu_notifier_range_init(&range, mm, addr, addr + PAGE_SIZE); + VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page); err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL, &memcg, @@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, /* For try_to_free_swap() and munlock_vma_page() below */ lock_page(old_page); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_start(&range); err = -EAGAIN; if (!page_vma_mapped_walk(&pvmw)) { mem_cgroup_cancel_charge(new_page, memcg, false); @@ -220,7 +220,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, err = 0; unlock: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); unlock_page(old_page); return err; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0c0e18409fde..05136ad0f325 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1134,8 +1134,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, int i; vm_fault_t ret = 0; struct page **pages; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; pages = kmalloc_array(HPAGE_PMD_NR, sizeof(struct page *), GFP_KERNEL); @@ -1173,9 +1172,9 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, cond_resched(); } - mmun_start = haddr; - mmun_end = haddr + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) @@ -1220,8 +1219,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, * No need to double call mmu_notifier->invalidate_range() callback as * the above pmdp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, - mmun_end); + mmu_notifier_invalidate_range_only_end(&range); ret |= VM_FAULT_WRITE; put_page(page); @@ -1231,7 +1229,7 @@ out: out_free_pages: spin_unlock(vmf->ptl); - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); for (i = 0; i < HPAGE_PMD_NR; i++) { memcg = (void *)page_private(pages[i]); set_page_private(pages[i], 0); @@ -1248,8 +1246,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) struct page *page = NULL, *new_page; struct mem_cgroup *memcg; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; gfp_t huge_gfp; /* for allocation and charge */ vm_fault_t ret = 0; @@ -1338,9 +1335,9 @@ alloc: vma, HPAGE_PMD_NR); __SetPageUptodate(new_page); - mmun_start = haddr; - mmun_end = haddr + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); spin_lock(vmf->ptl); if (page) @@ -1375,8 +1372,7 @@ out_mn: * No need to double call mmu_notifier->invalidate_range() callback as * the above pmdp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, - mmun_end); + mmu_notifier_invalidate_range_only_end(&range); out: return ret; out_unlock: @@ -2015,14 +2011,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address) { spinlock_t *ptl; - struct mm_struct *mm = vma->vm_mm; - unsigned long haddr = address & HPAGE_PUD_MASK; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PUD_SIZE); - ptl = pud_lock(mm, pud); + mmu_notifier_range_init(&range, vma->vm_mm, address & HPAGE_PUD_MASK, + (address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE); + mmu_notifier_invalidate_range_start(&range); + ptl = pud_lock(vma->vm_mm, pud); if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud))) goto out; - __split_huge_pud_locked(vma, pud, haddr); + __split_huge_pud_locked(vma, pud, range.start); out: spin_unlock(ptl); @@ -2030,8 +2027,7 @@ out: * No need to double call mmu_notifier->invalidate_range() callback as * the above pudp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + - HPAGE_PUD_SIZE); + mmu_notifier_invalidate_range_only_end(&range); } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ @@ -2233,11 +2229,12 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze, struct page *page) { spinlock_t *ptl; - struct mm_struct *mm = vma->vm_mm; - unsigned long haddr = address & HPAGE_PMD_MASK; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); - ptl = pmd_lock(mm, pmd); + mmu_notifier_range_init(&range, vma->vm_mm, address & HPAGE_PMD_MASK, + (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + ptl = pmd_lock(vma->vm_mm, pmd); /* * If caller asks to setup a migration entries, we need a page to check @@ -2253,7 +2250,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, clear_page_mlock(page); } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) goto out; - __split_huge_pmd_locked(vma, pmd, haddr, freeze); + __split_huge_pmd_locked(vma, pmd, range.start, freeze); out: spin_unlock(ptl); /* @@ -2269,8 +2266,7 @@ out: * any further changes to individual pte will notify. So no need * to call mmu_notifier->invalidate_range() */ - mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + - HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_only_end(&range); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a80832487981..12000ba5c868 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3240,16 +3240,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, int cow; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; int ret = 0; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - mmun_start = vma->vm_start; - mmun_end = vma->vm_end; - if (cow) - mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end); + if (cow) { + mmu_notifier_range_init(&range, src, vma->vm_start, + vma->vm_end); + mmu_notifier_invalidate_range_start(&range); + } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { spinlock_t *src_ptl, *dst_ptl; @@ -3325,7 +3325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } if (cow) - mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return ret; } @@ -3342,8 +3342,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, struct page *page; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); - unsigned long mmun_start = start; /* For mmu_notifiers */ - unsigned long mmun_end = end; /* For mmu_notifiers */ + struct mmu_notifier_range range; WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON(start & ~huge_page_mask(h)); @@ -3359,8 +3358,9 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, /* * If sharing possible, alert mmu notifiers of worst case. */ - adjust_range_if_pmd_sharing_possible(vma, &mmun_start, &mmun_end); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, start, end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); + mmu_notifier_invalidate_range_start(&range); address = start; for (; address < end; address += sz) { ptep = huge_pte_offset(mm, address, sz); @@ -3428,7 +3428,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, if (ref_page) break; } - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); tlb_end_vma(tlb, vma); } @@ -3546,9 +3546,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, struct page *old_page, *new_page; int outside_reserve = 0; vm_fault_t ret = 0; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ unsigned long haddr = address & huge_page_mask(h); + struct mmu_notifier_range range; pte = huge_ptep_get(ptep); old_page = pte_page(pte); @@ -3627,9 +3626,8 @@ retry_avoidcopy: __SetPageUptodate(new_page); set_page_huge_active(new_page); - mmun_start = haddr; - mmun_end = mmun_start + huge_page_size(h); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, haddr, haddr + huge_page_size(h)); + mmu_notifier_invalidate_range_start(&range); /* * Retake the page table lock to check for racing updates @@ -3642,7 +3640,7 @@ retry_avoidcopy: /* Break COW */ huge_ptep_clear_flush(vma, haddr, ptep); - mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range(mm, range.start, range.end); set_huge_pte_at(mm, haddr, ptep, make_huge_pte(vma, new_page, 1)); page_remove_rmap(old_page, true); @@ -3651,7 +3649,7 @@ retry_avoidcopy: new_page = old_page; } spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out_release_all: restore_reserve_on_error(h, vma, haddr, new_page); put_page(new_page); @@ -4340,21 +4338,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, pte_t pte; struct hstate *h = hstate_vma(vma); unsigned long pages = 0; - unsigned long f_start = start; - unsigned long f_end = end; bool shared_pmd = false; + struct mmu_notifier_range range; /* * In the case of shared PMDs, the area to flush could be beyond - * start/end. Set f_start/f_end to cover the maximum possible + * start/end. Set range.start/range.end to cover the maximum possible * range if PMD sharing is possible. */ - adjust_range_if_pmd_sharing_possible(vma, &f_start, &f_end); + mmu_notifier_range_init(&range, mm, start, end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); BUG_ON(address >= end); - flush_cache_range(vma, f_start, f_end); + flush_cache_range(vma, range.start, range.end); - mmu_notifier_invalidate_range_start(mm, f_start, f_end); + mmu_notifier_invalidate_range_start(&range); i_mmap_lock_write(vma->vm_file->f_mapping); for (; address < end; address += huge_page_size(h)) { spinlock_t *ptl; @@ -4405,7 +4403,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * did unshare a page of pmds, flush the range corresponding to the pud. */ if (shared_pmd) - flush_hugetlb_tlb_range(vma, f_start, f_end); + flush_hugetlb_tlb_range(vma, range.start, range.end); else flush_hugetlb_tlb_range(vma, start, end); /* @@ -4415,7 +4413,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ i_mmap_unlock_write(vma->vm_file->f_mapping); - mmu_notifier_invalidate_range_end(mm, f_start, f_end); + mmu_notifier_invalidate_range_end(&range); return pages << h->order; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 43ce2f4d2551..4f017339ddb2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -944,8 +944,7 @@ static void collapse_huge_page(struct mm_struct *mm, int isolated = 0, result = 0; struct mem_cgroup *memcg; struct vm_area_struct *vma; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; gfp_t gfp; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -1017,9 +1016,8 @@ static void collapse_huge_page(struct mm_struct *mm, pte = pte_offset_map(pmd, address); pte_ptl = pte_lockptr(mm, pmd); - mmun_start = address; - mmun_end = address + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, address, address + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* * After this gup_fast can't run anymore. This also removes @@ -1029,7 +1027,7 @@ static void collapse_huge_page(struct mm_struct *mm, */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); spin_lock(pte_ptl); isolated = __collapse_huge_page_isolate(vma, address, pte); diff --git a/mm/ksm.c b/mm/ksm.c index 1a088306ef81..38c0360482fa 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1042,8 +1042,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, }; int swapped; int err = -EFAULT; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; pvmw.address = page_address_in_vma(page, vma); if (pvmw.address == -EFAULT) @@ -1051,9 +1050,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, BUG_ON(PageTransCompound(page)); - mmun_start = pvmw.address; - mmun_end = pvmw.address + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, pvmw.address, + pvmw.address + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); if (!page_vma_mapped_walk(&pvmw)) goto out_mn; @@ -1105,7 +1104,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, out_unlock: page_vma_mapped_walk_done(&pvmw); out_mn: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out: return err; } @@ -1129,8 +1128,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, spinlock_t *ptl; unsigned long addr; int err = -EFAULT; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; addr = page_address_in_vma(page, vma); if (addr == -EFAULT) @@ -1140,9 +1138,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, if (!pmd) goto out; - mmun_start = addr; - mmun_end = addr + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, addr, addr + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte_same(*ptep, orig_pte)) { @@ -1188,7 +1185,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, pte_unmap_unlock(ptep, ptl); err = 0; out_mn: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out: return err; } diff --git a/mm/madvise.c b/mm/madvise.c index 6cb1ca93e290..21a7881a2db4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -458,29 +458,30 @@ static void madvise_free_page_range(struct mmu_gather *tlb, static int madvise_free_single_vma(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { - unsigned long start, end; struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) return -EINVAL; - start = max(vma->vm_start, start_addr); - if (start >= vma->vm_end) + range.start = max(vma->vm_start, start_addr); + if (range.start >= vma->vm_end) return -EINVAL; - end = min(vma->vm_end, end_addr); - if (end <= vma->vm_start) + range.end = min(vma->vm_end, end_addr); + if (range.end <= vma->vm_start) return -EINVAL; + mmu_notifier_range_init(&range, mm, range.start, range.end); lru_add_drain(); - tlb_gather_mmu(&tlb, mm, start, end); + tlb_gather_mmu(&tlb, mm, range.start, range.end); update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, start, end); - madvise_free_page_range(&tlb, vma, start, end); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_invalidate_range_start(&range); + madvise_free_page_range(&tlb, vma, range.start, range.end); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, range.start, range.end); return 0; } diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..b7a8bfe5f5ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -973,8 +973,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; bool is_cow; int ret; @@ -1008,11 +1007,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * is_cow_mapping() returns true. */ is_cow = is_cow_mapping(vma->vm_flags); - mmun_start = addr; - mmun_end = end; - if (is_cow) - mmu_notifier_invalidate_range_start(src_mm, mmun_start, - mmun_end); + + if (is_cow) { + mmu_notifier_range_init(&range, src_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); + } ret = 0; dst_pgd = pgd_offset(dst_mm, addr); @@ -1029,7 +1028,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, } while (dst_pgd++, src_pgd++, addr = next, addr != end); if (is_cow) - mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return ret; } @@ -1332,12 +1331,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); + mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr); + mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); - mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); + mmu_notifier_invalidate_range_end(&range); } /** @@ -1351,18 +1351,18 @@ void unmap_vmas(struct mmu_gather *tlb, void zap_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long size) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; - unsigned long end = start + size; lru_add_drain(); - tlb_gather_mmu(&tlb, mm, start, end); - update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, start, end); - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) - unmap_single_vma(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_range_init(&range, vma->vm_mm, start, start + size); + tlb_gather_mmu(&tlb, vma->vm_mm, start, range.end); + update_hiwater_rss(vma->vm_mm); + mmu_notifier_invalidate_range_start(&range); + for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next) + unmap_single_vma(&tlb, vma, start, range.end, NULL); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, start, range.end); } /** @@ -1377,17 +1377,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; - unsigned long end = address + size; lru_add_drain(); - tlb_gather_mmu(&tlb, mm, address, end); - update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, address, end); - unmap_single_vma(&tlb, vma, address, end, details); - mmu_notifier_invalidate_range_end(mm, address, end); - tlb_finish_mmu(&tlb, address, end); + mmu_notifier_range_init(&range, vma->vm_mm, address, address + size); + tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end); + update_hiwater_rss(vma->vm_mm); + mmu_notifier_invalidate_range_start(&range); + unmap_single_vma(&tlb, vma, address, range.end, details); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, address, range.end); } /** @@ -2247,9 +2247,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) struct page *new_page = NULL; pte_t entry; int page_copied = 0; - const unsigned long mmun_start = vmf->address & PAGE_MASK; - const unsigned long mmun_end = mmun_start + PAGE_SIZE; struct mem_cgroup *memcg; + struct mmu_notifier_range range; if (unlikely(anon_vma_prepare(vma))) goto oom; @@ -2272,7 +2271,9 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) __SetPageUptodate(new_page); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, vmf->address & PAGE_MASK, + (vmf->address & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); /* * Re-check the pte - we dropped the lock @@ -2349,7 +2350,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * No need to double call mmu_notifier->invalidate_range() callback as * the above ptep_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_only_end(&range); if (old_page) { /* * Don't let another task, with possibly unlocked vma, @@ -4030,7 +4031,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) #endif /* __PAGETABLE_PMD_FOLDED */ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, + struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; @@ -4058,10 +4059,10 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, if (!pmdpp) goto out; - if (start && end) { - *start = address & PMD_MASK; - *end = *start + PMD_SIZE; - mmu_notifier_invalidate_range_start(mm, *start, *end); + if (range) { + mmu_notifier_range_init(range, mm, address & PMD_MASK, + (address & PMD_MASK) + PMD_SIZE); + mmu_notifier_invalidate_range_start(range); } *ptlp = pmd_lock(mm, pmd); if (pmd_huge(*pmd)) { @@ -4069,17 +4070,17 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, return 0; } spin_unlock(*ptlp); - if (start && end) - mmu_notifier_invalidate_range_end(mm, *start, *end); + if (range) + mmu_notifier_invalidate_range_end(range); } if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; - if (start && end) { - *start = address & PAGE_MASK; - *end = *start + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, *start, *end); + if (range) { + range->start = address & PAGE_MASK; + range->end = range->start + PAGE_SIZE; + mmu_notifier_invalidate_range_start(range); } ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) @@ -4088,8 +4089,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, return 0; unlock: pte_unmap_unlock(ptep, *ptlp); - if (start && end) - mmu_notifier_invalidate_range_end(mm, *start, *end); + if (range) + mmu_notifier_invalidate_range_end(range); out: return -EINVAL; } @@ -4101,20 +4102,20 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, NULL, NULL, + !(res = __follow_pte_pmd(mm, address, NULL, ptepp, NULL, ptlp))); return res; } int follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { int res; /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, start, end, + !(res = __follow_pte_pmd(mm, address, range, ptepp, pmdpp, ptlp))); return res; } diff --git a/mm/migrate.c b/mm/migrate.c index acda06f99754..462163f5f278 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2299,6 +2299,7 @@ next: */ static void migrate_vma_collect(struct migrate_vma *migrate) { + struct mmu_notifier_range range; struct mm_walk mm_walk; mm_walk.pmd_entry = migrate_vma_collect_pmd; @@ -2310,13 +2311,11 @@ static void migrate_vma_collect(struct migrate_vma *migrate) mm_walk.mm = migrate->vma->vm_mm; mm_walk.private = migrate; - mmu_notifier_invalidate_range_start(mm_walk.mm, - migrate->start, - migrate->end); + mmu_notifier_range_init(&range, mm_walk.mm, migrate->start, + migrate->end); + mmu_notifier_invalidate_range_start(&range); walk_page_range(migrate->start, migrate->end, &mm_walk); - mmu_notifier_invalidate_range_end(mm_walk.mm, - migrate->start, - migrate->end); + mmu_notifier_invalidate_range_end(&range); migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT); } @@ -2697,9 +2696,8 @@ static void migrate_vma_pages(struct migrate_vma *migrate) { const unsigned long npages = migrate->npages; const unsigned long start = migrate->start; - struct vm_area_struct *vma = migrate->vma; - struct mm_struct *mm = vma->vm_mm; - unsigned long addr, i, mmu_start; + struct mmu_notifier_range range; + unsigned long addr, i; bool notified = false; for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) { @@ -2718,11 +2716,12 @@ static void migrate_vma_pages(struct migrate_vma *migrate) continue; } if (!notified) { - mmu_start = addr; notified = true; - mmu_notifier_invalidate_range_start(mm, - mmu_start, - migrate->end); + + mmu_notifier_range_init(&range, + migrate->vma->vm_mm, + addr, migrate->end); + mmu_notifier_invalidate_range_start(&range); } migrate_vma_insert_page(migrate, addr, newpage, &migrate->src[i], @@ -2763,8 +2762,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate) * did already call it. */ if (notified) - mmu_notifier_invalidate_range_only_end(mm, mmu_start, - migrate->end); + mmu_notifier_invalidate_range_only_end(&range); } /* diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 74a7dc3d11c8..9c884abc7850 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -167,28 +167,20 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, srcu_read_unlock(&srcu, id); } -int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable) +int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { - struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int ret = 0; int id; - range->blockable = blockable; - range->start = start; - range->end = end; - range->mm = mm; - id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { int _ret = mn->ops->invalidate_range_start(mn, range); if (_ret) { pr_info("%pS callback failed with %d in %sblockable context.\n", - mn->ops->invalidate_range_start, _ret, - !blockable ? "non-" : ""); + mn->ops->invalidate_range_start, _ret, + !range->blockable ? "non-" : ""); ret = _ret; } } @@ -199,27 +191,14 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, } EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); -void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, - unsigned long end, +void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, bool only_end) { - struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int id; - /* - * The end call back will never be call if the start refused to go - * through because of blockable was false so here assume that we - * can block. - */ - range->blockable = true; - range->start = start; - range->end = end; - range->mm = mm; - id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { /* * Call invalidate_range here too to avoid the need for the * subsystem of having to register an invalidate_range_end @@ -234,7 +213,9 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, * already happen under page table lock. */ if (!only_end && mn->ops->invalidate_range) - mn->ops->invalidate_range(mn, mm, start, end); + mn->ops->invalidate_range(mn, range->mm, + range->start, + range->end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, range); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 6d331620b9e5..36cb358db170 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -167,11 +167,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pgprot_t newprot, int dirty_accountable, int prot_numa) { pmd_t *pmd; - struct mm_struct *mm = vma->vm_mm; unsigned long next; unsigned long pages = 0; unsigned long nr_huge_updates = 0; - unsigned long mni_start = 0; + struct mmu_notifier_range range; + + range.start = 0; pmd = pmd_offset(pud, addr); do { @@ -183,9 +184,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, goto next; /* invoke the mmu notifier if the pmd is populated */ - if (!mni_start) { - mni_start = addr; - mmu_notifier_invalidate_range_start(mm, mni_start, end); + if (!range.start) { + mmu_notifier_range_init(&range, vma->vm_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); } if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { @@ -214,8 +215,8 @@ next: cond_resched(); } while (pmd++, addr = next, addr != end); - if (mni_start) - mmu_notifier_invalidate_range_end(mm, mni_start, end); + if (range.start) + mmu_notifier_invalidate_range_end(&range); if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); diff --git a/mm/mremap.c b/mm/mremap.c index 7f9f9180e401..def01d86e36f 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -197,16 +197,14 @@ unsigned long move_page_tables(struct vm_area_struct *vma, bool need_rmap_locks) { unsigned long extent, next, old_end; + struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); - mmun_start = old_addr; - mmun_end = old_end; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, old_addr, old_end); + mmu_notifier_invalidate_range_start(&range); for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); @@ -247,7 +245,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, new_pmd, new_addr, need_rmap_locks); } - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return len + old_addr - old_end; /* how much done */ } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5442cb12e4ed..f0e8cd9edb1a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -528,19 +528,20 @@ bool __oom_reap_task_mm(struct mm_struct *mm) * count elevated without a good reason. */ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; + struct mmu_notifier_range range; struct mmu_gather tlb; - tlb_gather_mmu(&tlb, mm, start, end); - if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) { - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_range_init(&range, mm, vma->vm_start, + vma->vm_end); + tlb_gather_mmu(&tlb, mm, range.start, range.end); + if (mmu_notifier_invalidate_range_start_nonblock(&range)) { + tlb_finish_mmu(&tlb, range.start, range.end); ret = false; continue; } - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + unmap_page_range(&tlb, vma, range.start, range.end, NULL); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, range.start, range.end); } } diff --git a/mm/rmap.c b/mm/rmap.c index 85b7f9423352..c75f72f6fe0e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -889,15 +889,17 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, .address = address, .flags = PVMW_SYNC, }; - unsigned long start = address, end; + struct mmu_notifier_range range; int *cleaned = arg; /* * We have to assume the worse case ie pmd for invalidation. Note that * the page can not be free from this function. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); - mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + mmu_notifier_range_init(&range, vma->vm_mm, address, + min(vma->vm_end, address + + (PAGE_SIZE << compound_order(page)))); + mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { unsigned long cstart; @@ -949,7 +951,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, (*cleaned)++; } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); return true; } @@ -1345,7 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_t pteval; struct page *subpage; bool ret = true; - unsigned long start = address, end; + struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ @@ -1369,15 +1371,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * Note that the page can not be free in this function as call of * try_to_unmap() must hold a reference on the page. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + mmu_notifier_range_init(&range, vma->vm_mm, vma->vm_start, + min(vma->vm_end, vma->vm_start + + (PAGE_SIZE << compound_order(page)))); if (PageHuge(page)) { /* * If sharing is possible, start and end will be adjusted * accordingly. */ - adjust_range_if_pmd_sharing_possible(vma, &start, &end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, + &range.end); } - mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION @@ -1428,9 +1433,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * we must flush them all. start/end were * already adjusted above to cover this range. */ - flush_cache_range(vma, start, end); - flush_tlb_range(vma, start, end); - mmu_notifier_invalidate_range(mm, start, end); + flush_cache_range(vma, range.start, range.end); + flush_tlb_range(vma, range.start, range.end); + mmu_notifier_invalidate_range(mm, range.start, + range.end); /* * The ref count of the PMD page was dropped @@ -1650,7 +1656,7 @@ discard: put_page(page); } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); return ret; } -- cgit v1.3-7-g2ca7 From 063a7d1d3623db31ca5d2309cab6030ebf93b72f Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:39:46 -0800 Subject: mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The kbuild robot reported the following on a development branch that used memremap.h in a new path: In file included from arch/m68k/include/asm/pgtable_mm.h:148:0, from arch/m68k/include/asm/pgtable.h:5, from include/linux/memremap.h:7, from drivers//dax/bus.c:3: arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset': >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct' return mm->pgd + pgd_index(address); ^~ The ->page_fault() callback is specific to HMM. Move it to 'struct hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in include/linux/hmm.h. Longer term refactoring this dependency out of HMM is recommended, but in the meantime memremap.h remains generic. Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...") Signed-off-by: Dan Williams Reviewed-by: "Jérôme Glisse" Cc: Logan Gunthorpe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hmm.h | 24 ++++++++++++++++++++++++ include/linux/memremap.h | 32 -------------------------------- kernel/memremap.c | 6 +++++- mm/hmm.c | 4 ++-- 4 files changed, 31 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/include/linux/hmm.h b/include/linux/hmm.h index ed89fbc525d2..66f9ebbb1df3 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -69,6 +69,7 @@ #define LINUX_HMM_H #include +#include #if IS_ENABLED(CONFIG_HMM) @@ -486,6 +487,7 @@ struct hmm_devmem_ops { * @device: device to bind resource to * @ops: memory operations callback * @ref: per CPU refcount + * @page_fault: callback when CPU fault on an unaddressable device page * * This an helper structure for device drivers that do not wish to implement * the gory details related to hotplugging new memoy and allocating struct @@ -493,7 +495,28 @@ struct hmm_devmem_ops { * * Device drivers can directly use ZONE_DEVICE memory on their own if they * wish to do so. + * + * The page_fault() callback must migrate page back, from device memory to + * system memory, so that the CPU can access it. This might fail for various + * reasons (device issues, device have been unplugged, ...). When such error + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and + * set the CPU page table entry to "poisoned". + * + * Note that because memory cgroup charges are transferred to the device memory, + * this should never fail due to memory restrictions. However, allocation + * of a regular system page might still fail because we are out of memory. If + * that happens, the page_fault() callback must return VM_FAULT_OOM. + * + * The page_fault() callback can also try to migrate back multiple pages in one + * chunk, as an optimization. It must, however, prioritize the faulting address + * over all the others. */ +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, + unsigned long addr, + const struct page *page, + unsigned int flags, + pmd_t *pmdp); + struct hmm_devmem { struct completion completion; unsigned long pfn_first; @@ -503,6 +526,7 @@ struct hmm_devmem { struct dev_pagemap pagemap; const struct hmm_devmem_ops *ops; struct percpu_ref ref; + dev_page_fault_t page_fault; }; /* diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 55db66b3716f..f0628660d541 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -4,8 +4,6 @@ #include #include -#include - struct resource; struct device; @@ -66,47 +64,18 @@ enum memory_type { }; /* - * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two - * callbacks: - * page_fault() - * page_free() - * * Additional notes about MEMORY_DEVICE_PRIVATE may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief * explanation in include/linux/memory_hotplug.h. * - * The page_fault() callback must migrate page back, from device memory to - * system memory, so that the CPU can access it. This might fail for various - * reasons (device issues, device have been unplugged, ...). When such error - * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and - * set the CPU page table entry to "poisoned". - * - * Note that because memory cgroup charges are transferred to the device memory, - * this should never fail due to memory restrictions. However, allocation - * of a regular system page might still fail because we are out of memory. If - * that happens, the page_fault() callback must return VM_FAULT_OOM. - * - * The page_fault() callback can also try to migrate back multiple pages in one - * chunk, as an optimization. It must, however, prioritize the faulting address - * over all the others. - * - * * The page_free() callback is called once the page refcount reaches 1 * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug. * This allows the device driver to implement its own memory management.) - * - * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter. */ -typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, - unsigned long addr, - const struct page *page, - unsigned int flags, - pmd_t *pmdp); typedef void (*dev_page_free_t)(struct page *page, void *data); /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings - * @page_fault: callback when CPU fault on an unaddressable device page * @page_free: free page callback when page refcount reaches 1 * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref @@ -117,7 +86,6 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @type: memory type: see MEMORY_* in memory_hotplug.h */ struct dev_pagemap { - dev_page_fault_t page_fault; dev_page_free_t page_free; struct vmem_altmap altmap; bool altmap_valid; diff --git a/kernel/memremap.c b/kernel/memremap.c index 0d5603d76c37..a856cb5ff192 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -11,6 +11,7 @@ #include #include #include +#include static DEFINE_XARRAY(pgmap_array); #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) @@ -24,6 +25,9 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct *vma, pmd_t *pmdp) { struct page *page = device_private_entry_to_page(entry); + struct hmm_devmem *devmem; + + devmem = container_of(page->pgmap, typeof(*devmem), pagemap); /* * The page_fault() callback must migrate page back to system memory @@ -39,7 +43,7 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct *vma, * There is a more in-depth description of what that callback can and * cannot do, in include/linux/memremap.h */ - return page->pgmap->page_fault(vma, addr, page, flags, pmdp); + return devmem->page_fault(vma, addr, page, flags, pmdp); } EXPORT_SYMBOL(device_private_entry_fault); #endif /* CONFIG_DEVICE_PRIVATE */ diff --git a/mm/hmm.c b/mm/hmm.c index 789587731217..a04e4b810610 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -1087,10 +1087,10 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); + devmem->page_fault = hmm_devmem_fault; devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; devmem->pagemap.res = *devmem->resource; - devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; devmem->pagemap.altmap_valid = false; devmem->pagemap.ref = &devmem->ref; @@ -1141,10 +1141,10 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); + devmem->page_fault = hmm_devmem_fault; devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; devmem->pagemap.res = *devmem->resource; - devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; devmem->pagemap.altmap_valid = false; devmem->pagemap.ref = &devmem->ref; -- cgit v1.3-7-g2ca7 From 0f4991e8fd48987ae476a92cdee6bfec4aff31b8 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Fri, 28 Dec 2018 00:40:00 -0800 Subject: kernel/fork.c: mark 'stack_vm_area' with __maybe_unused Fixes gcc '-Wunused-but-set-variable' warning when CONFIG_VMAP_STACK is not set: kernel/fork.c: In function 'dup_task_struct': kernel/fork.c:843:20: warning: variable 'stack_vm_area' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1545965190-2381-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index c979605fe806..d439c48ecf18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -841,7 +841,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) { struct task_struct *tsk; unsigned long *stack; - struct vm_struct *stack_vm_area; + struct vm_struct *stack_vm_area __maybe_unused; int err; if (node == NUMA_NO_NODE) -- cgit v1.3-7-g2ca7 From 1a80dade010c7a7f4885a4c4c2a7ac22cc7b34df Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 28 Dec 2018 07:22:26 -0800 Subject: Fix failure path in alloc_pid() The failure path removes the allocated PIDs from the wrong namespace. This could lead to us inadvertently reusing PIDs in the leaf namespace and leaking PIDs in parent namespaces. Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API") Cc: Signed-off-by: Matthew Wilcox Acked-by: "Eric W. Biederman" Reviewed-by: Oleg Nesterov Signed-off-by: Linus Torvalds --- kernel/pid.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/pid.c b/kernel/pid.c index b2f6c506035d..20881598bdfa 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -233,8 +233,10 @@ out_unlock: out_free: spin_lock_irq(&pidmap_lock); - while (++i <= ns->level) - idr_remove(&ns->idr, (pid->numbers + i)->nr); + while (++i <= ns->level) { + upid = pid->numbers + i; + idr_remove(&upid->ns->idr, upid->nr); + } /* On failure to allocate the first pid, reset the state */ if (ns->pid_allocated == PIDNS_ADDING) -- cgit v1.3-7-g2ca7 From 9ef7fa507d6b53a96de4da3298c5f01bde603c0a Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Tue, 4 Dec 2018 19:38:25 -0800 Subject: kgdb: Remove irq flags from roundup The function kgdb_roundup_cpus() was passed a parameter that was documented as: > the flags that will be used when restoring the interrupts. There is > local_irq_save() call before kgdb_roundup_cpus(). Nobody used those flags. Anyone who wanted to temporarily turn on interrupts just did local_irq_enable() and local_irq_disable() without looking at them. So we can definitely remove the flags. Signed-off-by: Douglas Anderson Cc: Vineet Gupta Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Richard Kuo Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Yoshinori Sato Cc: Rich Felker Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Acked-by: Will Deacon Signed-off-by: Daniel Thompson --- arch/arc/kernel/kgdb.c | 2 +- arch/arm/kernel/kgdb.c | 2 +- arch/arm64/kernel/kgdb.c | 2 +- arch/hexagon/kernel/kgdb.c | 9 ++------- arch/mips/kernel/kgdb.c | 2 +- arch/powerpc/kernel/kgdb.c | 2 +- arch/sh/kernel/kgdb.c | 2 +- arch/sparc/kernel/smp_64.c | 2 +- arch/x86/kernel/kgdb.c | 9 ++------- include/linux/kgdb.h | 9 ++------- kernel/debug/debug_core.c | 2 +- 11 files changed, 14 insertions(+), 29 deletions(-) (limited to 'kernel') diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c index 9a3c34af2ae8..0932851028e0 100644 --- a/arch/arc/kernel/kgdb.c +++ b/arch/arc/kernel/kgdb.c @@ -197,7 +197,7 @@ static void kgdb_call_nmi_hook(void *ignored) kgdb_nmicallback(raw_smp_processor_id(), NULL); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(kgdb_call_nmi_hook, NULL, 0); diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c index caa0dbe3dc61..f21077b077be 100644 --- a/arch/arm/kernel/kgdb.c +++ b/arch/arm/kernel/kgdb.c @@ -175,7 +175,7 @@ static void kgdb_call_nmi_hook(void *ignored) kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(kgdb_call_nmi_hook, NULL, 0); diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c index a20de58061a8..12c339ff6e75 100644 --- a/arch/arm64/kernel/kgdb.c +++ b/arch/arm64/kernel/kgdb.c @@ -289,7 +289,7 @@ static void kgdb_call_nmi_hook(void *ignored) kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(kgdb_call_nmi_hook, NULL, 0); diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c index 16c24b22d0b2..012e0e230ac2 100644 --- a/arch/hexagon/kernel/kgdb.c +++ b/arch/hexagon/kernel/kgdb.c @@ -119,17 +119,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc) /** * kgdb_roundup_cpus - Get other CPUs into a holding pattern - * @flags: Current IRQ state * * On SMP systems, we need to get the attention of the other CPUs * and get them be in a known state. This should do what is needed * to get the other CPUs to call kgdb_wait(). Note that on some arches, * the NMI approach is not used for rounding up all the CPUs. For example, - * in case of MIPS, smp_call_function() is used to roundup CPUs. In - * this case, we have to make sure that interrupts are enabled before - * calling smp_call_function(). The argument to this function is - * the flags that will be used when restoring the interrupts. There is - * local_irq_save() call before kgdb_roundup_cpus(). + * in case of MIPS, smp_call_function() is used to roundup CPUs. * * On non-SMP systems, this is not called. */ @@ -139,7 +134,7 @@ static void hexagon_kgdb_nmi_hook(void *ignored) kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0); diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c index eb6c0d582626..2b05effc17b4 100644 --- a/arch/mips/kernel/kgdb.c +++ b/arch/mips/kernel/kgdb.c @@ -219,7 +219,7 @@ static void kgdb_call_nmi_hook(void *ignored) set_fs(old_fs); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(kgdb_call_nmi_hook, NULL, 0); diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c index 59c578f865aa..b0e804844be0 100644 --- a/arch/powerpc/kernel/kgdb.c +++ b/arch/powerpc/kernel/kgdb.c @@ -124,7 +124,7 @@ static int kgdb_call_nmi_hook(struct pt_regs *regs) } #ifdef CONFIG_SMP -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { smp_send_debugger_break(); } diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c index 4f04c6638a4d..cc57630f6bf2 100644 --- a/arch/sh/kernel/kgdb.c +++ b/arch/sh/kernel/kgdb.c @@ -319,7 +319,7 @@ static void kgdb_call_nmi_hook(void *ignored) kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); } -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { local_irq_enable(); smp_call_function(kgdb_call_nmi_hook, NULL, 0); diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c index 4792e08ad36b..f45d876983f1 100644 --- a/arch/sparc/kernel/smp_64.c +++ b/arch/sparc/kernel/smp_64.c @@ -1014,7 +1014,7 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page) } #ifdef CONFIG_KGDB -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { smp_cross_call(&xcall_kgdb_capture, 0, 0, 0); } diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 8e36f249646e..ac6291a4178d 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -422,21 +422,16 @@ static void kgdb_disable_hw_debug(struct pt_regs *regs) #ifdef CONFIG_SMP /** * kgdb_roundup_cpus - Get other CPUs into a holding pattern - * @flags: Current IRQ state * * On SMP systems, we need to get the attention of the other CPUs * and get them be in a known state. This should do what is needed * to get the other CPUs to call kgdb_wait(). Note that on some arches, * the NMI approach is not used for rounding up all the CPUs. For example, - * in case of MIPS, smp_call_function() is used to roundup CPUs. In - * this case, we have to make sure that interrupts are enabled before - * calling smp_call_function(). The argument to this function is - * the flags that will be used when restoring the interrupts. There is - * local_irq_save() call before kgdb_roundup_cpus(). + * in case of MIPS, smp_call_function() is used to roundup CPUs. * * On non-SMP systems, this is not called. */ -void kgdb_roundup_cpus(unsigned long flags) +void kgdb_roundup_cpus(void) { apic->send_IPI_allbutself(APIC_DM_NMI); } diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h index e465bb15912d..05e5b2eb0d32 100644 --- a/include/linux/kgdb.h +++ b/include/linux/kgdb.h @@ -178,21 +178,16 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code, /** * kgdb_roundup_cpus - Get other CPUs into a holding pattern - * @flags: Current IRQ state * * On SMP systems, we need to get the attention of the other CPUs * and get them into a known state. This should do what is needed * to get the other CPUs to call kgdb_wait(). Note that on some arches, * the NMI approach is not used for rounding up all the CPUs. For example, - * in case of MIPS, smp_call_function() is used to roundup CPUs. In - * this case, we have to make sure that interrupts are enabled before - * calling smp_call_function(). The argument to this function is - * the flags that will be used when restoring the interrupts. There is - * local_irq_save() call before kgdb_roundup_cpus(). + * in case of MIPS, smp_call_function() is used to roundup CPUs. * * On non-SMP systems, this is not called. */ -extern void kgdb_roundup_cpus(unsigned long flags); +extern void kgdb_roundup_cpus(void); /** * kgdb_arch_set_pc - Generic call back to the program counter diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 65c0f1363788..f3cadda45f07 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -593,7 +593,7 @@ return_normal: /* Signal the other CPUs to enter kgdb_wait() */ else if ((!kgdb_single_step) && kgdb_do_roundup) - kgdb_roundup_cpus(flags); + kgdb_roundup_cpus(); #endif /* -- cgit v1.3-7-g2ca7 From 3cd99ac3559855f69afbc1d5080e17eaa12394ff Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Tue, 4 Dec 2018 19:38:26 -0800 Subject: kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function() When I had lockdep turned on and dropped into kgdb I got a nice splat on my system. Specifically it hit: DEBUG_LOCKS_WARN_ON(current->hardirq_context) Specifically it looked like this: sysrq: SysRq : DEBUG ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(current->hardirq_context) WARNING: CPU: 0 PID: 0 at .../kernel/locking/lockdep.c:2875 lockdep_hardirqs_on+0xf0/0x160 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0 #27 pstate: 604003c9 (nZCv DAIF +PAN -UAO) pc : lockdep_hardirqs_on+0xf0/0x160 ... Call trace: lockdep_hardirqs_on+0xf0/0x160 trace_hardirqs_on+0x188/0x1ac kgdb_roundup_cpus+0x14/0x3c kgdb_cpu_enter+0x53c/0x5cc kgdb_handle_exception+0x180/0x1d4 kgdb_compiled_brk_fn+0x30/0x3c brk_handler+0x134/0x178 do_debug_exception+0xfc/0x178 el1_dbg+0x18/0x78 kgdb_breakpoint+0x34/0x58 sysrq_handle_dbg+0x54/0x5c __handle_sysrq+0x114/0x21c handle_sysrq+0x30/0x3c qcom_geni_serial_isr+0x2dc/0x30c ... ... irq event stamp: ...45 hardirqs last enabled at (...44): [...] __do_softirq+0xd8/0x4e4 hardirqs last disabled at (...45): [...] el1_irq+0x74/0x130 softirqs last enabled at (...42): [...] _local_bh_enable+0x2c/0x34 softirqs last disabled at (...43): [...] irq_exit+0xa8/0x100 ---[ end trace adf21f830c46e638 ]--- Looking closely at it, it seems like a really bad idea to be calling local_irq_enable() in kgdb_roundup_cpus(). If nothing else that seems like it could violate spinlock semantics and cause a deadlock. Instead, let's use a private csd alongside smp_call_function_single_async() to round up the other CPUs. Using smp_call_function_single_async() doesn't require interrupts to be enabled so we can remove the offending bit of code. In order to avoid duplicating this across all the architectures that use the default kgdb_roundup_cpus(), we'll add a "weak" implementation to debug_core.c. Looking at all the people who previously had copies of this code, there were a few variants. I've attempted to keep the variants working like they used to. Specifically: * For arch/arc we passed NULL to kgdb_nmicallback() instead of get_irq_regs(). * For arch/mips there was a bit of extra code around kgdb_nmicallback() NOTE: In this patch we will still get into trouble if we try to round up a CPU that failed to round up before. We'll try to round it up again and potentially hang when we try to grab the csd lock. That's not new behavior but we'll still try to do better in a future patch. Suggested-by: Daniel Thompson Signed-off-by: Douglas Anderson Cc: Vineet Gupta Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Richard Kuo Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Yoshinori Sato Cc: Rich Felker Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Acked-by: Will Deacon Signed-off-by: Daniel Thompson --- arch/arc/kernel/kgdb.c | 10 ++-------- arch/arm/kernel/kgdb.c | 12 ------------ arch/arm64/kernel/kgdb.c | 12 ------------ arch/hexagon/kernel/kgdb.c | 27 --------------------------- arch/mips/kernel/kgdb.c | 9 +-------- arch/powerpc/kernel/kgdb.c | 4 ++-- arch/sh/kernel/kgdb.c | 12 ------------ include/linux/kgdb.h | 15 +++++++++++++-- kernel/debug/debug_core.c | 41 +++++++++++++++++++++++++++++++++++++++++ 9 files changed, 59 insertions(+), 83 deletions(-) (limited to 'kernel') diff --git a/arch/arc/kernel/kgdb.c b/arch/arc/kernel/kgdb.c index 0932851028e0..68d9fe4b5aa7 100644 --- a/arch/arc/kernel/kgdb.c +++ b/arch/arc/kernel/kgdb.c @@ -192,18 +192,12 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long ip) instruction_pointer(regs) = ip; } -static void kgdb_call_nmi_hook(void *ignored) +void kgdb_call_nmi_hook(void *ignored) { + /* Default implementation passes get_irq_regs() but we don't */ kgdb_nmicallback(raw_smp_processor_id(), NULL); } -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); -} - struct kgdb_arch arch_kgdb_ops = { /* breakpoint instruction: TRAP_S 0x3 */ #ifdef CONFIG_CPU_BIG_ENDIAN diff --git a/arch/arm/kernel/kgdb.c b/arch/arm/kernel/kgdb.c index f21077b077be..d9a69e941463 100644 --- a/arch/arm/kernel/kgdb.c +++ b/arch/arm/kernel/kgdb.c @@ -170,18 +170,6 @@ static struct undef_hook kgdb_compiled_brkpt_hook = { .fn = kgdb_compiled_brk_fn }; -static void kgdb_call_nmi_hook(void *ignored) -{ - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); -} - -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); -} - static int __kgdb_notify(struct die_args *args, unsigned long cmd) { struct pt_regs *regs = args->regs; diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c index 12c339ff6e75..da880247c734 100644 --- a/arch/arm64/kernel/kgdb.c +++ b/arch/arm64/kernel/kgdb.c @@ -284,18 +284,6 @@ static struct step_hook kgdb_step_hook = { .fn = kgdb_step_brk_fn }; -static void kgdb_call_nmi_hook(void *ignored) -{ - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); -} - -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); -} - static int __kgdb_notify(struct die_args *args, unsigned long cmd) { struct pt_regs *regs = args->regs; diff --git a/arch/hexagon/kernel/kgdb.c b/arch/hexagon/kernel/kgdb.c index 012e0e230ac2..b95d12038a4e 100644 --- a/arch/hexagon/kernel/kgdb.c +++ b/arch/hexagon/kernel/kgdb.c @@ -115,33 +115,6 @@ void kgdb_arch_set_pc(struct pt_regs *regs, unsigned long pc) instruction_pointer(regs) = pc; } -#ifdef CONFIG_SMP - -/** - * kgdb_roundup_cpus - Get other CPUs into a holding pattern - * - * On SMP systems, we need to get the attention of the other CPUs - * and get them be in a known state. This should do what is needed - * to get the other CPUs to call kgdb_wait(). Note that on some arches, - * the NMI approach is not used for rounding up all the CPUs. For example, - * in case of MIPS, smp_call_function() is used to roundup CPUs. - * - * On non-SMP systems, this is not called. - */ - -static void hexagon_kgdb_nmi_hook(void *ignored) -{ - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); -} - -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(hexagon_kgdb_nmi_hook, NULL, 0); - local_irq_disable(); -} -#endif - /* Not yet working */ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, diff --git a/arch/mips/kernel/kgdb.c b/arch/mips/kernel/kgdb.c index 2b05effc17b4..42f057a6c215 100644 --- a/arch/mips/kernel/kgdb.c +++ b/arch/mips/kernel/kgdb.c @@ -207,7 +207,7 @@ void arch_kgdb_breakpoint(void) ".set\treorder"); } -static void kgdb_call_nmi_hook(void *ignored) +void kgdb_call_nmi_hook(void *ignored) { mm_segment_t old_fs; @@ -219,13 +219,6 @@ static void kgdb_call_nmi_hook(void *ignored) set_fs(old_fs); } -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); -} - static int compute_signal(int tt) { struct hard_trap_info *ht; diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c index b0e804844be0..b4ce54d73337 100644 --- a/arch/powerpc/kernel/kgdb.c +++ b/arch/powerpc/kernel/kgdb.c @@ -117,7 +117,7 @@ int kgdb_skipexception(int exception, struct pt_regs *regs) return kgdb_isremovedbreak(regs->nip); } -static int kgdb_call_nmi_hook(struct pt_regs *regs) +static int kgdb_debugger_ipi(struct pt_regs *regs) { kgdb_nmicallback(raw_smp_processor_id(), regs); return 0; @@ -502,7 +502,7 @@ int kgdb_arch_init(void) old__debugger_break_match = __debugger_break_match; old__debugger_fault_handler = __debugger_fault_handler; - __debugger_ipi = kgdb_call_nmi_hook; + __debugger_ipi = kgdb_debugger_ipi; __debugger = kgdb_debugger; __debugger_bpt = kgdb_handle_breakpoint; __debugger_sstep = kgdb_singlestep; diff --git a/arch/sh/kernel/kgdb.c b/arch/sh/kernel/kgdb.c index cc57630f6bf2..14e012ad7c57 100644 --- a/arch/sh/kernel/kgdb.c +++ b/arch/sh/kernel/kgdb.c @@ -314,18 +314,6 @@ BUILD_TRAP_HANDLER(singlestep) local_irq_restore(flags); } -static void kgdb_call_nmi_hook(void *ignored) -{ - kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); -} - -void kgdb_roundup_cpus(void) -{ - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); -} - static int __kgdb_notify(struct die_args *args, unsigned long cmd) { int ret; diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h index 05e5b2eb0d32..24422865cd18 100644 --- a/include/linux/kgdb.h +++ b/include/linux/kgdb.h @@ -176,14 +176,25 @@ kgdb_arch_handle_exception(int vector, int signo, int err_code, char *remcom_out_buffer, struct pt_regs *regs); +/** + * kgdb_call_nmi_hook - Call kgdb_nmicallback() on the current CPU + * @ignored: This parameter is only here to match the prototype. + * + * If you're using the default implementation of kgdb_roundup_cpus() + * this function will be called per CPU. If you don't implement + * kgdb_call_nmi_hook() a default will be used. + */ + +extern void kgdb_call_nmi_hook(void *ignored); + /** * kgdb_roundup_cpus - Get other CPUs into a holding pattern * * On SMP systems, we need to get the attention of the other CPUs * and get them into a known state. This should do what is needed * to get the other CPUs to call kgdb_wait(). Note that on some arches, - * the NMI approach is not used for rounding up all the CPUs. For example, - * in case of MIPS, smp_call_function() is used to roundup CPUs. + * the NMI approach is not used for rounding up all the CPUs. Normally + * those architectures can just not implement this and get the default. * * On non-SMP systems, this is not called. */ diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index f3cadda45f07..10db2833a423 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -220,6 +221,46 @@ int __weak kgdb_skipexception(int exception, struct pt_regs *regs) return 0; } +#ifdef CONFIG_SMP + +/* + * Default (weak) implementation for kgdb_roundup_cpus + */ + +static DEFINE_PER_CPU(call_single_data_t, kgdb_roundup_csd); + +void __weak kgdb_call_nmi_hook(void *ignored) +{ + /* + * NOTE: get_irq_regs() is supposed to get the registers from + * before the IPI interrupt happened and so is supposed to + * show where the processor was. In some situations it's + * possible we might be called without an IPI, so it might be + * safer to figure out how to make kgdb_breakpoint() work + * properly here. + */ + kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); +} + +void __weak kgdb_roundup_cpus(void) +{ + call_single_data_t *csd; + int this_cpu = raw_smp_processor_id(); + int cpu; + + for_each_online_cpu(cpu) { + /* No need to roundup ourselves */ + if (cpu == this_cpu) + continue; + + csd = &per_cpu(kgdb_roundup_csd, cpu); + csd->func = kgdb_call_nmi_hook; + smp_call_function_single_async(cpu, csd); + } +} + +#endif + /* * Some architectures need cache flushes when we set/clear a * breakpoint: -- cgit v1.3-7-g2ca7 From 87b095928584da7d5cb3149016f00b0b139c2292 Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Tue, 4 Dec 2018 19:38:27 -0800 Subject: kgdb: Don't round up a CPU that failed rounding up before If we're using the default implementation of kgdb_roundup_cpus() that uses smp_call_function_single_async() we can end up hanging kgdb_roundup_cpus() if we try to round up a CPU that failed to round up before. Specifically smp_call_function_single_async() will try to wait on the csd lock for the CPU that we're trying to round up. If the previous round up never finished then that lock could still be held and we'll just sit there hanging. There's not a lot of use trying to round up a CPU that failed to round up before. Let's keep a flag that indicates whether the CPU started but didn't finish to round up before. If we see that flag set then we'll skip the next round up. In general we have a few goals here: - We never want to end up calling smp_call_function_single_async() when the csd is still locked. This is accomplished because flush_smp_call_function_queue() unlocks the csd _before_ invoking the callback. That means that when kgdb_nmicallback() runs we know for sure the the csd is no longer locked. Thus when we set "rounding_up = false" we know for sure that the csd is unlocked. - If there are no timeouts rounding up we should never skip a round up. NOTE #1: In general trying to continue running after failing to round up CPUs doesn't appear to be supported in the debugger. When I simulate this I find that kdb reports "Catastrophic error detected" when I try to continue. I can overrule and continue anyway, but it should be noted that we may be entering the land of dragons here. Possibly the "Catastrophic error detected" was added _because_ of the future failure to round up, but even so this is an area of the code that hasn't been strongly tested. NOTE #2: I did a bit of testing before and after this change. I introduced a 10 second hang in the kernel while holding a spinlock that I could invoke on a certain CPU with 'taskset -c 3 cat /sys/...". Before this change if I did: - Invoke hang - Enter debugger - g (which warns about Catastrophic error, g again to go anyway) - g - Enter debugger ...I'd hang the rest of the 10 seconds without getting a debugger prompt. After this change I end up in the debugger the 2nd time after only 1 second with the standard warning about 'Timed out waiting for secondary CPUs.' I'll also note that once the CPU finished waiting I could actually debug it (aka "btc" worked) I won't promise that everything works perfectly if the errant CPU comes back at just the wrong time (like as we're entering or exiting the debugger) but it certainly seems like an improvement. NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some implementations override kgdb_call_nmi_hook(). It shouldn't hurt to have it in kgdb_nmicallback() in any case. NOTE #4: this logic is really only needed because there is no API call like "smp_try_call_function_single_async()" or "smp_csd_is_locked()". If such an API existed then we'd use it instead, but it seemed a bit much to add an API like this just for kgdb. Signed-off-by: Douglas Anderson Acked-by: Daniel Thompson Signed-off-by: Daniel Thompson --- kernel/debug/debug_core.c | 20 +++++++++++++++++++- kernel/debug/debug_core.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 10db2833a423..1fb8b239e567 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -247,6 +247,7 @@ void __weak kgdb_roundup_cpus(void) call_single_data_t *csd; int this_cpu = raw_smp_processor_id(); int cpu; + int ret; for_each_online_cpu(cpu) { /* No need to roundup ourselves */ @@ -254,8 +255,23 @@ void __weak kgdb_roundup_cpus(void) continue; csd = &per_cpu(kgdb_roundup_csd, cpu); + + /* + * If it didn't round up last time, don't try again + * since smp_call_function_single_async() will block. + * + * If rounding_up is false then we know that the + * previous call must have at least started and that + * means smp_call_function_single_async() won't block. + */ + if (kgdb_info[cpu].rounding_up) + continue; + kgdb_info[cpu].rounding_up = true; + csd->func = kgdb_call_nmi_hook; - smp_call_function_single_async(cpu, csd); + ret = smp_call_function_single_async(cpu, csd); + if (ret) + kgdb_info[cpu].rounding_up = false; } } @@ -788,6 +804,8 @@ int kgdb_nmicallback(int cpu, void *regs) struct kgdb_state kgdb_var; struct kgdb_state *ks = &kgdb_var; + kgdb_info[cpu].rounding_up = false; + memset(ks, 0, sizeof(struct kgdb_state)); ks->cpu = cpu; ks->linux_regs = regs; diff --git a/kernel/debug/debug_core.h b/kernel/debug/debug_core.h index 127d9bc49fb4..b4a7c326d546 100644 --- a/kernel/debug/debug_core.h +++ b/kernel/debug/debug_core.h @@ -42,6 +42,7 @@ struct debuggerinfo_struct { int ret_state; int irq_depth; int enter_kgdb; + bool rounding_up; }; extern struct debuggerinfo_struct kgdb_info[]; -- cgit v1.3-7-g2ca7 From 162bc7f5afd75b72acbe3c5f3488ef7e64a3fe36 Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Tue, 4 Dec 2018 19:38:28 -0800 Subject: kdb: Don't back trace on a cpu that didn't round up If you have a CPU that fails to round up and then run 'btc' you'll end up crashing in kdb becaue we dereferenced NULL. Let's add a check. It's wise to also set the task to NULL when leaving the debugger so that if we fail to round up on a later entry into the debugger we won't backtrace a stale task. Signed-off-by: Douglas Anderson Acked-by: Daniel Thompson Signed-off-by: Daniel Thompson --- kernel/debug/debug_core.c | 4 ++++ kernel/debug/kdb/kdb_bt.c | 11 ++++++++++- kernel/debug/kdb/kdb_debugger.c | 7 ------- 3 files changed, 14 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 1fb8b239e567..5cc608de6883 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -592,6 +592,8 @@ return_normal: arch_kgdb_ops.correct_hw_break(); if (trace_on) tracing_on(); + kgdb_info[cpu].debuggerinfo = NULL; + kgdb_info[cpu].task = NULL; kgdb_info[cpu].exception_state &= ~(DCPU_WANT_MASTER | DCPU_IS_SLAVE); kgdb_info[cpu].enter_kgdb--; @@ -724,6 +726,8 @@ kgdb_restore: if (trace_on) tracing_on(); + kgdb_info[cpu].debuggerinfo = NULL; + kgdb_info[cpu].task = NULL; kgdb_info[cpu].exception_state &= ~(DCPU_WANT_MASTER | DCPU_IS_SLAVE); kgdb_info[cpu].enter_kgdb--; diff --git a/kernel/debug/kdb/kdb_bt.c b/kernel/debug/kdb/kdb_bt.c index 7921ae4fca8d..7e2379aa0a1e 100644 --- a/kernel/debug/kdb/kdb_bt.c +++ b/kernel/debug/kdb/kdb_bt.c @@ -186,7 +186,16 @@ kdb_bt(int argc, const char **argv) kdb_printf("btc: cpu status: "); kdb_parse("cpu\n"); for_each_online_cpu(cpu) { - sprintf(buf, "btt 0x%px\n", KDB_TSK(cpu)); + void *kdb_tsk = KDB_TSK(cpu); + + /* If a CPU failed to round up we could be here */ + if (!kdb_tsk) { + kdb_printf("WARNING: no task for cpu %ld\n", + cpu); + continue; + } + + sprintf(buf, "btt 0x%px\n", kdb_tsk); kdb_parse(buf); touch_nmi_watchdog(); } diff --git a/kernel/debug/kdb/kdb_debugger.c b/kernel/debug/kdb/kdb_debugger.c index 15e1a7af5dd0..53a0df6e4d92 100644 --- a/kernel/debug/kdb/kdb_debugger.c +++ b/kernel/debug/kdb/kdb_debugger.c @@ -118,13 +118,6 @@ int kdb_stub(struct kgdb_state *ks) kdb_bp_remove(); KDB_STATE_CLEAR(DOING_SS); KDB_STATE_SET(PAGER); - /* zero out any offline cpu data */ - for_each_present_cpu(i) { - if (!cpu_online(i)) { - kgdb_info[i].debuggerinfo = NULL; - kgdb_info[i].task = NULL; - } - } if (ks->err_code == DIE_OOPS || reason == KDB_REASON_OOPS) { ks->pass_exception = 1; KDB_FLAG_SET(CATASTROPHIC); -- cgit v1.3-7-g2ca7 From 7faedcd4de4356dfcda30841a5ffef3bf35fa06c Mon Sep 17 00:00:00 2001 From: Nicholas Mc Guire Date: Fri, 20 Jul 2018 11:23:37 +0200 Subject: kdb: use bool for binary state indicators defcmd_in_progress is the state trace for command group processing - within a command group or not - usable is an indicator if a command set is valid (allocated/non-empty) - so use a bool for those binary indication here. Signed-off-by: Nicholas Mc Guire Reviewed-by: Daniel Thompson Signed-off-by: Daniel Thompson --- kernel/debug/kdb/kdb_main.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index d72b32c66f7d..82a3b32a7cfc 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -658,7 +658,7 @@ static void kdb_cmderror(int diag) */ struct defcmd_set { int count; - int usable; + bool usable; char *name; char *usage; char *help; @@ -666,7 +666,7 @@ struct defcmd_set { }; static struct defcmd_set *defcmd_set; static int defcmd_set_count; -static int defcmd_in_progress; +static bool defcmd_in_progress; /* Forward references */ static int kdb_exec_defcmd(int argc, const char **argv); @@ -676,9 +676,9 @@ static int kdb_defcmd2(const char *cmdstr, const char *argv0) struct defcmd_set *s = defcmd_set + defcmd_set_count - 1; char **save_command = s->command; if (strcmp(argv0, "endefcmd") == 0) { - defcmd_in_progress = 0; + defcmd_in_progress = false; if (!s->count) - s->usable = 0; + s->usable = false; if (s->usable) /* macros are always safe because when executed each * internal command re-enters kdb_parse() and is @@ -695,7 +695,7 @@ static int kdb_defcmd2(const char *cmdstr, const char *argv0) if (!s->command) { kdb_printf("Could not allocate new kdb_defcmd table for %s\n", cmdstr); - s->usable = 0; + s->usable = false; return KDB_NOTIMP; } memcpy(s->command, save_command, s->count * sizeof(*(s->command))); @@ -737,7 +737,7 @@ static int kdb_defcmd(int argc, const char **argv) defcmd_set_count * sizeof(*defcmd_set)); s = defcmd_set + defcmd_set_count; memset(s, 0, sizeof(*s)); - s->usable = 1; + s->usable = true; s->name = kdb_strdup(argv[1], GFP_KDB); if (!s->name) goto fail_name; @@ -756,7 +756,7 @@ static int kdb_defcmd(int argc, const char **argv) s->help[strlen(s->help)-1] = '\0'; } ++defcmd_set_count; - defcmd_in_progress = 1; + defcmd_in_progress = true; kfree(save_defcmd_set); return 0; fail_help: -- cgit v1.3-7-g2ca7 From c40f7d74c741a907cfaeb73a7697081881c497d0 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 27 Dec 2018 13:46:17 -0800 Subject: sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the scheduler under high loads, starting at around the v4.18 time frame, and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list manipulation. Do a (manual) revert of: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") It turns out that the list_del_leaf_cfs_rq() introduced by this commit is a surprising property that was not considered in followup commits such as: 9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list") As Vincent Guittot explains: "I think that there is a bigger problem with commit a9e7f6544b9c and cfs_rq throttling: Let take the example of the following topology TG2 --> TG1 --> root: 1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1 cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in one path because it has never been used and can't be throttled so tmp_alone_branch will point to leaf_cfs_rq_list at the end. 2) Then TG1 is throttled 3) and we add TG3 as a new child of TG1. 4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1 cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list. With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1 cfs_rq is removed from the list. Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list but tmp_alone_branch still points to TG3 cfs_rq because its throttled parent can't be enqueued when the lock is released. tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should. So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch points on another TG cfs_rq, the next TG cfs_rq that will be added, will be linked outside rq->leaf_cfs_rq_list - which is bad. In addition, we can break the ordering of the cfs_rq in rq->leaf_cfs_rq_list but this ordering is used to update and propagate the update from leaf down to root." Instead of trying to work through all these cases and trying to reproduce the very high loads that produced the lockup to begin with, simplify the code temporarily by reverting a9e7f6544b9c - which change was clearly not thought through completely. This (hopefully) gives us a kernel that doesn't lock up so people can continue to enjoy their holidays without worrying about regressions. ;-) [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ] Analyzed-by: Xie XiuQi Analyzed-by: Vincent Guittot Reported-by: Zhipeng Xie Reported-by: Sargun Dhillon Reported-by: Xie XiuQi Tested-by: Zhipeng Xie Tested-by: Sargun Dhillon Signed-off-by: Linus Torvalds Acked-by: Vincent Guittot Cc: # v4.13+ Cc: Bin Li Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 43 +++++++++---------------------------------- 1 file changed, 9 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d1907506318a..6483834f1278 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) } } -/* Iterate thr' all leaf cfs_rq's on a runqueue */ -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ - leaf_cfs_rq_list) +/* Iterate through all leaf cfs_rq's on a runqueue: */ +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { } -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7647,27 +7646,10 @@ static inline bool others_have_blocked(struct rq *rq) #ifdef CONFIG_FAIR_GROUP_SCHED -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) -{ - if (cfs_rq->load.weight) - return false; - - if (cfs_rq->avg.load_sum) - return false; - - if (cfs_rq->avg.util_sum) - return false; - - if (cfs_rq->avg.runnable_load_sum) - return false; - - return true; -} - static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7679,7 +7661,7 @@ static void update_blocked_averages(int cpu) * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { + for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; /* throttled entities do not contribute to load */ @@ -7694,13 +7676,6 @@ static void update_blocked_averages(int cpu) if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); - /* - * There can be a lot of idle CPU cgroups. Don't let fully - * decayed cfs_rqs linger on the list. - */ - if (cfs_rq_is_decayed(cfs_rq)) - list_del_leaf_cfs_rq(cfs_rq); - /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10570,10 +10545,10 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; rcu_read_lock(); - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); } -- cgit v1.3-7-g2ca7 From c08435ec7f2bc8f4109401f696fd55159b4b40cb Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:27 +0100 Subject: bpf: move {prev_,}insn_idx into verifier env Move prev_insn_idx and insn_idx from the do_check() function into the verifier environment, so they can be read inside the various helper functions for handling the instructions. It's easier to put this into the environment rather than changing all call-sites only to pass it along. insn_idx is useful in particular since this later on allows to hold state in env->insn_aux_data[env->insn_idx]. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 2 ++ kernel/bpf/verifier.c | 76 ++++++++++++++++++++++---------------------- 2 files changed, 40 insertions(+), 38 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index c233efc106c6..3f84f3e87704 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -212,6 +212,8 @@ struct bpf_subprog_info { * one verifier_env per bpf_check() call */ struct bpf_verifier_env { + u32 insn_idx; + u32 prev_insn_idx; struct bpf_prog *prog; /* eBPF program being verified */ const struct bpf_verifier_ops *ops; struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 71d86e3024ae..afa8515bbb34 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5650,7 +5650,6 @@ static int do_check(struct bpf_verifier_env *env) struct bpf_insn *insns = env->prog->insnsi; struct bpf_reg_state *regs; int insn_cnt = env->prog->len, i; - int insn_idx, prev_insn_idx = 0; int insn_processed = 0; bool do_print_state = false; @@ -5670,19 +5669,19 @@ static int do_check(struct bpf_verifier_env *env) BPF_MAIN_FUNC /* callsite */, 0 /* frameno */, 0 /* subprogno, zero == main subprog */); - insn_idx = 0; + for (;;) { struct bpf_insn *insn; u8 class; int err; - if (insn_idx >= insn_cnt) { + if (env->insn_idx >= insn_cnt) { verbose(env, "invalid insn idx %d insn_cnt %d\n", - insn_idx, insn_cnt); + env->insn_idx, insn_cnt); return -EFAULT; } - insn = &insns[insn_idx]; + insn = &insns[env->insn_idx]; class = BPF_CLASS(insn->code); if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) { @@ -5692,7 +5691,7 @@ static int do_check(struct bpf_verifier_env *env) return -E2BIG; } - err = is_state_visited(env, insn_idx); + err = is_state_visited(env, env->insn_idx); if (err < 0) return err; if (err == 1) { @@ -5700,9 +5699,9 @@ static int do_check(struct bpf_verifier_env *env) if (env->log.level) { if (do_print_state) verbose(env, "\nfrom %d to %d: safe\n", - prev_insn_idx, insn_idx); + env->prev_insn_idx, env->insn_idx); else - verbose(env, "%d: safe\n", insn_idx); + verbose(env, "%d: safe\n", env->insn_idx); } goto process_bpf_exit; } @@ -5715,10 +5714,10 @@ static int do_check(struct bpf_verifier_env *env) if (env->log.level > 1 || (env->log.level && do_print_state)) { if (env->log.level > 1) - verbose(env, "%d:", insn_idx); + verbose(env, "%d:", env->insn_idx); else verbose(env, "\nfrom %d to %d:", - prev_insn_idx, insn_idx); + env->prev_insn_idx, env->insn_idx); print_verifier_state(env, state->frame[state->curframe]); do_print_state = false; } @@ -5729,20 +5728,20 @@ static int do_check(struct bpf_verifier_env *env) .private_data = env, }; - verbose_linfo(env, insn_idx, "; "); - verbose(env, "%d: ", insn_idx); + verbose_linfo(env, env->insn_idx, "; "); + verbose(env, "%d: ", env->insn_idx); print_bpf_insn(&cbs, insn, env->allow_ptr_leaks); } if (bpf_prog_is_dev_bound(env->prog->aux)) { - err = bpf_prog_offload_verify_insn(env, insn_idx, - prev_insn_idx); + err = bpf_prog_offload_verify_insn(env, env->insn_idx, + env->prev_insn_idx); if (err) return err; } regs = cur_regs(env); - env->insn_aux_data[insn_idx].seen = true; + env->insn_aux_data[env->insn_idx].seen = true; if (class == BPF_ALU || class == BPF_ALU64) { err = check_alu_op(env, insn); @@ -5768,13 +5767,13 @@ static int do_check(struct bpf_verifier_env *env) /* check that memory (src_reg + off) is readable, * the state of dst_reg will be updated by this func */ - err = check_mem_access(env, insn_idx, insn->src_reg, insn->off, - BPF_SIZE(insn->code), BPF_READ, - insn->dst_reg, false); + err = check_mem_access(env, env->insn_idx, insn->src_reg, + insn->off, BPF_SIZE(insn->code), + BPF_READ, insn->dst_reg, false); if (err) return err; - prev_src_type = &env->insn_aux_data[insn_idx].ptr_type; + prev_src_type = &env->insn_aux_data[env->insn_idx].ptr_type; if (*prev_src_type == NOT_INIT) { /* saw a valid insn @@ -5799,10 +5798,10 @@ static int do_check(struct bpf_verifier_env *env) enum bpf_reg_type *prev_dst_type, dst_reg_type; if (BPF_MODE(insn->code) == BPF_XADD) { - err = check_xadd(env, insn_idx, insn); + err = check_xadd(env, env->insn_idx, insn); if (err) return err; - insn_idx++; + env->insn_idx++; continue; } @@ -5818,13 +5817,13 @@ static int do_check(struct bpf_verifier_env *env) dst_reg_type = regs[insn->dst_reg].type; /* check that memory (dst_reg + off) is writeable */ - err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off, - BPF_SIZE(insn->code), BPF_WRITE, - insn->src_reg, false); + err = check_mem_access(env, env->insn_idx, insn->dst_reg, + insn->off, BPF_SIZE(insn->code), + BPF_WRITE, insn->src_reg, false); if (err) return err; - prev_dst_type = &env->insn_aux_data[insn_idx].ptr_type; + prev_dst_type = &env->insn_aux_data[env->insn_idx].ptr_type; if (*prev_dst_type == NOT_INIT) { *prev_dst_type = dst_reg_type; @@ -5852,9 +5851,9 @@ static int do_check(struct bpf_verifier_env *env) } /* check that memory (dst_reg + off) is writeable */ - err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off, - BPF_SIZE(insn->code), BPF_WRITE, - -1, false); + err = check_mem_access(env, env->insn_idx, insn->dst_reg, + insn->off, BPF_SIZE(insn->code), + BPF_WRITE, -1, false); if (err) return err; @@ -5872,9 +5871,9 @@ static int do_check(struct bpf_verifier_env *env) } if (insn->src_reg == BPF_PSEUDO_CALL) - err = check_func_call(env, insn, &insn_idx); + err = check_func_call(env, insn, &env->insn_idx); else - err = check_helper_call(env, insn->imm, insn_idx); + err = check_helper_call(env, insn->imm, env->insn_idx); if (err) return err; @@ -5887,7 +5886,7 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } - insn_idx += insn->off + 1; + env->insn_idx += insn->off + 1; continue; } else if (opcode == BPF_EXIT) { @@ -5901,8 +5900,8 @@ static int do_check(struct bpf_verifier_env *env) if (state->curframe) { /* exit from nested function */ - prev_insn_idx = insn_idx; - err = prepare_func_exit(env, &insn_idx); + env->prev_insn_idx = env->insn_idx; + err = prepare_func_exit(env, &env->insn_idx); if (err) return err; do_print_state = true; @@ -5932,7 +5931,8 @@ static int do_check(struct bpf_verifier_env *env) if (err) return err; process_bpf_exit: - err = pop_stack(env, &prev_insn_idx, &insn_idx); + err = pop_stack(env, &env->prev_insn_idx, + &env->insn_idx); if (err < 0) { if (err != -ENOENT) return err; @@ -5942,7 +5942,7 @@ process_bpf_exit: continue; } } else { - err = check_cond_jmp_op(env, insn, &insn_idx); + err = check_cond_jmp_op(env, insn, &env->insn_idx); if (err) return err; } @@ -5959,8 +5959,8 @@ process_bpf_exit: if (err) return err; - insn_idx++; - env->insn_aux_data[insn_idx].seen = true; + env->insn_idx++; + env->insn_aux_data[env->insn_idx].seen = true; } else { verbose(env, "invalid BPF_LD mode\n"); return -EINVAL; @@ -5970,7 +5970,7 @@ process_bpf_exit: return -EINVAL; } - insn_idx++; + env->insn_idx++; } verbose(env, "processed %d insns (limit %d), stack depth ", -- cgit v1.3-7-g2ca7 From 144cd91c4c2bced6eb8a7e25e590f6618a11e854 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:28 +0100 Subject: bpf: move tmp variable into ax register in interpreter This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run() into the hidden ax register. The latter is currently only used in JITs for constant blinding as a temporary scratch register, meaning the BPF interpreter will never see the use of ax. Therefore it is safe to use it for the cases where tmp has been used earlier. This is needed to later on allow restricted hidden use of ax in both interpreter and JITs. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/linux/filter.h | 3 ++- kernel/bpf/core.c | 34 +++++++++++++++++----------------- 2 files changed, 19 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index 8c8544b375eb..84a6a98f8328 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -60,7 +60,8 @@ struct sock_reuseport; * constants. See JIT pre-step in bpf_jit_blind_constants(). */ #define BPF_REG_AX MAX_BPF_REG -#define MAX_BPF_JIT_REG (MAX_BPF_REG + 1) +#define MAX_BPF_EXT_REG (MAX_BPF_REG + 1) +#define MAX_BPF_JIT_REG MAX_BPF_EXT_REG /* unused opcode to mark special call to bpf_tail_call() helper */ #define BPF_TAIL_CALL 0xf0 diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 38de580abcc2..a34312a5eea2 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -54,6 +54,7 @@ #define DST regs[insn->dst_reg] #define SRC regs[insn->src_reg] #define FP regs[BPF_REG_FP] +#define AX regs[BPF_REG_AX] #define ARG1 regs[BPF_REG_ARG1] #define CTX regs[BPF_REG_CTX] #define IMM insn->imm @@ -1188,7 +1189,6 @@ bool bpf_opcode_in_insntable(u8 code) */ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) { - u64 tmp; #define BPF_INSN_2_LBL(x, y) [BPF_##x | BPF_##y] = &&x##_##y #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z static const void *jumptable[256] = { @@ -1268,36 +1268,36 @@ select_insn: (*(s64 *) &DST) >>= IMM; CONT; ALU64_MOD_X: - div64_u64_rem(DST, SRC, &tmp); - DST = tmp; + div64_u64_rem(DST, SRC, &AX); + DST = AX; CONT; ALU_MOD_X: - tmp = (u32) DST; - DST = do_div(tmp, (u32) SRC); + AX = (u32) DST; + DST = do_div(AX, (u32) SRC); CONT; ALU64_MOD_K: - div64_u64_rem(DST, IMM, &tmp); - DST = tmp; + div64_u64_rem(DST, IMM, &AX); + DST = AX; CONT; ALU_MOD_K: - tmp = (u32) DST; - DST = do_div(tmp, (u32) IMM); + AX = (u32) DST; + DST = do_div(AX, (u32) IMM); CONT; ALU64_DIV_X: DST = div64_u64(DST, SRC); CONT; ALU_DIV_X: - tmp = (u32) DST; - do_div(tmp, (u32) SRC); - DST = (u32) tmp; + AX = (u32) DST; + do_div(AX, (u32) SRC); + DST = (u32) AX; CONT; ALU64_DIV_K: DST = div64_u64(DST, IMM); CONT; ALU_DIV_K: - tmp = (u32) DST; - do_div(tmp, (u32) IMM); - DST = (u32) tmp; + AX = (u32) DST; + do_div(AX, (u32) IMM); + DST = (u32) AX; CONT; ALU_END_TO_BE: switch (IMM) { @@ -1553,7 +1553,7 @@ STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \ { \ u64 stack[stack_size / sizeof(u64)]; \ - u64 regs[MAX_BPF_REG]; \ + u64 regs[MAX_BPF_EXT_REG]; \ \ FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \ ARG1 = (u64) (unsigned long) ctx; \ @@ -1566,7 +1566,7 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \ const struct bpf_insn *insn) \ { \ u64 stack[stack_size / sizeof(u64)]; \ - u64 regs[MAX_BPF_REG]; \ + u64 regs[MAX_BPF_EXT_REG]; \ \ FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \ BPF_R1 = r1; \ -- cgit v1.3-7-g2ca7 From 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:29 +0100 Subject: bpf: enable access to ax register also from verifier rewrite Right now we are using BPF ax register in JIT for constant blinding as well as in interpreter as temporary variable. Verifier will not be able to use it simply because its use will get overridden from the former in bpf_jit_blind_insn(). However, it can be made to work in that blinding will be skipped if there is prior use in either source or destination register on the instruction. Taking constraints of ax into account, the verifier is then open to use it in rewrites under some constraints. Note, ax register already has mappings in every eBPF JIT. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/linux/filter.h | 7 +------ kernel/bpf/core.c | 20 ++++++++++++++++++++ 2 files changed, 21 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index 84a6a98f8328..ad106d845b22 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -53,12 +53,7 @@ struct sock_reuseport; #define BPF_REG_D BPF_REG_8 /* data, callee-saved */ #define BPF_REG_H BPF_REG_9 /* hlen, callee-saved */ -/* Kernel hidden auxiliary/helper register for hardening step. - * Only used by eBPF JITs. It's nothing more than a temporary - * register that JITs use internally, only that here it's part - * of eBPF instructions that have been rewritten for blinding - * constants. See JIT pre-step in bpf_jit_blind_constants(). - */ +/* Kernel hidden auxiliary/helper register. */ #define BPF_REG_AX MAX_BPF_REG #define MAX_BPF_EXT_REG (MAX_BPF_REG + 1) #define MAX_BPF_JIT_REG MAX_BPF_EXT_REG diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index a34312a5eea2..f908b9356025 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -858,6 +858,26 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from, BUILD_BUG_ON(BPF_REG_AX + 1 != MAX_BPF_JIT_REG); BUILD_BUG_ON(MAX_BPF_REG + 1 != MAX_BPF_JIT_REG); + /* Constraints on AX register: + * + * AX register is inaccessible from user space. It is mapped in + * all JITs, and used here for constant blinding rewrites. It is + * typically "stateless" meaning its contents are only valid within + * the executed instruction, but not across several instructions. + * There are a few exceptions however which are further detailed + * below. + * + * Constant blinding is only used by JITs, not in the interpreter. + * The interpreter uses AX in some occasions as a local temporary + * register e.g. in DIV or MOD instructions. + * + * In restricted circumstances, the verifier can also use the AX + * register for rewrites as long as they do not interfere with + * the above cases! + */ + if (from->dst_reg == BPF_REG_AX || from->src_reg == BPF_REG_AX) + goto out; + if (from->imm == 0 && (from->code == (BPF_ALU | BPF_MOV | BPF_K) || from->code == (BPF_ALU64 | BPF_MOV | BPF_K))) { -- cgit v1.3-7-g2ca7 From 0d6303db7970e6f56ae700fa07e11eb510cda125 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:30 +0100 Subject: bpf: restrict map value pointer arithmetic for unprivileged Restrict map value pointer arithmetic for unprivileged users in that arithmetic itself must not go out of bounds as opposed to the actual access later on. Therefore after each adjust_ptr_min_max_vals() with a map value pointer as a destination it will simulate a check_map_access() of 1 byte on the destination and once that fails the program is rejected for unprivileged program loads. We use this later on for masking any pointer arithmetic with the remainder of the map value space. The likelihood of breaking any existing real-world unprivileged eBPF program is very small for this corner case. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index afa8515bbb34..4da8c73e168f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3249,6 +3249,17 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, __update_reg_bounds(dst_reg); __reg_deduce_bounds(dst_reg); __reg_bound_offset(dst_reg); + + /* For unprivileged we require that resulting offset must be in bounds + * in order to be able to sanitize access later on. + */ + if (!env->allow_ptr_leaks && dst_reg->type == PTR_TO_MAP_VALUE && + check_map_access(env, dst, dst_reg->off, 1, false)) { + verbose(env, "R%d pointer arithmetic of map value goes out of range, prohibited for !root\n", + dst); + return -EACCES; + } + return 0; } -- cgit v1.3-7-g2ca7 From e4298d25830a866cc0f427d4bccb858e76715859 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:31 +0100 Subject: bpf: restrict stack pointer arithmetic for unprivileged Restrict stack pointer arithmetic for unprivileged users in that arithmetic itself must not go out of bounds as opposed to the actual access later on. Therefore after each adjust_ptr_min_max_vals() with a stack pointer as a destination we simulate a check_stack_access() of 1 byte on the destination and once that fails the program is rejected for unprivileged program loads. This is analog to map value pointer arithmetic and needed for masking later on. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 63 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 41 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4da8c73e168f..9ac205d1b8b7 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1387,6 +1387,31 @@ static int check_stack_read(struct bpf_verifier_env *env, } } +static int check_stack_access(struct bpf_verifier_env *env, + const struct bpf_reg_state *reg, + int off, int size) +{ + /* Stack accesses must be at a fixed offset, so that we + * can determine what type of data were returned. See + * check_stack_read(). + */ + if (!tnum_is_const(reg->var_off)) { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "variable stack access var_off=%s off=%d size=%d", + tn_buf, off, size); + return -EACCES; + } + + if (off >= 0 || off < -MAX_BPF_STACK) { + verbose(env, "invalid stack off=%d size=%d\n", off, size); + return -EACCES; + } + + return 0; +} + /* check read/write into map element returned by bpf_map_lookup_elem() */ static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off, int size, bool zero_size_allowed) @@ -1954,24 +1979,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn } } else if (reg->type == PTR_TO_STACK) { - /* stack accesses must be at a fixed offset, so that we can - * determine what type of data were returned. - * See check_stack_read(). - */ - if (!tnum_is_const(reg->var_off)) { - char tn_buf[48]; - - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "variable stack access var_off=%s off=%d size=%d", - tn_buf, off, size); - return -EACCES; - } off += reg->var_off.value; - if (off >= 0 || off < -MAX_BPF_STACK) { - verbose(env, "invalid stack off=%d size=%d\n", off, - size); - return -EACCES; - } + err = check_stack_access(env, reg, off, size); + if (err) + return err; state = func(env, reg); err = update_stack_depth(env, state, off); @@ -3253,11 +3264,19 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, /* For unprivileged we require that resulting offset must be in bounds * in order to be able to sanitize access later on. */ - if (!env->allow_ptr_leaks && dst_reg->type == PTR_TO_MAP_VALUE && - check_map_access(env, dst, dst_reg->off, 1, false)) { - verbose(env, "R%d pointer arithmetic of map value goes out of range, prohibited for !root\n", - dst); - return -EACCES; + if (!env->allow_ptr_leaks) { + if (dst_reg->type == PTR_TO_MAP_VALUE && + check_map_access(env, dst, dst_reg->off, 1, false)) { + verbose(env, "R%d pointer arithmetic of map value goes out of range, " + "prohibited for !root\n", dst); + return -EACCES; + } else if (dst_reg->type == PTR_TO_STACK && + check_stack_access(env, dst_reg, dst_reg->off + + dst_reg->var_off.value, 1)) { + verbose(env, "R%d stack pointer arithmetic goes out of range, " + "prohibited for !root\n", dst); + return -EACCES; + } } return 0; -- cgit v1.3-7-g2ca7 From 9d7eceede769f90b66cfa06ad5b357140d5141ed Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:32 +0100 Subject: bpf: restrict unknown scalars of mixed signed bounds for unprivileged For unknown scalars of mixed signed bounds, meaning their smin_value is negative and their smax_value is positive, we need to reject arithmetic with pointer to map value. For unprivileged the goal is to mask every map pointer arithmetic and this cannot reliably be done when it is unknown at verification time whether the scalar value is negative or positive. Given this is a corner case, the likelihood of breaking should be very small. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9ac205d1b8b7..eebbc03e5af2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3081,8 +3081,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, smin_ptr = ptr_reg->smin_value, smax_ptr = ptr_reg->smax_value; u64 umin_val = off_reg->umin_value, umax_val = off_reg->umax_value, umin_ptr = ptr_reg->umin_value, umax_ptr = ptr_reg->umax_value; + u32 dst = insn->dst_reg, src = insn->src_reg; u8 opcode = BPF_OP(insn->code); - u32 dst = insn->dst_reg; dst_reg = ®s[dst]; @@ -3115,6 +3115,13 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, verbose(env, "R%d pointer arithmetic on %s prohibited\n", dst, reg_type_str[ptr_reg->type]); return -EACCES; + case PTR_TO_MAP_VALUE: + if (!env->allow_ptr_leaks && !known && (smin_val < 0) != (smax_val < 0)) { + verbose(env, "R%d has unknown scalar with mixed signed bounds, pointer arithmetic with it prohibited for !root\n", + off_reg == dst_reg ? dst : src); + return -EACCES; + } + /* fall-through */ default: break; } -- cgit v1.3-7-g2ca7 From b7137c4eab85c1cf3d46acdde90ce1163b28c873 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:33 +0100 Subject: bpf: fix check_map_access smin_value test when pointer contains offset In check_map_access() we probe actual bounds through __check_map_access() with offset of reg->smin_value + off for lower bound and offset of reg->umax_value + off for the upper bound. However, even though the reg->smin_value could have a negative value, the final result of the sum with off could be positive when pointer arithmetic with known and unknown scalars is combined. In this case we reject the program with an error such as "R min value is negative, either use unsigned index or do a if (index >=0) check." even though the access itself would be fine. Therefore extend the check to probe whether the actual resulting reg->smin_value + off is less than zero. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index eebbc03e5af2..8e5da1ce5da4 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1443,13 +1443,17 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, */ if (env->log.level) print_verifier_state(env, state); + /* The minimum value is only important with signed * comparisons where we can't assume the floor of a * value is 0. If we are using signed variables for our * index'es we need to make sure that whatever we use * will have a set floor within our range. */ - if (reg->smin_value < 0) { + if (reg->smin_value < 0 && + (reg->smin_value == S64_MIN || + (off + reg->smin_value != (s64)(s32)(off + reg->smin_value)) || + reg->smin_value + off < 0)) { verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n", regno); return -EACCES; -- cgit v1.3-7-g2ca7 From 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 3 Jan 2019 00:58:34 +0100 Subject: bpf: prevent out of bounds speculation on pointer arithmetic Jann reported that the original commit back in b2157399cc98 ("bpf: prevent out-of-bounds speculation") was not sufficient to stop CPU from speculating out of bounds memory access: While b2157399cc98 only focussed on masking array map access for unprivileged users for tail calls and data access such that the user provided index gets sanitized from BPF program and syscall side, there is still a more generic form affected from BPF programs that applies to most maps that hold user data in relation to dynamic map access when dealing with unknown scalars or "slow" known scalars as access offset, for example: - Load a map value pointer into R6 - Load an index into R7 - Do a slow computation (e.g. with a memory dependency) that loads a limit into R8 (e.g. load the limit from a map for high latency, then mask it to make the verifier happy) - Exit if R7 >= R8 (mispredicted branch) - Load R0 = R6[R7] - Load R0 = R6[R0] For unknown scalars there are two options in the BPF verifier where we could derive knowledge from in order to guarantee safe access to the memory: i) While /<=/>= variants won't allow to derive any lower or upper bounds from the unknown scalar where it would be safe to add it to the map value pointer, it is possible through ==/!= test however. ii) another option is to transform the unknown scalar into a known scalar, for example, through ALU ops combination such as R &= followed by R |= or any similar combination where the original information from the unknown scalar would be destroyed entirely leaving R with a constant. The initial slow load still precedes the latter ALU ops on that register, so the CPU executes speculatively from that point. Once we have the known scalar, any compare operation would work then. A third option only involving registers with known scalars could be crafted as described in [0] where a CPU port (e.g. Slow Int unit) would be filled with many dependent computations such that the subsequent condition depending on its outcome has to wait for evaluation on its execution port and thereby executing speculatively if the speculated code can be scheduled on a different execution port, or any other form of mistraining as described in [1], for example. Given this is not limited to only unknown scalars, not only map but also stack access is affected since both is accessible for unprivileged users and could potentially be used for out of bounds access under speculation. In order to prevent any of these cases, the verifier is now sanitizing pointer arithmetic on the offset such that any out of bounds speculation would be masked in a way where the pointer arithmetic result in the destination register will stay unchanged, meaning offset masked into zero similar as in array_index_nospec() case. With regards to implementation, there are three options that were considered: i) new insn for sanitation, ii) push/pop insn and sanitation as inlined BPF, iii) reuse of ax register and sanitation as inlined BPF. Option i) has the downside that we end up using from reserved bits in the opcode space, but also that we would require each JIT to emit masking as native arch opcodes meaning mitigation would have slow adoption till everyone implements it eventually which is counter-productive. Option ii) and iii) have both in common that a temporary register is needed in order to implement the sanitation as inlined BPF since we are not allowed to modify the source register. While a push / pop insn in ii) would be useful to have in any case, it requires once again that every JIT needs to implement it first. While possible, amount of changes needed would also be unsuitable for a -stable patch. Therefore, the path which has fewer changes, less BPF instructions for the mitigation and does not require anything to be changed in the JITs is option iii) which this work is pursuing. The ax register is already mapped to a register in all JITs (modulo arm32 where it's mapped to stack as various other BPF registers there) and used in constant blinding for JITs-only so far. It can be reused for verifier rewrites under certain constraints. The interpreter's tmp "register" has therefore been remapped into extending the register set with hidden ax register and reusing that for a number of instructions that needed the prior temporary variable internally (e.g. div, mod). This allows for zero increase in stack space usage in the interpreter, and enables (restricted) generic use in rewrites otherwise as long as such a patchlet does not make use of these instructions. The sanitation mask is dynamic and relative to the offset the map value or stack pointer currently holds. There are various cases that need to be taken under consideration for the masking, e.g. such operation could look as follows: ptr += val or val += ptr or ptr -= val. Thus, the value to be sanitized could reside either in source or in destination register, and the limit is different depending on whether the ALU op is addition or subtraction and depending on the current known and bounded offset. The limit is derived as follows: limit := max_value_size - (smin_value + off). For subtraction: limit := umax_value + off. This holds because we do not allow any pointer arithmetic that would temporarily go out of bounds or would have an unknown value with mixed signed bounds where it is unclear at verification time whether the actual runtime value would be either negative or positive. For example, we have a derived map pointer value with constant offset and bounded one, so limit based on smin_value works because the verifier requires that statically analyzed arithmetic on the pointer must be in bounds, and thus it checks if resulting smin_value + off and umax_value + off is still within map value bounds at time of arithmetic in addition to time of access. Similarly, for the case of stack access we derive the limit as follows: MAX_BPF_STACK + off for subtraction and -off for the case of addition where off := ptr_reg->off + ptr_reg->var_off.value. Subtraction is a special case for the masking which can be in form of ptr += -val, ptr -= -val, or ptr -= val. In the first two cases where we know that the value is negative, we need to temporarily negate the value in order to do the sanitation on a positive value where we later swap the ALU op, and restore original source register if the value was in source. The sanitation of pointer arithmetic alone is still not fully sufficient as is, since a scenario like the following could happen ... PTR += 0x1000 (e.g. K-based imm) PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON PTR += 0x1000 PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON [...] ... which under speculation could end up as ... PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] [...] ... and therefore still access out of bounds. To prevent such case, the verifier is also analyzing safety for potential out of bounds access under speculative execution. Meaning, it is also simulating pointer access under truncation. We therefore "branch off" and push the current verification state after the ALU operation with known 0 to the verification stack for later analysis. Given the current path analysis succeeded it is likely that the one under speculation can be pruned. In any case, it is also subject to existing complexity limits and therefore anything beyond this point will be rejected. In terms of pruning, it needs to be ensured that the verification state from speculative execution simulation must never prune a non-speculative execution path, therefore, we mark verifier state accordingly at the time of push_stack(). If verifier detects out of bounds access under speculative execution from one of the possible paths that includes a truncation, it will reject such program. Given we mask every reg-based pointer arithmetic for unprivileged programs, we've been looking into how it could affect real-world programs in terms of size increase. As the majority of programs are targeted for privileged-only use case, we've unconditionally enabled masking (with its alu restrictions on top of it) for privileged programs for the sake of testing in order to check i) whether they get rejected in its current form, and ii) by how much the number of instructions and size will increase. We've tested this by using Katran, Cilium and test_l4lb from the kernel selftests. For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb we've used test_l4lb.o as well as test_l4lb_noinline.o. We found that none of the programs got rejected by the verifier with this change, and that impact is rather minimal to none. balancer_kern.o had 13,904 bytes (1,738 insns) xlated and 7,797 bytes JITed before and after the change. Most complex program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated and 18,538 bytes JITed before and after and none of the other tail call programs in bpf_lxc.o had any changes either. For the older bpf_lxc_opt_-DUNKNOWN.o object we found a small increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed after the change. Other programs from that object file had similar small increase. Both test_l4lb.o had no change and remained at 6,544 bytes (817 insns) xlated and 3,401 bytes JITed and for test_l4lb_noinline.o constant at 5,080 bytes (634 insns) xlated and 3,313 bytes JITed. This can be explained in that LLVM typically optimizes stack based pointer arithmetic by using K-based operations and that use of dynamic map access is not overly frequent. However, in future we may decide to optimize the algorithm further under known guarantees from branch and value speculation. Latter seems also unclear in terms of prediction heuristics that today's CPUs apply as well as whether there could be collisions in e.g. the predictor's Value History/Pattern Table for triggering out of bounds access, thus masking is performed unconditionally at this point but could be subject to relaxation later on. We were generally also brainstorming various other approaches for mitigation, but the blocker was always lack of available registers at runtime and/or overhead for runtime tracking of limits belonging to a specific pointer. Thus, we found this to be minimally intrusive under given constraints. With that in place, a simple example with sanitized access on unprivileged load at post-verification time looks as follows: # bpftool prog dump xlated id 282 [...] 28: (79) r1 = *(u64 *)(r7 +0) 29: (79) r2 = *(u64 *)(r7 +8) 30: (57) r1 &= 15 31: (79) r3 = *(u64 *)(r0 +4608) 32: (57) r3 &= 1 33: (47) r3 |= 1 34: (2d) if r2 > r3 goto pc+19 35: (b4) (u32) r11 = (u32) 20479 | 36: (1f) r11 -= r2 | Dynamic sanitation for pointer 37: (4f) r11 |= r2 | arithmetic with registers 38: (87) r11 = -r11 | containing bounded or known 39: (c7) r11 s>>= 63 | scalars in order to prevent 40: (5f) r11 &= r2 | out of bounds speculation. 41: (0f) r4 += r11 | 42: (71) r4 = *(u8 *)(r4 +0) 43: (6f) r4 <<= r1 [...] For the case where the scalar sits in the destination register as opposed to the source register, the following code is emitted for the above example: [...] 16: (b4) (u32) r11 = (u32) 20479 17: (1f) r11 -= r2 18: (4f) r11 |= r2 19: (87) r11 = -r11 20: (c7) r11 s>>= 63 21: (5f) r2 &= r11 22: (0f) r2 += r0 23: (61) r0 = *(u32 *)(r2 +0) [...] JIT blinding example with non-conflicting use of r10: [...] d5: je 0x0000000000000106 _ d7: mov 0x0(%rax),%edi | da: mov $0xf153246,%r10d | Index load from map value and e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f. e7: and %r10,%rdi |_ ea: mov $0x2f,%r10d | f0: sub %rdi,%r10 | Sanitized addition. Both use r10 f3: or %rdi,%r10 | but do not interfere with each f6: neg %r10 | other. (Neither do these instructions f9: sar $0x3f,%r10 | interfere with the use of ax as temp fd: and %r10,%rdi | in interpreter.) 100: add %rax,%rdi |_ 103: mov 0x0(%rdi),%eax [...] Tested that it fixes Jann's reproducer, and also checked that test_verifier and test_progs suite with interpreter, JIT and JIT with hardening enabled on x86-64 and arm64 runs successfully. [0] Speculose: Analyzing the Security Implications of Speculative Execution in CPUs, Giorgi Maisuradze and Christian Rossow, https://arxiv.org/pdf/1801.04084.pdf [1] A Systematic Evaluation of Transient Execution Attacks and Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, Daniel Gruss, https://arxiv.org/pdf/1811.05441.pdf Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Jann Horn Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 10 +++ kernel/bpf/verifier.c | 185 +++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 189 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 3f84f3e87704..27b74947cd2b 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -148,6 +148,7 @@ struct bpf_verifier_state { /* call stack tracking */ struct bpf_func_state *frame[MAX_CALL_FRAMES]; u32 curframe; + bool speculative; }; #define bpf_get_spilled_reg(slot, frame) \ @@ -167,15 +168,24 @@ struct bpf_verifier_state_list { struct bpf_verifier_state_list *next; }; +/* Possible states for alu_state member. */ +#define BPF_ALU_SANITIZE_SRC 1U +#define BPF_ALU_SANITIZE_DST 2U +#define BPF_ALU_NEG_VALUE (1U << 2) +#define BPF_ALU_SANITIZE (BPF_ALU_SANITIZE_SRC | \ + BPF_ALU_SANITIZE_DST) + struct bpf_insn_aux_data { union { enum bpf_reg_type ptr_type; /* pointer type for load/store insns */ unsigned long map_state; /* pointer/poison value for maps */ s32 call_imm; /* saved imm field of call insn */ + u32 alu_limit; /* limit for add/sub register with pointer */ }; int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ int sanitize_stack_off; /* stack slot to be cleared */ bool seen; /* this insn was processed by the verifier */ + u8 alu_state; /* used in combination with alu_limit */ }; #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8e5da1ce5da4..f6bc62a9ee8e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -710,6 +710,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, free_func_state(dst_state->frame[i]); dst_state->frame[i] = NULL; } + dst_state->speculative = src->speculative; dst_state->curframe = src->curframe; for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; @@ -754,7 +755,8 @@ static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx, } static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, - int insn_idx, int prev_insn_idx) + int insn_idx, int prev_insn_idx, + bool speculative) { struct bpf_verifier_state *cur = env->cur_state; struct bpf_verifier_stack_elem *elem; @@ -772,6 +774,7 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, err = copy_verifier_state(&elem->st, cur); if (err) goto err; + elem->st.speculative |= speculative; if (env->stack_size > BPF_COMPLEXITY_LIMIT_STACK) { verbose(env, "BPF program is too complex\n"); goto err; @@ -3067,6 +3070,102 @@ static bool check_reg_sane_offset(struct bpf_verifier_env *env, return true; } +static struct bpf_insn_aux_data *cur_aux(struct bpf_verifier_env *env) +{ + return &env->insn_aux_data[env->insn_idx]; +} + +static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg, + u32 *ptr_limit, u8 opcode, bool off_is_neg) +{ + bool mask_to_left = (opcode == BPF_ADD && off_is_neg) || + (opcode == BPF_SUB && !off_is_neg); + u32 off; + + switch (ptr_reg->type) { + case PTR_TO_STACK: + off = ptr_reg->off + ptr_reg->var_off.value; + if (mask_to_left) + *ptr_limit = MAX_BPF_STACK + off; + else + *ptr_limit = -off; + return 0; + case PTR_TO_MAP_VALUE: + if (mask_to_left) { + *ptr_limit = ptr_reg->umax_value + ptr_reg->off; + } else { + off = ptr_reg->smin_value + ptr_reg->off; + *ptr_limit = ptr_reg->map_ptr->value_size - off; + } + return 0; + default: + return -EINVAL; + } +} + +static int sanitize_ptr_alu(struct bpf_verifier_env *env, + struct bpf_insn *insn, + const struct bpf_reg_state *ptr_reg, + struct bpf_reg_state *dst_reg, + bool off_is_neg) +{ + struct bpf_verifier_state *vstate = env->cur_state; + struct bpf_insn_aux_data *aux = cur_aux(env); + bool ptr_is_dst_reg = ptr_reg == dst_reg; + u8 opcode = BPF_OP(insn->code); + u32 alu_state, alu_limit; + struct bpf_reg_state tmp; + bool ret; + + if (env->allow_ptr_leaks || BPF_SRC(insn->code) == BPF_K) + return 0; + + /* We already marked aux for masking from non-speculative + * paths, thus we got here in the first place. We only care + * to explore bad access from here. + */ + if (vstate->speculative) + goto do_sim; + + alu_state = off_is_neg ? BPF_ALU_NEG_VALUE : 0; + alu_state |= ptr_is_dst_reg ? + BPF_ALU_SANITIZE_SRC : BPF_ALU_SANITIZE_DST; + + if (retrieve_ptr_limit(ptr_reg, &alu_limit, opcode, off_is_neg)) + return 0; + + /* If we arrived here from different branches with different + * limits to sanitize, then this won't work. + */ + if (aux->alu_state && + (aux->alu_state != alu_state || + aux->alu_limit != alu_limit)) + return -EACCES; + + /* Corresponding fixup done in fixup_bpf_calls(). */ + aux->alu_state = alu_state; + aux->alu_limit = alu_limit; + +do_sim: + /* Simulate and find potential out-of-bounds access under + * speculative execution from truncation as a result of + * masking when off was not within expected range. If off + * sits in dst, then we temporarily need to move ptr there + * to simulate dst (== 0) +/-= ptr. Needed, for example, + * for cases where we use K-based arithmetic in one direction + * and truncated reg-based in the other in order to explore + * bad access. + */ + if (!ptr_is_dst_reg) { + tmp = *dst_reg; + *dst_reg = *ptr_reg; + } + ret = push_stack(env, env->insn_idx + 1, env->insn_idx, true); + if (!ptr_is_dst_reg) + *dst_reg = tmp; + return !ret ? -EFAULT : 0; +} + /* Handles arithmetic on a pointer and a scalar: computes new min/max and var_off. * Caller should also handle BPF_MOV case separately. * If we return -EACCES, caller may want to try again treating pointer as a @@ -3087,6 +3186,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, umin_ptr = ptr_reg->umin_value, umax_ptr = ptr_reg->umax_value; u32 dst = insn->dst_reg, src = insn->src_reg; u8 opcode = BPF_OP(insn->code); + int ret; dst_reg = ®s[dst]; @@ -3142,6 +3242,11 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, switch (opcode) { case BPF_ADD: + ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0); + if (ret < 0) { + verbose(env, "R%d tried to add from different maps or paths\n", dst); + return ret; + } /* We can take a fixed offset as long as it doesn't overflow * the s32 'off' field */ @@ -3192,6 +3297,11 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } break; case BPF_SUB: + ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0); + if (ret < 0) { + verbose(env, "R%d tried to sub from different maps or paths\n", dst); + return ret; + } if (dst_reg == off_reg) { /* scalar -= pointer. Creates an unknown scalar */ verbose(env, "R%d tried to subtract pointer from scalar\n", @@ -4389,7 +4499,8 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } } - other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx); + other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx, + false); if (!other_branch) return -EFAULT; other_branch_regs = other_branch->frame[other_branch->curframe]->regs; @@ -5499,6 +5610,12 @@ static bool states_equal(struct bpf_verifier_env *env, if (old->curframe != cur->curframe) return false; + /* Verification state from speculative execution simulation + * must never prune a non-speculative execution one. + */ + if (old->speculative && !cur->speculative) + return false; + /* for states to be equal callsites have to be the same * and all frame states need to be equivalent */ @@ -5700,6 +5817,7 @@ static int do_check(struct bpf_verifier_env *env) if (!state) return -ENOMEM; state->curframe = 0; + state->speculative = false; state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL); if (!state->frame[0]) { kfree(state); @@ -5739,8 +5857,10 @@ static int do_check(struct bpf_verifier_env *env) /* found equivalent state, can prune the search */ if (env->log.level) { if (do_print_state) - verbose(env, "\nfrom %d to %d: safe\n", - env->prev_insn_idx, env->insn_idx); + verbose(env, "\nfrom %d to %d%s: safe\n", + env->prev_insn_idx, env->insn_idx, + env->cur_state->speculative ? + " (speculative execution)" : ""); else verbose(env, "%d: safe\n", env->insn_idx); } @@ -5757,8 +5877,10 @@ static int do_check(struct bpf_verifier_env *env) if (env->log.level > 1) verbose(env, "%d:", env->insn_idx); else - verbose(env, "\nfrom %d to %d:", - env->prev_insn_idx, env->insn_idx); + verbose(env, "\nfrom %d to %d%s:", + env->prev_insn_idx, env->insn_idx, + env->cur_state->speculative ? + " (speculative execution)" : ""); print_verifier_state(env, state->frame[state->curframe]); do_print_state = false; } @@ -6750,6 +6872,57 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) continue; } + if (insn->code == (BPF_ALU64 | BPF_ADD | BPF_X) || + insn->code == (BPF_ALU64 | BPF_SUB | BPF_X)) { + const u8 code_add = BPF_ALU64 | BPF_ADD | BPF_X; + const u8 code_sub = BPF_ALU64 | BPF_SUB | BPF_X; + struct bpf_insn insn_buf[16]; + struct bpf_insn *patch = &insn_buf[0]; + bool issrc, isneg; + u32 off_reg; + + aux = &env->insn_aux_data[i + delta]; + if (!aux->alu_state) + continue; + + isneg = aux->alu_state & BPF_ALU_NEG_VALUE; + issrc = (aux->alu_state & BPF_ALU_SANITIZE) == + BPF_ALU_SANITIZE_SRC; + + off_reg = issrc ? insn->src_reg : insn->dst_reg; + if (isneg) + *patch++ = BPF_ALU64_IMM(BPF_MUL, off_reg, -1); + *patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1); + *patch++ = BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg); + *patch++ = BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg); + *patch++ = BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0); + *patch++ = BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63); + if (issrc) { + *patch++ = BPF_ALU64_REG(BPF_AND, BPF_REG_AX, + off_reg); + insn->src_reg = BPF_REG_AX; + } else { + *patch++ = BPF_ALU64_REG(BPF_AND, off_reg, + BPF_REG_AX); + } + if (isneg) + insn->code = insn->code == code_add ? + code_sub : code_add; + *patch++ = *insn; + if (issrc && isneg) + *patch++ = BPF_ALU64_IMM(BPF_MUL, off_reg, -1); + cnt = patch - insn_buf; + + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); + if (!new_prog) + return -ENOMEM; + + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + continue; + } + if (insn->code != (BPF_JMP | BPF_CALL)) continue; if (insn->src_reg == BPF_PSEUDO_CALL) -- cgit v1.3-7-g2ca7 From 96d4f267e40f9509e8a66e2b39e8b95655617693 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 3 Jan 2019 18:57:57 -0800 Subject: Remove 'type' argument from access_ok() function Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/futex.h | 2 +- arch/alpha/include/asm/uaccess.h | 2 +- arch/alpha/kernel/signal.c | 12 +-- arch/alpha/lib/csum_partial_copy.c | 2 +- arch/arc/include/asm/futex.h | 2 +- arch/arc/kernel/process.c | 2 +- arch/arc/kernel/signal.c | 4 +- arch/arm/include/asm/futex.h | 4 +- arch/arm/include/asm/uaccess.h | 4 +- arch/arm/kernel/perf_callchain.c | 2 +- arch/arm/kernel/signal.c | 6 +- arch/arm/kernel/swp_emulate.c | 2 +- arch/arm/kernel/sys_oabi-compat.c | 4 +- arch/arm/kernel/traps.c | 2 +- arch/arm/oprofile/common.c | 2 +- arch/arm64/include/asm/futex.h | 2 +- arch/arm64/include/asm/uaccess.h | 8 +- arch/arm64/kernel/armv8_deprecated.c | 2 +- arch/arm64/kernel/perf_callchain.c | 4 +- arch/arm64/kernel/signal.c | 6 +- arch/arm64/kernel/signal32.c | 6 +- arch/arm64/kernel/sys_compat.c | 2 +- arch/c6x/kernel/signal.c | 4 +- arch/csky/abiv1/alignment.c | 4 +- arch/csky/include/asm/uaccess.h | 16 +--- arch/csky/kernel/signal.c | 2 +- arch/csky/lib/usercopy.c | 8 +- arch/h8300/kernel/signal.c | 4 +- arch/hexagon/include/asm/futex.h | 2 +- arch/hexagon/include/asm/uaccess.h | 3 - arch/hexagon/kernel/signal.c | 4 +- arch/hexagon/mm/uaccess.c | 2 +- arch/ia64/include/asm/futex.h | 2 +- arch/ia64/include/asm/uaccess.h | 2 +- arch/ia64/kernel/ptrace.c | 4 +- arch/ia64/kernel/signal.c | 4 +- arch/m68k/include/asm/uaccess_mm.h | 2 +- arch/m68k/include/asm/uaccess_no.h | 2 +- arch/m68k/kernel/signal.c | 4 +- arch/microblaze/include/asm/futex.h | 2 +- arch/microblaze/include/asm/uaccess.h | 23 +++--- arch/microblaze/kernel/signal.c | 4 +- arch/mips/include/asm/checksum.h | 4 +- arch/mips/include/asm/futex.h | 2 +- arch/mips/include/asm/termios.h | 4 +- arch/mips/include/asm/uaccess.h | 12 +-- arch/mips/kernel/mips-r2-to-r6-emul.c | 24 +++--- arch/mips/kernel/ptrace.c | 12 +-- arch/mips/kernel/signal.c | 12 +-- arch/mips/kernel/signal32.c | 4 +- arch/mips/kernel/signal_n32.c | 4 +- arch/mips/kernel/signal_o32.c | 8 +- arch/mips/kernel/syscall.c | 2 +- arch/mips/kernel/unaligned.c | 98 ++++++++++++------------- arch/mips/math-emu/cp1emu.c | 16 ++-- arch/mips/mm/cache.c | 2 +- arch/mips/mm/gup.c | 3 +- arch/mips/oprofile/backtrace.c | 2 +- arch/mips/sibyte/common/sb_tbprof.c | 2 +- arch/nds32/include/asm/futex.h | 2 +- arch/nds32/include/asm/uaccess.h | 11 +-- arch/nds32/kernel/perf_event_cpu.c | 11 ++- arch/nds32/kernel/signal.c | 4 +- arch/nds32/mm/alignment.c | 8 +- arch/nios2/include/asm/uaccess.h | 8 +- arch/nios2/kernel/signal.c | 2 +- arch/openrisc/include/asm/futex.h | 2 +- arch/openrisc/include/asm/uaccess.h | 8 +- arch/openrisc/kernel/signal.c | 6 +- arch/parisc/include/asm/futex.h | 2 +- arch/parisc/include/asm/uaccess.h | 2 +- arch/powerpc/include/asm/futex.h | 2 +- arch/powerpc/include/asm/uaccess.h | 8 +- arch/powerpc/kernel/align.c | 3 +- arch/powerpc/kernel/rtas_flash.c | 2 +- arch/powerpc/kernel/rtasd.c | 2 +- arch/powerpc/kernel/signal.c | 2 +- arch/powerpc/kernel/signal_32.c | 12 +-- arch/powerpc/kernel/signal_64.c | 13 ++-- arch/powerpc/kernel/syscalls.c | 2 +- arch/powerpc/kernel/traps.c | 2 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/lib/checksum_wrappers.c | 4 +- arch/powerpc/mm/fault.c | 2 +- arch/powerpc/mm/subpage-prot.c | 2 +- arch/powerpc/oprofile/backtrace.c | 4 +- arch/powerpc/platforms/cell/spufs/file.c | 16 ++-- arch/powerpc/platforms/powernv/opal-lpc.c | 4 +- arch/powerpc/platforms/pseries/scanlog.c | 2 +- arch/riscv/include/asm/futex.h | 2 +- arch/riscv/include/asm/uaccess.h | 14 +--- arch/riscv/kernel/signal.c | 4 +- arch/s390/include/asm/uaccess.h | 2 +- arch/sh/include/asm/checksum_32.h | 2 +- arch/sh/include/asm/futex.h | 2 +- arch/sh/include/asm/uaccess.h | 9 +-- arch/sh/kernel/signal_32.c | 8 +- arch/sh/kernel/signal_64.c | 8 +- arch/sh/kernel/traps_64.c | 12 +-- arch/sh/mm/gup.c | 3 +- arch/sh/oprofile/backtrace.c | 2 +- arch/sparc/include/asm/checksum_32.h | 2 +- arch/sparc/include/asm/uaccess_32.h | 2 +- arch/sparc/include/asm/uaccess_64.h | 2 +- arch/sparc/kernel/sigutil_32.c | 2 +- arch/sparc/kernel/unaligned_32.c | 7 +- arch/um/kernel/ptrace.c | 4 +- arch/unicore32/kernel/signal.c | 4 +- arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- arch/x86/ia32/ia32_aout.c | 4 +- arch/x86/ia32/ia32_signal.c | 8 +- arch/x86/ia32/sys_ia32.c | 2 +- arch/x86/include/asm/checksum_32.h | 2 +- arch/x86/include/asm/pgtable_32.h | 2 +- arch/x86/include/asm/uaccess.h | 7 +- arch/x86/kernel/fpu/signal.c | 4 +- arch/x86/kernel/signal.c | 14 ++-- arch/x86/kernel/stacktrace.c | 2 +- arch/x86/kernel/vm86_32.c | 4 +- arch/x86/lib/csum-wrappers_64.c | 4 +- arch/x86/lib/usercopy_32.c | 2 +- arch/x86/lib/usercopy_64.c | 2 +- arch/x86/math-emu/fpu_system.h | 4 +- arch/x86/math-emu/load_store.c | 6 +- arch/x86/math-emu/reg_ld_str.c | 48 ++++++------ arch/x86/mm/mpx.c | 2 +- arch/x86/um/asm/checksum_32.h | 2 +- arch/x86/um/signal.c | 6 +- arch/xtensa/include/asm/checksum.h | 2 +- arch/xtensa/include/asm/futex.h | 2 +- arch/xtensa/include/asm/uaccess.h | 10 +-- arch/xtensa/kernel/signal.c | 4 +- arch/xtensa/kernel/stacktrace.c | 2 +- drivers/acpi/acpi_dbg.c | 4 +- drivers/char/generic_nvram.c | 4 +- drivers/char/mem.c | 4 +- drivers/char/nwflash.c | 2 +- drivers/char/pcmcia/cm4000_cs.c | 4 +- drivers/crypto/ccp/psp-dev.c | 6 +- drivers/firewire/core-cdev.c | 2 +- drivers/firmware/efi/test/efi_test.c | 8 +- drivers/fpga/dfl-afu-dma-region.c | 2 +- drivers/fpga/dfl-fme-pr.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 18 ++--- drivers/gpu/drm/armada/armada_gem.c | 2 +- drivers/gpu/drm/drm_file.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_drv.c | 8 +- drivers/gpu/drm/i915/i915_gem.c | 7 +- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 +- drivers/gpu/drm/i915/i915_gem_userptr.c | 3 +- drivers/gpu/drm/i915/i915_ioc32.c | 2 +- drivers/gpu/drm/i915/i915_perf.c | 2 +- drivers/gpu/drm/i915/i915_query.c | 2 +- drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- drivers/gpu/drm/qxl/qxl_ioctl.c | 3 +- drivers/infiniband/core/uverbs_main.c | 3 +- drivers/infiniband/hw/hfi1/user_exp_rcv.c | 2 +- drivers/infiniband/hw/qib/qib_file_ops.c | 2 +- drivers/macintosh/ans-lcd.c | 2 +- drivers/macintosh/via-pmu.c | 2 +- drivers/media/pci/ivtv/ivtvfb.c | 2 +- drivers/media/v4l2-core/v4l2-compat-ioctl32.c | 46 ++++++------ drivers/misc/vmw_vmci/vmci_host.c | 2 +- drivers/pci/proc.c | 4 +- drivers/platform/goldfish/goldfish_pipe.c | 3 +- drivers/pnp/isapnp/proc.c | 2 +- drivers/scsi/pmcraid.c | 4 +- drivers/scsi/scsi_ioctl.c | 2 +- drivers/scsi/sg.c | 16 ++-- drivers/staging/comedi/comedi_compat32.c | 24 +++--- drivers/tty/n_hdlc.c | 2 +- drivers/usb/core/devices.c | 2 +- drivers/usb/core/devio.c | 7 +- drivers/usb/gadget/function/f_hid.c | 4 +- drivers/usb/gadget/udc/atmel_usba_udc.c | 2 +- drivers/vhost/vhost.c | 16 ++-- drivers/video/fbdev/amifb.c | 4 +- drivers/video/fbdev/omap2/omapfb/omapfb-ioctl.c | 2 +- drivers/xen/privcmd.c | 6 +- fs/binfmt_aout.c | 4 +- fs/btrfs/send.c | 2 +- fs/eventpoll.c | 2 +- fs/fat/dir.c | 4 +- fs/ioctl.c | 2 +- fs/namespace.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 4 +- fs/pstore/pmsg.c | 2 +- fs/pstore/ram_core.c | 2 +- fs/read_write.c | 13 ++-- fs/readdir.c | 10 +-- fs/select.c | 11 +-- include/asm-generic/uaccess.h | 12 +-- include/linux/regset.h | 4 +- include/linux/uaccess.h | 9 +-- include/net/checksum.h | 4 +- kernel/bpf/syscall.c | 2 +- kernel/compat.c | 16 ++-- kernel/events/core.c | 2 +- kernel/exit.c | 4 +- kernel/futex.c | 35 +++++---- kernel/printk/printk.c | 4 +- kernel/ptrace.c | 4 +- kernel/rseq.c | 6 +- kernel/sched/core.c | 4 +- kernel/signal.c | 8 +- kernel/sys.c | 2 +- kernel/trace/bpf_trace.c | 2 +- lib/bitmap.c | 4 +- lib/iov_iter.c | 8 +- lib/usercopy.c | 4 +- mm/gup.c | 6 +- mm/mincore.c | 4 +- net/batman-adv/icmp_socket.c | 2 +- net/batman-adv/log.c | 2 +- net/compat.c | 30 ++++---- net/sunrpc/sysctl.c | 2 +- security/tomoyo/common.c | 2 +- sound/core/seq/seq_clientmgr.c | 2 +- sound/isa/sb/emu8000_patch.c | 4 +- tools/perf/util/include/asm/uaccess.h | 2 +- virt/kvm/kvm_main.c | 3 +- 221 files changed, 610 insertions(+), 679 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/include/asm/futex.h b/arch/alpha/include/asm/futex.h index ca3322536f72..bfd3c01038f8 100644 --- a/arch/alpha/include/asm/futex.h +++ b/arch/alpha/include/asm/futex.h @@ -68,7 +68,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0, cmp; u32 prev; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __asm__ __volatile__ ( diff --git a/arch/alpha/include/asm/uaccess.h b/arch/alpha/include/asm/uaccess.h index 87d8c4f0307d..e69c4e13c328 100644 --- a/arch/alpha/include/asm/uaccess.h +++ b/arch/alpha/include/asm/uaccess.h @@ -36,7 +36,7 @@ #define __access_ok(addr, size) \ ((get_fs().seg & (addr | size | (addr+size))) == 0) -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ ({ \ __chk_user_ptr(addr); \ __access_ok(((unsigned long)(addr)), (size)); \ diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c index 8c0c4ee0be6e..33e904a05881 100644 --- a/arch/alpha/kernel/signal.c +++ b/arch/alpha/kernel/signal.c @@ -65,7 +65,7 @@ SYSCALL_DEFINE3(osf_sigaction, int, sig, if (act) { old_sigset_t mask; - if (!access_ok(VERIFY_READ, act, sizeof(*act)) || + if (!access_ok(act, sizeof(*act)) || __get_user(new_ka.sa.sa_handler, &act->sa_handler) || __get_user(new_ka.sa.sa_flags, &act->sa_flags) || __get_user(mask, &act->sa_mask)) @@ -77,7 +77,7 @@ SYSCALL_DEFINE3(osf_sigaction, int, sig, ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || + if (!access_ok(oact, sizeof(*oact)) || __put_user(old_ka.sa.sa_handler, &oact->sa_handler) || __put_user(old_ka.sa.sa_flags, &oact->sa_flags) || __put_user(old_ka.sa.sa_mask.sig[0], &oact->sa_mask)) @@ -207,7 +207,7 @@ do_sigreturn(struct sigcontext __user *sc) sigset_t set; /* Verify that it's a good sigcontext before using it */ - if (!access_ok(VERIFY_READ, sc, sizeof(*sc))) + if (!access_ok(sc, sizeof(*sc))) goto give_sigsegv; if (__get_user(set.sig[0], &sc->sc_mask)) goto give_sigsegv; @@ -235,7 +235,7 @@ do_rt_sigreturn(struct rt_sigframe __user *frame) sigset_t set; /* Verify that it's a good ucontext_t before using it */ - if (!access_ok(VERIFY_READ, &frame->uc, sizeof(frame->uc))) + if (!access_ok(&frame->uc, sizeof(frame->uc))) goto give_sigsegv; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto give_sigsegv; @@ -332,7 +332,7 @@ setup_frame(struct ksignal *ksig, sigset_t *set, struct pt_regs *regs) oldsp = rdusp(); frame = get_sigframe(ksig, oldsp, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= setup_sigcontext(&frame->sc, regs, set->sig[0], oldsp); @@ -377,7 +377,7 @@ setup_rt_frame(struct ksignal *ksig, sigset_t *set, struct pt_regs *regs) oldsp = rdusp(); frame = get_sigframe(ksig, oldsp, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= copy_siginfo_to_user(&frame->info, &ksig->info); diff --git a/arch/alpha/lib/csum_partial_copy.c b/arch/alpha/lib/csum_partial_copy.c index ddb9c2f376fa..e53f96e8aa6d 100644 --- a/arch/alpha/lib/csum_partial_copy.c +++ b/arch/alpha/lib/csum_partial_copy.c @@ -333,7 +333,7 @@ csum_partial_copy_from_user(const void __user *src, void *dst, int len, unsigned long doff = 7 & (unsigned long) dst; if (len) { - if (!access_ok(VERIFY_READ, src, len)) { + if (!access_ok(src, len)) { if (errp) *errp = -EFAULT; memset(dst, 0, len); return sum; diff --git a/arch/arc/include/asm/futex.h b/arch/arc/include/asm/futex.h index eb887dd13e74..c29c3fae6854 100644 --- a/arch/arc/include/asm/futex.h +++ b/arch/arc/include/asm/futex.h @@ -126,7 +126,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 expval, int ret = 0; u32 existval; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; #ifndef CONFIG_ARC_HAS_LLSC diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index 8ce6e7235915..641c364fc232 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -61,7 +61,7 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new) /* Z indicates to userspace if operation succeded */ regs->status32 &= ~STATUS_Z_MASK; - ret = access_ok(VERIFY_WRITE, uaddr, sizeof(*uaddr)); + ret = access_ok(uaddr, sizeof(*uaddr)); if (!ret) goto fail; diff --git a/arch/arc/kernel/signal.c b/arch/arc/kernel/signal.c index 48685445002e..1bfb7de696bd 100644 --- a/arch/arc/kernel/signal.c +++ b/arch/arc/kernel/signal.c @@ -169,7 +169,7 @@ SYSCALL_DEFINE0(rt_sigreturn) sf = (struct rt_sigframe __force __user *)(regs->sp); - if (!access_ok(VERIFY_READ, sf, sizeof(*sf))) + if (!access_ok(sf, sizeof(*sf))) goto badframe; if (__get_user(magic, &sf->sigret_magic)) @@ -219,7 +219,7 @@ static inline void __user *get_sigframe(struct ksignal *ksig, frame = (void __user *)((sp - framesize) & ~7); /* Check that we can actually write to the signal frame */ - if (!access_ok(VERIFY_WRITE, frame, framesize)) + if (!access_ok(frame, framesize)) frame = NULL; return frame; diff --git a/arch/arm/include/asm/futex.h b/arch/arm/include/asm/futex.h index ffebe7b7a5b7..0a46676b4245 100644 --- a/arch/arm/include/asm/futex.h +++ b/arch/arm/include/asm/futex.h @@ -50,7 +50,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret; u32 val; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; smp_mb(); @@ -104,7 +104,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0; u32 val; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; preempt_disable(); diff --git a/arch/arm/include/asm/uaccess.h b/arch/arm/include/asm/uaccess.h index c136eef8f690..27ed17ec45fe 100644 --- a/arch/arm/include/asm/uaccess.h +++ b/arch/arm/include/asm/uaccess.h @@ -279,7 +279,7 @@ static inline void set_fs(mm_segment_t fs) #endif /* CONFIG_MMU */ -#define access_ok(type, addr, size) (__range_ok(addr, size) == 0) +#define access_ok(addr, size) (__range_ok(addr, size) == 0) #define user_addr_max() \ (uaccess_kernel() ? ~0UL : get_fs()) @@ -560,7 +560,7 @@ raw_copy_to_user(void __user *to, const void *from, unsigned long n) static inline unsigned long __must_check clear_user(void __user *to, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) n = __clear_user(to, n); return n; } diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c index 08e43a32a693..3b69a76d341e 100644 --- a/arch/arm/kernel/perf_callchain.c +++ b/arch/arm/kernel/perf_callchain.c @@ -37,7 +37,7 @@ user_backtrace(struct frame_tail __user *tail, struct frame_tail buftail; unsigned long err; - if (!access_ok(VERIFY_READ, tail, sizeof(buftail))) + if (!access_ok(tail, sizeof(buftail))) return NULL; pagefault_disable(); diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index b908382b69ff..76bb8de6bf6b 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -241,7 +241,7 @@ asmlinkage int sys_sigreturn(struct pt_regs *regs) frame = (struct sigframe __user *)regs->ARM_sp; - if (!access_ok(VERIFY_READ, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) goto badframe; if (restore_sigframe(regs, frame)) @@ -271,7 +271,7 @@ asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) frame = (struct rt_sigframe __user *)regs->ARM_sp; - if (!access_ok(VERIFY_READ, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) goto badframe; if (restore_sigframe(regs, &frame->sig)) @@ -355,7 +355,7 @@ get_sigframe(struct ksignal *ksig, struct pt_regs *regs, int framesize) /* * Check that we can actually write to the signal frame. */ - if (!access_ok(VERIFY_WRITE, frame, framesize)) + if (!access_ok(frame, framesize)) frame = NULL; return frame; diff --git a/arch/arm/kernel/swp_emulate.c b/arch/arm/kernel/swp_emulate.c index a188d5e8ab7f..76f6e6a9736c 100644 --- a/arch/arm/kernel/swp_emulate.c +++ b/arch/arm/kernel/swp_emulate.c @@ -198,7 +198,7 @@ static int swp_handler(struct pt_regs *regs, unsigned int instr) destreg, EXTRACT_REG_NUM(instr, RT2_OFFSET), data); /* Check access in reasonable access range for both SWP and SWPB */ - if (!access_ok(VERIFY_WRITE, (address & ~3), 4)) { + if (!access_ok((address & ~3), 4)) { pr_debug("SWP{B} emulation: access to %p not allowed!\n", (void *)address); res = -EFAULT; diff --git a/arch/arm/kernel/sys_oabi-compat.c b/arch/arm/kernel/sys_oabi-compat.c index 40da0872170f..92ab36f38795 100644 --- a/arch/arm/kernel/sys_oabi-compat.c +++ b/arch/arm/kernel/sys_oabi-compat.c @@ -285,7 +285,7 @@ asmlinkage long sys_oabi_epoll_wait(int epfd, maxevents > (INT_MAX/sizeof(*kbuf)) || maxevents > (INT_MAX/sizeof(*events))) return -EINVAL; - if (!access_ok(VERIFY_WRITE, events, sizeof(*events) * maxevents)) + if (!access_ok(events, sizeof(*events) * maxevents)) return -EFAULT; kbuf = kmalloc_array(maxevents, sizeof(*kbuf), GFP_KERNEL); if (!kbuf) @@ -326,7 +326,7 @@ asmlinkage long sys_oabi_semtimedop(int semid, if (nsops < 1 || nsops > SEMOPM) return -EINVAL; - if (!access_ok(VERIFY_READ, tsops, sizeof(*tsops) * nsops)) + if (!access_ok(tsops, sizeof(*tsops) * nsops)) return -EFAULT; sops = kmalloc_array(nsops, sizeof(*sops), GFP_KERNEL); if (!sops) diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index 2d668cff8ef4..33af097c454b 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -582,7 +582,7 @@ do_cache_op(unsigned long start, unsigned long end, int flags) if (end < start || flags) return -EINVAL; - if (!access_ok(VERIFY_READ, start, end - start)) + if (!access_ok(start, end - start)) return -EFAULT; return __do_cache_op(start, end); diff --git a/arch/arm/oprofile/common.c b/arch/arm/oprofile/common.c index cc649a1e46da..7cb3e0453fcd 100644 --- a/arch/arm/oprofile/common.c +++ b/arch/arm/oprofile/common.c @@ -88,7 +88,7 @@ static struct frame_tail* user_backtrace(struct frame_tail *tail) struct frame_tail buftail[2]; /* Also check accessibility of one struct frame_tail beyond */ - if (!access_ok(VERIFY_READ, tail, sizeof(buftail))) + if (!access_ok(tail, sizeof(buftail))) return NULL; if (__copy_from_user_inatomic(buftail, tail, sizeof(buftail))) return NULL; diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h index 07fe2479d310..cccb83ad7fa8 100644 --- a/arch/arm64/include/asm/futex.h +++ b/arch/arm64/include/asm/futex.h @@ -96,7 +96,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr, u32 val, tmp; u32 __user *uaddr; - if (!access_ok(VERIFY_WRITE, _uaddr, sizeof(u32))) + if (!access_ok(_uaddr, sizeof(u32))) return -EFAULT; uaddr = __uaccess_mask_ptr(_uaddr); diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index ed252435fd92..547d7a0c9d05 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -95,7 +95,7 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si return ret; } -#define access_ok(type, addr, size) __range_ok(addr, size) +#define access_ok(addr, size) __range_ok(addr, size) #define user_addr_max get_fs #define _ASM_EXTABLE(from, to) \ @@ -301,7 +301,7 @@ do { \ ({ \ __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_READ, __p, sizeof(*__p))) { \ + if (access_ok(__p, sizeof(*__p))) { \ __p = uaccess_mask_ptr(__p); \ __get_user_err((x), __p, (err)); \ } else { \ @@ -370,7 +370,7 @@ do { \ ({ \ __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_WRITE, __p, sizeof(*__p))) { \ + if (access_ok(__p, sizeof(*__p))) { \ __p = uaccess_mask_ptr(__p); \ __put_user_err((x), __p, (err)); \ } else { \ @@ -418,7 +418,7 @@ extern unsigned long __must_check __arch_copy_in_user(void __user *to, const voi extern unsigned long __must_check __arch_clear_user(void __user *to, unsigned long n); static inline unsigned long __must_check __clear_user(void __user *to, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) n = __arch_clear_user(__uaccess_mask_ptr(to), n); return n; } diff --git a/arch/arm64/kernel/armv8_deprecated.c b/arch/arm64/kernel/armv8_deprecated.c index 92be1d12d590..e52e7280884a 100644 --- a/arch/arm64/kernel/armv8_deprecated.c +++ b/arch/arm64/kernel/armv8_deprecated.c @@ -402,7 +402,7 @@ static int swp_handler(struct pt_regs *regs, u32 instr) /* Check access in reasonable access range for both SWP and SWPB */ user_ptr = (const void __user *)(unsigned long)(address & ~3); - if (!access_ok(VERIFY_WRITE, user_ptr, 4)) { + if (!access_ok(user_ptr, 4)) { pr_debug("SWP{B} emulation: access to 0x%08x not allowed!\n", address); goto fault; diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c index a34c26afacb0..61d983f5756f 100644 --- a/arch/arm64/kernel/perf_callchain.c +++ b/arch/arm64/kernel/perf_callchain.c @@ -39,7 +39,7 @@ user_backtrace(struct frame_tail __user *tail, unsigned long lr; /* Also check accessibility of one struct frame_tail beyond */ - if (!access_ok(VERIFY_READ, tail, sizeof(buftail))) + if (!access_ok(tail, sizeof(buftail))) return NULL; pagefault_disable(); @@ -86,7 +86,7 @@ compat_user_backtrace(struct compat_frame_tail __user *tail, unsigned long err; /* Also check accessibility of one struct frame_tail beyond */ - if (!access_ok(VERIFY_READ, tail, sizeof(buftail))) + if (!access_ok(tail, sizeof(buftail))) return NULL; pagefault_disable(); diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 5dcc942906db..867a7cea70e5 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -470,7 +470,7 @@ static int parse_user_sigframe(struct user_ctxs *user, offset = 0; limit = extra_size; - if (!access_ok(VERIFY_READ, base, limit)) + if (!access_ok(base, limit)) goto invalid; continue; @@ -556,7 +556,7 @@ SYSCALL_DEFINE0(rt_sigreturn) frame = (struct rt_sigframe __user *)regs->sp; - if (!access_ok(VERIFY_READ, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) goto badframe; if (restore_sigframe(regs, frame)) @@ -730,7 +730,7 @@ static int get_sigframe(struct rt_sigframe_user_layout *user, /* * Check that we can actually write to the signal frame. */ - if (!access_ok(VERIFY_WRITE, user->sigframe, sp_top - sp)) + if (!access_ok(user->sigframe, sp_top - sp)) return -EFAULT; return 0; diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c index 24b09003f821..cb7800acd19f 100644 --- a/arch/arm64/kernel/signal32.c +++ b/arch/arm64/kernel/signal32.c @@ -303,7 +303,7 @@ COMPAT_SYSCALL_DEFINE0(sigreturn) frame = (struct compat_sigframe __user *)regs->compat_sp; - if (!access_ok(VERIFY_READ, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) goto badframe; if (compat_restore_sigframe(regs, frame)) @@ -334,7 +334,7 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn) frame = (struct compat_rt_sigframe __user *)regs->compat_sp; - if (!access_ok(VERIFY_READ, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) goto badframe; if (compat_restore_sigframe(regs, &frame->sig)) @@ -365,7 +365,7 @@ static void __user *compat_get_sigframe(struct ksignal *ksig, /* * Check that we can actually write to the signal frame. */ - if (!access_ok(VERIFY_WRITE, frame, framesize)) + if (!access_ok(frame, framesize)) frame = NULL; return frame; diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c index 32653d156747..21005dfe8406 100644 --- a/arch/arm64/kernel/sys_compat.c +++ b/arch/arm64/kernel/sys_compat.c @@ -58,7 +58,7 @@ do_compat_cache_op(unsigned long start, unsigned long end, int flags) if (end < start || flags) return -EINVAL; - if (!access_ok(VERIFY_READ, (const void __user *)start, end - start)) + if (!access_ok((const void __user *)start, end - start)) return -EFAULT; return __do_compat_cache_op(start, end); diff --git a/arch/c6x/kernel/signal.c b/arch/c6x/kernel/signal.c index 3c4bb5a5c382..33b9f69c38f7 100644 --- a/arch/c6x/kernel/signal.c +++ b/arch/c6x/kernel/signal.c @@ -80,7 +80,7 @@ asmlinkage int do_rt_sigreturn(struct pt_regs *regs) frame = (struct rt_sigframe __user *) ((unsigned long) regs->sp + 8); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; @@ -149,7 +149,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= __put_user(&frame->info, &frame->pinfo); diff --git a/arch/csky/abiv1/alignment.c b/arch/csky/abiv1/alignment.c index 60205e98fb87..d789be36eb4f 100644 --- a/arch/csky/abiv1/alignment.c +++ b/arch/csky/abiv1/alignment.c @@ -32,7 +32,7 @@ static int ldb_asm(uint32_t addr, uint32_t *valp) uint32_t val; int err; - if (!access_ok(VERIFY_READ, (void *)addr, 1)) + if (!access_ok((void *)addr, 1)) return 1; asm volatile ( @@ -67,7 +67,7 @@ static int stb_asm(uint32_t addr, uint32_t val) { int err; - if (!access_ok(VERIFY_WRITE, (void *)addr, 1)) + if (!access_ok((void *)addr, 1)) return 1; asm volatile ( diff --git a/arch/csky/include/asm/uaccess.h b/arch/csky/include/asm/uaccess.h index acaf0e210d81..eaa1c3403a42 100644 --- a/arch/csky/include/asm/uaccess.h +++ b/arch/csky/include/asm/uaccess.h @@ -16,10 +16,7 @@ #include #include -#define VERIFY_READ 0 -#define VERIFY_WRITE 1 - -static inline int access_ok(int type, const void *addr, unsigned long size) +static inline int access_ok(const void *addr, unsigned long size) { unsigned long limit = current_thread_info()->addr_limit.seg; @@ -27,12 +24,7 @@ static inline int access_ok(int type, const void *addr, unsigned long size) ((unsigned long)(addr + size) < limit)); } -static inline int verify_area(int type, const void *addr, unsigned long size) -{ - return access_ok(type, addr, size) ? 0 : -EFAULT; -} - -#define __addr_ok(addr) (access_ok(VERIFY_READ, addr, 0)) +#define __addr_ok(addr) (access_ok(addr, 0)) extern int __put_user_bad(void); @@ -91,7 +83,7 @@ extern int __put_user_bad(void); long __pu_err = -EFAULT; \ typeof(*(ptr)) *__pu_addr = (ptr); \ typeof(*(ptr)) __pu_val = (typeof(*(ptr)))(x); \ - if (access_ok(VERIFY_WRITE, __pu_addr, size) && __pu_addr) \ + if (access_ok(__pu_addr, size) && __pu_addr) \ __put_user_size(__pu_val, __pu_addr, (size), __pu_err); \ __pu_err; \ }) @@ -217,7 +209,7 @@ do { \ ({ \ int __gu_err = -EFAULT; \ const __typeof__(*(ptr)) __user *__gu_ptr = (ptr); \ - if (access_ok(VERIFY_READ, __gu_ptr, size) && __gu_ptr) \ + if (access_ok(__gu_ptr, size) && __gu_ptr) \ __get_user_size(x, __gu_ptr, size, __gu_err); \ __gu_err; \ }) diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c index 66e1b729b10b..9967c10eee2b 100644 --- a/arch/csky/kernel/signal.c +++ b/arch/csky/kernel/signal.c @@ -88,7 +88,7 @@ do_rt_sigreturn(void) struct pt_regs *regs = current_pt_regs(); struct rt_sigframe *frame = (struct rt_sigframe *)(regs->usp); - if (verify_area(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; diff --git a/arch/csky/lib/usercopy.c b/arch/csky/lib/usercopy.c index ac9170e2cbb8..647a23986fb5 100644 --- a/arch/csky/lib/usercopy.c +++ b/arch/csky/lib/usercopy.c @@ -7,7 +7,7 @@ unsigned long raw_copy_from_user(void *to, const void *from, unsigned long n) { - if (access_ok(VERIFY_READ, from, n)) + if (access_ok(from, n)) __copy_user_zeroing(to, from, n); else memset(to, 0, n); @@ -18,7 +18,7 @@ EXPORT_SYMBOL(raw_copy_from_user); unsigned long raw_copy_to_user(void *to, const void *from, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) __copy_user(to, from, n); return n; } @@ -113,7 +113,7 @@ long strncpy_from_user(char *dst, const char *src, long count) { long res = -EFAULT; - if (access_ok(VERIFY_READ, src, 1)) + if (access_ok(src, 1)) __do_strncpy_from_user(dst, src, count, res); return res; } @@ -236,7 +236,7 @@ do { \ unsigned long clear_user(void __user *to, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) __do_clear_user(to, n); return n; } diff --git a/arch/h8300/kernel/signal.c b/arch/h8300/kernel/signal.c index 1e8070d08770..e0f2b708e5d9 100644 --- a/arch/h8300/kernel/signal.c +++ b/arch/h8300/kernel/signal.c @@ -110,7 +110,7 @@ asmlinkage int sys_rt_sigreturn(void) sigset_t set; int er0; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; @@ -165,7 +165,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (ksig->ka.sa.sa_flags & SA_SIGINFO) diff --git a/arch/hexagon/include/asm/futex.h b/arch/hexagon/include/asm/futex.h index c889f5993ecd..cb635216a732 100644 --- a/arch/hexagon/include/asm/futex.h +++ b/arch/hexagon/include/asm/futex.h @@ -77,7 +77,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, int prev; int ret; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __asm__ __volatile__ ( diff --git a/arch/hexagon/include/asm/uaccess.h b/arch/hexagon/include/asm/uaccess.h index 458b69886b34..a30e58d5f351 100644 --- a/arch/hexagon/include/asm/uaccess.h +++ b/arch/hexagon/include/asm/uaccess.h @@ -29,9 +29,6 @@ /* * access_ok: - Checks if a user space pointer is valid - * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that - * %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe - * to write to a block, it is always safe to read from it. * @addr: User space pointer to start of block to check * @size: Size of block to check * diff --git a/arch/hexagon/kernel/signal.c b/arch/hexagon/kernel/signal.c index 78aa7304a5c9..31e2cf95f189 100644 --- a/arch/hexagon/kernel/signal.c +++ b/arch/hexagon/kernel/signal.c @@ -115,7 +115,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(struct rt_sigframe))) + if (!access_ok(frame, sizeof(struct rt_sigframe))) return -EFAULT; if (copy_siginfo_to_user(&frame->info, &ksig->info)) @@ -244,7 +244,7 @@ asmlinkage int sys_rt_sigreturn(void) current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *)pt_psp(regs); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&blocked, &frame->uc.uc_sigmask, sizeof(blocked))) goto badframe; diff --git a/arch/hexagon/mm/uaccess.c b/arch/hexagon/mm/uaccess.c index c599eb126c9e..6f9c4697552c 100644 --- a/arch/hexagon/mm/uaccess.c +++ b/arch/hexagon/mm/uaccess.c @@ -51,7 +51,7 @@ __kernel_size_t __clear_user_hexagon(void __user *dest, unsigned long count) unsigned long clear_user_hexagon(void __user *dest, unsigned long count) { - if (!access_ok(VERIFY_WRITE, dest, count)) + if (!access_ok(dest, count)) return count; else return __clear_user_hexagon(dest, count); diff --git a/arch/ia64/include/asm/futex.h b/arch/ia64/include/asm/futex.h index db2dd85918c2..2e106d462196 100644 --- a/arch/ia64/include/asm/futex.h +++ b/arch/ia64/include/asm/futex.h @@ -86,7 +86,7 @@ static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval) { - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; { diff --git a/arch/ia64/include/asm/uaccess.h b/arch/ia64/include/asm/uaccess.h index a74524f2d625..306d469e43da 100644 --- a/arch/ia64/include/asm/uaccess.h +++ b/arch/ia64/include/asm/uaccess.h @@ -67,7 +67,7 @@ static inline int __access_ok(const void __user *p, unsigned long size) return likely(addr <= seg) && (seg == KERNEL_DS.seg || likely(REGION_OFFSET(addr) < RGN_MAP_LIMIT)); } -#define access_ok(type, addr, size) __access_ok((addr), (size)) +#define access_ok(addr, size) __access_ok((addr), (size)) /* * These are the main single-value transfer routines. They automatically diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c index 427cd565fd61..6d50ede0ed69 100644 --- a/arch/ia64/kernel/ptrace.c +++ b/arch/ia64/kernel/ptrace.c @@ -836,7 +836,7 @@ ptrace_getregs (struct task_struct *child, struct pt_all_user_regs __user *ppr) char nat = 0; int i; - if (!access_ok(VERIFY_WRITE, ppr, sizeof(struct pt_all_user_regs))) + if (!access_ok(ppr, sizeof(struct pt_all_user_regs))) return -EIO; pt = task_pt_regs(child); @@ -981,7 +981,7 @@ ptrace_setregs (struct task_struct *child, struct pt_all_user_regs __user *ppr) memset(&fpval, 0, sizeof(fpval)); - if (!access_ok(VERIFY_READ, ppr, sizeof(struct pt_all_user_regs))) + if (!access_ok(ppr, sizeof(struct pt_all_user_regs))) return -EIO; pt = task_pt_regs(child); diff --git a/arch/ia64/kernel/signal.c b/arch/ia64/kernel/signal.c index 99099f73b207..6062fd14e34e 100644 --- a/arch/ia64/kernel/signal.c +++ b/arch/ia64/kernel/signal.c @@ -132,7 +132,7 @@ ia64_rt_sigreturn (struct sigscratch *scr) */ retval = (long) &ia64_strace_leave_kernel; - if (!access_ok(VERIFY_READ, sc, sizeof(*sc))) + if (!access_ok(sc, sizeof(*sc))) goto give_sigsegv; if (GET_SIGSET(&set, &sc->sc_mask)) @@ -264,7 +264,7 @@ setup_frame(struct ksignal *ksig, sigset_t *set, struct sigscratch *scr) } frame = (void __user *) ((new_sp - sizeof(*frame)) & -STACK_ALIGN); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) { + if (!access_ok(frame, sizeof(*frame))) { force_sigsegv(ksig->sig, current); return 1; } diff --git a/arch/m68k/include/asm/uaccess_mm.h b/arch/m68k/include/asm/uaccess_mm.h index c4cb889660aa..7e85de984df1 100644 --- a/arch/m68k/include/asm/uaccess_mm.h +++ b/arch/m68k/include/asm/uaccess_mm.h @@ -10,7 +10,7 @@ #include /* We let the MMU do all checking */ -static inline int access_ok(int type, const void __user *addr, +static inline int access_ok(const void __user *addr, unsigned long size) { return 1; diff --git a/arch/m68k/include/asm/uaccess_no.h b/arch/m68k/include/asm/uaccess_no.h index 892efb56beef..0134008bf539 100644 --- a/arch/m68k/include/asm/uaccess_no.h +++ b/arch/m68k/include/asm/uaccess_no.h @@ -10,7 +10,7 @@ #include -#define access_ok(type,addr,size) _access_ok((unsigned long)(addr),(size)) +#define access_ok(addr,size) _access_ok((unsigned long)(addr),(size)) /* * It is not enough to just have access_ok check for a real RAM address. diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c index 72850b85ecf8..e2a9421c5797 100644 --- a/arch/m68k/kernel/signal.c +++ b/arch/m68k/kernel/signal.c @@ -787,7 +787,7 @@ asmlinkage int do_sigreturn(struct pt_regs *regs, struct switch_stack *sw) struct sigframe __user *frame = (struct sigframe __user *)(usp - 4); sigset_t set; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.sc_mask) || (_NSIG_WORDS > 1 && @@ -812,7 +812,7 @@ asmlinkage int do_rt_sigreturn(struct pt_regs *regs, struct switch_stack *sw) struct rt_sigframe __user *frame = (struct rt_sigframe __user *)(usp - 4); sigset_t set; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; diff --git a/arch/microblaze/include/asm/futex.h b/arch/microblaze/include/asm/futex.h index 2572077b04ea..8c90357e5983 100644 --- a/arch/microblaze/include/asm/futex.h +++ b/arch/microblaze/include/asm/futex.h @@ -71,7 +71,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0, cmp; u32 prev; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __asm__ __volatile__ ("1: lwx %1, %3, r0; \ diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h index 81f16aadbf9e..dbfea093a7c7 100644 --- a/arch/microblaze/include/asm/uaccess.h +++ b/arch/microblaze/include/asm/uaccess.h @@ -60,26 +60,25 @@ static inline int ___range_ok(unsigned long addr, unsigned long size) #define __range_ok(addr, size) \ ___range_ok((unsigned long)(addr), (unsigned long)(size)) -#define access_ok(type, addr, size) (__range_ok((addr), (size)) == 0) +#define access_ok(addr, size) (__range_ok((addr), (size)) == 0) #else -static inline int access_ok(int type, const void __user *addr, - unsigned long size) +static inline int access_ok(const void __user *addr, unsigned long size) { if (!size) goto ok; if ((get_fs().seg < ((unsigned long)addr)) || (get_fs().seg < ((unsigned long)addr + size - 1))) { - pr_devel("ACCESS fail: %s at 0x%08x (size 0x%x), seg 0x%08x\n", - type ? "WRITE" : "READ ", (__force u32)addr, (u32)size, + pr_devel("ACCESS fail at 0x%08x (size 0x%x), seg 0x%08x\n", + (__force u32)addr, (u32)size, (u32)get_fs().seg); return 0; } ok: - pr_devel("ACCESS OK: %s at 0x%08x (size 0x%x), seg 0x%08x\n", - type ? "WRITE" : "READ ", (__force u32)addr, (u32)size, + pr_devel("ACCESS OK at 0x%08x (size 0x%x), seg 0x%08x\n", + (__force u32)addr, (u32)size, (u32)get_fs().seg); return 1; } @@ -120,7 +119,7 @@ static inline unsigned long __must_check clear_user(void __user *to, unsigned long n) { might_fault(); - if (unlikely(!access_ok(VERIFY_WRITE, to, n))) + if (unlikely(!access_ok(to, n))) return n; return __clear_user(to, n); @@ -174,7 +173,7 @@ extern long __user_bad(void); const typeof(*(ptr)) __user *__gu_addr = (ptr); \ int __gu_err = 0; \ \ - if (access_ok(VERIFY_READ, __gu_addr, size)) { \ + if (access_ok(__gu_addr, size)) { \ switch (size) { \ case 1: \ __get_user_asm("lbu", __gu_addr, __gu_val, \ @@ -286,7 +285,7 @@ extern long __user_bad(void); typeof(*(ptr)) __user *__pu_addr = (ptr); \ int __pu_err = 0; \ \ - if (access_ok(VERIFY_WRITE, __pu_addr, size)) { \ + if (access_ok(__pu_addr, size)) { \ switch (size) { \ case 1: \ __put_user_asm("sb", __pu_addr, __pu_val, \ @@ -358,7 +357,7 @@ extern int __strncpy_user(char *to, const char __user *from, int len); static inline long strncpy_from_user(char *dst, const char __user *src, long count) { - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return -EFAULT; return __strncpy_user(dst, src, count); } @@ -372,7 +371,7 @@ extern int __strnlen_user(const char __user *sstr, int len); static inline long strnlen_user(const char __user *src, long n) { - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return 0; return __strnlen_user(src, n); } diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c index 97001524ca2d..0685696349bb 100644 --- a/arch/microblaze/kernel/signal.c +++ b/arch/microblaze/kernel/signal.c @@ -91,7 +91,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) @@ -166,7 +166,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (ksig->ka.sa.sa_flags & SA_SIGINFO) diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h index e8161e4dfde7..dcebaaf8c862 100644 --- a/arch/mips/include/asm/checksum.h +++ b/arch/mips/include/asm/checksum.h @@ -63,7 +63,7 @@ static inline __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len, __wsum sum, int *err_ptr) { - if (access_ok(VERIFY_READ, src, len)) + if (access_ok(src, len)) return csum_partial_copy_from_user(src, dst, len, sum, err_ptr); if (len) @@ -81,7 +81,7 @@ __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len, __wsum sum, int *err_ptr) { might_fault(); - if (access_ok(VERIFY_WRITE, dst, len)) { + if (access_ok(dst, len)) { if (uaccess_kernel()) return __csum_partial_copy_kernel(src, (__force void *)dst, diff --git a/arch/mips/include/asm/futex.h b/arch/mips/include/asm/futex.h index 8eff134b3a43..c14d798f3888 100644 --- a/arch/mips/include/asm/futex.h +++ b/arch/mips/include/asm/futex.h @@ -129,7 +129,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0; u32 val; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; if (cpu_has_llsc && R10000_LLSC_WAR) { diff --git a/arch/mips/include/asm/termios.h b/arch/mips/include/asm/termios.h index ce2d72e34274..bc29eeacc55a 100644 --- a/arch/mips/include/asm/termios.h +++ b/arch/mips/include/asm/termios.h @@ -32,7 +32,7 @@ static inline int user_termio_to_kernel_termios(struct ktermios *termios, unsigned short iflag, oflag, cflag, lflag; unsigned int err; - if (!access_ok(VERIFY_READ, termio, sizeof(struct termio))) + if (!access_ok(termio, sizeof(struct termio))) return -EFAULT; err = __get_user(iflag, &termio->c_iflag); @@ -61,7 +61,7 @@ static inline int kernel_termios_to_user_termio(struct termio __user *termio, { int err; - if (!access_ok(VERIFY_WRITE, termio, sizeof(struct termio))) + if (!access_ok(termio, sizeof(struct termio))) return -EFAULT; err = __put_user(termios->c_iflag, &termio->c_iflag); diff --git a/arch/mips/include/asm/uaccess.h b/arch/mips/include/asm/uaccess.h index 06629011a434..d43c1dc6ef15 100644 --- a/arch/mips/include/asm/uaccess.h +++ b/arch/mips/include/asm/uaccess.h @@ -109,9 +109,6 @@ static inline bool eva_kernel_access(void) /* * access_ok: - Checks if a user space pointer is valid - * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that - * %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe - * to write to a block, it is always safe to read from it. * @addr: User space pointer to start of block to check * @size: Size of block to check * @@ -134,7 +131,7 @@ static inline int __access_ok(const void __user *p, unsigned long size) return (get_fs().seg & (addr | (addr + size) | __ua_size(size))) == 0; } -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ likely(__access_ok((addr), (size))) /* @@ -304,7 +301,7 @@ do { \ const __typeof__(*(ptr)) __user * __gu_ptr = (ptr); \ \ might_fault(); \ - if (likely(access_ok(VERIFY_READ, __gu_ptr, size))) { \ + if (likely(access_ok( __gu_ptr, size))) { \ if (eva_kernel_access()) \ __get_kernel_common((x), size, __gu_ptr); \ else \ @@ -446,7 +443,7 @@ do { \ int __pu_err = -EFAULT; \ \ might_fault(); \ - if (likely(access_ok(VERIFY_WRITE, __pu_addr, size))) { \ + if (likely(access_ok( __pu_addr, size))) { \ if (eva_kernel_access()) \ __put_kernel_common(__pu_addr, size); \ else \ @@ -691,8 +688,7 @@ __clear_user(void __user *addr, __kernel_size_t size) ({ \ void __user * __cl_addr = (addr); \ unsigned long __cl_size = (n); \ - if (__cl_size && access_ok(VERIFY_WRITE, \ - __cl_addr, __cl_size)) \ + if (__cl_size && access_ok(__cl_addr, __cl_size)) \ __cl_size = __clear_user(__cl_addr, __cl_size); \ __cl_size; \ }) diff --git a/arch/mips/kernel/mips-r2-to-r6-emul.c b/arch/mips/kernel/mips-r2-to-r6-emul.c index cb22a558431e..c50c89a978f1 100644 --- a/arch/mips/kernel/mips-r2-to-r6-emul.c +++ b/arch/mips/kernel/mips-r2-to-r6-emul.c @@ -1205,7 +1205,7 @@ fpu_emul: case lwl_op: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1278,7 +1278,7 @@ fpu_emul: case lwr_op: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1352,7 +1352,7 @@ fpu_emul: case swl_op: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1422,7 +1422,7 @@ fpu_emul: case swr_op: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1497,7 +1497,7 @@ fpu_emul: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1616,7 +1616,7 @@ fpu_emul: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1735,7 +1735,7 @@ fpu_emul: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1853,7 +1853,7 @@ fpu_emul: rt = regs->regs[MIPSInst_RT(inst)]; vaddr = regs->regs[MIPSInst_RS(inst)] + MIPSInst_SIMM(inst); - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGSEGV; break; @@ -1970,7 +1970,7 @@ fpu_emul: err = SIGBUS; break; } - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGBUS; break; @@ -2026,7 +2026,7 @@ fpu_emul: err = SIGBUS; break; } - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 4)) { + if (!access_ok((void __user *)vaddr, 4)) { current->thread.cp0_baduaddr = vaddr; err = SIGBUS; break; @@ -2089,7 +2089,7 @@ fpu_emul: err = SIGBUS; break; } - if (!access_ok(VERIFY_READ, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGBUS; break; @@ -2150,7 +2150,7 @@ fpu_emul: err = SIGBUS; break; } - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, 8)) { + if (!access_ok((void __user *)vaddr, 8)) { current->thread.cp0_baduaddr = vaddr; err = SIGBUS; break; diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c index ea54575255ea..0057c910bc2f 100644 --- a/arch/mips/kernel/ptrace.c +++ b/arch/mips/kernel/ptrace.c @@ -71,7 +71,7 @@ int ptrace_getregs(struct task_struct *child, struct user_pt_regs __user *data) struct pt_regs *regs; int i; - if (!access_ok(VERIFY_WRITE, data, 38 * 8)) + if (!access_ok(data, 38 * 8)) return -EIO; regs = task_pt_regs(child); @@ -98,7 +98,7 @@ int ptrace_setregs(struct task_struct *child, struct user_pt_regs __user *data) struct pt_regs *regs; int i; - if (!access_ok(VERIFY_READ, data, 38 * 8)) + if (!access_ok(data, 38 * 8)) return -EIO; regs = task_pt_regs(child); @@ -125,7 +125,7 @@ int ptrace_get_watch_regs(struct task_struct *child, if (!cpu_has_watch || boot_cpu_data.watch_reg_use_cnt == 0) return -EIO; - if (!access_ok(VERIFY_WRITE, addr, sizeof(struct pt_watch_regs))) + if (!access_ok(addr, sizeof(struct pt_watch_regs))) return -EIO; #ifdef CONFIG_32BIT @@ -167,7 +167,7 @@ int ptrace_set_watch_regs(struct task_struct *child, if (!cpu_has_watch || boot_cpu_data.watch_reg_use_cnt == 0) return -EIO; - if (!access_ok(VERIFY_READ, addr, sizeof(struct pt_watch_regs))) + if (!access_ok(addr, sizeof(struct pt_watch_regs))) return -EIO; /* Check the values. */ for (i = 0; i < boot_cpu_data.watch_reg_use_cnt; i++) { @@ -359,7 +359,7 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data) { int i; - if (!access_ok(VERIFY_WRITE, data, 33 * 8)) + if (!access_ok(data, 33 * 8)) return -EIO; if (tsk_used_math(child)) { @@ -385,7 +385,7 @@ int ptrace_setfpregs(struct task_struct *child, __u32 __user *data) u32 value; int i; - if (!access_ok(VERIFY_READ, data, 33 * 8)) + if (!access_ok(data, 33 * 8)) return -EIO; init_fp_ctx(child); diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index d3a23758592c..d75337974ee9 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -590,7 +590,7 @@ SYSCALL_DEFINE3(sigaction, int, sig, const struct sigaction __user *, act, if (act) { old_sigset_t mask; - if (!access_ok(VERIFY_READ, act, sizeof(*act))) + if (!access_ok(act, sizeof(*act))) return -EFAULT; err |= __get_user(new_ka.sa.sa_handler, &act->sa_handler); err |= __get_user(new_ka.sa.sa_flags, &act->sa_flags); @@ -604,7 +604,7 @@ SYSCALL_DEFINE3(sigaction, int, sig, const struct sigaction __user *, act, ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact))) + if (!access_ok(oact, sizeof(*oact))) return -EFAULT; err |= __put_user(old_ka.sa.sa_flags, &oact->sa_flags); err |= __put_user(old_ka.sa.sa_handler, &oact->sa_handler); @@ -630,7 +630,7 @@ asmlinkage void sys_sigreturn(void) regs = current_pt_regs(); frame = (struct sigframe __user *)regs->regs[29]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&blocked, &frame->sf_mask, sizeof(blocked))) goto badframe; @@ -667,7 +667,7 @@ asmlinkage void sys_rt_sigreturn(void) regs = current_pt_regs(); frame = (struct rt_sigframe __user *)regs->regs[29]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->rs_uc.uc_sigmask, sizeof(set))) goto badframe; @@ -705,7 +705,7 @@ static int setup_frame(void *sig_return, struct ksignal *ksig, int err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) return -EFAULT; err |= setup_sigcontext(regs, &frame->sf_sc); @@ -744,7 +744,7 @@ static int setup_rt_frame(void *sig_return, struct ksignal *ksig, int err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) return -EFAULT; /* Create siginfo. */ diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c index b5d9e1784aff..59b8965433c2 100644 --- a/arch/mips/kernel/signal32.c +++ b/arch/mips/kernel/signal32.c @@ -46,7 +46,7 @@ SYSCALL_DEFINE3(32_sigaction, long, sig, const struct compat_sigaction __user *, old_sigset_t mask; s32 handler; - if (!access_ok(VERIFY_READ, act, sizeof(*act))) + if (!access_ok(act, sizeof(*act))) return -EFAULT; err |= __get_user(handler, &act->sa_handler); new_ka.sa.sa_handler = (void __user *)(s64)handler; @@ -61,7 +61,7 @@ SYSCALL_DEFINE3(32_sigaction, long, sig, const struct compat_sigaction __user *, ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact))) + if (!access_ok(oact, sizeof(*oact))) return -EFAULT; err |= __put_user(old_ka.sa.sa_flags, &oact->sa_flags); err |= __put_user((u32)(u64)old_ka.sa.sa_handler, diff --git a/arch/mips/kernel/signal_n32.c b/arch/mips/kernel/signal_n32.c index 8f65aaf9206d..c498b027823e 100644 --- a/arch/mips/kernel/signal_n32.c +++ b/arch/mips/kernel/signal_n32.c @@ -73,7 +73,7 @@ asmlinkage void sysn32_rt_sigreturn(void) regs = current_pt_regs(); frame = (struct rt_sigframe_n32 __user *)regs->regs[29]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_conv_sigset_from_user(&set, &frame->rs_uc.uc_sigmask)) goto badframe; @@ -110,7 +110,7 @@ static int setup_rt_frame_n32(void *sig_return, struct ksignal *ksig, int err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) return -EFAULT; /* Create siginfo. */ diff --git a/arch/mips/kernel/signal_o32.c b/arch/mips/kernel/signal_o32.c index b6e3ddef48a0..df259618e834 100644 --- a/arch/mips/kernel/signal_o32.c +++ b/arch/mips/kernel/signal_o32.c @@ -118,7 +118,7 @@ static int setup_frame_32(void *sig_return, struct ksignal *ksig, int err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) return -EFAULT; err |= setup_sigcontext32(regs, &frame->sf_sc); @@ -160,7 +160,7 @@ asmlinkage void sys32_rt_sigreturn(void) regs = current_pt_regs(); frame = (struct rt_sigframe32 __user *)regs->regs[29]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_conv_sigset_from_user(&set, &frame->rs_uc.uc_sigmask)) goto badframe; @@ -197,7 +197,7 @@ static int setup_rt_frame_32(void *sig_return, struct ksignal *ksig, int err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame))) + if (!access_ok(frame, sizeof (*frame))) return -EFAULT; /* Convert (siginfo_t -> compat_siginfo_t) and copy to user. */ @@ -262,7 +262,7 @@ asmlinkage void sys32_sigreturn(void) regs = current_pt_regs(); frame = (struct sigframe32 __user *)regs->regs[29]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_conv_sigset_from_user(&blocked, &frame->sf_mask)) goto badframe; diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c index 41a0db08cd37..b6dc78ad5d8c 100644 --- a/arch/mips/kernel/syscall.c +++ b/arch/mips/kernel/syscall.c @@ -101,7 +101,7 @@ static inline int mips_atomic_set(unsigned long addr, unsigned long new) if (unlikely(addr & 3)) return -EINVAL; - if (unlikely(!access_ok(VERIFY_WRITE, (const void __user *)addr, 4))) + if (unlikely(!access_ok((const void __user *)addr, 4))) return -EINVAL; if (cpu_has_llsc && R10000_LLSC_WAR) { diff --git a/arch/mips/kernel/unaligned.c b/arch/mips/kernel/unaligned.c index c60e7719ef77..595ca9c85111 100644 --- a/arch/mips/kernel/unaligned.c +++ b/arch/mips/kernel/unaligned.c @@ -936,7 +936,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, if (insn.dsp_format.func == lx_op) { switch (insn.dsp_format.op) { case lwx_op: - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadW(addr, value, res); if (res) @@ -945,7 +945,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, regs->regs[insn.dsp_format.rd] = value; break; case lhx_op: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; LoadHW(addr, value, res); if (res) @@ -968,7 +968,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, set_fs(USER_DS); switch (insn.spec3_format.func) { case lhe_op: - if (!access_ok(VERIFY_READ, addr, 2)) { + if (!access_ok(addr, 2)) { set_fs(seg); goto sigbus; } @@ -981,7 +981,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, regs->regs[insn.spec3_format.rt] = value; break; case lwe_op: - if (!access_ok(VERIFY_READ, addr, 4)) { + if (!access_ok(addr, 4)) { set_fs(seg); goto sigbus; } @@ -994,7 +994,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, regs->regs[insn.spec3_format.rt] = value; break; case lhue_op: - if (!access_ok(VERIFY_READ, addr, 2)) { + if (!access_ok(addr, 2)) { set_fs(seg); goto sigbus; } @@ -1007,7 +1007,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, regs->regs[insn.spec3_format.rt] = value; break; case she_op: - if (!access_ok(VERIFY_WRITE, addr, 2)) { + if (!access_ok(addr, 2)) { set_fs(seg); goto sigbus; } @@ -1020,7 +1020,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, } break; case swe_op: - if (!access_ok(VERIFY_WRITE, addr, 4)) { + if (!access_ok(addr, 4)) { set_fs(seg); goto sigbus; } @@ -1041,7 +1041,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, #endif break; case lh_op: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; if (IS_ENABLED(CONFIG_EVA)) { @@ -1060,7 +1060,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lw_op: - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; if (IS_ENABLED(CONFIG_EVA)) { @@ -1079,7 +1079,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lhu_op: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; if (IS_ENABLED(CONFIG_EVA)) { @@ -1106,7 +1106,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadWU(addr, value, res); @@ -1129,7 +1129,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; LoadDW(addr, value, res); @@ -1144,7 +1144,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, goto sigill; case sh_op: - if (!access_ok(VERIFY_WRITE, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; compute_return_epc(regs); @@ -1164,7 +1164,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case sw_op: - if (!access_ok(VERIFY_WRITE, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; compute_return_epc(regs); @@ -1192,7 +1192,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_WRITE, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; compute_return_epc(regs); @@ -1254,7 +1254,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, switch (insn.msa_mi10_format.func) { case msa_ld_op: - if (!access_ok(VERIFY_READ, addr, sizeof(*fpr))) + if (!access_ok(addr, sizeof(*fpr))) goto sigbus; do { @@ -1290,7 +1290,7 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case msa_st_op: - if (!access_ok(VERIFY_WRITE, addr, sizeof(*fpr))) + if (!access_ok(addr, sizeof(*fpr))) goto sigbus; /* @@ -1463,7 +1463,7 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if (reg == 31) goto sigbus; - if (!access_ok(VERIFY_READ, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; LoadW(addr, value, res); @@ -1482,7 +1482,7 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if (reg == 31) goto sigbus; - if (!access_ok(VERIFY_WRITE, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; value = regs->regs[reg]; @@ -1502,7 +1502,7 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if (reg == 31) goto sigbus; - if (!access_ok(VERIFY_READ, addr, 16)) + if (!access_ok(addr, 16)) goto sigbus; LoadDW(addr, value, res); @@ -1525,7 +1525,7 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if (reg == 31) goto sigbus; - if (!access_ok(VERIFY_WRITE, addr, 16)) + if (!access_ok(addr, 16)) goto sigbus; value = regs->regs[reg]; @@ -1548,11 +1548,10 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if ((rvar > 9) || !reg) goto sigill; if (reg & 0x10) { - if (!access_ok - (VERIFY_READ, addr, 4 * (rvar + 1))) + if (!access_ok(addr, 4 * (rvar + 1))) goto sigbus; } else { - if (!access_ok(VERIFY_READ, addr, 4 * rvar)) + if (!access_ok(addr, 4 * rvar)) goto sigbus; } if (rvar == 9) @@ -1585,11 +1584,10 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if ((rvar > 9) || !reg) goto sigill; if (reg & 0x10) { - if (!access_ok - (VERIFY_WRITE, addr, 4 * (rvar + 1))) + if (!access_ok(addr, 4 * (rvar + 1))) goto sigbus; } else { - if (!access_ok(VERIFY_WRITE, addr, 4 * rvar)) + if (!access_ok(addr, 4 * rvar)) goto sigbus; } if (rvar == 9) @@ -1623,11 +1621,10 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if ((rvar > 9) || !reg) goto sigill; if (reg & 0x10) { - if (!access_ok - (VERIFY_READ, addr, 8 * (rvar + 1))) + if (!access_ok(addr, 8 * (rvar + 1))) goto sigbus; } else { - if (!access_ok(VERIFY_READ, addr, 8 * rvar)) + if (!access_ok(addr, 8 * rvar)) goto sigbus; } if (rvar == 9) @@ -1665,11 +1662,10 @@ static void emulate_load_store_microMIPS(struct pt_regs *regs, if ((rvar > 9) || !reg) goto sigill; if (reg & 0x10) { - if (!access_ok - (VERIFY_WRITE, addr, 8 * (rvar + 1))) + if (!access_ok(addr, 8 * (rvar + 1))) goto sigbus; } else { - if (!access_ok(VERIFY_WRITE, addr, 8 * rvar)) + if (!access_ok(addr, 8 * rvar)) goto sigbus; } if (rvar == 9) @@ -1788,7 +1784,7 @@ fpu_emul: case mm_lwm16_op: reg = insn.mm16_m_format.rlist; rvar = reg + 1; - if (!access_ok(VERIFY_READ, addr, 4 * rvar)) + if (!access_ok(addr, 4 * rvar)) goto sigbus; for (i = 16; rvar; rvar--, i++) { @@ -1808,7 +1804,7 @@ fpu_emul: case mm_swm16_op: reg = insn.mm16_m_format.rlist; rvar = reg + 1; - if (!access_ok(VERIFY_WRITE, addr, 4 * rvar)) + if (!access_ok(addr, 4 * rvar)) goto sigbus; for (i = 16; rvar; rvar--, i++) { @@ -1862,7 +1858,7 @@ fpu_emul: } loadHW: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; LoadHW(addr, value, res); @@ -1872,7 +1868,7 @@ loadHW: goto success; loadHWU: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; LoadHWU(addr, value, res); @@ -1882,7 +1878,7 @@ loadHWU: goto success; loadW: - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadW(addr, value, res); @@ -1900,7 +1896,7 @@ loadWU: * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadWU(addr, value, res); @@ -1922,7 +1918,7 @@ loadDW: * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; LoadDW(addr, value, res); @@ -1936,7 +1932,7 @@ loadDW: goto sigill; storeHW: - if (!access_ok(VERIFY_WRITE, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; value = regs->regs[reg]; @@ -1946,7 +1942,7 @@ storeHW: goto success; storeW: - if (!access_ok(VERIFY_WRITE, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; value = regs->regs[reg]; @@ -1964,7 +1960,7 @@ storeDW: * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_WRITE, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; value = regs->regs[reg]; @@ -2122,7 +2118,7 @@ static void emulate_load_store_MIPS16e(struct pt_regs *regs, void __user * addr) goto sigbus; case MIPS16e_lh_op: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; LoadHW(addr, value, res); @@ -2133,7 +2129,7 @@ static void emulate_load_store_MIPS16e(struct pt_regs *regs, void __user * addr) break; case MIPS16e_lhu_op: - if (!access_ok(VERIFY_READ, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; LoadHWU(addr, value, res); @@ -2146,7 +2142,7 @@ static void emulate_load_store_MIPS16e(struct pt_regs *regs, void __user * addr) case MIPS16e_lw_op: case MIPS16e_lwpc_op: case MIPS16e_lwsp_op: - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadW(addr, value, res); @@ -2165,7 +2161,7 @@ static void emulate_load_store_MIPS16e(struct pt_regs *regs, void __user * addr) * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; LoadWU(addr, value, res); @@ -2189,7 +2185,7 @@ loadDW: * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_READ, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; LoadDW(addr, value, res); @@ -2204,7 +2200,7 @@ loadDW: goto sigill; case MIPS16e_sh_op: - if (!access_ok(VERIFY_WRITE, addr, 2)) + if (!access_ok(addr, 2)) goto sigbus; MIPS16e_compute_return_epc(regs, &oldinst); @@ -2217,7 +2213,7 @@ loadDW: case MIPS16e_sw_op: case MIPS16e_swsp_op: case MIPS16e_i8_op: /* actually - MIPS16e_swrasp_func */ - if (!access_ok(VERIFY_WRITE, addr, 4)) + if (!access_ok(addr, 4)) goto sigbus; MIPS16e_compute_return_epc(regs, &oldinst); @@ -2237,7 +2233,7 @@ writeDW: * would blow up, so for now we don't handle unaligned 64-bit * instructions on 32-bit kernels. */ - if (!access_ok(VERIFY_WRITE, addr, 8)) + if (!access_ok(addr, 8)) goto sigbus; MIPS16e_compute_return_epc(regs, &oldinst); diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c index 82e2993c1a2c..e60e29078ef5 100644 --- a/arch/mips/math-emu/cp1emu.c +++ b/arch/mips/math-emu/cp1emu.c @@ -1063,7 +1063,7 @@ emul: MIPSInst_SIMM(ir)); MIPS_FPU_EMU_INC_STATS(loads); - if (!access_ok(VERIFY_READ, dva, sizeof(u64))) { + if (!access_ok(dva, sizeof(u64))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = dva; return SIGBUS; @@ -1081,7 +1081,7 @@ emul: MIPSInst_SIMM(ir)); MIPS_FPU_EMU_INC_STATS(stores); DIFROMREG(dval, MIPSInst_RT(ir)); - if (!access_ok(VERIFY_WRITE, dva, sizeof(u64))) { + if (!access_ok(dva, sizeof(u64))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = dva; return SIGBUS; @@ -1097,7 +1097,7 @@ emul: wva = (u32 __user *) (xcp->regs[MIPSInst_RS(ir)] + MIPSInst_SIMM(ir)); MIPS_FPU_EMU_INC_STATS(loads); - if (!access_ok(VERIFY_READ, wva, sizeof(u32))) { + if (!access_ok(wva, sizeof(u32))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = wva; return SIGBUS; @@ -1115,7 +1115,7 @@ emul: MIPSInst_SIMM(ir)); MIPS_FPU_EMU_INC_STATS(stores); SIFROMREG(wval, MIPSInst_RT(ir)); - if (!access_ok(VERIFY_WRITE, wva, sizeof(u32))) { + if (!access_ok(wva, sizeof(u32))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = wva; return SIGBUS; @@ -1493,7 +1493,7 @@ static int fpux_emu(struct pt_regs *xcp, struct mips_fpu_struct *ctx, xcp->regs[MIPSInst_FT(ir)]); MIPS_FPU_EMU_INC_STATS(loads); - if (!access_ok(VERIFY_READ, va, sizeof(u32))) { + if (!access_ok(va, sizeof(u32))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = va; return SIGBUS; @@ -1513,7 +1513,7 @@ static int fpux_emu(struct pt_regs *xcp, struct mips_fpu_struct *ctx, MIPS_FPU_EMU_INC_STATS(stores); SIFROMREG(val, MIPSInst_FS(ir)); - if (!access_ok(VERIFY_WRITE, va, sizeof(u32))) { + if (!access_ok(va, sizeof(u32))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = va; return SIGBUS; @@ -1590,7 +1590,7 @@ static int fpux_emu(struct pt_regs *xcp, struct mips_fpu_struct *ctx, xcp->regs[MIPSInst_FT(ir)]); MIPS_FPU_EMU_INC_STATS(loads); - if (!access_ok(VERIFY_READ, va, sizeof(u64))) { + if (!access_ok(va, sizeof(u64))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = va; return SIGBUS; @@ -1609,7 +1609,7 @@ static int fpux_emu(struct pt_regs *xcp, struct mips_fpu_struct *ctx, MIPS_FPU_EMU_INC_STATS(stores); DIFROMREG(val, MIPSInst_FS(ir)); - if (!access_ok(VERIFY_WRITE, va, sizeof(u64))) { + if (!access_ok(va, sizeof(u64))) { MIPS_FPU_EMU_INC_STATS(errors); *fault_addr = va; return SIGBUS; diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index 70a523151ff3..55099fbff4e6 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -76,7 +76,7 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes, { if (bytes == 0) return 0; - if (!access_ok(VERIFY_WRITE, (void __user *) addr, bytes)) + if (!access_ok((void __user *) addr, bytes)) return -EFAULT; __flush_icache_user_range(addr, addr + bytes); diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c index 5a4875cac1ec..0d14e0d8eacf 100644 --- a/arch/mips/mm/gup.c +++ b/arch/mips/mm/gup.c @@ -195,8 +195,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; - if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, - (void __user *)start, len))) + if (unlikely(!access_ok((void __user *)start, len))) return 0; /* diff --git a/arch/mips/oprofile/backtrace.c b/arch/mips/oprofile/backtrace.c index 806fb798091f..07d98ba7f49e 100644 --- a/arch/mips/oprofile/backtrace.c +++ b/arch/mips/oprofile/backtrace.c @@ -19,7 +19,7 @@ struct stackframe { static inline int get_mem(unsigned long addr, unsigned long *result) { unsigned long *address = (unsigned long *) addr; - if (!access_ok(VERIFY_READ, address, sizeof(unsigned long))) + if (!access_ok(address, sizeof(unsigned long))) return -1; if (__copy_from_user_inatomic(result, address, sizeof(unsigned long))) return -3; diff --git a/arch/mips/sibyte/common/sb_tbprof.c b/arch/mips/sibyte/common/sb_tbprof.c index 99c720be72d2..9ff26b0cd3b6 100644 --- a/arch/mips/sibyte/common/sb_tbprof.c +++ b/arch/mips/sibyte/common/sb_tbprof.c @@ -458,7 +458,7 @@ static ssize_t sbprof_tb_read(struct file *filp, char *buf, char *dest = buf; long cur_off = *offp; - if (!access_ok(VERIFY_WRITE, buf, size)) + if (!access_ok(buf, size)) return -EFAULT; mutex_lock(&sbp.lock); diff --git a/arch/nds32/include/asm/futex.h b/arch/nds32/include/asm/futex.h index cb6cb91cfdf8..baf178bf1d0b 100644 --- a/arch/nds32/include/asm/futex.h +++ b/arch/nds32/include/asm/futex.h @@ -40,7 +40,7 @@ futex_atomic_cmpxchg_inatomic(u32 * uval, u32 __user * uaddr, int ret = 0; u32 val, tmp, flags; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; smp_mb(); diff --git a/arch/nds32/include/asm/uaccess.h b/arch/nds32/include/asm/uaccess.h index 362a32d9bd16..53dcb49b0b12 100644 --- a/arch/nds32/include/asm/uaccess.h +++ b/arch/nds32/include/asm/uaccess.h @@ -13,9 +13,6 @@ #include #include -#define VERIFY_READ 0 -#define VERIFY_WRITE 1 - #define __asmeq(x, y) ".ifnc " x "," y " ; .err ; .endif\n\t" /* @@ -53,7 +50,7 @@ static inline void set_fs(mm_segment_t fs) #define __range_ok(addr, size) (size <= get_fs() && addr <= (get_fs() -size)) -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ __range_ok((unsigned long)addr, (unsigned long)size) /* * Single-value transfer routines. They automatically use the right @@ -94,7 +91,7 @@ static inline void set_fs(mm_segment_t fs) ({ \ const __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_READ, __p, sizeof(*__p))) { \ + if (access_ok(__p, sizeof(*__p))) { \ __get_user_err((x), __p, (err)); \ } else { \ (x) = 0; (err) = -EFAULT; \ @@ -189,7 +186,7 @@ do { \ ({ \ __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_WRITE, __p, sizeof(*__p))) { \ + if (access_ok(__p, sizeof(*__p))) { \ __put_user_err((x), __p, (err)); \ } else { \ (err) = -EFAULT; \ @@ -279,7 +276,7 @@ extern unsigned long __arch_copy_to_user(void __user * to, const void *from, #define INLINE_COPY_TO_USER static inline unsigned long clear_user(void __user * to, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) n = __arch_clear_user(to, n); return n; } diff --git a/arch/nds32/kernel/perf_event_cpu.c b/arch/nds32/kernel/perf_event_cpu.c index 5e00ce54d0ff..334c2a6cec23 100644 --- a/arch/nds32/kernel/perf_event_cpu.c +++ b/arch/nds32/kernel/perf_event_cpu.c @@ -1306,7 +1306,7 @@ user_backtrace(struct perf_callchain_entry_ctx *entry, unsigned long fp) (unsigned long *)(fp - (unsigned long)sizeof(buftail)); /* Check accessibility of one struct frame_tail beyond */ - if (!access_ok(VERIFY_READ, user_frame_tail, sizeof(buftail))) + if (!access_ok(user_frame_tail, sizeof(buftail))) return 0; if (__copy_from_user_inatomic (&buftail, user_frame_tail, sizeof(buftail))) @@ -1332,7 +1332,7 @@ user_backtrace_opt_size(struct perf_callchain_entry_ctx *entry, (unsigned long *)(fp - (unsigned long)sizeof(buftail)); /* Check accessibility of one struct frame_tail beyond */ - if (!access_ok(VERIFY_READ, user_frame_tail, sizeof(buftail))) + if (!access_ok(user_frame_tail, sizeof(buftail))) return 0; if (__copy_from_user_inatomic (&buftail, user_frame_tail, sizeof(buftail))) @@ -1386,7 +1386,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, user_frame_tail = (unsigned long *)(fp - (unsigned long)sizeof(fp)); - if (!access_ok(VERIFY_READ, user_frame_tail, sizeof(fp))) + if (!access_ok(user_frame_tail, sizeof(fp))) return; if (__copy_from_user_inatomic @@ -1406,8 +1406,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, (unsigned long *)(fp - (unsigned long)sizeof(buftail)); - if (!access_ok - (VERIFY_READ, user_frame_tail, sizeof(buftail))) + if (!access_ok(user_frame_tail, sizeof(buftail))) return; if (__copy_from_user_inatomic @@ -1424,7 +1423,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, (unsigned long *)(fp - (unsigned long) sizeof(buftail_opt_size)); - if (!access_ok(VERIFY_READ, user_frame_tail, + if (!access_ok(user_frame_tail, sizeof(buftail_opt_size))) return; diff --git a/arch/nds32/kernel/signal.c b/arch/nds32/kernel/signal.c index 5b5be082cfa4..5f7660aa2d68 100644 --- a/arch/nds32/kernel/signal.c +++ b/arch/nds32/kernel/signal.c @@ -151,7 +151,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) frame = (struct rt_sigframe __user *)regs->sp; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (restore_sigframe(regs, frame)) @@ -275,7 +275,7 @@ setup_rt_frame(struct ksignal *ksig, sigset_t * set, struct pt_regs *regs) get_sigframe(ksig, regs, sizeof(*frame)); int err = 0; - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; __put_user_error(0, &frame->uc.uc_flags, err); diff --git a/arch/nds32/mm/alignment.c b/arch/nds32/mm/alignment.c index e1aed9dc692d..c8b9061a2ee3 100644 --- a/arch/nds32/mm/alignment.c +++ b/arch/nds32/mm/alignment.c @@ -289,13 +289,13 @@ static inline int do_16(unsigned long inst, struct pt_regs *regs) unaligned_addr += shift; if (load) { - if (!access_ok(VERIFY_READ, (void *)unaligned_addr, len)) + if (!access_ok((void *)unaligned_addr, len)) return -EACCES; get_data(unaligned_addr, &target_val, len); *idx_to_addr(regs, target_idx) = target_val; } else { - if (!access_ok(VERIFY_WRITE, (void *)unaligned_addr, len)) + if (!access_ok((void *)unaligned_addr, len)) return -EACCES; target_val = *idx_to_addr(regs, target_idx); set_data((void *)unaligned_addr, target_val, len); @@ -479,7 +479,7 @@ static inline int do_32(unsigned long inst, struct pt_regs *regs) if (load) { - if (!access_ok(VERIFY_READ, (void *)unaligned_addr, len)) + if (!access_ok((void *)unaligned_addr, len)) return -EACCES; get_data(unaligned_addr, &target_val, len); @@ -491,7 +491,7 @@ static inline int do_32(unsigned long inst, struct pt_regs *regs) *idx_to_addr(regs, RT(inst)) = target_val; } else { - if (!access_ok(VERIFY_WRITE, (void *)unaligned_addr, len)) + if (!access_ok((void *)unaligned_addr, len)) return -EACCES; target_val = *idx_to_addr(regs, RT(inst)); diff --git a/arch/nios2/include/asm/uaccess.h b/arch/nios2/include/asm/uaccess.h index dfa3c7cb30b4..e0ea10806491 100644 --- a/arch/nios2/include/asm/uaccess.h +++ b/arch/nios2/include/asm/uaccess.h @@ -37,7 +37,7 @@ (((signed long)(((long)get_fs().seg) & \ ((long)(addr) | (((long)(addr)) + (len)) | (len)))) == 0) -#define access_ok(type, addr, len) \ +#define access_ok(addr, len) \ likely(__access_ok((unsigned long)(addr), (unsigned long)(len))) # define __EX_TABLE_SECTION ".section __ex_table,\"a\"\n" @@ -70,7 +70,7 @@ static inline unsigned long __must_check __clear_user(void __user *to, static inline unsigned long __must_check clear_user(void __user *to, unsigned long n) { - if (!access_ok(VERIFY_WRITE, to, n)) + if (!access_ok(to, n)) return n; return __clear_user(to, n); } @@ -142,7 +142,7 @@ do { \ long __gu_err = -EFAULT; \ const __typeof__(*(ptr)) __user *__gu_ptr = (ptr); \ unsigned long __gu_val = 0; \ - if (access_ok(VERIFY_READ, __gu_ptr, sizeof(*__gu_ptr))) \ + if (access_ok( __gu_ptr, sizeof(*__gu_ptr))) \ __get_user_common(__gu_val, sizeof(*__gu_ptr), \ __gu_ptr, __gu_err); \ (x) = (__force __typeof__(x))__gu_val; \ @@ -168,7 +168,7 @@ do { \ long __pu_err = -EFAULT; \ __typeof__(*(ptr)) __user *__pu_ptr = (ptr); \ __typeof__(*(ptr)) __pu_val = (__typeof(*ptr))(x); \ - if (access_ok(VERIFY_WRITE, __pu_ptr, sizeof(*__pu_ptr))) { \ + if (access_ok(__pu_ptr, sizeof(*__pu_ptr))) { \ switch (sizeof(*__pu_ptr)) { \ case 1: \ __put_user_asm(__pu_val, "stb", __pu_ptr, __pu_err); \ diff --git a/arch/nios2/kernel/signal.c b/arch/nios2/kernel/signal.c index 20662b0f6c9e..4a81876b6086 100644 --- a/arch/nios2/kernel/signal.c +++ b/arch/nios2/kernel/signal.c @@ -106,7 +106,7 @@ asmlinkage int do_rt_sigreturn(struct switch_stack *sw) sigset_t set; int rval; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) diff --git a/arch/openrisc/include/asm/futex.h b/arch/openrisc/include/asm/futex.h index 618da4a1bffb..fe894e6331ae 100644 --- a/arch/openrisc/include/asm/futex.h +++ b/arch/openrisc/include/asm/futex.h @@ -72,7 +72,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0; u32 prev; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __asm__ __volatile__ ( \ diff --git a/arch/openrisc/include/asm/uaccess.h b/arch/openrisc/include/asm/uaccess.h index bbf5c79cce7a..bc8191a34db7 100644 --- a/arch/openrisc/include/asm/uaccess.h +++ b/arch/openrisc/include/asm/uaccess.h @@ -58,7 +58,7 @@ /* Ensure that addr is below task's addr_limit */ #define __addr_ok(addr) ((unsigned long) addr < get_fs()) -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ __range_ok((unsigned long)addr, (unsigned long)size) /* @@ -102,7 +102,7 @@ extern long __put_user_bad(void); ({ \ long __pu_err = -EFAULT; \ __typeof__(*(ptr)) *__pu_addr = (ptr); \ - if (access_ok(VERIFY_WRITE, __pu_addr, size)) \ + if (access_ok(__pu_addr, size)) \ __put_user_size((x), __pu_addr, (size), __pu_err); \ __pu_err; \ }) @@ -175,7 +175,7 @@ struct __large_struct { ({ \ long __gu_err = -EFAULT, __gu_val = 0; \ const __typeof__(*(ptr)) * __gu_addr = (ptr); \ - if (access_ok(VERIFY_READ, __gu_addr, size)) \ + if (access_ok(__gu_addr, size)) \ __get_user_size(__gu_val, __gu_addr, (size), __gu_err); \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ __gu_err; \ @@ -254,7 +254,7 @@ extern unsigned long __clear_user(void *addr, unsigned long size); static inline __must_check unsigned long clear_user(void *addr, unsigned long size) { - if (likely(access_ok(VERIFY_WRITE, addr, size))) + if (likely(access_ok(addr, size))) size = __clear_user(addr, size); return size; } diff --git a/arch/openrisc/kernel/signal.c b/arch/openrisc/kernel/signal.c index 265f10fb3930..5ac9d3b1d615 100644 --- a/arch/openrisc/kernel/signal.c +++ b/arch/openrisc/kernel/signal.c @@ -50,7 +50,7 @@ static int restore_sigcontext(struct pt_regs *regs, /* * Restore the regs from &sc->regs. - * (sc is already checked for VERIFY_READ since the sigframe was + * (sc is already checked since the sigframe was * checked in sys_sigreturn previously) */ err |= __copy_from_user(regs, sc->regs.gpr, 32 * sizeof(unsigned long)); @@ -83,7 +83,7 @@ asmlinkage long _sys_rt_sigreturn(struct pt_regs *regs) if (((long)frame) & 3) goto badframe; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; @@ -161,7 +161,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; /* Create siginfo. */ diff --git a/arch/parisc/include/asm/futex.h b/arch/parisc/include/asm/futex.h index cf7ba058f619..d2c3e4106851 100644 --- a/arch/parisc/include/asm/futex.h +++ b/arch/parisc/include/asm/futex.h @@ -95,7 +95,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, if (uaccess_kernel() && !uaddr) return -EFAULT; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; /* HPPA has no cmpxchg in hardware and therefore the diff --git a/arch/parisc/include/asm/uaccess.h b/arch/parisc/include/asm/uaccess.h index ea70e36ce6af..30ac2865ea73 100644 --- a/arch/parisc/include/asm/uaccess.h +++ b/arch/parisc/include/asm/uaccess.h @@ -27,7 +27,7 @@ * that put_user is the same as __put_user, etc. */ -#define access_ok(type, uaddr, size) \ +#define access_ok(uaddr, size) \ ( (uaddr) == (uaddr) ) #define put_user __put_user diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index 94542776a62d..88b38b37c21b 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -72,7 +72,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, int ret = 0; u32 prev; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __asm__ __volatile__ ( diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index ebc0b916dcf9..b31bf45eebd4 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -62,7 +62,7 @@ static inline int __access_ok(unsigned long addr, unsigned long size, #endif -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ (__chk_user_ptr(addr), (void)(type), \ __access_ok((__force unsigned long)(addr), (size), get_fs())) @@ -166,7 +166,7 @@ do { \ long __pu_err = -EFAULT; \ __typeof__(*(ptr)) __user *__pu_addr = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_WRITE, __pu_addr, size)) \ + if (access_ok(__pu_addr, size)) \ __put_user_size((x), __pu_addr, (size), __pu_err); \ __pu_err; \ }) @@ -276,7 +276,7 @@ do { \ __long_type(*(ptr)) __gu_val = 0; \ __typeof__(*(ptr)) __user *__gu_addr = (ptr); \ might_fault(); \ - if (access_ok(VERIFY_READ, __gu_addr, (size))) { \ + if (access_ok(__gu_addr, (size))) { \ barrier_nospec(); \ __get_user_size(__gu_val, __gu_addr, (size), __gu_err); \ } \ @@ -374,7 +374,7 @@ extern unsigned long __clear_user(void __user *addr, unsigned long size); static inline unsigned long clear_user(void __user *addr, unsigned long size) { might_fault(); - if (likely(access_ok(VERIFY_WRITE, addr, size))) + if (likely(access_ok(addr, size))) return __clear_user(addr, size); return size; } diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c index 11550a3d1ac2..0d1b6370bae0 100644 --- a/arch/powerpc/kernel/align.c +++ b/arch/powerpc/kernel/align.c @@ -131,8 +131,7 @@ static int emulate_spe(struct pt_regs *regs, unsigned int reg, /* Verify the address of the operand */ if (unlikely(user_mode(regs) && - !access_ok((flags & ST ? VERIFY_WRITE : VERIFY_READ), - addr, nb))) + !access_ok(addr, nb))) return -EFAULT; /* userland only */ diff --git a/arch/powerpc/kernel/rtas_flash.c b/arch/powerpc/kernel/rtas_flash.c index 10fabae2574d..8246f437bbc6 100644 --- a/arch/powerpc/kernel/rtas_flash.c +++ b/arch/powerpc/kernel/rtas_flash.c @@ -523,7 +523,7 @@ static ssize_t validate_flash_write(struct file *file, const char __user *buf, args_buf->status = VALIDATE_INCOMPLETE; } - if (!access_ok(VERIFY_READ, buf, count)) { + if (!access_ok(buf, count)) { rc = -EFAULT; goto done; } diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c index 38cadae4ca4f..8a1746d755c9 100644 --- a/arch/powerpc/kernel/rtasd.c +++ b/arch/powerpc/kernel/rtasd.c @@ -335,7 +335,7 @@ static ssize_t rtas_log_read(struct file * file, char __user * buf, count = rtas_error_log_buffer_max; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; tmp = kmalloc(count, GFP_KERNEL); diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index b3e8db376ecd..e6c30cee6abf 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -44,7 +44,7 @@ void __user *get_sigframe(struct ksignal *ksig, unsigned long sp, newsp = (oldsp - frame_size) & ~0xFUL; /* Check access */ - if (!access_ok(VERIFY_WRITE, (void __user *)newsp, oldsp - newsp)) + if (!access_ok((void __user *)newsp, oldsp - newsp)) return NULL; return (void __user *)newsp; diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index 2d47cc79e5b3..ede4f04281ae 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1017,7 +1017,7 @@ static int do_setcontext(struct ucontext __user *ucp, struct pt_regs *regs, int #else if (__get_user(mcp, &ucp->uc_regs)) return -EFAULT; - if (!access_ok(VERIFY_READ, mcp, sizeof(*mcp))) + if (!access_ok(mcp, sizeof(*mcp))) return -EFAULT; #endif set_current_blocked(&set); @@ -1120,7 +1120,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, */ mctx = (struct mcontext __user *) ((unsigned long) &old_ctx->uc_mcontext & ~0xfUL); - if (!access_ok(VERIFY_WRITE, old_ctx, ctx_size) + if (!access_ok(old_ctx, ctx_size) || save_user_regs(regs, mctx, NULL, 0, ctx_has_vsx_region) || put_sigset_t(&old_ctx->uc_sigmask, ¤t->blocked) || __put_user(to_user_ptr(mctx), &old_ctx->uc_regs)) @@ -1128,7 +1128,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, } if (new_ctx == NULL) return 0; - if (!access_ok(VERIFY_READ, new_ctx, ctx_size) || + if (!access_ok(new_ctx, ctx_size) || fault_in_pages_readable((u8 __user *)new_ctx, ctx_size)) return -EFAULT; @@ -1169,7 +1169,7 @@ SYSCALL_DEFINE0(rt_sigreturn) rt_sf = (struct rt_sigframe __user *) (regs->gpr[1] + __SIGNAL_FRAMESIZE + 16); - if (!access_ok(VERIFY_READ, rt_sf, sizeof(*rt_sf))) + if (!access_ok(rt_sf, sizeof(*rt_sf))) goto bad; #ifdef CONFIG_PPC_TRANSACTIONAL_MEM @@ -1315,7 +1315,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx, current->thread.debug.dbcr0 = new_dbcr0; #endif - if (!access_ok(VERIFY_READ, ctx, sizeof(*ctx)) || + if (!access_ok(ctx, sizeof(*ctx)) || fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx))) return -EFAULT; @@ -1500,7 +1500,7 @@ SYSCALL_DEFINE0(sigreturn) { sr = (struct mcontext __user *)from_user_ptr(sigctx.regs); addr = sr; - if (!access_ok(VERIFY_READ, sr, sizeof(*sr)) + if (!access_ok(sr, sizeof(*sr)) || restore_user_regs(regs, sr, 1)) goto badframe; } diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 0935fe6c282a..bd5e6834ca69 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -383,7 +383,7 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig, err |= __get_user(v_regs, &sc->v_regs); if (err) return err; - if (v_regs && !access_ok(VERIFY_READ, v_regs, 34 * sizeof(vector128))) + if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128))) return -EFAULT; /* Copy 33 vec registers (vr0..31 and vscr) from the stack */ if (v_regs != NULL && (msr & MSR_VEC) != 0) { @@ -502,10 +502,9 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, err |= __get_user(tm_v_regs, &tm_sc->v_regs); if (err) return err; - if (v_regs && !access_ok(VERIFY_READ, v_regs, 34 * sizeof(vector128))) + if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128))) return -EFAULT; - if (tm_v_regs && !access_ok(VERIFY_READ, - tm_v_regs, 34 * sizeof(vector128))) + if (tm_v_regs && !access_ok(tm_v_regs, 34 * sizeof(vector128))) return -EFAULT; /* Copy 33 vec registers (vr0..31 and vscr) from the stack */ if (v_regs != NULL && tm_v_regs != NULL && (msr & MSR_VEC) != 0) { @@ -671,7 +670,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, ctx_has_vsx_region = 1; if (old_ctx != NULL) { - if (!access_ok(VERIFY_WRITE, old_ctx, ctx_size) + if (!access_ok(old_ctx, ctx_size) || setup_sigcontext(&old_ctx->uc_mcontext, current, 0, NULL, 0, ctx_has_vsx_region) || __copy_to_user(&old_ctx->uc_sigmask, @@ -680,7 +679,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, } if (new_ctx == NULL) return 0; - if (!access_ok(VERIFY_READ, new_ctx, ctx_size) + if (!access_ok(new_ctx, ctx_size) || __get_user(tmp, (u8 __user *) new_ctx) || __get_user(tmp, (u8 __user *) new_ctx + ctx_size - 1)) return -EFAULT; @@ -725,7 +724,7 @@ SYSCALL_DEFINE0(rt_sigreturn) /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, uc, sizeof(*uc))) + if (!access_ok(uc, sizeof(*uc))) goto badframe; if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set))) diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index 466216506eb2..e6982ab21816 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -89,7 +89,7 @@ ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user *exp, s if ( (unsigned long)n >= 4096 ) { unsigned long __user *buffer = (unsigned long __user *)n; - if (!access_ok(VERIFY_READ, buffer, 5*sizeof(unsigned long)) + if (!access_ok(buffer, 5*sizeof(unsigned long)) || __get_user(n, buffer) || __get_user(inp, ((fd_set __user * __user *)(buffer+1))) || __get_user(outp, ((fd_set __user * __user *)(buffer+2))) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 00af2c4febf4..64936b60d521 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -837,7 +837,7 @@ static void p9_hmi_special_emu(struct pt_regs *regs) addr = (__force const void __user *)ea; /* Check it */ - if (!access_ok(VERIFY_READ, addr, 16)) { + if (!access_ok(addr, 16)) { pr_devel("HMI vec emu: bad access %i:%s[%d] nip=%016lx" " instr=%08x addr=%016lx\n", smp_processor_id(), current->comm, current->pid, diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 6f2d2fb4e098..bd2dcfbf00cd 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1744,7 +1744,7 @@ static ssize_t kvm_htab_read(struct file *file, char __user *buf, int first_pass; unsigned long hpte[2]; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; if (kvm_is_radix(kvm)) return 0; @@ -1844,7 +1844,7 @@ static ssize_t kvm_htab_write(struct file *file, const char __user *buf, int mmu_ready; int pshift; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; if (kvm_is_radix(kvm)) return -EINVAL; diff --git a/arch/powerpc/lib/checksum_wrappers.c b/arch/powerpc/lib/checksum_wrappers.c index a0cb63fb76a1..890d4ddd91d6 100644 --- a/arch/powerpc/lib/checksum_wrappers.c +++ b/arch/powerpc/lib/checksum_wrappers.c @@ -37,7 +37,7 @@ __wsum csum_and_copy_from_user(const void __user *src, void *dst, goto out; } - if (unlikely((len < 0) || !access_ok(VERIFY_READ, src, len))) { + if (unlikely((len < 0) || !access_ok(src, len))) { *err_ptr = -EFAULT; csum = (__force unsigned int)sum; goto out; @@ -78,7 +78,7 @@ __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len, goto out; } - if (unlikely((len < 0) || !access_ok(VERIFY_WRITE, dst, len))) { + if (unlikely((len < 0) || !access_ok(dst, len))) { *err_ptr = -EFAULT; csum = -1; /* invalid checksum */ goto out; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index a6dcfda3e11e..887f11bcf330 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -274,7 +274,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, return false; if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) && - access_ok(VERIFY_READ, nip, sizeof(*nip))) { + access_ok(nip, sizeof(*nip))) { unsigned int inst; int res; diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c index 3327551c8b47..5e4178790dee 100644 --- a/arch/powerpc/mm/subpage-prot.c +++ b/arch/powerpc/mm/subpage-prot.c @@ -214,7 +214,7 @@ SYSCALL_DEFINE3(subpage_prot, unsigned long, addr, return 0; } - if (!access_ok(VERIFY_READ, map, (len >> PAGE_SHIFT) * sizeof(u32))) + if (!access_ok(map, (len >> PAGE_SHIFT) * sizeof(u32))) return -EFAULT; down_write(&mm->mmap_sem); diff --git a/arch/powerpc/oprofile/backtrace.c b/arch/powerpc/oprofile/backtrace.c index 5df6290d1ccc..260c53700978 100644 --- a/arch/powerpc/oprofile/backtrace.c +++ b/arch/powerpc/oprofile/backtrace.c @@ -31,7 +31,7 @@ static unsigned int user_getsp32(unsigned int sp, int is_first) unsigned int stack_frame[2]; void __user *p = compat_ptr(sp); - if (!access_ok(VERIFY_READ, p, sizeof(stack_frame))) + if (!access_ok(p, sizeof(stack_frame))) return 0; /* @@ -57,7 +57,7 @@ static unsigned long user_getsp64(unsigned long sp, int is_first) { unsigned long stack_frame[3]; - if (!access_ok(VERIFY_READ, (void __user *)sp, sizeof(stack_frame))) + if (!access_ok((void __user *)sp, sizeof(stack_frame))) return 0; if (__copy_from_user_inatomic(stack_frame, (void __user *)sp, diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c index 43e7b93f27c7..ae8123edddc6 100644 --- a/arch/powerpc/platforms/cell/spufs/file.c +++ b/arch/powerpc/platforms/cell/spufs/file.c @@ -609,7 +609,7 @@ static ssize_t spufs_mbox_read(struct file *file, char __user *buf, if (len < 4) return -EINVAL; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; udata = (void __user *)buf; @@ -717,7 +717,7 @@ static ssize_t spufs_ibox_read(struct file *file, char __user *buf, if (len < 4) return -EINVAL; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; udata = (void __user *)buf; @@ -856,7 +856,7 @@ static ssize_t spufs_wbox_write(struct file *file, const char __user *buf, return -EINVAL; udata = (void __user *)buf; - if (!access_ok(VERIFY_READ, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; if (__get_user(wbox_data, udata)) @@ -1994,7 +1994,7 @@ static ssize_t spufs_mbox_info_read(struct file *file, char __user *buf, int ret; struct spu_context *ctx = file->private_data; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; ret = spu_acquire_saved(ctx); @@ -2034,7 +2034,7 @@ static ssize_t spufs_ibox_info_read(struct file *file, char __user *buf, struct spu_context *ctx = file->private_data; int ret; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; ret = spu_acquire_saved(ctx); @@ -2077,7 +2077,7 @@ static ssize_t spufs_wbox_info_read(struct file *file, char __user *buf, struct spu_context *ctx = file->private_data; int ret; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; ret = spu_acquire_saved(ctx); @@ -2129,7 +2129,7 @@ static ssize_t spufs_dma_info_read(struct file *file, char __user *buf, struct spu_context *ctx = file->private_data; int ret; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; ret = spu_acquire_saved(ctx); @@ -2160,7 +2160,7 @@ static ssize_t __spufs_proxydma_info_read(struct spu_context *ctx, if (len < ret) return -EINVAL; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; info.proxydma_info_type = ctx->csa.prob.dma_querytype_RW; diff --git a/arch/powerpc/platforms/powernv/opal-lpc.c b/arch/powerpc/platforms/powernv/opal-lpc.c index 6c7ad1d8b32e..2623996a193a 100644 --- a/arch/powerpc/platforms/powernv/opal-lpc.c +++ b/arch/powerpc/platforms/powernv/opal-lpc.c @@ -192,7 +192,7 @@ static ssize_t lpc_debug_read(struct file *filp, char __user *ubuf, u32 data, pos, len, todo; int rc; - if (!access_ok(VERIFY_WRITE, ubuf, count)) + if (!access_ok(ubuf, count)) return -EFAULT; todo = count; @@ -283,7 +283,7 @@ static ssize_t lpc_debug_write(struct file *filp, const char __user *ubuf, u32 data, pos, len, todo; int rc; - if (!access_ok(VERIFY_READ, ubuf, count)) + if (!access_ok(ubuf, count)) return -EFAULT; todo = count; diff --git a/arch/powerpc/platforms/pseries/scanlog.c b/arch/powerpc/platforms/pseries/scanlog.c index 054ce7a16fc3..24b157e1e890 100644 --- a/arch/powerpc/platforms/pseries/scanlog.c +++ b/arch/powerpc/platforms/pseries/scanlog.c @@ -63,7 +63,7 @@ static ssize_t scanlog_read(struct file *file, char __user *buf, return -EINVAL; } - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; for (;;) { diff --git a/arch/riscv/include/asm/futex.h b/arch/riscv/include/asm/futex.h index 3b19eba1bc8e..66641624d8a5 100644 --- a/arch/riscv/include/asm/futex.h +++ b/arch/riscv/include/asm/futex.h @@ -95,7 +95,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 val; uintptr_t tmp; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; __enable_user_access(); diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h index 8c3e3e3c8be1..637b896894fc 100644 --- a/arch/riscv/include/asm/uaccess.h +++ b/arch/riscv/include/asm/uaccess.h @@ -54,14 +54,8 @@ static inline void set_fs(mm_segment_t fs) #define user_addr_max() (get_fs()) -#define VERIFY_READ 0 -#define VERIFY_WRITE 1 - /** * access_ok: - Checks if a user space pointer is valid - * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that - * %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe - * to write to a block, it is always safe to read from it. * @addr: User space pointer to start of block to check * @size: Size of block to check * @@ -76,7 +70,7 @@ static inline void set_fs(mm_segment_t fs) * checks that the pointer is in the user space range - after calling * this function, memory access functions may still return -EFAULT. */ -#define access_ok(type, addr, size) ({ \ +#define access_ok(addr, size) ({ \ __chk_user_ptr(addr); \ likely(__access_ok((unsigned long __force)(addr), (size))); \ }) @@ -258,7 +252,7 @@ do { \ ({ \ const __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - access_ok(VERIFY_READ, __p, sizeof(*__p)) ? \ + access_ok(__p, sizeof(*__p)) ? \ __get_user((x), __p) : \ ((x) = 0, -EFAULT); \ }) @@ -386,7 +380,7 @@ do { \ ({ \ __typeof__(*(ptr)) __user *__p = (ptr); \ might_fault(); \ - access_ok(VERIFY_WRITE, __p, sizeof(*__p)) ? \ + access_ok(__p, sizeof(*__p)) ? \ __put_user((x), __p) : \ -EFAULT; \ }) @@ -421,7 +415,7 @@ static inline unsigned long __must_check clear_user(void __user *to, unsigned long n) { might_fault(); - return access_ok(VERIFY_WRITE, to, n) ? + return access_ok(to, n) ? __clear_user(to, n) : n; } diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c index f9b5e7e352ef..837e1646091a 100644 --- a/arch/riscv/kernel/signal.c +++ b/arch/riscv/kernel/signal.c @@ -115,7 +115,7 @@ SYSCALL_DEFINE0(rt_sigreturn) frame = (struct rt_sigframe __user *)regs->sp; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) @@ -187,7 +187,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, long err = 0; frame = get_sigframe(ksig, regs, sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= copy_siginfo_to_user(&frame->info, &ksig->info); diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index ad6b91013a05..bd2545977ad3 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -48,7 +48,7 @@ static inline int __range_ok(unsigned long addr, unsigned long size) __range_ok((unsigned long)(addr), (size)); \ }) -#define access_ok(type, addr, size) __access_ok(addr, size) +#define access_ok(addr, size) __access_ok(addr, size) unsigned long __must_check raw_copy_from_user(void *to, const void __user *from, unsigned long n); diff --git a/arch/sh/include/asm/checksum_32.h b/arch/sh/include/asm/checksum_32.h index b58f3d95dc19..36b84cfd3f67 100644 --- a/arch/sh/include/asm/checksum_32.h +++ b/arch/sh/include/asm/checksum_32.h @@ -197,7 +197,7 @@ static inline __wsum csum_and_copy_to_user(const void *src, int len, __wsum sum, int *err_ptr) { - if (access_ok(VERIFY_WRITE, dst, len)) + if (access_ok(dst, len)) return csum_partial_copy_generic((__force const void *)src, dst, len, sum, NULL, err_ptr); diff --git a/arch/sh/include/asm/futex.h b/arch/sh/include/asm/futex.h index 6d192f4908a7..3190ec89df81 100644 --- a/arch/sh/include/asm/futex.h +++ b/arch/sh/include/asm/futex.h @@ -22,7 +22,7 @@ static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval) { - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; return atomic_futex_op_cmpxchg_inatomic(uval, uaddr, oldval, newval); diff --git a/arch/sh/include/asm/uaccess.h b/arch/sh/include/asm/uaccess.h index 32eb56e00c11..deebbfab5342 100644 --- a/arch/sh/include/asm/uaccess.h +++ b/arch/sh/include/asm/uaccess.h @@ -18,7 +18,7 @@ */ #define __access_ok(addr, size) \ (__addr_ok((addr) + (size))) -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ (__chk_user_ptr(addr), \ __access_ok((unsigned long __force)(addr), (size))) @@ -66,7 +66,7 @@ struct __large_struct { unsigned long buf[100]; }; long __gu_err = -EFAULT; \ unsigned long __gu_val = 0; \ const __typeof__(*(ptr)) *__gu_addr = (ptr); \ - if (likely(access_ok(VERIFY_READ, __gu_addr, (size)))) \ + if (likely(access_ok(__gu_addr, (size)))) \ __get_user_size(__gu_val, __gu_addr, (size), __gu_err); \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ __gu_err; \ @@ -87,7 +87,7 @@ struct __large_struct { unsigned long buf[100]; }; long __pu_err = -EFAULT; \ __typeof__(*(ptr)) __user *__pu_addr = (ptr); \ __typeof__(*(ptr)) __pu_val = x; \ - if (likely(access_ok(VERIFY_WRITE, __pu_addr, size))) \ + if (likely(access_ok(__pu_addr, size))) \ __put_user_size(__pu_val, __pu_addr, (size), \ __pu_err); \ __pu_err; \ @@ -132,8 +132,7 @@ __kernel_size_t __clear_user(void *addr, __kernel_size_t size); void __user * __cl_addr = (addr); \ unsigned long __cl_size = (n); \ \ - if (__cl_size && access_ok(VERIFY_WRITE, \ - ((unsigned long)(__cl_addr)), __cl_size)) \ + if (__cl_size && access_ok(__cl_addr, __cl_size)) \ __cl_size = __clear_user(__cl_addr, __cl_size); \ \ __cl_size; \ diff --git a/arch/sh/kernel/signal_32.c b/arch/sh/kernel/signal_32.c index c46c0020ff55..2a2121ba8ebe 100644 --- a/arch/sh/kernel/signal_32.c +++ b/arch/sh/kernel/signal_32.c @@ -160,7 +160,7 @@ asmlinkage int sys_sigreturn(void) /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) @@ -190,7 +190,7 @@ asmlinkage int sys_rt_sigreturn(void) /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) @@ -272,7 +272,7 @@ static int setup_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(&ksig->ka, regs->regs[15], sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= setup_sigcontext(&frame->sc, regs, set->sig[0]); @@ -338,7 +338,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set, frame = get_sigframe(&ksig->ka, regs->regs[15], sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= copy_siginfo_to_user(&frame->info, &ksig->info); diff --git a/arch/sh/kernel/signal_64.c b/arch/sh/kernel/signal_64.c index 76661dee3c65..f1f1598879c2 100644 --- a/arch/sh/kernel/signal_64.c +++ b/arch/sh/kernel/signal_64.c @@ -259,7 +259,7 @@ asmlinkage int sys_sigreturn(unsigned long r2, unsigned long r3, /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) @@ -293,7 +293,7 @@ asmlinkage int sys_rt_sigreturn(unsigned long r2, unsigned long r3, /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) @@ -379,7 +379,7 @@ static int setup_frame(struct ksignal *ksig, sigset_t *set, struct pt_regs *regs frame = get_sigframe(&ksig->ka, regs->regs[REG_SP], sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= setup_sigcontext(&frame->sc, regs, set->sig[0]); @@ -465,7 +465,7 @@ static int setup_rt_frame(struct ksignal *kig, sigset_t *set, frame = get_sigframe(&ksig->ka, regs->regs[REG_SP], sizeof(*frame)); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; err |= __put_user(&frame->info, &frame->pinfo); diff --git a/arch/sh/kernel/traps_64.c b/arch/sh/kernel/traps_64.c index c52bda4d2574..8ce90a7da67d 100644 --- a/arch/sh/kernel/traps_64.c +++ b/arch/sh/kernel/traps_64.c @@ -40,7 +40,7 @@ static int read_opcode(reg_size_t pc, insn_size_t *result_opcode, int from_user_ /* SHmedia */ aligned_pc = pc & ~3; if (from_user_mode) { - if (!access_ok(VERIFY_READ, aligned_pc, sizeof(insn_size_t))) { + if (!access_ok(aligned_pc, sizeof(insn_size_t))) { get_user_error = -EFAULT; } else { get_user_error = __get_user(opcode, (insn_size_t *)aligned_pc); @@ -180,7 +180,7 @@ static int misaligned_load(struct pt_regs *regs, if (user_mode(regs)) { __u64 buffer; - if (!access_ok(VERIFY_READ, (unsigned long) address, 1UL<thread.float_regs[0], &fpu->si_float_regs[0], diff --git a/arch/sparc/kernel/unaligned_32.c b/arch/sparc/kernel/unaligned_32.c index 64ac8c0c1429..83db94c0b431 100644 --- a/arch/sparc/kernel/unaligned_32.c +++ b/arch/sparc/kernel/unaligned_32.c @@ -278,7 +278,6 @@ static inline int ok_for_user(struct pt_regs *regs, unsigned int insn, enum direction dir) { unsigned int reg; - int check = (dir == load) ? VERIFY_READ : VERIFY_WRITE; int size = ((insn >> 19) & 3) == 3 ? 8 : 4; if ((regs->pc | regs->npc) & 3) @@ -290,18 +289,18 @@ static inline int ok_for_user(struct pt_regs *regs, unsigned int insn, reg = (insn >> 25) & 0x1f; if (reg >= 16) { - if (!access_ok(check, WINREG_ADDR(reg - 16), size)) + if (!access_ok(WINREG_ADDR(reg - 16), size)) return -EFAULT; } reg = (insn >> 14) & 0x1f; if (reg >= 16) { - if (!access_ok(check, WINREG_ADDR(reg - 16), size)) + if (!access_ok(WINREG_ADDR(reg - 16), size)) return -EFAULT; } if (!(insn & 0x2000)) { reg = (insn & 0x1f); if (reg >= 16) { - if (!access_ok(check, WINREG_ADDR(reg - 16), size)) + if (!access_ok(WINREG_ADDR(reg - 16), size)) return -EFAULT; } } diff --git a/arch/um/kernel/ptrace.c b/arch/um/kernel/ptrace.c index 1a1d88a4d940..5f47422401e1 100644 --- a/arch/um/kernel/ptrace.c +++ b/arch/um/kernel/ptrace.c @@ -66,7 +66,7 @@ long arch_ptrace(struct task_struct *child, long request, #ifdef PTRACE_GETREGS case PTRACE_GETREGS: { /* Get all gp regs from the child. */ - if (!access_ok(VERIFY_WRITE, p, MAX_REG_OFFSET)) { + if (!access_ok(p, MAX_REG_OFFSET)) { ret = -EIO; break; } @@ -81,7 +81,7 @@ long arch_ptrace(struct task_struct *child, long request, #ifdef PTRACE_SETREGS case PTRACE_SETREGS: { /* Set all gp regs in the child. */ unsigned long tmp = 0; - if (!access_ok(VERIFY_READ, p, MAX_REG_OFFSET)) { + if (!access_ok(p, MAX_REG_OFFSET)) { ret = -EIO; break; } diff --git a/arch/unicore32/kernel/signal.c b/arch/unicore32/kernel/signal.c index 4ae51cf15ade..63be04809d40 100644 --- a/arch/unicore32/kernel/signal.c +++ b/arch/unicore32/kernel/signal.c @@ -117,7 +117,7 @@ asmlinkage int __sys_rt_sigreturn(struct pt_regs *regs) frame = (struct rt_sigframe __user *)regs->UCreg_sp; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (restore_sigframe(regs, &frame->sig)) @@ -205,7 +205,7 @@ static inline void __user *get_sigframe(struct k_sigaction *ka, /* * Check that we can actually write to the signal frame. */ - if (!access_ok(VERIFY_WRITE, frame, framesize)) + if (!access_ok(frame, framesize)) frame = NULL; return frame; diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c index d78bcc03e60e..d9d81ad7a400 100644 --- a/arch/x86/entry/vsyscall/vsyscall_64.c +++ b/arch/x86/entry/vsyscall/vsyscall_64.c @@ -99,7 +99,7 @@ static bool write_ok_or_segv(unsigned long ptr, size_t size) * sig_on_uaccess_err, this could go away. */ - if (!access_ok(VERIFY_WRITE, (void __user *)ptr, size)) { + if (!access_ok((void __user *)ptr, size)) { struct thread_struct *thread = ¤t->thread; thread->error_code = X86_PF_USER | X86_PF_WRITE; diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c index 8e02b30cf08e..f65b78d32f5e 100644 --- a/arch/x86/ia32/ia32_aout.c +++ b/arch/x86/ia32/ia32_aout.c @@ -176,10 +176,10 @@ static int aout_core_dump(struct coredump_params *cprm) /* make sure we actually have a data and stack area to dump */ set_fs(USER_DS); - if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_DATA(dump), + if (!access_ok((void *) (unsigned long)START_DATA(dump), dump.u_dsize << PAGE_SHIFT)) dump.u_dsize = 0; - if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_STACK(dump), + if (!access_ok((void *) (unsigned long)START_STACK(dump), dump.u_ssize << PAGE_SHIFT)) dump.u_ssize = 0; diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 86b1341cba9a..321fe5f5d0e9 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -119,7 +119,7 @@ asmlinkage long sys32_sigreturn(void) struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user *)(regs->sp-8); sigset_t set; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) || (_COMPAT_NSIG_WORDS > 1 @@ -147,7 +147,7 @@ asmlinkage long sys32_rt_sigreturn(void) frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; @@ -269,7 +269,7 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, frame = get_sigframe(ksig, regs, sizeof(*frame), &fpstate); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (__put_user(sig, &frame->sig)) @@ -349,7 +349,7 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, frame = get_sigframe(ksig, regs, sizeof(*frame), &fpstate); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; put_user_try { diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c index 11ef7b7c9cc8..a43212036257 100644 --- a/arch/x86/ia32/sys_ia32.c +++ b/arch/x86/ia32/sys_ia32.c @@ -75,7 +75,7 @@ static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) typeof(ubuf->st_gid) gid = 0; SET_UID(uid, from_kuid_munged(current_user_ns(), stat->uid)); SET_GID(gid, from_kgid_munged(current_user_ns(), stat->gid)); - if (!access_ok(VERIFY_WRITE, ubuf, sizeof(struct stat64)) || + if (!access_ok(ubuf, sizeof(struct stat64)) || __put_user(huge_encode_dev(stat->dev), &ubuf->st_dev) || __put_user(stat->ino, &ubuf->__st_ino) || __put_user(stat->ino, &ubuf->st_ino) || diff --git a/arch/x86/include/asm/checksum_32.h b/arch/x86/include/asm/checksum_32.h index 7a659c74cd03..f57b94e02c57 100644 --- a/arch/x86/include/asm/checksum_32.h +++ b/arch/x86/include/asm/checksum_32.h @@ -182,7 +182,7 @@ static inline __wsum csum_and_copy_to_user(const void *src, __wsum ret; might_sleep(); - if (access_ok(VERIFY_WRITE, dst, len)) { + if (access_ok(dst, len)) { stac(); ret = csum_partial_copy_generic(src, (__force void *)dst, len, sum, NULL, err_ptr); diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h index b3ec519e3982..4fe9e7fc74d3 100644 --- a/arch/x86/include/asm/pgtable_32.h +++ b/arch/x86/include/asm/pgtable_32.h @@ -37,7 +37,7 @@ void sync_initial_page_table(void); /* * Define this if things work differently on an i386 and an i486: * it will (on an i486) warn about kernel memory accesses that are - * done without a 'access_ok(VERIFY_WRITE,..)' + * done without a 'access_ok( ..)' */ #undef TEST_ACCESS_OK diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index b5e58cc0c5e7..3920f456db79 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -77,9 +77,6 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un /** * access_ok: - Checks if a user space pointer is valid - * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that - * %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe - * to write to a block, it is always safe to read from it. * @addr: User space pointer to start of block to check * @size: Size of block to check * @@ -95,7 +92,7 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un * checks that the pointer is in the user space range - after calling * this function, memory access functions may still return -EFAULT. */ -#define access_ok(type, addr, size) \ +#define access_ok(addr, size) \ ({ \ WARN_ON_IN_IRQ(); \ likely(!__range_not_ok(addr, size, user_addr_max())); \ @@ -670,7 +667,7 @@ extern void __cmpxchg_wrong_size(void) #define user_atomic_cmpxchg_inatomic(uval, ptr, old, new) \ ({ \ - access_ok(VERIFY_WRITE, (ptr), sizeof(*(ptr))) ? \ + access_ok((ptr), sizeof(*(ptr))) ? \ __user_atomic_cmpxchg_inatomic((uval), (ptr), \ (old), (new), sizeof(*(ptr))) : \ -EFAULT; \ diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index d99a8ee9e185..f6a1d299627c 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -164,7 +164,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) || IS_ENABLED(CONFIG_IA32_EMULATION)); - if (!access_ok(VERIFY_WRITE, buf, size)) + if (!access_ok(buf, size)) return -EACCES; if (!static_cpu_has(X86_FEATURE_FPU)) @@ -281,7 +281,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) return 0; } - if (!access_ok(VERIFY_READ, buf, size)) + if (!access_ok(buf, size)) return -EACCES; fpu__initialize(fpu); diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 92a3b312a53c..08dfd4c1a4f9 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -322,7 +322,7 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (__put_user(sig, &frame->sig)) @@ -385,7 +385,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; put_user_try { @@ -465,7 +465,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (ksig->ka.sa.sa_flags & SA_SIGINFO) { @@ -547,7 +547,7 @@ static int x32_setup_rt_frame(struct ksignal *ksig, frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return -EFAULT; if (ksig->ka.sa.sa_flags & SA_SIGINFO) { @@ -610,7 +610,7 @@ SYSCALL_DEFINE0(sigreturn) frame = (struct sigframe __user *)(regs->sp - 8); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) || (_NSIG_WORDS > 1 && __copy_from_user(&set.sig[1], &frame->extramask, @@ -642,7 +642,7 @@ SYSCALL_DEFINE0(rt_sigreturn) unsigned long uc_flags; frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; @@ -871,7 +871,7 @@ asmlinkage long sys32_x32_rt_sigreturn(void) frame = (struct rt_sigframe_x32 __user *)(regs->sp - 8); - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) goto badframe; diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c index 7627455047c2..5c2d71a1dc06 100644 --- a/arch/x86/kernel/stacktrace.c +++ b/arch/x86/kernel/stacktrace.c @@ -177,7 +177,7 @@ copy_stack_frame(const void __user *fp, struct stack_frame_user *frame) { int ret; - if (!access_ok(VERIFY_READ, fp, sizeof(*frame))) + if (!access_ok(fp, sizeof(*frame))) return 0; ret = 1; diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index c2fd39752da8..a092b6b40c6b 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -114,7 +114,7 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval) set_flags(regs->pt.flags, VEFLAGS, X86_EFLAGS_VIF | vm86->veflags_mask); user = vm86->user_vm86; - if (!access_ok(VERIFY_WRITE, user, vm86->vm86plus.is_vm86pus ? + if (!access_ok(user, vm86->vm86plus.is_vm86pus ? sizeof(struct vm86plus_struct) : sizeof(struct vm86_struct))) { pr_alert("could not access userspace vm86 info\n"); @@ -278,7 +278,7 @@ static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus) if (vm86->saved_sp0) return -EPERM; - if (!access_ok(VERIFY_READ, user_vm86, plus ? + if (!access_ok(user_vm86, plus ? sizeof(struct vm86_struct) : sizeof(struct vm86plus_struct))) return -EFAULT; diff --git a/arch/x86/lib/csum-wrappers_64.c b/arch/x86/lib/csum-wrappers_64.c index 8bd53589ecfb..a6a2b7dccbff 100644 --- a/arch/x86/lib/csum-wrappers_64.c +++ b/arch/x86/lib/csum-wrappers_64.c @@ -27,7 +27,7 @@ csum_partial_copy_from_user(const void __user *src, void *dst, might_sleep(); *errp = 0; - if (!likely(access_ok(VERIFY_READ, src, len))) + if (!likely(access_ok(src, len))) goto out_err; /* @@ -89,7 +89,7 @@ csum_partial_copy_to_user(const void *src, void __user *dst, might_sleep(); - if (unlikely(!access_ok(VERIFY_WRITE, dst, len))) { + if (unlikely(!access_ok(dst, len))) { *errp = -EFAULT; return 0; } diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c index 71fb58d44d58..bfd94e7812fc 100644 --- a/arch/x86/lib/usercopy_32.c +++ b/arch/x86/lib/usercopy_32.c @@ -67,7 +67,7 @@ unsigned long clear_user(void __user *to, unsigned long n) { might_fault(); - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) __do_clear_user(to, n); return n; } diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index 1bd837cdc4b1..ee42bb0cbeb3 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -48,7 +48,7 @@ EXPORT_SYMBOL(__clear_user); unsigned long clear_user(void __user *to, unsigned long n) { - if (access_ok(VERIFY_WRITE, to, n)) + if (access_ok(to, n)) return __clear_user(to, n); return n; } diff --git a/arch/x86/math-emu/fpu_system.h b/arch/x86/math-emu/fpu_system.h index c8b1b31ed7c4..f98a0c956764 100644 --- a/arch/x86/math-emu/fpu_system.h +++ b/arch/x86/math-emu/fpu_system.h @@ -104,7 +104,7 @@ static inline bool seg_writable(struct desc_struct *d) #define instruction_address (*(struct address *)&I387->soft.fip) #define operand_address (*(struct address *)&I387->soft.foo) -#define FPU_access_ok(x,y,z) if ( !access_ok(x,y,z) ) \ +#define FPU_access_ok(y,z) if ( !access_ok(y,z) ) \ math_abort(FPU_info,SIGSEGV) #define FPU_abort math_abort(FPU_info, SIGSEGV) @@ -119,7 +119,7 @@ static inline bool seg_writable(struct desc_struct *d) /* A simpler test than access_ok() can probably be done for FPU_code_access_ok() because the only possible error is to step past the upper boundary of a legal code area. */ -#define FPU_code_access_ok(z) FPU_access_ok(VERIFY_READ,(void __user *)FPU_EIP,z) +#define FPU_code_access_ok(z) FPU_access_ok((void __user *)FPU_EIP,z) #endif #define FPU_get_user(x,y) get_user((x),(y)) diff --git a/arch/x86/math-emu/load_store.c b/arch/x86/math-emu/load_store.c index f821a9cd7753..f15263e158e8 100644 --- a/arch/x86/math-emu/load_store.c +++ b/arch/x86/math-emu/load_store.c @@ -251,7 +251,7 @@ int FPU_load_store(u_char type, fpu_addr_modes addr_modes, break; case 024: /* fldcw */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, data_address, 2); + FPU_access_ok(data_address, 2); FPU_get_user(control_word, (unsigned short __user *)data_address); RE_ENTRANT_CHECK_ON; @@ -291,7 +291,7 @@ int FPU_load_store(u_char type, fpu_addr_modes addr_modes, break; case 034: /* fstcw m16int */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, data_address, 2); + FPU_access_ok(data_address, 2); FPU_put_user(control_word, (unsigned short __user *)data_address); RE_ENTRANT_CHECK_ON; @@ -305,7 +305,7 @@ int FPU_load_store(u_char type, fpu_addr_modes addr_modes, break; case 036: /* fstsw m2byte */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, data_address, 2); + FPU_access_ok(data_address, 2); FPU_put_user(status_word(), (unsigned short __user *)data_address); RE_ENTRANT_CHECK_ON; diff --git a/arch/x86/math-emu/reg_ld_str.c b/arch/x86/math-emu/reg_ld_str.c index d40ff45497b9..f3779743d15e 100644 --- a/arch/x86/math-emu/reg_ld_str.c +++ b/arch/x86/math-emu/reg_ld_str.c @@ -84,7 +84,7 @@ int FPU_load_extended(long double __user *s, int stnr) FPU_REG *sti_ptr = &st(stnr); RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, s, 10); + FPU_access_ok(s, 10); __copy_from_user(sti_ptr, s, 10); RE_ENTRANT_CHECK_ON; @@ -98,7 +98,7 @@ int FPU_load_double(double __user *dfloat, FPU_REG *loaded_data) unsigned m64, l64; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, dfloat, 8); + FPU_access_ok(dfloat, 8); FPU_get_user(m64, 1 + (unsigned long __user *)dfloat); FPU_get_user(l64, (unsigned long __user *)dfloat); RE_ENTRANT_CHECK_ON; @@ -159,7 +159,7 @@ int FPU_load_single(float __user *single, FPU_REG *loaded_data) int exp, tag, negative; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, single, 4); + FPU_access_ok(single, 4); FPU_get_user(m32, (unsigned long __user *)single); RE_ENTRANT_CHECK_ON; @@ -214,7 +214,7 @@ int FPU_load_int64(long long __user *_s) FPU_REG *st0_ptr = &st(0); RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, _s, 8); + FPU_access_ok(_s, 8); if (copy_from_user(&s, _s, 8)) FPU_abort; RE_ENTRANT_CHECK_ON; @@ -243,7 +243,7 @@ int FPU_load_int32(long __user *_s, FPU_REG *loaded_data) int negative; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, _s, 4); + FPU_access_ok(_s, 4); FPU_get_user(s, _s); RE_ENTRANT_CHECK_ON; @@ -271,7 +271,7 @@ int FPU_load_int16(short __user *_s, FPU_REG *loaded_data) int s, negative; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, _s, 2); + FPU_access_ok(_s, 2); /* Cast as short to get the sign extended. */ FPU_get_user(s, _s); RE_ENTRANT_CHECK_ON; @@ -304,7 +304,7 @@ int FPU_load_bcd(u_char __user *s) int sign; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, s, 10); + FPU_access_ok(s, 10); RE_ENTRANT_CHECK_ON; for (pos = 8; pos >= 0; pos--) { l *= 10; @@ -345,7 +345,7 @@ int FPU_store_extended(FPU_REG *st0_ptr, u_char st0_tag, if (st0_tag != TAG_Empty) { RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 10); + FPU_access_ok(d, 10); FPU_put_user(st0_ptr->sigl, (unsigned long __user *)d); FPU_put_user(st0_ptr->sigh, @@ -364,7 +364,7 @@ int FPU_store_extended(FPU_REG *st0_ptr, u_char st0_tag, /* The masked response */ /* Put out the QNaN indefinite */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 10); + FPU_access_ok(d, 10); FPU_put_user(0, (unsigned long __user *)d); FPU_put_user(0xc0000000, 1 + (unsigned long __user *)d); FPU_put_user(0xffff, 4 + (short __user *)d); @@ -539,7 +539,7 @@ denormal_arg: /* The masked response */ /* Put out the QNaN indefinite */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, dfloat, 8); + FPU_access_ok(dfloat, 8); FPU_put_user(0, (unsigned long __user *)dfloat); FPU_put_user(0xfff80000, 1 + (unsigned long __user *)dfloat); @@ -552,7 +552,7 @@ denormal_arg: l[1] |= 0x80000000; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, dfloat, 8); + FPU_access_ok(dfloat, 8); FPU_put_user(l[0], (unsigned long __user *)dfloat); FPU_put_user(l[1], 1 + (unsigned long __user *)dfloat); RE_ENTRANT_CHECK_ON; @@ -724,7 +724,7 @@ int FPU_store_single(FPU_REG *st0_ptr, u_char st0_tag, float __user *single) /* The masked response */ /* Put out the QNaN indefinite */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, single, 4); + FPU_access_ok(single, 4); FPU_put_user(0xffc00000, (unsigned long __user *)single); RE_ENTRANT_CHECK_ON; @@ -742,7 +742,7 @@ int FPU_store_single(FPU_REG *st0_ptr, u_char st0_tag, float __user *single) templ |= 0x80000000; RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, single, 4); + FPU_access_ok(single, 4); FPU_put_user(templ, (unsigned long __user *)single); RE_ENTRANT_CHECK_ON; @@ -791,7 +791,7 @@ int FPU_store_int64(FPU_REG *st0_ptr, u_char st0_tag, long long __user *d) } RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 8); + FPU_access_ok(d, 8); if (copy_to_user(d, &tll, 8)) FPU_abort; RE_ENTRANT_CHECK_ON; @@ -838,7 +838,7 @@ int FPU_store_int32(FPU_REG *st0_ptr, u_char st0_tag, long __user *d) } RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 4); + FPU_access_ok(d, 4); FPU_put_user(t.sigl, (unsigned long __user *)d); RE_ENTRANT_CHECK_ON; @@ -884,7 +884,7 @@ int FPU_store_int16(FPU_REG *st0_ptr, u_char st0_tag, short __user *d) } RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 2); + FPU_access_ok(d, 2); FPU_put_user((short)t.sigl, d); RE_ENTRANT_CHECK_ON; @@ -925,7 +925,7 @@ int FPU_store_bcd(FPU_REG *st0_ptr, u_char st0_tag, u_char __user *d) if (control_word & CW_Invalid) { /* Produce the QNaN "indefinite" */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 10); + FPU_access_ok(d, 10); for (i = 0; i < 7; i++) FPU_put_user(0, d + i); /* These bytes "undefined" */ FPU_put_user(0xc0, d + 7); /* This byte "undefined" */ @@ -941,7 +941,7 @@ int FPU_store_bcd(FPU_REG *st0_ptr, u_char st0_tag, u_char __user *d) } RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 10); + FPU_access_ok(d, 10); RE_ENTRANT_CHECK_ON; for (i = 0; i < 9; i++) { b = FPU_div_small(&ll, 10); @@ -1034,7 +1034,7 @@ u_char __user *fldenv(fpu_addr_modes addr_modes, u_char __user *s) ((addr_modes.default_mode == PM16) ^ (addr_modes.override.operand_size == OP_SIZE_PREFIX))) { RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, s, 0x0e); + FPU_access_ok(s, 0x0e); FPU_get_user(control_word, (unsigned short __user *)s); FPU_get_user(partial_status, (unsigned short __user *)(s + 2)); FPU_get_user(tag_word, (unsigned short __user *)(s + 4)); @@ -1056,7 +1056,7 @@ u_char __user *fldenv(fpu_addr_modes addr_modes, u_char __user *s) } } else { RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, s, 0x1c); + FPU_access_ok(s, 0x1c); FPU_get_user(control_word, (unsigned short __user *)s); FPU_get_user(partial_status, (unsigned short __user *)(s + 4)); FPU_get_user(tag_word, (unsigned short __user *)(s + 8)); @@ -1125,7 +1125,7 @@ void frstor(fpu_addr_modes addr_modes, u_char __user *data_address) /* Copy all registers in stack order. */ RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_READ, s, 80); + FPU_access_ok(s, 80); __copy_from_user(register_base + offset, s, other); if (offset) __copy_from_user(register_base, s + other, offset); @@ -1146,7 +1146,7 @@ u_char __user *fstenv(fpu_addr_modes addr_modes, u_char __user *d) ((addr_modes.default_mode == PM16) ^ (addr_modes.override.operand_size == OP_SIZE_PREFIX))) { RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 14); + FPU_access_ok(d, 14); #ifdef PECULIAR_486 FPU_put_user(control_word & ~0xe080, (unsigned long __user *)d); #else @@ -1174,7 +1174,7 @@ u_char __user *fstenv(fpu_addr_modes addr_modes, u_char __user *d) d += 0x0e; } else { RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 7 * 4); + FPU_access_ok(d, 7 * 4); #ifdef PECULIAR_486 control_word &= ~0xe080; /* An 80486 sets nearly all of the reserved bits to 1. */ @@ -1204,7 +1204,7 @@ void fsave(fpu_addr_modes addr_modes, u_char __user *data_address) d = fstenv(addr_modes, data_address); RE_ENTRANT_CHECK_OFF; - FPU_access_ok(VERIFY_WRITE, d, 80); + FPU_access_ok(d, 80); /* Copy all registers in stack order. */ if (__copy_to_user(d, register_base + offset, other)) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 2385538e8065..de1851d15699 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -495,7 +495,7 @@ static int get_bt_addr(struct mm_struct *mm, unsigned long bd_entry; unsigned long bt_addr; - if (!access_ok(VERIFY_READ, (bd_entry_ptr), sizeof(*bd_entry_ptr))) + if (!access_ok((bd_entry_ptr), sizeof(*bd_entry_ptr))) return -EFAULT; while (1) { diff --git a/arch/x86/um/asm/checksum_32.h b/arch/x86/um/asm/checksum_32.h index 83a75f8a1233..b9ac7c9eb72c 100644 --- a/arch/x86/um/asm/checksum_32.h +++ b/arch/x86/um/asm/checksum_32.h @@ -43,7 +43,7 @@ static __inline__ __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len, __wsum sum, int *err_ptr) { - if (access_ok(VERIFY_WRITE, dst, len)) { + if (access_ok(dst, len)) { if (copy_to_user(dst, src, len)) { *err_ptr = -EFAULT; return (__force __wsum)-1; diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c index 727ed442e0a5..8b4a71efe7ee 100644 --- a/arch/x86/um/signal.c +++ b/arch/x86/um/signal.c @@ -367,7 +367,7 @@ int setup_signal_stack_sc(unsigned long stack_top, struct ksignal *ksig, /* This is the same calculation as i386 - ((sp + 4) & 15) == 0 */ stack_top = ((stack_top + 4) & -16UL) - 4; frame = (struct sigframe __user *) stack_top - 1; - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return 1; restorer = frame->retcode; @@ -412,7 +412,7 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig, stack_top &= -8UL; frame = (struct rt_sigframe __user *) stack_top - 1; - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) return 1; restorer = frame->retcode; @@ -497,7 +497,7 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig, /* Subtract 128 for a red zone and 8 for proper alignment */ frame = (struct rt_sigframe __user *) ((unsigned long) frame - 128 - 8); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto out; if (ksig->ka.sa.sa_flags & SA_SIGINFO) { diff --git a/arch/xtensa/include/asm/checksum.h b/arch/xtensa/include/asm/checksum.h index 3ae74d7e074b..f302ef57973a 100644 --- a/arch/xtensa/include/asm/checksum.h +++ b/arch/xtensa/include/asm/checksum.h @@ -243,7 +243,7 @@ static __inline__ __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len, __wsum sum, int *err_ptr) { - if (access_ok(VERIFY_WRITE, dst, len)) + if (access_ok(dst, len)) return csum_partial_copy_generic(src,dst,len,sum,NULL,err_ptr); if (len) diff --git a/arch/xtensa/include/asm/futex.h b/arch/xtensa/include/asm/futex.h index fd0eef6b8e7c..505d09eff184 100644 --- a/arch/xtensa/include/asm/futex.h +++ b/arch/xtensa/include/asm/futex.h @@ -93,7 +93,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, { int ret = 0; - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; #if !XCHAL_HAVE_S32C1I diff --git a/arch/xtensa/include/asm/uaccess.h b/arch/xtensa/include/asm/uaccess.h index d11ef2939652..4b2480304bc3 100644 --- a/arch/xtensa/include/asm/uaccess.h +++ b/arch/xtensa/include/asm/uaccess.h @@ -42,7 +42,7 @@ #define __user_ok(addr, size) \ (((size) <= TASK_SIZE)&&((addr) <= TASK_SIZE-(size))) #define __access_ok(addr, size) (__kernel_ok || __user_ok((addr), (size))) -#define access_ok(type, addr, size) __access_ok((unsigned long)(addr), (size)) +#define access_ok(addr, size) __access_ok((unsigned long)(addr), (size)) #define user_addr_max() (uaccess_kernel() ? ~0UL : TASK_SIZE) @@ -86,7 +86,7 @@ extern long __put_user_bad(void); ({ \ long __pu_err = -EFAULT; \ __typeof__(*(ptr)) *__pu_addr = (ptr); \ - if (access_ok(VERIFY_WRITE, __pu_addr, size)) \ + if (access_ok(__pu_addr, size)) \ __put_user_size((x), __pu_addr, (size), __pu_err); \ __pu_err; \ }) @@ -183,7 +183,7 @@ __asm__ __volatile__( \ ({ \ long __gu_err = -EFAULT, __gu_val = 0; \ const __typeof__(*(ptr)) *__gu_addr = (ptr); \ - if (access_ok(VERIFY_READ, __gu_addr, size)) \ + if (access_ok(__gu_addr, size)) \ __get_user_size(__gu_val, __gu_addr, (size), __gu_err); \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ __gu_err; \ @@ -269,7 +269,7 @@ __xtensa_clear_user(void *addr, unsigned long size) static inline unsigned long clear_user(void *addr, unsigned long size) { - if (access_ok(VERIFY_WRITE, addr, size)) + if (access_ok(addr, size)) return __xtensa_clear_user(addr, size); return size ? -EFAULT : 0; } @@ -284,7 +284,7 @@ extern long __strncpy_user(char *, const char *, long); static inline long strncpy_from_user(char *dst, const char *src, long count) { - if (access_ok(VERIFY_READ, src, 1)) + if (access_ok(src, 1)) return __strncpy_user(dst, src, count); return -EFAULT; } diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 74e1682876ac..dc22a238ed9c 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -251,7 +251,7 @@ asmlinkage long xtensa_rt_sigreturn(long a0, long a1, long a2, long a3, frame = (struct rt_sigframe __user *) regs->areg[1]; - if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) @@ -348,7 +348,7 @@ static int setup_frame(struct ksignal *ksig, sigset_t *set, if (regs->depc > 64) panic ("Double exception sys_sigreturn\n"); - if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) { + if (!access_ok(frame, sizeof(*frame))) { return -EFAULT; } diff --git a/arch/xtensa/kernel/stacktrace.c b/arch/xtensa/kernel/stacktrace.c index 0df4080fa20f..174c11f13bba 100644 --- a/arch/xtensa/kernel/stacktrace.c +++ b/arch/xtensa/kernel/stacktrace.c @@ -91,7 +91,7 @@ void xtensa_backtrace_user(struct pt_regs *regs, unsigned int depth, pc = MAKE_PC_FROM_RA(a0, pc); /* Check if the region is OK to access. */ - if (!access_ok(VERIFY_READ, &SPILL_SLOT(a1, 0), 8)) + if (!access_ok(&SPILL_SLOT(a1, 0), 8)) return; /* Copy a1, a0 from user space stack frame. */ if (__get_user(a0, &SPILL_SLOT(a1, 0)) || diff --git a/drivers/acpi/acpi_dbg.c b/drivers/acpi/acpi_dbg.c index f21c99ec46ee..a2dcd62ea32f 100644 --- a/drivers/acpi/acpi_dbg.c +++ b/drivers/acpi/acpi_dbg.c @@ -614,7 +614,7 @@ static ssize_t acpi_aml_read(struct file *file, char __user *buf, if (!count) return 0; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; while (count > 0) { @@ -684,7 +684,7 @@ static ssize_t acpi_aml_write(struct file *file, const char __user *buf, if (!count) return 0; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; while (count > 0) { diff --git a/drivers/char/generic_nvram.c b/drivers/char/generic_nvram.c index 14e728fbb8a0..ff5394f47587 100644 --- a/drivers/char/generic_nvram.c +++ b/drivers/char/generic_nvram.c @@ -44,7 +44,7 @@ static ssize_t read_nvram(struct file *file, char __user *buf, unsigned int i; char __user *p = buf; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; if (*ppos >= nvram_len) return 0; @@ -62,7 +62,7 @@ static ssize_t write_nvram(struct file *file, const char __user *buf, const char __user *p = buf; char c; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; if (*ppos >= nvram_len) return 0; diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 7b4e4de778e4..b08dc50f9f26 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -609,7 +609,7 @@ static ssize_t read_port(struct file *file, char __user *buf, unsigned long i = *ppos; char __user *tmp = buf; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; while (count-- > 0 && i < 65536) { if (__put_user(inb(i), tmp) < 0) @@ -627,7 +627,7 @@ static ssize_t write_port(struct file *file, const char __user *buf, unsigned long i = *ppos; const char __user *tmp = buf; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; while (count-- > 0 && i < 65536) { char c; diff --git a/drivers/char/nwflash.c b/drivers/char/nwflash.c index a284ae25e69a..76fb434068d4 100644 --- a/drivers/char/nwflash.c +++ b/drivers/char/nwflash.c @@ -167,7 +167,7 @@ static ssize_t flash_write(struct file *file, const char __user *buf, if (count > gbFlashSize - p) count = gbFlashSize - p; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* diff --git a/drivers/char/pcmcia/cm4000_cs.c b/drivers/char/pcmcia/cm4000_cs.c index 809507bf8f1c..7a4eb86aedac 100644 --- a/drivers/char/pcmcia/cm4000_cs.c +++ b/drivers/char/pcmcia/cm4000_cs.c @@ -1445,11 +1445,11 @@ static long cmm_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) _IOC_DIR(cmd), _IOC_READ, _IOC_WRITE, size, cmd); if (_IOC_DIR(cmd) & _IOC_READ) { - if (!access_ok(VERIFY_WRITE, argp, size)) + if (!access_ok(argp, size)) goto out; } if (_IOC_DIR(cmd) & _IOC_WRITE) { - if (!access_ok(VERIFY_READ, argp, size)) + if (!access_ok(argp, size)) goto out; } rc = 0; diff --git a/drivers/crypto/ccp/psp-dev.c b/drivers/crypto/ccp/psp-dev.c index d64a78ccc03e..b16be8a11d92 100644 --- a/drivers/crypto/ccp/psp-dev.c +++ b/drivers/crypto/ccp/psp-dev.c @@ -364,7 +364,7 @@ static int sev_ioctl_do_pek_csr(struct sev_issue_cmd *argp) goto cmd; /* allocate a physically contiguous buffer to store the CSR blob */ - if (!access_ok(VERIFY_WRITE, input.address, input.length) || + if (!access_ok(input.address, input.length) || input.length > SEV_FW_BLOB_MAX_SIZE) { ret = -EFAULT; goto e_free; @@ -644,14 +644,14 @@ static int sev_ioctl_do_pdh_export(struct sev_issue_cmd *argp) /* Allocate a physically contiguous buffer to store the PDH blob. */ if ((input.pdh_cert_len > SEV_FW_BLOB_MAX_SIZE) || - !access_ok(VERIFY_WRITE, input.pdh_cert_address, input.pdh_cert_len)) { + !access_ok(input.pdh_cert_address, input.pdh_cert_len)) { ret = -EFAULT; goto e_free; } /* Allocate a physically contiguous buffer to store the cert chain blob. */ if ((input.cert_chain_len > SEV_FW_BLOB_MAX_SIZE) || - !access_ok(VERIFY_WRITE, input.cert_chain_address, input.cert_chain_len)) { + !access_ok(input.cert_chain_address, input.cert_chain_len)) { ret = -EFAULT; goto e_free; } diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c index d8e185582642..16a7045736a9 100644 --- a/drivers/firewire/core-cdev.c +++ b/drivers/firewire/core-cdev.c @@ -1094,7 +1094,7 @@ static int ioctl_queue_iso(struct client *client, union ioctl_arg *arg) return -EINVAL; p = (struct fw_cdev_iso_packet __user *)u64_to_uptr(a->packets); - if (!access_ok(VERIFY_READ, p, a->size)) + if (!access_ok(p, a->size)) return -EFAULT; end = (void __user *)p + a->size; diff --git a/drivers/firmware/efi/test/efi_test.c b/drivers/firmware/efi/test/efi_test.c index 769640940c9f..51ecf7d6da48 100644 --- a/drivers/firmware/efi/test/efi_test.c +++ b/drivers/firmware/efi/test/efi_test.c @@ -68,7 +68,7 @@ copy_ucs2_from_user_len(efi_char16_t **dst, efi_char16_t __user *src, return 0; } - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return -EFAULT; buf = memdup_user(src, len); @@ -89,7 +89,7 @@ copy_ucs2_from_user_len(efi_char16_t **dst, efi_char16_t __user *src, static inline int get_ucs2_strsize_from_user(efi_char16_t __user *src, size_t *len) { - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return -EFAULT; *len = user_ucs2_strsize(src); @@ -116,7 +116,7 @@ copy_ucs2_from_user(efi_char16_t **dst, efi_char16_t __user *src) { size_t len; - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return -EFAULT; len = user_ucs2_strsize(src); @@ -140,7 +140,7 @@ copy_ucs2_to_user_len(efi_char16_t __user *dst, efi_char16_t *src, size_t len) if (!src) return 0; - if (!access_ok(VERIFY_WRITE, dst, 1)) + if (!access_ok(dst, 1)) return -EFAULT; return copy_to_user(dst, src, len); diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c index 025aba3ea76c..e18a786fc943 100644 --- a/drivers/fpga/dfl-afu-dma-region.c +++ b/drivers/fpga/dfl-afu-dma-region.c @@ -369,7 +369,7 @@ int afu_dma_map_region(struct dfl_feature_platform_data *pdata, if (user_addr + length < user_addr) return -EINVAL; - if (!access_ok(VERIFY_WRITE, (void __user *)(unsigned long)user_addr, + if (!access_ok((void __user *)(unsigned long)user_addr, length)) return -EINVAL; diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c index fe5a5578fbf7..d9ca9554844a 100644 --- a/drivers/fpga/dfl-fme-pr.c +++ b/drivers/fpga/dfl-fme-pr.c @@ -99,8 +99,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg) return -EINVAL; } - if (!access_ok(VERIFY_READ, - (void __user *)(unsigned long)port_pr.buffer_address, + if (!access_ok((void __user *)(unsigned long)port_pr.buffer_address, port_pr.buffer_size)) return -EFAULT; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 3623538baf6f..be68752c3469 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -158,8 +158,7 @@ static int set_queue_properties_from_user(struct queue_properties *q_properties, } if ((args->ring_base_address) && - (!access_ok(VERIFY_WRITE, - (const void __user *) args->ring_base_address, + (!access_ok((const void __user *) args->ring_base_address, sizeof(uint64_t)))) { pr_err("Can't access ring base address\n"); return -EFAULT; @@ -170,31 +169,27 @@ static int set_queue_properties_from_user(struct queue_properties *q_properties, return -EINVAL; } - if (!access_ok(VERIFY_WRITE, - (const void __user *) args->read_pointer_address, + if (!access_ok((const void __user *) args->read_pointer_address, sizeof(uint32_t))) { pr_err("Can't access read pointer\n"); return -EFAULT; } - if (!access_ok(VERIFY_WRITE, - (const void __user *) args->write_pointer_address, + if (!access_ok((const void __user *) args->write_pointer_address, sizeof(uint32_t))) { pr_err("Can't access write pointer\n"); return -EFAULT; } if (args->eop_buffer_address && - !access_ok(VERIFY_WRITE, - (const void __user *) args->eop_buffer_address, + !access_ok((const void __user *) args->eop_buffer_address, sizeof(uint32_t))) { pr_debug("Can't access eop buffer"); return -EFAULT; } if (args->ctx_save_restore_address && - !access_ok(VERIFY_WRITE, - (const void __user *) args->ctx_save_restore_address, + !access_ok((const void __user *) args->ctx_save_restore_address, sizeof(uint32_t))) { pr_debug("Can't access ctx save restore buffer"); return -EFAULT; @@ -365,8 +360,7 @@ static int kfd_ioctl_update_queue(struct file *filp, struct kfd_process *p, } if ((args->ring_base_address) && - (!access_ok(VERIFY_WRITE, - (const void __user *) args->ring_base_address, + (!access_ok((const void __user *) args->ring_base_address, sizeof(uint64_t)))) { pr_err("Can't access ring base address\n"); return -EFAULT; diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c index 892c1d9304bb..642d0e70d0f8 100644 --- a/drivers/gpu/drm/armada/armada_gem.c +++ b/drivers/gpu/drm/armada/armada_gem.c @@ -334,7 +334,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data, ptr = (char __user *)(uintptr_t)args->ptr; - if (!access_ok(VERIFY_READ, ptr, args->size)) + if (!access_ok(ptr, args->size)) return -EFAULT; ret = fault_in_pages_readable(ptr, args->size); diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index ffa8dc35515f..46f48f245eb5 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -525,7 +525,7 @@ ssize_t drm_read(struct file *filp, char __user *buffer, struct drm_device *dev = file_priv->minor->dev; ssize_t ret; - if (!access_ok(VERIFY_WRITE, buffer, count)) + if (!access_ok(buffer, count)) return -EFAULT; ret = mutex_lock_interruptible(&file_priv->event_read_lock); diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c index 96efc84396bf..18c27f795cf6 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c @@ -339,7 +339,6 @@ static int etnaviv_ioctl_gem_userptr(struct drm_device *dev, void *data, struct drm_file *file) { struct drm_etnaviv_gem_userptr *args = data; - int access; if (args->flags & ~(ETNA_USERPTR_READ|ETNA_USERPTR_WRITE) || args->flags == 0) @@ -351,12 +350,7 @@ static int etnaviv_ioctl_gem_userptr(struct drm_device *dev, void *data, args->user_ptr & ~PAGE_MASK) return -EINVAL; - if (args->flags & ETNA_USERPTR_WRITE) - access = VERIFY_WRITE; - else - access = VERIFY_READ; - - if (!access_ok(access, (void __user *)(unsigned long)args->user_ptr, + if (!access_ok((void __user *)(unsigned long)args->user_ptr, args->user_size)) return -EFAULT; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index a9de07bb72c8..216f52b744a6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1282,8 +1282,7 @@ i915_gem_pread_ioctl(struct drm_device *dev, void *data, if (args->size == 0) return 0; - if (!access_ok(VERIFY_WRITE, - u64_to_user_ptr(args->data_ptr), + if (!access_ok(u64_to_user_ptr(args->data_ptr), args->size)) return -EFAULT; @@ -1609,9 +1608,7 @@ i915_gem_pwrite_ioctl(struct drm_device *dev, void *data, if (args->size == 0) return 0; - if (!access_ok(VERIFY_READ, - u64_to_user_ptr(args->data_ptr), - args->size)) + if (!access_ok(u64_to_user_ptr(args->data_ptr), args->size)) return -EFAULT; obj = i915_gem_object_lookup(file, args->handle); diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index 8ff6b581cf1c..fee66ccebed6 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -1447,7 +1447,7 @@ static int eb_relocate_vma(struct i915_execbuffer *eb, struct i915_vma *vma) * to read. However, if the array is not writable the user loses * the updated relocation values. */ - if (unlikely(!access_ok(VERIFY_READ, urelocs, remain*sizeof(*urelocs)))) + if (unlikely(!access_ok(urelocs, remain*sizeof(*urelocs)))) return -EFAULT; do { @@ -1554,7 +1554,7 @@ static int check_relocations(const struct drm_i915_gem_exec_object2 *entry) addr = u64_to_user_ptr(entry->relocs_ptr); size *= sizeof(struct drm_i915_gem_relocation_entry); - if (!access_ok(VERIFY_READ, addr, size)) + if (!access_ok(addr, size)) return -EFAULT; end = addr + size; @@ -2090,7 +2090,7 @@ get_fence_array(struct drm_i915_gem_execbuffer2 *args, return ERR_PTR(-EINVAL); user = u64_to_user_ptr(args->cliprects_ptr); - if (!access_ok(VERIFY_READ, user, nfences * sizeof(*user))) + if (!access_ok(user, nfences * sizeof(*user))) return ERR_PTR(-EFAULT); fences = kvmalloc_array(nfences, sizeof(*fences), diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 3df77020aada..9558582c105e 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -789,8 +789,7 @@ i915_gem_userptr_ioctl(struct drm_device *dev, if (offset_in_page(args->user_ptr | args->user_size)) return -EINVAL; - if (!access_ok(args->flags & I915_USERPTR_READ_ONLY ? VERIFY_READ : VERIFY_WRITE, - (char __user *)(unsigned long)args->user_ptr, args->user_size)) + if (!access_ok((char __user *)(unsigned long)args->user_ptr, args->user_size)) return -EFAULT; if (args->flags & I915_USERPTR_READ_ONLY) { diff --git a/drivers/gpu/drm/i915/i915_ioc32.c b/drivers/gpu/drm/i915/i915_ioc32.c index 0e5c580d117c..e869daf9c8a9 100644 --- a/drivers/gpu/drm/i915/i915_ioc32.c +++ b/drivers/gpu/drm/i915/i915_ioc32.c @@ -52,7 +52,7 @@ static int compat_i915_getparam(struct file *file, unsigned int cmd, return -EFAULT; request = compat_alloc_user_space(sizeof(*request)); - if (!access_ok(VERIFY_WRITE, request, sizeof(*request)) || + if (!access_ok(request, sizeof(*request)) || __put_user(req32.param, &request->param) || __put_user((void __user *)(unsigned long)req32.value, &request->value)) diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c index 4529edfdcfc8..2b2eb57ca71f 100644 --- a/drivers/gpu/drm/i915/i915_perf.c +++ b/drivers/gpu/drm/i915/i915_perf.c @@ -3052,7 +3052,7 @@ static struct i915_oa_reg *alloc_oa_regs(struct drm_i915_private *dev_priv, if (!n_regs) return NULL; - if (!access_ok(VERIFY_READ, regs, n_regs * sizeof(u32) * 2)) + if (!access_ok(regs, n_regs * sizeof(u32) * 2)) return ERR_PTR(-EFAULT); /* No is_valid function means we're not allowing any register to be programmed. */ diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c index 6fc4b8eeab42..fe56465cdfd6 100644 --- a/drivers/gpu/drm/i915/i915_query.c +++ b/drivers/gpu/drm/i915/i915_query.c @@ -46,7 +46,7 @@ static int query_topology_info(struct drm_i915_private *dev_priv, if (topo.flags != 0) return -EINVAL; - if (!access_ok(VERIFY_WRITE, u64_to_user_ptr(query_item->data_ptr), + if (!access_ok(u64_to_user_ptr(query_item->data_ptr), total_length)) return -EFAULT; diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index a28465d90529..12b983fc0b56 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -77,7 +77,7 @@ void msm_gem_submit_free(struct msm_gem_submit *submit) static inline unsigned long __must_check copy_from_user_inatomic(void *to, const void __user *from, unsigned long n) { - if (access_ok(VERIFY_READ, from, n)) + if (access_ok(from, n)) return __copy_from_user_inatomic(to, from, n); return -EFAULT; } diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/drivers/gpu/drm/qxl/qxl_ioctl.c index 6e828158bcb0..d410e2925162 100644 --- a/drivers/gpu/drm/qxl/qxl_ioctl.c +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c @@ -163,8 +163,7 @@ static int qxl_process_single_command(struct qxl_device *qdev, if (cmd->command_size > PAGE_SIZE - sizeof(union qxl_release_info)) return -EINVAL; - if (!access_ok(VERIFY_READ, - u64_to_user_ptr(cmd->command), + if (!access_ok(u64_to_user_ptr(cmd->command), cmd->command_size)) return -EFAULT; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 9f9172eb1512..fb0007aa0c27 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -611,8 +611,7 @@ static ssize_t verify_hdr(struct ib_uverbs_cmd_hdr *hdr, if (hdr->out_words * 8 < method_elm->resp_size) return -ENOSPC; - if (!access_ok(VERIFY_WRITE, - u64_to_user_ptr(ex_hdr->response), + if (!access_ok(u64_to_user_ptr(ex_hdr->response), (hdr->out_words + ex_hdr->provider_out_words) * 8)) return -EFAULT; } else { diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c index dbe7d14a5c76..0cd71ce7cc71 100644 --- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c +++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c @@ -232,7 +232,7 @@ static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf) } /* Verify that access is OK for the user buffer */ - if (!access_ok(VERIFY_WRITE, (void __user *)vaddr, + if (!access_ok((void __user *)vaddr, npages * PAGE_SIZE)) { dd_dev_err(dd, "Fail vaddr %p, %u pages, !access_ok\n", (void *)vaddr, npages); diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c index 98e1ce14fa2a..78fa634de98a 100644 --- a/drivers/infiniband/hw/qib/qib_file_ops.c +++ b/drivers/infiniband/hw/qib/qib_file_ops.c @@ -343,7 +343,7 @@ static int qib_tid_update(struct qib_ctxtdata *rcd, struct file *fp, /* virtual address of first page in transfer */ vaddr = ti->tidvaddr; - if (!access_ok(VERIFY_WRITE, (void __user *) vaddr, + if (!access_ok((void __user *) vaddr, cnt * PAGE_SIZE)) { ret = -EFAULT; goto done; diff --git a/drivers/macintosh/ans-lcd.c b/drivers/macintosh/ans-lcd.c index ef0c2366cf59..400960cf04d5 100644 --- a/drivers/macintosh/ans-lcd.c +++ b/drivers/macintosh/ans-lcd.c @@ -64,7 +64,7 @@ anslcd_write( struct file * file, const char __user * buf, printk(KERN_DEBUG "LCD: write\n"); #endif - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; mutex_lock(&anslcd_mutex); diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c index ac0cf37d6239..21d532a78fa4 100644 --- a/drivers/macintosh/via-pmu.c +++ b/drivers/macintosh/via-pmu.c @@ -2188,7 +2188,7 @@ pmu_read(struct file *file, char __user *buf, if (count < 1 || !pp) return -EINVAL; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; spin_lock_irqsave(&pp->lock, flags); diff --git a/drivers/media/pci/ivtv/ivtvfb.c b/drivers/media/pci/ivtv/ivtvfb.c index 3e02de02ffdd..8ec2525d8ef5 100644 --- a/drivers/media/pci/ivtv/ivtvfb.c +++ b/drivers/media/pci/ivtv/ivtvfb.c @@ -356,7 +356,7 @@ static int ivtvfb_prep_frame(struct ivtv *itv, int cmd, void __user *source, IVTVFB_WARN("ivtvfb_prep_frame: Count not a multiple of 4 (%d)\n", count); /* Check Source */ - if (!access_ok(VERIFY_READ, source + dest_offset, count)) { + if (!access_ok(source + dest_offset, count)) { IVTVFB_WARN("Invalid userspace pointer %p\n", source); IVTVFB_DEBUG_WARN("access_ok() failed for offset 0x%08lx source %p count %d\n", diff --git a/drivers/media/v4l2-core/v4l2-compat-ioctl32.c b/drivers/media/v4l2-core/v4l2-compat-ioctl32.c index fe4577a46869..73dac1d8d4f6 100644 --- a/drivers/media/v4l2-core/v4l2-compat-ioctl32.c +++ b/drivers/media/v4l2-core/v4l2-compat-ioctl32.c @@ -158,7 +158,7 @@ static int get_v4l2_window32(struct v4l2_window __user *p64, compat_caddr_t p; u32 clipcount; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || copy_in_user(&p64->w, &p32->w, sizeof(p32->w)) || assign_in_user(&p64->field, &p32->field) || assign_in_user(&p64->chromakey, &p32->chromakey) || @@ -283,7 +283,7 @@ static int __bufsize_v4l2_format(struct v4l2_format32 __user *p32, u32 *size) static int bufsize_v4l2_format(struct v4l2_format32 __user *p32, u32 *size) { - if (!access_ok(VERIFY_READ, p32, sizeof(*p32))) + if (!access_ok(p32, sizeof(*p32))) return -EFAULT; return __bufsize_v4l2_format(p32, size); } @@ -335,7 +335,7 @@ static int get_v4l2_format32(struct v4l2_format __user *p64, struct v4l2_format32 __user *p32, void __user *aux_buf, u32 aux_space) { - if (!access_ok(VERIFY_READ, p32, sizeof(*p32))) + if (!access_ok(p32, sizeof(*p32))) return -EFAULT; return __get_v4l2_format32(p64, p32, aux_buf, aux_space); } @@ -343,7 +343,7 @@ static int get_v4l2_format32(struct v4l2_format __user *p64, static int bufsize_v4l2_create(struct v4l2_create_buffers32 __user *p32, u32 *size) { - if (!access_ok(VERIFY_READ, p32, sizeof(*p32))) + if (!access_ok(p32, sizeof(*p32))) return -EFAULT; return __bufsize_v4l2_format(&p32->format, size); } @@ -352,7 +352,7 @@ static int get_v4l2_create32(struct v4l2_create_buffers __user *p64, struct v4l2_create_buffers32 __user *p32, void __user *aux_buf, u32 aux_space) { - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || copy_in_user(p64, p32, offsetof(struct v4l2_create_buffers32, format))) return -EFAULT; @@ -404,7 +404,7 @@ static int __put_v4l2_format32(struct v4l2_format __user *p64, static int put_v4l2_format32(struct v4l2_format __user *p64, struct v4l2_format32 __user *p32) { - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32))) + if (!access_ok(p32, sizeof(*p32))) return -EFAULT; return __put_v4l2_format32(p64, p32); } @@ -412,7 +412,7 @@ static int put_v4l2_format32(struct v4l2_format __user *p64, static int put_v4l2_create32(struct v4l2_create_buffers __user *p64, struct v4l2_create_buffers32 __user *p32) { - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || copy_in_user(p32, p64, offsetof(struct v4l2_create_buffers32, format)) || assign_in_user(&p32->capabilities, &p64->capabilities) || @@ -434,7 +434,7 @@ static int get_v4l2_standard32(struct v4l2_standard __user *p64, struct v4l2_standard32 __user *p32) { /* other fields are not set by the user, nor used by the driver */ - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p64->index, &p32->index)) return -EFAULT; return 0; @@ -443,7 +443,7 @@ static int get_v4l2_standard32(struct v4l2_standard __user *p64, static int put_v4l2_standard32(struct v4l2_standard __user *p64, struct v4l2_standard32 __user *p32) { - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p32->index, &p64->index) || assign_in_user(&p32->id, &p64->id) || copy_in_user(p32->name, p64->name, sizeof(p32->name)) || @@ -560,7 +560,7 @@ static int bufsize_v4l2_buffer(struct v4l2_buffer32 __user *p32, u32 *size) u32 type; u32 length; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || get_user(type, &p32->type) || get_user(length, &p32->length)) return -EFAULT; @@ -593,7 +593,7 @@ static int get_v4l2_buffer32(struct v4l2_buffer __user *p64, compat_caddr_t p; int ret; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p64->index, &p32->index) || get_user(type, &p32->type) || put_user(type, &p64->type) || @@ -632,7 +632,7 @@ static int get_v4l2_buffer32(struct v4l2_buffer __user *p64, return -EFAULT; uplane32 = compat_ptr(p); - if (!access_ok(VERIFY_READ, uplane32, + if (!access_ok(uplane32, num_planes * sizeof(*uplane32))) return -EFAULT; @@ -691,7 +691,7 @@ static int put_v4l2_buffer32(struct v4l2_buffer __user *p64, compat_caddr_t p; int ret; - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p32->index, &p64->index) || get_user(type, &p64->type) || put_user(type, &p32->type) || @@ -781,7 +781,7 @@ static int get_v4l2_framebuffer32(struct v4l2_framebuffer __user *p64, { compat_caddr_t tmp; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || get_user(tmp, &p32->base) || put_user_force(compat_ptr(tmp), &p64->base) || assign_in_user(&p64->capability, &p32->capability) || @@ -796,7 +796,7 @@ static int put_v4l2_framebuffer32(struct v4l2_framebuffer __user *p64, { void *base; - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || get_user(base, &p64->base) || put_user(ptr_to_compat((void __user *)base), &p32->base) || assign_in_user(&p32->capability, &p64->capability) || @@ -893,7 +893,7 @@ static int bufsize_v4l2_ext_controls(struct v4l2_ext_controls32 __user *p32, { u32 count; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || get_user(count, &p32->count)) return -EFAULT; if (count > V4L2_CID_MAX_CTRLS) @@ -913,7 +913,7 @@ static int get_v4l2_ext_controls32(struct file *file, u32 n; compat_caddr_t p; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p64->which, &p32->which) || get_user(count, &p32->count) || put_user(count, &p64->count) || @@ -929,7 +929,7 @@ static int get_v4l2_ext_controls32(struct file *file, if (get_user(p, &p32->controls)) return -EFAULT; ucontrols = compat_ptr(p); - if (!access_ok(VERIFY_READ, ucontrols, count * sizeof(*ucontrols))) + if (!access_ok(ucontrols, count * sizeof(*ucontrols))) return -EFAULT; if (aux_space < count * sizeof(*kcontrols)) return -EFAULT; @@ -979,7 +979,7 @@ static int put_v4l2_ext_controls32(struct file *file, * with __user causes smatch warnings, so instead declare it * without __user and cast it as a userspace pointer where needed. */ - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p32->which, &p64->which) || get_user(count, &p64->count) || put_user(count, &p32->count) || @@ -994,7 +994,7 @@ static int put_v4l2_ext_controls32(struct file *file, if (get_user(p, &p32->controls)) return -EFAULT; ucontrols = compat_ptr(p); - if (!access_ok(VERIFY_WRITE, ucontrols, count * sizeof(*ucontrols))) + if (!access_ok(ucontrols, count * sizeof(*ucontrols))) return -EFAULT; for (n = 0; n < count; n++) { @@ -1043,7 +1043,7 @@ struct v4l2_event32 { static int put_v4l2_event32(struct v4l2_event __user *p64, struct v4l2_event32 __user *p32) { - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p32->type, &p64->type) || copy_in_user(&p32->u, &p64->u, sizeof(p64->u)) || assign_in_user(&p32->pending, &p64->pending) || @@ -1069,7 +1069,7 @@ static int get_v4l2_edid32(struct v4l2_edid __user *p64, { compat_uptr_t tmp; - if (!access_ok(VERIFY_READ, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p64->pad, &p32->pad) || assign_in_user(&p64->start_block, &p32->start_block) || assign_in_user_cast(&p64->blocks, &p32->blocks) || @@ -1085,7 +1085,7 @@ static int put_v4l2_edid32(struct v4l2_edid __user *p64, { void *edid; - if (!access_ok(VERIFY_WRITE, p32, sizeof(*p32)) || + if (!access_ok(p32, sizeof(*p32)) || assign_in_user(&p32->pad, &p64->pad) || assign_in_user(&p32->start_block, &p64->start_block) || assign_in_user(&p32->blocks, &p64->blocks) || diff --git a/drivers/misc/vmw_vmci/vmci_host.c b/drivers/misc/vmw_vmci/vmci_host.c index 5da1f3e3f997..997f92543dd4 100644 --- a/drivers/misc/vmw_vmci/vmci_host.c +++ b/drivers/misc/vmw_vmci/vmci_host.c @@ -236,7 +236,7 @@ static int vmci_host_setup_notify(struct vmci_ctx *context, * about the size. */ BUILD_BUG_ON(sizeof(bool) != sizeof(u8)); - if (!access_ok(VERIFY_WRITE, (void __user *)uva, sizeof(u8))) + if (!access_ok((void __user *)uva, sizeof(u8))) return VMCI_ERROR_GENERIC; /* diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index 7ac035af39f0..6fa1627ce08d 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -52,7 +52,7 @@ static ssize_t proc_bus_pci_read(struct file *file, char __user *buf, nbytes = size - pos; cnt = nbytes; - if (!access_ok(VERIFY_WRITE, buf, cnt)) + if (!access_ok(buf, cnt)) return -EINVAL; pci_config_pm_runtime_get(dev); @@ -125,7 +125,7 @@ static ssize_t proc_bus_pci_write(struct file *file, const char __user *buf, nbytes = size - pos; cnt = nbytes; - if (!access_ok(VERIFY_READ, buf, cnt)) + if (!access_ok(buf, cnt)) return -EINVAL; pci_config_pm_runtime_get(dev); diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index 7c639006252e..321bc673c417 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -416,8 +416,7 @@ static ssize_t goldfish_pipe_read_write(struct file *filp, if (unlikely(bufflen == 0)) return 0; /* Check the buffer range for access */ - if (unlikely(!access_ok(is_write ? VERIFY_WRITE : VERIFY_READ, - buffer, bufflen))) + if (unlikely(!access_ok(buffer, bufflen))) return -EFAULT; address = (unsigned long)buffer; diff --git a/drivers/pnp/isapnp/proc.c b/drivers/pnp/isapnp/proc.c index 262285e48a09..051613140812 100644 --- a/drivers/pnp/isapnp/proc.c +++ b/drivers/pnp/isapnp/proc.c @@ -47,7 +47,7 @@ static ssize_t isapnp_proc_bus_read(struct file *file, char __user * buf, nbytes = size - pos; cnt = nbytes; - if (!access_ok(VERIFY_WRITE, buf, cnt)) + if (!access_ok(buf, cnt)) return -EINVAL; isapnp_cfg_begin(dev->card->number, dev->number); diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c index 7c4673308f5b..e338d7a4f571 100644 --- a/drivers/scsi/pmcraid.c +++ b/drivers/scsi/pmcraid.c @@ -3600,7 +3600,7 @@ static long pmcraid_ioctl_passthrough( u32 ioasc; int request_size; int buffer_size; - u8 access, direction; + u8 direction; int rc = 0; /* If IOA reset is in progress, wait 10 secs for reset to complete */ @@ -3649,10 +3649,8 @@ static long pmcraid_ioctl_passthrough( request_size = le32_to_cpu(buffer->ioarcb.data_transfer_length); if (buffer->ioarcb.request_flags0 & TRANSFER_DIR_WRITE) { - access = VERIFY_READ; direction = DMA_TO_DEVICE; } else { - access = VERIFY_WRITE; direction = DMA_FROM_DEVICE; } diff --git a/drivers/scsi/scsi_ioctl.c b/drivers/scsi/scsi_ioctl.c index cc30fccc1a2e..840d96fe81bc 100644 --- a/drivers/scsi/scsi_ioctl.c +++ b/drivers/scsi/scsi_ioctl.c @@ -221,7 +221,7 @@ int scsi_ioctl(struct scsi_device *sdev, int cmd, void __user *arg) switch (cmd) { case SCSI_IOCTL_GET_IDLUN: - if (!access_ok(VERIFY_WRITE, arg, sizeof(struct scsi_idlun))) + if (!access_ok(arg, sizeof(struct scsi_idlun))) return -EFAULT; __put_user((sdev->id & 0xff) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 4e27460ec926..d3f15319b9b3 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -434,7 +434,7 @@ sg_read(struct file *filp, char __user *buf, size_t count, loff_t * ppos) SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_read: count=%d\n", (int) count)); - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; if (sfp->force_packid && (count >= SZ_SG_HEADER)) { old_hdr = kmalloc(SZ_SG_HEADER, GFP_KERNEL); @@ -632,7 +632,7 @@ sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) scsi_block_when_processing_errors(sdp->device))) return -ENXIO; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* protects following copy_from_user()s + get_user()s */ if (count < SZ_SG_HEADER) return -EIO; @@ -729,7 +729,7 @@ sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, if (count < SZ_SG_IO_HDR) return -EINVAL; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* protects following copy_from_user()s + get_user()s */ sfp->cmd_q = 1; /* when sg_io_hdr seen, set command queuing on */ @@ -768,7 +768,7 @@ sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, sg_remove_request(sfp, srp); return -EMSGSIZE; } - if (!access_ok(VERIFY_READ, hp->cmdp, hp->cmd_len)) { + if (!access_ok(hp->cmdp, hp->cmd_len)) { sg_remove_request(sfp, srp); return -EFAULT; /* protects following copy_from_user()s + get_user()s */ } @@ -922,7 +922,7 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) return -ENODEV; if (!scsi_block_when_processing_errors(sdp->device)) return -ENXIO; - if (!access_ok(VERIFY_WRITE, p, SZ_SG_IO_HDR)) + if (!access_ok(p, SZ_SG_IO_HDR)) return -EFAULT; result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, 1, read_only, 1, &srp); @@ -968,7 +968,7 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) case SG_GET_LOW_DMA: return put_user((int) sdp->device->host->unchecked_isa_dma, ip); case SG_GET_SCSI_ID: - if (!access_ok(VERIFY_WRITE, p, sizeof (sg_scsi_id_t))) + if (!access_ok(p, sizeof (sg_scsi_id_t))) return -EFAULT; else { sg_scsi_id_t __user *sg_idp = p; @@ -997,7 +997,7 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) sfp->force_packid = val ? 1 : 0; return 0; case SG_GET_PACK_ID: - if (!access_ok(VERIFY_WRITE, ip, sizeof (int))) + if (!access_ok(ip, sizeof (int))) return -EFAULT; read_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(srp, &sfp->rq_list, entry) { @@ -1078,7 +1078,7 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) val = (sdp->device ? 1 : 0); return put_user(val, ip); case SG_GET_REQUEST_TABLE: - if (!access_ok(VERIFY_WRITE, p, SZ_SG_REQ_INFO * SG_MAX_QUEUE)) + if (!access_ok(p, SZ_SG_REQ_INFO * SG_MAX_QUEUE)) return -EFAULT; else { sg_req_info_t *rinfo; diff --git a/drivers/staging/comedi/comedi_compat32.c b/drivers/staging/comedi/comedi_compat32.c index fa9d239474ee..36a3564ba1fb 100644 --- a/drivers/staging/comedi/comedi_compat32.c +++ b/drivers/staging/comedi/comedi_compat32.c @@ -102,8 +102,8 @@ static int compat_chaninfo(struct file *file, unsigned long arg) chaninfo = compat_alloc_user_space(sizeof(*chaninfo)); /* Copy chaninfo structure. Ignore unused members. */ - if (!access_ok(VERIFY_READ, chaninfo32, sizeof(*chaninfo32)) || - !access_ok(VERIFY_WRITE, chaninfo, sizeof(*chaninfo))) + if (!access_ok(chaninfo32, sizeof(*chaninfo32)) || + !access_ok(chaninfo, sizeof(*chaninfo))) return -EFAULT; err = 0; @@ -136,8 +136,8 @@ static int compat_rangeinfo(struct file *file, unsigned long arg) rangeinfo = compat_alloc_user_space(sizeof(*rangeinfo)); /* Copy rangeinfo structure. */ - if (!access_ok(VERIFY_READ, rangeinfo32, sizeof(*rangeinfo32)) || - !access_ok(VERIFY_WRITE, rangeinfo, sizeof(*rangeinfo))) + if (!access_ok(rangeinfo32, sizeof(*rangeinfo32)) || + !access_ok(rangeinfo, sizeof(*rangeinfo))) return -EFAULT; err = 0; @@ -163,8 +163,8 @@ static int get_compat_cmd(struct comedi_cmd __user *cmd, } temp; /* Copy cmd structure. */ - if (!access_ok(VERIFY_READ, cmd32, sizeof(*cmd32)) || - !access_ok(VERIFY_WRITE, cmd, sizeof(*cmd))) + if (!access_ok(cmd32, sizeof(*cmd32)) || + !access_ok(cmd, sizeof(*cmd))) return -EFAULT; err = 0; @@ -217,8 +217,8 @@ static int put_compat_cmd(struct comedi32_cmd_struct __user *cmd32, * Assume the pointer values are already valid. * (Could use ptr_to_compat() to set them.) */ - if (!access_ok(VERIFY_READ, cmd, sizeof(*cmd)) || - !access_ok(VERIFY_WRITE, cmd32, sizeof(*cmd32))) + if (!access_ok(cmd, sizeof(*cmd)) || + !access_ok(cmd32, sizeof(*cmd32))) return -EFAULT; err = 0; @@ -317,8 +317,8 @@ static int get_compat_insn(struct comedi_insn __user *insn, /* Copy insn structure. Ignore the unused members. */ err = 0; - if (!access_ok(VERIFY_READ, insn32, sizeof(*insn32)) || - !access_ok(VERIFY_WRITE, insn, sizeof(*insn))) + if (!access_ok(insn32, sizeof(*insn32)) || + !access_ok(insn, sizeof(*insn))) return -EFAULT; err |= __get_user(temp.uint, &insn32->insn); @@ -350,7 +350,7 @@ static int compat_insnlist(struct file *file, unsigned long arg) insnlist32 = compat_ptr(arg); /* Get 32-bit insnlist structure. */ - if (!access_ok(VERIFY_READ, insnlist32, sizeof(*insnlist32))) + if (!access_ok(insnlist32, sizeof(*insnlist32))) return -EFAULT; err = 0; @@ -365,7 +365,7 @@ static int compat_insnlist(struct file *file, unsigned long arg) insn[n_insns])); /* Set native insnlist structure. */ - if (!access_ok(VERIFY_WRITE, &s->insnlist, sizeof(s->insnlist))) + if (!access_ok(&s->insnlist, sizeof(s->insnlist))) return -EFAULT; err |= __put_user(n_insns, &s->insnlist.n_insns); diff --git a/drivers/tty/n_hdlc.c b/drivers/tty/n_hdlc.c index 99460af61b77..4164414d4c64 100644 --- a/drivers/tty/n_hdlc.c +++ b/drivers/tty/n_hdlc.c @@ -573,7 +573,7 @@ static ssize_t n_hdlc_tty_read(struct tty_struct *tty, struct file *file, return -EIO; /* verify user access to buffer */ - if (!access_ok(VERIFY_WRITE, buf, nr)) { + if (!access_ok(buf, nr)) { printk(KERN_WARNING "%s(%d) n_hdlc_tty_read() can't verify user " "buffer\n", __FILE__, __LINE__); return -EFAULT; diff --git a/drivers/usb/core/devices.c b/drivers/usb/core/devices.c index 3de3c750b5f6..44f28a114c2b 100644 --- a/drivers/usb/core/devices.c +++ b/drivers/usb/core/devices.c @@ -598,7 +598,7 @@ static ssize_t usb_device_read(struct file *file, char __user *buf, return -EINVAL; if (nbytes <= 0) return 0; - if (!access_ok(VERIFY_WRITE, buf, nbytes)) + if (!access_ok(buf, nbytes)) return -EFAULT; mutex_lock(&usb_bus_idr_lock); diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index a75bc0b8a50f..d65566341dd1 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1094,7 +1094,7 @@ static int proc_control(struct usb_dev_state *ps, void __user *arg) ctrl.bRequestType, ctrl.bRequest, ctrl.wValue, ctrl.wIndex, ctrl.wLength); if (ctrl.bRequestType & 0x80) { - if (ctrl.wLength && !access_ok(VERIFY_WRITE, ctrl.data, + if (ctrl.wLength && !access_ok(ctrl.data, ctrl.wLength)) { ret = -EINVAL; goto done; @@ -1183,7 +1183,7 @@ static int proc_bulk(struct usb_dev_state *ps, void __user *arg) } tmo = bulk.timeout; if (bulk.ep & 0x80) { - if (len1 && !access_ok(VERIFY_WRITE, bulk.data, len1)) { + if (len1 && !access_ok(bulk.data, len1)) { ret = -EINVAL; goto done; } @@ -1584,8 +1584,7 @@ static int proc_do_submiturb(struct usb_dev_state *ps, struct usbdevfs_urb *uurb } if (uurb->buffer_length > 0 && - !access_ok(is_in ? VERIFY_WRITE : VERIFY_READ, - uurb->buffer, uurb->buffer_length)) { + !access_ok(uurb->buffer, uurb->buffer_length)) { ret = -EFAULT; goto error; } diff --git a/drivers/usb/gadget/function/f_hid.c b/drivers/usb/gadget/function/f_hid.c index 54e859dcb25c..75b113a5b25c 100644 --- a/drivers/usb/gadget/function/f_hid.c +++ b/drivers/usb/gadget/function/f_hid.c @@ -252,7 +252,7 @@ static ssize_t f_hidg_read(struct file *file, char __user *buffer, if (!count) return 0; - if (!access_ok(VERIFY_WRITE, buffer, count)) + if (!access_ok(buffer, count)) return -EFAULT; spin_lock_irqsave(&hidg->read_spinlock, flags); @@ -339,7 +339,7 @@ static ssize_t f_hidg_write(struct file *file, const char __user *buffer, unsigned long flags; ssize_t status = -ENOMEM; - if (!access_ok(VERIFY_READ, buffer, count)) + if (!access_ok(buffer, count)) return -EFAULT; spin_lock_irqsave(&hidg->write_spinlock, flags); diff --git a/drivers/usb/gadget/udc/atmel_usba_udc.c b/drivers/usb/gadget/udc/atmel_usba_udc.c index 11247322d587..660712e0bf98 100644 --- a/drivers/usb/gadget/udc/atmel_usba_udc.c +++ b/drivers/usb/gadget/udc/atmel_usba_udc.c @@ -88,7 +88,7 @@ static ssize_t queue_dbg_read(struct file *file, char __user *buf, size_t len, remaining, actual = 0; char tmpbuf[38]; - if (!access_ok(VERIFY_WRITE, buf, nbytes)) + if (!access_ok(buf, nbytes)) return -EFAULT; inode_lock(file_inode(file)); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 55e5aa662ad5..9f7942cbcbb2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -655,7 +655,7 @@ static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz) a + (unsigned long)log_base > ULONG_MAX) return false; - return access_ok(VERIFY_WRITE, log_base + a, + return access_ok(log_base + a, (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8); } @@ -681,7 +681,7 @@ static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem, return false; - if (!access_ok(VERIFY_WRITE, (void __user *)a, + if (!access_ok((void __user *)a, node->size)) return false; else if (log_all && !log_access_ok(log_base, @@ -973,10 +973,10 @@ static bool umem_access_ok(u64 uaddr, u64 size, int access) return false; if ((access & VHOST_ACCESS_RO) && - !access_ok(VERIFY_READ, (void __user *)a, size)) + !access_ok((void __user *)a, size)) return false; if ((access & VHOST_ACCESS_WO) && - !access_ok(VERIFY_WRITE, (void __user *)a, size)) + !access_ok((void __user *)a, size)) return false; return true; } @@ -1185,10 +1185,10 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num, { size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0; - return access_ok(VERIFY_READ, desc, num * sizeof *desc) && - access_ok(VERIFY_READ, avail, + return access_ok(desc, num * sizeof *desc) && + access_ok(avail, sizeof *avail + num * sizeof *avail->ring + s) && - access_ok(VERIFY_WRITE, used, + access_ok(used, sizeof *used + num * sizeof *used->ring + s); } @@ -1814,7 +1814,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq) goto err; vq->signalled_used_valid = false; if (!vq->iotlb && - !access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx)) { + !access_ok(&vq->used->idx, sizeof vq->used->idx)) { r = -EFAULT; goto err; } diff --git a/drivers/video/fbdev/amifb.c b/drivers/video/fbdev/amifb.c index 0777aff211e5..758457026694 100644 --- a/drivers/video/fbdev/amifb.c +++ b/drivers/video/fbdev/amifb.c @@ -1855,7 +1855,7 @@ static int ami_get_var_cursorinfo(struct fb_var_cursorinfo *var, var->yspot = par->crsr.spot_y; if (size > var->height * var->width) return -ENAMETOOLONG; - if (!access_ok(VERIFY_WRITE, data, size)) + if (!access_ok(data, size)) return -EFAULT; delta = 1 << par->crsr.fmode; lspr = lofsprite + (delta << 1); @@ -1935,7 +1935,7 @@ static int ami_set_var_cursorinfo(struct fb_var_cursorinfo *var, return -EINVAL; if (!var->height) return -EINVAL; - if (!access_ok(VERIFY_READ, data, var->width * var->height)) + if (!access_ok(data, var->width * var->height)) return -EFAULT; delta = 1 << fmode; lofsprite = shfsprite = (u_short *)spritememory; diff --git a/drivers/video/fbdev/omap2/omapfb/omapfb-ioctl.c b/drivers/video/fbdev/omap2/omapfb/omapfb-ioctl.c index a3edb20ea4c3..53f93616c671 100644 --- a/drivers/video/fbdev/omap2/omapfb/omapfb-ioctl.c +++ b/drivers/video/fbdev/omap2/omapfb/omapfb-ioctl.c @@ -493,7 +493,7 @@ static int omapfb_memory_read(struct fb_info *fbi, if (!display || !display->driver->memory_read) return -ENOENT; - if (!access_ok(VERIFY_WRITE, mr->buffer, mr->buffer_size)) + if (!access_ok(mr->buffer, mr->buffer_size)) return -EFAULT; if (mr->w > 4096 || mr->h > 4096) diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index 7e6e682104dc..b24ddac1604b 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -459,14 +459,14 @@ static long privcmd_ioctl_mmap_batch( return -EFAULT; /* Returns per-frame error in m.arr. */ m.err = NULL; - if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) + if (!access_ok(m.arr, m.num * sizeof(*m.arr))) return -EFAULT; break; case 2: if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch_v2))) return -EFAULT; /* Returns per-frame error code in m.err. */ - if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err)))) + if (!access_ok(m.err, m.num * (sizeof(*m.err)))) return -EFAULT; break; default: @@ -661,7 +661,7 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata) goto out; } - if (!access_ok(VERIFY_WRITE, kbufs[i].uptr, + if (!access_ok(kbufs[i].uptr, kbufs[i].size)) { rc = -EFAULT; goto out; diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c index c3deb2e35f20..ca9725f18e00 100644 --- a/fs/binfmt_aout.c +++ b/fs/binfmt_aout.c @@ -78,9 +78,9 @@ static int aout_core_dump(struct coredump_params *cprm) /* make sure we actually have a data and stack area to dump */ set_fs(USER_DS); - if (!access_ok(VERIFY_READ, START_DATA(dump), dump.u_dsize << PAGE_SHIFT)) + if (!access_ok(START_DATA(dump), dump.u_dsize << PAGE_SHIFT)) dump.u_dsize = 0; - if (!access_ok(VERIFY_READ, START_STACK(dump), dump.u_ssize << PAGE_SHIFT)) + if (!access_ok(START_STACK(dump), dump.u_ssize << PAGE_SHIFT)) dump.u_ssize = 0; set_fs(KERNEL_DS); diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 1b15b43905f8..7ea2d6b1f170 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -6646,7 +6646,7 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) goto out; } - if (!access_ok(VERIFY_READ, arg->clone_sources, + if (!access_ok(arg->clone_sources, sizeof(*arg->clone_sources) * arg->clone_sources_count)) { ret = -EFAULT; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 8a5a1010886b..7ebae39fbcb3 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2172,7 +2172,7 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, return -EINVAL; /* Verify that the area passed by the user is writeable */ - if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) return -EFAULT; /* Get the "struct file *" for the eventpoll file */ diff --git a/fs/fat/dir.c b/fs/fat/dir.c index c8366cb8eccd..0295a095b920 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -805,7 +805,7 @@ static long fat_dir_ioctl(struct file *filp, unsigned int cmd, return fat_generic_ioctl(filp, cmd, arg); } - if (!access_ok(VERIFY_WRITE, d1, sizeof(struct __fat_dirent[2]))) + if (!access_ok(d1, sizeof(struct __fat_dirent[2]))) return -EFAULT; /* * Yes, we don't need this put_user() absolutely. However old @@ -845,7 +845,7 @@ static long fat_compat_dir_ioctl(struct file *filp, unsigned cmd, return fat_generic_ioctl(filp, cmd, (unsigned long)arg); } - if (!access_ok(VERIFY_WRITE, d1, sizeof(struct compat_dirent[2]))) + if (!access_ok(d1, sizeof(struct compat_dirent[2]))) return -EFAULT; /* * Yes, we don't need this put_user() absolutely. However old diff --git a/fs/ioctl.c b/fs/ioctl.c index d64f622cac8b..fef3a6bf7c78 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -203,7 +203,7 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg) fieinfo.fi_extents_start = ufiemap->fm_extents; if (fiemap.fm_extent_count != 0 && - !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + !access_ok(fieinfo.fi_extents_start, fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) return -EFAULT; diff --git a/fs/namespace.c b/fs/namespace.c index a7f91265ea67..97b7c7098c3d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2651,7 +2651,7 @@ static long exact_copy_from_user(void *to, const void __user * from, const char __user *f = from; char c; - if (!access_ok(VERIFY_READ, from, n)) + if (!access_ok(from, n)) return n; current->kernel_uaccess_faults_ok++; diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index b8fa1487cd85..8decbe95dcec 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -254,7 +254,7 @@ static ssize_t dlmfs_file_read(struct file *filp, if (!count) return 0; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* don't read past the lvb */ @@ -302,7 +302,7 @@ static ssize_t dlmfs_file_write(struct file *filp, if (!count) return 0; - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* don't write past the lvb */ diff --git a/fs/pstore/pmsg.c b/fs/pstore/pmsg.c index 24db02de1787..97fcef74e5af 100644 --- a/fs/pstore/pmsg.c +++ b/fs/pstore/pmsg.c @@ -33,7 +33,7 @@ static ssize_t write_pmsg(struct file *file, const char __user *buf, record.size = count; /* check outside lock, page in any data. write_user also checks */ - if (!access_ok(VERIFY_READ, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; mutex_lock(&pmsg_lock); diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index c11711c2cc83..f375c0735351 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -357,7 +357,7 @@ int notrace persistent_ram_write_user(struct persistent_ram_zone *prz, int rem, ret = 0, c = count; size_t start; - if (unlikely(!access_ok(VERIFY_READ, s, count))) + if (unlikely(!access_ok(s, count))) return -EFAULT; if (unlikely(c > prz->buffer_size)) { s += c - prz->buffer_size; diff --git a/fs/read_write.c b/fs/read_write.c index 58f30537c47a..ff3c5e6f87cf 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -442,7 +442,7 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) return -EBADF; if (!(file->f_mode & FMODE_CAN_READ)) return -EINVAL; - if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) + if (unlikely(!access_ok(buf, count))) return -EFAULT; ret = rw_verify_area(READ, file, pos, count); @@ -538,7 +538,7 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; - if (unlikely(!access_ok(VERIFY_READ, buf, count))) + if (unlikely(!access_ok(buf, count))) return -EFAULT; ret = rw_verify_area(WRITE, file, pos, count); @@ -718,9 +718,6 @@ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter, return ret; } -/* A write operation does a read from user space and vice versa */ -#define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) - /** * rw_copy_check_uvector() - Copy an array of &struct iovec from userspace * into the kernel and check that it is valid. @@ -810,7 +807,7 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, goto out; } if (type >= 0 - && unlikely(!access_ok(vrfy_dir(type), buf, len))) { + && unlikely(!access_ok(buf, len))) { ret = -EFAULT; goto out; } @@ -856,7 +853,7 @@ ssize_t compat_rw_copy_check_uvector(int type, *ret_pointer = iov; ret = -EFAULT; - if (!access_ok(VERIFY_READ, uvector, nr_segs*sizeof(*uvector))) + if (!access_ok(uvector, nr_segs*sizeof(*uvector))) goto out; /* @@ -881,7 +878,7 @@ ssize_t compat_rw_copy_check_uvector(int type, if (len < 0) /* size_t not fitting in compat_ssize_t .. */ goto out; if (type >= 0 && - !access_ok(vrfy_dir(type), compat_ptr(buf), len)) { + !access_ok(compat_ptr(buf), len)) { ret = -EFAULT; goto out; } diff --git a/fs/readdir.c b/fs/readdir.c index d97f548e6323..2f6a4534e0df 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -105,7 +105,7 @@ static int fillonedir(struct dir_context *ctx, const char *name, int namlen, } buf->result++; dirent = buf->dirent; - if (!access_ok(VERIFY_WRITE, dirent, + if (!access_ok(dirent, (unsigned long)(dirent->d_name + namlen + 1) - (unsigned long)dirent)) goto efault; @@ -221,7 +221,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd, }; int error; - if (!access_ok(VERIFY_WRITE, dirent, count)) + if (!access_ok(dirent, count)) return -EFAULT; f = fdget_pos(fd); @@ -304,7 +304,7 @@ int ksys_getdents64(unsigned int fd, struct linux_dirent64 __user *dirent, }; int error; - if (!access_ok(VERIFY_WRITE, dirent, count)) + if (!access_ok(dirent, count)) return -EFAULT; f = fdget_pos(fd); @@ -365,7 +365,7 @@ static int compat_fillonedir(struct dir_context *ctx, const char *name, } buf->result++; dirent = buf->dirent; - if (!access_ok(VERIFY_WRITE, dirent, + if (!access_ok(dirent, (unsigned long)(dirent->d_name + namlen + 1) - (unsigned long)dirent)) goto efault; @@ -475,7 +475,7 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigned int, fd, }; int error; - if (!access_ok(VERIFY_WRITE, dirent, count)) + if (!access_ok(dirent, count)) return -EFAULT; f = fdget_pos(fd); diff --git a/fs/select.c b/fs/select.c index 4c8652390c94..d0f35dbc0e8f 100644 --- a/fs/select.c +++ b/fs/select.c @@ -381,9 +381,6 @@ typedef struct { #define FDS_BYTES(nr) (FDS_LONGS(nr)*sizeof(long)) /* - * We do a VERIFY_WRITE here even though we are only reading this time: - * we'll write to it eventually.. - * * Use "unsigned long" accesses to let user-mode fd_set's be long-aligned. */ static inline @@ -782,7 +779,7 @@ SYSCALL_DEFINE6(pselect6, int, n, fd_set __user *, inp, fd_set __user *, outp, sigset_t __user *up = NULL; if (sig) { - if (!access_ok(VERIFY_READ, sig, sizeof(void *)+sizeof(size_t)) + if (!access_ok(sig, sizeof(void *)+sizeof(size_t)) || __get_user(up, (sigset_t __user * __user *)sig) || __get_user(sigsetsize, (size_t __user *)(sig+sizeof(void *)))) @@ -802,7 +799,7 @@ SYSCALL_DEFINE6(pselect6_time32, int, n, fd_set __user *, inp, fd_set __user *, sigset_t __user *up = NULL; if (sig) { - if (!access_ok(VERIFY_READ, sig, sizeof(void *)+sizeof(size_t)) + if (!access_ok(sig, sizeof(void *)+sizeof(size_t)) || __get_user(up, (sigset_t __user * __user *)sig) || __get_user(sigsetsize, (size_t __user *)(sig+sizeof(void *)))) @@ -1368,7 +1365,7 @@ COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp, compat_uptr_t up = 0; if (sig) { - if (!access_ok(VERIFY_READ, sig, + if (!access_ok(sig, sizeof(compat_uptr_t)+sizeof(compat_size_t)) || __get_user(up, (compat_uptr_t __user *)sig) || __get_user(sigsetsize, @@ -1390,7 +1387,7 @@ COMPAT_SYSCALL_DEFINE6(pselect6, int, n, compat_ulong_t __user *, inp, compat_uptr_t up = 0; if (sig) { - if (!access_ok(VERIFY_READ, sig, + if (!access_ok(sig, sizeof(compat_uptr_t)+sizeof(compat_size_t)) || __get_user(up, (compat_uptr_t __user *)sig) || __get_user(sigsetsize, diff --git a/include/asm-generic/uaccess.h b/include/asm-generic/uaccess.h index 6b2e63df2739..d82c78a79da5 100644 --- a/include/asm-generic/uaccess.h +++ b/include/asm-generic/uaccess.h @@ -35,7 +35,7 @@ static inline void set_fs(mm_segment_t fs) #define segment_eq(a, b) ((a).seg == (b).seg) #endif -#define access_ok(type, addr, size) __access_ok((unsigned long)(addr),(size)) +#define access_ok(addr, size) __access_ok((unsigned long)(addr),(size)) /* * The architecture should really override this if possible, at least @@ -78,7 +78,7 @@ static inline int __access_ok(unsigned long addr, unsigned long size) ({ \ void __user *__p = (ptr); \ might_fault(); \ - access_ok(VERIFY_WRITE, __p, sizeof(*ptr)) ? \ + access_ok(__p, sizeof(*ptr)) ? \ __put_user((x), ((__typeof__(*(ptr)) __user *)__p)) : \ -EFAULT; \ }) @@ -140,7 +140,7 @@ extern int __put_user_bad(void) __attribute__((noreturn)); ({ \ const void __user *__p = (ptr); \ might_fault(); \ - access_ok(VERIFY_READ, __p, sizeof(*ptr)) ? \ + access_ok(__p, sizeof(*ptr)) ? \ __get_user((x), (__typeof__(*(ptr)) __user *)__p) :\ ((x) = (__typeof__(*(ptr)))0,-EFAULT); \ }) @@ -175,7 +175,7 @@ __strncpy_from_user(char *dst, const char __user *src, long count) static inline long strncpy_from_user(char *dst, const char __user *src, long count) { - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return -EFAULT; return __strncpy_from_user(dst, src, count); } @@ -196,7 +196,7 @@ strncpy_from_user(char *dst, const char __user *src, long count) */ static inline long strnlen_user(const char __user *src, long n) { - if (!access_ok(VERIFY_READ, src, 1)) + if (!access_ok(src, 1)) return 0; return __strnlen_user(src, n); } @@ -217,7 +217,7 @@ static inline __must_check unsigned long clear_user(void __user *to, unsigned long n) { might_fault(); - if (!access_ok(VERIFY_WRITE, to, n)) + if (!access_ok(to, n)) return n; return __clear_user(to, n); diff --git a/include/linux/regset.h b/include/linux/regset.h index 494cedaafdf2..a85c1707285c 100644 --- a/include/linux/regset.h +++ b/include/linux/regset.h @@ -376,7 +376,7 @@ static inline int copy_regset_to_user(struct task_struct *target, if (!regset->get) return -EOPNOTSUPP; - if (!access_ok(VERIFY_WRITE, data, size)) + if (!access_ok(data, size)) return -EFAULT; return regset->get(target, regset, offset, size, NULL, data); @@ -402,7 +402,7 @@ static inline int copy_regset_from_user(struct task_struct *target, if (!regset->set) return -EOPNOTSUPP; - if (!access_ok(VERIFY_READ, data, size)) + if (!access_ok(data, size)) return -EFAULT; return regset->set(target, regset, offset, size, NULL, data); diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index efe79c1cdd47..bf2523867a02 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -6,9 +6,6 @@ #include #include -#define VERIFY_READ 0 -#define VERIFY_WRITE 1 - #define uaccess_kernel() segment_eq(get_fs(), KERNEL_DS) #include @@ -111,7 +108,7 @@ _copy_from_user(void *to, const void __user *from, unsigned long n) { unsigned long res = n; might_fault(); - if (likely(access_ok(VERIFY_READ, from, n))) { + if (likely(access_ok(from, n))) { kasan_check_write(to, n); res = raw_copy_from_user(to, from, n); } @@ -129,7 +126,7 @@ static inline unsigned long _copy_to_user(void __user *to, const void *from, unsigned long n) { might_fault(); - if (access_ok(VERIFY_WRITE, to, n)) { + if (access_ok(to, n)) { kasan_check_read(from, n); n = raw_copy_to_user(to, from, n); } @@ -160,7 +157,7 @@ static __always_inline unsigned long __must_check copy_in_user(void __user *to, const void __user *from, unsigned long n) { might_fault(); - if (access_ok(VERIFY_WRITE, to, n) && access_ok(VERIFY_READ, from, n)) + if (access_ok(to, n) && access_ok(from, n)) n = raw_copy_in_user(to, from, n); return n; } diff --git a/include/net/checksum.h b/include/net/checksum.h index aef2b2bb6603..0f319e13be2c 100644 --- a/include/net/checksum.h +++ b/include/net/checksum.h @@ -30,7 +30,7 @@ static inline __wsum csum_and_copy_from_user (const void __user *src, void *dst, int len, __wsum sum, int *err_ptr) { - if (access_ok(VERIFY_READ, src, len)) + if (access_ok(src, len)) return csum_partial_copy_from_user(src, dst, len, sum, err_ptr); if (len) @@ -46,7 +46,7 @@ static __inline__ __wsum csum_and_copy_to_user { sum = csum_partial(src, len, sum); - if (access_ok(VERIFY_WRITE, dst, len)) { + if (access_ok(dst, len)) { if (copy_to_user(dst, src, len) == 0) return sum; } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0607db304def..b155cd17c1bd 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -79,7 +79,7 @@ int bpf_check_uarg_tail_zero(void __user *uaddr, if (unlikely(actual_size > PAGE_SIZE)) /* silly large */ return -E2BIG; - if (unlikely(!access_ok(VERIFY_READ, uaddr, actual_size))) + if (unlikely(!access_ok(uaddr, actual_size))) return -EFAULT; if (actual_size <= expected_size) diff --git a/kernel/compat.c b/kernel/compat.c index 089d00d0da9c..705d4ae6c018 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -95,28 +95,28 @@ int compat_put_timex(struct compat_timex __user *utp, const struct timex *txc) static int __compat_get_timeval(struct timeval *tv, const struct old_timeval32 __user *ctv) { - return (!access_ok(VERIFY_READ, ctv, sizeof(*ctv)) || + return (!access_ok(ctv, sizeof(*ctv)) || __get_user(tv->tv_sec, &ctv->tv_sec) || __get_user(tv->tv_usec, &ctv->tv_usec)) ? -EFAULT : 0; } static int __compat_put_timeval(const struct timeval *tv, struct old_timeval32 __user *ctv) { - return (!access_ok(VERIFY_WRITE, ctv, sizeof(*ctv)) || + return (!access_ok(ctv, sizeof(*ctv)) || __put_user(tv->tv_sec, &ctv->tv_sec) || __put_user(tv->tv_usec, &ctv->tv_usec)) ? -EFAULT : 0; } static int __compat_get_timespec(struct timespec *ts, const struct old_timespec32 __user *cts) { - return (!access_ok(VERIFY_READ, cts, sizeof(*cts)) || + return (!access_ok(cts, sizeof(*cts)) || __get_user(ts->tv_sec, &cts->tv_sec) || __get_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0; } static int __compat_put_timespec(const struct timespec *ts, struct old_timespec32 __user *cts) { - return (!access_ok(VERIFY_WRITE, cts, sizeof(*cts)) || + return (!access_ok(cts, sizeof(*cts)) || __put_user(ts->tv_sec, &cts->tv_sec) || __put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0; } @@ -335,7 +335,7 @@ int get_compat_sigevent(struct sigevent *event, const struct compat_sigevent __user *u_event) { memset(event, 0, sizeof(*event)); - return (!access_ok(VERIFY_READ, u_event, sizeof(*u_event)) || + return (!access_ok(u_event, sizeof(*u_event)) || __get_user(event->sigev_value.sival_int, &u_event->sigev_value.sival_int) || __get_user(event->sigev_signo, &u_event->sigev_signo) || @@ -354,7 +354,7 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask, bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG); nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size); - if (!access_ok(VERIFY_READ, umask, bitmap_size / 8)) + if (!access_ok(umask, bitmap_size / 8)) return -EFAULT; user_access_begin(); @@ -384,7 +384,7 @@ long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask, bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG); nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size); - if (!access_ok(VERIFY_WRITE, umask, bitmap_size / 8)) + if (!access_ok(umask, bitmap_size / 8)) return -EFAULT; user_access_begin(); @@ -438,7 +438,7 @@ void __user *compat_alloc_user_space(unsigned long len) ptr = arch_compat_alloc_user_space(len); - if (unlikely(!access_ok(VERIFY_WRITE, ptr, len))) + if (unlikely(!access_ok(ptr, len))) return NULL; return ptr; diff --git a/kernel/events/core.c b/kernel/events/core.c index 67ecac337374..3cd13a30f732 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10135,7 +10135,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr, u32 size; int ret; - if (!access_ok(VERIFY_WRITE, uattr, PERF_ATTR_SIZE_VER0)) + if (!access_ok(uattr, PERF_ATTR_SIZE_VER0)) return -EFAULT; /* diff --git a/kernel/exit.c b/kernel/exit.c index 0e21e6d21f35..8a01b671dc1f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1604,7 +1604,7 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *, if (!infop) return err; - if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop))) + if (!access_ok(infop, sizeof(*infop))) return -EFAULT; user_access_begin(); @@ -1732,7 +1732,7 @@ COMPAT_SYSCALL_DEFINE5(waitid, if (!infop) return err; - if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop))) + if (!access_ok(infop, sizeof(*infop))) return -EFAULT; user_access_begin(); diff --git a/kernel/futex.c b/kernel/futex.c index 054105854e0e..be3bff2315ff 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -481,13 +481,18 @@ static void drop_futex_key_refs(union futex_key *key) } } +enum futex_access { + FUTEX_READ, + FUTEX_WRITE +}; + /** * get_futex_key() - Get parameters which are the keys for a futex * @uaddr: virtual address of the futex * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED * @key: address where result is stored. - * @rw: mapping needs to be read/write (values: VERIFY_READ, - * VERIFY_WRITE) + * @rw: mapping needs to be read/write (values: FUTEX_READ, + * FUTEX_WRITE) * * Return: a negative error code or 0 * @@ -500,7 +505,7 @@ static void drop_futex_key_refs(union futex_key *key) * lock_page() might sleep, the caller should not hold a spinlock. */ static int -get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) +get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, enum futex_access rw) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -516,7 +521,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) return -EINVAL; address -= key->both.offset; - if (unlikely(!access_ok(rw, uaddr, sizeof(u32)))) + if (unlikely(!access_ok(uaddr, sizeof(u32)))) return -EFAULT; if (unlikely(should_fail_futex(fshared))) @@ -546,7 +551,7 @@ again: * If write access is not required (eg. FUTEX_WAIT), try * and get read-only access. */ - if (err == -EFAULT && rw == VERIFY_READ) { + if (err == -EFAULT && rw == FUTEX_READ) { err = get_user_pages_fast(address, 1, 0, &page); ro = 1; } @@ -1583,7 +1588,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) if (!bitset) return -EINVAL; - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ); + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, FUTEX_READ); if (unlikely(ret != 0)) goto out; @@ -1642,7 +1647,7 @@ static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr) oparg = 1 << oparg; } - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) + if (!access_ok(uaddr, sizeof(u32))) return -EFAULT; ret = arch_futex_atomic_op_inuser(op, oparg, &oldval, uaddr); @@ -1682,10 +1687,10 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2, DEFINE_WAKE_Q(wake_q); retry: - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); + ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE); + ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE); if (unlikely(ret != 0)) goto out_put_key1; @@ -1961,11 +1966,11 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, } retry: - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); + ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ); if (unlikely(ret != 0)) goto out; ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, - requeue_pi ? VERIFY_WRITE : VERIFY_READ); + requeue_pi ? FUTEX_WRITE : FUTEX_READ); if (unlikely(ret != 0)) goto out_put_key1; @@ -2634,7 +2639,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags, * while the syscall executes. */ retry: - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ); + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, FUTEX_READ); if (unlikely(ret != 0)) return ret; @@ -2793,7 +2798,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, } retry: - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE); + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, FUTEX_WRITE); if (unlikely(ret != 0)) goto out; @@ -2972,7 +2977,7 @@ retry: if ((uval & FUTEX_TID_MASK) != vpid) return -EPERM; - ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE); + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, FUTEX_WRITE); if (ret) return ret; @@ -3199,7 +3204,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, */ rt_mutex_init_waiter(&rt_waiter); - ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE); + ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE); if (unlikely(ret != 0)) goto out; diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 1306fe0c1dc6..d3d170374ceb 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1466,7 +1466,7 @@ int do_syslog(int type, char __user *buf, int len, int source) return -EINVAL; if (!len) return 0; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; error = wait_event_interruptible(log_wait, syslog_seq != log_next_seq); @@ -1484,7 +1484,7 @@ int do_syslog(int type, char __user *buf, int len, int source) return -EINVAL; if (!len) return 0; - if (!access_ok(VERIFY_WRITE, buf, len)) + if (!access_ok(buf, len)) return -EFAULT; error = syslog_print_all(buf, len, clear); break; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index c2cee9db5204..771e93f9c43f 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1073,7 +1073,7 @@ int ptrace_request(struct task_struct *child, long request, struct iovec kiov; struct iovec __user *uiov = datavp; - if (!access_ok(VERIFY_WRITE, uiov, sizeof(*uiov))) + if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(kiov.iov_base, &uiov->iov_base) || @@ -1229,7 +1229,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request, compat_uptr_t ptr; compat_size_t len; - if (!access_ok(VERIFY_WRITE, uiov, sizeof(*uiov))) + if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(ptr, &uiov->iov_base) || diff --git a/kernel/rseq.c b/kernel/rseq.c index c6242d8594dc..25e9a7b60eba 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -267,7 +267,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) if (unlikely(t->flags & PF_EXITING)) return; - if (unlikely(!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))) + if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq)))) goto error; ret = rseq_ip_fixup(regs); if (unlikely(ret < 0)) @@ -295,7 +295,7 @@ void rseq_syscall(struct pt_regs *regs) if (!t->rseq) return; - if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) || + if (!access_ok(t->rseq, sizeof(*t->rseq)) || rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) force_sig(SIGSEGV, t); } @@ -351,7 +351,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || rseq_len != sizeof(*rseq)) return -EINVAL; - if (!access_ok(VERIFY_WRITE, rseq, rseq_len)) + if (!access_ok(rseq, rseq_len)) return -EFAULT; current->rseq = rseq; current->rseq_len = rseq_len; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f66920173370..1f3e19fd6dc6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4450,7 +4450,7 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a u32 size; int ret; - if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0)) + if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0)) return -EFAULT; /* Zero the full structure, so that a short copy will be nice: */ @@ -4650,7 +4650,7 @@ static int sched_read_attr(struct sched_attr __user *uattr, { int ret; - if (!access_ok(VERIFY_WRITE, uattr, usize)) + if (!access_ok(uattr, usize)) return -EFAULT; /* diff --git a/kernel/signal.c b/kernel/signal.c index 53e07d97ffe0..e1d7ad8e6ab1 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3997,7 +3997,7 @@ SYSCALL_DEFINE3(sigaction, int, sig, if (act) { old_sigset_t mask; - if (!access_ok(VERIFY_READ, act, sizeof(*act)) || + if (!access_ok(act, sizeof(*act)) || __get_user(new_ka.sa.sa_handler, &act->sa_handler) || __get_user(new_ka.sa.sa_restorer, &act->sa_restorer) || __get_user(new_ka.sa.sa_flags, &act->sa_flags) || @@ -4012,7 +4012,7 @@ SYSCALL_DEFINE3(sigaction, int, sig, ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || + if (!access_ok(oact, sizeof(*oact)) || __put_user(old_ka.sa.sa_handler, &oact->sa_handler) || __put_user(old_ka.sa.sa_restorer, &oact->sa_restorer) || __put_user(old_ka.sa.sa_flags, &oact->sa_flags) || @@ -4034,7 +4034,7 @@ COMPAT_SYSCALL_DEFINE3(sigaction, int, sig, compat_uptr_t handler, restorer; if (act) { - if (!access_ok(VERIFY_READ, act, sizeof(*act)) || + if (!access_ok(act, sizeof(*act)) || __get_user(handler, &act->sa_handler) || __get_user(restorer, &act->sa_restorer) || __get_user(new_ka.sa.sa_flags, &act->sa_flags) || @@ -4052,7 +4052,7 @@ COMPAT_SYSCALL_DEFINE3(sigaction, int, sig, ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || + if (!access_ok(oact, sizeof(*oact)) || __put_user(ptr_to_compat(old_ka.sa.sa_handler), &oact->sa_handler) || __put_user(ptr_to_compat(old_ka.sa.sa_restorer), diff --git a/kernel/sys.c b/kernel/sys.c index 64b5a230f38d..a48cbf1414b8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2627,7 +2627,7 @@ COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info) s.freehigh >>= bitcount; } - if (!access_ok(VERIFY_WRITE, info, sizeof(struct compat_sysinfo)) || + if (!access_ok(info, sizeof(struct compat_sysinfo)) || __put_user(s.uptime, &info->uptime) || __put_user(s.loads[0], &info->loads[0]) || __put_user(s.loads[1], &info->loads[1]) || diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 9ddb6fddb4e0..8b068adb9da1 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -170,7 +170,7 @@ BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src, return -EPERM; if (unlikely(uaccess_kernel())) return -EPERM; - if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) + if (!access_ok(unsafe_ptr, size)) return -EPERM; return probe_kernel_write(unsafe_ptr, src, size); diff --git a/lib/bitmap.c b/lib/bitmap.c index eead55aa7170..98872e9025da 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -443,7 +443,7 @@ int bitmap_parse_user(const char __user *ubuf, unsigned int ulen, unsigned long *maskp, int nmaskbits) { - if (!access_ok(VERIFY_READ, ubuf, ulen)) + if (!access_ok(ubuf, ulen)) return -EFAULT; return __bitmap_parse((const char __force *)ubuf, ulen, 1, maskp, nmaskbits); @@ -641,7 +641,7 @@ int bitmap_parselist_user(const char __user *ubuf, unsigned int ulen, unsigned long *maskp, int nmaskbits) { - if (!access_ok(VERIFY_READ, ubuf, ulen)) + if (!access_ok(ubuf, ulen)) return -EFAULT; return __bitmap_parselist((const char __force *)ubuf, ulen, 1, maskp, nmaskbits); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 1928009f506e..c93870987b58 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -136,7 +136,7 @@ static int copyout(void __user *to, const void *from, size_t n) { - if (access_ok(VERIFY_WRITE, to, n)) { + if (access_ok(to, n)) { kasan_check_read(from, n); n = raw_copy_to_user(to, from, n); } @@ -145,7 +145,7 @@ static int copyout(void __user *to, const void *from, size_t n) static int copyin(void *to, const void __user *from, size_t n) { - if (access_ok(VERIFY_READ, from, n)) { + if (access_ok(from, n)) { kasan_check_write(to, n); n = raw_copy_from_user(to, from, n); } @@ -614,7 +614,7 @@ EXPORT_SYMBOL(_copy_to_iter); #ifdef CONFIG_ARCH_HAS_UACCESS_MCSAFE static int copyout_mcsafe(void __user *to, const void *from, size_t n) { - if (access_ok(VERIFY_WRITE, to, n)) { + if (access_ok(to, n)) { kasan_check_read(from, n); n = copy_to_user_mcsafe((__force void *) to, from, n); } @@ -1663,7 +1663,7 @@ int import_single_range(int rw, void __user *buf, size_t len, { if (len > MAX_RW_COUNT) len = MAX_RW_COUNT; - if (unlikely(!access_ok(!rw, buf, len))) + if (unlikely(!access_ok(buf, len))) return -EFAULT; iov->iov_base = buf; diff --git a/lib/usercopy.c b/lib/usercopy.c index 3744b2a8e591..c2bfbcaeb3dc 100644 --- a/lib/usercopy.c +++ b/lib/usercopy.c @@ -8,7 +8,7 @@ unsigned long _copy_from_user(void *to, const void __user *from, unsigned long n { unsigned long res = n; might_fault(); - if (likely(access_ok(VERIFY_READ, from, n))) { + if (likely(access_ok(from, n))) { kasan_check_write(to, n); res = raw_copy_from_user(to, from, n); } @@ -23,7 +23,7 @@ EXPORT_SYMBOL(_copy_from_user); unsigned long _copy_to_user(void __user *to, const void *from, unsigned long n) { might_fault(); - if (likely(access_ok(VERIFY_WRITE, to, n))) { + if (likely(access_ok(to, n))) { kasan_check_read(from, n); n = raw_copy_to_user(to, from, n); } diff --git a/mm/gup.c b/mm/gup.c index 8cb68a50dbdf..6f591ccb8eca 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1813,8 +1813,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; - if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, - (void __user *)start, len))) + if (unlikely(!access_ok((void __user *)start, len))) return 0; /* @@ -1868,8 +1867,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, if (nr_pages <= 0) return 0; - if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, - (void __user *)start, len))) + if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; if (gup_fast_permitted(start, nr_pages, write)) { diff --git a/mm/mincore.c b/mm/mincore.c index 4985965aa20a..218099b5ed31 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -233,14 +233,14 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, return -EINVAL; /* ..and we need to be passed a valid user-space range */ - if (!access_ok(VERIFY_READ, (void __user *) start, len)) + if (!access_ok((void __user *) start, len)) return -ENOMEM; /* This also avoids any overflows on PAGE_ALIGN */ pages = len >> PAGE_SHIFT; pages += (offset_in_page(len)) != 0; - if (!access_ok(VERIFY_WRITE, vec, pages)) + if (!access_ok(vec, pages)) return -EFAULT; tmp = (void *) __get_free_page(GFP_USER); diff --git a/net/batman-adv/icmp_socket.c b/net/batman-adv/icmp_socket.c index d70f363c52ae..6d5859714f52 100644 --- a/net/batman-adv/icmp_socket.c +++ b/net/batman-adv/icmp_socket.c @@ -147,7 +147,7 @@ static ssize_t batadv_socket_read(struct file *file, char __user *buf, if (!buf || count < sizeof(struct batadv_icmp_packet)) return -EINVAL; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; error = wait_event_interruptible(socket_client->queue_wait, diff --git a/net/batman-adv/log.c b/net/batman-adv/log.c index 02e55b78132f..75f602e1ce94 100644 --- a/net/batman-adv/log.c +++ b/net/batman-adv/log.c @@ -136,7 +136,7 @@ static ssize_t batadv_log_read(struct file *file, char __user *buf, if (count == 0) return 0; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; error = wait_event_interruptible(debug_log->queue_wait, diff --git a/net/compat.c b/net/compat.c index c3a2f868e8af..959d1c51826d 100644 --- a/net/compat.c +++ b/net/compat.c @@ -358,7 +358,7 @@ static int do_set_sock_timeout(struct socket *sock, int level, if (optlen < sizeof(*up)) return -EINVAL; - if (!access_ok(VERIFY_READ, up, sizeof(*up)) || + if (!access_ok(up, sizeof(*up)) || __get_user(ktime.tv_sec, &up->tv_sec) || __get_user(ktime.tv_usec, &up->tv_usec)) return -EFAULT; @@ -438,7 +438,7 @@ static int do_get_sock_timeout(struct socket *sock, int level, int optname, if (!err) { if (put_user(sizeof(*up), optlen) || - !access_ok(VERIFY_WRITE, up, sizeof(*up)) || + !access_ok(up, sizeof(*up)) || __put_user(ktime.tv_sec, &up->tv_sec) || __put_user(ktime.tv_usec, &up->tv_usec)) err = -EFAULT; @@ -590,8 +590,8 @@ int compat_mc_setsockopt(struct sock *sock, int level, int optname, compat_alloc_user_space(sizeof(struct group_req)); u32 interface; - if (!access_ok(VERIFY_READ, gr32, sizeof(*gr32)) || - !access_ok(VERIFY_WRITE, kgr, sizeof(struct group_req)) || + if (!access_ok(gr32, sizeof(*gr32)) || + !access_ok(kgr, sizeof(struct group_req)) || __get_user(interface, &gr32->gr_interface) || __put_user(interface, &kgr->gr_interface) || copy_in_user(&kgr->gr_group, &gr32->gr_group, @@ -611,8 +611,8 @@ int compat_mc_setsockopt(struct sock *sock, int level, int optname, sizeof(struct group_source_req)); u32 interface; - if (!access_ok(VERIFY_READ, gsr32, sizeof(*gsr32)) || - !access_ok(VERIFY_WRITE, kgsr, + if (!access_ok(gsr32, sizeof(*gsr32)) || + !access_ok(kgsr, sizeof(struct group_source_req)) || __get_user(interface, &gsr32->gsr_interface) || __put_user(interface, &kgsr->gsr_interface) || @@ -631,7 +631,7 @@ int compat_mc_setsockopt(struct sock *sock, int level, int optname, struct group_filter __user *kgf; u32 interface, fmode, numsrc; - if (!access_ok(VERIFY_READ, gf32, __COMPAT_GF0_SIZE) || + if (!access_ok(gf32, __COMPAT_GF0_SIZE) || __get_user(interface, &gf32->gf_interface) || __get_user(fmode, &gf32->gf_fmode) || __get_user(numsrc, &gf32->gf_numsrc)) @@ -641,7 +641,7 @@ int compat_mc_setsockopt(struct sock *sock, int level, int optname, if (koptlen < GROUP_FILTER_SIZE(numsrc)) return -EINVAL; kgf = compat_alloc_user_space(koptlen); - if (!access_ok(VERIFY_WRITE, kgf, koptlen) || + if (!access_ok(kgf, koptlen) || __put_user(interface, &kgf->gf_interface) || __put_user(fmode, &kgf->gf_fmode) || __put_user(numsrc, &kgf->gf_numsrc) || @@ -675,7 +675,7 @@ int compat_mc_getsockopt(struct sock *sock, int level, int optname, return getsockopt(sock, level, optname, optval, optlen); koptlen = compat_alloc_user_space(sizeof(*koptlen)); - if (!access_ok(VERIFY_READ, optlen, sizeof(*optlen)) || + if (!access_ok(optlen, sizeof(*optlen)) || __get_user(ulen, optlen)) return -EFAULT; @@ -685,14 +685,14 @@ int compat_mc_getsockopt(struct sock *sock, int level, int optname, if (klen < GROUP_FILTER_SIZE(0)) return -EINVAL; - if (!access_ok(VERIFY_WRITE, koptlen, sizeof(*koptlen)) || + if (!access_ok(koptlen, sizeof(*koptlen)) || __put_user(klen, koptlen)) return -EFAULT; /* have to allow space for previous compat_alloc_user_space, too */ kgf = compat_alloc_user_space(klen+sizeof(*optlen)); - if (!access_ok(VERIFY_READ, gf32, __COMPAT_GF0_SIZE) || + if (!access_ok(gf32, __COMPAT_GF0_SIZE) || __get_user(interface, &gf32->gf_interface) || __get_user(fmode, &gf32->gf_fmode) || __get_user(numsrc, &gf32->gf_numsrc) || @@ -706,18 +706,18 @@ int compat_mc_getsockopt(struct sock *sock, int level, int optname, if (err) return err; - if (!access_ok(VERIFY_READ, koptlen, sizeof(*koptlen)) || + if (!access_ok(koptlen, sizeof(*koptlen)) || __get_user(klen, koptlen)) return -EFAULT; ulen = klen - (sizeof(*kgf)-sizeof(*gf32)); - if (!access_ok(VERIFY_WRITE, optlen, sizeof(*optlen)) || + if (!access_ok(optlen, sizeof(*optlen)) || __put_user(ulen, optlen)) return -EFAULT; - if (!access_ok(VERIFY_READ, kgf, klen) || - !access_ok(VERIFY_WRITE, gf32, ulen) || + if (!access_ok(kgf, klen) || + !access_ok(gf32, ulen) || __get_user(interface, &kgf->gf_interface) || __get_user(fmode, &kgf->gf_fmode) || __get_user(numsrc, &kgf->gf_numsrc) || diff --git a/net/sunrpc/sysctl.c b/net/sunrpc/sysctl.c index 8c3936403fea..0bea8ff8b0d3 100644 --- a/net/sunrpc/sysctl.c +++ b/net/sunrpc/sysctl.c @@ -89,7 +89,7 @@ proc_dodebug(struct ctl_table *table, int write, left = *lenp; if (write) { - if (!access_ok(VERIFY_READ, buffer, left)) + if (!access_ok(buffer, left)) return -EFAULT; p = buffer; while (left && __get_user(c, p) >= 0 && isspace(c)) diff --git a/security/tomoyo/common.c b/security/tomoyo/common.c index 9b38f94b5dd0..c598aa00d5e3 100644 --- a/security/tomoyo/common.c +++ b/security/tomoyo/common.c @@ -2591,7 +2591,7 @@ ssize_t tomoyo_write_control(struct tomoyo_io_buffer *head, int idx; if (!head->write) return -ENOSYS; - if (!access_ok(VERIFY_READ, buffer, buffer_len)) + if (!access_ok(buffer, buffer_len)) return -EFAULT; if (mutex_lock_interruptible(&head->io_sem)) return -EINTR; diff --git a/sound/core/seq/seq_clientmgr.c b/sound/core/seq/seq_clientmgr.c index 92e6524a3a9d..7d4640d1fe9f 100644 --- a/sound/core/seq/seq_clientmgr.c +++ b/sound/core/seq/seq_clientmgr.c @@ -393,7 +393,7 @@ static ssize_t snd_seq_read(struct file *file, char __user *buf, size_t count, if (!(snd_seq_file_flags(file) & SNDRV_SEQ_LFLG_INPUT)) return -ENXIO; - if (!access_ok(VERIFY_WRITE, buf, count)) + if (!access_ok(buf, count)) return -EFAULT; /* check client structures are in place */ diff --git a/sound/isa/sb/emu8000_patch.c b/sound/isa/sb/emu8000_patch.c index d45a6b9d6437..3d44c358c4b3 100644 --- a/sound/isa/sb/emu8000_patch.c +++ b/sound/isa/sb/emu8000_patch.c @@ -183,10 +183,10 @@ snd_emu8000_sample_new(struct snd_emux *rec, struct snd_sf_sample *sp, } if (sp->v.mode_flags & SNDRV_SFNT_SAMPLE_8BITS) { - if (!access_ok(VERIFY_READ, data, sp->v.size)) + if (!access_ok(data, sp->v.size)) return -EFAULT; } else { - if (!access_ok(VERIFY_READ, data, sp->v.size * 2)) + if (!access_ok(data, sp->v.size * 2)) return -EFAULT; } diff --git a/tools/perf/util/include/asm/uaccess.h b/tools/perf/util/include/asm/uaccess.h index 6a6f4b990547..548100315710 100644 --- a/tools/perf/util/include/asm/uaccess.h +++ b/tools/perf/util/include/asm/uaccess.h @@ -10,6 +10,6 @@ #define get_user __get_user -#define access_ok(type, addr, size) 1 +#define access_ok(addr, size) 1 #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 666d0155662d..1f888a103f78 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -939,8 +939,7 @@ int __kvm_set_memory_region(struct kvm *kvm, /* We can read the guest memory with __xxx_user() later on. */ if ((id < KVM_USER_MEM_SLOTS) && ((mem->userspace_addr & (PAGE_SIZE - 1)) || - !access_ok(VERIFY_WRITE, - (void __user *)(unsigned long)mem->userspace_addr, + !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size))) goto out; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) -- cgit v1.3-7-g2ca7 From 2e05ea5cdc1ac55d9ef678ed5ea6c38acf7fd2a3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 25 Dec 2018 08:50:35 +0100 Subject: dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs And also switch the way we implement the unmap side around to stay consistent. This ensures dma-debug works again because it records which function we used for mapping to ensure it is also used for unmapping, and also reduces further code duplication. Last but not least this also officially allows calling dma_sync_single_* for mappings created using dma_map_page, which is perfectly fine given that the sync calls only take a dma_addr_t, but not a virtual address or struct page. Fixes: 7f0fee242e ("dma-mapping: merge dma_unmap_page_attrs and dma_unmap_single_attrs") Signed-off-by: Christoph Hellwig Tested-by: LABBE Corentin --- include/linux/dma-debug.h | 11 +++----- include/linux/dma-mapping.h | 66 +++++++++++++++++---------------------------- kernel/dma/debug.c | 17 +++--------- 3 files changed, 32 insertions(+), 62 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h index 2ad5c363d7d5..cb422cbe587d 100644 --- a/include/linux/dma-debug.h +++ b/include/linux/dma-debug.h @@ -35,13 +35,12 @@ extern void debug_dma_map_single(struct device *dev, const void *addr, extern void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, - int direction, dma_addr_t dma_addr, - bool map_single); + int direction, dma_addr_t dma_addr); extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr); extern void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, int direction, bool map_single); + size_t size, int direction); extern void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, int mapped_ents, int direction); @@ -95,8 +94,7 @@ static inline void debug_dma_map_single(struct device *dev, const void *addr, static inline void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, - int direction, dma_addr_t dma_addr, - bool map_single) + int direction, dma_addr_t dma_addr) { } @@ -106,8 +104,7 @@ static inline void debug_dma_mapping_error(struct device *dev, } static inline void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, int direction, - bool map_single) + size_t size, int direction) { } diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index ba521d5506c9..0452a8be2789 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -284,32 +284,25 @@ static inline void dma_direct_sync_sg_for_cpu(struct device *dev, } #endif -static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, - size_t size, - enum dma_data_direction dir, - unsigned long attrs) +static inline dma_addr_t dma_map_page_attrs(struct device *dev, + struct page *page, size_t offset, size_t size, + enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); - debug_dma_map_single(dev, ptr, size); if (dma_is_direct(ops)) - addr = dma_direct_map_page(dev, virt_to_page(ptr), - offset_in_page(ptr), size, dir, attrs); + addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); else - addr = ops->map_page(dev, virt_to_page(ptr), - offset_in_page(ptr), size, dir, attrs); - debug_dma_map_page(dev, virt_to_page(ptr), - offset_in_page(ptr), size, - dir, addr, true); + addr = ops->map_page(dev, page, offset, size, dir, attrs); + debug_dma_map_page(dev, page, offset, size, dir, addr); + return addr; } -static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, - size_t size, - enum dma_data_direction dir, - unsigned long attrs) +static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); @@ -318,13 +311,7 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, dma_direct_unmap_page(dev, addr, size, dir, attrs); else if (ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); - debug_dma_unmap_page(dev, addr, size, dir, true); -} - -static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, - size_t size, enum dma_data_direction dir, unsigned long attrs) -{ - return dma_unmap_single_attrs(dev, addr, size, dir, attrs); + debug_dma_unmap_page(dev, addr, size, dir); } /* @@ -363,25 +350,6 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg ops->unmap_sg(dev, sg, nents, dir, attrs); } -static inline dma_addr_t dma_map_page_attrs(struct device *dev, - struct page *page, - size_t offset, size_t size, - enum dma_data_direction dir, - unsigned long attrs) -{ - const struct dma_map_ops *ops = get_dma_ops(dev); - dma_addr_t addr; - - BUG_ON(!valid_dma_direction(dir)); - if (dma_is_direct(ops)) - addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); - else - addr = ops->map_page(dev, page, offset, size, dir, attrs); - debug_dma_map_page(dev, page, offset, size, dir, addr, false); - - return addr; -} - static inline dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size, @@ -488,6 +456,20 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, } +static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ + debug_dma_map_single(dev, ptr, size); + return dma_map_page_attrs(dev, virt_to_page(ptr), offset_in_page(ptr), + size, dir, attrs); +} + +static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ + return dma_unmap_page_attrs(dev, addr, size, dir, attrs); +} + #define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, 0) #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0) #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 164706da2a73..1e0157113d15 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -49,7 +49,6 @@ enum { dma_debug_single, - dma_debug_page, dma_debug_sg, dma_debug_coherent, dma_debug_resource, @@ -1300,8 +1299,7 @@ void debug_dma_map_single(struct device *dev, const void *addr, EXPORT_SYMBOL(debug_dma_map_single); void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, - size_t size, int direction, dma_addr_t dma_addr, - bool map_single) + size_t size, int direction, dma_addr_t dma_addr) { struct dma_debug_entry *entry; @@ -1316,7 +1314,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, return; entry->dev = dev; - entry->type = dma_debug_page; + entry->type = dma_debug_single; entry->pfn = page_to_pfn(page); entry->offset = offset, entry->dev_addr = dma_addr; @@ -1324,9 +1322,6 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, entry->direction = direction; entry->map_err_type = MAP_ERR_NOT_CHECKED; - if (map_single) - entry->type = dma_debug_single; - check_for_stack(dev, page, offset); if (!PageHighMem(page)) { @@ -1378,10 +1373,10 @@ void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) EXPORT_SYMBOL(debug_dma_mapping_error); void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, int direction, bool map_single) + size_t size, int direction) { struct dma_debug_entry ref = { - .type = dma_debug_page, + .type = dma_debug_single, .dev = dev, .dev_addr = addr, .size = size, @@ -1390,10 +1385,6 @@ void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, if (unlikely(dma_debug_disabled())) return; - - if (map_single) - ref.type = dma_debug_single; - check_unmap(&ref); } EXPORT_SYMBOL(debug_dma_unmap_page); -- cgit v1.3-7-g2ca7 From d7076f07840851bbe57cb21ba052d6a4a9b1efa9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 25 Dec 2018 17:44:19 +0100 Subject: dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrs dmam_alloc_coherent is just the default no-flags case of dmam_alloc_attrs, so take advantage of this similar to the non-managed version. Signed-off-by: Christoph Hellwig --- include/linux/dma-mapping.h | 20 +++++++++++++------- kernel/dma/mapping.c | 39 --------------------------------------- 2 files changed, 13 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0452a8be2789..fa2ebe8ad4d0 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -677,21 +677,20 @@ dma_mark_declared_memory_occupied(struct device *dev, * Managed DMA API */ #ifdef CONFIG_HAS_DMA -extern void *dmam_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp); +extern void *dmam_alloc_attrs(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, + unsigned long attrs); extern void dmam_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle); #else /* !CONFIG_HAS_DMA */ -static inline void *dmam_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp) +static inline void *dmam_alloc_attrs(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, + unsigned long attrs) { return NULL; } static inline void dmam_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) { } #endif /* !CONFIG_HAS_DMA */ -extern void *dmam_alloc_attrs(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp, - unsigned long attrs); #ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT extern int dmam_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, @@ -711,6 +710,13 @@ static inline void dmam_release_declared_memory(struct device *dev) } #endif /* CONFIG_HAVE_GENERIC_DMA_COHERENT */ +static inline void *dmam_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp) +{ + return dmam_alloc_attrs(dev, size, dma_handle, gfp, + (gfp & __GFP_NOWARN) ? DMA_ATTR_NO_WARN : 0); +} + static inline void *dma_alloc_wc(struct device *dev, size_t size, dma_addr_t *dma_addr, gfp_t gfp) { diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index d7c34d2d1ba5..f00544cda4e9 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -45,45 +45,6 @@ static int dmam_match(struct device *dev, void *res, void *match_data) return 0; } -/** - * dmam_alloc_coherent - Managed dma_alloc_coherent() - * @dev: Device to allocate coherent memory for - * @size: Size of allocation - * @dma_handle: Out argument for allocated DMA handle - * @gfp: Allocation flags - * - * Managed dma_alloc_coherent(). Memory allocated using this function - * will be automatically released on driver detach. - * - * RETURNS: - * Pointer to allocated memory on success, NULL on failure. - */ -void *dmam_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp) -{ - struct dma_devres *dr; - void *vaddr; - - dr = devres_alloc(dmam_release, sizeof(*dr), gfp); - if (!dr) - return NULL; - - vaddr = dma_alloc_coherent(dev, size, dma_handle, gfp); - if (!vaddr) { - devres_free(dr); - return NULL; - } - - dr->vaddr = vaddr; - dr->dma_handle = *dma_handle; - dr->size = size; - - devres_add(dev, dr); - - return vaddr; -} -EXPORT_SYMBOL(dmam_alloc_coherent); - /** * dmam_free_coherent - Managed dma_free_coherent() * @dev: Device to free coherent memory for -- cgit v1.3-7-g2ca7 From 4788ba5792cc1368ba4867e1488dc168b4fe97b7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 26 Dec 2018 07:51:44 +0100 Subject: dma-mapping: remove dmam_{declare,release}_coherent_memory These functions have never been used. Signed-off-by: Christoph Hellwig --- Documentation/driver-model/devres.txt | 1 - include/linux/dma-mapping.h | 19 ------------ kernel/dma/mapping.c | 55 ----------------------------------- 3 files changed, 75 deletions(-) (limited to 'kernel') diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.txt index 841c99529d27..b277cafce71e 100644 --- a/Documentation/driver-model/devres.txt +++ b/Documentation/driver-model/devres.txt @@ -250,7 +250,6 @@ DMA dmaenginem_async_device_register() dmam_alloc_coherent() dmam_alloc_attrs() - dmam_declare_coherent_memory() dmam_free_coherent() dmam_pool_create() dmam_pool_destroy() diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index fa2ebe8ad4d0..937c2a949fca 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -691,25 +691,6 @@ static inline void dmam_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) { } #endif /* !CONFIG_HAS_DMA */ -#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT -extern int dmam_declare_coherent_memory(struct device *dev, - phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, - int flags); -extern void dmam_release_declared_memory(struct device *dev); -#else /* CONFIG_HAVE_GENERIC_DMA_COHERENT */ -static inline int dmam_declare_coherent_memory(struct device *dev, - phys_addr_t phys_addr, dma_addr_t device_addr, - size_t size, gfp_t gfp) -{ - return 0; -} - -static inline void dmam_release_declared_memory(struct device *dev) -{ -} -#endif /* CONFIG_HAVE_GENERIC_DMA_COHERENT */ - static inline void *dmam_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp) { diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index f00544cda4e9..a11006b6d8e8 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -105,61 +105,6 @@ void *dmam_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, } EXPORT_SYMBOL(dmam_alloc_attrs); -#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT - -static void dmam_coherent_decl_release(struct device *dev, void *res) -{ - dma_release_declared_memory(dev); -} - -/** - * dmam_declare_coherent_memory - Managed dma_declare_coherent_memory() - * @dev: Device to declare coherent memory for - * @phys_addr: Physical address of coherent memory to be declared - * @device_addr: Device address of coherent memory to be declared - * @size: Size of coherent memory to be declared - * @flags: Flags - * - * Managed dma_declare_coherent_memory(). - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int dmam_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int flags) -{ - void *res; - int rc; - - res = devres_alloc(dmam_coherent_decl_release, 0, GFP_KERNEL); - if (!res) - return -ENOMEM; - - rc = dma_declare_coherent_memory(dev, phys_addr, device_addr, size, - flags); - if (!rc) - devres_add(dev, res); - else - devres_free(res); - - return rc; -} -EXPORT_SYMBOL(dmam_declare_coherent_memory); - -/** - * dmam_release_declared_memory - Managed dma_release_declared_memory(). - * @dev: Device to release declared coherent memory for - * - * Managed dmam_release_declared_memory(). - */ -void dmam_release_declared_memory(struct device *dev) -{ - WARN_ON(devres_destroy(dev, dmam_coherent_decl_release, NULL, NULL)); -} -EXPORT_SYMBOL(dmam_release_declared_memory); - -#endif - /* * Create scatter-list for the already allocated DMA buffer. */ -- cgit v1.3-7-g2ca7 From 48e638fb68be8fecdca0611beff53a9c947704e3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 1 Jan 2019 17:14:39 +0100 Subject: dma-mapping: remove a few unused exports Now that the slow path DMA API calls are implemented out of line a few helpers only used by them don't need to be exported anymore. Signed-off-by: Christoph Hellwig --- kernel/dma/coherent.c | 2 -- kernel/dma/debug.c | 2 -- 2 files changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 597d40893862..66f0fb7e9a3a 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -223,7 +223,6 @@ int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size, */ return mem->flags & DMA_MEMORY_EXCLUSIVE; } -EXPORT_SYMBOL(dma_alloc_from_dev_coherent); void *dma_alloc_from_global_coherent(ssize_t size, dma_addr_t *dma_handle) { @@ -268,7 +267,6 @@ int dma_release_from_dev_coherent(struct device *dev, int order, void *vaddr) return __dma_release_from_coherent(mem, order, vaddr); } -EXPORT_SYMBOL(dma_release_from_dev_coherent); int dma_release_from_global_coherent(int order, void *vaddr) { diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 1e0157113d15..23cf5361bcf1 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -1512,7 +1512,6 @@ void debug_dma_alloc_coherent(struct device *dev, size_t size, add_dma_entry(entry); } -EXPORT_SYMBOL(debug_dma_alloc_coherent); void debug_dma_free_coherent(struct device *dev, size_t size, void *virt, dma_addr_t addr) @@ -1540,7 +1539,6 @@ void debug_dma_free_coherent(struct device *dev, size_t size, check_unmap(&ref); } -EXPORT_SYMBOL(debug_dma_free_coherent); void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, int direction, dma_addr_t dma_addr) -- cgit v1.3-7-g2ca7 From 594cc251fdd0d231d342d88b2fdff4bc42fb0690 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Fri, 4 Jan 2019 12:56:09 -0800 Subject: make 'user_access_begin()' do 'access_ok()' Originally, the rule used to be that you'd have to do access_ok() separately, and then user_access_begin() before actually doing the direct (optimized) user access. But experience has shown that people then decide not to do access_ok() at all, and instead rely on it being implied by other operations or similar. Which makes it very hard to verify that the access has actually been range-checked. If you use the unsafe direct user accesses, hardware features (either SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged Access Never - on ARM) do force you to use user_access_begin(). But nothing really forces the range check. By putting the range check into user_access_begin(), we actually force people to do the right thing (tm), and the range check vill be visible near the actual accesses. We have way too long a history of people trying to avoid them. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 9 ++++++++- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +++++++++++++-- include/linux/uaccess.h | 2 +- kernel/compat.c | 6 ++---- kernel/exit.c | 6 ++---- lib/strncpy_from_user.c | 9 +++++---- lib/strnlen_user.c | 9 +++++---- 7 files changed, 36 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 3920f456db79..a87ab5290ab4 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -705,7 +705,14 @@ extern struct movsl_mask { * checking before using them, but you have to surround them with the * user_access_begin/end() pair. */ -#define user_access_begin() __uaccess_begin() +static __must_check inline bool user_access_begin(const void __user *ptr, size_t len) +{ + if (unlikely(!access_ok(ptr,len))) + return 0; + __uaccess_begin(); + return 1; +} +#define user_access_begin(a,b) user_access_begin(a,b) #define user_access_end() __uaccess_end() #define unsafe_put_user(x, ptr, err_label) \ diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index 55d8f9b8777f..485b259127c3 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -1624,7 +1624,9 @@ end_user: * happened we would make the mistake of assuming that the * relocations were valid. */ - user_access_begin(); + if (!user_access_begin(urelocs, size)) + goto end_user; + for (copied = 0; copied < nreloc; copied++) unsafe_put_user(-1, &urelocs[copied].presumed_offset, @@ -2606,7 +2608,16 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data, unsigned int i; /* Copy the new buffer offsets back to the user's exec list. */ - user_access_begin(); + /* + * Note: count * sizeof(*user_exec_list) does not overflow, + * because we checked 'count' in check_buffer_count(). + * + * And this range already got effectively checked earlier + * when we did the "copy_from_user()" above. + */ + if (!user_access_begin(user_exec_list, count * sizeof(*user_exec_list))) + goto end_user; + for (i = 0; i < args->buffer_count; i++) { if (!(exec2_list[i].offset & UPDATE)) continue; diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index bf2523867a02..37b226e8df13 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -264,7 +264,7 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); probe_kernel_read(&retval, addr, sizeof(retval)) #ifndef user_access_begin -#define user_access_begin() do { } while (0) +#define user_access_begin(ptr,len) access_ok(ptr, len) #define user_access_end() do { } while (0) #define unsafe_get_user(x, ptr, err) do { if (unlikely(__get_user(x, ptr))) goto err; } while (0) #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0) diff --git a/kernel/compat.c b/kernel/compat.c index 705d4ae6c018..f01affa17e22 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -354,10 +354,9 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask, bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG); nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size); - if (!access_ok(umask, bitmap_size / 8)) + if (!user_access_begin(umask, bitmap_size / 8)) return -EFAULT; - user_access_begin(); while (nr_compat_longs > 1) { compat_ulong_t l1, l2; unsafe_get_user(l1, umask++, Efault); @@ -384,10 +383,9 @@ long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask, bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG); nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size); - if (!access_ok(umask, bitmap_size / 8)) + if (!user_access_begin(umask, bitmap_size / 8)) return -EFAULT; - user_access_begin(); while (nr_compat_longs > 1) { unsigned long m = *mask++; unsafe_put_user((compat_ulong_t)m, umask++, Efault); diff --git a/kernel/exit.c b/kernel/exit.c index 8a01b671dc1f..2d14979577ee 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1604,10 +1604,9 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *, if (!infop) return err; - if (!access_ok(infop, sizeof(*infop))) + if (!user_access_begin(infop, sizeof(*infop))) return -EFAULT; - user_access_begin(); unsafe_put_user(signo, &infop->si_signo, Efault); unsafe_put_user(0, &infop->si_errno, Efault); unsafe_put_user(info.cause, &infop->si_code, Efault); @@ -1732,10 +1731,9 @@ COMPAT_SYSCALL_DEFINE5(waitid, if (!infop) return err; - if (!access_ok(infop, sizeof(*infop))) + if (!user_access_begin(infop, sizeof(*infop))) return -EFAULT; - user_access_begin(); unsafe_put_user(signo, &infop->si_signo, Efault); unsafe_put_user(0, &infop->si_errno, Efault); unsafe_put_user(info.cause, &infop->si_code, Efault); diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index b53e1b5d80f4..58eacd41526c 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -114,10 +114,11 @@ long strncpy_from_user(char *dst, const char __user *src, long count) kasan_check_write(dst, count); check_object_size(dst, count, false); - user_access_begin(); - retval = do_strncpy_from_user(dst, src, count, max); - user_access_end(); - return retval; + if (user_access_begin(src, max)) { + retval = do_strncpy_from_user(dst, src, count, max); + user_access_end(); + return retval; + } } return -EFAULT; } diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c index 60d0bbda8f5e..1c1a1b0e38a5 100644 --- a/lib/strnlen_user.c +++ b/lib/strnlen_user.c @@ -114,10 +114,11 @@ long strnlen_user(const char __user *str, long count) unsigned long max = max_addr - src_addr; long retval; - user_access_begin(); - retval = do_strnlen_user(str, count, max); - user_access_end(); - return retval; + if (user_access_begin(str, max)) { + retval = do_strnlen_user(str, count, max); + user_access_end(); + return retval; + } } return 0; } -- cgit v1.3-7-g2ca7 From 09be178400829dddc1189b50a7888495dd26aa84 Mon Sep 17 00:00:00 2001 From: Cheng Lin Date: Thu, 3 Jan 2019 15:26:13 -0800 Subject: proc/sysctl: fix return error for proc_doulongvec_minmax() If the number of input parameters is less than the total parameters, an EINVAL error will be returned. For example, we use proc_doulongvec_minmax to pass up to two parameters with kern_table: { .procname = "monitor_signals", .data = &monitor_sigs, .maxlen = 2*sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, Reproduce: When passing two parameters, it's work normal. But passing only one parameter, an error "Invalid argument"(EINVAL) is returned. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals -bash: echo: write error: Invalid argument [root@cl150 ~]# echo $? 1 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# The following is the result after apply this patch. No error is returned when the number of input parameters is less than the total parameters. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# echo $? 0 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# There are three processing functions dealing with digital parameters, __do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax. This patch deals with __do_proc_doulongvec_minmax, just as __do_proc_dointvec does, adding a check for parameters 'left'. In __do_proc_douintvec, its code implementation explicitly does not support multiple inputs. static int __do_proc_douintvec(...){ ... /* * Arrays are not supported, keep this simple. *Do not* add * support for them. */ if (vleft != 1) { *lenp = 0; return -EINVAL; } ... } So, just __do_proc_doulongvec_minmax has the problem. And most use of proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one parameter. Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn Signed-off-by: Cheng Lin Acked-by: Luis Chamberlain Reviewed-by: Kees Cook Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 1825f712e73b..7f6c1a3b3485 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2787,6 +2787,8 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int bool neg; left -= proc_skip_spaces(&p); + if (!left) + break; err = proc_get_long(&p, &left, &val, &neg, proc_wspace_sep, -- cgit v1.3-7-g2ca7 From 168e06f7937d96c7222037d8a05565e8a6eb00fe Mon Sep 17 00:00:00 2001 From: "Liu, Chuansheng" Date: Thu, 3 Jan 2019 15:26:27 -0800 Subject: kernel/hung_task.c: force console verbose before panic Based on commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before panic"), we could get the call stack of hung task. However, if the console loglevel is not high, we still can not see the useful panic information in practice, and in most cases users don't set console loglevel to high level. This patch is to force console verbose before system panic, so that the real useful information can be seen in the console, instead of being like the following, which doesn't have hung task information. INFO: task init:1 blocked for more than 120 seconds. Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Kernel panic - not syncing: hung_task: blocked tasks CPU: 2 PID: 479 Comm: khungtaskd Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 Call Trace: dump_stack+0x4f/0x65 panic+0xde/0x231 watchdog+0x290/0x410 kthread+0x12c/0x150 ret_from_fork+0x35/0x40 reboot: panic mode set: p,w Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com Signed-off-by: Chuansheng Liu Reviewed-by: Petr Mladek Reviewed-by: Sergey Senozhatsky Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hung_task.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/hung_task.c b/kernel/hung_task.c index cb8e3e8ac7b9..85af0cde7f46 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -112,8 +112,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) trace_sched_process_hang(t); - if (!sysctl_hung_task_warnings && !sysctl_hung_task_panic) - return; + if (sysctl_hung_task_panic) { + console_verbose(); + hung_task_show_lock = true; + hung_task_call_panic = true; + } /* * Ok, the task did not get scheduled for more than 2 minutes, @@ -135,11 +138,6 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) } touch_nmi_watchdog(); - - if (sysctl_hung_task_panic) { - hung_task_show_lock = true; - hung_task_call_panic = true; - } } /* -- cgit v1.3-7-g2ca7 From 304ae42739b108305f8d7b3eb3c1aec7c2b643a9 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 3 Jan 2019 15:26:31 -0800 Subject: kernel/hung_task.c: break RCU locks based on jiffies check_hung_uninterruptible_tasks() is currently calling rcu_lock_break() for every 1024 threads. But check_hung_task() is very slow if printk() was called, and is very fast otherwise. If many threads within some 1024 threads called printk(), the RCU grace period might be extended enough to trigger RCU stall warnings. Therefore, calling rcu_lock_break() for every some fixed jiffies will be safer. Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Acked-by: Paul E. McKenney Cc: Petr Mladek Cc: Sergey Senozhatsky Cc: Dmitry Vyukov Cc: "Rafael J. Wysocki" Cc: Vitaly Kuznetsov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hung_task.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 85af0cde7f46..4a9191617076 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -34,7 +34,7 @@ int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; * is disabled during the critical section. It also controls the size of * the RCU grace period. So it needs to be upper-bound. */ -#define HUNG_TASK_BATCHING 1024 +#define HUNG_TASK_LOCK_BREAK (HZ / 10) /* * Zero means infinite timeout - no checking done: @@ -171,7 +171,7 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) static void check_hung_uninterruptible_tasks(unsigned long timeout) { int max_count = sysctl_hung_task_check_count; - int batch_count = HUNG_TASK_BATCHING; + unsigned long last_break = jiffies; struct task_struct *g, *t; /* @@ -186,10 +186,10 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) for_each_process_thread(g, t) { if (!max_count--) goto unlock; - if (!--batch_count) { - batch_count = HUNG_TASK_BATCHING; + if (time_after(jiffies, last_break + HUNG_TASK_LOCK_BREAK)) { if (!rcu_lock_break(g, t)) goto unlock; + last_break = jiffies; } /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ if (t->state == TASK_UNINTERRUPTIBLE) -- cgit v1.3-7-g2ca7 From fb5bf31722d0805a3f394f7d59f2e8cd07acccb7 Mon Sep 17 00:00:00 2001 From: Yi Wang Date: Thu, 3 Jan 2019 15:28:03 -0800 Subject: fork: fix some -Wmissing-prototypes warnings We get a warning when building kernel with W=1: kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes] kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes] Add the missing declaration in head file to fix this. Also, remove arch_release_thread_stack() completely because no arch seems to implement it since bb9d81264 (arch: remove tile port). Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Yi Wang Acked-by: Michal Hocko Acked-by: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched/task.h | 2 ++ init/main.c | 1 - kernel/fork.c | 5 ----- 3 files changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 108ede99e533..44c6f15800ff 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -39,6 +39,8 @@ void __noreturn do_task_dead(void); extern void proc_caches_init(void); +extern void fork_init(void); + extern void release_task(struct task_struct * p); #ifdef CONFIG_HAVE_COPY_THREAD_TLS diff --git a/init/main.c b/init/main.c index 6a74ba0892d2..e2e80ca3165a 100644 --- a/init/main.c +++ b/init/main.c @@ -105,7 +105,6 @@ static int kernel_init(void *); extern void init_IRQ(void); -extern void fork_init(void); extern void radix_tree_init(void); /* diff --git a/kernel/fork.c b/kernel/fork.c index d439c48ecf18..a60459947f18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -164,10 +164,6 @@ static inline void free_task_struct(struct task_struct *tsk) } #endif -void __weak arch_release_thread_stack(unsigned long *stack) -{ -} - #ifndef CONFIG_ARCH_THREAD_STACK_ALLOCATOR /* @@ -422,7 +418,6 @@ static void release_task_stack(struct task_struct *tsk) return; /* Better to leak the stack than to free prematurely */ account_kernel_stack(tsk, -1); - arch_release_thread_stack(tsk->stack); free_thread_stack(tsk); tsk->stack = NULL; #ifdef CONFIG_VMAP_STACK -- cgit v1.3-7-g2ca7 From d999bd9392dea7c1a9ac43b8680b22c4425ae4c7 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Thu, 3 Jan 2019 15:28:17 -0800 Subject: panic: add options to print system info when panic happens Kernel panic issues are always painful to debug, partially because it's not easy to get enough information of the context when panic happens. And we have ramoops and kdump for that, while this commit tries to provide a easier way to show the system info by adding a cmdline parameter, referring some idea from sysrq handler. Link: http://lkml.kernel.org/r/1543398842-19295-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang Reviewed-by: Kees Cook Acked-by: Steven Rostedt (VMware) Cc: Thomas Gleixner Cc: John Stultz Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Steven Rostedt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 8 +++++++ kernel/panic.c | 28 +++++++++++++++++++++++++ 2 files changed, 36 insertions(+) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 408781ee142c..e7b5c49702bb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3092,6 +3092,14 @@ timeout < 0: reboot immediately Format: + panic_print= Bitmask for printing system info when panic happens. + User can chose combination of the following bits: + bit 0: print all tasks info + bit 1: print system memory info + bit 2: print timer info + bit 3: print locks info if CONFIG_LOCKDEP is on + bit 4: print ftrace buffer + panic_on_warn panic() instead of WARN(). Useful to cause kdump on a WARN(). diff --git a/kernel/panic.c b/kernel/panic.c index d10c340c43b0..855f66738bc7 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -46,6 +46,13 @@ int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; EXPORT_SYMBOL_GPL(panic_timeout); +#define PANIC_PRINT_TASK_INFO 0x00000001 +#define PANIC_PRINT_MEM_INFO 0x00000002 +#define PANIC_PRINT_TIMER_INFO 0x00000004 +#define PANIC_PRINT_LOCK_INFO 0x00000008 +#define PANIC_PRINT_FTRACE_INFO 0x00000010 +static unsigned long panic_print; + ATOMIC_NOTIFIER_HEAD(panic_notifier_list); EXPORT_SYMBOL(panic_notifier_list); @@ -125,6 +132,24 @@ void nmi_panic(struct pt_regs *regs, const char *msg) } EXPORT_SYMBOL(nmi_panic); +static void panic_print_sys_info(void) +{ + if (panic_print & PANIC_PRINT_TASK_INFO) + show_state(); + + if (panic_print & PANIC_PRINT_MEM_INFO) + show_mem(0, NULL); + + if (panic_print & PANIC_PRINT_TIMER_INFO) + sysrq_timer_list_show(); + + if (panic_print & PANIC_PRINT_LOCK_INFO) + debug_show_all_locks(); + + if (panic_print & PANIC_PRINT_FTRACE_INFO) + ftrace_dump(DUMP_ALL); +} + /** * panic - halt the system * @fmt: The text string to print @@ -254,6 +279,8 @@ void panic(const char *fmt, ...) debug_locks_off(); console_flush_on_panic(); + panic_print_sys_info(); + if (!panic_blink) panic_blink = no_blink; @@ -658,6 +685,7 @@ void refcount_error_report(struct pt_regs *regs, const char *err) #endif core_param(panic, panic_timeout, int, 0644); +core_param(panic_print, panic_print, ulong, 0644); core_param(pause_on_oops, pause_on_oops, int, 0644); core_param(panic_on_warn, panic_on_warn, int, 0644); core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644); -- cgit v1.3-7-g2ca7 From 81c9d43f94870be66146739c6e61df40dc17bb64 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Thu, 3 Jan 2019 15:28:20 -0800 Subject: kernel/sysctl: add panic_print into sysctl So that we can also runtime chose to print out the needed system info for panic, other than setting the kernel cmdline. Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang Suggested-by: Steven Rostedt Acked-by: Steven Rostedt (VMware) Cc: Thomas Gleixner Cc: John Stultz Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 17 +++++++++++++++++ include/linux/kernel.h | 1 + include/uapi/linux/sysctl.h | 1 + kernel/panic.c | 2 +- kernel/sysctl.c | 7 +++++++ kernel/sysctl_binary.c | 1 + 6 files changed, 28 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 1b8775298cf7..c0527d8a468a 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -60,6 +60,7 @@ show up in /proc/sys/kernel: - panic_on_stackoverflow - panic_on_unrecovered_nmi - panic_on_warn +- panic_print - panic_on_rcu_stall - perf_cpu_time_max_percent - perf_event_paranoid @@ -654,6 +655,22 @@ a kernel rebuild when attempting to kdump at the location of a WARN(). ============================================================== +panic_print: + +Bitmask for printing system info when panic happens. User can chose +combination of the following bits: + +bit 0: print all tasks info +bit 1: print system memory info +bit 2: print timer info +bit 3: print locks info if CONFIG_LOCKDEP is on +bit 4: print ftrace buffer + +So for example to print tasks and memory info on panic, user can: + echo 3 > /proc/sys/kernel/panic_print + +============================================================== + panic_on_rcu_stall: When set to 1, calls panic() after RCU stall detection messages. This diff --git a/include/linux/kernel.h b/include/linux/kernel.h index d6aac75b51ba..8f0e68e250a7 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -527,6 +527,7 @@ static inline u32 int_sqrt64(u64 x) extern void bust_spinlocks(int yes); extern int oops_in_progress; /* If set, an oops, panic(), BUG() or die() is in progress */ extern int panic_timeout; +extern unsigned long panic_print; extern int panic_on_oops; extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index d71013fffaf6..87aa2a6d9125 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -153,6 +153,7 @@ enum KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */ KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */ KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */ + KERN_PANIC_PRINT=78, /* ulong: bitmask to print system info on panic */ }; diff --git a/kernel/panic.c b/kernel/panic.c index 855f66738bc7..f121e6ba7e11 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -51,7 +51,7 @@ EXPORT_SYMBOL_GPL(panic_timeout); #define PANIC_PRINT_TIMER_INFO 0x00000004 #define PANIC_PRINT_LOCK_INFO 0x00000008 #define PANIC_PRINT_FTRACE_INFO 0x00000010 -static unsigned long panic_print; +unsigned long panic_print; ATOMIC_NOTIFIER_HEAD(panic_notifier_list); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7f6c1a3b3485..ba4d9e85feb8 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -807,6 +807,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "panic_print", + .data = &panic_print, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, #if defined CONFIG_PRINTK { .procname = "printk", diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 07148b497451..73c132095a7b 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -140,6 +140,7 @@ static const struct bin_table bin_kern_table[] = { { CTL_INT, KERN_MAX_LOCK_DEPTH, "max_lock_depth" }, { CTL_INT, KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" }, { CTL_INT, KERN_PANIC_ON_WARN, "panic_on_warn" }, + { CTL_ULONG, KERN_PANIC_PRINT, "panic_print" }, {} }; -- cgit v1.3-7-g2ca7 From 634724431607f6f46c495dfef801a1c8b44a96d9 Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Thu, 3 Jan 2019 15:28:24 -0800 Subject: kernel/kcov.c: mark write_comp_data() as notrace Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the function called from __sanitizer_cov_trace_const_cmp4 shouldn't be traceable either. ftrace_graph_caller() gets called every time func write_comp_data() gets called if it isn't marked 'notrace'. This is the backtrace from gdb: #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116 #3 0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=, arg2=) at ../kernel/kcov.c:188 #4 0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27 #5 0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Rework so that write_comp_data() that are called from __sanitizer_cov_trace_*_cmp*() are marked as 'notrace'. Commit 903e8ff86753 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace") missed to mark write_comp_data() as 'notrace'. When that patch was created gcc-7 was used. In lib/Kconfig.debug config KCOV_ENABLE_COMPARISONS depends on $(cc-option,-fsanitize-coverage=trace-cmp) That code path isn't hit with gcc-7. However, it were that with gcc-8. Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell Signed-off-by: Arnd Bergmann Co-developed-by: Arnd Bergmann Acked-by: Steven Rostedt (VMware) Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index 97959d7b77e2..c2277dbdbfb1 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -112,7 +112,7 @@ void notrace __sanitizer_cov_trace_pc(void) EXPORT_SYMBOL(__sanitizer_cov_trace_pc); #ifdef CONFIG_KCOV_ENABLE_COMPARISONS -static void write_comp_data(u64 type, u64 arg1, u64 arg2, u64 ip) +static void notrace write_comp_data(u64 type, u64 arg1, u64 arg2, u64 ip) { struct task_struct *t; u64 *area; -- cgit v1.3-7-g2ca7 From 3bb5f4ac55dd91d516e7e36b45c94bd57efbb068 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Thu, 3 Jan 2019 15:28:44 -0800 Subject: kernel/locking/mutex.c: remove caller signal_pending branch predictions This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Reviewed-by: Andrew Morton Cc: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/locking/mutex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 3f8a35104285..db578783dd36 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -987,7 +987,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * wait_lock. This ensures the lock cancellation is ordered * against mutex_unlock() and wake-ups do not go missing. */ - if (unlikely(signal_pending_state(state, current))) { + if (signal_pending_state(state, current)) { ret = -EINTR; goto err; } -- cgit v1.3-7-g2ca7 From 34ec35ad8f5f4624e8391dbb83afb4c791f027e3 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Thu, 3 Jan 2019 15:28:48 -0800 Subject: kernel/sched/: remove caller signal_pending branch predictions This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Reviewed-by: Andrew Morton Cc: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched/core.c | 2 +- kernel/sched/swait.c | 2 +- kernel/sched/wait.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f66920173370..17a954c9e153 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3416,7 +3416,7 @@ static void __sched notrace __schedule(bool preempt) switch_count = &prev->nivcsw; if (!preempt && prev->state) { - if (unlikely(signal_pending_state(prev->state, prev))) { + if (signal_pending_state(prev->state, prev)) { prev->state = TASK_RUNNING; } else { deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c index 66b59ac77c22..e83a3f8449f6 100644 --- a/kernel/sched/swait.c +++ b/kernel/sched/swait.c @@ -93,7 +93,7 @@ long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait long ret = 0; raw_spin_lock_irqsave(&q->lock, flags); - if (unlikely(signal_pending_state(state, current))) { + if (signal_pending_state(state, current)) { /* * See prepare_to_wait_event(). TL;DR, subsequent swake_up_one() * must not see us. diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 5dd47f1103d1..6eb1f8efd221 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -264,7 +264,7 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_en long ret = 0; spin_lock_irqsave(&wq_head->lock, flags); - if (unlikely(signal_pending_state(state, current))) { + if (signal_pending_state(state, current)) { /* * Exclusive waiter must not fail if it was selected by wakeup, * it should "consume" the condition we were waiting for. -- cgit v1.3-7-g2ca7 From 8270f3a11ceef64bdb0ab141180e8d2f17c619ec Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 4 Jan 2019 18:31:48 +0100 Subject: dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocations We need to return a dma_addr_t even if we don't have a kernel mapping. Do so by consolidating the phys_to_dma call in a single place and jump to it from all the branches that return successfully. Fixes: bfd56cd60521 ("dma-mapping: support highmem in the generic remap allocator") Reported-by: Liviu Dudau Tested-by: Liviu Dudau --- kernel/dma/remap.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 18cc09fc27b9..7a723194ecbe 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -204,8 +204,7 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, ret = dma_alloc_from_pool(size, &page, flags); if (!ret) return NULL; - *dma_handle = phys_to_dma(dev, page_to_phys(page)); - return ret; + goto done; } page = __dma_direct_alloc_pages(dev, size, dma_handle, flags, attrs); @@ -215,8 +214,10 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, /* remove any dirty cache lines on the kernel alias */ arch_dma_prep_coherent(page, size); - if (attrs & DMA_ATTR_NO_KERNEL_MAPPING) - return page; /* opaque cookie */ + if (attrs & DMA_ATTR_NO_KERNEL_MAPPING) { + ret = page; /* opaque cookie */ + goto done; + } /* create a coherent mapping */ ret = dma_common_contiguous_remap(page, size, VM_USERMAP, @@ -227,9 +228,9 @@ void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, return ret; } - *dma_handle = phys_to_dma(dev, page_to_phys(page)); memset(ret, 0, size); - +done: + *dma_handle = phys_to_dma(dev, page_to_phys(page)); return ret; } -- cgit v1.3-7-g2ca7 From e9666d10a5677a494260d60d1fa0b73cc7646eb3 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 31 Dec 2018 00:14:15 +0900 Subject: jump_label: move 'asm goto' support test to Kconfig Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada Acked-by: Michael Ellerman (powerpc) Tested-by: Sedat Dilek --- Makefile | 7 ------- arch/Kconfig | 1 + arch/arm/kernel/jump_label.c | 4 ---- arch/arm64/kernel/jump_label.c | 4 ---- arch/mips/kernel/jump_label.c | 4 ---- arch/powerpc/include/asm/asm-prototypes.h | 2 +- arch/powerpc/kernel/jump_label.c | 2 -- arch/powerpc/platforms/powernv/opal-tracepoints.c | 2 +- arch/powerpc/platforms/powernv/opal-wrappers.S | 2 +- arch/powerpc/platforms/pseries/hvCall.S | 4 ++-- arch/powerpc/platforms/pseries/lpar.c | 2 +- arch/s390/kernel/Makefile | 3 ++- arch/s390/kernel/jump_label.c | 4 ---- arch/sparc/kernel/Makefile | 2 +- arch/sparc/kernel/jump_label.c | 4 ---- arch/x86/Makefile | 2 +- arch/x86/entry/calling.h | 2 +- arch/x86/include/asm/cpufeature.h | 2 +- arch/x86/include/asm/jump_label.h | 13 ------------- arch/x86/include/asm/rmwcc.h | 6 +++--- arch/x86/kernel/Makefile | 3 ++- arch/x86/kernel/jump_label.c | 4 ---- arch/x86/kvm/emulate.c | 2 +- arch/xtensa/kernel/jump_label.c | 4 ---- include/linux/dynamic_debug.h | 6 +++--- include/linux/jump_label.h | 22 +++++++++------------- include/linux/jump_label_ratelimit.h | 8 +++----- include/linux/module.h | 2 +- include/linux/netfilter.h | 4 ++-- include/linux/netfilter_ingress.h | 2 +- init/Kconfig | 3 +++ kernel/jump_label.c | 10 +++------- kernel/module.c | 2 +- kernel/sched/core.c | 2 +- kernel/sched/debug.c | 4 ++-- kernel/sched/fair.c | 6 +++--- kernel/sched/sched.h | 6 +++--- lib/dynamic_debug.c | 2 +- net/core/dev.c | 6 +++--- net/netfilter/core.c | 6 +++--- scripts/gcc-goto.sh | 2 +- tools/arch/x86/include/asm/rmwcc.h | 6 +++--- 42 files changed, 65 insertions(+), 119 deletions(-) (limited to 'kernel') diff --git a/Makefile b/Makefile index 60a473247657..04a857817f77 100644 --- a/Makefile +++ b/Makefile @@ -514,13 +514,6 @@ RETPOLINE_VDSO_CFLAGS := $(call cc-option,$(RETPOLINE_VDSO_CFLAGS_GCC),$(call cc export RETPOLINE_CFLAGS export RETPOLINE_VDSO_CFLAGS -# check for 'asm goto' -ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh $(CC) $(KBUILD_CFLAGS)), y) - CC_HAVE_ASM_GOTO := 1 - KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO - KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO -endif - # The expansion should be delayed until arch/$(SRCARCH)/Makefile is included. # Some architectures define CROSS_COMPILE in arch/$(SRCARCH)/Makefile. # CC_VERSION_TEXT is referenced from Kconfig (so it needs export), diff --git a/arch/Kconfig b/arch/Kconfig index b70c952ac838..4cfb6de48f79 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -71,6 +71,7 @@ config KPROBES config JUMP_LABEL bool "Optimize very unlikely/likely branches" depends on HAVE_ARCH_JUMP_LABEL + depends on CC_HAS_ASM_GOTO help This option enables a transparent branch optimization that makes certain almost-always-true or almost-always-false branch diff --git a/arch/arm/kernel/jump_label.c b/arch/arm/kernel/jump_label.c index 90bce3d9928e..303b3ab87f7e 100644 --- a/arch/arm/kernel/jump_label.c +++ b/arch/arm/kernel/jump_label.c @@ -4,8 +4,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - static void __arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type, bool is_static) @@ -35,5 +33,3 @@ void arch_jump_label_transform_static(struct jump_entry *entry, { __arch_jump_label_transform(entry, type, true); } - -#endif diff --git a/arch/arm64/kernel/jump_label.c b/arch/arm64/kernel/jump_label.c index 646b9562ee64..1eff270e8861 100644 --- a/arch/arm64/kernel/jump_label.c +++ b/arch/arm64/kernel/jump_label.c @@ -20,8 +20,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { @@ -49,5 +47,3 @@ void arch_jump_label_transform_static(struct jump_entry *entry, * NOP needs to be replaced by a branch. */ } - -#endif /* HAVE_JUMP_LABEL */ diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c index 32e3168316cd..ab943927f97a 100644 --- a/arch/mips/kernel/jump_label.c +++ b/arch/mips/kernel/jump_label.c @@ -16,8 +16,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - /* * Define parameters for the standard MIPS and the microMIPS jump * instruction encoding respectively: @@ -70,5 +68,3 @@ void arch_jump_label_transform(struct jump_entry *e, mutex_unlock(&text_mutex); } - -#endif /* HAVE_JUMP_LABEL */ diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h index 6f201b199c02..1d911f68a23b 100644 --- a/arch/powerpc/include/asm/asm-prototypes.h +++ b/arch/powerpc/include/asm/asm-prototypes.h @@ -38,7 +38,7 @@ extern struct static_key hcall_tracepoint_key; void __trace_hcall_entry(unsigned long opcode, unsigned long *args); void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf); /* OPAL tracing */ -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL extern struct static_key opal_tracepoint_key; #endif diff --git a/arch/powerpc/kernel/jump_label.c b/arch/powerpc/kernel/jump_label.c index 6472472093d0..0080c5fbd225 100644 --- a/arch/powerpc/kernel/jump_label.c +++ b/arch/powerpc/kernel/jump_label.c @@ -11,7 +11,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { @@ -22,4 +21,3 @@ void arch_jump_label_transform(struct jump_entry *entry, else patch_instruction(addr, PPC_INST_NOP); } -#endif diff --git a/arch/powerpc/platforms/powernv/opal-tracepoints.c b/arch/powerpc/platforms/powernv/opal-tracepoints.c index 1ab7d26c0a2c..f16a43540e30 100644 --- a/arch/powerpc/platforms/powernv/opal-tracepoints.c +++ b/arch/powerpc/platforms/powernv/opal-tracepoints.c @@ -4,7 +4,7 @@ #include #include -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL struct static_key opal_tracepoint_key = STATIC_KEY_INIT; int opal_tracepoint_regfunc(void) diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index 251528231a9e..f4875fe3f8ff 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -20,7 +20,7 @@ .section ".text" #ifdef CONFIG_TRACEPOINTS -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #define OPAL_BRANCH(LABEL) \ ARCH_STATIC_BRANCH(LABEL, opal_tracepoint_key) #else diff --git a/arch/powerpc/platforms/pseries/hvCall.S b/arch/powerpc/platforms/pseries/hvCall.S index d91412c591ef..50dc9426d0be 100644 --- a/arch/powerpc/platforms/pseries/hvCall.S +++ b/arch/powerpc/platforms/pseries/hvCall.S @@ -19,7 +19,7 @@ #ifdef CONFIG_TRACEPOINTS -#ifndef HAVE_JUMP_LABEL +#ifndef CONFIG_JUMP_LABEL .section ".toc","aw" .globl hcall_tracepoint_refcount @@ -79,7 +79,7 @@ hcall_tracepoint_refcount: mr r5,BUFREG; \ __HCALL_INST_POSTCALL -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #define HCALL_BRANCH(LABEL) \ ARCH_STATIC_BRANCH(LABEL, hcall_tracepoint_key) #else diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 32d4452973e7..f2a9f0adc2d3 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1040,7 +1040,7 @@ EXPORT_SYMBOL(arch_free_page); #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_TRACEPOINTS -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL struct static_key hcall_tracepoint_key = STATIC_KEY_INIT; int hcall_tracepoint_regfunc(void) diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile index 386b1abb217b..e216e116a9a9 100644 --- a/arch/s390/kernel/Makefile +++ b/arch/s390/kernel/Makefile @@ -48,7 +48,7 @@ CFLAGS_ptrace.o += -DUTS_MACHINE='"$(UTS_MACHINE)"' obj-y := traps.o time.o process.o base.o early.o setup.o idle.o vtime.o obj-y += processor.o sys_s390.o ptrace.o signal.o cpcmd.o ebcdic.o nmi.o obj-y += debug.o irq.o ipl.o dis.o diag.o vdso.o early_nobss.o -obj-y += sysinfo.o jump_label.o lgr.o os_info.o machine_kexec.o pgm_check.o +obj-y += sysinfo.o lgr.o os_info.o machine_kexec.o pgm_check.o obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o obj-y += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o obj-y += nospec-branch.o ipl_vmparm.o @@ -72,6 +72,7 @@ obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_FUNCTION_TRACER) += mcount.o ftrace.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_UPROBES) += uprobes.o +obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o obj-$(CONFIG_KEXEC_FILE) += kexec_elf.o diff --git a/arch/s390/kernel/jump_label.c b/arch/s390/kernel/jump_label.c index 50a1798604a8..3f10b56bd5a3 100644 --- a/arch/s390/kernel/jump_label.c +++ b/arch/s390/kernel/jump_label.c @@ -10,8 +10,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - struct insn { u16 opcode; s32 offset; @@ -103,5 +101,3 @@ void arch_jump_label_transform_static(struct jump_entry *entry, { __jump_label_transform(entry, type, 1); } - -#endif diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile index cf8640841b7a..97c0e19263d1 100644 --- a/arch/sparc/kernel/Makefile +++ b/arch/sparc/kernel/Makefile @@ -118,4 +118,4 @@ pc--$(CONFIG_PERF_EVENTS) := perf_event.o obj-$(CONFIG_SPARC64) += $(pc--y) obj-$(CONFIG_UPROBES) += uprobes.o -obj-$(CONFIG_SPARC64) += jump_label.o +obj-$(CONFIG_JUMP_LABEL) += jump_label.o diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c index 7f8eac51df33..a4cfaeecaf5e 100644 --- a/arch/sparc/kernel/jump_label.c +++ b/arch/sparc/kernel/jump_label.c @@ -9,8 +9,6 @@ #include -#ifdef HAVE_JUMP_LABEL - void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { @@ -47,5 +45,3 @@ void arch_jump_label_transform(struct jump_entry *entry, flushi(insn); mutex_unlock(&text_mutex); } - -#endif diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 16c3145c0a5f..9c5a67d1b9c1 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -289,7 +289,7 @@ vdso_install: archprepare: checkbin checkbin: -ifndef CC_HAVE_ASM_GOTO +ifndef CONFIG_CC_HAS_ASM_GOTO @echo Compiler lacks asm-goto support. @exit 1 endif diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 20d0885b00fb..efb0d1b1f15f 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -351,7 +351,7 @@ For 32-bit we have the following conventions - kernel is built with */ .macro CALL_enter_from_user_mode #ifdef CONFIG_CONTEXT_TRACKING -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0 #endif call enter_from_user_mode diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index aced6c9290d6..ce95b8cbd229 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -140,7 +140,7 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit); #define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit) -#if defined(__clang__) && !defined(CC_HAVE_ASM_GOTO) +#if defined(__clang__) && !defined(CONFIG_CC_HAS_ASM_GOTO) /* * Workaround for the sake of BPF compilation which utilizes kernel diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h index 21efc9d07ed9..65191ce8e1cf 100644 --- a/arch/x86/include/asm/jump_label.h +++ b/arch/x86/include/asm/jump_label.h @@ -2,19 +2,6 @@ #ifndef _ASM_X86_JUMP_LABEL_H #define _ASM_X86_JUMP_LABEL_H -#ifndef HAVE_JUMP_LABEL -/* - * For better or for worse, if jump labels (the gcc extension) are missing, - * then the entire static branch patching infrastructure is compiled out. - * If that happens, the code in here will malfunction. Raise a compiler - * error instead. - * - * In theory, jump labels and the static branch patching infrastructure - * could be decoupled to fix this. - */ -#error asm/jump_label.h included on a non-jump-label kernel -#endif - #define JUMP_LABEL_NOP_SIZE 5 #ifdef CONFIG_X86_64 diff --git a/arch/x86/include/asm/rmwcc.h b/arch/x86/include/asm/rmwcc.h index 46ac84b506f5..8a9eba191516 100644 --- a/arch/x86/include/asm/rmwcc.h +++ b/arch/x86/include/asm/rmwcc.h @@ -11,7 +11,7 @@ #define __CLOBBERS_MEM(clb...) "memory", ## clb -#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO) +#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CONFIG_CC_HAS_ASM_GOTO) /* Use asm goto */ @@ -27,7 +27,7 @@ cc_label: c = true; \ c; \ }) -#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */ +#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */ /* Use flags output or a set instruction */ @@ -40,7 +40,7 @@ cc_label: c = true; \ c; \ }) -#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */ +#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */ #define GEN_UNARY_RMWcc_4(op, var, cc, arg0) \ __GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM()) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index eb51b0e1189c..00b7e27bc2b7 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -49,7 +49,8 @@ obj-$(CONFIG_COMPAT) += signal_compat.o obj-y += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o obj-y += time.o ioport.o dumpstack.o nmi.o obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o -obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o +obj-y += setup.o x86_init.o i8259.o irqinit.o +obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_IRQ_WORK) += irq_work.o obj-y += probe_roms.o obj-$(CONFIG_X86_64) += sys_x86_64.o diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index aac0c1f7e354..f99bd26bd3f1 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -16,8 +16,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - union jump_code_union { char code[JUMP_LABEL_NOP_SIZE]; struct { @@ -130,5 +128,3 @@ __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, if (jlstate == JL_STATE_UPDATE) __jump_label_transform(entry, type, text_poke_early, 1); } - -#endif diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 78e430f4e15c..c338984c850d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -456,7 +456,7 @@ FOP_END; /* * XXX: inoutclob user must know where the argument is being expanded. - * Relying on CC_HAVE_ASM_GOTO would allow us to remove _fault. + * Relying on CONFIG_CC_HAS_ASM_GOTO would allow us to remove _fault. */ #define asm_safe(insn, inoutclob...) \ ({ \ diff --git a/arch/xtensa/kernel/jump_label.c b/arch/xtensa/kernel/jump_label.c index d108f721c116..61cf6497a646 100644 --- a/arch/xtensa/kernel/jump_label.c +++ b/arch/xtensa/kernel/jump_label.c @@ -10,8 +10,6 @@ #include -#ifdef HAVE_JUMP_LABEL - #define J_OFFSET_MASK 0x0003ffff #define J_SIGN_MASK (~(J_OFFSET_MASK >> 1)) @@ -95,5 +93,3 @@ void arch_jump_label_transform(struct jump_entry *e, patch_text(jump_entry_code(e), &insn, JUMP_LABEL_NOP_SIZE); } - -#endif /* HAVE_JUMP_LABEL */ diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index 2fd8006153c3..b3419da1a776 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -2,7 +2,7 @@ #ifndef _DYNAMIC_DEBUG_H #define _DYNAMIC_DEBUG_H -#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) +#if defined(CONFIG_JUMP_LABEL) #include #endif @@ -38,7 +38,7 @@ struct _ddebug { #define _DPRINTK_FLAGS_DEFAULT 0 #endif unsigned int flags:8; -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL union { struct static_key_true dd_key_true; struct static_key_false dd_key_false; @@ -83,7 +83,7 @@ void __dynamic_netdev_dbg(struct _ddebug *descriptor, dd_key_init(key, init) \ } -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #define dd_key_init(key, init) key = (init) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 5df6a621e464..3e113a1fa0f1 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -71,10 +71,6 @@ * Additional babbling in: Documentation/static-keys.txt */ -#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) -# define HAVE_JUMP_LABEL -#endif - #ifndef __ASSEMBLY__ #include @@ -86,7 +82,7 @@ extern bool static_key_initialized; "%s(): static key '%pS' used before call to jump_label_init()", \ __func__, (key)) -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL struct static_key { atomic_t enabled; @@ -114,10 +110,10 @@ struct static_key { struct static_key { atomic_t enabled; }; -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ #endif /* __ASSEMBLY__ */ -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #include #ifndef __ASSEMBLY__ @@ -192,7 +188,7 @@ enum jump_label_type { struct module; -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL @@ -245,7 +241,7 @@ extern void static_key_disable_cpuslocked(struct static_key *key); { .enabled = { 0 }, \ { .entries = (void *)JUMP_TYPE_FALSE } } -#else /* !HAVE_JUMP_LABEL */ +#else /* !CONFIG_JUMP_LABEL */ #include #include @@ -330,7 +326,7 @@ static inline void static_key_disable(struct static_key *key) #define STATIC_KEY_INIT_TRUE { .enabled = ATOMIC_INIT(1) } #define STATIC_KEY_INIT_FALSE { .enabled = ATOMIC_INIT(0) } -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ #define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE #define jump_label_enabled static_key_enabled @@ -394,7 +390,7 @@ extern bool ____wrong_branch_error(void); static_key_count((struct static_key *)x) > 0; \ }) -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL /* * Combine the right initial value (type) with the right branch order @@ -476,12 +472,12 @@ extern bool ____wrong_branch_error(void); unlikely(branch); \ }) -#else /* !HAVE_JUMP_LABEL */ +#else /* !CONFIG_JUMP_LABEL */ #define static_branch_likely(x) likely(static_key_enabled(&(x)->key)) #define static_branch_unlikely(x) unlikely(static_key_enabled(&(x)->key)) -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ /* * Advanced usage; refcount, branch is enabled when: count != 0 diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h index baa8eabbaa56..a49f2b45b3f0 100644 --- a/include/linux/jump_label_ratelimit.h +++ b/include/linux/jump_label_ratelimit.h @@ -5,21 +5,19 @@ #include #include -#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) +#if defined(CONFIG_JUMP_LABEL) struct static_key_deferred { struct static_key key; unsigned long timeout; struct delayed_work work; }; -#endif -#ifdef HAVE_JUMP_LABEL extern void static_key_slow_dec_deferred(struct static_key_deferred *key); extern void static_key_deferred_flush(struct static_key_deferred *key); extern void jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl); -#else /* !HAVE_JUMP_LABEL */ +#else /* !CONFIG_JUMP_LABEL */ struct static_key_deferred { struct static_key key; }; @@ -38,5 +36,5 @@ jump_label_rate_limit(struct static_key_deferred *key, { STATIC_KEY_CHECK_USE(key); } -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ #endif /* _LINUX_JUMP_LABEL_RATELIMIT_H */ diff --git a/include/linux/module.h b/include/linux/module.h index d5453eb5a68b..9a21fe3509af 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -436,7 +436,7 @@ struct module { unsigned int num_bpf_raw_events; struct bpf_raw_event_map *bpf_raw_events; #endif -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; #endif diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index bbe99d2b28b4..72cb19c3db6a 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -176,7 +176,7 @@ void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg, int nf_register_sockopt(struct nf_sockopt_ops *reg); void nf_unregister_sockopt(struct nf_sockopt_ops *reg); -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; #endif @@ -198,7 +198,7 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net, struct nf_hook_entries *hook_head = NULL; int ret = 1; -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL if (__builtin_constant_p(pf) && __builtin_constant_p(hook) && !static_key_false(&nf_hooks_needed[pf][hook])) diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h index 554c920691dd..a13774be2eb5 100644 --- a/include/linux/netfilter_ingress.h +++ b/include/linux/netfilter_ingress.h @@ -8,7 +8,7 @@ #ifdef CONFIG_NETFILTER_INGRESS static inline bool nf_hook_ingress_active(const struct sk_buff *skb) { -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS])) return false; #endif diff --git a/init/Kconfig b/init/Kconfig index 3e6be1694766..d47cb77a220e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -23,6 +23,9 @@ config CLANG_VERSION int default $(shell,$(srctree)/scripts/clang-version.sh $(CC)) +config CC_HAS_ASM_GOTO + def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC)) + config CONSTRUCTORS bool depends on !UML diff --git a/kernel/jump_label.c b/kernel/jump_label.c index b28028b08d44..bad96b476eb6 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -18,8 +18,6 @@ #include #include -#ifdef HAVE_JUMP_LABEL - /* mutex to protect coming/going of the the jump_label table */ static DEFINE_MUTEX(jump_label_mutex); @@ -80,13 +78,13 @@ jump_label_sort_entries(struct jump_entry *start, struct jump_entry *stop) static void jump_label_update(struct static_key *key); /* - * There are similar definitions for the !HAVE_JUMP_LABEL case in jump_label.h. + * There are similar definitions for the !CONFIG_JUMP_LABEL case in jump_label.h. * The use of 'atomic_read()' requires atomic.h and its problematic for some * kernel headers such as kernel.h and others. Since static_key_count() is not - * used in the branch statements as it is for the !HAVE_JUMP_LABEL case its ok + * used in the branch statements as it is for the !CONFIG_JUMP_LABEL case its ok * to have it be a function here. Similarly, for 'static_key_enable()' and * 'static_key_disable()', which require bug.h. This should allow jump_label.h - * to be included from most/all places for HAVE_JUMP_LABEL. + * to be included from most/all places for CONFIG_JUMP_LABEL. */ int static_key_count(struct static_key *key) { @@ -791,5 +789,3 @@ static __init int jump_label_test(void) } early_initcall(jump_label_test); #endif /* STATIC_KEYS_SELFTEST */ - -#endif /* HAVE_JUMP_LABEL */ diff --git a/kernel/module.c b/kernel/module.c index fcbc0128810b..2ad1b5239910 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3102,7 +3102,7 @@ static int find_module_sections(struct module *mod, struct load_info *info) sizeof(*mod->bpf_raw_events), &mod->num_bpf_raw_events); #endif -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL mod->jump_entries = section_objs(info, "__jump_table", sizeof(*mod->jump_entries), &mod->num_jump_entries); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 223f78d5c111..a674c7db2f29 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,7 +24,7 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); -#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) +#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* * Debugging: various feature bits * diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 02bd5f969b21..de3de997e245 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -73,7 +73,7 @@ static int sched_feat_show(struct seq_file *m, void *v) return 0; } -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL #define jump_label_key__true STATIC_KEY_INIT_TRUE #define jump_label_key__false STATIC_KEY_INIT_FALSE @@ -99,7 +99,7 @@ static void sched_feat_enable(int i) #else static void sched_feat_disable(int i) { }; static void sched_feat_enable(int i) { }; -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ static int sched_feat_set(char *cmp) { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6483834f1278..50aa2aba69bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4217,7 +4217,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) #ifdef CONFIG_CFS_BANDWIDTH -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL static struct static_key __cfs_bandwidth_used; static inline bool cfs_bandwidth_used(void) @@ -4234,7 +4234,7 @@ void cfs_bandwidth_usage_dec(void) { static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used); } -#else /* HAVE_JUMP_LABEL */ +#else /* CONFIG_JUMP_LABEL */ static bool cfs_bandwidth_used(void) { return true; @@ -4242,7 +4242,7 @@ static bool cfs_bandwidth_used(void) void cfs_bandwidth_usage_inc(void) {} void cfs_bandwidth_usage_dec(void) {} -#endif /* HAVE_JUMP_LABEL */ +#endif /* CONFIG_JUMP_LABEL */ /* * default period for cfs group bandwidth. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0ba08924e017..d04530bf251f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1488,7 +1488,7 @@ enum { #undef SCHED_FEAT -#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) +#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* * To support run-time toggling of sched features, all the translation units @@ -1508,7 +1508,7 @@ static __always_inline bool static_branch_##name(struct static_key *key) \ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR]; #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x])) -#else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */ +#else /* !(SCHED_DEBUG && CONFIG_JUMP_LABEL) */ /* * Each translation unit has its own copy of sysctl_sched_features to allow @@ -1524,7 +1524,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features = #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x)) -#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */ +#endif /* SCHED_DEBUG && CONFIG_JUMP_LABEL */ extern struct static_key_false sched_numa_balancing; extern struct static_key_false sched_schedstats; diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index c7c96bc7654a..dbf2b457e47e 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -188,7 +188,7 @@ static int ddebug_change(const struct ddebug_query *query, newflags = (dp->flags & mask) | flags; if (newflags == dp->flags) continue; -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL if (dp->flags & _DPRINTK_FLAGS_PRINT) { if (!(flags & _DPRINTK_FLAGS_PRINT)) static_branch_disable(&dp->key.dd_key_true); diff --git a/net/core/dev.c b/net/core/dev.c index 1b5a4410be0e..82f20022259d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1821,7 +1821,7 @@ EXPORT_SYMBOL_GPL(net_dec_egress_queue); #endif static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL static atomic_t netstamp_needed_deferred; static atomic_t netstamp_wanted; static void netstamp_clear(struct work_struct *work) @@ -1840,7 +1840,7 @@ static DECLARE_WORK(netstamp_work, netstamp_clear); void net_enable_timestamp(void) { -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL int wanted; while (1) { @@ -1860,7 +1860,7 @@ EXPORT_SYMBOL(net_enable_timestamp); void net_disable_timestamp(void) { -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL int wanted; while (1) { diff --git a/net/netfilter/core.c b/net/netfilter/core.c index dc240cb47ddf..93aaec3a54ec 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -33,7 +33,7 @@ EXPORT_SYMBOL_GPL(nf_ipv6_ops); DEFINE_PER_CPU(bool, nf_skb_duplicated); EXPORT_SYMBOL_GPL(nf_skb_duplicated); -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; EXPORT_SYMBOL(nf_hooks_needed); #endif @@ -347,7 +347,7 @@ static int __nf_register_net_hook(struct net *net, int pf, if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS) net_inc_ingress_queue(); #endif -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]); #endif BUG_ON(p == new_hooks); @@ -405,7 +405,7 @@ static void __nf_unregister_net_hook(struct net *net, int pf, if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS) net_dec_ingress_queue(); #endif -#ifdef HAVE_JUMP_LABEL +#ifdef CONFIG_JUMP_LABEL static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]); #endif } else { diff --git a/scripts/gcc-goto.sh b/scripts/gcc-goto.sh index 083c526073ef..8b980fb2270a 100755 --- a/scripts/gcc-goto.sh +++ b/scripts/gcc-goto.sh @@ -3,7 +3,7 @@ # Test for gcc 'asm goto' support # Copyright (C) 2010, Jason Baron -cat << "END" | $@ -x c - -c -o /dev/null >/dev/null 2>&1 && echo "y" +cat << "END" | $@ -x c - -fno-PIE -c -o /dev/null int main(void) { #if defined(__arm__) || defined(__aarch64__) diff --git a/tools/arch/x86/include/asm/rmwcc.h b/tools/arch/x86/include/asm/rmwcc.h index dc90c0c2fae3..fee7983a90b4 100644 --- a/tools/arch/x86/include/asm/rmwcc.h +++ b/tools/arch/x86/include/asm/rmwcc.h @@ -2,7 +2,7 @@ #ifndef _TOOLS_LINUX_ASM_X86_RMWcc #define _TOOLS_LINUX_ASM_X86_RMWcc -#ifdef CC_HAVE_ASM_GOTO +#ifdef CONFIG_CC_HAS_ASM_GOTO #define __GEN_RMWcc(fullop, var, cc, ...) \ do { \ @@ -20,7 +20,7 @@ cc_label: \ #define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \ __GEN_RMWcc(op " %1, " arg0, var, cc, vcon (val)) -#else /* !CC_HAVE_ASM_GOTO */ +#else /* !CONFIG_CC_HAS_ASM_GOTO */ #define __GEN_RMWcc(fullop, var, cc, ...) \ do { \ @@ -37,6 +37,6 @@ do { \ #define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \ __GEN_RMWcc(op " %2, " arg0, var, cc, vcon (val)) -#endif /* CC_HAVE_ASM_GOTO */ +#endif /* CONFIG_CC_HAS_ASM_GOTO */ #endif /* _TOOLS_LINUX_ASM_X86_RMWcc */ -- cgit v1.3-7-g2ca7 From ad774086356da92a477a87916613d96f4b36005c Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 31 Dec 2018 17:24:09 +0900 Subject: kbuild: change filechk to surround the given command with { } filechk_* rules often consist of multiple 'echo' lines. They must be surrounded with { } or ( ) to work correctly. Otherwise, only the string from the last 'echo' would be written into the target. Let's take care of that in the 'filechk' in scripts/Kbuild.include to clean up filechk_* rules. Signed-off-by: Masahiro Yamada --- Kbuild | 2 +- Makefile | 6 +++--- arch/s390/tools/Makefile | 2 +- firmware/Makefile | 5 ++--- kernel/Makefile | 6 +++++- scripts/Kbuild.include | 2 +- scripts/Makefile.lib | 3 +-- 7 files changed, 14 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/Kbuild b/Kbuild index 414ae6da1f50..06b801e12fb4 100644 --- a/Kbuild +++ b/Kbuild @@ -27,7 +27,7 @@ timeconst-file := include/generated/timeconst.h targets += $(timeconst-file) define filechk_gentimeconst - (echo $(CONFIG_HZ) | bc -q $< ) + echo $(CONFIG_HZ) | bc -q $< endef $(timeconst-file): kernel/time/timeconst.bc FORCE diff --git a/Makefile b/Makefile index 04a857817f77..437d6033598c 100644 --- a/Makefile +++ b/Makefile @@ -1127,13 +1127,13 @@ define filechk_utsrelease.h echo '"$(KERNELRELEASE)" exceeds $(uts_len) characters' >&2; \ exit 1; \ fi; \ - (echo \#define UTS_RELEASE \"$(KERNELRELEASE)\";) + echo \#define UTS_RELEASE \"$(KERNELRELEASE)\" endef define filechk_version.h - (echo \#define LINUX_VERSION_CODE $(shell \ + echo \#define LINUX_VERSION_CODE $(shell \ expr $(VERSION) \* 65536 + 0$(PATCHLEVEL) \* 256 + 0$(SUBLEVEL)); \ - echo '#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))';) + echo '#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))' endef $(version_h): FORCE diff --git a/arch/s390/tools/Makefile b/arch/s390/tools/Makefile index 48cdac1143a9..cf4846a7ee8d 100644 --- a/arch/s390/tools/Makefile +++ b/arch/s390/tools/Makefile @@ -25,7 +25,7 @@ define filechk_facility-defs.h endef define filechk_dis-defs.h - ( $(obj)/gen_opcode_table < $(srctree)/arch/$(ARCH)/tools/opcodes.txt ) + $(obj)/gen_opcode_table < $(srctree)/arch/$(ARCH)/tools/opcodes.txt endef $(kapi)/facility-defs.h: $(obj)/gen_facilities FORCE diff --git a/firmware/Makefile b/firmware/Makefile index e2f7dd2f30e0..37e5ae387400 100644 --- a/firmware/Makefile +++ b/firmware/Makefile @@ -13,7 +13,7 @@ ASM_WORD = $(if $(CONFIG_64BIT),.quad,.long) ASM_ALIGN = $(if $(CONFIG_64BIT),3,2) PROGBITS = $(if $(CONFIG_ARM),%,@)progbits -filechk_fwbin = { \ +filechk_fwbin = \ echo "/* Generated by $(src)/Makefile */" ;\ echo " .section .rodata" ;\ echo " .p2align $(ASM_ALIGN)" ;\ @@ -28,8 +28,7 @@ filechk_fwbin = { \ echo " .p2align $(ASM_ALIGN)" ;\ echo " $(ASM_WORD) _fw_$(FWSTR)_name" ;\ echo " $(ASM_WORD) _fw_$(FWSTR)_bin" ;\ - echo " $(ASM_WORD) _fw_end - _fw_$(FWSTR)_bin" ;\ -} + echo " $(ASM_WORD) _fw_end - _fw_$(FWSTR)_bin" $(obj)/%.gen.S: FORCE $(call filechk,fwbin) diff --git a/kernel/Makefile b/kernel/Makefile index cde93d54c571..6aa7543bcdb2 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -122,7 +122,11 @@ targets += config_data.gz $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE $(call if_changed,gzip) - filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;") +filechk_ikconfiggz = \ + echo "static const char kernel_config_data[] __used = MAGIC_START"; \ + cat $< | scripts/bin2c; \ + echo "MAGIC_END;" + targets += config_data.h $(obj)/config_data.h: $(obj)/config_data.gz FORCE $(call filechk,ikconfiggz) diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index 46bf1a073f5d..74a3fe7ddc01 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -56,7 +56,7 @@ kecho := $($(quiet)kecho) define filechk $(Q)set -e; \ mkdir -p $(dir $@); \ - $(filechk_$(1)) > $@.tmp; \ + { $(filechk_$(1)); } > $@.tmp; \ if [ -r $@ ] && cmp -s $@ $@.tmp; then \ rm -f $@.tmp; \ else \ diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib index 390957f9306f..12b88d09c3a4 100644 --- a/scripts/Makefile.lib +++ b/scripts/Makefile.lib @@ -417,7 +417,6 @@ endef # Use filechk to avoid rebuilds when a header changes, but the resulting file # does not define filechk_offsets - ( \ echo "#ifndef $2"; \ echo "#define $2"; \ echo "/*"; \ @@ -428,5 +427,5 @@ define filechk_offsets echo ""; \ sed -ne $(sed-offsets) < $<; \ echo ""; \ - echo "#endif" ) + echo "#endif" endef -- cgit v1.3-7-g2ca7 From d3bd7413e0ca40b60cf60d4003246d067cafdeda Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Sun, 6 Jan 2019 00:54:37 +0100 Subject: bpf: fix sanitation of alu op with pointer / scalar type from different paths While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") took care of rejecting alu op on pointer when e.g. pointer came from two different map values with different map properties such as value size, Jann reported that a case was not covered yet when a given alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from different branches where we would incorrectly try to sanitize based on the pointer's limit. Catch this corner case and reject the program instead. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Reported-by: Jann Horn Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 1 + kernel/bpf/verifier.c | 61 ++++++++++++++++++++++++++++++++++---------- 2 files changed, 49 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 27b74947cd2b..573cca00a0e6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -172,6 +172,7 @@ struct bpf_verifier_state_list { #define BPF_ALU_SANITIZE_SRC 1U #define BPF_ALU_SANITIZE_DST 2U #define BPF_ALU_NEG_VALUE (1U << 2) +#define BPF_ALU_NON_POINTER (1U << 3) #define BPF_ALU_SANITIZE (BPF_ALU_SANITIZE_SRC | \ BPF_ALU_SANITIZE_DST) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f6bc62a9ee8e..56674a7c3778 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3103,6 +3103,40 @@ static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg, } } +static bool can_skip_alu_sanitation(const struct bpf_verifier_env *env, + const struct bpf_insn *insn) +{ + return env->allow_ptr_leaks || BPF_SRC(insn->code) == BPF_K; +} + +static int update_alu_sanitation_state(struct bpf_insn_aux_data *aux, + u32 alu_state, u32 alu_limit) +{ + /* If we arrived here from different branches with different + * state or limits to sanitize, then this won't work. + */ + if (aux->alu_state && + (aux->alu_state != alu_state || + aux->alu_limit != alu_limit)) + return -EACCES; + + /* Corresponding fixup done in fixup_bpf_calls(). */ + aux->alu_state = alu_state; + aux->alu_limit = alu_limit; + return 0; +} + +static int sanitize_val_alu(struct bpf_verifier_env *env, + struct bpf_insn *insn) +{ + struct bpf_insn_aux_data *aux = cur_aux(env); + + if (can_skip_alu_sanitation(env, insn)) + return 0; + + return update_alu_sanitation_state(aux, BPF_ALU_NON_POINTER, 0); +} + static int sanitize_ptr_alu(struct bpf_verifier_env *env, struct bpf_insn *insn, const struct bpf_reg_state *ptr_reg, @@ -3117,7 +3151,7 @@ static int sanitize_ptr_alu(struct bpf_verifier_env *env, struct bpf_reg_state tmp; bool ret; - if (env->allow_ptr_leaks || BPF_SRC(insn->code) == BPF_K) + if (can_skip_alu_sanitation(env, insn)) return 0; /* We already marked aux for masking from non-speculative @@ -3133,19 +3167,8 @@ static int sanitize_ptr_alu(struct bpf_verifier_env *env, if (retrieve_ptr_limit(ptr_reg, &alu_limit, opcode, off_is_neg)) return 0; - - /* If we arrived here from different branches with different - * limits to sanitize, then this won't work. - */ - if (aux->alu_state && - (aux->alu_state != alu_state || - aux->alu_limit != alu_limit)) + if (update_alu_sanitation_state(aux, alu_state, alu_limit)) return -EACCES; - - /* Corresponding fixup done in fixup_bpf_calls(). */ - aux->alu_state = alu_state; - aux->alu_limit = alu_limit; - do_sim: /* Simulate and find potential out-of-bounds access under * speculative execution from truncation as a result of @@ -3418,6 +3441,8 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, s64 smin_val, smax_val; u64 umin_val, umax_val; u64 insn_bitness = (BPF_CLASS(insn->code) == BPF_ALU64) ? 64 : 32; + u32 dst = insn->dst_reg; + int ret; if (insn_bitness == 32) { /* Relevant for 32-bit RSH: Information can propagate towards @@ -3452,6 +3477,11 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, switch (opcode) { case BPF_ADD: + ret = sanitize_val_alu(env, insn); + if (ret < 0) { + verbose(env, "R%d tried to add from different pointers or scalars\n", dst); + return ret; + } if (signed_add_overflows(dst_reg->smin_value, smin_val) || signed_add_overflows(dst_reg->smax_value, smax_val)) { dst_reg->smin_value = S64_MIN; @@ -3471,6 +3501,11 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg.var_off); break; case BPF_SUB: + ret = sanitize_val_alu(env, insn); + if (ret < 0) { + verbose(env, "R%d tried to sub from different pointers or scalars\n", dst); + return ret; + } if (signed_sub_overflows(dst_reg->smin_value, smax_val) || signed_sub_overflows(dst_reg->smax_value, smin_val)) { /* Overflow possible, we know nothing */ -- cgit v1.3-7-g2ca7 From 7b55851367136b1efd84d98fea81ba57a98304cf Mon Sep 17 00:00:00 2001 From: David Herrmann Date: Tue, 8 Jan 2019 13:58:52 +0100 Subject: fork: record start_time late This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn Signed-off-by: Tom Gundersen Signed-off-by: David Herrmann Signed-off-by: Linus Torvalds --- kernel/fork.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a60459947f18..7f49be94eba9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1833,8 +1833,6 @@ static __latent_entropy struct task_struct *copy_process( posix_cpu_timers_init(p); - p->start_time = ktime_get_ns(); - p->real_start_time = ktime_get_boot_ns(); p->io_context = NULL; audit_set_context(p, NULL); cgroup_fork(p); @@ -2000,6 +1998,17 @@ static __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_free_pid; + /* + * From this point on we must avoid any synchronous user-space + * communication until we take the tasklist-lock. In particular, we do + * not want user-space to be able to predict the process start-time by + * stalling fork(2) after we recorded the start_time but before it is + * visible to the system. + */ + + p->start_time = ktime_get_ns(); + p->real_start_time = ktime_get_boot_ns(); + /* * Make it visible to the rest of the system, but dont wake it up yet. * Need tasklist lock for parent etc handling! -- cgit v1.3-7-g2ca7 From 98c88651365767c72ec6dc672072423bc19a39aa Mon Sep 17 00:00:00 2001 From: Casey Schaufler Date: Fri, 21 Sep 2018 17:17:25 -0700 Subject: SELinux: Remove cred security blob poisoning The SELinux specific credential poisioning only makes sense if SELinux is managing the credentials. As the intent of this patch set is to move the blob management out of the modules and into the infrastructure, the SELinux specific code has to go. The poisioning could be introduced into the infrastructure at some later date. Signed-off-by: Casey Schaufler Reviewed-by: Kees Cook Signed-off-by: Kees Cook --- kernel/cred.c | 13 ------------- security/selinux/hooks.c | 6 ------ 2 files changed, 19 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 21f4a97085b4..45d77284aed0 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -760,19 +760,6 @@ bool creds_are_invalid(const struct cred *cred) { if (cred->magic != CRED_MAGIC) return true; -#ifdef CONFIG_SECURITY_SELINUX - /* - * cred->security == NULL if security_cred_alloc_blank() or - * security_prepare_creds() returned an error. - */ - if (selinux_is_enabled() && cred->security) { - if ((unsigned long) cred->security < PAGE_SIZE) - return true; - if ((*(u32 *)cred->security & 0xffffff00) == - (POISON_FREE << 24 | POISON_FREE << 16 | POISON_FREE << 8)) - return true; - } -#endif return false; } EXPORT_SYMBOL(creds_are_invalid); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index beec1de5c2da..ad227177550b 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -3708,12 +3708,6 @@ static void selinux_cred_free(struct cred *cred) { struct task_security_struct *tsec = selinux_cred(cred); - /* - * cred->security == NULL if security_cred_alloc_blank() or - * security_prepare_creds() returned an error. - */ - BUG_ON(cred->security && (unsigned long) cred->security < PAGE_SIZE); - cred->security = (void *) 0x7UL; kfree(tsec); } -- cgit v1.3-7-g2ca7 From ba4a45746c362b665e245c50b870615f02f34781 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 8 Jan 2019 15:22:57 -0800 Subject: fork, memcg: fix cached_stacks case Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") fixes a crash caused due to failed memcg charge of the kernel stack. However the fix misses the cached_stacks case which this patch fixes. So, the same crash can happen if the memcg charge of a cached stack is failed. Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") Signed-off-by: Shakeel Butt Acked-by: Michal Hocko Acked-by: Rik van Riel Cc: Rik van Riel Cc: Roman Gushchin Cc: Johannes Weiner Cc: Tejun Heo Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a60459947f18..5ad60d47f7e7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -217,6 +217,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) memset(s->addr, 0, THREAD_SIZE); tsk->stack_vm_area = s; + tsk->stack = s->addr; return s->addr; } -- cgit v1.3-7-g2ca7 From beaf3d1901f4ea46fbd5c9d857227d99751de469 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Tue, 8 Jan 2019 14:20:44 -0800 Subject: bpf: fix panic in stack_map_get_build_id() on i386 and arm32 As Naresh reported, test_stacktrace_build_id() causes panic on i386 and arm32 systems. This is caused by page_address() returns NULL in certain cases. This patch fixes this error by using kmap_atomic/kunmap_atomic instead of page_address. Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address") Reported-by: Naresh Kamboju Signed-off-by: Song Liu Signed-off-by: Daniel Borkmann --- kernel/bpf/stackmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 90daf285de03..d9e2483669d0 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -260,7 +260,7 @@ static int stack_map_get_build_id(struct vm_area_struct *vma, return -EFAULT; /* page not mapped */ ret = -EINVAL; - page_addr = page_address(page); + page_addr = kmap_atomic(page); ehdr = (Elf32_Ehdr *)page_addr; /* compare magic x7f "ELF" */ @@ -276,6 +276,7 @@ static int stack_map_get_build_id(struct vm_area_struct *vma, else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64) ret = stack_map_get_build_id_64(page_addr, build_id); out: + kunmap_atomic(page_addr); put_page(page); return ret; } -- cgit v1.3-7-g2ca7 From c1a85a00ea66cb6f0bd0f14e47c28c2b0999799f Mon Sep 17 00:00:00 2001 From: Micah Morton Date: Mon, 7 Jan 2019 16:10:53 -0800 Subject: LSM: generalize flag passing to security_capable This patch provides a general mechanism for passing flags to the security_capable LSM hook. It replaces the specific 'audit' flag that is used to tell security_capable whether it should log an audit message for the given capability check. The reason for generalizing this flag passing is so we can add an additional flag that signifies whether security_capable is being called by a setid syscall (which is needed by the proposed SafeSetID LSM). Signed-off-by: Micah Morton Reviewed-by: Kees Cook Signed-off-by: James Morris --- include/linux/lsm_hooks.h | 8 +++++--- include/linux/security.h | 28 ++++++++++++++-------------- kernel/capability.c | 22 +++++++++++++--------- kernel/seccomp.c | 4 ++-- security/apparmor/capability.c | 14 +++++++------- security/apparmor/include/capability.h | 2 +- security/apparmor/ipc.c | 3 ++- security/apparmor/lsm.c | 4 ++-- security/apparmor/resource.c | 2 +- security/commoncap.c | 17 +++++++++-------- security/security.c | 14 +++++--------- security/selinux/hooks.c | 18 +++++++++--------- security/smack/smack_access.c | 2 +- 13 files changed, 71 insertions(+), 67 deletions(-) (limited to 'kernel') diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 40511a8a5ae6..195707210975 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1270,7 +1270,7 @@ * @cred contains the credentials to use. * @ns contains the user namespace we want the capability in * @cap contains the capability . - * @audit contains whether to write an audit message or not + * @opts contains options for the capable check * Return 0 if the capability is granted for @tsk. * @syslog: * Check permission before accessing the kernel message ring or changing @@ -1446,8 +1446,10 @@ union security_list_options { const kernel_cap_t *effective, const kernel_cap_t *inheritable, const kernel_cap_t *permitted); - int (*capable)(const struct cred *cred, struct user_namespace *ns, - int cap, int audit); + int (*capable)(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts); int (*quotactl)(int cmds, int type, int id, struct super_block *sb); int (*quota_on)(struct dentry *dentry); int (*syslog)(int type); diff --git a/include/linux/security.h b/include/linux/security.h index b2c5333ed4b5..13537a49ae97 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -54,9 +54,12 @@ struct xattr; struct xfrm_sec_ctx; struct mm_struct; +/* Default (no) options for the capable function */ +#define CAP_OPT_NONE 0x0 /* If capable should audit the security request */ -#define SECURITY_CAP_NOAUDIT 0 -#define SECURITY_CAP_AUDIT 1 +#define CAP_OPT_NOAUDIT BIT(1) +/* If capable is being called by a setid function */ +#define CAP_OPT_INSETID BIT(2) /* LSM Agnostic defines for sb_set_mnt_opts */ #define SECURITY_LSM_NATIVE_LABELS 1 @@ -72,7 +75,7 @@ enum lsm_event { /* These functions are in security/commoncap.c */ extern int cap_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit); + int cap, unsigned int opts); extern int cap_settime(const struct timespec64 *ts, const struct timezone *tz); extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode); extern int cap_ptrace_traceme(struct task_struct *parent); @@ -207,10 +210,10 @@ int security_capset(struct cred *new, const struct cred *old, const kernel_cap_t *effective, const kernel_cap_t *inheritable, const kernel_cap_t *permitted); -int security_capable(const struct cred *cred, struct user_namespace *ns, - int cap); -int security_capable_noaudit(const struct cred *cred, struct user_namespace *ns, - int cap); +int security_capable(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts); int security_quotactl(int cmds, int type, int id, struct super_block *sb); int security_quota_on(struct dentry *dentry); int security_syslog(int type); @@ -464,14 +467,11 @@ static inline int security_capset(struct cred *new, } static inline int security_capable(const struct cred *cred, - struct user_namespace *ns, int cap) + struct user_namespace *ns, + int cap, + unsigned int opts) { - return cap_capable(cred, ns, cap, SECURITY_CAP_AUDIT); -} - -static inline int security_capable_noaudit(const struct cred *cred, - struct user_namespace *ns, int cap) { - return cap_capable(cred, ns, cap, SECURITY_CAP_NOAUDIT); + return cap_capable(cred, ns, cap, opts); } static inline int security_quotactl(int cmds, int type, int id, diff --git a/kernel/capability.c b/kernel/capability.c index 1e1c0236f55b..7718d7dcadc7 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -299,7 +299,7 @@ bool has_ns_capability(struct task_struct *t, int ret; rcu_read_lock(); - ret = security_capable(__task_cred(t), ns, cap); + ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NONE); rcu_read_unlock(); return (ret == 0); @@ -340,7 +340,7 @@ bool has_ns_capability_noaudit(struct task_struct *t, int ret; rcu_read_lock(); - ret = security_capable_noaudit(__task_cred(t), ns, cap); + ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); @@ -363,7 +363,9 @@ bool has_capability_noaudit(struct task_struct *t, int cap) return has_ns_capability_noaudit(t, &init_user_ns, cap); } -static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) +static bool ns_capable_common(struct user_namespace *ns, + int cap, + unsigned int opts) { int capable; @@ -372,8 +374,7 @@ static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) BUG(); } - capable = audit ? security_capable(current_cred(), ns, cap) : - security_capable_noaudit(current_cred(), ns, cap); + capable = security_capable(current_cred(), ns, cap, opts); if (capable == 0) { current->flags |= PF_SUPERPRIV; return true; @@ -394,7 +395,7 @@ static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) */ bool ns_capable(struct user_namespace *ns, int cap) { - return ns_capable_common(ns, cap, true); + return ns_capable_common(ns, cap, CAP_OPT_NONE); } EXPORT_SYMBOL(ns_capable); @@ -412,7 +413,7 @@ EXPORT_SYMBOL(ns_capable); */ bool ns_capable_noaudit(struct user_namespace *ns, int cap) { - return ns_capable_common(ns, cap, false); + return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT); } EXPORT_SYMBOL(ns_capable_noaudit); @@ -448,10 +449,11 @@ EXPORT_SYMBOL(capable); bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap) { + if (WARN_ON_ONCE(!cap_valid(cap))) return false; - if (security_capable(file->f_cred, ns, cap) == 0) + if (security_capable(file->f_cred, ns, cap, CAP_OPT_NONE) == 0) return true; return false; @@ -500,10 +502,12 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) { int ret = 0; /* An absent tracer adds no restrictions */ const struct cred *cred; + rcu_read_lock(); cred = rcu_dereference(tsk->ptracer_cred); if (cred) - ret = security_capable_noaudit(cred, ns, CAP_SYS_PTRACE); + ret = security_capable(cred, ns, CAP_SYS_PTRACE, + CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d7f538847b84..38a77800def6 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -443,8 +443,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) * behavior of privileged children. */ if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) + security_capable(current_cred(), current_user_ns(), + CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) != 0) return ERR_PTR(-EACCES); /* Allocate a new seccomp_filter */ diff --git a/security/apparmor/capability.c b/security/apparmor/capability.c index 253ef6e9d445..752f73980e30 100644 --- a/security/apparmor/capability.c +++ b/security/apparmor/capability.c @@ -110,13 +110,13 @@ static int audit_caps(struct common_audit_data *sa, struct aa_profile *profile, * profile_capable - test if profile allows use of capability @cap * @profile: profile being enforced (NOT NULL, NOT unconfined) * @cap: capability to test if allowed - * @audit: whether an audit record should be generated + * @opts: CAP_OPT_NOAUDIT bit determines whether audit record is generated * @sa: audit data (MAY BE NULL indicating no auditing) * * Returns: 0 if allowed else -EPERM */ -static int profile_capable(struct aa_profile *profile, int cap, int audit, - struct common_audit_data *sa) +static int profile_capable(struct aa_profile *profile, int cap, + unsigned int opts, struct common_audit_data *sa) { int error; @@ -126,7 +126,7 @@ static int profile_capable(struct aa_profile *profile, int cap, int audit, else error = -EPERM; - if (audit == SECURITY_CAP_NOAUDIT) { + if (opts & CAP_OPT_NOAUDIT) { if (!COMPLAIN_MODE(profile)) return error; /* audit the cap request in complain mode but note that it @@ -142,13 +142,13 @@ static int profile_capable(struct aa_profile *profile, int cap, int audit, * aa_capable - test permission to use capability * @label: label being tested for capability (NOT NULL) * @cap: capability to be tested - * @audit: whether an audit record should be generated + * @opts: CAP_OPT_NOAUDIT bit determines whether audit record is generated * * Look up capability in profile capability set. * * Returns: 0 on success, or else an error code. */ -int aa_capable(struct aa_label *label, int cap, int audit) +int aa_capable(struct aa_label *label, int cap, unsigned int opts) { struct aa_profile *profile; int error = 0; @@ -156,7 +156,7 @@ int aa_capable(struct aa_label *label, int cap, int audit) sa.u.cap = cap; error = fn_for_each_confined(label, profile, - profile_capable(profile, cap, audit, &sa)); + profile_capable(profile, cap, opts, &sa)); return error; } diff --git a/security/apparmor/include/capability.h b/security/apparmor/include/capability.h index e0304e2aeb7f..1b3663b6ab12 100644 --- a/security/apparmor/include/capability.h +++ b/security/apparmor/include/capability.h @@ -40,7 +40,7 @@ struct aa_caps { extern struct aa_sfs_entry aa_sfs_entry_caps[]; -int aa_capable(struct aa_label *label, int cap, int audit); +int aa_capable(struct aa_label *label, int cap, unsigned int opts); static inline void aa_free_cap_rules(struct aa_caps *caps) { diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c index 527ea1557120..aacd1e95cb59 100644 --- a/security/apparmor/ipc.c +++ b/security/apparmor/ipc.c @@ -107,7 +107,8 @@ static int profile_tracer_perm(struct aa_profile *tracer, aad(sa)->label = &tracer->label; aad(sa)->peer = tracee; aad(sa)->request = 0; - aad(sa)->error = aa_capable(&tracer->label, CAP_SYS_PTRACE, 1); + aad(sa)->error = aa_capable(&tracer->label, CAP_SYS_PTRACE, + CAP_OPT_NONE); return aa_audit(AUDIT_APPARMOR_AUTO, tracer, sa, audit_ptrace_cb); } diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c index 60ef71268ccf..b6c395e2acd0 100644 --- a/security/apparmor/lsm.c +++ b/security/apparmor/lsm.c @@ -172,14 +172,14 @@ static int apparmor_capget(struct task_struct *target, kernel_cap_t *effective, } static int apparmor_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit) + int cap, unsigned int opts) { struct aa_label *label; int error = 0; label = aa_get_newest_cred_label(cred); if (!unconfined(label)) - error = aa_capable(label, cap, audit); + error = aa_capable(label, cap, opts); aa_put_label(label); return error; diff --git a/security/apparmor/resource.c b/security/apparmor/resource.c index 95fd26d09757..552ed09cb47e 100644 --- a/security/apparmor/resource.c +++ b/security/apparmor/resource.c @@ -124,7 +124,7 @@ int aa_task_setrlimit(struct aa_label *label, struct task_struct *task, */ if (label != peer && - aa_capable(label, CAP_SYS_RESOURCE, SECURITY_CAP_NOAUDIT) != 0) + aa_capable(label, CAP_SYS_RESOURCE, CAP_OPT_NOAUDIT) != 0) error = fn_for_each(label, profile, audit_resource(profile, resource, new_rlim->rlim_max, peer, diff --git a/security/commoncap.c b/security/commoncap.c index 52e04136bfa8..188eaf59f82f 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -68,7 +68,7 @@ static void warn_setuid_and_fcaps_mixed(const char *fname) * kernel's capable() and has_capability() returns 1 for this case. */ int cap_capable(const struct cred *cred, struct user_namespace *targ_ns, - int cap, int audit) + int cap, unsigned int opts) { struct user_namespace *ns = targ_ns; @@ -222,12 +222,11 @@ int cap_capget(struct task_struct *target, kernel_cap_t *effective, */ static inline int cap_inh_is_capped(void) { - /* they are so limited unless the current task has the CAP_SETPCAP * capability */ if (cap_capable(current_cred(), current_cred()->user_ns, - CAP_SETPCAP, SECURITY_CAP_AUDIT) == 0) + CAP_SETPCAP, CAP_OPT_NONE) == 0) return 0; return 1; } @@ -1208,8 +1207,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, || ((old->securebits & SECURE_ALL_LOCKS & ~arg2)) /*[2]*/ || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/ || (cap_capable(current_cred(), - current_cred()->user_ns, CAP_SETPCAP, - SECURITY_CAP_AUDIT) != 0) /*[4]*/ + current_cred()->user_ns, + CAP_SETPCAP, + CAP_OPT_NONE) != 0) /*[4]*/ /* * [1] no changing of bits that are locked * [2] no unlocking of locks @@ -1304,9 +1304,10 @@ int cap_vm_enough_memory(struct mm_struct *mm, long pages) { int cap_sys_admin = 0; - if (cap_capable(current_cred(), &init_user_ns, CAP_SYS_ADMIN, - SECURITY_CAP_NOAUDIT) == 0) + if (cap_capable(current_cred(), &init_user_ns, + CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) == 0) cap_sys_admin = 1; + return cap_sys_admin; } @@ -1325,7 +1326,7 @@ int cap_mmap_addr(unsigned long addr) if (addr < dac_mmap_min_addr) { ret = cap_capable(current_cred(), &init_user_ns, CAP_SYS_RAWIO, - SECURITY_CAP_AUDIT); + CAP_OPT_NONE); /* set PF_SUPERPRIV if it turns out we allow the low mmap */ if (ret == 0) current->flags |= PF_SUPERPRIV; diff --git a/security/security.c b/security/security.c index 953fc3ea18a9..a618e22df5c6 100644 --- a/security/security.c +++ b/security/security.c @@ -689,16 +689,12 @@ int security_capset(struct cred *new, const struct cred *old, effective, inheritable, permitted); } -int security_capable(const struct cred *cred, struct user_namespace *ns, - int cap) +int security_capable(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts) { - return call_int_hook(capable, 0, cred, ns, cap, SECURITY_CAP_AUDIT); -} - -int security_capable_noaudit(const struct cred *cred, struct user_namespace *ns, - int cap) -{ - return call_int_hook(capable, 0, cred, ns, cap, SECURITY_CAP_NOAUDIT); + return call_int_hook(capable, 0, cred, ns, cap, opts); } int security_quotactl(int cmds, int type, int id, struct super_block *sb) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index d98e1d8d18f6..b2ee49f938f1 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -1578,7 +1578,7 @@ static inline u32 signal_to_av(int sig) /* Check whether a task is allowed to use a capability. */ static int cred_has_capability(const struct cred *cred, - int cap, int audit, bool initns) + int cap, unsigned int opts, bool initns) { struct common_audit_data ad; struct av_decision avd; @@ -1605,7 +1605,7 @@ static int cred_has_capability(const struct cred *cred, rc = avc_has_perm_noaudit(&selinux_state, sid, sid, sclass, av, 0, &avd); - if (audit == SECURITY_CAP_AUDIT) { + if (!(opts & CAP_OPT_NOAUDIT)) { int rc2 = avc_audit(&selinux_state, sid, sid, sclass, av, &avd, rc, &ad, 0); if (rc2) @@ -2125,9 +2125,9 @@ static int selinux_capset(struct cred *new, const struct cred *old, */ static int selinux_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit) + int cap, unsigned int opts) { - return cred_has_capability(cred, cap, audit, ns == &init_user_ns); + return cred_has_capability(cred, cap, opts, ns == &init_user_ns); } static int selinux_quotactl(int cmds, int type, int id, struct super_block *sb) @@ -2201,7 +2201,7 @@ static int selinux_vm_enough_memory(struct mm_struct *mm, long pages) int rc, cap_sys_admin = 0; rc = cred_has_capability(current_cred(), CAP_SYS_ADMIN, - SECURITY_CAP_NOAUDIT, true); + CAP_OPT_NOAUDIT, true); if (rc == 0) cap_sys_admin = 1; @@ -2988,11 +2988,11 @@ static int selinux_inode_getattr(const struct path *path) static bool has_cap_mac_admin(bool audit) { const struct cred *cred = current_cred(); - int cap_audit = audit ? SECURITY_CAP_AUDIT : SECURITY_CAP_NOAUDIT; + unsigned int opts = audit ? CAP_OPT_NONE : CAP_OPT_NOAUDIT; - if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, cap_audit)) + if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, opts)) return false; - if (cred_has_capability(cred, CAP_MAC_ADMIN, cap_audit, true)) + if (cred_has_capability(cred, CAP_MAC_ADMIN, opts, true)) return false; return true; } @@ -3387,7 +3387,7 @@ static int selinux_file_ioctl(struct file *file, unsigned int cmd, case KDSKBENT: case KDSKBSENT: error = cred_has_capability(cred, CAP_SYS_TTY_CONFIG, - SECURITY_CAP_AUDIT, true); + CAP_OPT_NONE, true); break; /* default case assumes that the command will go diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c index 489d49a20b47..fe2ce3a65822 100644 --- a/security/smack/smack_access.c +++ b/security/smack/smack_access.c @@ -640,7 +640,7 @@ bool smack_privileged_cred(int cap, const struct cred *cred) struct smack_known_list_elem *sklep; int rc; - rc = cap_capable(cred, &init_user_ns, cap, SECURITY_CAP_AUDIT); + rc = cap_capable(cred, &init_user_ns, cap, CAP_OPT_NONE); if (rc) return false; -- cgit v1.3-7-g2ca7 From 17e3ac812541f73224299d8958ddb420c2d5bbd8 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Thu, 10 Jan 2019 11:14:00 -0800 Subject: bpf: fix bpffs bitfield pretty print Commit 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") introduced kind_flag and used bitfield_size in the btf_member to directly pretty print member values. The commit contained a bug where the incorrect parameters could be passed to function btf_bitfield_seq_show(). The bits_offset parameter in the function expects a value less than 8. Instead, the member offset in the structure is passed. The below is btf_bitfield_seq_show() func signature: void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_bits, struct seq_file *m) both bits_offset and nr_bits are u8 type. If the bitfield member offset is greater than 256, incorrect value will be printed. This patch fixed the issue by calculating correct proper data offset and bits_offset similar to non kind_flag case. Fixes: 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 715f9fcf4712..a2f53642592b 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1219,8 +1219,6 @@ static void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_copy_bits; u64 print_num; - data += BITS_ROUNDDOWN_BYTES(bits_offset); - bits_offset = BITS_PER_BYTE_MASKED(bits_offset); nr_copy_bits = nr_bits + bits_offset; nr_copy_bytes = BITS_ROUNDUP_BYTES(nr_copy_bits); @@ -1255,7 +1253,9 @@ static void btf_int_bits_seq_show(const struct btf *btf, * BTF_INT_OFFSET() cannot exceed 64 bits. */ total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); - btf_bitfield_seq_show(data, total_bits_offset, nr_bits, m); + data += BITS_ROUNDDOWN_BYTES(total_bits_offset); + bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset); + btf_bitfield_seq_show(data, bits_offset, nr_bits, m); } static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, @@ -2001,12 +2001,12 @@ static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, member_offset = btf_member_bit_offset(t, member); bitfield_size = btf_member_bitfield_size(t, member); + bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); + bits8_offset = BITS_PER_BYTE_MASKED(member_offset); if (bitfield_size) { - btf_bitfield_seq_show(data, member_offset, + btf_bitfield_seq_show(data + bytes_offset, bits8_offset, bitfield_size, m); } else { - bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); - bits8_offset = BITS_PER_BYTE_MASKED(member_offset); ops = btf_type_ops(member_type); ops->seq_show(btf, member_type, member->type, data + bytes_offset, bits8_offset, m); -- cgit v1.3-7-g2ca7 From 19514910d021c93c7823ec32067e6b7dea224f0f Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:19 +0100 Subject: livepatch: Change unsigned long old_addr -> void *old_func in struct klp_func The address of the to be patched function and new function is stored in struct klp_func as: void *new_func; unsigned long old_addr; The different naming scheme and type are derived from the way the addresses are set. @old_addr is assigned at runtime using kallsyms-based search. @new_func is statically initialized, for example: static struct klp_func funcs[] = { { .old_name = "cmdline_proc_show", .new_func = livepatch_cmdline_proc_show, }, { } }; This patch changes unsigned long old_addr -> void *old_func. It removes some confusion when these address are later used in the code. It is motivated by a followup patch that adds special NOP struct klp_func where we want to assign func->new_func = func->old_addr respectively func->new_func = func->old_func. This patch does not modify the existing behavior. Suggested-by: Josh Poimboeuf Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Joe Lawrence Acked-by: Alice Ferrazzi Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- include/linux/livepatch.h | 4 ++-- kernel/livepatch/core.c | 6 +++--- kernel/livepatch/patch.c | 18 ++++++++++-------- kernel/livepatch/patch.h | 4 ++-- kernel/livepatch/transition.c | 4 ++-- 5 files changed, 19 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index aec44b1d9582..634e13876380 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -40,7 +40,7 @@ * @new_func: pointer to the patched function code * @old_sympos: a hint indicating which symbol position the old function * can be found (optional) - * @old_addr: the address of the function being patched + * @old_func: pointer to the function being patched * @kobj: kobject for sysfs resources * @stack_node: list node for klp_ops func_stack list * @old_size: size of the old function @@ -77,7 +77,7 @@ struct klp_func { unsigned long old_sympos; /* internal */ - unsigned long old_addr; + void *old_func; struct kobject kobj; struct list_head stack_node; unsigned long old_size, new_size; diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 5b77a7314e01..cb59c7fb94cb 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -648,7 +648,7 @@ static void klp_free_object_loaded(struct klp_object *obj) obj->mod = NULL; klp_for_each_func(obj, func) - func->old_addr = 0; + func->old_func = NULL; } /* @@ -721,11 +721,11 @@ static int klp_init_object_loaded(struct klp_patch *patch, klp_for_each_func(obj, func) { ret = klp_find_object_symbol(obj->name, func->old_name, func->old_sympos, - &func->old_addr); + (unsigned long *)&func->old_func); if (ret) return ret; - ret = kallsyms_lookup_size_offset(func->old_addr, + ret = kallsyms_lookup_size_offset((unsigned long)func->old_func, &func->old_size, NULL); if (!ret) { pr_err("kallsyms size lookup failed for '%s'\n", diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 7702cb4064fc..825022d70912 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -34,7 +34,7 @@ static LIST_HEAD(klp_ops); -struct klp_ops *klp_find_ops(unsigned long old_addr) +struct klp_ops *klp_find_ops(void *old_func) { struct klp_ops *ops; struct klp_func *func; @@ -42,7 +42,7 @@ struct klp_ops *klp_find_ops(unsigned long old_addr) list_for_each_entry(ops, &klp_ops, node) { func = list_first_entry(&ops->func_stack, struct klp_func, stack_node); - if (func->old_addr == old_addr) + if (func->old_func == old_func) return ops; } @@ -142,17 +142,18 @@ static void klp_unpatch_func(struct klp_func *func) if (WARN_ON(!func->patched)) return; - if (WARN_ON(!func->old_addr)) + if (WARN_ON(!func->old_func)) return; - ops = klp_find_ops(func->old_addr); + ops = klp_find_ops(func->old_func); if (WARN_ON(!ops)) return; if (list_is_singular(&ops->func_stack)) { unsigned long ftrace_loc; - ftrace_loc = klp_get_ftrace_location(func->old_addr); + ftrace_loc = + klp_get_ftrace_location((unsigned long)func->old_func); if (WARN_ON(!ftrace_loc)) return; @@ -174,17 +175,18 @@ static int klp_patch_func(struct klp_func *func) struct klp_ops *ops; int ret; - if (WARN_ON(!func->old_addr)) + if (WARN_ON(!func->old_func)) return -EINVAL; if (WARN_ON(func->patched)) return -EINVAL; - ops = klp_find_ops(func->old_addr); + ops = klp_find_ops(func->old_func); if (!ops) { unsigned long ftrace_loc; - ftrace_loc = klp_get_ftrace_location(func->old_addr); + ftrace_loc = + klp_get_ftrace_location((unsigned long)func->old_func); if (!ftrace_loc) { pr_err("failed to find location for function '%s'\n", func->old_name); diff --git a/kernel/livepatch/patch.h b/kernel/livepatch/patch.h index e72d8250d04b..a9b16e513656 100644 --- a/kernel/livepatch/patch.h +++ b/kernel/livepatch/patch.h @@ -10,7 +10,7 @@ * struct klp_ops - structure for tracking registered ftrace ops structs * * A single ftrace_ops is shared between all enabled replacement functions - * (klp_func structs) which have the same old_addr. This allows the switch + * (klp_func structs) which have the same old_func. This allows the switch * between function versions to happen instantaneously by updating the klp_ops * struct's func_stack list. The winner is the klp_func at the top of the * func_stack (front of the list). @@ -25,7 +25,7 @@ struct klp_ops { struct ftrace_ops fops; }; -struct klp_ops *klp_find_ops(unsigned long old_addr); +struct klp_ops *klp_find_ops(void *old_func); int klp_patch_object(struct klp_object *obj); void klp_unpatch_object(struct klp_object *obj); diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 304d5eb8a98c..f27a378ad5e1 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -224,11 +224,11 @@ static int klp_check_stack_func(struct klp_func *func, * Check for the to-be-patched function * (the previous func). */ - ops = klp_find_ops(func->old_addr); + ops = klp_find_ops(func->old_func); if (list_is_singular(&ops->func_stack)) { /* original function */ - func_addr = func->old_addr; + func_addr = (unsigned long)func->old_func; func_size = func->old_size; } else { /* previously patched function */ -- cgit v1.3-7-g2ca7 From 26c3e98e2f8e44e856cd36c12b3eaefcc6eafb16 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:20 +0100 Subject: livepatch: Shuffle klp_enable_patch()/klp_disable_patch() code We are going to simplify the API and code by removing the registration step. This would require calling init/free functions from enable/disable ones. This patch just moves the code to prevent more forward declarations. This patch does not change the code except for two forward declarations. Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Joe Lawrence Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 330 ++++++++++++++++++++++++------------------------ 1 file changed, 166 insertions(+), 164 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index cb59c7fb94cb..20589da35194 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -278,170 +278,6 @@ static int klp_write_object_relocations(struct module *pmod, return ret; } -static int __klp_disable_patch(struct klp_patch *patch) -{ - struct klp_object *obj; - - if (WARN_ON(!patch->enabled)) - return -EINVAL; - - if (klp_transition_patch) - return -EBUSY; - - /* enforce stacking: only the last enabled patch can be disabled */ - if (!list_is_last(&patch->list, &klp_patches) && - list_next_entry(patch, list)->enabled) - return -EBUSY; - - klp_init_transition(patch, KLP_UNPATCHED); - - klp_for_each_object(patch, obj) - if (obj->patched) - klp_pre_unpatch_callback(obj); - - /* - * Enforce the order of the func->transition writes in - * klp_init_transition() and the TIF_PATCH_PENDING writes in - * klp_start_transition(). In the rare case where klp_ftrace_handler() - * is called shortly after klp_update_patch_state() switches the task, - * this ensures the handler sees that func->transition is set. - */ - smp_wmb(); - - klp_start_transition(); - klp_try_complete_transition(); - patch->enabled = false; - - return 0; -} - -/** - * klp_disable_patch() - disables a registered patch - * @patch: The registered, enabled patch to be disabled - * - * Unregisters the patched functions from ftrace. - * - * Return: 0 on success, otherwise error - */ -int klp_disable_patch(struct klp_patch *patch) -{ - int ret; - - mutex_lock(&klp_mutex); - - if (!klp_is_patch_registered(patch)) { - ret = -EINVAL; - goto err; - } - - if (!patch->enabled) { - ret = -EINVAL; - goto err; - } - - ret = __klp_disable_patch(patch); - -err: - mutex_unlock(&klp_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(klp_disable_patch); - -static int __klp_enable_patch(struct klp_patch *patch) -{ - struct klp_object *obj; - int ret; - - if (klp_transition_patch) - return -EBUSY; - - if (WARN_ON(patch->enabled)) - return -EINVAL; - - /* enforce stacking: only the first disabled patch can be enabled */ - if (patch->list.prev != &klp_patches && - !list_prev_entry(patch, list)->enabled) - return -EBUSY; - - /* - * A reference is taken on the patch module to prevent it from being - * unloaded. - */ - if (!try_module_get(patch->mod)) - return -ENODEV; - - pr_notice("enabling patch '%s'\n", patch->mod->name); - - klp_init_transition(patch, KLP_PATCHED); - - /* - * Enforce the order of the func->transition writes in - * klp_init_transition() and the ops->func_stack writes in - * klp_patch_object(), so that klp_ftrace_handler() will see the - * func->transition updates before the handler is registered and the - * new funcs become visible to the handler. - */ - smp_wmb(); - - klp_for_each_object(patch, obj) { - if (!klp_is_object_loaded(obj)) - continue; - - ret = klp_pre_patch_callback(obj); - if (ret) { - pr_warn("pre-patch callback failed for object '%s'\n", - klp_is_module(obj) ? obj->name : "vmlinux"); - goto err; - } - - ret = klp_patch_object(obj); - if (ret) { - pr_warn("failed to patch object '%s'\n", - klp_is_module(obj) ? obj->name : "vmlinux"); - goto err; - } - } - - klp_start_transition(); - klp_try_complete_transition(); - patch->enabled = true; - - return 0; -err: - pr_warn("failed to enable patch '%s'\n", patch->mod->name); - - klp_cancel_transition(); - return ret; -} - -/** - * klp_enable_patch() - enables a registered patch - * @patch: The registered, disabled patch to be enabled - * - * Performs the needed symbol lookups and code relocations, - * then registers the patched functions with ftrace. - * - * Return: 0 on success, otherwise error - */ -int klp_enable_patch(struct klp_patch *patch) -{ - int ret; - - mutex_lock(&klp_mutex); - - if (!klp_is_patch_registered(patch)) { - ret = -EINVAL; - goto err; - } - - ret = __klp_enable_patch(patch); - -err: - mutex_unlock(&klp_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(klp_enable_patch); - /* * Sysfs Interface * @@ -454,6 +290,8 @@ EXPORT_SYMBOL_GPL(klp_enable_patch); * /sys/kernel/livepatch// * /sys/kernel/livepatch/// */ +static int __klp_disable_patch(struct klp_patch *patch); +static int __klp_enable_patch(struct klp_patch *patch); static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) @@ -904,6 +742,170 @@ int klp_register_patch(struct klp_patch *patch) } EXPORT_SYMBOL_GPL(klp_register_patch); +static int __klp_disable_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + + if (WARN_ON(!patch->enabled)) + return -EINVAL; + + if (klp_transition_patch) + return -EBUSY; + + /* enforce stacking: only the last enabled patch can be disabled */ + if (!list_is_last(&patch->list, &klp_patches) && + list_next_entry(patch, list)->enabled) + return -EBUSY; + + klp_init_transition(patch, KLP_UNPATCHED); + + klp_for_each_object(patch, obj) + if (obj->patched) + klp_pre_unpatch_callback(obj); + + /* + * Enforce the order of the func->transition writes in + * klp_init_transition() and the TIF_PATCH_PENDING writes in + * klp_start_transition(). In the rare case where klp_ftrace_handler() + * is called shortly after klp_update_patch_state() switches the task, + * this ensures the handler sees that func->transition is set. + */ + smp_wmb(); + + klp_start_transition(); + klp_try_complete_transition(); + patch->enabled = false; + + return 0; +} + +/** + * klp_disable_patch() - disables a registered patch + * @patch: The registered, enabled patch to be disabled + * + * Unregisters the patched functions from ftrace. + * + * Return: 0 on success, otherwise error + */ +int klp_disable_patch(struct klp_patch *patch) +{ + int ret; + + mutex_lock(&klp_mutex); + + if (!klp_is_patch_registered(patch)) { + ret = -EINVAL; + goto err; + } + + if (!patch->enabled) { + ret = -EINVAL; + goto err; + } + + ret = __klp_disable_patch(patch); + +err: + mutex_unlock(&klp_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(klp_disable_patch); + +static int __klp_enable_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + int ret; + + if (klp_transition_patch) + return -EBUSY; + + if (WARN_ON(patch->enabled)) + return -EINVAL; + + /* enforce stacking: only the first disabled patch can be enabled */ + if (patch->list.prev != &klp_patches && + !list_prev_entry(patch, list)->enabled) + return -EBUSY; + + /* + * A reference is taken on the patch module to prevent it from being + * unloaded. + */ + if (!try_module_get(patch->mod)) + return -ENODEV; + + pr_notice("enabling patch '%s'\n", patch->mod->name); + + klp_init_transition(patch, KLP_PATCHED); + + /* + * Enforce the order of the func->transition writes in + * klp_init_transition() and the ops->func_stack writes in + * klp_patch_object(), so that klp_ftrace_handler() will see the + * func->transition updates before the handler is registered and the + * new funcs become visible to the handler. + */ + smp_wmb(); + + klp_for_each_object(patch, obj) { + if (!klp_is_object_loaded(obj)) + continue; + + ret = klp_pre_patch_callback(obj); + if (ret) { + pr_warn("pre-patch callback failed for object '%s'\n", + klp_is_module(obj) ? obj->name : "vmlinux"); + goto err; + } + + ret = klp_patch_object(obj); + if (ret) { + pr_warn("failed to patch object '%s'\n", + klp_is_module(obj) ? obj->name : "vmlinux"); + goto err; + } + } + + klp_start_transition(); + klp_try_complete_transition(); + patch->enabled = true; + + return 0; +err: + pr_warn("failed to enable patch '%s'\n", patch->mod->name); + + klp_cancel_transition(); + return ret; +} + +/** + * klp_enable_patch() - enables a registered patch + * @patch: The registered, disabled patch to be enabled + * + * Performs the needed symbol lookups and code relocations, + * then registers the patched functions with ftrace. + * + * Return: 0 on success, otherwise error + */ +int klp_enable_patch(struct klp_patch *patch) +{ + int ret; + + mutex_lock(&klp_mutex); + + if (!klp_is_patch_registered(patch)) { + ret = -EINVAL; + goto err; + } + + ret = __klp_enable_patch(patch); + +err: + mutex_unlock(&klp_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(klp_enable_patch); + /* * Remove parts of patches that touch a given kernel module. The list of * patches processed might be limited. When limit is NULL, all patches -- cgit v1.3-7-g2ca7 From 0430f78bf38f9972f0cf0522709cc63d49fa164c Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:21 +0100 Subject: livepatch: Consolidate klp_free functions The code for freeing livepatch structures is a bit scattered and tricky: + direct calls to klp_free_*_limited() and kobject_put() are used to release partially initialized objects + klp_free_patch() removes the patch from the public list and releases all objects except for patch->kobj + object_put(&patch->kobj) and the related wait_for_completion() are called directly outside klp_mutex; this code is duplicated; Now, we are going to remove the registration stage to simplify the API and the code. This would require handling more situations in klp_enable_patch() error paths. More importantly, we are going to add a feature called atomic replace. It will need to dynamically create func and object structures. We will want to reuse the existing init() and free() functions. This would create even more error path scenarios. This patch implements more straightforward free functions: + checks kobj_added flag instead of @limit[*] + initializes patch->list early so that the check for empty list always works + The action(s) that has to be done outside klp_mutex are done in separate klp_free_patch_finish() function. It waits only when patch->kobj was really released via the _start() part. The patch does not change the existing behavior. [*] We need our own flag to track that the kobject was successfully added to the hierarchy. Note that kobj.state_initialized only indicates that kobject has been initialized, not whether is has been added (and needs to be removed on cleanup). Signed-off-by: Petr Mladek Cc: Josh Poimboeuf Cc: Miroslav Benes Cc: Jessica Yu Cc: Jiri Kosina Cc: Jason Baron Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- include/linux/livepatch.h | 6 ++ kernel/livepatch/core.c | 137 +++++++++++++++++++++++++++++++--------------- 2 files changed, 98 insertions(+), 45 deletions(-) (limited to 'kernel') diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 634e13876380..6978785bc059 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -45,6 +45,7 @@ * @stack_node: list node for klp_ops func_stack list * @old_size: size of the old function * @new_size: size of the new function + * @kobj_added: @kobj has been added and needs freeing * @patched: the func has been added to the klp_ops list * @transition: the func is currently being applied or reverted * @@ -81,6 +82,7 @@ struct klp_func { struct kobject kobj; struct list_head stack_node; unsigned long old_size, new_size; + bool kobj_added; bool patched; bool transition; }; @@ -117,6 +119,7 @@ struct klp_callbacks { * @kobj: kobject for sysfs resources * @mod: kernel module associated with the patched object * (NULL for vmlinux) + * @kobj_added: @kobj has been added and needs freeing * @patched: the object's funcs have been added to the klp_ops list */ struct klp_object { @@ -128,6 +131,7 @@ struct klp_object { /* internal */ struct kobject kobj; struct module *mod; + bool kobj_added; bool patched; }; @@ -137,6 +141,7 @@ struct klp_object { * @objs: object entries for kernel objects to be patched * @list: list node for global list of registered patches * @kobj: kobject for sysfs resources + * @kobj_added: @kobj has been added and needs freeing * @enabled: the patch is enabled (but operation may be incomplete) * @finish: for waiting till it is safe to remove the patch module */ @@ -148,6 +153,7 @@ struct klp_patch { /* internal */ struct list_head list; struct kobject kobj; + bool kobj_added; bool enabled; struct completion finish; }; diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 20589da35194..6f0d9095f662 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -465,17 +465,15 @@ static struct kobj_type klp_ktype_func = { .sysfs_ops = &kobj_sysfs_ops, }; -/* - * Free all functions' kobjects in the array up to some limit. When limit is - * NULL, all kobjects are freed. - */ -static void klp_free_funcs_limited(struct klp_object *obj, - struct klp_func *limit) +static void klp_free_funcs(struct klp_object *obj) { struct klp_func *func; - for (func = obj->funcs; func->old_name && func != limit; func++) - kobject_put(&func->kobj); + klp_for_each_func(obj, func) { + /* Might be called from klp_init_patch() error path. */ + if (func->kobj_added) + kobject_put(&func->kobj); + } } /* Clean up when a patched object is unloaded */ @@ -489,30 +487,60 @@ static void klp_free_object_loaded(struct klp_object *obj) func->old_func = NULL; } -/* - * Free all objects' kobjects in the array up to some limit. When limit is - * NULL, all kobjects are freed. - */ -static void klp_free_objects_limited(struct klp_patch *patch, - struct klp_object *limit) +static void klp_free_objects(struct klp_patch *patch) { struct klp_object *obj; - for (obj = patch->objs; obj->funcs && obj != limit; obj++) { - klp_free_funcs_limited(obj, NULL); - kobject_put(&obj->kobj); + klp_for_each_object(patch, obj) { + klp_free_funcs(obj); + + /* Might be called from klp_init_patch() error path. */ + if (obj->kobj_added) + kobject_put(&obj->kobj); } } -static void klp_free_patch(struct klp_patch *patch) +/* + * This function implements the free operations that can be called safely + * under klp_mutex. + * + * The operation must be completed by calling klp_free_patch_finish() + * outside klp_mutex. + */ +static void klp_free_patch_start(struct klp_patch *patch) { - klp_free_objects_limited(patch, NULL); if (!list_empty(&patch->list)) list_del(&patch->list); + + klp_free_objects(patch); +} + +/* + * This function implements the free part that must be called outside + * klp_mutex. + * + * It must be called after klp_free_patch_start(). And it has to be + * the last function accessing the livepatch structures when the patch + * gets disabled. + */ +static void klp_free_patch_finish(struct klp_patch *patch) +{ + /* + * Avoid deadlock with enabled_store() sysfs callback by + * calling this outside klp_mutex. It is safe because + * this is called when the patch gets disabled and it + * cannot get enabled again. + */ + if (patch->kobj_added) { + kobject_put(&patch->kobj); + wait_for_completion(&patch->finish); + } } static int klp_init_func(struct klp_object *obj, struct klp_func *func) { + int ret; + if (!func->old_name || !func->new_func) return -EINVAL; @@ -528,9 +556,13 @@ static int klp_init_func(struct klp_object *obj, struct klp_func *func) * object. If the user selects 0 for old_sympos, then 1 will be used * since a unique symbol will be the first occurrence. */ - return kobject_init_and_add(&func->kobj, &klp_ktype_func, - &obj->kobj, "%s,%lu", func->old_name, - func->old_sympos ? func->old_sympos : 1); + ret = kobject_init_and_add(&func->kobj, &klp_ktype_func, + &obj->kobj, "%s,%lu", func->old_name, + func->old_sympos ? func->old_sympos : 1); + if (!ret) + func->kobj_added = true; + + return ret; } /* Arches may override this to finish any remaining arch-specific tasks */ @@ -589,9 +621,6 @@ static int klp_init_object(struct klp_patch *patch, struct klp_object *obj) int ret; const char *name; - if (!obj->funcs) - return -EINVAL; - if (klp_is_module(obj) && strlen(obj->name) >= MODULE_NAME_LEN) return -EINVAL; @@ -605,46 +634,66 @@ static int klp_init_object(struct klp_patch *patch, struct klp_object *obj) &patch->kobj, "%s", name); if (ret) return ret; + obj->kobj_added = true; klp_for_each_func(obj, func) { ret = klp_init_func(obj, func); if (ret) - goto free; + return ret; } - if (klp_is_object_loaded(obj)) { + if (klp_is_object_loaded(obj)) ret = klp_init_object_loaded(patch, obj); - if (ret) - goto free; - } - - return 0; -free: - klp_free_funcs_limited(obj, func); - kobject_put(&obj->kobj); return ret; } -static int klp_init_patch(struct klp_patch *patch) +static int klp_init_patch_early(struct klp_patch *patch) { struct klp_object *obj; - int ret; + struct klp_func *func; if (!patch->objs) return -EINVAL; - mutex_lock(&klp_mutex); - + INIT_LIST_HEAD(&patch->list); + patch->kobj_added = false; patch->enabled = false; init_completion(&patch->finish); + klp_for_each_object(patch, obj) { + if (!obj->funcs) + return -EINVAL; + + obj->kobj_added = false; + + klp_for_each_func(obj, func) + func->kobj_added = false; + } + + return 0; +} + +static int klp_init_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + int ret; + + mutex_lock(&klp_mutex); + + ret = klp_init_patch_early(patch); + if (ret) { + mutex_unlock(&klp_mutex); + return ret; + } + ret = kobject_init_and_add(&patch->kobj, &klp_ktype_patch, klp_root_kobj, "%s", patch->mod->name); if (ret) { mutex_unlock(&klp_mutex); return ret; } + patch->kobj_added = true; klp_for_each_object(patch, obj) { ret = klp_init_object(patch, obj); @@ -659,12 +708,11 @@ static int klp_init_patch(struct klp_patch *patch) return 0; free: - klp_free_objects_limited(patch, obj); + klp_free_patch_start(patch); mutex_unlock(&klp_mutex); - kobject_put(&patch->kobj); - wait_for_completion(&patch->finish); + klp_free_patch_finish(patch); return ret; } @@ -693,12 +741,11 @@ int klp_unregister_patch(struct klp_patch *patch) goto err; } - klp_free_patch(patch); + klp_free_patch_start(patch); mutex_unlock(&klp_mutex); - kobject_put(&patch->kobj); - wait_for_completion(&patch->finish); + klp_free_patch_finish(patch); return 0; err: -- cgit v1.3-7-g2ca7 From 68007289bf3cd937a5b8fc4987d2787167bd06ca Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:22 +0100 Subject: livepatch: Don't block the removal of patches loaded after a forced transition module_put() is currently never called in klp_complete_transition() when klp_force is set. As a result, we might keep the reference count even when klp_enable_patch() fails and klp_cancel_transition() is called. This might give the impression that a module might get blocked in some strange init state. Fortunately, it is not the case. The reference count is ignored when mod->init fails and erroneous modules are always removed. Anyway, this might be confusing. Instead, this patch moves the global klp_forced flag into struct klp_patch. As a result, we block only modules that might still be in use after a forced transition. Newly loaded livepatches might be eventually completely removed later. It is not a big deal. But the code is at least consistent with the reality. Signed-off-by: Petr Mladek Acked-by: Joe Lawrence Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- include/linux/livepatch.h | 2 ++ kernel/livepatch/core.c | 4 +++- kernel/livepatch/core.h | 1 + kernel/livepatch/transition.c | 10 +++++----- 4 files changed, 11 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 6978785bc059..6a9165d9b090 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -143,6 +143,7 @@ struct klp_object { * @kobj: kobject for sysfs resources * @kobj_added: @kobj has been added and needs freeing * @enabled: the patch is enabled (but operation may be incomplete) + * @forced: was involved in a forced transition * @finish: for waiting till it is safe to remove the patch module */ struct klp_patch { @@ -155,6 +156,7 @@ struct klp_patch { struct kobject kobj; bool kobj_added; bool enabled; + bool forced; struct completion finish; }; diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 6f0d9095f662..e77c5017ae0c 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -45,7 +45,8 @@ */ DEFINE_MUTEX(klp_mutex); -static LIST_HEAD(klp_patches); +/* Registered patches */ +LIST_HEAD(klp_patches); static struct kobject *klp_root_kobj; @@ -659,6 +660,7 @@ static int klp_init_patch_early(struct klp_patch *patch) INIT_LIST_HEAD(&patch->list); patch->kobj_added = false; patch->enabled = false; + patch->forced = false; init_completion(&patch->finish); klp_for_each_object(patch, obj) { diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h index 48a83d4364cf..d0cb5390e247 100644 --- a/kernel/livepatch/core.h +++ b/kernel/livepatch/core.h @@ -5,6 +5,7 @@ #include extern struct mutex klp_mutex; +extern struct list_head klp_patches; static inline bool klp_is_object_loaded(struct klp_object *obj) { diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index f27a378ad5e1..a4c921364003 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -33,8 +33,6 @@ struct klp_patch *klp_transition_patch; static int klp_target_state = KLP_UNDEFINED; -static bool klp_forced = false; - /* * This work can be performed periodically to finish patching or unpatching any * "straggler" tasks which failed to transition in the first attempt. @@ -137,10 +135,10 @@ static void klp_complete_transition(void) klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); /* - * klp_forced set implies unbounded increase of module's ref count if + * patch->forced set implies unbounded increase of module's ref count if * the module is disabled/enabled in a loop. */ - if (!klp_forced && klp_target_state == KLP_UNPATCHED) + if (!klp_transition_patch->forced && klp_target_state == KLP_UNPATCHED) module_put(klp_transition_patch->mod); klp_target_state = KLP_UNDEFINED; @@ -620,6 +618,7 @@ void klp_send_signals(void) */ void klp_force_transition(void) { + struct klp_patch *patch; struct task_struct *g, *task; unsigned int cpu; @@ -633,5 +632,6 @@ void klp_force_transition(void) for_each_possible_cpu(cpu) klp_update_patch_state(idle_task(cpu)); - klp_forced = true; + list_for_each_entry(patch, &klp_patches, list) + patch->forced = true; } -- cgit v1.3-7-g2ca7 From 958ef1e39d24d6cb8bf2a7406130a98c9564230f Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:23 +0100 Subject: livepatch: Simplify API by removing registration step The possibility to re-enable a registered patch was useful for immediate patches where the livepatch module had to stay until the system reboot. The improved consistency model allows to achieve the same result by unloading and loading the livepatch module again. Also we are going to add a feature called atomic replace. It will allow to create a patch that would replace all already registered patches. The aim is to handle dependent patches more securely. It will obsolete the stack of patches that helped to handle the dependencies so far. Then it might be unclear when a cumulative patch re-enabling is safe. It would be complicated to support the many modes. Instead we could actually make the API and code easier to understand. Therefore, remove the two step public API. All the checks and init calls are moved from klp_register_patch() to klp_enabled_patch(). Also the patch is automatically freed, including the sysfs interface when the transition to the disabled state is completed. As a result, there is never a disabled patch on the top of the stack. Therefore we do not need to check the stack in __klp_enable_patch(). And we could simplify the check in __klp_disable_patch(). Also the API and logic is much easier. It is enough to call klp_enable_patch() in module_init() call. The patch can be disabled by writing '0' into /sys/kernel/livepatch//enabled. Then the module can be removed once the transition finishes and sysfs interface is freed. The only problem is how to free the structures and kobjects safely. The operation is triggered from the sysfs interface. We could not put the related kobject from there because it would cause lock inversion between klp_mutex and kernfs locks, see kn->count lockdep map. Therefore, offload the free task to a workqueue. It is perfectly fine: + The patch can no longer be used in the livepatch operations. + The module could not be removed until the free operation finishes and module_put() is called. + The operation is asynchronous already when the first klp_try_complete_transition() fails and another call is queued with a delay. Suggested-by: Josh Poimboeuf Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- Documentation/livepatch/livepatch.txt | 135 +++++-------- include/linux/livepatch.h | 7 +- kernel/livepatch/core.c | 280 +++++++++------------------ kernel/livepatch/core.h | 2 + kernel/livepatch/transition.c | 19 +- samples/livepatch/livepatch-callbacks-demo.c | 13 +- samples/livepatch/livepatch-sample.c | 13 +- samples/livepatch/livepatch-shadow-fix1.c | 14 +- samples/livepatch/livepatch-shadow-fix2.c | 14 +- 9 files changed, 168 insertions(+), 329 deletions(-) (limited to 'kernel') diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 2d7ed09dbd59..8f56490a4bb6 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -12,12 +12,11 @@ Table of Contents: 4. Livepatch module 4.1. New functions 4.2. Metadata - 4.3. Livepatch module handling 5. Livepatch life-cycle - 5.1. Registration + 5.1. Loading 5.2. Enabling 5.3. Disabling - 5.4. Unregistration + 5.4. Removing 6. Sysfs 7. Limitations @@ -298,117 +297,89 @@ into three levels: see the "Consistency model" section. -4.3. Livepatch module handling ------------------------------- - -The usual behavior is that the new functions will get used when -the livepatch module is loaded. For this, the module init() function -has to register the patch (struct klp_patch) and enable it. See the -section "Livepatch life-cycle" below for more details about these -two operations. - -Module removal is only safe when there are no users of the underlying -functions. This is the reason why the force feature permanently disables -the removal. The forced tasks entered the functions but we cannot say -that they returned back. Therefore it cannot be decided when the -livepatch module can be safely removed. When the system is successfully -transitioned to a new patch state (patched/unpatched) without being -forced it is guaranteed that no task sleeps or runs in the old code. - - 5. Livepatch life-cycle ======================= -Livepatching defines four basic operations that define the life cycle of each -live patch: registration, enabling, disabling and unregistration. There are -several reasons why it is done this way. - -First, the patch is applied only when all patched symbols for already -loaded objects are found. The error handling is much easier if this -check is done before particular functions get redirected. +Livepatching can be described by four basic operations: +loading, enabling, disabling, removing. -Second, it might take some time until the entire system is migrated with -the hybrid consistency model being used. The patch revert might block -the livepatch module removal for too long. Therefore it is useful to -revert the patch using a separate operation that might be called -explicitly. But it does not make sense to remove all information until -the livepatch module is really removed. +5.1. Loading +------------ -5.1. Registration ------------------ +The only reasonable way is to enable the patch when the livepatch kernel +module is being loaded. For this, klp_enable_patch() has to be called +in the module_init() callback. There are two main reasons: -Each patch first has to be registered using klp_register_patch(). This makes -the patch known to the livepatch framework. Also it does some preliminary -computing and checks. +First, only the module has an easy access to the related struct klp_patch. -In particular, the patch is added into the list of known patches. The -addresses of the patched functions are found according to their names. -The special relocations, mentioned in the section "New functions", are -applied. The relevant entries are created under -/sys/kernel/livepatch/. The patch is rejected when any operation -fails. +Second, the error code might be used to refuse loading the module when +the patch cannot get enabled. 5.2. Enabling ------------- -Registered patches might be enabled either by calling klp_enable_patch() or -by writing '1' to /sys/kernel/livepatch//enabled. The system will -start using the new implementation of the patched functions at this stage. +The livepatch gets enabled by calling klp_enable_patch() from +the module_init() callback. The system will start using the new +implementation of the patched functions at this stage. -When a patch is enabled, livepatch enters into a transition state where -tasks are converging to the patched state. This is indicated by a value -of '1' in /sys/kernel/livepatch//transition. Once all tasks have -been patched, the 'transition' value changes to '0'. For more -information about this process, see the "Consistency model" section. +First, the addresses of the patched functions are found according to their +names. The special relocations, mentioned in the section "New functions", +are applied. The relevant entries are created under +/sys/kernel/livepatch/. The patch is rejected when any above +operation fails. -If an original function is patched for the first time, a function -specific struct klp_ops is created and an universal ftrace handler is -registered. +Second, livepatch enters into a transition state where tasks are converging +to the patched state. If an original function is patched for the first +time, a function specific struct klp_ops is created and an universal +ftrace handler is registered[*]. This stage is indicated by a value of '1' +in /sys/kernel/livepatch//transition. For more information about +this process, see the "Consistency model" section. -Functions might be patched multiple times. The ftrace handler is registered -only once for the given function. Further patches just add an entry to the -list (see field `func_stack`) of the struct klp_ops. The last added -entry is chosen by the ftrace handler and becomes the active function -replacement. +Finally, once all tasks have been patched, the 'transition' value changes +to '0'. -Note that the patches might be enabled in a different order than they were -registered. +[*] Note that functions might be patched multiple times. The ftrace handler + is registered only once for a given function. Further patches just add + an entry to the list (see field `func_stack`) of the struct klp_ops. + The right implementation is selected by the ftrace handler, see + the "Consistency model" section. 5.3. Disabling -------------- -Enabled patches might get disabled either by calling klp_disable_patch() or -by writing '0' to /sys/kernel/livepatch//enabled. At this stage -either the code from the previously enabled patch or even the original -code gets used. +Enabled patches might get disabled by writing '0' to +/sys/kernel/livepatch//enabled. -When a patch is disabled, livepatch enters into a transition state where -tasks are converging to the unpatched state. This is indicated by a -value of '1' in /sys/kernel/livepatch//transition. Once all tasks -have been unpatched, the 'transition' value changes to '0'. For more -information about this process, see the "Consistency model" section. +First, livepatch enters into a transition state where tasks are converging +to the unpatched state. The system starts using either the code from +the previously enabled patch or even the original one. This stage is +indicated by a value of '1' in /sys/kernel/livepatch//transition. +For more information about this process, see the "Consistency model" +section. -Here all the functions (struct klp_func) associated with the to-be-disabled +Second, once all tasks have been unpatched, the 'transition' value changes +to '0'. All the functions (struct klp_func) associated with the to-be-disabled patch are removed from the corresponding struct klp_ops. The ftrace handler is unregistered and the struct klp_ops is freed when the func_stack list becomes empty. -Patches must be disabled in exactly the reverse order in which they were -enabled. It makes the problem and the implementation much easier. +Third, the sysfs interface is destroyed. +Note that patches must be disabled in exactly the reverse order in which +they were enabled. It makes the problem and the implementation much easier. -5.4. Unregistration -------------------- -Disabled patches might be unregistered by calling klp_unregister_patch(). -This can be done only when the patch is disabled and the code is no longer -used. It must be called before the livepatch module gets unloaded. +5.4. Removing +------------- -At this stage, all the relevant sys-fs entries are removed and the patch -is removed from the list of known patches. +Module removal is only safe when there are no users of functions provided +by the module. This is the reason why the force feature permanently +disables the removal. Only when the system is successfully transitioned +to a new patch state (patched/unpatched) without being forced it is +guaranteed that no task sleeps or runs in the old code. 6. Sysfs diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 6a9165d9b090..8f9c19c69744 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -139,11 +139,12 @@ struct klp_object { * struct klp_patch - patch structure for live patching * @mod: reference to the live patch module * @objs: object entries for kernel objects to be patched - * @list: list node for global list of registered patches + * @list: list node for global list of actively used patches * @kobj: kobject for sysfs resources * @kobj_added: @kobj has been added and needs freeing * @enabled: the patch is enabled (but operation may be incomplete) * @forced: was involved in a forced transition + * @free_work: patch cleanup from workqueue-context * @finish: for waiting till it is safe to remove the patch module */ struct klp_patch { @@ -157,6 +158,7 @@ struct klp_patch { bool kobj_added; bool enabled; bool forced; + struct work_struct free_work; struct completion finish; }; @@ -168,10 +170,7 @@ struct klp_patch { func->old_name || func->new_func || func->old_sympos; \ func++) -int klp_register_patch(struct klp_patch *); -int klp_unregister_patch(struct klp_patch *); int klp_enable_patch(struct klp_patch *); -int klp_disable_patch(struct klp_patch *); void arch_klp_init_object_loaded(struct klp_patch *patch, struct klp_object *obj); diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index e77c5017ae0c..bd41b03a72d5 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -45,7 +45,11 @@ */ DEFINE_MUTEX(klp_mutex); -/* Registered patches */ +/* + * Actively used patches: enabled or in transition. Note that replaced + * or disabled patches are not listed even though the related kernel + * module still can be loaded. + */ LIST_HEAD(klp_patches); static struct kobject *klp_root_kobj; @@ -83,17 +87,6 @@ static void klp_find_object_module(struct klp_object *obj) mutex_unlock(&module_mutex); } -static bool klp_is_patch_registered(struct klp_patch *patch) -{ - struct klp_patch *mypatch; - - list_for_each_entry(mypatch, &klp_patches, list) - if (mypatch == patch) - return true; - - return false; -} - static bool klp_initialized(void) { return !!klp_root_kobj; @@ -292,7 +285,6 @@ static int klp_write_object_relocations(struct module *pmod, * /sys/kernel/livepatch/// */ static int __klp_disable_patch(struct klp_patch *patch); -static int __klp_enable_patch(struct klp_patch *patch); static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) @@ -309,40 +301,32 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, mutex_lock(&klp_mutex); - if (!klp_is_patch_registered(patch)) { - /* - * Module with the patch could either disappear meanwhile or is - * not properly initialized yet. - */ - ret = -EINVAL; - goto err; - } - if (patch->enabled == enabled) { /* already in requested state */ ret = -EINVAL; - goto err; + goto out; } - if (patch == klp_transition_patch) { + /* + * Allow to reverse a pending transition in both ways. It might be + * necessary to complete the transition without forcing and breaking + * the system integrity. + * + * Do not allow to re-enable a disabled patch. + */ + if (patch == klp_transition_patch) klp_reverse_transition(); - } else if (enabled) { - ret = __klp_enable_patch(patch); - if (ret) - goto err; - } else { + else if (!enabled) ret = __klp_disable_patch(patch); - if (ret) - goto err; - } + else + ret = -EINVAL; +out: mutex_unlock(&klp_mutex); + if (ret) + return ret; return count; - -err: - mutex_unlock(&klp_mutex); - return ret; } static ssize_t enabled_show(struct kobject *kobj, @@ -508,7 +492,7 @@ static void klp_free_objects(struct klp_patch *patch) * The operation must be completed by calling klp_free_patch_finish() * outside klp_mutex. */ -static void klp_free_patch_start(struct klp_patch *patch) +void klp_free_patch_start(struct klp_patch *patch) { if (!list_empty(&patch->list)) list_del(&patch->list); @@ -536,6 +520,23 @@ static void klp_free_patch_finish(struct klp_patch *patch) kobject_put(&patch->kobj); wait_for_completion(&patch->finish); } + + /* Put the module after the last access to struct klp_patch. */ + if (!patch->forced) + module_put(patch->mod); +} + +/* + * The livepatch might be freed from sysfs interface created by the patch. + * This work allows to wait until the interface is destroyed in a separate + * context. + */ +static void klp_free_patch_work_fn(struct work_struct *work) +{ + struct klp_patch *patch = + container_of(work, struct klp_patch, free_work); + + klp_free_patch_finish(patch); } static int klp_init_func(struct klp_object *obj, struct klp_func *func) @@ -661,6 +662,7 @@ static int klp_init_patch_early(struct klp_patch *patch) patch->kobj_added = false; patch->enabled = false; patch->forced = false; + INIT_WORK(&patch->free_work, klp_free_patch_work_fn); init_completion(&patch->finish); klp_for_each_object(patch, obj) { @@ -673,6 +675,9 @@ static int klp_init_patch_early(struct klp_patch *patch) func->kobj_added = false; } + if (!try_module_get(patch->mod)) + return -ENODEV; + return 0; } @@ -681,115 +686,22 @@ static int klp_init_patch(struct klp_patch *patch) struct klp_object *obj; int ret; - mutex_lock(&klp_mutex); - - ret = klp_init_patch_early(patch); - if (ret) { - mutex_unlock(&klp_mutex); - return ret; - } - ret = kobject_init_and_add(&patch->kobj, &klp_ktype_patch, klp_root_kobj, "%s", patch->mod->name); - if (ret) { - mutex_unlock(&klp_mutex); + if (ret) return ret; - } patch->kobj_added = true; klp_for_each_object(patch, obj) { ret = klp_init_object(patch, obj); if (ret) - goto free; + return ret; } list_add_tail(&patch->list, &klp_patches); - mutex_unlock(&klp_mutex); - - return 0; - -free: - klp_free_patch_start(patch); - - mutex_unlock(&klp_mutex); - - klp_free_patch_finish(patch); - - return ret; -} - -/** - * klp_unregister_patch() - unregisters a patch - * @patch: Disabled patch to be unregistered - * - * Frees the data structures and removes the sysfs interface. - * - * Return: 0 on success, otherwise error - */ -int klp_unregister_patch(struct klp_patch *patch) -{ - int ret; - - mutex_lock(&klp_mutex); - - if (!klp_is_patch_registered(patch)) { - ret = -EINVAL; - goto err; - } - - if (patch->enabled) { - ret = -EBUSY; - goto err; - } - - klp_free_patch_start(patch); - - mutex_unlock(&klp_mutex); - - klp_free_patch_finish(patch); - return 0; -err: - mutex_unlock(&klp_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(klp_unregister_patch); - -/** - * klp_register_patch() - registers a patch - * @patch: Patch to be registered - * - * Initializes the data structure associated with the patch and - * creates the sysfs interface. - * - * There is no need to take the reference on the patch module here. It is done - * later when the patch is enabled. - * - * Return: 0 on success, otherwise error - */ -int klp_register_patch(struct klp_patch *patch) -{ - if (!patch || !patch->mod) - return -EINVAL; - - if (!is_livepatch_module(patch->mod)) { - pr_err("module %s is not marked as a livepatch module\n", - patch->mod->name); - return -EINVAL; - } - - if (!klp_initialized()) - return -ENODEV; - - if (!klp_have_reliable_stack()) { - pr_err("This architecture doesn't have support for the livepatch consistency model.\n"); - return -ENOSYS; - } - - return klp_init_patch(patch); } -EXPORT_SYMBOL_GPL(klp_register_patch); static int __klp_disable_patch(struct klp_patch *patch) { @@ -802,8 +714,7 @@ static int __klp_disable_patch(struct klp_patch *patch) return -EBUSY; /* enforce stacking: only the last enabled patch can be disabled */ - if (!list_is_last(&patch->list, &klp_patches) && - list_next_entry(patch, list)->enabled) + if (!list_is_last(&patch->list, &klp_patches)) return -EBUSY; klp_init_transition(patch, KLP_UNPATCHED); @@ -822,44 +733,12 @@ static int __klp_disable_patch(struct klp_patch *patch) smp_wmb(); klp_start_transition(); - klp_try_complete_transition(); patch->enabled = false; + klp_try_complete_transition(); return 0; } -/** - * klp_disable_patch() - disables a registered patch - * @patch: The registered, enabled patch to be disabled - * - * Unregisters the patched functions from ftrace. - * - * Return: 0 on success, otherwise error - */ -int klp_disable_patch(struct klp_patch *patch) -{ - int ret; - - mutex_lock(&klp_mutex); - - if (!klp_is_patch_registered(patch)) { - ret = -EINVAL; - goto err; - } - - if (!patch->enabled) { - ret = -EINVAL; - goto err; - } - - ret = __klp_disable_patch(patch); - -err: - mutex_unlock(&klp_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(klp_disable_patch); - static int __klp_enable_patch(struct klp_patch *patch) { struct klp_object *obj; @@ -871,17 +750,8 @@ static int __klp_enable_patch(struct klp_patch *patch) if (WARN_ON(patch->enabled)) return -EINVAL; - /* enforce stacking: only the first disabled patch can be enabled */ - if (patch->list.prev != &klp_patches && - !list_prev_entry(patch, list)->enabled) - return -EBUSY; - - /* - * A reference is taken on the patch module to prevent it from being - * unloaded. - */ - if (!try_module_get(patch->mod)) - return -ENODEV; + if (!patch->kobj_added) + return -EINVAL; pr_notice("enabling patch '%s'\n", patch->mod->name); @@ -916,8 +786,8 @@ static int __klp_enable_patch(struct klp_patch *patch) } klp_start_transition(); - klp_try_complete_transition(); patch->enabled = true; + klp_try_complete_transition(); return 0; err: @@ -928,11 +798,15 @@ err: } /** - * klp_enable_patch() - enables a registered patch - * @patch: The registered, disabled patch to be enabled + * klp_enable_patch() - enable the livepatch + * @patch: patch to be enabled * - * Performs the needed symbol lookups and code relocations, - * then registers the patched functions with ftrace. + * Initializes the data structure associated with the patch, creates the sysfs + * interface, performs the needed symbol lookups and code relocations, + * registers the patched functions with ftrace. + * + * This function is supposed to be called from the livepatch module_init() + * callback. * * Return: 0 on success, otherwise error */ @@ -940,17 +814,51 @@ int klp_enable_patch(struct klp_patch *patch) { int ret; + if (!patch || !patch->mod) + return -EINVAL; + + if (!is_livepatch_module(patch->mod)) { + pr_err("module %s is not marked as a livepatch module\n", + patch->mod->name); + return -EINVAL; + } + + if (!klp_initialized()) + return -ENODEV; + + if (!klp_have_reliable_stack()) { + pr_err("This architecture doesn't have support for the livepatch consistency model.\n"); + return -ENOSYS; + } + + mutex_lock(&klp_mutex); - if (!klp_is_patch_registered(patch)) { - ret = -EINVAL; - goto err; + ret = klp_init_patch_early(patch); + if (ret) { + mutex_unlock(&klp_mutex); + return ret; } + ret = klp_init_patch(patch); + if (ret) + goto err; + ret = __klp_enable_patch(patch); + if (ret) + goto err; + + mutex_unlock(&klp_mutex); + + return 0; err: + klp_free_patch_start(patch); + mutex_unlock(&klp_mutex); + + klp_free_patch_finish(patch); + return ret; } EXPORT_SYMBOL_GPL(klp_enable_patch); diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h index d0cb5390e247..d4eefc520c08 100644 --- a/kernel/livepatch/core.h +++ b/kernel/livepatch/core.h @@ -7,6 +7,8 @@ extern struct mutex klp_mutex; extern struct list_head klp_patches; +void klp_free_patch_start(struct klp_patch *patch); + static inline bool klp_is_object_loaded(struct klp_object *obj) { return !obj->name || obj->mod; diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index a4c921364003..c9917a24b3a4 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -134,13 +134,6 @@ static void klp_complete_transition(void) pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name, klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); - /* - * patch->forced set implies unbounded increase of module's ref count if - * the module is disabled/enabled in a loop. - */ - if (!klp_transition_patch->forced && klp_target_state == KLP_UNPATCHED) - module_put(klp_transition_patch->mod); - klp_target_state = KLP_UNDEFINED; klp_transition_patch = NULL; } @@ -357,6 +350,7 @@ void klp_try_complete_transition(void) { unsigned int cpu; struct task_struct *g, *task; + struct klp_patch *patch; bool complete = true; WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED); @@ -405,7 +399,18 @@ void klp_try_complete_transition(void) } /* we're done, now cleanup the data structures */ + patch = klp_transition_patch; klp_complete_transition(); + + /* + * It would make more sense to free the patch in + * klp_complete_transition() but it is called also + * from klp_cancel_transition(). + */ + if (!patch->enabled) { + klp_free_patch_start(patch); + schedule_work(&patch->free_work); + } } /* diff --git a/samples/livepatch/livepatch-callbacks-demo.c b/samples/livepatch/livepatch-callbacks-demo.c index 72f9e6d1387b..62d97953ad02 100644 --- a/samples/livepatch/livepatch-callbacks-demo.c +++ b/samples/livepatch/livepatch-callbacks-demo.c @@ -195,22 +195,11 @@ static struct klp_patch patch = { static int livepatch_callbacks_demo_init(void) { - int ret; - - ret = klp_register_patch(&patch); - if (ret) - return ret; - ret = klp_enable_patch(&patch); - if (ret) { - WARN_ON(klp_unregister_patch(&patch)); - return ret; - } - return 0; + return klp_enable_patch(&patch); } static void livepatch_callbacks_demo_exit(void) { - WARN_ON(klp_unregister_patch(&patch)); } module_init(livepatch_callbacks_demo_init); diff --git a/samples/livepatch/livepatch-sample.c b/samples/livepatch/livepatch-sample.c index 2d554dd930e2..01c9cf003ca2 100644 --- a/samples/livepatch/livepatch-sample.c +++ b/samples/livepatch/livepatch-sample.c @@ -69,22 +69,11 @@ static struct klp_patch patch = { static int livepatch_init(void) { - int ret; - - ret = klp_register_patch(&patch); - if (ret) - return ret; - ret = klp_enable_patch(&patch); - if (ret) { - WARN_ON(klp_unregister_patch(&patch)); - return ret; - } - return 0; + return klp_enable_patch(&patch); } static void livepatch_exit(void) { - WARN_ON(klp_unregister_patch(&patch)); } module_init(livepatch_init); diff --git a/samples/livepatch/livepatch-shadow-fix1.c b/samples/livepatch/livepatch-shadow-fix1.c index e8f1bd6b29b1..a5a5cac21d4d 100644 --- a/samples/livepatch/livepatch-shadow-fix1.c +++ b/samples/livepatch/livepatch-shadow-fix1.c @@ -157,25 +157,13 @@ static struct klp_patch patch = { static int livepatch_shadow_fix1_init(void) { - int ret; - - ret = klp_register_patch(&patch); - if (ret) - return ret; - ret = klp_enable_patch(&patch); - if (ret) { - WARN_ON(klp_unregister_patch(&patch)); - return ret; - } - return 0; + return klp_enable_patch(&patch); } static void livepatch_shadow_fix1_exit(void) { /* Cleanup any existing SV_LEAK shadow variables */ klp_shadow_free_all(SV_LEAK, livepatch_fix1_dummy_leak_dtor); - - WARN_ON(klp_unregister_patch(&patch)); } module_init(livepatch_shadow_fix1_init); diff --git a/samples/livepatch/livepatch-shadow-fix2.c b/samples/livepatch/livepatch-shadow-fix2.c index b34c7bf83356..52de947b5526 100644 --- a/samples/livepatch/livepatch-shadow-fix2.c +++ b/samples/livepatch/livepatch-shadow-fix2.c @@ -129,25 +129,13 @@ static struct klp_patch patch = { static int livepatch_shadow_fix2_init(void) { - int ret; - - ret = klp_register_patch(&patch); - if (ret) - return ret; - ret = klp_enable_patch(&patch); - if (ret) { - WARN_ON(klp_unregister_patch(&patch)); - return ret; - } - return 0; + return klp_enable_patch(&patch); } static void livepatch_shadow_fix2_exit(void) { /* Cleanup any existing SV_COUNTER shadow variables */ klp_shadow_free_all(SV_COUNTER, NULL); - - WARN_ON(klp_unregister_patch(&patch)); } module_init(livepatch_shadow_fix2_init); -- cgit v1.3-7-g2ca7 From 20e55025958e18e671d92c7adea00c301ac93c43 Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Wed, 9 Jan 2019 13:43:24 +0100 Subject: livepatch: Use lists to manage patches, objects and functions Currently klp_patch contains a pointer to a statically allocated array of struct klp_object and struct klp_objects contains a pointer to a statically allocated array of klp_func. In order to allow for the dynamic allocation of objects and functions, link klp_patch, klp_object, and klp_func together via linked lists. This allows us to more easily allocate new objects and functions, while having the iterator be a simple linked list walk. The static structures are added to the lists early. It allows to add the dynamically allocated objects before klp_init_object() and klp_init_func() calls. Therefore it reduces the further changes to the code. This patch does not change the existing behavior. Signed-off-by: Jason Baron [pmladek@suse.com: Initialize lists before init calls] Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Joe Lawrence Cc: Josh Poimboeuf Cc: Jiri Kosina Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- include/linux/livepatch.h | 19 +++++++++++++++++-- kernel/livepatch/core.c | 9 +++++++-- 2 files changed, 24 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 8f9c19c69744..e117e20ff771 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -24,6 +24,7 @@ #include #include #include +#include #if IS_ENABLED(CONFIG_LIVEPATCH) @@ -42,6 +43,7 @@ * can be found (optional) * @old_func: pointer to the function being patched * @kobj: kobject for sysfs resources + * @node: list node for klp_object func_list * @stack_node: list node for klp_ops func_stack list * @old_size: size of the old function * @new_size: size of the new function @@ -80,6 +82,7 @@ struct klp_func { /* internal */ void *old_func; struct kobject kobj; + struct list_head node; struct list_head stack_node; unsigned long old_size, new_size; bool kobj_added; @@ -117,6 +120,8 @@ struct klp_callbacks { * @funcs: function entries for functions to be patched in the object * @callbacks: functions to be executed pre/post (un)patching * @kobj: kobject for sysfs resources + * @func_list: dynamic list of the function entries + * @node: list node for klp_patch obj_list * @mod: kernel module associated with the patched object * (NULL for vmlinux) * @kobj_added: @kobj has been added and needs freeing @@ -130,6 +135,8 @@ struct klp_object { /* internal */ struct kobject kobj; + struct list_head func_list; + struct list_head node; struct module *mod; bool kobj_added; bool patched; @@ -141,6 +148,7 @@ struct klp_object { * @objs: object entries for kernel objects to be patched * @list: list node for global list of actively used patches * @kobj: kobject for sysfs resources + * @obj_list: dynamic list of the object entries * @kobj_added: @kobj has been added and needs freeing * @enabled: the patch is enabled (but operation may be incomplete) * @forced: was involved in a forced transition @@ -155,6 +163,7 @@ struct klp_patch { /* internal */ struct list_head list; struct kobject kobj; + struct list_head obj_list; bool kobj_added; bool enabled; bool forced; @@ -162,14 +171,20 @@ struct klp_patch { struct completion finish; }; -#define klp_for_each_object(patch, obj) \ +#define klp_for_each_object_static(patch, obj) \ for (obj = patch->objs; obj->funcs || obj->name; obj++) -#define klp_for_each_func(obj, func) \ +#define klp_for_each_object(patch, obj) \ + list_for_each_entry(obj, &patch->obj_list, node) + +#define klp_for_each_func_static(obj, func) \ for (func = obj->funcs; \ func->old_name || func->new_func || func->old_sympos; \ func++) +#define klp_for_each_func(obj, func) \ + list_for_each_entry(func, &obj->func_list, node) + int klp_enable_patch(struct klp_patch *); void arch_klp_init_object_loaded(struct klp_patch *patch, diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index bd41b03a72d5..37d0d3645fa6 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -659,20 +659,25 @@ static int klp_init_patch_early(struct klp_patch *patch) return -EINVAL; INIT_LIST_HEAD(&patch->list); + INIT_LIST_HEAD(&patch->obj_list); patch->kobj_added = false; patch->enabled = false; patch->forced = false; INIT_WORK(&patch->free_work, klp_free_patch_work_fn); init_completion(&patch->finish); - klp_for_each_object(patch, obj) { + klp_for_each_object_static(patch, obj) { if (!obj->funcs) return -EINVAL; + INIT_LIST_HEAD(&obj->func_list); obj->kobj_added = false; + list_add_tail(&obj->node, &patch->obj_list); - klp_for_each_func(obj, func) + klp_for_each_func_static(obj, func) { func->kobj_added = false; + list_add_tail(&func->node, &obj->func_list); + } } if (!try_module_get(patch->mod)) -- cgit v1.3-7-g2ca7 From e1452b607c48c642caf57299f4da83aa002f8533 Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Wed, 9 Jan 2019 13:43:25 +0100 Subject: livepatch: Add atomic replace Sometimes we would like to revert a particular fix. Currently, this is not easy because we want to keep all other fixes active and we could revert only the last applied patch. One solution would be to apply new patch that implemented all the reverted functions like in the original code. It would work as expected but there will be unnecessary redirections. In addition, it would also require knowing which functions need to be reverted at build time. Another problem is when there are many patches that touch the same functions. There might be dependencies between patches that are not enforced on the kernel side. Also it might be pretty hard to actually prepare the patch and ensure compatibility with the other patches. Atomic replace && cumulative patches: A better solution would be to create cumulative patch and say that it replaces all older ones. This patch adds a new "replace" flag to struct klp_patch. When it is enabled, a set of 'nop' klp_func will be dynamically created for all functions that are already being patched but that will no longer be modified by the new patch. They are used as a new target during the patch transition. The idea is to handle Nops' structures like the static ones. When the dynamic structures are allocated, we initialize all values that are normally statically defined. The only exception is "new_func" in struct klp_func. It has to point to the original function and the address is known only when the object (module) is loaded. Note that we really need to set it. The address is used, for example, in klp_check_stack_func(). Nevertheless we still need to distinguish the dynamically allocated structures in some operations. For this, we add "nop" flag into struct klp_func and "dynamic" flag into struct klp_object. They need special handling in the following situations: + The structures are added into the lists of objects and functions immediately. In fact, the lists were created for this purpose. + The address of the original function is known only when the patched object (module) is loaded. Therefore it is copied later in klp_init_object_loaded(). + The ftrace handler must not set PC to func->new_func. It would cause infinite loop because the address points back to the beginning of the original function. + The various free() functions must free the structure itself. Note that other ways to detect the dynamic structures are not considered safe. For example, even the statically defined struct klp_object might include empty funcs array. It might be there just to run some callbacks. Also note that the safe iterator must be used in the free() functions. Otherwise already freed structures might get accessed. Special callbacks handling: The callbacks from the replaced patches are _not_ called by intention. It would be pretty hard to define a reasonable semantic and implement it. It might even be counter-productive. The new patch is cumulative. It is supposed to include most of the changes from older patches. In most cases, it will not want to call pre_unpatch() post_unpatch() callbacks from the replaced patches. It would disable/break things for no good reasons. Also it should be easier to handle various scenarios in a single script in the new patch than think about interactions caused by running many scripts from older patches. Not to say that the old scripts even would not expect to be called in this situation. Removing replaced patches: One nice effect of the cumulative patches is that the code from the older patches is no longer used. Therefore the replaced patches can be removed. It has several advantages: + Nops' structs will no longer be necessary and might be removed. This would save memory, restore performance (no ftrace handler), allow clear view on what is really patched. + Disabling the patch will cause using the original code everywhere. Therefore the livepatch callbacks could handle only one scenario. Note that the complication is already complex enough when the patch gets enabled. It is currently solved by calling callbacks only from the new cumulative patch. + The state is clean in both the sysfs interface and lsmod. The modules with the replaced livepatches might even get removed from the system. Some people actually expected this behavior from the beginning. After all a cumulative patch is supposed to "completely" replace an existing one. It is like when a new version of an application replaces an older one. This patch does the first step. It removes the replaced patches from the list of patches. It is safe. The consistency model ensures that they are no longer used. By other words, each process works only with the structures from klp_transition_patch. The removal is done by a special function. It combines actions done by __disable_patch() and klp_complete_transition(). But it is a fast track without all the transaction-related stuff. Signed-off-by: Jason Baron [pmladek@suse.com: Split, reuse existing code, simplified] Signed-off-by: Petr Mladek Cc: Josh Poimboeuf Cc: Jessica Yu Cc: Jiri Kosina Cc: Miroslav Benes Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- Documentation/livepatch/livepatch.txt | 31 ++++- include/linux/livepatch.h | 12 ++ kernel/livepatch/core.c | 232 ++++++++++++++++++++++++++++++++-- kernel/livepatch/core.h | 1 + kernel/livepatch/patch.c | 8 ++ kernel/livepatch/transition.c | 3 + 6 files changed, 273 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 8f56490a4bb6..2a70f43166f6 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -15,8 +15,9 @@ Table of Contents: 5. Livepatch life-cycle 5.1. Loading 5.2. Enabling - 5.3. Disabling - 5.4. Removing + 5.3. Replacing + 5.4. Disabling + 5.5. Removing 6. Sysfs 7. Limitations @@ -300,8 +301,12 @@ into three levels: 5. Livepatch life-cycle ======================= -Livepatching can be described by four basic operations: -loading, enabling, disabling, removing. +Livepatching can be described by five basic operations: +loading, enabling, replacing, disabling, removing. + +Where the replacing and the disabling operations are mutually +exclusive. They have the same result for the given patch but +not for the system. 5.1. Loading @@ -347,7 +352,21 @@ to '0'. the "Consistency model" section. -5.3. Disabling +5.3. Replacing +-------------- + +All enabled patches might get replaced by a cumulative patch that +has the .replace flag set. + +Once the new patch is enabled and the 'transition' finishes then +all the functions (struct klp_func) associated with the replaced +patches are removed from the corresponding struct klp_ops. Also +the ftrace handler is unregistered and the struct klp_ops is +freed when the related function is not modified by the new patch +and func_stack list becomes empty. + + +5.4. Disabling -------------- Enabled patches might get disabled by writing '0' to @@ -372,7 +391,7 @@ Note that patches must be disabled in exactly the reverse order in which they were enabled. It makes the problem and the implementation much easier. -5.4. Removing +5.5. Removing ------------- Module removal is only safe when there are no users of functions provided diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index e117e20ff771..53551f470722 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -48,6 +48,7 @@ * @old_size: size of the old function * @new_size: size of the new function * @kobj_added: @kobj has been added and needs freeing + * @nop: temporary patch to use the original code again; dyn. allocated * @patched: the func has been added to the klp_ops list * @transition: the func is currently being applied or reverted * @@ -86,6 +87,7 @@ struct klp_func { struct list_head stack_node; unsigned long old_size, new_size; bool kobj_added; + bool nop; bool patched; bool transition; }; @@ -125,6 +127,7 @@ struct klp_callbacks { * @mod: kernel module associated with the patched object * (NULL for vmlinux) * @kobj_added: @kobj has been added and needs freeing + * @dynamic: temporary object for nop functions; dynamically allocated * @patched: the object's funcs have been added to the klp_ops list */ struct klp_object { @@ -139,6 +142,7 @@ struct klp_object { struct list_head node; struct module *mod; bool kobj_added; + bool dynamic; bool patched; }; @@ -146,6 +150,7 @@ struct klp_object { * struct klp_patch - patch structure for live patching * @mod: reference to the live patch module * @objs: object entries for kernel objects to be patched + * @replace: replace all actively used patches * @list: list node for global list of actively used patches * @kobj: kobject for sysfs resources * @obj_list: dynamic list of the object entries @@ -159,6 +164,7 @@ struct klp_patch { /* external */ struct module *mod; struct klp_object *objs; + bool replace; /* internal */ struct list_head list; @@ -174,6 +180,9 @@ struct klp_patch { #define klp_for_each_object_static(patch, obj) \ for (obj = patch->objs; obj->funcs || obj->name; obj++) +#define klp_for_each_object_safe(patch, obj, tmp_obj) \ + list_for_each_entry_safe(obj, tmp_obj, &patch->obj_list, node) + #define klp_for_each_object(patch, obj) \ list_for_each_entry(obj, &patch->obj_list, node) @@ -182,6 +191,9 @@ struct klp_patch { func->old_name || func->new_func || func->old_sympos; \ func++) +#define klp_for_each_func_safe(obj, func, tmp_func) \ + list_for_each_entry_safe(func, tmp_func, &obj->func_list, node) + #define klp_for_each_func(obj, func) \ list_for_each_entry(func, &obj->func_list, node) diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 37d0d3645fa6..ecb7660f1d8b 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -92,6 +92,40 @@ static bool klp_initialized(void) return !!klp_root_kobj; } +static struct klp_func *klp_find_func(struct klp_object *obj, + struct klp_func *old_func) +{ + struct klp_func *func; + + klp_for_each_func(obj, func) { + if ((strcmp(old_func->old_name, func->old_name) == 0) && + (old_func->old_sympos == func->old_sympos)) { + return func; + } + } + + return NULL; +} + +static struct klp_object *klp_find_object(struct klp_patch *patch, + struct klp_object *old_obj) +{ + struct klp_object *obj; + + klp_for_each_object(patch, obj) { + if (klp_is_module(old_obj)) { + if (klp_is_module(obj) && + strcmp(old_obj->name, obj->name) == 0) { + return obj; + } + } else if (!klp_is_module(obj)) { + return obj; + } + } + + return NULL; +} + struct klp_find_arg { const char *objname; const char *name; @@ -418,6 +452,121 @@ static struct attribute *klp_patch_attrs[] = { NULL }; +static void klp_free_object_dynamic(struct klp_object *obj) +{ + kfree(obj->name); + kfree(obj); +} + +static struct klp_object *klp_alloc_object_dynamic(const char *name) +{ + struct klp_object *obj; + + obj = kzalloc(sizeof(*obj), GFP_KERNEL); + if (!obj) + return NULL; + + if (name) { + obj->name = kstrdup(name, GFP_KERNEL); + if (!obj->name) { + kfree(obj); + return NULL; + } + } + + INIT_LIST_HEAD(&obj->func_list); + obj->dynamic = true; + + return obj; +} + +static void klp_free_func_nop(struct klp_func *func) +{ + kfree(func->old_name); + kfree(func); +} + +static struct klp_func *klp_alloc_func_nop(struct klp_func *old_func, + struct klp_object *obj) +{ + struct klp_func *func; + + func = kzalloc(sizeof(*func), GFP_KERNEL); + if (!func) + return NULL; + + if (old_func->old_name) { + func->old_name = kstrdup(old_func->old_name, GFP_KERNEL); + if (!func->old_name) { + kfree(func); + return NULL; + } + } + + /* + * func->new_func is same as func->old_func. These addresses are + * set when the object is loaded, see klp_init_object_loaded(). + */ + func->old_sympos = old_func->old_sympos; + func->nop = true; + + return func; +} + +static int klp_add_object_nops(struct klp_patch *patch, + struct klp_object *old_obj) +{ + struct klp_object *obj; + struct klp_func *func, *old_func; + + obj = klp_find_object(patch, old_obj); + + if (!obj) { + obj = klp_alloc_object_dynamic(old_obj->name); + if (!obj) + return -ENOMEM; + + list_add_tail(&obj->node, &patch->obj_list); + } + + klp_for_each_func(old_obj, old_func) { + func = klp_find_func(obj, old_func); + if (func) + continue; + + func = klp_alloc_func_nop(old_func, obj); + if (!func) + return -ENOMEM; + + list_add_tail(&func->node, &obj->func_list); + } + + return 0; +} + +/* + * Add 'nop' functions which simply return to the caller to run + * the original function. The 'nop' functions are added to a + * patch to facilitate a 'replace' mode. + */ +static int klp_add_nops(struct klp_patch *patch) +{ + struct klp_patch *old_patch; + struct klp_object *old_obj; + + list_for_each_entry(old_patch, &klp_patches, list) { + klp_for_each_object(old_patch, old_obj) { + int err; + + err = klp_add_object_nops(patch, old_obj); + if (err) + return err; + } + } + + return 0; +} + static void klp_kobj_release_patch(struct kobject *kobj) { struct klp_patch *patch; @@ -434,6 +583,12 @@ static struct kobj_type klp_ktype_patch = { static void klp_kobj_release_object(struct kobject *kobj) { + struct klp_object *obj; + + obj = container_of(kobj, struct klp_object, kobj); + + if (obj->dynamic) + klp_free_object_dynamic(obj); } static struct kobj_type klp_ktype_object = { @@ -443,6 +598,12 @@ static struct kobj_type klp_ktype_object = { static void klp_kobj_release_func(struct kobject *kobj) { + struct klp_func *func; + + func = container_of(kobj, struct klp_func, kobj); + + if (func->nop) + klp_free_func_nop(func); } static struct kobj_type klp_ktype_func = { @@ -452,12 +613,15 @@ static struct kobj_type klp_ktype_func = { static void klp_free_funcs(struct klp_object *obj) { - struct klp_func *func; + struct klp_func *func, *tmp_func; - klp_for_each_func(obj, func) { + klp_for_each_func_safe(obj, func, tmp_func) { /* Might be called from klp_init_patch() error path. */ - if (func->kobj_added) + if (func->kobj_added) { kobject_put(&func->kobj); + } else if (func->nop) { + klp_free_func_nop(func); + } } } @@ -468,20 +632,27 @@ static void klp_free_object_loaded(struct klp_object *obj) obj->mod = NULL; - klp_for_each_func(obj, func) + klp_for_each_func(obj, func) { func->old_func = NULL; + + if (func->nop) + func->new_func = NULL; + } } static void klp_free_objects(struct klp_patch *patch) { - struct klp_object *obj; + struct klp_object *obj, *tmp_obj; - klp_for_each_object(patch, obj) { + klp_for_each_object_safe(patch, obj, tmp_obj) { klp_free_funcs(obj); /* Might be called from klp_init_patch() error path. */ - if (obj->kobj_added) + if (obj->kobj_added) { kobject_put(&obj->kobj); + } else if (obj->dynamic) { + klp_free_object_dynamic(obj); + } } } @@ -543,7 +714,14 @@ static int klp_init_func(struct klp_object *obj, struct klp_func *func) { int ret; - if (!func->old_name || !func->new_func) + if (!func->old_name) + return -EINVAL; + + /* + * NOPs get the address later. The patched module must be loaded, + * see klp_init_object_loaded(). + */ + if (!func->new_func && !func->nop) return -EINVAL; if (strlen(func->old_name) >= KSYM_NAME_LEN) @@ -605,6 +783,9 @@ static int klp_init_object_loaded(struct klp_patch *patch, return -ENOENT; } + if (func->nop) + func->new_func = func->old_func; + ret = kallsyms_lookup_size_offset((unsigned long)func->new_func, &func->new_size, NULL); if (!ret) { @@ -697,6 +878,12 @@ static int klp_init_patch(struct klp_patch *patch) return ret; patch->kobj_added = true; + if (patch->replace) { + ret = klp_add_nops(patch); + if (ret) + return ret; + } + klp_for_each_object(patch, obj) { ret = klp_init_object(patch, obj); if (ret) @@ -868,6 +1055,35 @@ err: } EXPORT_SYMBOL_GPL(klp_enable_patch); +/* + * This function removes replaced patches. + * + * We could be pretty aggressive here. It is called in the situation where + * these structures are no longer accessible. All functions are redirected + * by the klp_transition_patch. They use either a new code or they are in + * the original code because of the special nop function patches. + * + * The only exception is when the transition was forced. In this case, + * klp_ftrace_handler() might still see the replaced patch on the stack. + * Fortunately, it is carefully designed to work with removed functions + * thanks to RCU. We only have to keep the patches on the system. Also + * this is handled transparently by patch->module_put. + */ +void klp_discard_replaced_patches(struct klp_patch *new_patch) +{ + struct klp_patch *old_patch, *tmp_patch; + + list_for_each_entry_safe(old_patch, tmp_patch, &klp_patches, list) { + if (old_patch == new_patch) + return; + + old_patch->enabled = false; + klp_unpatch_objects(old_patch); + klp_free_patch_start(old_patch); + schedule_work(&old_patch->free_work); + } +} + /* * Remove parts of patches that touch a given kernel module. The list of * patches processed might be limited. When limit is NULL, all patches diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h index d4eefc520c08..f6a853adcc00 100644 --- a/kernel/livepatch/core.h +++ b/kernel/livepatch/core.h @@ -8,6 +8,7 @@ extern struct mutex klp_mutex; extern struct list_head klp_patches; void klp_free_patch_start(struct klp_patch *patch); +void klp_discard_replaced_patches(struct klp_patch *new_patch); static inline bool klp_is_object_loaded(struct klp_object *obj) { diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 825022d70912..0ff466ab4b5a 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -118,7 +118,15 @@ static void notrace klp_ftrace_handler(unsigned long ip, } } + /* + * NOPs are used to replace existing patches with original code. + * Do nothing! Setting pc would cause an infinite loop. + */ + if (func->nop) + goto unlock; + klp_arch_set_pc(regs, (unsigned long)func->new_func); + unlock: preempt_enable_notrace(); } diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index c9917a24b3a4..f4c5908a9731 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -85,6 +85,9 @@ static void klp_complete_transition(void) klp_transition_patch->mod->name, klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); + if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) + klp_discard_replaced_patches(klp_transition_patch); + if (klp_target_state == KLP_UNPATCHED) { /* * All tasks have transitioned to KLP_UNPATCHED so we can now -- cgit v1.3-7-g2ca7 From d697bad588eb4e76311193e6eaacc7c7aaa5a4ba Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:26 +0100 Subject: livepatch: Remove Nop structures when unused Replaced patches are removed from the stack when the transition is finished. It means that Nop structures will never be needed again and can be removed. Why should we care? + Nop structures give the impression that the function is patched even though the ftrace handler has no effect. + Ftrace handlers do not come for free. They cause slowdown that might be visible in some workloads. The ftrace-related slowdown might actually be the reason why the function is no longer patched in the new cumulative patch. One would expect that cumulative patch would help solve these problems as well. + Cumulative patches are supposed to replace any earlier version of the patch. The amount of NOPs depends on which version was replaced. This multiplies the amount of scenarios that might happen. One might say that NOPs are innocent. But there are even optimized NOP instructions for different processors, for example, see arch/x86/kernel/alternative.c. And klp_ftrace_handler() is much more complicated. + It sounds natural to clean up a mess that is no longer needed. It could only be worse if we do not do it. This patch allows to unpatch and free the dynamic structures independently when the transition finishes. The free part is a bit tricky because kobject free callbacks are called asynchronously. We could not wait for them easily. Fortunately, we do not have to. Any further access can be avoided by removing them from the dynamic lists. Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 48 ++++++++++++++++++++++++++++++++++++++++--- kernel/livepatch/core.h | 1 + kernel/livepatch/patch.c | 31 +++++++++++++++++++++++----- kernel/livepatch/patch.h | 1 + kernel/livepatch/transition.c | 4 +++- 5 files changed, 76 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index ecb7660f1d8b..113645ee86b6 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -611,11 +611,16 @@ static struct kobj_type klp_ktype_func = { .sysfs_ops = &kobj_sysfs_ops, }; -static void klp_free_funcs(struct klp_object *obj) +static void __klp_free_funcs(struct klp_object *obj, bool nops_only) { struct klp_func *func, *tmp_func; klp_for_each_func_safe(obj, func, tmp_func) { + if (nops_only && !func->nop) + continue; + + list_del(&func->node); + /* Might be called from klp_init_patch() error path. */ if (func->kobj_added) { kobject_put(&func->kobj); @@ -640,12 +645,17 @@ static void klp_free_object_loaded(struct klp_object *obj) } } -static void klp_free_objects(struct klp_patch *patch) +static void __klp_free_objects(struct klp_patch *patch, bool nops_only) { struct klp_object *obj, *tmp_obj; klp_for_each_object_safe(patch, obj, tmp_obj) { - klp_free_funcs(obj); + __klp_free_funcs(obj, nops_only); + + if (nops_only && !obj->dynamic) + continue; + + list_del(&obj->node); /* Might be called from klp_init_patch() error path. */ if (obj->kobj_added) { @@ -656,6 +666,16 @@ static void klp_free_objects(struct klp_patch *patch) } } +static void klp_free_objects(struct klp_patch *patch) +{ + __klp_free_objects(patch, false); +} + +static void klp_free_objects_dynamic(struct klp_patch *patch) +{ + __klp_free_objects(patch, true); +} + /* * This function implements the free operations that can be called safely * under klp_mutex. @@ -1084,6 +1104,28 @@ void klp_discard_replaced_patches(struct klp_patch *new_patch) } } +/* + * This function removes the dynamically allocated 'nop' functions. + * + * We could be pretty aggressive. NOPs do not change the existing + * behavior except for adding unnecessary delay by the ftrace handler. + * + * It is safe even when the transition was forced. The ftrace handler + * will see a valid ops->func_stack entry thanks to RCU. + * + * We could even free the NOPs structures. They must be the last entry + * in ops->func_stack. Therefore unregister_ftrace_function() is called. + * It does the same as klp_synchronize_transition() to make sure that + * nobody is inside the ftrace handler once the operation finishes. + * + * IMPORTANT: It must be called right after removing the replaced patches! + */ +void klp_discard_nops(struct klp_patch *new_patch) +{ + klp_unpatch_objects_dynamic(klp_transition_patch); + klp_free_objects_dynamic(klp_transition_patch); +} + /* * Remove parts of patches that touch a given kernel module. The list of * patches processed might be limited. When limit is NULL, all patches diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h index f6a853adcc00..e6200f38701f 100644 --- a/kernel/livepatch/core.h +++ b/kernel/livepatch/core.h @@ -9,6 +9,7 @@ extern struct list_head klp_patches; void klp_free_patch_start(struct klp_patch *patch); void klp_discard_replaced_patches(struct klp_patch *new_patch); +void klp_discard_nops(struct klp_patch *new_patch); static inline bool klp_is_object_loaded(struct klp_object *obj) { diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 0ff466ab4b5a..99cb3ad05eb4 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -246,15 +246,26 @@ err: return ret; } -void klp_unpatch_object(struct klp_object *obj) +static void __klp_unpatch_object(struct klp_object *obj, bool nops_only) { struct klp_func *func; - klp_for_each_func(obj, func) + klp_for_each_func(obj, func) { + if (nops_only && !func->nop) + continue; + if (func->patched) klp_unpatch_func(func); + } - obj->patched = false; + if (obj->dynamic || !nops_only) + obj->patched = false; +} + + +void klp_unpatch_object(struct klp_object *obj) +{ + __klp_unpatch_object(obj, false); } int klp_patch_object(struct klp_object *obj) @@ -277,11 +288,21 @@ int klp_patch_object(struct klp_object *obj) return 0; } -void klp_unpatch_objects(struct klp_patch *patch) +static void __klp_unpatch_objects(struct klp_patch *patch, bool nops_only) { struct klp_object *obj; klp_for_each_object(patch, obj) if (obj->patched) - klp_unpatch_object(obj); + __klp_unpatch_object(obj, nops_only); +} + +void klp_unpatch_objects(struct klp_patch *patch) +{ + __klp_unpatch_objects(patch, false); +} + +void klp_unpatch_objects_dynamic(struct klp_patch *patch) +{ + __klp_unpatch_objects(patch, true); } diff --git a/kernel/livepatch/patch.h b/kernel/livepatch/patch.h index a9b16e513656..d5f2fbe373e0 100644 --- a/kernel/livepatch/patch.h +++ b/kernel/livepatch/patch.h @@ -30,5 +30,6 @@ struct klp_ops *klp_find_ops(void *old_func); int klp_patch_object(struct klp_object *obj); void klp_unpatch_object(struct klp_object *obj); void klp_unpatch_objects(struct klp_patch *patch); +void klp_unpatch_objects_dynamic(struct klp_patch *patch); #endif /* _LIVEPATCH_PATCH_H */ diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index f4c5908a9731..300273819674 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -85,8 +85,10 @@ static void klp_complete_transition(void) klp_transition_patch->mod->name, klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); - if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) + if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) { klp_discard_replaced_patches(klp_transition_patch); + klp_discard_nops(klp_transition_patch); + } if (klp_target_state == KLP_UNPATCHED) { /* -- cgit v1.3-7-g2ca7 From d67a53720966f2ef5be5c8f238d13512b8260868 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 9 Jan 2019 13:43:28 +0100 Subject: livepatch: Remove ordering (stacking) of the livepatches The atomic replace and cumulative patches were introduced as a more secure way to handle dependent patches. They simplify the logic: + Any new cumulative patch is supposed to take over shadow variables and changes made by callbacks from previous livepatches. + All replaced patches are discarded and the modules can be unloaded. As a result, there is only one scenario when a cumulative livepatch gets disabled. The different handling of "normal" and cumulative patches might cause confusion. It would make sense to keep only one mode. On the other hand, it would be rude to enforce using the cumulative livepatches even for trivial and independent (hot) fixes. However, the stack of patches is not really necessary any longer. The patch ordering was never clearly visible via the sysfs interface. Also the "normal" patches need a lot of caution anyway. Note that the list of enabled patches is still necessary but the ordering is not longer enforced. Otherwise, the code is ready to disable livepatches in an random order. Namely, klp_check_stack_func() always looks for the function from the livepatch that is being disabled. klp_func structures are just removed from the related func_stack. Finally, the ftrace handlers is removed only when the func_stack becomes empty. Signed-off-by: Petr Mladek Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- Documentation/livepatch/cumulative-patches.txt | 11 ++++------- Documentation/livepatch/livepatch.txt | 13 +++++++------ kernel/livepatch/core.c | 4 ---- 3 files changed, 11 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/Documentation/livepatch/cumulative-patches.txt b/Documentation/livepatch/cumulative-patches.txt index e7cf5be69f23..0012808e8d44 100644 --- a/Documentation/livepatch/cumulative-patches.txt +++ b/Documentation/livepatch/cumulative-patches.txt @@ -7,8 +7,8 @@ to do different changes to the same function(s) then we need to define an order in which the patches will be installed. And function implementations from any newer livepatch must be done on top of the older ones. -This might become a maintenance nightmare. Especially if anyone would want -to remove a patch that is in the middle of the stack. +This might become a maintenance nightmare. Especially when more patches +modified the same function in different ways. An elegant solution comes with the feature called "Atomic Replace". It allows creation of so called "Cumulative Patches". They include all wanted changes @@ -26,11 +26,9 @@ for example: .replace = true, }; -Such a patch is added on top of the livepatch stack when enabled. - All processes are then migrated to use the code only from the new patch. Once the transition is finished, all older patches are automatically -disabled and removed from the stack of patches. +disabled. Ftrace handlers are transparently removed from functions that are no longer modified by the new cumulative patch. @@ -57,8 +55,7 @@ The atomic replace allows: + Remove eventual performance impact caused by core redirection for functions that are no longer patched. - + Decrease user confusion about stacking order and what code - is actually in use. + + Decrease user confusion about dependencies between livepatches. Limitations: diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 6f32d6ea2fcb..71d7f286ec4d 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -143,9 +143,9 @@ without HAVE_RELIABLE_STACKTRACE are not considered fully supported by the kernel livepatching. The /sys/kernel/livepatch//transition file shows whether a patch -is in transition. Only a single patch (the topmost patch on the stack) -can be in transition at a given time. A patch can remain in transition -indefinitely, if any of the tasks are stuck in the initial patch state. +is in transition. Only a single patch can be in transition at a given +time. A patch can remain in transition indefinitely, if any of the tasks +are stuck in the initial patch state. A transition can be reversed and effectively canceled by writing the opposite value to the /sys/kernel/livepatch//enabled file while @@ -351,6 +351,10 @@ to '0'. The right implementation is selected by the ftrace handler, see the "Consistency model" section. + That said, it is highly recommended to use cumulative livepatches + because they help keeping the consistency of all changes. In this case, + functions might be patched two times only during the transition period. + 5.3. Replacing -------------- @@ -389,9 +393,6 @@ becomes empty. Third, the sysfs interface is destroyed. -Note that patches must be disabled in exactly the reverse order in which -they were enabled. It makes the problem and the implementation much easier. - 5.5. Removing ------------- diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 113645ee86b6..adca5cf07f7e 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -925,10 +925,6 @@ static int __klp_disable_patch(struct klp_patch *patch) if (klp_transition_patch) return -EBUSY; - /* enforce stacking: only the last enabled patch can be disabled */ - if (!list_is_last(&patch->list, &klp_patches)) - return -EBUSY; - klp_init_transition(patch, KLP_UNPATCHED); klp_for_each_object(patch, obj) -- cgit v1.3-7-g2ca7 From 73ab1cb2de9e3efe7f818d5453de271e5371df1d Mon Sep 17 00:00:00 2001 From: Taehee Yoo Date: Wed, 9 Jan 2019 02:23:56 +0900 Subject: umh: add exit routine for UMH process A UMH process which is created by the fork_usermode_blob() such as bpfilter needs to release members of the umh_info when process is terminated. But the do_exit() does not release members of the umh_info. hence module which uses UMH needs own code to detect whether UMH process is terminated or not. But this implementation needs extra code for checking the status of UMH process. it eventually makes the code more complex. The new PF_UMH flag is added and it is used to identify UMH processes. The exit_umh() does not release members of the umh_info. Hence umh_info->cleanup callback should release both members of the umh_info and the private data. Suggested-by: David S. Miller Signed-off-by: Taehee Yoo Signed-off-by: David S. Miller --- include/linux/sched.h | 9 +++++++++ include/linux/umh.h | 2 ++ kernel/exit.c | 1 + kernel/umh.c | 33 +++++++++++++++++++++++++++++++-- 4 files changed, 43 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 89541d248893..e35e35b9fc48 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern struct pid *cad_pid; #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */ +#define PF_UMH 0x02000000 /* I'm an Usermodehelper process */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ @@ -1904,6 +1905,14 @@ static inline void rseq_execve(struct task_struct *t) #endif +void __exit_umh(struct task_struct *tsk); + +static inline void exit_umh(struct task_struct *tsk) +{ + if (unlikely(tsk->flags & PF_UMH)) + __exit_umh(tsk); +} + #ifdef CONFIG_DEBUG_RSEQ void rseq_syscall(struct pt_regs *regs); diff --git a/include/linux/umh.h b/include/linux/umh.h index 235f51b62c71..0c08de356d0d 100644 --- a/include/linux/umh.h +++ b/include/linux/umh.h @@ -47,6 +47,8 @@ struct umh_info { const char *cmdline; struct file *pipe_to_umh; struct file *pipe_from_umh; + struct list_head list; + void (*cleanup)(struct umh_info *info); pid_t pid; }; int fork_usermode_blob(void *data, size_t len, struct umh_info *info); diff --git a/kernel/exit.c b/kernel/exit.c index 8a01b671dc1f..dad70419195c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -866,6 +866,7 @@ void __noreturn do_exit(long code) exit_task_namespaces(tsk); exit_task_work(tsk); exit_thread(tsk); + exit_umh(tsk); /* * Flush inherited counters to the parent - before the parent diff --git a/kernel/umh.c b/kernel/umh.c index 0baa672e023c..d937cbad903a 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -37,6 +37,8 @@ static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); static DECLARE_RWSEM(umhelper_sem); +static LIST_HEAD(umh_list); +static DEFINE_MUTEX(umh_list_lock); static void call_usermodehelper_freeinfo(struct subprocess_info *info) { @@ -100,10 +102,12 @@ static int call_usermodehelper_exec_async(void *data) commit_creds(new); sub_info->pid = task_pid_nr(current); - if (sub_info->file) + if (sub_info->file) { retval = do_execve_file(sub_info->file, sub_info->argv, sub_info->envp); - else + if (!retval) + current->flags |= PF_UMH; + } else retval = do_execve(getname_kernel(sub_info->path), (const char __user *const __user *)sub_info->argv, (const char __user *const __user *)sub_info->envp); @@ -517,6 +521,11 @@ int fork_usermode_blob(void *data, size_t len, struct umh_info *info) goto out; err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC); + if (!err) { + mutex_lock(&umh_list_lock); + list_add(&info->list, &umh_list); + mutex_unlock(&umh_list_lock); + } out: fput(file); return err; @@ -679,6 +688,26 @@ static int proc_cap_handler(struct ctl_table *table, int write, return 0; } +void __exit_umh(struct task_struct *tsk) +{ + struct umh_info *info; + pid_t pid = tsk->pid; + + mutex_lock(&umh_list_lock); + list_for_each_entry(info, &umh_list, list) { + if (info->pid == pid) { + list_del(&info->list); + mutex_unlock(&umh_list_lock); + goto out; + } + } + mutex_unlock(&umh_list_lock); + return; +out: + if (info->cleanup) + info->cleanup(info); +} + struct ctl_table usermodehelper_table[] = { { .procname = "bset", -- cgit v1.3-7-g2ca7 From b7285b425318331c2de4af2a784a18e6dccef484 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Sat, 12 Jan 2019 18:14:30 +0100 Subject: kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit UNAME26 is a mechanism to report Linux's version as 2.6.x, for compatibility with old/broken software. Due to the way it is implemented, it would have to be updated after 5.0, to keep the resulting versions unique. Linus Torvalds argued: "Do we actually need this? I'd rather let it bitrot, and just let it return random versions. It will just start again at 2.4.60, won't it? Anybody who uses UNAME26 for a 5.x kernel might as well think it's still 4.x. The user space is so old that it can't possibly care about differences between 4.x and 5.x, can it? The only thing that matters is that it shows "2.4.", which it will do regardless" Signed-off-by: Jonathan Neuschäfer Signed-off-by: Linus Torvalds --- kernel/sys.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index a48cbf1414b8..f7eb62eceb24 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1207,7 +1207,8 @@ DECLARE_RWSEM(uts_sem); /* * Work around broken programs that cannot handle "Linux 3.0". * Instead we map 3.x to 2.6.40+x, so e.g. 3.0 would be 2.6.40 - * And we map 4.x to 2.6.60+x, so 4.0 would be 2.6.60. + * And we map 4.x and later versions to 2.6.60+x, so 4.0/5.0/6.0/... would be + * 2.6.60. */ static int override_release(char __user *release, size_t len) { -- cgit v1.3-7-g2ca7 From 53fc7a01df51f58b317ea5ab1607a1af65d6d4cf Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Mon, 10 Dec 2018 17:17:48 -0500 Subject: audit: give a clue what CONFIG_CHANGE op was involved The failure to add an audit rule due to audit locked gives no clue what CONFIG_CHANGE operation failed. Similarly the set operation is the only other operation that doesn't give the "op=" field to indicate the action. All other CONFIG_CHANGE records include an op= field to give a clue as to what sort of configuration change is being executed. Since these are the only CONFIG_CHANGE records that that do not have an op= field, add them to bring them in line with the rest. Old records: type=CONFIG_CHANGE msg=audit(1519812997.781:374): pid=610 uid=0 auid=0 ses=1 subj=... audit_enabled=2 res=0 type=CONFIG_CHANGE msg=audit(2018-06-14 14:55:04.507:47) : audit_enabled=1 old=1 auid=unset ses=unset subj=... res=yes New records: type=CONFIG_CHANGE msg=audit(1520958477.855:100): pid=610 uid=0 auid=0 ses=1 subj=... op=add_rule audit_enabled=2 res=0 type=CONFIG_CHANGE msg=audit(2018-06-14 14:55:04.507:47) : op=set audit_enabled=1 old=1 auid=unset ses=unset subj=... res=yes See: https://github.com/linux-audit/audit-kernel/issues/59 Signed-off-by: Richard Guy Briggs [PM: fixed checkpatch.pl line length problems] Signed-off-by: Paul Moore --- kernel/audit.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 632d36059556..d412fb4ae6d5 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -399,7 +399,7 @@ static int audit_log_config_change(char *function_name, u32 new, u32 old, ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; - audit_log_format(ab, "%s=%u old=%u ", function_name, new, old); + audit_log_format(ab, "op=set %s=%u old=%u ", function_name, new, old); audit_log_session_info(ab); rc = audit_log_task_context(ab); if (rc) @@ -1362,7 +1362,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) return -EINVAL; if (audit_enabled == AUDIT_LOCKED) { audit_log_common_recv_msg(&ab, AUDIT_CONFIG_CHANGE); - audit_log_format(ab, " audit_enabled=%d res=0", audit_enabled); + audit_log_format(ab, " op=%s audit_enabled=%d res=0", + msg_type == AUDIT_ADD_RULE ? + "add_rule" : "remove_rule", + audit_enabled); audit_log_end(ab); return -EPERM; } -- cgit v1.3-7-g2ca7 From 9e36a5d49c3a6fc4a2e0ba2dc11b27c4a8ae6303 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Mon, 10 Dec 2018 17:17:50 -0500 Subject: audit: hand taken context to audit_kill_trees for syscall logging Since the context is derived from the task parameter handed to __audit_free(), hand the context to audit_kill_trees() so it can be used to associate with a syscall record. This requires adding the context parameter to kill_rules() rather than using the current audit_context. The callers of trim_marked() and evict_chunk() still have their context. The EOE record was being issued prior to the pruning of the killed_tree list. Move the kill_trees call before the audit_log_exit call in __audit_free() and __audit_syscall_exit() so that any pruned trees CONFIG_CHANGE records are included with the associated syscall event by the user library due to the EOE record flagging the end of the event. See: https://github.com/linux-audit/audit-kernel/issues/50 See: https://github.com/linux-audit/audit-kernel/issues/59 Signed-off-by: Richard Guy Briggs [PM: fixed merge fuzz in kernel/audit_tree.c] Signed-off-by: Paul Moore --- kernel/audit.h | 4 ++-- kernel/audit_tree.c | 19 +++++++++++-------- kernel/auditsc.c | 12 ++++++------ 3 files changed, 19 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.h b/kernel/audit.h index 91421679a168..6ffb70575082 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -314,7 +314,7 @@ extern void audit_trim_trees(void); extern int audit_tag_tree(char *old, char *new); extern const char *audit_tree_path(struct audit_tree *tree); extern void audit_put_tree(struct audit_tree *tree); -extern void audit_kill_trees(struct list_head *list); +extern void audit_kill_trees(struct audit_context *context); #else #define audit_remove_tree_rule(rule) BUG() #define audit_add_tree_rule(rule) -EINVAL @@ -323,7 +323,7 @@ extern void audit_kill_trees(struct list_head *list); #define audit_put_tree(tree) (void)0 #define audit_tag_tree(old, new) -EINVAL #define audit_tree_path(rule) "" /* never called */ -#define audit_kill_trees(list) BUG() +#define audit_kill_trees(context) BUG() #endif extern char *audit_unpack_string(void **bufp, size_t *remain, size_t len); diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index d4af4d97f847..abfb112f26aa 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -524,13 +524,14 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) return 0; } -static void audit_tree_log_remove_rule(struct audit_krule *rule) +static void audit_tree_log_remove_rule(struct audit_context *context, + struct audit_krule *rule) { struct audit_buffer *ab; if (!audit_enabled) return; - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); + ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return; audit_log_format(ab, "op=remove_rule dir="); @@ -540,7 +541,7 @@ static void audit_tree_log_remove_rule(struct audit_krule *rule) audit_log_end(ab); } -static void kill_rules(struct audit_tree *tree) +static void kill_rules(struct audit_context *context, struct audit_tree *tree) { struct audit_krule *rule, *next; struct audit_entry *entry; @@ -551,7 +552,7 @@ static void kill_rules(struct audit_tree *tree) list_del_init(&rule->rlist); if (rule->tree) { /* not a half-baked one */ - audit_tree_log_remove_rule(rule); + audit_tree_log_remove_rule(context, rule); if (entry->rule.exe) audit_remove_mark(entry->rule.exe); rule->tree = NULL; @@ -633,7 +634,7 @@ static void trim_marked(struct audit_tree *tree) tree->goner = 1; spin_unlock(&hash_lock); mutex_lock(&audit_filter_mutex); - kill_rules(tree); + kill_rules(audit_context(), tree); list_del_init(&tree->list); mutex_unlock(&audit_filter_mutex); prune_one(tree); @@ -973,8 +974,10 @@ static void audit_schedule_prune(void) * ... and that one is done if evict_chunk() decides to delay until the end * of syscall. Runs synchronously. */ -void audit_kill_trees(struct list_head *list) +void audit_kill_trees(struct audit_context *context) { + struct list_head *list = &context->killed_trees; + audit_ctl_lock(); mutex_lock(&audit_filter_mutex); @@ -982,7 +985,7 @@ void audit_kill_trees(struct list_head *list) struct audit_tree *victim; victim = list_entry(list->next, struct audit_tree, list); - kill_rules(victim); + kill_rules(context, victim); list_del_init(&victim->list); mutex_unlock(&audit_filter_mutex); @@ -1017,7 +1020,7 @@ static void evict_chunk(struct audit_chunk *chunk) list_del_init(&owner->same_root); spin_unlock(&hash_lock); if (!postponed) { - kill_rules(owner); + kill_rules(audit_context(), owner); list_move(&owner->list, &prune_list); need_prune = 1; } else { diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 6593a5207fb0..b585ceb2f7a2 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1444,6 +1444,9 @@ void __audit_free(struct task_struct *tsk) if (!context) return; + if (!list_empty(&context->killed_trees)) + audit_kill_trees(context); + /* We are called either by do_exit() or the fork() error handling code; * in the former case tsk == current and in the latter tsk is a * random task_struct that doesn't doesn't have any meaningful data we @@ -1460,9 +1463,6 @@ void __audit_free(struct task_struct *tsk) audit_log_exit(); } - if (!list_empty(&context->killed_trees)) - audit_kill_trees(&context->killed_trees); - audit_set_context(tsk, NULL); audit_free_context(context); } @@ -1537,6 +1537,9 @@ void __audit_syscall_exit(int success, long return_code) if (!context) return; + if (!list_empty(&context->killed_trees)) + audit_kill_trees(context); + if (!context->dummy && context->in_syscall) { if (success) context->return_valid = AUDITSC_SUCCESS; @@ -1571,9 +1574,6 @@ void __audit_syscall_exit(int success, long return_code) context->in_syscall = 0; context->prio = context->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0; - if (!list_empty(&context->killed_trees)) - audit_kill_trees(&context->killed_trees); - audit_free_names(context); unroll_tree_refs(context, NULL, 0); audit_free_aux(context); -- cgit v1.3-7-g2ca7 From 44133f7eaebec227d1381868d642e3c64023520f Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:31:54 +0100 Subject: genirq: Annotate implicit fall through There is a plan to build the kernel with -Wimplicit-fallthrough. The fallthrough in __irq_set_trigger() lacks an annotation. Add it. Signed-off-by: Mathieu Malaterre Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20190114203154.17125-1-malat@debian.org --- kernel/irq/manage.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a4888ce4667a..1ca39dc4538d 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -723,6 +723,7 @@ int __irq_set_trigger(struct irq_desc *desc, unsigned long flags) case IRQ_SET_MASK_OK_DONE: irqd_clear(&desc->irq_data, IRQD_TRIGGER_MASK); irqd_set(&desc->irq_data, flags); + /* fall through */ case IRQ_SET_MASK_OK_NOCOPY: flags = irqd_get_trigger_type(&desc->irq_data); -- cgit v1.3-7-g2ca7 From 01cdfa912f1004c463586f52f1dfcbec1274b1f2 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:36:33 +0100 Subject: genirq: Correctly annotate implicit fall through There is a plan to build the kernel with -Wimplicit-fallthrough. The fallthrough in __handle_irq_event_percpu() has a fallthrough annotation which is followed by an additional comment and is not recognized by GCC. Separate the 'fall through' and the rest of the comment with a dash so the regular expression used by GCC matches. Signed-off-by: Mathieu Malaterre Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20190114203633.18557-1-malat@debian.org --- kernel/irq/handle.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index 38554bc35375..6df5ddfdb0f8 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -166,7 +166,7 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags __irq_wake_thread(desc, action); - /* Fall through to add to randomness */ + /* Fall through - to add to randomness */ case IRQ_HANDLED: *flags |= action->flags; break; -- cgit v1.3-7-g2ca7 From a4cffdad731447217701d3bc6da4587bbb4d2cbd Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 20 Dec 2018 09:05:25 -0800 Subject: time: Move CONTEXT_TRACKING to kernel/time/Kconfig Both CONTEXT_TRACKING and CONTEXT_TRACKING_FORCE are currently defined in kernel/rcu/kconfig, which might have made sense at some point, but no longer does given that RCU refers to neither of these Kconfig options. Therefore move them to kernel/time/Kconfig, where the rest of the NO_HZ_FULL Kconfig options live. Signed-off-by: Paul E. McKenney Signed-off-by: Thomas Gleixner Cc: Frederic Weisbecker Link: https://lkml.kernel.org/r/20181220170525.GA12579@linux.ibm.com --- kernel/rcu/Kconfig | 30 ------------------------------ kernel/time/Kconfig | 29 +++++++++++++++++++++++++++++ 2 files changed, 29 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index 939a2056c87a..37301430970e 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -87,36 +87,6 @@ config RCU_STALL_COMMON config RCU_NEED_SEGCBLIST def_bool ( TREE_RCU || PREEMPT_RCU || TREE_SRCU ) -config CONTEXT_TRACKING - bool - -config CONTEXT_TRACKING_FORCE - bool "Force context tracking" - depends on CONTEXT_TRACKING - default y if !NO_HZ_FULL - help - The major pre-requirement for full dynticks to work is to - support the context tracking subsystem. But there are also - other dependencies to provide in order to make the full - dynticks working. - - This option stands for testing when an arch implements the - context tracking backend but doesn't yet fullfill all the - requirements to make the full dynticks feature working. - Without the full dynticks, there is no way to test the support - for context tracking and the subsystems that rely on it: RCU - userspace extended quiescent state and tickless cputime - accounting. This option copes with the absence of the full - dynticks subsystem by forcing the context tracking on all - CPUs in the system. - - Say Y only if you're working on the development of an - architecture backend for the context tracking. - - Say N otherwise, this option brings an overhead that you - don't want in production. - - config RCU_FANOUT int "Tree-based hierarchical RCU fanout value" range 2 64 if 64BIT diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 58b981f4bb5d..e2c038d6c13c 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -117,6 +117,35 @@ config NO_HZ_FULL endchoice +config CONTEXT_TRACKING + bool + +config CONTEXT_TRACKING_FORCE + bool "Force context tracking" + depends on CONTEXT_TRACKING + default y if !NO_HZ_FULL + help + The major pre-requirement for full dynticks to work is to + support the context tracking subsystem. But there are also + other dependencies to provide in order to make the full + dynticks working. + + This option stands for testing when an arch implements the + context tracking backend but doesn't yet fullfill all the + requirements to make the full dynticks feature working. + Without the full dynticks, there is no way to test the support + for context tracking and the subsystems that rely on it: RCU + userspace extended quiescent state and tickless cputime + accounting. This option copes with the absence of the full + dynticks subsystem by forcing the context tracking on all + CPUs in the system. + + Say Y only if you're working on the development of an + architecture backend for the context tracking. + + Say N otherwise, this option brings an overhead that you + don't want in production. + config NO_HZ bool "Old Idle dynticks config" depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS -- cgit v1.3-7-g2ca7 From bddda606ec76550dd63592e32a6e87e7d32583f7 Mon Sep 17 00:00:00 2001 From: Srinivas Ramana Date: Thu, 20 Dec 2018 19:05:57 +0530 Subject: genirq: Make sure the initial affinity is not empty If all CPUs in the irq_default_affinity mask are offline when an interrupt is initialized then irq_setup_affinity() can set an empty affinity mask for a newly allocated interrupt. Fix this by falling back to cpu_online_mask in case the resulting affinity mask is zero. Signed-off-by: Srinivas Ramana Signed-off-by: Thomas Gleixner Cc: linux-arm-msm@vger.kernel.org Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.org --- kernel/irq/manage.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a4888ce4667a..84b54a17b95d 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -393,6 +393,9 @@ int irq_setup_affinity(struct irq_desc *desc) } cpumask_and(&mask, cpu_online_mask, set); + if (cpumask_empty(&mask)) + cpumask_copy(&mask, cpu_online_mask); + if (node != NUMA_NO_NODE) { const struct cpumask *nodemask = cpumask_of_node(node); -- cgit v1.3-7-g2ca7 From 93ad0fc088c5b4631f796c995bdd27a082ef33a6 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 11 Jan 2019 14:33:16 +0100 Subject: posix-cpu-timers: Unbreak timer rearming The recent commit which prevented a division by 0 issue in the alarm timer code broke posix CPU timers as an unwanted side effect. The reason is that the common rearm code checks for timer->it_interval being 0 now. What went unnoticed is that the posix cpu timer setup does not initialize timer->it_interval as it stores the interval in CPU timer specific storage. The reason for the separate storage is historical as the posix CPU timers always had a 64bit nanoseconds representation internally while timer->it_interval is type ktime_t which used to be a modified timespec representation on 32bit machines. Instead of reverting the offending commit and fixing the alarmtimer issue in the alarmtimer code, store the interval in timer->it_interval at CPU timer setup time so the common code check works. This also repairs the existing inconistency of the posix CPU timer code which kept a single shot timer armed despite of the interval being 0. The separate storage can be removed in mainline, but that needs to be a separate commit as the current one has to be backported to stable kernels. Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug") Reported-by: H.J. Lu Signed-off-by: Thomas Gleixner Cc: John Stultz Cc: Peter Zijlstra Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de --- kernel/time/posix-cpu-timers.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 8f0644af40be..80f955210861 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -685,6 +685,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, * set up the signal and overrun bookkeeping. */ timer->it.cpu.incr = timespec64_to_ns(&new->it_interval); + timer->it_interval = ns_to_ktime(timer->it.cpu.incr); /* * This acts as a modification timestamp for the timer, -- cgit v1.3-7-g2ca7 From 16118794ede91aac1a73abe15de22d3de9d2b775 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 11 Jan 2019 14:33:17 +0100 Subject: posix-cpu-timers: Remove private interval storage Posix CPU timers store the interval in private storage for historical reasons (it_interval used to be a non scalar representation on 32bit systems). This is gone and there is no reason for duplicated storage anymore. Use it_interval everywhere. Signed-off-by: Thomas Gleixner Cc: John Stultz Cc: Peter Zijlstra Cc: "H.J. Lu" Link: https://lkml.kernel.org/r/20190111133500.945255655@linutronix.de --- include/linux/posix-timers.h | 2 +- kernel/time/posix-cpu-timers.c | 13 ++++++------- 2 files changed, 7 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index e96581ca7c9d..b20798fc5191 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -12,7 +12,7 @@ struct siginfo; struct cpu_timer_list { struct list_head entry; - u64 expires, incr; + u64 expires; struct task_struct *task; int firing; }; diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 80f955210861..0a426f4e3125 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -67,13 +67,13 @@ static void bump_cpu_timer(struct k_itimer *timer, u64 now) int i; u64 delta, incr; - if (timer->it.cpu.incr == 0) + if (!timer->it_interval) return; if (now < timer->it.cpu.expires) return; - incr = timer->it.cpu.incr; + incr = timer->it_interval; delta = now + incr - timer->it.cpu.expires; /* Don't use (incr*2 < delta), incr*2 might overflow. */ @@ -520,7 +520,7 @@ static void cpu_timer_fire(struct k_itimer *timer) */ wake_up_process(timer->it_process); timer->it.cpu.expires = 0; - } else if (timer->it.cpu.incr == 0) { + } else if (!timer->it_interval) { /* * One-shot timer. Clear it as soon as it's fired. */ @@ -606,7 +606,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, */ ret = 0; - old_incr = timer->it.cpu.incr; + old_incr = timer->it_interval; old_expires = timer->it.cpu.expires; if (unlikely(timer->it.cpu.firing)) { timer->it.cpu.firing = -1; @@ -684,8 +684,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, * Install the new reload setting, and * set up the signal and overrun bookkeeping. */ - timer->it.cpu.incr = timespec64_to_ns(&new->it_interval); - timer->it_interval = ns_to_ktime(timer->it.cpu.incr); + timer->it_interval = timespec64_to_ktime(new->it_interval); /* * This acts as a modification timestamp for the timer, @@ -724,7 +723,7 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec64 *itp /* * Easy part: convert the reload time. */ - itp->it_interval = ns_to_timespec64(timer->it.cpu.incr); + itp->it_interval = ktime_to_timespec64(timer->it_interval); if (!timer->it.cpu.expires) return; -- cgit v1.3-7-g2ca7 From 8b05a3a7503c2a982c9c462eae96cfbd59506783 Mon Sep 17 00:00:00 2001 From: Andrea Righi Date: Fri, 11 Jan 2019 07:01:13 +0100 Subject: tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create() It is possible to trigger a NULL pointer dereference by writing an incorrectly formatted string to krpobe_events (trying to create a kretprobe omitting the symbol). Example: echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events That triggers this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125 Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018 RIP: 0010:kstrtoull+0x2/0x20 Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e RSP: 0018:ffffb5d482e57cb8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff82b12720 RDX: ffffb5d482e57cf8 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb5d482e57d70 R08: ffffa0c05e5a7080 R09: ffffa0c05e003980 R10: 0000000000000000 R11: 0000000040000000 R12: ffffa0c04fe87b08 R13: 0000000000000001 R14: 000000000000000b R15: ffffa0c058d749e1 FS: 00007f137c7f7740(0000) GS:ffffa0c05e580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000497d46004 CR4: 00000000003606e0 Call Trace: ? trace_kprobe_create+0xb6/0x840 ? _cond_resched+0x19/0x40 ? _cond_resched+0x19/0x40 ? __kmalloc+0x62/0x210 ? argv_split+0x8f/0x140 ? trace_kprobe_create+0x840/0x840 ? trace_kprobe_create+0x840/0x840 create_or_delete_trace_kprobe+0x11/0x30 trace_run_command+0x50/0x90 trace_parse_run_command+0xc1/0x160 probes_write+0x10/0x20 __vfs_write+0x3a/0x1b0 ? apparmor_file_permission+0x1a/0x20 ? security_file_permission+0x31/0xf0 ? _cond_resched+0x19/0x40 vfs_write+0xb1/0x1a0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix by doing the proper argument checks in trace_kprobe_create(). Cc: Ingo Molnar Link: https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365977@kernel.org Link: http://lkml.kernel.org/r/20190111060113.GA22841@xps-13 Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events") Acked-by: Masami Hiramatsu Signed-off-by: Andrea Righi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 5c19b8c41c7e..d5fb09ebba8b 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -607,11 +607,17 @@ static int trace_kprobe_create(int argc, const char *argv[]) char buf[MAX_EVENT_NAME_LEN]; unsigned int flags = TPARG_FL_KERNEL; - /* argc must be >= 1 */ - if (argv[0][0] == 'r') { + switch (argv[0][0]) { + case 'r': is_return = true; flags |= TPARG_FL_RETURN; - } else if (argv[0][0] != 'p' || argc < 2) + break; + case 'p': + break; + default: + return -ECANCELED; + } + if (argc < 2) return -ECANCELED; event = strchr(&argv[0][1], ':'); -- cgit v1.3-7-g2ca7 From a811dc61559e0c8003f1086c2a4dc8e4d5ae4cb8 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Sat, 12 Jan 2019 11:24:20 -0700 Subject: seccomp: fix UAF in user-trap code On the failure path, we do an fput() of the listener fd if the filter fails to install (e.g. because of a TSYNC race that's lost, or if the thread is killed, etc.). fput() doesn't actually release the fd, it just ads it to a work queue. Then the thread proceeds to free the filter, even though the listener struct file has a reference to it. To fix this, on the failure path let's set the private data to null, so we know in ->release() to ignore the filter. Reported-by: syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Tycho Andersen Acked-by: Kees Cook Signed-off-by: James Morris --- kernel/seccomp.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d7f538847b84..e815781ed751 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -976,6 +976,9 @@ static int seccomp_notify_release(struct inode *inode, struct file *file) struct seccomp_filter *filter = file->private_data; struct seccomp_knotif *knotif; + if (!filter) + return 0; + mutex_lock(&filter->notify_lock); /* @@ -1300,6 +1303,7 @@ out: out_put_fd: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { if (ret < 0) { + listener_f->private_data = NULL; fput(listener_f); put_unused_fd(listener); } else { -- cgit v1.3-7-g2ca7 From 227a76b64718888c1687cc237463aa000ae6fb2b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 14 Jan 2019 21:14:08 +0100 Subject: swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit Otherwise is_swiotlb_buffer will return false positives when we first initialize a swiotlb buffer, but then free it because we have an IOMMU available. Fixes: 55897af63091 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code") Reported-by: Sibren Vasse Signed-off-by: Christoph Hellwig Tested-by: Sibren Vasse Signed-off-by: Konrad Rzeszutek Wilk --- kernel/dma/swiotlb.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index d6361776dc5c..1fb6fd68b9c7 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -378,6 +378,8 @@ void __init swiotlb_exit(void) memblock_free_late(io_tlb_start, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT)); } + io_tlb_start = 0; + io_tlb_end = 0; io_tlb_nslabs = 0; max_segment = 0; } -- cgit v1.3-7-g2ca7 From cba82dea30613346cf9a0532a41fc118bc3263af Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Tue, 15 Jan 2019 17:45:06 +0100 Subject: livepatch: Send a fake signal periodically An administrator may send a fake signal to all remaining blocking tasks of a running transition by writing to /sys/kernel/livepatch//signal attribute. Let's do it automatically after 15 seconds. The timeout is chosen deliberately. It gives the tasks enough time to transition themselves. Theoretically, sending it once should be more than enough. However, every task must get outside of a patched function to be successfully transitioned. It could prove not to be simple and resending could be helpful in that case. A new workqueue job could be a cleaner solution to achieve it, but it could also introduce deadlocks and cause more headaches with synchronization and cancelling. [jkosina@suse.cz: removed added newline] Signed-off-by: Miroslav Benes Signed-off-by: Jiri Kosina --- Documentation/livepatch/livepatch.txt | 3 ++- kernel/livepatch/transition.c | 16 +++++++++++++--- 2 files changed, 15 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 71d7f286ec4d..407e0f03dc99 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -163,7 +163,8 @@ patched state. This may be harmful to the system though. Writing 1 to the attribute sends a fake signal to all remaining blocking tasks. No proper signal is actually delivered (there is no data in signal pending structures). Tasks are interrupted or woken up, and forced to change -their patched state. +their patched state. Despite the sysfs attribute the fake signal is also sent +every 15 seconds automatically. Administrator can also affect a transition through /sys/kernel/livepatch//force attribute. Writing 1 there clears diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 300273819674..ea7697bb753e 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -29,10 +29,14 @@ #define MAX_STACK_ENTRIES 100 #define STACK_ERR_BUF_SIZE 128 +#define SIGNALS_TIMEOUT 15 + struct klp_patch *klp_transition_patch; static int klp_target_state = KLP_UNDEFINED; +static unsigned int klp_signals_cnt; + /* * This work can be performed periodically to finish patching or unpatching any * "straggler" tasks which failed to transition in the first attempt. @@ -393,6 +397,10 @@ void klp_try_complete_transition(void) put_online_cpus(); if (!complete) { + if (klp_signals_cnt && !(klp_signals_cnt % SIGNALS_TIMEOUT)) + klp_send_signals(); + klp_signals_cnt++; + /* * Some tasks weren't able to be switched over. Try again * later and/or wait for other methods like kernel exit @@ -454,6 +462,8 @@ void klp_start_transition(void) if (task->patch_state != klp_target_state) set_tsk_thread_flag(task, TIF_PATCH_PENDING); } + + klp_signals_cnt = 0; } /* @@ -578,14 +588,14 @@ void klp_copy_process(struct task_struct *child) /* * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set. - * Kthreads with TIF_PATCH_PENDING set are woken up. Only admin can request this - * action currently. + * Kthreads with TIF_PATCH_PENDING set are woken up. */ void klp_send_signals(void) { struct task_struct *g, *task; - pr_notice("signaling remaining tasks\n"); + if (klp_signals_cnt == SIGNALS_TIMEOUT) + pr_notice("signaling remaining tasks\n"); read_lock(&tasklist_lock); for_each_process_thread(g, task) { -- cgit v1.3-7-g2ca7 From 0b3d52790e1cfd6b80b826a245d24859e89632f7 Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Tue, 15 Jan 2019 17:45:07 +0100 Subject: livepatch: Remove signal sysfs attribute The fake signal is send automatically now. We can rely on it completely and remove the sysfs attribute. Signed-off-by: Miroslav Benes Signed-off-by: Jiri Kosina --- Documentation/ABI/testing/sysfs-kernel-livepatch | 12 ---- Documentation/livepatch/livepatch.txt | 16 ++--- kernel/livepatch/core.c | 32 --------- kernel/livepatch/transition.c | 82 ++++++++++++------------ kernel/livepatch/transition.h | 1 - 5 files changed, 48 insertions(+), 95 deletions(-) (limited to 'kernel') diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch index dac7e1e62a8b..85db352f68f9 100644 --- a/Documentation/ABI/testing/sysfs-kernel-livepatch +++ b/Documentation/ABI/testing/sysfs-kernel-livepatch @@ -33,18 +33,6 @@ Description: An attribute which indicates whether the patch is currently in transition. -What: /sys/kernel/livepatch//signal -Date: Nov 2017 -KernelVersion: 4.15.0 -Contact: live-patching@vger.kernel.org -Description: - A writable attribute that allows administrator to affect the - course of an existing transition. Writing 1 sends a fake - signal to all remaining blocking tasks. The fake signal - means that no proper signal is delivered (there is no data in - signal pending structures). Tasks are interrupted or woken up, - and forced to change their patched state. - What: /sys/kernel/livepatch//force Date: Nov 2017 KernelVersion: 4.15.0 diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index 407e0f03dc99..4627b41ff02e 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -158,13 +158,11 @@ If a patch is in transition, this file shows 0 to indicate the task is unpatched and 1 to indicate it's patched. Otherwise, if no patch is in transition, it shows -1. Any tasks which are blocking the transition can be signaled with SIGSTOP and SIGCONT to force them to change their -patched state. This may be harmful to the system though. -/sys/kernel/livepatch//signal attribute provides a better alternative. -Writing 1 to the attribute sends a fake signal to all remaining blocking -tasks. No proper signal is actually delivered (there is no data in signal -pending structures). Tasks are interrupted or woken up, and forced to change -their patched state. Despite the sysfs attribute the fake signal is also sent -every 15 seconds automatically. +patched state. This may be harmful to the system though. Sending a fake signal +to all remaining blocking tasks is a better alternative. No proper signal is +actually delivered (there is no data in signal pending structures). Tasks are +interrupted or woken up, and forced to change their patched state. The fake +signal is automatically sent every 15 seconds. Administrator can also affect a transition through /sys/kernel/livepatch//force attribute. Writing 1 there clears @@ -412,8 +410,8 @@ Information about the registered patches can be found under /sys/kernel/livepatch. The patches could be enabled and disabled by writing there. -/sys/kernel/livepatch//signal and /sys/kernel/livepatch//force -attributes allow administrator to affect a patching operation. +/sys/kernel/livepatch//force attributes allow administrator to affect a +patching operation. See Documentation/ABI/testing/sysfs-kernel-livepatch for more details. diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index adca5cf07f7e..fe1993399823 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -313,7 +313,6 @@ static int klp_write_object_relocations(struct module *pmod, * /sys/kernel/livepatch/ * /sys/kernel/livepatch//enabled * /sys/kernel/livepatch//transition - * /sys/kernel/livepatch//signal * /sys/kernel/livepatch//force * /sys/kernel/livepatch// * /sys/kernel/livepatch/// @@ -382,35 +381,6 @@ static ssize_t transition_show(struct kobject *kobj, patch == klp_transition_patch); } -static ssize_t signal_store(struct kobject *kobj, struct kobj_attribute *attr, - const char *buf, size_t count) -{ - struct klp_patch *patch; - int ret; - bool val; - - ret = kstrtobool(buf, &val); - if (ret) - return ret; - - if (!val) - return count; - - mutex_lock(&klp_mutex); - - patch = container_of(kobj, struct klp_patch, kobj); - if (patch != klp_transition_patch) { - mutex_unlock(&klp_mutex); - return -EINVAL; - } - - klp_send_signals(); - - mutex_unlock(&klp_mutex); - - return count; -} - static ssize_t force_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { @@ -442,12 +412,10 @@ static ssize_t force_store(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute enabled_kobj_attr = __ATTR_RW(enabled); static struct kobj_attribute transition_kobj_attr = __ATTR_RO(transition); -static struct kobj_attribute signal_kobj_attr = __ATTR_WO(signal); static struct kobj_attribute force_kobj_attr = __ATTR_WO(force); static struct attribute *klp_patch_attrs[] = { &enabled_kobj_attr.attr, &transition_kobj_attr.attr, - &signal_kobj_attr.attr, &force_kobj_attr.attr, NULL }; diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index ea7697bb753e..183b2086ba03 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -347,6 +347,47 @@ done: } +/* + * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set. + * Kthreads with TIF_PATCH_PENDING set are woken up. + */ +static void klp_send_signals(void) +{ + struct task_struct *g, *task; + + if (klp_signals_cnt == SIGNALS_TIMEOUT) + pr_notice("signaling remaining tasks\n"); + + read_lock(&tasklist_lock); + for_each_process_thread(g, task) { + if (!klp_patch_pending(task)) + continue; + + /* + * There is a small race here. We could see TIF_PATCH_PENDING + * set and decide to wake up a kthread or send a fake signal. + * Meanwhile the task could migrate itself and the action + * would be meaningless. It is not serious though. + */ + if (task->flags & PF_KTHREAD) { + /* + * Wake up a kthread which sleeps interruptedly and + * still has not been migrated. + */ + wake_up_state(task, TASK_INTERRUPTIBLE); + } else { + /* + * Send fake signal to all non-kthread tasks which are + * still not migrated. + */ + spin_lock_irq(&task->sighand->siglock); + signal_wake_up(task, 0); + spin_unlock_irq(&task->sighand->siglock); + } + } + read_unlock(&tasklist_lock); +} + /* * Try to switch all remaining tasks to the target patch state by walking the * stacks of sleeping tasks and looking for any to-be-patched or @@ -586,47 +627,6 @@ void klp_copy_process(struct task_struct *child) /* TIF_PATCH_PENDING gets copied in setup_thread_stack() */ } -/* - * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set. - * Kthreads with TIF_PATCH_PENDING set are woken up. - */ -void klp_send_signals(void) -{ - struct task_struct *g, *task; - - if (klp_signals_cnt == SIGNALS_TIMEOUT) - pr_notice("signaling remaining tasks\n"); - - read_lock(&tasklist_lock); - for_each_process_thread(g, task) { - if (!klp_patch_pending(task)) - continue; - - /* - * There is a small race here. We could see TIF_PATCH_PENDING - * set and decide to wake up a kthread or send a fake signal. - * Meanwhile the task could migrate itself and the action - * would be meaningless. It is not serious though. - */ - if (task->flags & PF_KTHREAD) { - /* - * Wake up a kthread which sleeps interruptedly and - * still has not been migrated. - */ - wake_up_state(task, TASK_INTERRUPTIBLE); - } else { - /* - * Send fake signal to all non-kthread tasks which are - * still not migrated. - */ - spin_lock_irq(&task->sighand->siglock); - signal_wake_up(task, 0); - spin_unlock_irq(&task->sighand->siglock); - } - } - read_unlock(&tasklist_lock); -} - /* * Drop TIF_PATCH_PENDING of all tasks on admin's request. This forces an * existing transition to finish. diff --git a/kernel/livepatch/transition.h b/kernel/livepatch/transition.h index f9d0bc016067..322db16233de 100644 --- a/kernel/livepatch/transition.h +++ b/kernel/livepatch/transition.h @@ -11,7 +11,6 @@ void klp_cancel_transition(void); void klp_start_transition(void); void klp_try_complete_transition(void); void klp_reverse_transition(void); -void klp_send_signals(void); void klp_force_transition(void); #endif /* _LIVEPATCH_TRANSITION_H */ -- cgit v1.3-7-g2ca7 From b1e8818cabf407a5a2cec696411b0bdfd7fd12f0 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 15 Jan 2019 17:07:47 -0800 Subject: bpf: btf: support 128 bit integer type Currently, btf only supports up to 64-bit integer. On the other hand, 128bit support for gcc and clang has existed for a long time. For example, both gcc 4.8 and llvm 3.7 supports types "__int128" and "unsigned __int128" for virtually all 64bit architectures including bpf. The requirement for __int128 support comes from two areas: . bpf program may use __int128. For example, some bcc tools (https://github.com/iovisor/bcc/tree/master/tools), mostly tcp v6 related, tcpstates.py, tcpaccept.py, etc., are using __int128 to represent the ipv6 addresses. . linux itself is using __int128 types. Hence supporting __int128 type in BTF is required for vmlinux BTF, which will be used by "compile once and run everywhere" and other projects. For 128bit integer, instead of base-10, hex numbers are pretty printed out as large decimal number is hard to decipher, e.g., for ipv6 addresses. Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 104 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 85 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index a2f53642592b..022ef9ca1296 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -157,7 +157,7 @@ * */ -#define BITS_PER_U64 (sizeof(u64) * BITS_PER_BYTE) +#define BITS_PER_U128 (sizeof(u64) * BITS_PER_BYTE * 2) #define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1) #define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK) #define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3) @@ -525,7 +525,7 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) /* * Regular int is not a bit field and it must be either - * u8/u16/u32/u64. + * u8/u16/u32/u64 or __int128. */ static bool btf_type_int_is_regular(const struct btf_type *t) { @@ -538,7 +538,8 @@ static bool btf_type_int_is_regular(const struct btf_type *t) if (BITS_PER_BYTE_MASKED(nr_bits) || BTF_INT_OFFSET(int_data) || (nr_bytes != sizeof(u8) && nr_bytes != sizeof(u16) && - nr_bytes != sizeof(u32) && nr_bytes != sizeof(u64))) { + nr_bytes != sizeof(u32) && nr_bytes != sizeof(u64) && + nr_bytes != (2 * sizeof(u64)))) { return false; } @@ -1063,9 +1064,9 @@ static int btf_int_check_member(struct btf_verifier_env *env, nr_copy_bits = BTF_INT_BITS(int_data) + BITS_PER_BYTE_MASKED(struct_bits_off); - if (nr_copy_bits > BITS_PER_U64) { + if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, - "nr_copy_bits exceeds 64"); + "nr_copy_bits exceeds 128"); return -EINVAL; } @@ -1119,9 +1120,9 @@ static int btf_int_check_kflag_member(struct btf_verifier_env *env, bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); nr_copy_bits = nr_bits + BITS_PER_BYTE_MASKED(struct_bits_off); - if (nr_copy_bits > BITS_PER_U64) { + if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, - "nr_copy_bits exceeds 64"); + "nr_copy_bits exceeds 128"); return -EINVAL; } @@ -1168,9 +1169,9 @@ static s32 btf_int_check_meta(struct btf_verifier_env *env, nr_bits = BTF_INT_BITS(int_data) + BTF_INT_OFFSET(int_data); - if (nr_bits > BITS_PER_U64) { + if (nr_bits > BITS_PER_U128) { btf_verifier_log_type(env, t, "nr_bits exceeds %zu", - BITS_PER_U64); + BITS_PER_U128); return -EINVAL; } @@ -1211,31 +1212,93 @@ static void btf_int_log(struct btf_verifier_env *env, btf_int_encoding_str(BTF_INT_ENCODING(int_data))); } +static void btf_int128_print(struct seq_file *m, void *data) +{ + /* data points to a __int128 number. + * Suppose + * int128_num = *(__int128 *)data; + * The below formulas shows what upper_num and lower_num represents: + * upper_num = int128_num >> 64; + * lower_num = int128_num & 0xffffffffFFFFFFFFULL; + */ + u64 upper_num, lower_num; + +#ifdef __BIG_ENDIAN_BITFIELD + upper_num = *(u64 *)data; + lower_num = *(u64 *)(data + 8); +#else + upper_num = *(u64 *)(data + 8); + lower_num = *(u64 *)data; +#endif + if (upper_num == 0) + seq_printf(m, "0x%llx", lower_num); + else + seq_printf(m, "0x%llx%016llx", upper_num, lower_num); +} + +static void btf_int128_shift(u64 *print_num, u16 left_shift_bits, + u16 right_shift_bits) +{ + u64 upper_num, lower_num; + +#ifdef __BIG_ENDIAN_BITFIELD + upper_num = print_num[0]; + lower_num = print_num[1]; +#else + upper_num = print_num[1]; + lower_num = print_num[0]; +#endif + + /* shake out un-needed bits by shift/or operations */ + if (left_shift_bits >= 64) { + upper_num = lower_num << (left_shift_bits - 64); + lower_num = 0; + } else { + upper_num = (upper_num << left_shift_bits) | + (lower_num >> (64 - left_shift_bits)); + lower_num = lower_num << left_shift_bits; + } + + if (right_shift_bits >= 64) { + lower_num = upper_num >> (right_shift_bits - 64); + upper_num = 0; + } else { + lower_num = (lower_num >> right_shift_bits) | + (upper_num << (64 - right_shift_bits)); + upper_num = upper_num >> right_shift_bits; + } + +#ifdef __BIG_ENDIAN_BITFIELD + print_num[0] = upper_num; + print_num[1] = lower_num; +#else + print_num[0] = lower_num; + print_num[1] = upper_num; +#endif +} + static void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_bits, struct seq_file *m) { u16 left_shift_bits, right_shift_bits; u8 nr_copy_bytes; u8 nr_copy_bits; - u64 print_num; + u64 print_num[2] = {}; nr_copy_bits = nr_bits + bits_offset; nr_copy_bytes = BITS_ROUNDUP_BYTES(nr_copy_bits); - print_num = 0; - memcpy(&print_num, data, nr_copy_bytes); + memcpy(print_num, data, nr_copy_bytes); #ifdef __BIG_ENDIAN_BITFIELD left_shift_bits = bits_offset; #else - left_shift_bits = BITS_PER_U64 - nr_copy_bits; + left_shift_bits = BITS_PER_U128 - nr_copy_bits; #endif - right_shift_bits = BITS_PER_U64 - nr_bits; - - print_num <<= left_shift_bits; - print_num >>= right_shift_bits; + right_shift_bits = BITS_PER_U128 - nr_bits; - seq_printf(m, "0x%llx", print_num); + btf_int128_shift(print_num, left_shift_bits, right_shift_bits); + btf_int128_print(m, print_num); } @@ -1250,7 +1313,7 @@ static void btf_int_bits_seq_show(const struct btf *btf, /* * bits_offset is at most 7. - * BTF_INT_OFFSET() cannot exceed 64 bits. + * BTF_INT_OFFSET() cannot exceed 128 bits. */ total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); data += BITS_ROUNDDOWN_BYTES(total_bits_offset); @@ -1274,6 +1337,9 @@ static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, } switch (nr_bits) { + case 128: + btf_int128_print(m, data); + break; case 64: if (sign) seq_printf(m, "%lld", *(s64 *)data); -- cgit v1.3-7-g2ca7 From d0b2818efbe27e6c2e0c52621c8db18eb5abb5e1 Mon Sep 17 00:00:00 2001 From: Peter Oskolkov Date: Wed, 16 Jan 2019 10:43:01 -0800 Subject: bpf: fix a (false) compiler warning An older GCC compiler complains: kernel/bpf/verifier.c: In function 'bpf_check': kernel/bpf/verifier.c:4***:13: error: 'prev_offset' may be used uninitialized in this function [-Werror=maybe-uninitialized] } else if (krecord[i].insn_offset <= prev_offset) { ^ kernel/bpf/verifier.c:4***:38: note: 'prev_offset' was declared here u32 i, nfuncs, urec_size, min_size, prev_offset; Although the compiler is wrong here, the patch makes sure that prev_offset is always initialized, just to silence the warning. v2: fix a spelling error in the commit message. Signed-off-by: Peter Oskolkov Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 56674a7c3778..ce87198ecd01 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4997,13 +4997,14 @@ static int check_btf_func(struct bpf_verifier_env *env, const union bpf_attr *attr, union bpf_attr __user *uattr) { - u32 i, nfuncs, urec_size, min_size, prev_offset; + u32 i, nfuncs, urec_size, min_size; u32 krec_size = sizeof(struct bpf_func_info); struct bpf_func_info *krecord; const struct btf_type *type; struct bpf_prog *prog; const struct btf *btf; void __user *urecord; + u32 prev_offset = 0; int ret = 0; nfuncs = attr->func_info_cnt; -- cgit v1.3-7-g2ca7 From ea6eb5e7d15e1838de335609994b4546e2abcaaf Mon Sep 17 00:00:00 2001 From: Andreas Ziegler Date: Thu, 17 Jan 2019 14:30:23 +0100 Subject: tracing: uprobes: Fix typo in pr_fmt string The subsystem-specific message prefix for uprobes was also "trace_kprobe: " instead of "trace_uprobe: " as described in the original commit message. Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de Cc: Ingo Molnar Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu Fixes: 7257634135c24 ("tracing/probe: Show subsystem name in messages") Signed-off-by: Andreas Ziegler Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_uprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index e335576b9411..19a1a8e19062 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -5,7 +5,7 @@ * Copyright (C) IBM Corporation, 2010-2012 * Author: Srikar Dronamraju */ -#define pr_fmt(fmt) "trace_kprobe: " fmt +#define pr_fmt(fmt) "trace_uprobe: " fmt #include #include -- cgit v1.3-7-g2ca7 From 0b698005a9d11c0e91141ec11a2c4918a129f703 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Wed, 16 Jan 2019 14:03:15 -0800 Subject: bpf: don't assume build-id length is always 20 bytes Build-id length is not fixed to 20, it can be (`man ld` /--build-id): * 128-bit (uuid) * 160-bit (sha1) * any length specified in ld --build-id=0xhexstring To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids, assume that build-id is somewhere in the range of 1 .. 20. Set the remaining bytes to zero. v2: * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)", we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter this 'if' condition Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu Signed-off-by: Stanislav Fomichev Signed-off-by: Daniel Borkmann --- kernel/bpf/stackmap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index d9e2483669d0..f9df545e92f6 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -180,11 +180,14 @@ static inline int stack_map_parse_build_id(void *page_addr, if (nhdr->n_type == BPF_BUILD_ID && nhdr->n_namesz == sizeof("GNU") && - nhdr->n_descsz == BPF_BUILD_ID_SIZE) { + nhdr->n_descsz > 0 && + nhdr->n_descsz <= BPF_BUILD_ID_SIZE) { memcpy(build_id, note_start + note_offs + ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr), - BPF_BUILD_ID_SIZE); + nhdr->n_descsz); + memset(build_id + nhdr->n_descsz, 0, + BPF_BUILD_ID_SIZE - nhdr->n_descsz); return 0; } new_offs = note_offs + sizeof(Elf32_Nhdr) + -- cgit v1.3-7-g2ca7 From 4af396ae4836c4ecab61e975b8e61270c551894d Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Wed, 16 Jan 2019 14:03:16 -0800 Subject: bpf: zero out build_id for BPF_STACK_BUILD_ID_IP When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset, make sure that build_id field is empty. Since we are using percpu free list, there is a possibility that we might reuse some previous bpf_stack_build_id with non-zero build_id. Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu Signed-off-by: Stanislav Fomichev Signed-off-by: Daniel Borkmann --- kernel/bpf/stackmap.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index f9df545e92f6..d43b14535827 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -314,6 +314,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, for (i = 0; i < trace_nr; i++) { id_offs[i].status = BPF_STACK_BUILD_ID_IP; id_offs[i].ip = ips[i]; + memset(id_offs[i].build_id, 0, BPF_BUILD_ID_SIZE); } return; } @@ -324,6 +325,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, /* per entry fall back to ips */ id_offs[i].status = BPF_STACK_BUILD_ID_IP; id_offs[i].ip = ips[i]; + memset(id_offs[i].build_id, 0, BPF_BUILD_ID_SIZE); continue; } id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i] -- cgit v1.3-7-g2ca7 From 583c53185399cea5c51195064564d1c9ddc70cf3 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Wed, 16 Jan 2019 20:29:40 +0100 Subject: bpf: Make function btf_name_offset_valid static Initially in commit 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") the function 'btf_name_offset_valid' was introduced as static function it was later on changed to a non-static one, and then finally in commit 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") the function prototype was removed. Revert back to original implementation and make the function static. Remove warning triggered with W=1: kernel/bpf/btf.c:470:6: warning: no previous prototype for 'btf_name_offset_valid' [-Wmissing-prototypes] Fixes: 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") Signed-off-by: Mathieu Malaterre Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index a2f53642592b..befe570be5ba 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -467,7 +467,7 @@ static const struct btf_kind_operations *btf_type_ops(const struct btf_type *t) return kind_ops[BTF_INFO_KIND(t->info)]; } -bool btf_name_offset_valid(const struct btf *btf, u32 offset) +static bool btf_name_offset_valid(const struct btf *btf, u32 offset) { return BTF_STR_OFFSET_VALID(offset) && offset < btf->hdr.str_len; -- cgit v1.3-7-g2ca7 From c8dc79806e7f6cb6b0952aae1ce626c39905ad7e Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Wed, 16 Jan 2019 20:35:29 +0100 Subject: bpf: Annotate implicit fall through in cgroup_dev_func_proto There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warnings (W=1). This commit removes the following warning: kernel/bpf/cgroup.c:719:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre Signed-off-by: Daniel Borkmann --- kernel/bpf/cgroup.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 9425c2fb872f..ab612fe9862f 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -718,6 +718,7 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_trace_printk: if (capable(CAP_SYS_ADMIN)) return bpf_get_trace_printk_proto(); + /* fall through */ default: return NULL; } -- cgit v1.3-7-g2ca7 From 0722069a5374b904ec1a67f91249f90e1cfae259 Mon Sep 17 00:00:00 2001 From: Andreas Ziegler Date: Wed, 16 Jan 2019 15:16:29 +0100 Subject: tracing/uprobes: Fix output for multiple string arguments When printing multiple uprobe arguments as strings the output for the earlier arguments would also include all later string arguments. This is best explained in an example: Consider adding a uprobe to a function receiving two strings as parameters which is at offset 0xa0 in strlib.so and we want to print both parameters when the uprobe is hit (on x86_64): $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \ /sys/kernel/debug/tracing/uprobe_events When the function is called as func("foo", "bar") and we hit the probe, the trace file shows a line like the following: [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar" Note the extra "bar" printed as part of arg1. This behaviour stacks up for additional string arguments. The strings are stored in a dynamically growing part of the uprobe buffer by fetch_store_string() after copying them from userspace via strncpy_from_user(). The return value of strncpy_from_user() is then directly used as the required size for the string. However, this does not take the terminating null byte into account as the documentation for strncpy_from_user() cleary states that it "[...] returns the length of the string (not including the trailing NUL)" even though the null byte will be copied to the destination. Therefore, subsequent calls to fetch_store_string() will overwrite the terminating null byte of the most recently fetched string with the first character of the current string, leading to the "accumulation" of strings in earlier arguments in the output. Fix this by incrementing the return value of strncpy_from_user() by one if we did not hit the maximum buffer size. Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: 5baaa59ef09e ("tracing/probes: Implement 'memory' fetch method for uprobes") Acked-by: Masami Hiramatsu Signed-off-by: Andreas Ziegler Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_uprobe.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 19a1a8e19062..9bde07c06362 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -160,6 +160,13 @@ fetch_store_string(unsigned long addr, void *dest, void *base) if (ret >= 0) { if (ret == maxlen) dst[ret - 1] = '\0'; + else + /* + * Include the terminating null byte. In this case it + * was copied by strncpy_from_user but not accounted + * for in ret. + */ + ret++; *(u32 *)dest = make_data_loc(ret, (void *)dst - base); } -- cgit v1.3-7-g2ca7 From 399504e21a10be16dd1408ba0147367d9d82a10c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sun, 6 Jan 2019 11:41:29 -0500 Subject: fix cgroup_do_mount() handling of failure exits same story as with last May fixes in sysfs (7b745a4e4051 "unfuck sysfs_mount()"); new_sb is left uninitialized in case of early errors in kernfs_mount_ns() and papering over it by treating any error from kernfs_mount_ns() as equivalent to !new_ns ends up conflating the cases when objects had never been transferred to a superblock with ones when that has happened and resulting new superblock had been dropped. Easily fixed (same way as in sysfs case). Additionally, there's a superblock leak on kernfs_node_dentry() failure *and* a dentry leak inside kernfs_node_dentry() itself - the latter on probably impossible errors, but the former not impossible to trigger (as the matter of fact, injecting allocation failures at that point *does* trigger it). Cc: stable@kernel.org Signed-off-by: Al Viro --- fs/kernfs/mount.c | 8 ++++++-- kernel/cgroup/cgroup.c | 9 ++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index fdf527b6d79c..d71c9405874a 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -196,8 +196,10 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn, return dentry; knparent = find_next_ancestor(kn, NULL); - if (WARN_ON(!knparent)) + if (WARN_ON(!knparent)) { + dput(dentry); return ERR_PTR(-EINVAL); + } do { struct dentry *dtmp; @@ -206,8 +208,10 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn, if (kn == knparent) return dentry; kntmp = find_next_ancestor(kn, knparent); - if (WARN_ON(!kntmp)) + if (WARN_ON(!kntmp)) { + dput(dentry); return ERR_PTR(-EINVAL); + } dtmp = lookup_one_len_unlocked(kntmp->name, dentry, strlen(kntmp->name)); dput(dentry); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index f31bd61c9466..503bba3c4bae 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2033,7 +2033,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, struct cgroup_namespace *ns) { struct dentry *dentry; - bool new_sb; + bool new_sb = false; dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb); @@ -2043,6 +2043,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, */ if (!IS_ERR(dentry) && ns != &init_cgroup_ns) { struct dentry *nsdentry; + struct super_block *sb = dentry->d_sb; struct cgroup *cgrp; mutex_lock(&cgroup_mutex); @@ -2053,12 +2054,14 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, spin_unlock_irq(&css_set_lock); mutex_unlock(&cgroup_mutex); - nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb); + nsdentry = kernfs_node_dentry(cgrp->kn, sb); dput(dentry); + if (IS_ERR(nsdentry)) + deactivate_locked_super(sb); dentry = nsdentry; } - if (IS_ERR(dentry) || !new_sb) + if (!new_sb) cgroup_put(&root->cgrp); return dentry; -- cgit v1.3-7-g2ca7 From 35ac1184244f1329783e1d897f74926d8bb1103a Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 12 Jan 2019 00:20:54 -0500 Subject: cgroup: saner refcounting for cgroup_root * make the reference from superblock to cgroup_root counting - do cgroup_put() in cgroup_kill_sb() whether we'd done percpu_ref_kill() or not; matching grab is done when we allocate a new root. That gives the same refcounting rules for all callers of cgroup_do_mount() - a reference to cgroup_root has been grabbed by caller and it either is transferred to new superblock or dropped. * have cgroup_kill_sb() treat an already killed refcount as "just don't bother killing it, then". * after successful cgroup_do_mount() have cgroup1_mount() recheck if we'd raced with mount/umount from somebody else and cgroup_root got killed. In that case we drop the superblock and bugger off with -ERESTARTSYS, same as if we'd found it in the list already dying. * don't bother with delayed initialization of refcount - it's unreliable and not needed. No need to prevent attempts to bump the refcount if we find cgroup_root of another mount in progress - sget will reuse an existing superblock just fine and if the other sb manages to die before we get there, we'll catch that immediately after cgroup_do_mount(). * don't bother with kernfs_pin_sb() - no need for doing that either. Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 2 +- kernel/cgroup/cgroup-v1.c | 58 +++++++++-------------------------------- kernel/cgroup/cgroup.c | 16 +++++------- 3 files changed, 21 insertions(+), 55 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index c950864016e2..c9a35f09e4b9 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -198,7 +198,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen, void cgroup_free_root(struct cgroup_root *root); void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts); -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags); +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, struct cgroup_root *root, unsigned long magic, diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 583b969b0c0e..f94a7229974e 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1116,13 +1116,11 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, void *data, unsigned long magic, struct cgroup_namespace *ns) { - struct super_block *pinned_sb = NULL; struct cgroup_sb_opts opts; struct cgroup_root *root; struct cgroup_subsys *ss; struct dentry *dentry; int i, ret; - bool new_root = false; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); @@ -1184,29 +1182,6 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, if (root->flags ^ opts.flags) pr_warn("new mount options do not match the existing superblock, will be ignored\n"); - /* - * We want to reuse @root whose lifetime is governed by its - * ->cgrp. Let's check whether @root is alive and keep it - * that way. As cgroup_kill_sb() can happen anytime, we - * want to block it by pinning the sb so that @root doesn't - * get killed before mount is complete. - * - * With the sb pinned, tryget_live can reliably indicate - * whether @root can be reused. If it's being killed, - * drain it. We can use wait_queue for the wait but this - * path is super cold. Let's just sleep a bit and retry. - */ - pinned_sb = kernfs_pin_sb(root->kf_root, NULL); - if (IS_ERR(pinned_sb) || - !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { - mutex_unlock(&cgroup_mutex); - if (!IS_ERR_OR_NULL(pinned_sb)) - deactivate_super(pinned_sb); - msleep(10); - ret = restart_syscall(); - goto out_free; - } - ret = 0; goto out_unlock; } @@ -1232,15 +1207,20 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, ret = -ENOMEM; goto out_unlock; } - new_root = true; init_cgroup_root(root, &opts); - ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD); + ret = cgroup_setup_root(root, opts.subsys_mask); if (ret) cgroup_free_root(root); out_unlock: + if (!ret && !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { + mutex_unlock(&cgroup_mutex); + msleep(10); + ret = restart_syscall(); + goto out_free; + } mutex_unlock(&cgroup_mutex); out_free: kfree(opts.release_agent); @@ -1252,25 +1232,13 @@ out_free: dentry = cgroup_do_mount(&cgroup_fs_type, flags, root, CGROUP_SUPER_MAGIC, ns); - /* - * There's a race window after we release cgroup_mutex and before - * allocating a superblock. Make sure a concurrent process won't - * be able to re-use the root during this window by delaying the - * initialization of root refcnt. - */ - if (new_root) { - mutex_lock(&cgroup_mutex); - percpu_ref_reinit(&root->cgrp.self.refcnt); - mutex_unlock(&cgroup_mutex); + if (!IS_ERR(dentry) && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block *sb = dentry->d_sb; + dput(dentry); + deactivate_locked_super(sb); + msleep(10); + dentry = ERR_PTR(restart_syscall()); } - - /* - * If @pinned_sb, we're reusing an existing root and holding an - * extra ref on its sb. Mount is complete. Put the extra ref. - */ - if (pinned_sb) - deactivate_super(pinned_sb); - return dentry; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 503bba3c4bae..7fd9f22e406d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1927,7 +1927,7 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts) set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags); } -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags) +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) { LIST_HEAD(tmp_links); struct cgroup *root_cgrp = &root->cgrp; @@ -1944,7 +1944,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags) root_cgrp->ancestor_ids[0] = ret; ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, - ref_flags, GFP_KERNEL); + 0, GFP_KERNEL); if (ret) goto out; @@ -2121,18 +2121,16 @@ static void cgroup_kill_sb(struct super_block *sb) struct cgroup_root *root = cgroup_root_from_kf(kf_root); /* - * If @root doesn't have any mounts or children, start killing it. + * If @root doesn't have any children, start killing it. * This prevents new mounts by disabling percpu_ref_tryget_live(). * cgroup_mount() may wait for @root's release. * * And don't kill the default root. */ - if (!list_empty(&root->cgrp.self.children) || - root == &cgrp_dfl_root) - cgroup_put(&root->cgrp); - else + if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root && + !percpu_ref_is_dying(&root->cgrp.self.refcnt)) percpu_ref_kill(&root->cgrp.self.refcnt); - + cgroup_put(&root->cgrp); kernfs_kill_sb(sb); } @@ -5402,7 +5400,7 @@ int __init cgroup_init(void) hash_add(css_set_table, &init_css_set.hlist, css_set_hash(init_css_set.subsys)); - BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0)); + BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0)); mutex_unlock(&cgroup_mutex); -- cgit v1.3-7-g2ca7 From 12fee4cd5be2c4a73cc13d7ad76eb2d2feda8a71 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Thu, 17 Jan 2019 11:00:09 +0800 Subject: genirq/irqdesc: Fix double increment in alloc_descs() The recent rework of alloc_descs() introduced a double increment of the loop counter. As a consequence only every second affinity mask is validated. Remove it. [ tglx: Massaged changelog ] Fixes: c410abbbacb9 ("genirq/affinity: Add is_managed to struct irq_affinity_desc") Signed-off-by: Huacai Chen Signed-off-by: Thomas Gleixner Cc: Fuxin Zhang Cc: Zhangjin Wu Cc: Huacai Chen Cc: Dou Liyang Link: https://lkml.kernel.org/r/1547694009-16261-1-git-send-email-chenhc@lemote.com --- kernel/irq/irqdesc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index ee062b7939d3..ef8ad36cadcf 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -457,7 +457,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, /* Validate affinity mask(s) */ if (affinity) { - for (i = 0; i < cnt; i++, i++) { + for (i = 0; i < cnt; i++) { if (cpumask_empty(&affinity[i].mask)) return -EINVAL; } -- cgit v1.3-7-g2ca7 From 58fa4a410fc31afe08d0d0c6b6d8860c22ec17c2 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 16 Jan 2019 14:15:20 +0100 Subject: ipc: introduce ksys_ipc()/compat_ksys_ipc() for s390 The sys_ipc() and compat_ksys_ipc() functions are meant to only be used from the system call table, not called by another function. Introduce ksys_*() interfaces for this purpose, as we have done for many other system calls. Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de Signed-off-by: Arnd Bergmann Reviewed-by: Heiko Carstens [heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT] Signed-off-by: Heiko Carstens Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_linux.c | 2 +- arch/s390/kernel/sys_s390.c | 4 +++- include/linux/syscalls.h | 4 ++++ ipc/syscall.c | 20 ++++++++++++++++---- kernel/sys_ni.c | 1 + 5 files changed, 25 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c index 8ac38d51ed7d..a47f6d3c6d5b 100644 --- a/arch/s390/kernel/compat_linux.c +++ b/arch/s390/kernel/compat_linux.c @@ -296,7 +296,7 @@ COMPAT_SYSCALL_DEFINE5(s390_ipc, uint, call, int, first, compat_ulong_t, second, { if (call >> 16) /* hack for backward compatibility */ return -EINVAL; - return compat_sys_ipc(call, first, second, third, ptr, third); + return compat_ksys_ipc(call, first, second, third, ptr, third); } #endif diff --git a/arch/s390/kernel/sys_s390.c b/arch/s390/kernel/sys_s390.c index 560bdaf8a74f..fd0cbbed4d9f 100644 --- a/arch/s390/kernel/sys_s390.c +++ b/arch/s390/kernel/sys_s390.c @@ -58,6 +58,7 @@ out: return error; } +#ifdef CONFIG_SYSVIPC /* * sys_ipc() is the de-multiplexer for the SysV IPC calls. */ @@ -74,8 +75,9 @@ SYSCALL_DEFINE5(s390_ipc, uint, call, int, first, unsigned long, second, * Therefore we can call the generic variant by simply passing the * third parameter also as fifth parameter. */ - return sys_ipc(call, first, second, third, ptr, third); + return ksys_ipc(call, first, second, third, ptr, third); } +#endif /* CONFIG_SYSVIPC */ SYSCALL_DEFINE1(s390_personality, unsigned int, personality) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..fb63045a0fb6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1185,6 +1185,10 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); ssize_t ksys_readahead(int fd, loff_t offset, size_t count); +int ksys_ipc(unsigned int call, int first, unsigned long second, + unsigned long third, void __user * ptr, long fifth); +int compat_ksys_ipc(u32 call, int first, int second, + u32 third, u32 ptr, u32 fifth); /* * The following kernel syscall equivalents are just wrappers to fs-internal diff --git a/ipc/syscall.c b/ipc/syscall.c index 1ac06e3983c0..3cf8ad703a4d 100644 --- a/ipc/syscall.c +++ b/ipc/syscall.c @@ -17,8 +17,8 @@ #include #include -SYSCALL_DEFINE6(ipc, unsigned int, call, int, first, unsigned long, second, - unsigned long, third, void __user *, ptr, long, fifth) +int ksys_ipc(unsigned int call, int first, unsigned long second, + unsigned long third, void __user * ptr, long fifth) { int version, ret; @@ -106,6 +106,12 @@ SYSCALL_DEFINE6(ipc, unsigned int, call, int, first, unsigned long, second, return -ENOSYS; } } + +SYSCALL_DEFINE6(ipc, unsigned int, call, int, first, unsigned long, second, + unsigned long, third, void __user *, ptr, long, fifth) +{ + return ksys_ipc(call, first, second, third, ptr, fifth); +} #endif #ifdef CONFIG_COMPAT @@ -121,8 +127,8 @@ struct compat_ipc_kludge { }; #ifdef CONFIG_ARCH_WANT_OLD_COMPAT_IPC -COMPAT_SYSCALL_DEFINE6(ipc, u32, call, int, first, int, second, - u32, third, compat_uptr_t, ptr, u32, fifth) +int compat_ksys_ipc(u32 call, int first, int second, + u32 third, compat_uptr_t ptr, u32 fifth) { int version; u32 pad; @@ -195,5 +201,11 @@ COMPAT_SYSCALL_DEFINE6(ipc, u32, call, int, first, int, second, return -ENOSYS; } + +COMPAT_SYSCALL_DEFINE6(ipc, u32, call, int, first, int, second, + u32, third, compat_uptr_t, ptr, u32, fifth) +{ + return compat_ksys_ipc(call, first, second, third, ptr, fifth); +} #endif #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..bc934f31ab10 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -366,6 +366,7 @@ COND_SYSCALL(kexec_file_load); /* s390 */ COND_SYSCALL(s390_pci_mmio_read); COND_SYSCALL(s390_pci_mmio_write); +COND_SYSCALL(s390_ipc); COND_SYSCALL_COMPAT(s390_ipc); /* powerpc */ -- cgit v1.3-7-g2ca7 From 1a51c5da5acc6c188c917ba572eebac5f8793432 Mon Sep 17 00:00:00 2001 From: Stephane Eranian Date: Thu, 10 Jan 2019 17:17:16 -0800 Subject: perf core: Fix perf_proc_update_handler() bug The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e, when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100, then no modification to sysctl_perf_event_sample_rate is allowed to prevent possible hang from wrong values. The problem is that the test to prevent modification is made after the sysctl variable is modified in perf_proc_update_handler(). You get an error: $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate echo: write error: invalid argument But the value is still modified causing all sorts of inconsistencies: $ cat /proc/sys/kernel/perf_event_max_sample_rate 10001 This patch fixes the problem by moving the parsing of the value after the test. Committer testing: # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate -bash: echo: write error: Invalid argument # cat /proc/sys/kernel/perf_event_max_sample_rate 10001 # Signed-off-by: Stephane Eranian Reviewed-by: Andi Kleen Reviewed-by: Jiri Olsa Tested-by: Arnaldo Carvalho de Melo Cc: Kan Liang Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 3cd13a30f732..e5ede6918050 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -436,18 +436,18 @@ int perf_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { - int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); - - if (ret || !write) - return ret; - + int ret; + int perf_cpu = sysctl_perf_cpu_time_max_percent; /* * If throttling is disabled don't allow the write: */ - if (sysctl_perf_cpu_time_max_percent == 100 || - sysctl_perf_cpu_time_max_percent == 0) + if (write && (perf_cpu == 100 || perf_cpu == 0)) return -EINVAL; + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + max_samples_per_tick = DIV_ROUND_UP(sysctl_perf_event_sample_rate, HZ); perf_sample_period_ns = NSEC_PER_SEC / sysctl_perf_event_sample_rate; update_perf_cpu_limits(); -- cgit v1.3-7-g2ca7 From 7527a7b157d1191b23562ed70154ae93bd65f845 Mon Sep 17 00:00:00 2001 From: Parav Pandit Date: Thu, 17 Jan 2019 20:14:15 +0200 Subject: IB/core: Simplify rdma cgroup registration RDMA cgroup registration routine always returns success, so simplify function to be void and run clang formatter over whole CONFIG_CGROUP_RDMA art of core_priv.h. This reduces unwinding error path for regular registration and future net namespace change functionality for rdma device. Signed-off-by: Parav Pandit Signed-off-by: Leon Romanovsky Acked-by: Tejun Heo Signed-off-by: Jason Gunthorpe --- drivers/infiniband/core/cgroup.c | 5 ++--- drivers/infiniband/core/core_priv.h | 17 +++++++++++------ drivers/infiniband/core/device.c | 8 +------- include/linux/cgroup_rdma.h | 2 +- kernel/cgroup/rdma.c | 5 +---- 5 files changed, 16 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c index 126ac5f99db7..388fd04e5f63 100644 --- a/drivers/infiniband/core/cgroup.c +++ b/drivers/infiniband/core/cgroup.c @@ -21,12 +21,11 @@ * Register with the rdma cgroup. Should be called before * exposing rdma device to user space applications to avoid * resource accounting leak. - * Returns 0 on success or otherwise failure code. */ -int ib_device_register_rdmacg(struct ib_device *device) +void ib_device_register_rdmacg(struct ib_device *device) { device->cg_device.name = device->name; - return rdmacg_register_device(&device->cg_device); + rdmacg_register_device(&device->cg_device); } /** diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index aca75c74e451..42a49982f66e 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -115,7 +115,7 @@ void ib_cache_cleanup_one(struct ib_device *device); void ib_cache_release_one(struct ib_device *device); #ifdef CONFIG_CGROUP_RDMA -int ib_device_register_rdmacg(struct ib_device *device); +void ib_device_register_rdmacg(struct ib_device *device); void ib_device_unregister_rdmacg(struct ib_device *device); int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, @@ -126,21 +126,26 @@ void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index); #else -static inline int ib_device_register_rdmacg(struct ib_device *device) -{ return 0; } +static inline void ib_device_register_rdmacg(struct ib_device *device) +{ +} static inline void ib_device_unregister_rdmacg(struct ib_device *device) -{ } +{ +} static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index) -{ return 0; } +{ + return 0; +} static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index) -{ } +{ +} #endif static inline bool rdma_is_upper_dev_rcu(struct net_device *dev, diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 4a9aa6d10c5e..200431c540f2 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -599,12 +599,7 @@ int ib_register_device(struct ib_device *device, const char *name) device->index = __dev_new_index(); - ret = ib_device_register_rdmacg(device); - if (ret) { - dev_warn(&device->dev, - "Couldn't register device with rdma cgroup\n"); - goto dev_cleanup; - } + ib_device_register_rdmacg(device); ret = ib_device_register_sysfs(device); if (ret) { @@ -627,7 +622,6 @@ int ib_register_device(struct ib_device *device, const char *name) cg_cleanup: ib_device_unregister_rdmacg(device); -dev_cleanup: cleanup_device(device); out: mutex_unlock(&device_mutex); diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h index e94290b29e99..ef1bae2983f3 100644 --- a/include/linux/cgroup_rdma.h +++ b/include/linux/cgroup_rdma.h @@ -39,7 +39,7 @@ struct rdmacg_device { * APIs for RDMA/IB stack to publish when a device wants to * participate in resource accounting */ -int rdmacg_register_device(struct rdmacg_device *device); +void rdmacg_register_device(struct rdmacg_device *device); void rdmacg_unregister_device(struct rdmacg_device *device); /* APIs for RDMA/IB stack to charge/uncharge pool specific resources */ diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c index d3bbb757ee49..1d75ae7f1cb7 100644 --- a/kernel/cgroup/rdma.c +++ b/kernel/cgroup/rdma.c @@ -313,10 +313,8 @@ EXPORT_SYMBOL(rdmacg_try_charge); * If IB stack wish a device to participate in rdma cgroup resource * tracking, it must invoke this API to register with rdma cgroup before * any user space application can start using the RDMA resources. - * Returns 0 on success or EINVAL when table length given is beyond - * supported size. */ -int rdmacg_register_device(struct rdmacg_device *device) +void rdmacg_register_device(struct rdmacg_device *device) { INIT_LIST_HEAD(&device->dev_node); INIT_LIST_HEAD(&device->rpools); @@ -324,7 +322,6 @@ int rdmacg_register_device(struct rdmacg_device *device) mutex_lock(&rdmacg_mutex); list_add_tail(&device->dev_node, &rdmacg_devices); mutex_unlock(&rdmacg_mutex); - return 0; } EXPORT_SYMBOL(rdmacg_register_device); -- cgit v1.3-7-g2ca7 From 626abcd13d4ea2b67be3249a250046cf713f532a Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 18 Jan 2019 17:42:48 -0500 Subject: audit: add syscall information to CONFIG_CHANGE records Tie syscall information to all CONFIG_CHANGE calls since they are all a result of user actions. Exclude user records from syscall context: Since the function audit_log_common_recv_msg() is shared by a number of AUDIT_CONFIG_CHANGE and the entire range of AUDIT_USER_* record types, and since the AUDIT_CONFIG_CHANGE message type has been converted to a syscall accompanied record type, special-case the AUDIT_USER_* range of messages so they remain standalone records. See: https://github.com/linux-audit/audit-kernel/issues/59 See: https://github.com/linux-audit/audit-kernel/issues/50 Signed-off-by: Richard Guy Briggs [PM: fix line lengths in kernel/audit.c] Signed-off-by: Paul Moore --- kernel/audit.c | 28 +++++++++++++++++++--------- kernel/audit_fsnotify.c | 2 +- kernel/audit_watch.c | 2 +- kernel/auditfilter.c | 2 +- 4 files changed, 22 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index d412fb4ae6d5..c2a7662cc254 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -396,7 +396,7 @@ static int audit_log_config_change(char *function_name, u32 new, u32 old, struct audit_buffer *ab; int rc = 0; - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; audit_log_format(ab, "op=set %s=%u old=%u ", function_name, new, old); @@ -1053,7 +1053,8 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) return err; } -static void audit_log_common_recv_msg(struct audit_buffer **ab, u16 msg_type) +static void audit_log_common_recv_msg(struct audit_context *context, + struct audit_buffer **ab, u16 msg_type) { uid_t uid = from_kuid(&init_user_ns, current_uid()); pid_t pid = task_tgid_nr(current); @@ -1063,7 +1064,7 @@ static void audit_log_common_recv_msg(struct audit_buffer **ab, u16 msg_type) return; } - *ab = audit_log_start(NULL, GFP_KERNEL, msg_type); + *ab = audit_log_start(context, GFP_KERNEL, msg_type); if (unlikely(!*ab)) return; audit_log_format(*ab, "pid=%d uid=%u ", pid, uid); @@ -1071,6 +1072,12 @@ static void audit_log_common_recv_msg(struct audit_buffer **ab, u16 msg_type) audit_log_task_context(*ab); } +static inline void audit_log_user_recv_msg(struct audit_buffer **ab, + u16 msg_type) +{ + audit_log_common_recv_msg(NULL, ab, msg_type); +} + int is_audit_feature_set(int i) { return af.features & AUDIT_FEATURE_TO_MASK(i); @@ -1338,7 +1345,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) if (err) break; } - audit_log_common_recv_msg(&ab, msg_type); + audit_log_user_recv_msg(&ab, msg_type); if (msg_type != AUDIT_USER_TTY) audit_log_format(ab, " msg='%.*s'", AUDIT_MESSAGE_TEXT_MAX, @@ -1361,7 +1368,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) if (nlmsg_len(nlh) < sizeof(struct audit_rule_data)) return -EINVAL; if (audit_enabled == AUDIT_LOCKED) { - audit_log_common_recv_msg(&ab, AUDIT_CONFIG_CHANGE); + audit_log_common_recv_msg(audit_context(), &ab, + AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=%s audit_enabled=%d res=0", msg_type == AUDIT_ADD_RULE ? "add_rule" : "remove_rule", @@ -1376,7 +1384,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) break; case AUDIT_TRIM: audit_trim_trees(); - audit_log_common_recv_msg(&ab, AUDIT_CONFIG_CHANGE); + audit_log_common_recv_msg(audit_context(), &ab, + AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=trim res=1"); audit_log_end(ab); break; @@ -1406,8 +1415,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) /* OK, here comes... */ err = audit_tag_tree(old, new); - audit_log_common_recv_msg(&ab, AUDIT_CONFIG_CHANGE); - + audit_log_common_recv_msg(audit_context(), &ab, + AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=make_equiv old="); audit_log_untrustedstring(ab, old); audit_log_format(ab, " new="); @@ -1474,7 +1483,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) old.enabled = t & AUDIT_TTY_ENABLE; old.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); - audit_log_common_recv_msg(&ab, AUDIT_CONFIG_CHANGE); + audit_log_common_recv_msg(audit_context(), &ab, + AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=tty_set old-enabled=%d new-enabled=%d" " old-log_passwd=%d new-log_passwd=%d res=%d", old.enabled, s.enabled, old.log_passwd, diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c index cf4512a33675..37ae95cfb7f4 100644 --- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -127,7 +127,7 @@ static void audit_mark_log_rule_change(struct audit_fsnotify_mark *audit_mark, c if (!audit_enabled) return; - ab = audit_log_start(NULL, GFP_NOFS, AUDIT_CONFIG_CHANGE); + ab = audit_log_start(audit_context(), GFP_NOFS, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return; audit_log_session_info(ab); diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 20ef9ba134b0..e8d1adeb2223 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -242,7 +242,7 @@ static void audit_watch_log_rule_change(struct audit_krule *r, struct audit_watc if (!audit_enabled) return; - ab = audit_log_start(NULL, GFP_NOFS, AUDIT_CONFIG_CHANGE); + ab = audit_log_start(audit_context(), GFP_NOFS, AUDIT_CONFIG_CHANGE); if (!ab) return; audit_log_session_info(ab); diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index bf309f2592c4..26a80a9d43a9 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -1091,7 +1091,7 @@ static void audit_log_rule_change(char *action, struct audit_krule *rule, int re if (!audit_enabled) return; - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE); + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (!ab) return; audit_log_session_info(ab); -- cgit v1.3-7-g2ca7 From 9d5564ddcf2a0f5ba3fa1c3a1f8a1b59ad309553 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Thu, 17 Jan 2019 16:34:45 +0100 Subject: bpf: fix inner map masking to prevent oob under speculation During review I noticed that inner meta map setup for map in map is buggy in that it does not propagate all needed data from the reference map which the verifier is later accessing. In particular one such case is index masking to prevent out of bounds access under speculative execution due to missing the map's unpriv_array/index_mask field propagation. Fix this such that the verifier is generating the correct code for inlined lookups in case of unpriviledged use. Before patch (test_verifier's 'map in map access' dump): # bpftool prog dump xla id 3 0: (62) *(u32 *)(r10 -4) = 0 1: (bf) r2 = r10 2: (07) r2 += -4 3: (18) r1 = map[id:4] 5: (07) r1 += 272 | 6: (61) r0 = *(u32 *)(r2 +0) | 7: (35) if r0 >= 0x1 goto pc+6 | Inlined map in map lookup 8: (54) (u32) r0 &= (u32) 0 | with index masking for 9: (67) r0 <<= 3 | map->unpriv_array. 10: (0f) r0 += r1 | 11: (79) r0 = *(u64 *)(r0 +0) | 12: (15) if r0 == 0x0 goto pc+1 | 13: (05) goto pc+1 | 14: (b7) r0 = 0 | 15: (15) if r0 == 0x0 goto pc+11 16: (62) *(u32 *)(r10 -4) = 0 17: (bf) r2 = r10 18: (07) r2 += -4 19: (bf) r1 = r0 20: (07) r1 += 272 | 21: (61) r0 = *(u32 *)(r2 +0) | Index masking missing (!) 22: (35) if r0 >= 0x1 goto pc+3 | for inner map despite 23: (67) r0 <<= 3 | map->unpriv_array set. 24: (0f) r0 += r1 | 25: (05) goto pc+1 | 26: (b7) r0 = 0 | 27: (b7) r0 = 0 28: (95) exit After patch: # bpftool prog dump xla id 1 0: (62) *(u32 *)(r10 -4) = 0 1: (bf) r2 = r10 2: (07) r2 += -4 3: (18) r1 = map[id:2] 5: (07) r1 += 272 | 6: (61) r0 = *(u32 *)(r2 +0) | 7: (35) if r0 >= 0x1 goto pc+6 | Same inlined map in map lookup 8: (54) (u32) r0 &= (u32) 0 | with index masking due to 9: (67) r0 <<= 3 | map->unpriv_array. 10: (0f) r0 += r1 | 11: (79) r0 = *(u64 *)(r0 +0) | 12: (15) if r0 == 0x0 goto pc+1 | 13: (05) goto pc+1 | 14: (b7) r0 = 0 | 15: (15) if r0 == 0x0 goto pc+12 16: (62) *(u32 *)(r10 -4) = 0 17: (bf) r2 = r10 18: (07) r2 += -4 19: (bf) r1 = r0 20: (07) r1 += 272 | 21: (61) r0 = *(u32 *)(r2 +0) | 22: (35) if r0 >= 0x1 goto pc+4 | Now fixed inlined inner map 23: (54) (u32) r0 &= (u32) 0 | lookup with proper index masking 24: (67) r0 <<= 3 | for map->unpriv_array. 25: (0f) r0 += r1 | 26: (05) goto pc+1 | 27: (b7) r0 = 0 | 28: (b7) r0 = 0 29: (95) exit Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Daniel Borkmann Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- kernel/bpf/map_in_map.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 99d243e1ad6e..52378d3e34b3 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -12,6 +12,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) { struct bpf_map *inner_map, *inner_map_meta; + u32 inner_map_meta_size; struct fd f; f = fdget(inner_map_ufd); @@ -36,7 +37,12 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) return ERR_PTR(-EINVAL); } - inner_map_meta = kzalloc(sizeof(*inner_map_meta), GFP_USER); + inner_map_meta_size = sizeof(*inner_map_meta); + /* In some cases verifier needs to access beyond just base map. */ + if (inner_map->ops == &array_map_ops) + inner_map_meta_size = sizeof(struct bpf_array); + + inner_map_meta = kzalloc(inner_map_meta_size, GFP_USER); if (!inner_map_meta) { fdput(f); return ERR_PTR(-ENOMEM); @@ -46,9 +52,16 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) inner_map_meta->key_size = inner_map->key_size; inner_map_meta->value_size = inner_map->value_size; inner_map_meta->map_flags = inner_map->map_flags; - inner_map_meta->ops = inner_map->ops; inner_map_meta->max_entries = inner_map->max_entries; + /* Misc members not needed in bpf_map_meta_equal() check. */ + inner_map_meta->ops = inner_map->ops; + if (inner_map->ops == &array_map_ops) { + inner_map_meta->unpriv_array = inner_map->unpriv_array; + container_of(inner_map_meta, struct bpf_array, map)->index_mask = + container_of(inner_map, struct bpf_array, map)->index_mask; + } + fdput(f); return inner_map_meta; } -- cgit v1.3-7-g2ca7 From cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 Mon Sep 17 00:00:00 2001 From: Andrew Murray Date: Thu, 10 Jan 2019 13:53:25 +0000 Subject: perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Ivan Kokshaysky Cc: Linus Torvalds Cc: Mark Rutland Cc: Matt Turner Cc: Michael Ellerman Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Richard Henderson Cc: Russell King Cc: Sascha Hauer Cc: Shawn Guo Cc: Thomas Gleixner Cc: Will Deacon Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.com Signed-off-by: Ingo Molnar --- include/linux/perf_event.h | 1 + kernel/events/core.c | 9 +++++++++ 2 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 54a78d22f0a6..cec02dc63b51 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -244,6 +244,7 @@ struct perf_event; #define PERF_PMU_CAP_EXCLUSIVE 0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40 +#define PERF_PMU_CAP_NO_EXCLUDE 0x80 /** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index 3cd13a30f732..fbe59b793b36 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9772,6 +9772,15 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) if (ctx) perf_event_ctx_unlock(event->group_leader, ctx); + if (!ret) { + if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && + event_has_any_exclude_flag(event)) { + if (event->destroy) + event->destroy(event); + ret = -EINVAL; + } + } + if (ret) module_put(pmu->module); -- cgit v1.3-7-g2ca7 From 6dc080eeb2ba01973bfff0d79844d7a59e12542e Mon Sep 17 00:00:00 2001 From: Prateek Sood Date: Fri, 30 Nov 2018 20:40:56 +0530 Subject: sched/wait: Fix rcuwait_wake_up() ordering For some peculiar reason rcuwait_wake_up() has the right barrier in the comment, but not in the code. This mistake has been observed to cause a deadlock in the following situation: P1 P2 percpu_up_read() percpu_down_write() rcu_sync_is_idle() // false rcu_sync_enter() ... __percpu_up_read() [S] ,- __this_cpu_dec(*sem->read_count) | smp_rmb(); [L] | task = rcu_dereference(w->task) // NULL | | [S] w->task = current | smp_mb(); | [L] readers_active_check() // fail `-> Where the smp_rmb() (obviously) fails to constrain the store. [ peterz: Added changelog. ] Signed-off-by: Prateek Sood Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Andrea Parri Acked-by: Davidlohr Bueso Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery") Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org Signed-off-by: Ingo Molnar --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 284f2fe9a293..3fb7be001964 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -307,7 +307,7 @@ void rcuwait_wake_up(struct rcuwait *w) * MB (A) MB (B) * [L] cond [L] tsk */ - smp_rmb(); /* (B) */ + smp_mb(); /* (B) */ /* * Avoid using task_rcu_dereference() magic as long as we are careful, -- cgit v1.3-7-g2ca7 From e6018c0f5c996e61639adce6a0697391a2861916 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 17 Dec 2018 10:14:53 +0100 Subject: sched/wake_q: Document wake_q_add() The only guarantee provided by wake_q_add() is that a wakeup will happen after it, it does _NOT_ guarantee the wakeup will be delayed until the matching wake_up_q(). If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and that can happen at any time after the cmpxchg(). This means we should not rely on the wakeup happening at wake_q_up(), but should be ready for wake_q_add() to issue the wakeup. The delay; if provided (most likely); should only result in more efficient behaviour. Reported-by: Yongji Xie Signed-off-by: Peter Zijlstra (Intel) Cc: Davidlohr Bueso Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Signed-off-by: Ingo Molnar --- include/linux/sched/wake_q.h | 6 +++++- kernel/sched/core.c | 12 ++++++++++++ 2 files changed, 17 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h index 10b19a192b2d..545f37138057 100644 --- a/include/linux/sched/wake_q.h +++ b/include/linux/sched/wake_q.h @@ -24,9 +24,13 @@ * called near the end of a function. Otherwise, the list can be * re-initialized for later re-use by wake_q_init(). * - * Note that this can cause spurious wakeups. schedule() callers + * NOTE that this can cause spurious wakeups. schedule() callers * must ensure the call is done inside a loop, confirming that the * wakeup condition has in fact occurred. + * + * NOTE that there is no guarantee the wakeup will happen any later than the + * wake_q_add() location. Therefore task must be ready to be woken at the + * location of the wake_q_add(). */ #include diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a674c7db2f29..cc814933f7d6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -396,6 +396,18 @@ static bool set_nr_if_polling(struct task_struct *p) #endif #endif +/** + * wake_q_add() - queue a wakeup for 'later' waking. + * @head: the wake_q_head to add @task to + * @task: the task to queue for 'later' wakeup + * + * Queue a task for later wakeup, most likely by the wake_up_q() call in the + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come + * instantly. + * + * This function must be used as-if it were wake_up_process(); IOW the task + * must be ready to be woken at this location. + */ void wake_q_add(struct wake_q_head *head, struct task_struct *task) { struct wake_q_node *node = &task->wake_q; -- cgit v1.3-7-g2ca7 From 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 17 Dec 2018 10:14:53 +0100 Subject: sched/wake_q: Fix wakeup ordering for wake_q Notable cmpxchg() does not provide ordering when it fails, however wake_q_add() requires ordering in this specific case too. Without this it would be possible for the concurrent wakeup to not observe our prior state. Andrea Parri provided: C wake_up_q-wake_q_add { int next = 0; int y = 0; } P0(int *next, int *y) { int r0; /* in wake_up_q() */ WRITE_ONCE(*next, 1); /* node->next = NULL */ smp_mb(); /* implied by wake_up_process() */ r0 = READ_ONCE(*y); } P1(int *next, int *y) { int r1; /* in wake_q_add() */ WRITE_ONCE(*y, 1); /* wake_cond = true */ smp_mb__before_atomic(); r1 = cmpxchg_relaxed(next, 1, 2); } exists (0:r0=0 /\ 1:r1=0) This "exists" clause cannot be satisfied according to the LKMM: Test wake_up_q-wake_q_add Allowed States 3 0:r0=0; 1:r1=1; 0:r0=1; 1:r1=0; 0:r0=1; 1:r1=1; No Witnesses Positive: 0 Negative: 3 Condition exists (0:r0=0 /\ 1:r1=0) Observation wake_up_q-wake_q_add Never 0 3 Reported-by: Yongji Xie Signed-off-by: Peter Zijlstra (Intel) Cc: Davidlohr Bueso Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cc814933f7d6..d8d76a65cfdd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -417,10 +417,11 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) * its already queued (either by us or someone else) and will get the * wakeup due to that. * - * This cmpxchg() executes a full barrier, which pairs with the full - * barrier executed by the wakeup in wake_up_q(). + * In order to ensure that a pending wakeup will observe our pending + * state, even in the failed case, an explicit smp_mb() must be used. */ - if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) + smp_mb__before_atomic(); + if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)) return; get_task_struct(task); -- cgit v1.3-7-g2ca7 From b061c38bef43406df8e73c5be06cbfacad5ee6ad Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 29 Nov 2018 14:44:49 +0100 Subject: futex: Fix (possible) missed wakeup We must not rely on wake_q_add() to delay the wakeup; in particular commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups") moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which could result in futex_wait() waking before observing ->lock_ptr == NULL and going back to sleep again. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups") Signed-off-by: Ingo Molnar --- kernel/futex.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index be3bff2315ff..fdd312da0992 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1452,11 +1452,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n")) return; - /* - * Queue the task for later wakeup for after we've released - * the hb->lock. wake_q_add() grabs reference to p. - */ - wake_q_add(wake_q, p); + get_task_struct(p); __unqueue_futex(q); /* * The waiting task can free the futex_q as soon as q->lock_ptr = NULL @@ -1466,6 +1462,13 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) * plist_del in __unqueue_futex(). */ smp_store_release(&q->lock_ptr, NULL); + + /* + * Queue the task for later wakeup for after we've released + * the hb->lock. wake_q_add() grabs reference to p. + */ + wake_q_add(wake_q, p); + put_task_struct(p); } /* -- cgit v1.3-7-g2ca7 From e158488be27b157802753a59b336142dc0eb0380 Mon Sep 17 00:00:00 2001 From: Xie Yongji Date: Thu, 29 Nov 2018 20:50:30 +0800 Subject: locking/rwsem: Fix (possible) missed wakeup Because wake_q_add() can imply an immediate wakeup (cmpxchg failure case), we must not rely on the wakeup being delayed. However, commit: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task") relies on exactly that behaviour in that the wakeup must not happen until after we clear waiter->task. [ peterz: Added changelog. ] Signed-off-by: Xie Yongji Signed-off-by: Zhang Yu Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task") Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com Signed-off-by: Ingo Molnar --- kernel/locking/rwsem-xadd.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 09b180063ee1..50d9af615dc4 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, woken++; tsk = waiter->task; - wake_q_add(wake_q, tsk); + get_task_struct(tsk); list_del(&waiter->list); /* - * Ensure that the last operation is setting the reader + * Ensure calling get_task_struct() before setting the reader * waiter to nil such that rwsem_down_read_failed() cannot * race with do_exit() by always holding a reference count * to the task to wakeup. */ smp_store_release(&waiter->task, NULL); + /* + * Ensure issuing the wakeup (either by us or someone else) + * after setting the reader waiter to nil. + */ + wake_q_add(wake_q, tsk); + /* wake_q_add() already take the task ref */ + put_task_struct(tsk); } adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment; -- cgit v1.3-7-g2ca7 From 87ff19cb2f1aa55a5d8b691e6690cc059a59d2ec Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Sun, 2 Dec 2018 21:31:30 -0800 Subject: sched/wake_q: Add branch prediction hint to wake_q_add() cmpxchg The cmpxchg() will fail when the task is already in the process of waking up, and as such is an extremely rare occurrence. Micro-optimize the call and put an unlikely() around it. To no surprise, when using CONFIG_PROFILE_ANNOTATED_BRANCHES under a number of workloads the incorrect rate was a mere 1-2%. Signed-off-by: Davidlohr Bueso Signed-off-by: Peter Zijlstra (Intel) Acked-by: Waiman Long Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: Yongji Xie Cc: andrea.parri@amarulasolutions.com Cc: lilin24@baidu.com Cc: liuqi16@baidu.com Cc: nixun@baidu.com Cc: xieyongji@baidu.com Cc: yuanlinsi01@baidu.com Cc: zhangyu31@baidu.com Link: https://lkml.kernel.org/r/20181203053130.gwkw6kg72azt2npb@linux-r8p5 Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d8d76a65cfdd..b05eef7d7a1f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -421,7 +421,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) * state, even in the failed case, an explicit smp_mb() must be used. */ smp_mb__before_atomic(); - if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)) + if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))) return; get_task_struct(task); -- cgit v1.3-7-g2ca7 From 71492580571467fb7177aade19c18ce7486267f5 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Wed, 9 Jan 2019 23:03:25 -0500 Subject: locking/lockdep: Add debug_locks check in __lock_downgrade() Tetsuo Handa had reported he saw an incorrect "downgrading a read lock" warning right after a previous lockdep warning. It is likely that the previous warning turned off lock debugging causing the lockdep to have inconsistency states leading to the lock downgrade warning. Fix that by add a check for debug_locks at the beginning of __lock_downgrade(). Debugged-by: Tetsuo Handa Reported-by: Tetsuo Handa Reported-by: syzbot+53383ae265fb161ef488@syzkaller.appspotmail.com Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: https://lkml.kernel.org/r/1547093005-26085-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 95932333a48b..e805fe3bf87f 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3535,6 +3535,9 @@ static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) unsigned int depth; int i; + if (unlikely(!debug_locks)) + return 0; + depth = curr->lockdep_depth; /* * This function is about (re)setting the class of a held lock, -- cgit v1.3-7-g2ca7 From ce48c457b95316b9a01b5aa9d4456ce820df94b4 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 19 Dec 2018 18:23:15 +0000 Subject: cpu/hotplug: Mute hotplug lockdep during init Since we've had: commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations") we've been getting some lockdep warnings during init, such as on HiKey960: [ 0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48 [ 0.820498] Modules linked in: [ 0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S 4.20.0-rc5-00051-g4cae42a #34 [ 0.820511] Hardware name: HiKey960 (DT) [ 0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO) [ 0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48 [ 0.820523] lr : lockdep_assert_cpus_held+0x38/0x48 [ 0.820526] sp : ffff00000a9cbe50 [ 0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000 [ 0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0 [ 0.820537] x25: ffff000008ba69e0 x24: 0000000000000001 [ 0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8 [ 0.820545] x21: 0000000000000001 x20: 0000000000000003 [ 0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff [ 0.820552] x17: 0000000000000000 x16: 0000000000000000 [ 0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34 [ 0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8 [ 0.820564] x11: 0000000000000000 x10: ffff00000958f848 [ 0.820567] x9 : ffff000009592000 x8 : ffff00000958f848 [ 0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000 [ 0.820574] x5 : 0000000000000000 x4 : 0000000000000001 [ 0.820578] x3 : 0000000000000000 x2 : 0000000000000001 [ 0.820582] x1 : 00000000ffffffff x0 : 0000000000000000 [ 0.820587] Call trace: [ 0.820591] lockdep_assert_cpus_held+0x3c/0x48 [ 0.820598] static_key_enable_cpuslocked+0x28/0xd0 [ 0.820606] arch_timer_check_ool_workaround+0xe8/0x228 [ 0.820610] arch_timer_starting_cpu+0xe4/0x2d8 [ 0.820615] cpuhp_invoke_callback+0xe8/0xd08 [ 0.820619] notify_cpu_starting+0x80/0xb8 [ 0.820625] secondary_start_kernel+0x118/0x1d0 We've also had a similar warning in sched_init_smp() for every asymmetric system that would enable the sched_asym_cpucapacity static key, although that was singled out in: commit 40fa3780bac2 ("sched/core: Take the hotplug lock in sched_init_smp()") Those warnings are actually harmless, since we cannot have hotplug operations at the time they appear. Instead of starting to sprinkle useless hotplug lock operations in the init codepaths, mute the warnings until they start warning about real problems. Suggested-by: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Thomas Gleixner Cc: Will Deacon Cc: cai@gmx.us Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: linux-arm-kernel@lists.infradead.org Cc: longman@redhat.com Cc: marc.zyngier@arm.com Cc: mark.rutland@arm.com Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/cpu.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 91d5c38eb7e5..34e40ef2f6e0 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -313,6 +313,15 @@ void cpus_write_unlock(void) void lockdep_assert_cpus_held(void) { + /* + * We can't have hotplug operations before userspace starts running, + * and some init codepaths will knowingly not take the hotplug lock. + * This is all valid, so mute lockdep until it makes sense to report + * unheld locks. + */ + if (system_state < SYSTEM_RUNNING) + return; + percpu_rwsem_assert_held(&cpu_hotplug_lock); } -- cgit v1.3-7-g2ca7 From b5a4e2bb0f4c86bfeb38df3e1d5b2f1272f0e673 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 19 Dec 2018 18:23:16 +0000 Subject: Revert "sched/core: Take the hotplug lock in sched_init_smp()" This reverts commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10. Now that we have a system-wide muting of hotplug lockdep during init, this is no longer needed. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: cai@gmx.us Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: linux-arm-kernel@lists.infradead.org Cc: longman@redhat.com Cc: marc.zyngier@arm.com Cc: mark.rutland@arm.com Link: https://lkml.kernel.org/r/1545243796-23224-3-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b05eef7d7a1f..3c8b4dba3d2d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5867,14 +5867,11 @@ void __init sched_init_smp(void) /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot - * happen. The hotplug lock is nevertheless taken to satisfy lockdep, - * but there won't be any contention on it. + * happen. */ - cpus_read_lock(); mutex_lock(&sched_domains_mutex); sched_init_domains(cpu_active_mask); mutex_unlock(&sched_domains_mutex); - cpus_read_unlock(); /* Move init over to a non-isolated CPU */ if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0) -- cgit v1.3-7-g2ca7 From 436a49ae7b693161c4fdf98b575ef16243dc2dfa Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 28 Dec 2018 06:02:00 +0100 Subject: locking/lockdep: Simplify mark_held_locks() The enum mark_type appears a bit artificial here. We can directly pass the base enum lock_usage_bit value to mark_held_locks(). All we need then is to add the read index for each lock if necessary. It makes the code clearer. Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: https://lkml.kernel.org/r/1545973321-24422-2-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index e805fe3bf87f..1dcd8341e35b 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2709,35 +2709,28 @@ mark_lock_irq(struct task_struct *curr, struct held_lock *this, return 1; } -enum mark_type { -#define LOCKDEP_STATE(__STATE) __STATE, -#include "lockdep_states.h" -#undef LOCKDEP_STATE -}; - /* * Mark all held locks with a usage bit: */ static int -mark_held_locks(struct task_struct *curr, enum mark_type mark) +mark_held_locks(struct task_struct *curr, enum lock_usage_bit base_bit) { - enum lock_usage_bit usage_bit; struct held_lock *hlock; int i; for (i = 0; i < curr->lockdep_depth; i++) { + enum lock_usage_bit hlock_bit = base_bit; hlock = curr->held_locks + i; - usage_bit = 2 + (mark << 2); /* ENABLED */ if (hlock->read) - usage_bit += 1; /* READ */ + hlock_bit += 1; /* READ */ - BUG_ON(usage_bit >= LOCK_USAGE_STATES); + BUG_ON(hlock_bit >= LOCK_USAGE_STATES); if (!hlock->check) continue; - if (!mark_lock(curr, hlock, usage_bit)) + if (!mark_lock(curr, hlock, hlock_bit)) return 0; } @@ -2758,7 +2751,7 @@ static void __trace_hardirqs_on_caller(unsigned long ip) * We are going to turn hardirqs on, so set the * usage bit for all held locks: */ - if (!mark_held_locks(curr, HARDIRQ)) + if (!mark_held_locks(curr, LOCK_ENABLED_HARDIRQ)) return; /* * If we have softirqs enabled, then set the usage @@ -2766,7 +2759,7 @@ static void __trace_hardirqs_on_caller(unsigned long ip) * this bit from being set before) */ if (curr->softirqs_enabled) - if (!mark_held_locks(curr, SOFTIRQ)) + if (!mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ)) return; curr->hardirq_enable_ip = ip; @@ -2880,7 +2873,7 @@ void trace_softirqs_on(unsigned long ip) * enabled too: */ if (curr->hardirqs_enabled) - mark_held_locks(curr, SOFTIRQ); + mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ); current->lockdep_recursion = 0; } -- cgit v1.3-7-g2ca7 From bba2a8f1f974a45ca6ceaf688b2be7bc1c418a2f Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 28 Dec 2018 06:02:01 +0100 Subject: locking/lockdep: Provide enum lock_usage_bit mask names It makes the code more self-explanatory and tells throughout the code what magic number refers to: - state (Hardirq/Softirq) - direction (used in or enabled above state) - read or write We can even remove some comments that were compensating for the lack of those constant names. Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: https://lkml.kernel.org/r/1545973321-24422-3-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 33 +++++++++++---------------------- kernel/locking/lockdep_internals.h | 4 ++++ 2 files changed, 15 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 1dcd8341e35b..608f74ed8bb9 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1624,29 +1624,18 @@ static const char *state_rnames[] = { static inline const char *state_name(enum lock_usage_bit bit) { - return (bit & 1) ? state_rnames[bit >> 2] : state_names[bit >> 2]; + return (bit & LOCK_USAGE_READ_MASK) ? state_rnames[bit >> 2] : state_names[bit >> 2]; } static int exclusive_bit(int new_bit) { - /* - * USED_IN - * USED_IN_READ - * ENABLED - * ENABLED_READ - * - * bit 0 - write/read - * bit 1 - used_in/enabled - * bit 2+ state - */ - - int state = new_bit & ~3; - int dir = new_bit & 2; + int state = new_bit & LOCK_USAGE_STATE_MASK; + int dir = new_bit & LOCK_USAGE_DIR_MASK; /* * keep state, bit flip the direction and strip read. */ - return state | (dir ^ 2); + return state | (dir ^ LOCK_USAGE_DIR_MASK); } static int check_irq_usage(struct task_struct *curr, struct held_lock *prev, @@ -2662,8 +2651,8 @@ mark_lock_irq(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit) { int excl_bit = exclusive_bit(new_bit); - int read = new_bit & 1; - int dir = new_bit & 2; + int read = new_bit & LOCK_USAGE_READ_MASK; + int dir = new_bit & LOCK_USAGE_DIR_MASK; /* * mark USED_IN has to look forwards -- to ensure no dependency @@ -2687,19 +2676,19 @@ mark_lock_irq(struct task_struct *curr, struct held_lock *this, * states. */ if ((!read || !dir || STRICT_READ_CHECKS) && - !usage(curr, this, excl_bit, state_name(new_bit & ~1))) + !usage(curr, this, excl_bit, state_name(new_bit & ~LOCK_USAGE_READ_MASK))) return 0; /* * Check for read in write conflicts */ if (!read) { - if (!valid_state(curr, this, new_bit, excl_bit + 1)) + if (!valid_state(curr, this, new_bit, excl_bit + LOCK_USAGE_READ_MASK)) return 0; if (STRICT_READ_CHECKS && - !usage(curr, this, excl_bit + 1, - state_name(new_bit + 1))) + !usage(curr, this, excl_bit + LOCK_USAGE_READ_MASK, + state_name(new_bit + LOCK_USAGE_READ_MASK))) return 0; } @@ -2723,7 +2712,7 @@ mark_held_locks(struct task_struct *curr, enum lock_usage_bit base_bit) hlock = curr->held_locks + i; if (hlock->read) - hlock_bit += 1; /* READ */ + hlock_bit += LOCK_USAGE_READ_MASK; BUG_ON(hlock_bit >= LOCK_USAGE_STATES); diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index 88c847a41c8a..2ebb9d0ea91c 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -22,6 +22,10 @@ enum lock_usage_bit { LOCK_USAGE_STATES }; +#define LOCK_USAGE_READ_MASK 1 +#define LOCK_USAGE_DIR_MASK 2 +#define LOCK_USAGE_STATE_MASK (~(LOCK_USAGE_READ_MASK | LOCK_USAGE_DIR_MASK)) + /* * Usage-state bitmasks: */ -- cgit v1.3-7-g2ca7 From 5620196951192f7cd2da0a04e7c0113f40bfc14e Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Fri, 11 Jan 2019 13:20:20 -0300 Subject: perf: Make perf_event_output() propagate the output() return For the original mode of operation it isn't needed, since we report back errors via PERF_RECORD_LOST records in the ring buffer, but for use in bpf_perf_event_output() it is convenient to return the errors, basically -ENOSPC. Currently bpf_perf_event_output() returns an error indication, the last thing it does, which is to push it to the ring buffer is that can fail and if so, this failure won't be reported back to its users, fix it. Reported-by: Jamal Hadi Salim Tested-by: Jamal Hadi Salim Acked-by: Peter Zijlstra (Intel) Cc: Adrian Hunter Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Jiri Olsa Cc: Namhyung Kim Link: https://lkml.kernel.org/r/20190118150938.GN5823@kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- include/linux/perf_event.h | 6 +++--- kernel/events/core.c | 11 +++++++---- kernel/trace/bpf_trace.c | 3 +-- tools/perf/examples/bpf/augmented_raw_syscalls.c | 4 ++-- tools/perf/examples/bpf/augmented_syscalls.c | 14 +++++++------- tools/perf/examples/bpf/etcsnoop.c | 10 +++++----- 6 files changed, 25 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f8ec36197718..4eb88065a9b5 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -978,9 +978,9 @@ extern void perf_event_output_forward(struct perf_event *event, extern void perf_event_output_backward(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs); -extern void perf_event_output(struct perf_event *event, - struct perf_sample_data *data, - struct pt_regs *regs); +extern int perf_event_output(struct perf_event *event, + struct perf_sample_data *data, + struct pt_regs *regs); static inline bool is_default_overflow_handler(struct perf_event *event) diff --git a/kernel/events/core.c b/kernel/events/core.c index fbe59b793b36..bc525cd1615c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6489,7 +6489,7 @@ void perf_prepare_sample(struct perf_event_header *header, data->phys_addr = perf_virt_to_phys(data->addr); } -static __always_inline void +static __always_inline int __perf_event_output(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs, @@ -6499,13 +6499,15 @@ __perf_event_output(struct perf_event *event, { struct perf_output_handle handle; struct perf_event_header header; + int err; /* protect the callchain buffers */ rcu_read_lock(); perf_prepare_sample(&header, data, event, regs); - if (output_begin(&handle, event, header.size)) + err = output_begin(&handle, event, header.size); + if (err) goto exit; perf_output_sample(&handle, &header, data, event); @@ -6514,6 +6516,7 @@ __perf_event_output(struct perf_event *event, exit: rcu_read_unlock(); + return err; } void @@ -6532,12 +6535,12 @@ perf_event_output_backward(struct perf_event *event, __perf_event_output(event, data, regs, perf_output_begin_backward); } -void +int perf_event_output(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) { - __perf_event_output(event, data, regs, perf_output_begin); + return __perf_event_output(event, data, regs, perf_output_begin); } /* diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 8b068adb9da1..088c2032ceaf 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -431,8 +431,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, if (unlikely(event->oncpu != cpu)) return -EOPNOTSUPP; - perf_event_output(event, sd, regs); - return 0; + return perf_event_output(event, sd, regs); } BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, diff --git a/tools/perf/examples/bpf/augmented_raw_syscalls.c b/tools/perf/examples/bpf/augmented_raw_syscalls.c index 53c233370fae..9e9d4c66e53c 100644 --- a/tools/perf/examples/bpf/augmented_raw_syscalls.c +++ b/tools/perf/examples/bpf/augmented_raw_syscalls.c @@ -141,8 +141,8 @@ int sys_enter(struct syscall_enter_args *args) len = sizeof(augmented_args.args); } - perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, &augmented_args, len); - return 0; + /* If perf_event_output fails, return non-zero so that it gets recorded unaugmented */ + return perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, &augmented_args, len); } SEC("raw_syscalls:sys_exit") diff --git a/tools/perf/examples/bpf/augmented_syscalls.c b/tools/perf/examples/bpf/augmented_syscalls.c index 2ae44813ef2d..b7dba114e36c 100644 --- a/tools/perf/examples/bpf/augmented_syscalls.c +++ b/tools/perf/examples/bpf/augmented_syscalls.c @@ -55,9 +55,9 @@ int syscall_enter(syscall)(struct syscall_enter_##syscall##_args *args) \ len -= sizeof(augmented_args.filename.value) - augmented_args.filename.size; \ len &= sizeof(augmented_args.filename.value) - 1; \ } \ - perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ - &augmented_args, len); \ - return 0; \ + /* If perf_event_output fails, return non-zero so that it gets recorded unaugmented */ \ + return perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ + &augmented_args, len); \ } \ int syscall_exit(syscall)(struct syscall_exit_args *args) \ { \ @@ -125,10 +125,10 @@ int syscall_enter(syscall)(struct syscall_enter_##syscall##_args *args) \ /* addrlen = augmented_args.args.addrlen; */ \ /* */ \ probe_read(&augmented_args.addr, addrlen, args->addr_ptr); \ - perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ - &augmented_args, \ - sizeof(augmented_args) - sizeof(augmented_args.addr) + addrlen); \ - return 0; \ + /* If perf_event_output fails, return non-zero so that it gets recorded unaugmented */ \ + return perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ + &augmented_args, \ + sizeof(augmented_args) - sizeof(augmented_args.addr) + addrlen);\ } \ int syscall_exit(syscall)(struct syscall_exit_args *args) \ { \ diff --git a/tools/perf/examples/bpf/etcsnoop.c b/tools/perf/examples/bpf/etcsnoop.c index b59e8812ee8c..550e69c2e8d1 100644 --- a/tools/perf/examples/bpf/etcsnoop.c +++ b/tools/perf/examples/bpf/etcsnoop.c @@ -49,11 +49,11 @@ int syscall_enter(syscall)(struct syscall_enter_##syscall##_args *args) \ args->filename_ptr); \ if (__builtin_memcmp(augmented_args.filename.value, etc, 4) != 0) \ return 0; \ - perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ - &augmented_args, \ - (sizeof(augmented_args) - sizeof(augmented_args.filename.value) + \ - augmented_args.filename.size)); \ - return 0; \ + /* If perf_event_output fails, return non-zero so that it gets recorded unaugmented */ \ + return perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, \ + &augmented_args, \ + (sizeof(augmented_args) - sizeof(augmented_args.filename.value) + \ + augmented_args.filename.size)); \ } struct syscall_enter_openat_args { -- cgit v1.3-7-g2ca7 From 76193a94522f1d4edf2447a536f3f796ce56343b Mon Sep 17 00:00:00 2001 From: Song Liu Date: Thu, 17 Jan 2019 08:15:13 -0800 Subject: perf, bpf: Introduce PERF_RECORD_KSYMBOL For better performance analysis of dynamically JITed and loaded kernel functions, such as BPF programs, this patch introduces PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol register/unregister information to user space. The following data structure is used for PERF_RECORD_KSYMBOL. /* * struct { * struct perf_event_header header; * u64 addr; * u32 len; * u16 ksym_type; * u16 flags; * char name[]; * struct sample_id sample_id; * }; */ Signed-off-by: Song Liu Reviewed-by: Arnaldo Carvalho de Melo Tested-by: Arnaldo Carvalho de Melo Acked-by: Peter Zijlstra Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Peter Zijlstra Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo --- include/linux/perf_event.h | 8 ++++ include/uapi/linux/perf_event.h | 26 ++++++++++- kernel/events/core.c | 98 ++++++++++++++++++++++++++++++++++++++++- 3 files changed, 130 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 4eb88065a9b5..136fe0495374 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1122,6 +1122,10 @@ static inline void perf_event_task_sched_out(struct task_struct *prev, } extern void perf_event_mmap(struct vm_area_struct *vma); + +extern void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, + bool unregister, const char *sym); + extern struct perf_guest_info_callbacks *perf_guest_cbs; extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); @@ -1342,6 +1346,10 @@ static inline int perf_unregister_guest_info_callbacks (struct perf_guest_info_callbacks *callbacks) { return 0; } static inline void perf_event_mmap(struct vm_area_struct *vma) { } + +typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data); +static inline void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, + bool unregister, const char *sym) { } static inline void perf_event_exec(void) { } static inline void perf_event_comm(struct task_struct *tsk, bool exec) { } static inline void perf_event_namespaces(struct task_struct *tsk) { } diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index ea19b5d491bf..1dee5c8f166b 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -372,7 +372,8 @@ struct perf_event_attr { context_switch : 1, /* context switch data */ write_backward : 1, /* Write ring buffer from end to beginning */ namespaces : 1, /* include namespaces data */ - __reserved_1 : 35; + ksymbol : 1, /* include ksymbol events */ + __reserved_1 : 34; union { __u32 wakeup_events; /* wakeup every n events */ @@ -963,9 +964,32 @@ enum perf_event_type { */ PERF_RECORD_NAMESPACES = 16, + /* + * Record ksymbol register/unregister events: + * + * struct { + * struct perf_event_header header; + * u64 addr; + * u32 len; + * u16 ksym_type; + * u16 flags; + * char name[]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_KSYMBOL = 17, + PERF_RECORD_MAX, /* non-ABI */ }; +enum perf_record_ksymbol_type { + PERF_RECORD_KSYMBOL_TYPE_UNKNOWN = 0, + PERF_RECORD_KSYMBOL_TYPE_BPF = 1, + PERF_RECORD_KSYMBOL_TYPE_MAX /* non-ABI */ +}; + +#define PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER (1 << 0) + #define PERF_MAX_STACK_DEPTH 127 #define PERF_MAX_CONTEXTS_PER_STACK 8 diff --git a/kernel/events/core.c b/kernel/events/core.c index bc525cd1615c..e04ab5f325cf 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -385,6 +385,7 @@ static atomic_t nr_namespaces_events __read_mostly; static atomic_t nr_task_events __read_mostly; static atomic_t nr_freq_events __read_mostly; static atomic_t nr_switch_events __read_mostly; +static atomic_t nr_ksymbol_events __read_mostly; static LIST_HEAD(pmus); static DEFINE_MUTEX(pmus_lock); @@ -4235,7 +4236,7 @@ static bool is_sb_event(struct perf_event *event) if (attr->mmap || attr->mmap_data || attr->mmap2 || attr->comm || attr->comm_exec || - attr->task || + attr->task || attr->ksymbol || attr->context_switch) return true; return false; @@ -4305,6 +4306,8 @@ static void unaccount_event(struct perf_event *event) dec = true; if (has_branch_stack(event)) dec = true; + if (event->attr.ksymbol) + atomic_dec(&nr_ksymbol_events); if (dec) { if (!atomic_add_unless(&perf_sched_count, -1, 1)) @@ -7653,6 +7656,97 @@ static void perf_log_throttle(struct perf_event *event, int enable) perf_output_end(&handle); } +/* + * ksymbol register/unregister tracking + */ + +struct perf_ksymbol_event { + const char *name; + int name_len; + struct { + struct perf_event_header header; + u64 addr; + u32 len; + u16 ksym_type; + u16 flags; + } event_id; +}; + +static int perf_event_ksymbol_match(struct perf_event *event) +{ + return event->attr.ksymbol; +} + +static void perf_event_ksymbol_output(struct perf_event *event, void *data) +{ + struct perf_ksymbol_event *ksymbol_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + int ret; + + if (!perf_event_ksymbol_match(event)) + return; + + perf_event_header__init_id(&ksymbol_event->event_id.header, + &sample, event); + ret = perf_output_begin(&handle, event, + ksymbol_event->event_id.header.size); + if (ret) + return; + + perf_output_put(&handle, ksymbol_event->event_id); + __output_copy(&handle, ksymbol_event->name, ksymbol_event->name_len); + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +} + +void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, bool unregister, + const char *sym) +{ + struct perf_ksymbol_event ksymbol_event; + char name[KSYM_NAME_LEN]; + u16 flags = 0; + int name_len; + + if (!atomic_read(&nr_ksymbol_events)) + return; + + if (ksym_type >= PERF_RECORD_KSYMBOL_TYPE_MAX || + ksym_type == PERF_RECORD_KSYMBOL_TYPE_UNKNOWN) + goto err; + + strlcpy(name, sym, KSYM_NAME_LEN); + name_len = strlen(name) + 1; + while (!IS_ALIGNED(name_len, sizeof(u64))) + name[name_len++] = '\0'; + BUILD_BUG_ON(KSYM_NAME_LEN % sizeof(u64)); + + if (unregister) + flags |= PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER; + + ksymbol_event = (struct perf_ksymbol_event){ + .name = name, + .name_len = name_len, + .event_id = { + .header = { + .type = PERF_RECORD_KSYMBOL, + .size = sizeof(ksymbol_event.event_id) + + name_len, + }, + .addr = addr, + .len = len, + .ksym_type = ksym_type, + .flags = flags, + }, + }; + + perf_iterate_sb(perf_event_ksymbol_output, &ksymbol_event, NULL); + return; +err: + WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type); +} + void perf_event_itrace_started(struct perf_event *event) { event->attach_state |= PERF_ATTACH_ITRACE; @@ -9912,6 +10006,8 @@ static void account_event(struct perf_event *event) inc = true; if (is_cgroup_event(event)) inc = true; + if (event->attr.ksymbol) + atomic_inc(&nr_ksymbol_events); if (inc) { /* -- cgit v1.3-7-g2ca7 From 6ee52e2a3fe4ea35520720736e6791df1fb67106 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Thu, 17 Jan 2019 08:15:15 -0800 Subject: perf, bpf: Introduce PERF_RECORD_BPF_EVENT For better performance analysis of BPF programs, this patch introduces PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program load/unload information to user space. Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs. The following example shows kernel symbols for a BPF program with 7 sub programs: ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed for simple profiling. For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and gather more information about these (sub) programs via sys_bpf. Signed-off-by: Song Liu Reviewed-by: Arnaldo Carvalho de Melo Acked-by: Alexei Starovoitov Acked-by: Peter Zijlstra (Intel) Tested-by: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Peter Zijlstra Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo --- include/linux/filter.h | 7 +++ include/linux/perf_event.h | 6 +++ include/uapi/linux/perf_event.h | 29 +++++++++- kernel/bpf/core.c | 2 +- kernel/bpf/syscall.c | 2 + kernel/events/core.c | 115 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 159 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index ad106d845b22..d531d4250bff 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -951,6 +951,7 @@ bpf_address_lookup(unsigned long addr, unsigned long *size, void bpf_prog_kallsyms_add(struct bpf_prog *fp); void bpf_prog_kallsyms_del(struct bpf_prog *fp); +void bpf_get_prog_name(const struct bpf_prog *prog, char *sym); #else /* CONFIG_BPF_JIT */ @@ -1006,6 +1007,12 @@ static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp) static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp) { } + +static inline void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +{ + sym[0] = '\0'; +} + #endif /* CONFIG_BPF_JIT */ void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 136fe0495374..a79e59fc3b7d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1125,6 +1125,9 @@ extern void perf_event_mmap(struct vm_area_struct *vma); extern void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, bool unregister, const char *sym); +extern void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags); extern struct perf_guest_info_callbacks *perf_guest_cbs; extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); @@ -1350,6 +1353,9 @@ static inline void perf_event_mmap(struct vm_area_struct *vma) { } typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data); static inline void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, bool unregister, const char *sym) { } +static inline void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags) { } static inline void perf_event_exec(void) { } static inline void perf_event_comm(struct task_struct *tsk, bool exec) { } static inline void perf_event_namespaces(struct task_struct *tsk) { } diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 1dee5c8f166b..7198ddd0c6b1 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -373,7 +373,8 @@ struct perf_event_attr { write_backward : 1, /* Write ring buffer from end to beginning */ namespaces : 1, /* include namespaces data */ ksymbol : 1, /* include ksymbol events */ - __reserved_1 : 34; + bpf_event : 1, /* include bpf events */ + __reserved_1 : 33; union { __u32 wakeup_events; /* wakeup every n events */ @@ -979,6 +980,25 @@ enum perf_event_type { */ PERF_RECORD_KSYMBOL = 17, + /* + * Record bpf events: + * enum perf_bpf_event_type { + * PERF_BPF_EVENT_UNKNOWN = 0, + * PERF_BPF_EVENT_PROG_LOAD = 1, + * PERF_BPF_EVENT_PROG_UNLOAD = 2, + * }; + * + * struct { + * struct perf_event_header header; + * u16 type; + * u16 flags; + * u32 id; + * u8 tag[BPF_TAG_SIZE]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_BPF_EVENT = 18, + PERF_RECORD_MAX, /* non-ABI */ }; @@ -990,6 +1010,13 @@ enum perf_record_ksymbol_type { #define PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER (1 << 0) +enum perf_bpf_event_type { + PERF_BPF_EVENT_UNKNOWN = 0, + PERF_BPF_EVENT_PROG_LOAD = 1, + PERF_BPF_EVENT_PROG_UNLOAD = 2, + PERF_BPF_EVENT_MAX, /* non-ABI */ +}; + #define PERF_MAX_STACK_DEPTH 127 #define PERF_MAX_CONTEXTS_PER_STACK 8 diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f908b9356025..19c49313c709 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -495,7 +495,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, *symbol_end = addr + hdr->pages * PAGE_SIZE; } -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) { const char *end = sym + KSYM_NAME_LEN; const struct btf_type *type; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b155cd17c1bd..30ebd085790b 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1211,6 +1211,7 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) { if (atomic_dec_and_test(&prog->aux->refcnt)) { + perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0); /* bpf_prog_free_id() must be called first */ bpf_prog_free_id(prog, do_idr_lock); bpf_prog_kallsyms_del_all(prog); @@ -1554,6 +1555,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) } bpf_prog_kallsyms_add(prog); + perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_LOAD, 0); return err; free_used_maps: diff --git a/kernel/events/core.c b/kernel/events/core.c index e04ab5f325cf..236bb8ddb7bc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -386,6 +386,7 @@ static atomic_t nr_task_events __read_mostly; static atomic_t nr_freq_events __read_mostly; static atomic_t nr_switch_events __read_mostly; static atomic_t nr_ksymbol_events __read_mostly; +static atomic_t nr_bpf_events __read_mostly; static LIST_HEAD(pmus); static DEFINE_MUTEX(pmus_lock); @@ -4308,6 +4309,8 @@ static void unaccount_event(struct perf_event *event) dec = true; if (event->attr.ksymbol) atomic_dec(&nr_ksymbol_events); + if (event->attr.bpf_event) + atomic_dec(&nr_bpf_events); if (dec) { if (!atomic_add_unless(&perf_sched_count, -1, 1)) @@ -7747,6 +7750,116 @@ err: WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type); } +/* + * bpf program load/unload tracking + */ + +struct perf_bpf_event { + struct bpf_prog *prog; + struct { + struct perf_event_header header; + u16 type; + u16 flags; + u32 id; + u8 tag[BPF_TAG_SIZE]; + } event_id; +}; + +static int perf_event_bpf_match(struct perf_event *event) +{ + return event->attr.bpf_event; +} + +static void perf_event_bpf_output(struct perf_event *event, void *data) +{ + struct perf_bpf_event *bpf_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + int ret; + + if (!perf_event_bpf_match(event)) + return; + + perf_event_header__init_id(&bpf_event->event_id.header, + &sample, event); + ret = perf_output_begin(&handle, event, + bpf_event->event_id.header.size); + if (ret) + return; + + perf_output_put(&handle, bpf_event->event_id); + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +} + +static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog, + enum perf_bpf_event_type type) +{ + bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD; + char sym[KSYM_NAME_LEN]; + int i; + + if (prog->aux->func_cnt == 0) { + bpf_get_prog_name(prog, sym); + perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, + (u64)(unsigned long)prog->bpf_func, + prog->jited_len, unregister, sym); + } else { + for (i = 0; i < prog->aux->func_cnt; i++) { + struct bpf_prog *subprog = prog->aux->func[i]; + + bpf_get_prog_name(subprog, sym); + perf_event_ksymbol( + PERF_RECORD_KSYMBOL_TYPE_BPF, + (u64)(unsigned long)subprog->bpf_func, + subprog->jited_len, unregister, sym); + } + } +} + +void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags) +{ + struct perf_bpf_event bpf_event; + + if (type <= PERF_BPF_EVENT_UNKNOWN || + type >= PERF_BPF_EVENT_MAX) + return; + + switch (type) { + case PERF_BPF_EVENT_PROG_LOAD: + case PERF_BPF_EVENT_PROG_UNLOAD: + if (atomic_read(&nr_ksymbol_events)) + perf_event_bpf_emit_ksymbols(prog, type); + break; + default: + break; + } + + if (!atomic_read(&nr_bpf_events)) + return; + + bpf_event = (struct perf_bpf_event){ + .prog = prog, + .event_id = { + .header = { + .type = PERF_RECORD_BPF_EVENT, + .size = sizeof(bpf_event.event_id), + }, + .type = type, + .flags = flags, + .id = prog->aux->id, + }, + }; + + BUILD_BUG_ON(BPF_TAG_SIZE % sizeof(u64)); + + memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE); + perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL); +} + void perf_event_itrace_started(struct perf_event *event) { event->attach_state |= PERF_ATTACH_ITRACE; @@ -10008,6 +10121,8 @@ static void account_event(struct perf_event *event) inc = true; if (event->attr.ksymbol) atomic_inc(&nr_ksymbol_events); + if (event->attr.bpf_event) + atomic_inc(&nr_bpf_events); if (inc) { /* -- cgit v1.3-7-g2ca7 From 6934058d9fb6c058fb5e5b11cdcb19834e205c91 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Thu, 17 Jan 2019 08:15:21 -0800 Subject: bpf: Add module name [bpf] to ksymbols for bpf programs With this patch, /proc/kallsyms will show BPF programs as t bpf_prog__ [bpf] Signed-off-by: Song Liu Reviewed-by: Arnaldo Carvalho de Melo Tested-by: Arnaldo Carvalho de Melo Acked-by: Peter Zijlstra Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Peter Zijlstra Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-10-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/kallsyms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index f3a04994e063..14934afa9e68 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -494,7 +494,7 @@ static int get_ksymbol_ftrace_mod(struct kallsym_iter *iter) static int get_ksymbol_bpf(struct kallsym_iter *iter) { - iter->module_name[0] = '\0'; + strlcpy(iter->module_name, "bpf", MODULE_NAME_LEN); iter->exported = 0; return bpf_get_kallsym(iter->pos - iter->pos_ftrace_mod_end, &iter->value, &iter->type, -- cgit v1.3-7-g2ca7 From 659dc4562c1b15e3401ad15338d2109af7c6d718 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 22 Jan 2019 16:21:47 +0100 Subject: PM: QoS: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman Signed-off-by: Rafael J. Wysocki --- kernel/power/qos.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index b7a82502857a..9d22131afc1e 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -582,10 +582,8 @@ static int register_pm_qos_misc(struct pm_qos_object *qos, struct dentry *d) qos->pm_qos_power_miscdev.name = qos->name; qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; - if (d) { - (void)debugfs_create_file(qos->name, S_IRUGO, d, - (void *)qos, &pm_qos_debug_fops); - } + debugfs_create_file(qos->name, S_IRUGO, d, (void *)qos, + &pm_qos_debug_fops); return misc_register(&qos->pm_qos_power_miscdev); } @@ -685,8 +683,6 @@ static int __init pm_qos_power_init(void) BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES); d = debugfs_create_dir("pm_qos", NULL); - if (IS_ERR_OR_NULL(d)) - d = NULL; for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) { ret = register_pm_qos_misc(pm_qos_array[i], d); -- cgit v1.3-7-g2ca7 From 39e83beb9109c662c0e1f79c674f96f94b452786 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:28:35 +0100 Subject: capabilities:: annotate implicit fall through There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). In this particular case change put the fall through comment on a single line so as to match the regular expression expected by GCC. This commit remove the following warning: kernel/capability.c:95:3: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre Signed-off-by: James Morris --- kernel/capability.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/capability.c b/kernel/capability.c index 7718d7dcadc7..cfbbcb68e11e 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -93,9 +93,7 @@ static int cap_validate_magic(cap_user_header_t header, unsigned *tocopy) break; case _LINUX_CAPABILITY_VERSION_2: warn_deprecated_v2(); - /* - * fall through - v3 is otherwise equivalent to v2. - */ + /* fall through - v3 is otherwise equivalent to v2. */ case _LINUX_CAPABILITY_VERSION_3: *tocopy = _LINUX_CAPABILITY_U32S_3; break; -- cgit v1.3-7-g2ca7 From 9cac42d0645ceb2ca8b815cee04810ec9b0d13b3 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Tue, 22 Jan 2019 16:42:47 +0000 Subject: PM / EM: Expose the Energy Model in debugfs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The recently introduced Energy Model (EM) framework manages power cost tables of CPUs. These tables are currently only visible from kernel space. However, in order to debug the behaviour of subsystems that use the EM (EAS for example), it is often required to know what the power costs are from userspace. For this reason, introduce under /sys/kernel/debug/energy_model a set of directories representing the performance domains of the system. Each performance domain contains a set of sub-directories representing the different capacity states (cs) and their attributes, as well as a file exposing the related CPUs. The resulting hierarchy is as follows on Arm juno r0 for example: /sys/kernel/debug/energy_model ├── pd0 │   ├── cpus │   ├── cs:450000 │   │   ├── cost │   │   ├── frequency │   │   └── power │   ├── cs:575000 │   │   ├── cost │   │   ├── frequency │   │   └── power │   ├── cs:700000 │   │   ├── cost │   │   ├── frequency │   │   └── power │   ├── cs:775000 │   │   ├── cost │   │   ├── frequency │   │   └── power │   └── cs:850000 │   ├── cost │   ├── frequency │   └── power └── pd1 ├── cpus ├── cs:1100000 │   ├── cost │   ├── frequency │   └── power ├── cs:450000 │   ├── cost │   ├── frequency │   └── power ├── cs:625000 │   ├── cost │   ├── frequency │   └── power ├── cs:800000 │   ├── cost │   ├── frequency │   └── power └── cs:950000 ├── cost ├── frequency └── power Signed-off-by: Quentin Perret Reviewed-by: Greg Kroah-Hartman Signed-off-by: Rafael J. Wysocki --- kernel/power/energy_model.c | 57 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) (limited to 'kernel') diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c index d9dc2c38764a..7d66ee68aaaf 100644 --- a/kernel/power/energy_model.c +++ b/kernel/power/energy_model.c @@ -10,6 +10,7 @@ #include #include +#include #include #include #include @@ -23,6 +24,60 @@ static DEFINE_PER_CPU(struct em_perf_domain *, em_data); */ static DEFINE_MUTEX(em_pd_mutex); +#ifdef CONFIG_DEBUG_FS +static struct dentry *rootdir; + +static void em_debug_create_cs(struct em_cap_state *cs, struct dentry *pd) +{ + struct dentry *d; + char name[24]; + + snprintf(name, sizeof(name), "cs:%lu", cs->frequency); + + /* Create per-cs directory */ + d = debugfs_create_dir(name, pd); + debugfs_create_ulong("frequency", 0444, d, &cs->frequency); + debugfs_create_ulong("power", 0444, d, &cs->power); + debugfs_create_ulong("cost", 0444, d, &cs->cost); +} + +static int em_debug_cpus_show(struct seq_file *s, void *unused) +{ + seq_printf(s, "%*pbl\n", cpumask_pr_args(to_cpumask(s->private))); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(em_debug_cpus); + +static void em_debug_create_pd(struct em_perf_domain *pd, int cpu) +{ + struct dentry *d; + char name[8]; + int i; + + snprintf(name, sizeof(name), "pd%d", cpu); + + /* Create the directory of the performance domain */ + d = debugfs_create_dir(name, rootdir); + + debugfs_create_file("cpus", 0444, d, pd->cpus, &em_debug_cpus_fops); + + /* Create a sub-directory for each capacity state */ + for (i = 0; i < pd->nr_cap_states; i++) + em_debug_create_cs(&pd->table[i], d); +} + +static int __init em_debug_init(void) +{ + /* Create /sys/kernel/debug/energy_model directory */ + rootdir = debugfs_create_dir("energy_model", NULL); + + return 0; +} +core_initcall(em_debug_init); +#else /* CONFIG_DEBUG_FS */ +static void em_debug_create_pd(struct em_perf_domain *pd, int cpu) {} +#endif static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states, struct em_data_callback *cb) { @@ -102,6 +157,8 @@ static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states, pd->nr_cap_states = nr_states; cpumask_copy(to_cpumask(pd->cpus), span); + em_debug_create_pd(pd, cpu); + return pd; free_cs_table: -- cgit v1.3-7-g2ca7 From 2cbd95a5c4fb855a4177c0343a880cc2091f500d Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:18 -0800 Subject: bpf: change parameters of call/branch offset adjustment In preparation for code removal change parameters to branch and call adjustment functions to be more universal. The current parameters assume we are patching a single instruction with a longer set. A diagram may help reading the change, this is for the patch single case, patching instruction 1 with a replacement of 4: ____ 0 |____| 1 |____| <-- pos ^ 2 | | <-- end old ^ | 3 | | | delta | len 4 |____| | | (patch region) 5 | | <-- end new v v 6 |____| end_old = pos + 1 end_new = pos + delta + 1 If we are before the patch region - curr variable and the target are fully in old coordinates (hence comparing against end_old). If we are after the region curr is in new coordinates (hence the comparison to end_new) but target is in mixed coordinates, so we just check if it falls before end_new, and if so it needs the adjustment. Note that we will not fix up branches which land in removed region in case of removal, which should be okay, as we are only going to remove dead code. Signed-off-by: Jakub Kicinski Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 40 +++++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f908b9356025..ad08ba341197 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -307,15 +307,16 @@ int bpf_prog_calc_tag(struct bpf_prog *fp) return 0; } -static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta, - u32 curr, const bool probe_pass) +static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, s32 end_old, + s32 end_new, u32 curr, const bool probe_pass) { const s64 imm_min = S32_MIN, imm_max = S32_MAX; + s32 delta = end_new - end_old; s64 imm = insn->imm; - if (curr < pos && curr + imm + 1 > pos) + if (curr < pos && curr + imm + 1 >= end_old) imm += delta; - else if (curr > pos + delta && curr + imm + 1 <= pos + delta) + else if (curr >= end_new && curr + imm + 1 < end_new) imm -= delta; if (imm < imm_min || imm > imm_max) return -ERANGE; @@ -324,15 +325,16 @@ static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta, return 0; } -static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta, - u32 curr, const bool probe_pass) +static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, s32 end_old, + s32 end_new, u32 curr, const bool probe_pass) { const s32 off_min = S16_MIN, off_max = S16_MAX; + s32 delta = end_new - end_old; s32 off = insn->off; - if (curr < pos && curr + off + 1 > pos) + if (curr < pos && curr + off + 1 >= end_old) off += delta; - else if (curr > pos + delta && curr + off + 1 <= pos + delta) + else if (curr >= end_new && curr + off + 1 < end_new) off -= delta; if (off < off_min || off > off_max) return -ERANGE; @@ -341,10 +343,10 @@ static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta, return 0; } -static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta, - const bool probe_pass) +static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, s32 end_old, + s32 end_new, const bool probe_pass) { - u32 i, insn_cnt = prog->len + (probe_pass ? delta : 0); + u32 i, insn_cnt = prog->len + (probe_pass ? end_new - end_old : 0); struct bpf_insn *insn = prog->insnsi; int ret = 0; @@ -356,8 +358,8 @@ static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta, * do any other adjustments. Therefore skip the patchlet. */ if (probe_pass && i == pos) { - i += delta + 1; - insn++; + i = end_new; + insn = prog->insnsi + end_old; } code = insn->code; if (BPF_CLASS(code) != BPF_JMP || @@ -367,11 +369,11 @@ static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta, if (BPF_OP(code) == BPF_CALL) { if (insn->src_reg != BPF_PSEUDO_CALL) continue; - ret = bpf_adj_delta_to_imm(insn, pos, delta, i, - probe_pass); + ret = bpf_adj_delta_to_imm(insn, pos, end_old, + end_new, i, probe_pass); } else { - ret = bpf_adj_delta_to_off(insn, pos, delta, i, - probe_pass); + ret = bpf_adj_delta_to_off(insn, pos, end_old, + end_new, i, probe_pass); } if (ret) break; @@ -421,7 +423,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * we afterwards may not fail anymore. */ if (insn_adj_cnt > cnt_max && - bpf_adj_branches(prog, off, insn_delta, true)) + bpf_adj_branches(prog, off, off + 1, off + len, true)) return NULL; /* Several new instructions need to be inserted. Make room @@ -453,7 +455,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * the ship has sailed to reverse to the original state. An * overflow cannot happen at this point. */ - BUG_ON(bpf_adj_branches(prog_adj, off, insn_delta, false)); + BUG_ON(bpf_adj_branches(prog_adj, off, off + 1, off + len, false)); bpf_adj_linfo(prog_adj, off, insn_delta); -- cgit v1.3-7-g2ca7 From e2ae4ca266a1c9a0163738129506dbc63d5cca80 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:19 -0800 Subject: bpf: verifier: hard wire branches to dead code Loading programs with dead code becomes more and more common, as people begin to patch constants at load time. Turn conditional jumps to unconditional ones, to avoid potential branch misprediction penalty. This optimization is enabled for privileged users only. For branches which just fall through we could just mark them as not seen and have dead code removal take care of them, but that seems less clean. v0.2: - don't call capable(CAP_SYS_ADMIN) twice (Jiong). v3: - fix GCC warning; Signed-off-by: Jakub Kicinski Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ce87198ecd01..bf1f98e8beb6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6458,6 +6458,40 @@ static void sanitize_dead_code(struct bpf_verifier_env *env) } } +static bool insn_is_cond_jump(u8 code) +{ + u8 op; + + if (BPF_CLASS(code) != BPF_JMP) + return false; + + op = BPF_OP(code); + return op != BPF_JA && op != BPF_EXIT && op != BPF_CALL; +} + +static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + struct bpf_insn ja = BPF_JMP_IMM(BPF_JA, 0, 0, 0); + struct bpf_insn *insn = env->prog->insnsi; + const int insn_cnt = env->prog->len; + int i; + + for (i = 0; i < insn_cnt; i++, insn++) { + if (!insn_is_cond_jump(insn->code)) + continue; + + if (!aux_data[i + 1].seen) + ja.off = insn->off; + else if (!aux_data[i + 1 + insn->off].seen) + ja.off = 0; + else + continue; + + memcpy(insn, &ja, sizeof(ja)); + } +} + /* convert load instructions that access fields of a context type into a * sequence of instructions that access fields of the underlying structure: * struct __sk_buff -> struct sk_buff @@ -7149,6 +7183,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, struct bpf_verifier_env *env; struct bpf_verifier_log *log; int ret = -EINVAL; + bool is_priv; /* no program is valid */ if (ARRAY_SIZE(bpf_verifier_ops) == 0) @@ -7195,6 +7230,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (attr->prog_flags & BPF_F_ANY_ALIGNMENT) env->strict_alignment = false; + is_priv = capable(CAP_SYS_ADMIN); + env->allow_ptr_leaks = is_priv; + ret = replace_map_fd_with_map_ptr(env); if (ret < 0) goto skip_full_check; @@ -7212,8 +7250,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (!env->explored_states) goto skip_full_check; - env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); - ret = check_subprogs(env); if (ret < 0) goto skip_full_check; @@ -7243,6 +7279,11 @@ skip_full_check: ret = check_max_stack_depth(env); /* instruction rewrites happen after this point */ + if (is_priv) { + if (ret == 0) + opt_hard_wire_dead_code_branches(env); + } + if (ret == 0) sanitize_dead_code(env); -- cgit v1.3-7-g2ca7 From 52875a04f4b26e7ef30a288ea096f7cfec0e93cd Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:20 -0800 Subject: bpf: verifier: remove dead code Instead of overwriting dead code with jmp -1 instructions remove it completely for root. Adjust verifier state and line info appropriately. v2: - adjust func_info (Alexei); - make sure first instruction retains line info (Alexei). v4: (Yonghong) - remove unnecessary if (!insn to remove) checks; - always keep last line info if first live instruction lacks one. v5: (Martin Lau) - improve and clarify comments. Signed-off-by: Jakub Kicinski Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- include/linux/filter.h | 1 + kernel/bpf/core.c | 12 ++++ kernel/bpf/verifier.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 186 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index ad106d845b22..be9af6b4a9e4 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -778,6 +778,7 @@ static inline bool bpf_dump_raw_ok(void) struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); +int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt); void bpf_clear_redirect_map(struct bpf_map *map); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index ad08ba341197..2a81b8af3748 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -462,6 +462,18 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, return prog_adj; } +int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt) +{ + /* Branch offsets can't overflow when program is shrinking, no need + * to call bpf_adj_branches(..., true) here + */ + memmove(prog->insnsi + off, prog->insnsi + off + cnt, + sizeof(struct bpf_insn) * (prog->len - off - cnt)); + prog->len -= cnt; + + return WARN_ON_ONCE(bpf_adj_branches(prog, off, off + cnt, off, false)); +} + void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp) { int i; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bf1f98e8beb6..099b2541f87f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6432,6 +6432,150 @@ static struct bpf_prog *bpf_patch_insn_data(struct bpf_verifier_env *env, u32 of return new_prog; } +static int adjust_subprog_starts_after_remove(struct bpf_verifier_env *env, + u32 off, u32 cnt) +{ + int i, j; + + /* find first prog starting at or after off (first to remove) */ + for (i = 0; i < env->subprog_cnt; i++) + if (env->subprog_info[i].start >= off) + break; + /* find first prog starting at or after off + cnt (first to stay) */ + for (j = i; j < env->subprog_cnt; j++) + if (env->subprog_info[j].start >= off + cnt) + break; + /* if j doesn't start exactly at off + cnt, we are just removing + * the front of previous prog + */ + if (env->subprog_info[j].start != off + cnt) + j--; + + if (j > i) { + struct bpf_prog_aux *aux = env->prog->aux; + int move; + + /* move fake 'exit' subprog as well */ + move = env->subprog_cnt + 1 - j; + + memmove(env->subprog_info + i, + env->subprog_info + j, + sizeof(*env->subprog_info) * move); + env->subprog_cnt -= j - i; + + /* remove func_info */ + if (aux->func_info) { + move = aux->func_info_cnt - j; + + memmove(aux->func_info + i, + aux->func_info + j, + sizeof(*aux->func_info) * move); + aux->func_info_cnt -= j - i; + /* func_info->insn_off is set after all code rewrites, + * in adjust_btf_func() - no need to adjust + */ + } + } else { + /* convert i from "first prog to remove" to "first to adjust" */ + if (env->subprog_info[i].start == off) + i++; + } + + /* update fake 'exit' subprog as well */ + for (; i <= env->subprog_cnt; i++) + env->subprog_info[i].start -= cnt; + + return 0; +} + +static int bpf_adj_linfo_after_remove(struct bpf_verifier_env *env, u32 off, + u32 cnt) +{ + struct bpf_prog *prog = env->prog; + u32 i, l_off, l_cnt, nr_linfo; + struct bpf_line_info *linfo; + + nr_linfo = prog->aux->nr_linfo; + if (!nr_linfo) + return 0; + + linfo = prog->aux->linfo; + + /* find first line info to remove, count lines to be removed */ + for (i = 0; i < nr_linfo; i++) + if (linfo[i].insn_off >= off) + break; + + l_off = i; + l_cnt = 0; + for (; i < nr_linfo; i++) + if (linfo[i].insn_off < off + cnt) + l_cnt++; + else + break; + + /* First live insn doesn't match first live linfo, it needs to "inherit" + * last removed linfo. prog is already modified, so prog->len == off + * means no live instructions after (tail of the program was removed). + */ + if (prog->len != off && l_cnt && + (i == nr_linfo || linfo[i].insn_off != off + cnt)) { + l_cnt--; + linfo[--i].insn_off = off + cnt; + } + + /* remove the line info which refer to the removed instructions */ + if (l_cnt) { + memmove(linfo + l_off, linfo + i, + sizeof(*linfo) * (nr_linfo - i)); + + prog->aux->nr_linfo -= l_cnt; + nr_linfo = prog->aux->nr_linfo; + } + + /* pull all linfo[i].insn_off >= off + cnt in by cnt */ + for (i = l_off; i < nr_linfo; i++) + linfo[i].insn_off -= cnt; + + /* fix up all subprogs (incl. 'exit') which start >= off */ + for (i = 0; i <= env->subprog_cnt; i++) + if (env->subprog_info[i].linfo_idx > l_off) { + /* program may have started in the removed region but + * may not be fully removed + */ + if (env->subprog_info[i].linfo_idx >= l_off + l_cnt) + env->subprog_info[i].linfo_idx -= l_cnt; + else + env->subprog_info[i].linfo_idx = l_off; + } + + return 0; +} + +static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + unsigned int orig_prog_len = env->prog->len; + int err; + + err = bpf_remove_insns(env->prog, off, cnt); + if (err) + return err; + + err = adjust_subprog_starts_after_remove(env, off, cnt); + if (err) + return err; + + err = bpf_adj_linfo_after_remove(env, off, cnt); + if (err) + return err; + + memmove(aux_data + off, aux_data + off + cnt, + sizeof(*aux_data) * (orig_prog_len - off - cnt)); + + return 0; +} + /* The verifier does more data flow analysis than llvm and will not * explore branches that are dead at run time. Malicious programs can * have dead code too. Therefore replace all dead at-run-time code @@ -6492,6 +6636,30 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env) } } +static int opt_remove_dead_code(struct bpf_verifier_env *env) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + int insn_cnt = env->prog->len; + int i, err; + + for (i = 0; i < insn_cnt; i++) { + int j; + + j = 0; + while (i + j < insn_cnt && !aux_data[i + j].seen) + j++; + if (!j) + continue; + + err = verifier_remove_insns(env, i, j); + if (err) + return err; + insn_cnt = env->prog->len; + } + + return 0; +} + /* convert load instructions that access fields of a context type into a * sequence of instructions that access fields of the underlying structure: * struct __sk_buff -> struct sk_buff @@ -7282,11 +7450,13 @@ skip_full_check: if (is_priv) { if (ret == 0) opt_hard_wire_dead_code_branches(env); + if (ret == 0) + ret = opt_remove_dead_code(env); + } else { + if (ret == 0) + sanitize_dead_code(env); } - if (ret == 0) - sanitize_dead_code(env); - if (ret == 0) /* program is valid, convert *(u32*)(ctx + off) accesses */ ret = convert_ctx_accesses(env); -- cgit v1.3-7-g2ca7 From a1b14abc009d9c13be355dbd4a4c4d47816ad3db Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:21 -0800 Subject: bpf: verifier: remove unconditional branches by 0 Unconditional branches by 0 instructions are basically noops but they can result from earlier optimizations, e.g. a conditional jumps which would never be taken or a conditional jump around dead code. Remove those branches. v0.2: - s/opt_remove_dead_branches/opt_remove_nops/ (Jiong). Signed-off-by: Jakub Kicinski Reviewed-by: Jiong Wang Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 099b2541f87f..f39bca188a5c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6660,6 +6660,27 @@ static int opt_remove_dead_code(struct bpf_verifier_env *env) return 0; } +static int opt_remove_nops(struct bpf_verifier_env *env) +{ + const struct bpf_insn ja = BPF_JMP_IMM(BPF_JA, 0, 0, 0); + struct bpf_insn *insn = env->prog->insnsi; + int insn_cnt = env->prog->len; + int i, err; + + for (i = 0; i < insn_cnt; i++) { + if (memcmp(&insn[i], &ja, sizeof(ja))) + continue; + + err = verifier_remove_insns(env, i, 1); + if (err) + return err; + insn_cnt--; + i--; + } + + return 0; +} + /* convert load instructions that access fields of a context type into a * sequence of instructions that access fields of the underlying structure: * struct __sk_buff -> struct sk_buff @@ -7452,6 +7473,8 @@ skip_full_check: opt_hard_wire_dead_code_branches(env); if (ret == 0) ret = opt_remove_dead_code(env); + if (ret == 0) + ret = opt_remove_nops(env); } else { if (ret == 0) sanitize_dead_code(env); -- cgit v1.3-7-g2ca7 From 9e4c24e7ee7dfd3898269519103e823892b730d8 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:23 -0800 Subject: bpf: verifier: record original instruction index The communication between the verifier and advanced JITs is based on instruction indexes. We have to keep them stable throughout the optimizations otherwise referring to a particular instruction gets messy quickly. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Signed-off-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 1 + kernel/bpf/verifier.c | 8 +++++--- 2 files changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 573cca00a0e6..f3ae00ee5516 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -187,6 +187,7 @@ struct bpf_insn_aux_data { int sanitize_stack_off; /* stack slot to be cleared */ bool seen; /* this insn was processed by the verifier */ u8 alu_state; /* used in combination with alu_limit */ + unsigned int orig_idx; /* original instruction index */ }; #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f39bca188a5c..f2c49b4235df 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7371,7 +7371,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, { struct bpf_verifier_env *env; struct bpf_verifier_log *log; - int ret = -EINVAL; + int i, len, ret = -EINVAL; bool is_priv; /* no program is valid */ @@ -7386,12 +7386,14 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, return -ENOMEM; log = &env->log; + len = (*prog)->len; env->insn_aux_data = - vzalloc(array_size(sizeof(struct bpf_insn_aux_data), - (*prog)->len)); + vzalloc(array_size(sizeof(struct bpf_insn_aux_data), len)); ret = -ENOMEM; if (!env->insn_aux_data) goto err_free_env; + for (i = 0; i < len; i++) + env->insn_aux_data[i].orig_idx = i; env->prog = *prog; env->ops = bpf_verifier_ops[env->prog->type]; -- cgit v1.3-7-g2ca7 From 08ca90afba255d05dc3253caa44056e7aecbe8c5 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 22 Jan 2019 22:45:24 -0800 Subject: bpf: notify offload JITs about optimizations Let offload JITs know when instructions are replaced and optimized out, so they can update their state appropriately. The optimizations are best effort, if JIT returns an error from any callback verifier will stop notifying it as state may now be out of sync, but the verifier continues making progress. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 7 +++++++ include/linux/bpf_verifier.h | 5 +++++ kernel/bpf/offload.c | 35 +++++++++++++++++++++++++++++++++++ kernel/bpf/verifier.c | 6 ++++++ 4 files changed, 53 insertions(+) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e734f163bd0b..3851529062ec 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -268,9 +268,15 @@ struct bpf_verifier_ops { }; struct bpf_prog_offload_ops { + /* verifier basic callbacks */ int (*insn_hook)(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int (*finalize)(struct bpf_verifier_env *env); + /* verifier optimization callbacks (called after .finalize) */ + int (*replace_insn)(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn); + int (*remove_insns)(struct bpf_verifier_env *env, u32 off, u32 cnt); + /* program management callbacks */ int (*prepare)(struct bpf_prog *prog); int (*translate)(struct bpf_prog *prog); void (*destroy)(struct bpf_prog *prog); @@ -283,6 +289,7 @@ struct bpf_prog_offload { void *dev_priv; struct list_head offloads; bool dev_state; + bool opt_failed; void *jited_image; u32 jited_len; }; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index f3ae00ee5516..0620e418dde5 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -266,5 +266,10 @@ int bpf_prog_offload_verifier_prep(struct bpf_prog *prog); int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); int bpf_prog_offload_finalize(struct bpf_verifier_env *env); +void +bpf_prog_offload_replace_insn(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn); +void +bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt); #endif /* _LINUX_BPF_VERIFIER_H */ diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 54cf2b9c44a4..39dba8c90331 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -173,6 +173,41 @@ int bpf_prog_offload_finalize(struct bpf_verifier_env *env) return ret; } +void +bpf_prog_offload_replace_insn(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn) +{ + const struct bpf_prog_offload_ops *ops; + struct bpf_prog_offload *offload; + int ret = -EOPNOTSUPP; + + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) { + ops = offload->offdev->ops; + if (!offload->opt_failed && ops->replace_insn) + ret = ops->replace_insn(env, off, insn); + offload->opt_failed |= ret; + } + up_read(&bpf_devs_lock); +} + +void +bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt) +{ + struct bpf_prog_offload *offload; + int ret = -EOPNOTSUPP; + + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) { + if (!offload->opt_failed && offload->offdev->ops->remove_insns) + ret = offload->offdev->ops->remove_insns(env, off, cnt); + offload->opt_failed |= ret; + } + up_read(&bpf_devs_lock); +} + static void __bpf_prog_offload_destroy(struct bpf_prog *prog) { struct bpf_prog_offload *offload = prog->aux->offload; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f2c49b4235df..8cfe39ef770f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6558,6 +6558,9 @@ static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt) unsigned int orig_prog_len = env->prog->len; int err; + if (bpf_prog_is_dev_bound(env->prog->aux)) + bpf_prog_offload_remove_insns(env, off, cnt); + err = bpf_remove_insns(env->prog, off, cnt); if (err) return err; @@ -6632,6 +6635,9 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env) else continue; + if (bpf_prog_is_dev_bound(env->prog->aux)) + bpf_prog_offload_replace_insn(env, i, &ja); + memcpy(insn, &ja, sizeof(ja)); } } -- cgit v1.3-7-g2ca7 From 4d43d395fed124631ca02356c711facb90185175 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Wed, 23 Jan 2019 09:44:12 +0900 Subject: workqueue: Try to catch flush_work() without INIT_WORK(). syzbot found a flush_work() caller who forgot to call INIT_WORK() because that work_struct was allocated by kzalloc() [1]. But the message INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. by lock_map_acquire() is failing to tell that INIT_WORK() is missing. Since flush_work() without INIT_WORK() is a bug, and INIT_WORK() should set ->func field to non-zero, let's warn if ->func field is zero. [1] https://syzkaller.appspot.com/bug?id=a5954455fcfa51c29ca2ab55b203076337e1c770 Signed-off-by: Tetsuo Handa Signed-off-by: Tejun Heo --- kernel/workqueue.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 392be4b252f6..a503ad9d0aec 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2908,6 +2908,9 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) if (WARN_ON(!wq_online)) return false; + if (WARN_ON(!work->func)) + return false; + if (!from_cancel) { lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); -- cgit v1.3-7-g2ca7 From 275f22148e8720e84b180d9e0cdf8abfd69bac5b Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 31 Dec 2018 22:22:40 +0100 Subject: ipc: rename old-style shmctl/semctl/msgctl syscalls The behavior of these system calls is slightly different between architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION symbol. Most architectures that implement the split IPC syscalls don't set that symbol and only get the modern version, but alpha, arm, microblaze, mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag. For the architectures that so far only implement sys_ipc(), i.e. m68k, mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior when adding the split syscalls, so we need to distinguish between the two groups of architectures. The method I picked for this distinction is to have a separate system call entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl() does not. The system call tables of the five architectures are changed accordingly. As an additional benefit, we no longer need the configuration specific definition for ipc_parse_version(), it always does the same thing now, but simply won't get called on architectures with the modern interface. A small downside is that on architectures that do set ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points that are never called. They only add a few bytes of bloat, so it seems better to keep them compared to adding yet another Kconfig symbol. I considered adding new syscall numbers for the IPC_64 variants for consistency, but decided against that for now. Signed-off-by: Arnd Bergmann --- arch/alpha/kernel/syscalls/syscall.tbl | 6 ++--- arch/arm/tools/syscall.tbl | 6 ++--- arch/arm64/include/asm/unistd32.h | 6 ++--- arch/microblaze/kernel/syscalls/syscall.tbl | 6 ++--- arch/mips/kernel/syscalls/syscall_n32.tbl | 6 ++--- arch/mips/kernel/syscalls/syscall_n64.tbl | 6 ++--- arch/xtensa/kernel/syscalls/syscall.tbl | 6 ++--- include/linux/syscalls.h | 3 +++ ipc/msg.c | 39 +++++++++++++++++++++++----- ipc/sem.c | 39 +++++++++++++++++++++++----- ipc/shm.c | 40 ++++++++++++++++++++++++----- ipc/syscall.c | 12 ++++----- ipc/util.h | 21 +++++---------- kernel/sys_ni.c | 3 +++ 14 files changed, 137 insertions(+), 62 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index f920b65e8c49..b0e247287908 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -174,17 +174,17 @@ 187 common osf_alt_sigpending sys_ni_syscall 188 common osf_alt_setsid sys_ni_syscall 199 common osf_swapon sys_swapon -200 common msgctl sys_msgctl +200 common msgctl sys_old_msgctl 201 common msgget sys_msgget 202 common msgrcv sys_msgrcv 203 common msgsnd sys_msgsnd -204 common semctl sys_semctl +204 common semctl sys_old_semctl 205 common semget sys_semget 206 common semop sys_semop 207 common osf_utsname sys_osf_utsname 208 common lchown sys_lchown 209 common shmat sys_shmat -210 common shmctl sys_shmctl +210 common shmctl sys_old_shmctl 211 common shmdt sys_shmdt 212 common shmget sys_shmget 213 common osf_mvalid sys_ni_syscall diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 20ed7e026723..b54b7f2bc24a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -314,15 +314,15 @@ 297 common recvmsg sys_recvmsg 298 common semop sys_semop sys_oabi_semop 299 common semget sys_semget -300 common semctl sys_semctl +300 common semctl sys_old_semctl 301 common msgsnd sys_msgsnd 302 common msgrcv sys_msgrcv 303 common msgget sys_msgget -304 common msgctl sys_msgctl +304 common msgctl sys_old_msgctl 305 common shmat sys_shmat 306 common shmdt sys_shmdt 307 common shmget sys_shmget -308 common shmctl sys_shmctl +308 common shmctl sys_old_shmctl 309 common add_key sys_add_key 310 common request_key sys_request_key 311 common keyctl sys_keyctl diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 8ca1d4c304f4..d10cce69a4b0 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -622,7 +622,7 @@ __SYSCALL(__NR_semop, sys_semop) #define __NR_semget 299 __SYSCALL(__NR_semget, sys_semget) #define __NR_semctl 300 -__SYSCALL(__NR_semctl, compat_sys_semctl) +__SYSCALL(__NR_semctl, compat_sys_old_semctl) #define __NR_msgsnd 301 __SYSCALL(__NR_msgsnd, compat_sys_msgsnd) #define __NR_msgrcv 302 @@ -630,7 +630,7 @@ __SYSCALL(__NR_msgrcv, compat_sys_msgrcv) #define __NR_msgget 303 __SYSCALL(__NR_msgget, sys_msgget) #define __NR_msgctl 304 -__SYSCALL(__NR_msgctl, compat_sys_msgctl) +__SYSCALL(__NR_msgctl, compat_sys_old_msgctl) #define __NR_shmat 305 __SYSCALL(__NR_shmat, compat_sys_shmat) #define __NR_shmdt 306 @@ -638,7 +638,7 @@ __SYSCALL(__NR_shmdt, sys_shmdt) #define __NR_shmget 307 __SYSCALL(__NR_shmget, sys_shmget) #define __NR_shmctl 308 -__SYSCALL(__NR_shmctl, compat_sys_shmctl) +__SYSCALL(__NR_shmctl, compat_sys_old_shmctl) #define __NR_add_key 309 __SYSCALL(__NR_add_key, sys_add_key) #define __NR_request_key 310 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index a24d09e937dd..7cc0f9554da3 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -335,15 +335,15 @@ 325 common semtimedop sys_semtimedop 326 common timerfd_settime sys_timerfd_settime 327 common timerfd_gettime sys_timerfd_gettime -328 common semctl sys_semctl +328 common semctl sys_old_semctl 329 common semget sys_semget 330 common semop sys_semop -331 common msgctl sys_msgctl +331 common msgctl sys_old_msgctl 332 common msgget sys_msgget 333 common msgrcv sys_msgrcv 334 common msgsnd sys_msgsnd 335 common shmat sys_shmat -336 common shmctl sys_shmctl +336 common shmctl sys_old_shmctl 337 common shmdt sys_shmdt 338 common shmget sys_shmget 339 common signalfd4 sys_signalfd4 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 53d5862649ae..cc134b1211aa 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -37,7 +37,7 @@ 27 n32 madvise sys_madvise 28 n32 shmget sys_shmget 29 n32 shmat sys_shmat -30 n32 shmctl compat_sys_shmctl +30 n32 shmctl compat_sys_old_shmctl 31 n32 dup sys_dup 32 n32 dup2 sys_dup2 33 n32 pause sys_pause @@ -71,12 +71,12 @@ 61 n32 uname sys_newuname 62 n32 semget sys_semget 63 n32 semop sys_semop -64 n32 semctl compat_sys_semctl +64 n32 semctl compat_sys_old_semctl 65 n32 shmdt sys_shmdt 66 n32 msgget sys_msgget 67 n32 msgsnd compat_sys_msgsnd 68 n32 msgrcv compat_sys_msgrcv -69 n32 msgctl compat_sys_msgctl +69 n32 msgctl compat_sys_old_msgctl 70 n32 fcntl compat_sys_fcntl 71 n32 flock sys_flock 72 n32 fsync sys_fsync diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index a8286ccbb66c..af0da757a7b2 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -37,7 +37,7 @@ 27 n64 madvise sys_madvise 28 n64 shmget sys_shmget 29 n64 shmat sys_shmat -30 n64 shmctl sys_shmctl +30 n64 shmctl sys_old_shmctl 31 n64 dup sys_dup 32 n64 dup2 sys_dup2 33 n64 pause sys_pause @@ -71,12 +71,12 @@ 61 n64 uname sys_newuname 62 n64 semget sys_semget 63 n64 semop sys_semop -64 n64 semctl sys_semctl +64 n64 semctl sys_old_semctl 65 n64 shmdt sys_shmdt 66 n64 msgget sys_msgget 67 n64 msgsnd sys_msgsnd 68 n64 msgrcv sys_msgrcv -69 n64 msgctl sys_msgctl +69 n64 msgctl sys_old_msgctl 70 n64 fcntl sys_fcntl 71 n64 flock sys_flock 72 n64 fsync sys_fsync diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 69cf91b03b26..f8befa11b0c4 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -103,7 +103,7 @@ 91 common madvise sys_madvise 92 common shmget sys_shmget 93 common shmat xtensa_shmat -94 common shmctl sys_shmctl +94 common shmctl sys_old_shmctl 95 common shmdt sys_shmdt # Socket Operations 96 common socket sys_socket @@ -177,12 +177,12 @@ 161 common semtimedop sys_semtimedop 162 common semget sys_semget 163 common semop sys_semop -164 common semctl sys_semctl +164 common semctl sys_old_semctl 165 common available165 sys_ni_syscall 166 common msgget sys_msgget 167 common msgsnd sys_msgsnd 168 common msgrcv sys_msgrcv -169 common msgctl sys_msgctl +169 common msgctl sys_old_msgctl 170 common available170 sys_ni_syscall # File System 171 common umount2 sys_umount diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fb63045a0fb6..938d8908b9e0 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -717,6 +717,7 @@ asmlinkage long sys_mq_getsetattr(mqd_t mqdes, const struct mq_attr __user *mqst /* ipc/msg.c */ asmlinkage long sys_msgget(key_t key, int msgflg); +asmlinkage long sys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf); asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf); asmlinkage long sys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz, long msgtyp, int msgflg); @@ -726,6 +727,7 @@ asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp, /* ipc/sem.c */ asmlinkage long sys_semget(key_t key, int nsems, int semflg); asmlinkage long sys_semctl(int semid, int semnum, int cmd, unsigned long arg); +asmlinkage long sys_old_semctl(int semid, int semnum, int cmd, unsigned long arg); asmlinkage long sys_semtimedop(int semid, struct sembuf __user *sops, unsigned nsops, const struct __kernel_timespec __user *timeout); @@ -734,6 +736,7 @@ asmlinkage long sys_semop(int semid, struct sembuf __user *sops, /* ipc/shm.c */ asmlinkage long sys_shmget(key_t key, size_t size, int flag); +asmlinkage long sys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf); asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf); asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg); asmlinkage long sys_shmdt(char __user *shmaddr); diff --git a/ipc/msg.c b/ipc/msg.c index 0833c6405915..8dec945fa030 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -567,9 +567,8 @@ out_unlock: return err; } -long ksys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf) +static long ksys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf, int version) { - int version; struct ipc_namespace *ns; struct msqid64_ds msqid64; int err; @@ -577,7 +576,6 @@ long ksys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf) if (msqid < 0 || cmd < 0) return -EINVAL; - version = ipc_parse_version(&cmd); ns = current->nsproxy->ipc_ns; switch (cmd) { @@ -613,9 +611,23 @@ long ksys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf) SYSCALL_DEFINE3(msgctl, int, msqid, int, cmd, struct msqid_ds __user *, buf) { - return ksys_msgctl(msqid, cmd, buf); + return ksys_msgctl(msqid, cmd, buf, IPC_64); } +#ifdef CONFIG_ARCH_WANT_IPC_PARSE_VERSION +long ksys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf) +{ + int version = ipc_parse_version(&cmd); + + return ksys_msgctl(msqid, cmd, buf, version); +} + +SYSCALL_DEFINE3(old_msgctl, int, msqid, int, cmd, struct msqid_ds __user *, buf) +{ + return ksys_old_msgctl(msqid, cmd, buf); +} +#endif + #ifdef CONFIG_COMPAT struct compat_msqid_ds { @@ -689,12 +701,11 @@ static int copy_compat_msqid_to_user(void __user *buf, struct msqid64_ds *in, } } -long compat_ksys_msgctl(int msqid, int cmd, void __user *uptr) +static long compat_ksys_msgctl(int msqid, int cmd, void __user *uptr, int version) { struct ipc_namespace *ns; int err; struct msqid64_ds msqid64; - int version = compat_ipc_parse_version(&cmd); ns = current->nsproxy->ipc_ns; @@ -734,8 +745,22 @@ long compat_ksys_msgctl(int msqid, int cmd, void __user *uptr) COMPAT_SYSCALL_DEFINE3(msgctl, int, msqid, int, cmd, void __user *, uptr) { - return compat_ksys_msgctl(msqid, cmd, uptr); + return compat_ksys_msgctl(msqid, cmd, uptr, IPC_64); } + +#ifdef CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION +long compat_ksys_old_msgctl(int msqid, int cmd, void __user *uptr) +{ + int version = compat_ipc_parse_version(&cmd); + + return compat_ksys_msgctl(msqid, cmd, uptr, version); +} + +COMPAT_SYSCALL_DEFINE3(old_msgctl, int, msqid, int, cmd, void __user *, uptr) +{ + return compat_ksys_old_msgctl(msqid, cmd, uptr); +} +#endif #endif static int testmsg(struct msg_msg *msg, long type, int mode) diff --git a/ipc/sem.c b/ipc/sem.c index 745dc6187e84..d1efff3a81bb 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1634,9 +1634,8 @@ out_up: return err; } -long ksys_semctl(int semid, int semnum, int cmd, unsigned long arg) +static long ksys_semctl(int semid, int semnum, int cmd, unsigned long arg, int version) { - int version; struct ipc_namespace *ns; void __user *p = (void __user *)arg; struct semid64_ds semid64; @@ -1645,7 +1644,6 @@ long ksys_semctl(int semid, int semnum, int cmd, unsigned long arg) if (semid < 0) return -EINVAL; - version = ipc_parse_version(&cmd); ns = current->nsproxy->ipc_ns; switch (cmd) { @@ -1691,9 +1689,23 @@ long ksys_semctl(int semid, int semnum, int cmd, unsigned long arg) SYSCALL_DEFINE4(semctl, int, semid, int, semnum, int, cmd, unsigned long, arg) { - return ksys_semctl(semid, semnum, cmd, arg); + return ksys_semctl(semid, semnum, cmd, arg, IPC_64); } +#ifdef CONFIG_ARCH_WANT_IPC_PARSE_VERSION +long ksys_old_semctl(int semid, int semnum, int cmd, unsigned long arg) +{ + int version = ipc_parse_version(&cmd); + + return ksys_semctl(semid, semnum, cmd, arg, version); +} + +SYSCALL_DEFINE4(old_semctl, int, semid, int, semnum, int, cmd, unsigned long, arg) +{ + return ksys_old_semctl(semid, semnum, cmd, arg); +} +#endif + #ifdef CONFIG_COMPAT struct compat_semid_ds { @@ -1744,12 +1756,11 @@ static int copy_compat_semid_to_user(void __user *buf, struct semid64_ds *in, } } -long compat_ksys_semctl(int semid, int semnum, int cmd, int arg) +static long compat_ksys_semctl(int semid, int semnum, int cmd, int arg, int version) { void __user *p = compat_ptr(arg); struct ipc_namespace *ns; struct semid64_ds semid64; - int version = compat_ipc_parse_version(&cmd); int err; ns = current->nsproxy->ipc_ns; @@ -1792,8 +1803,22 @@ long compat_ksys_semctl(int semid, int semnum, int cmd, int arg) COMPAT_SYSCALL_DEFINE4(semctl, int, semid, int, semnum, int, cmd, int, arg) { - return compat_ksys_semctl(semid, semnum, cmd, arg); + return compat_ksys_semctl(semid, semnum, cmd, arg, IPC_64); } + +#ifdef CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION +long compat_ksys_old_semctl(int semid, int semnum, int cmd, int arg) +{ + int version = compat_ipc_parse_version(&cmd); + + return compat_ksys_semctl(semid, semnum, cmd, arg, version); +} + +COMPAT_SYSCALL_DEFINE4(old_semctl, int, semid, int, semnum, int, cmd, int, arg) +{ + return compat_ksys_old_semctl(semid, semnum, cmd, arg); +} +#endif #endif /* If the task doesn't already have a undo_list, then allocate one diff --git a/ipc/shm.c b/ipc/shm.c index 0842411cb0e9..ce1ca9f7c6e9 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1137,16 +1137,15 @@ out_unlock1: return err; } -long ksys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf) +static long ksys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf, int version) { - int err, version; + int err; struct ipc_namespace *ns; struct shmid64_ds sem64; if (cmd < 0 || shmid < 0) return -EINVAL; - version = ipc_parse_version(&cmd); ns = current->nsproxy->ipc_ns; switch (cmd) { @@ -1194,8 +1193,22 @@ long ksys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf) SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) { - return ksys_shmctl(shmid, cmd, buf); + return ksys_shmctl(shmid, cmd, buf, IPC_64); +} + +#ifdef CONFIG_ARCH_WANT_IPC_PARSE_VERSION +long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf) +{ + int version = ipc_parse_version(&cmd); + + return ksys_shmctl(shmid, cmd, buf, version); +} + +SYSCALL_DEFINE3(old_shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) +{ + return ksys_old_shmctl(shmid, cmd, buf); } +#endif #ifdef CONFIG_COMPAT @@ -1319,11 +1332,10 @@ static int copy_compat_shmid_from_user(struct shmid64_ds *out, void __user *buf, } } -long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr) +long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr, int version) { struct ipc_namespace *ns; struct shmid64_ds sem64; - int version = compat_ipc_parse_version(&cmd); int err; ns = current->nsproxy->ipc_ns; @@ -1378,8 +1390,22 @@ long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr) COMPAT_SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, void __user *, uptr) { - return compat_ksys_shmctl(shmid, cmd, uptr); + return compat_ksys_shmctl(shmid, cmd, uptr, IPC_64); } + +#ifdef CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION +long compat_ksys_old_shmctl(int shmid, int cmd, void __user *uptr) +{ + int version = compat_ipc_parse_version(&cmd); + + return compat_ksys_shmctl(shmid, cmd, uptr, version); +} + +COMPAT_SYSCALL_DEFINE3(old_shmctl, int, shmid, int, cmd, void __user *, uptr) +{ + return compat_ksys_old_shmctl(shmid, cmd, uptr); +} +#endif #endif /* diff --git a/ipc/syscall.c b/ipc/syscall.c index 3cf8ad703a4d..581bdff4e7c5 100644 --- a/ipc/syscall.c +++ b/ipc/syscall.c @@ -47,7 +47,7 @@ int ksys_ipc(unsigned int call, int first, unsigned long second, return -EINVAL; if (get_user(arg, (unsigned long __user *) ptr)) return -EFAULT; - return ksys_semctl(first, second, third, arg); + return ksys_old_semctl(first, second, third, arg); } case MSGSND: @@ -75,7 +75,7 @@ int ksys_ipc(unsigned int call, int first, unsigned long second, case MSGGET: return ksys_msgget((key_t) first, second); case MSGCTL: - return ksys_msgctl(first, second, + return ksys_old_msgctl(first, second, (struct msqid_ds __user *)ptr); case SHMAT: @@ -100,7 +100,7 @@ int ksys_ipc(unsigned int call, int first, unsigned long second, case SHMGET: return ksys_shmget(first, second, third); case SHMCTL: - return ksys_shmctl(first, second, + return ksys_old_shmctl(first, second, (struct shmid_ds __user *) ptr); default: return -ENOSYS; @@ -152,7 +152,7 @@ int compat_ksys_ipc(u32 call, int first, int second, return -EINVAL; if (get_user(pad, (u32 __user *) compat_ptr(ptr))) return -EFAULT; - return compat_ksys_semctl(first, second, third, pad); + return compat_ksys_old_semctl(first, second, third, pad); case MSGSND: return compat_ksys_msgsnd(first, ptr, second, third); @@ -177,7 +177,7 @@ int compat_ksys_ipc(u32 call, int first, int second, case MSGGET: return ksys_msgget(first, second); case MSGCTL: - return compat_ksys_msgctl(first, second, compat_ptr(ptr)); + return compat_ksys_old_msgctl(first, second, compat_ptr(ptr)); case SHMAT: { int err; @@ -196,7 +196,7 @@ int compat_ksys_ipc(u32 call, int first, int second, case SHMGET: return ksys_shmget(first, (unsigned int)second, third); case SHMCTL: - return compat_ksys_shmctl(first, second, compat_ptr(ptr)); + return compat_ksys_old_shmctl(first, second, compat_ptr(ptr)); } return -ENOSYS; diff --git a/ipc/util.h b/ipc/util.h index d768fdbed515..e272be622ae7 100644 --- a/ipc/util.h +++ b/ipc/util.h @@ -160,10 +160,7 @@ static inline void ipc_update_pid(struct pid **pos, struct pid *pid) } } -#ifndef CONFIG_ARCH_WANT_IPC_PARSE_VERSION -/* On IA-64, we always use the "64-bit version" of the IPC structures. */ -# define ipc_parse_version(cmd) IPC_64 -#else +#ifdef CONFIG_ARCH_WANT_IPC_PARSE_VERSION int ipc_parse_version(int *cmd); #endif @@ -246,13 +243,9 @@ int get_compat_ipc64_perm(struct ipc64_perm *, static inline int compat_ipc_parse_version(int *cmd) { -#ifdef CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION int version = *cmd & IPC_64; *cmd &= ~IPC_64; return version; -#else - return IPC_64; -#endif } #endif @@ -261,29 +254,29 @@ long ksys_semtimedop(int semid, struct sembuf __user *tsops, unsigned int nsops, const struct __kernel_timespec __user *timeout); long ksys_semget(key_t key, int nsems, int semflg); -long ksys_semctl(int semid, int semnum, int cmd, unsigned long arg); +long ksys_old_semctl(int semid, int semnum, int cmd, unsigned long arg); long ksys_msgget(key_t key, int msgflg); -long ksys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf); +long ksys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf); long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz, long msgtyp, int msgflg); long ksys_msgsnd(int msqid, struct msgbuf __user *msgp, size_t msgsz, int msgflg); long ksys_shmget(key_t key, size_t size, int shmflg); long ksys_shmdt(char __user *shmaddr); -long ksys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf); +long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf); /* for CONFIG_ARCH_WANT_OLD_COMPAT_IPC */ long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems, unsigned int nsops, const struct old_timespec32 __user *timeout); #ifdef CONFIG_COMPAT -long compat_ksys_semctl(int semid, int semnum, int cmd, int arg); -long compat_ksys_msgctl(int msqid, int cmd, void __user *uptr); +long compat_ksys_old_semctl(int semid, int semnum, int cmd, int arg); +long compat_ksys_old_msgctl(int msqid, int cmd, void __user *uptr); long compat_ksys_msgrcv(int msqid, compat_uptr_t msgp, compat_ssize_t msgsz, compat_long_t msgtyp, int msgflg); long compat_ksys_msgsnd(int msqid, compat_uptr_t msgp, compat_ssize_t msgsz, int msgflg); -long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr); +long compat_ksys_old_shmctl(int shmid, int cmd, void __user *uptr); #endif /* CONFIG_COMPAT */ #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index bc934f31ab10..ce04431a40d1 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ COND_SYSCALL_COMPAT(mq_getsetattr); /* ipc/msg.c */ COND_SYSCALL(msgget); +COND_SYSCALL(old_msgctl); COND_SYSCALL(msgctl); COND_SYSCALL_COMPAT(msgctl); COND_SYSCALL(msgrcv); @@ -206,6 +207,7 @@ COND_SYSCALL_COMPAT(msgsnd); /* ipc/sem.c */ COND_SYSCALL(semget); +COND_SYSCALL(old_semctl); COND_SYSCALL(semctl); COND_SYSCALL_COMPAT(semctl); COND_SYSCALL(semtimedop); @@ -214,6 +216,7 @@ COND_SYSCALL(semop); /* ipc/shm.c */ COND_SYSCALL(shmget); +COND_SYSCALL(old_shmctl); COND_SYSCALL(shmctl); COND_SYSCALL_COMPAT(shmctl); COND_SYSCALL(shmat); -- cgit v1.3-7-g2ca7 From 4b7d248b3a1de483ffe9d05c1debbf32a544164d Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Tue, 22 Jan 2019 17:06:39 -0500 Subject: audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT loginuid and sessionid (and audit_log_session_info) should be part of CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of which are otherwise dependent on AUDITSYSCALL. Please see github issue https://github.com/linux-audit/audit-kernel/issues/104 Signed-off-by: Richard Guy Briggs [PM: tweaked subject line for better grep'ing] Signed-off-by: Paul Moore --- fs/proc/base.c | 6 ++-- include/linux/audit.h | 42 +++++++++++++------------ include/linux/sched.h | 2 +- init/init_task.c | 2 +- kernel/audit.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/auditsc.c | 84 -------------------------------------------------- 6 files changed, 113 insertions(+), 108 deletions(-) (limited to 'kernel') diff --git a/fs/proc/base.c b/fs/proc/base.c index 633a63462573..a23651ce6960 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1210,7 +1210,7 @@ static const struct file_operations proc_oom_score_adj_operations = { .llseek = default_llseek, }; -#ifdef CONFIG_AUDITSYSCALL +#ifdef CONFIG_AUDIT #define TMPBUFLEN 11 static ssize_t proc_loginuid_read(struct file * file, char __user * buf, size_t count, loff_t *ppos) @@ -3002,7 +3002,7 @@ static const struct pid_entry tgid_base_stuff[] = { ONE("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adj_operations), REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), -#ifdef CONFIG_AUDITSYSCALL +#ifdef CONFIG_AUDIT REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUGO, proc_sessionid_operations), #endif @@ -3390,7 +3390,7 @@ static const struct pid_entry tid_base_stuff[] = { ONE("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adj_operations), REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), -#ifdef CONFIG_AUDITSYSCALL +#ifdef CONFIG_AUDIT REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUGO, proc_sessionid_operations), #endif diff --git a/include/linux/audit.h b/include/linux/audit.h index a625c29a2ea2..ecb5d317d6a2 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -159,6 +159,18 @@ extern int audit_update_lsm_rules(void); extern int audit_rule_change(int type, int seq, void *data, size_t datasz); extern int audit_list_rules_send(struct sk_buff *request_skb, int seq); +extern int audit_set_loginuid(kuid_t loginuid); + +static inline kuid_t audit_get_loginuid(struct task_struct *tsk) +{ + return tsk->loginuid; +} + +static inline unsigned int audit_get_sessionid(struct task_struct *tsk) +{ + return tsk->sessionid; +} + extern u32 audit_enabled; #else /* CONFIG_AUDIT */ static inline __printf(4, 5) @@ -201,6 +213,17 @@ static inline int audit_log_task_context(struct audit_buffer *ab) } static inline void audit_log_task_info(struct audit_buffer *ab) { } + +static inline kuid_t audit_get_loginuid(struct task_struct *tsk) +{ + return INVALID_UID; +} + +static inline unsigned int audit_get_sessionid(struct task_struct *tsk) +{ + return AUDIT_SID_UNSET; +} + #define audit_enabled AUDIT_OFF #endif /* CONFIG_AUDIT */ @@ -323,17 +346,6 @@ static inline void audit_ptrace(struct task_struct *t) extern unsigned int audit_serial(void); extern int auditsc_get_stamp(struct audit_context *ctx, struct timespec64 *t, unsigned int *serial); -extern int audit_set_loginuid(kuid_t loginuid); - -static inline kuid_t audit_get_loginuid(struct task_struct *tsk) -{ - return tsk->loginuid; -} - -static inline unsigned int audit_get_sessionid(struct task_struct *tsk) -{ - return tsk->sessionid; -} extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp); extern void __audit_ipc_set_perm(unsigned long qbytes, uid_t uid, gid_t gid, umode_t mode); @@ -519,14 +531,6 @@ static inline int auditsc_get_stamp(struct audit_context *ctx, { return 0; } -static inline kuid_t audit_get_loginuid(struct task_struct *tsk) -{ - return INVALID_UID; -} -static inline unsigned int audit_get_sessionid(struct task_struct *tsk) -{ - return AUDIT_SID_UNSET; -} static inline void audit_ipc_obj(struct kern_ipc_perm *ipcp) { } static inline void audit_ipc_set_perm(unsigned long qbytes, uid_t uid, diff --git a/include/linux/sched.h b/include/linux/sched.h index 89541d248893..f9788bb122c5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -886,7 +886,7 @@ struct task_struct { struct callback_head *task_works; struct audit_context *audit_context; -#ifdef CONFIG_AUDITSYSCALL +#ifdef CONFIG_AUDIT kuid_t loginuid; unsigned int sessionid; #endif diff --git a/init/init_task.c b/init/init_task.c index 5aebe3be4d7c..39c3109acc1a 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -121,7 +121,7 @@ struct task_struct init_task .thread_pid = &init_struct_pid, .thread_group = LIST_HEAD_INIT(init_task.thread_group), .thread_node = LIST_HEAD_INIT(init_signals.thread_head), -#ifdef CONFIG_AUDITSYSCALL +#ifdef CONFIG_AUDIT .loginuid = INVALID_UID, .sessionid = AUDIT_SID_UNSET, #endif diff --git a/kernel/audit.c b/kernel/audit.c index c2a7662cc254..2a32f304223d 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2335,6 +2335,91 @@ void audit_log_link_denied(const char *operation) audit_log_end(ab); } +/* global counter which is incremented every time something logs in */ +static atomic_t session_id = ATOMIC_INIT(0); + +static int audit_set_loginuid_perm(kuid_t loginuid) +{ + /* if we are unset, we don't need privs */ + if (!audit_loginuid_set(current)) + return 0; + /* if AUDIT_FEATURE_LOGINUID_IMMUTABLE means never ever allow a change*/ + if (is_audit_feature_set(AUDIT_FEATURE_LOGINUID_IMMUTABLE)) + return -EPERM; + /* it is set, you need permission */ + if (!capable(CAP_AUDIT_CONTROL)) + return -EPERM; + /* reject if this is not an unset and we don't allow that */ + if (is_audit_feature_set(AUDIT_FEATURE_ONLY_UNSET_LOGINUID) + && uid_valid(loginuid)) + return -EPERM; + return 0; +} + +static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, + unsigned int oldsessionid, + unsigned int sessionid, int rc) +{ + struct audit_buffer *ab; + uid_t uid, oldloginuid, loginuid; + struct tty_struct *tty; + + if (!audit_enabled) + return; + + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_LOGIN); + if (!ab) + return; + + uid = from_kuid(&init_user_ns, task_uid(current)); + oldloginuid = from_kuid(&init_user_ns, koldloginuid); + loginuid = from_kuid(&init_user_ns, kloginuid), + tty = audit_get_tty(); + + audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); + audit_log_task_context(ab); + audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", + oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", + oldsessionid, sessionid, !rc); + audit_put_tty(tty); + audit_log_end(ab); +} + +/** + * audit_set_loginuid - set current task's loginuid + * @loginuid: loginuid value + * + * Returns 0. + * + * Called (set) from fs/proc/base.c::proc_loginuid_write(). + */ +int audit_set_loginuid(kuid_t loginuid) +{ + unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; + kuid_t oldloginuid; + int rc; + + oldloginuid = audit_get_loginuid(current); + oldsessionid = audit_get_sessionid(current); + + rc = audit_set_loginuid_perm(loginuid); + if (rc) + goto out; + + /* are we setting or clearing? */ + if (uid_valid(loginuid)) { + sessionid = (unsigned int)atomic_inc_return(&session_id); + if (unlikely(sessionid == AUDIT_SID_UNSET)) + sessionid = (unsigned int)atomic_inc_return(&session_id); + } + + current->sessionid = sessionid; + current->loginuid = loginuid; +out: + audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); + return rc; +} + /** * audit_log_end - end one audit record * @ab: the audit_buffer diff --git a/kernel/auditsc.c b/kernel/auditsc.c index b585ceb2f7a2..572d247957fb 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1983,90 +1983,6 @@ int auditsc_get_stamp(struct audit_context *ctx, return 1; } -/* global counter which is incremented every time something logs in */ -static atomic_t session_id = ATOMIC_INIT(0); - -static int audit_set_loginuid_perm(kuid_t loginuid) -{ - /* if we are unset, we don't need privs */ - if (!audit_loginuid_set(current)) - return 0; - /* if AUDIT_FEATURE_LOGINUID_IMMUTABLE means never ever allow a change*/ - if (is_audit_feature_set(AUDIT_FEATURE_LOGINUID_IMMUTABLE)) - return -EPERM; - /* it is set, you need permission */ - if (!capable(CAP_AUDIT_CONTROL)) - return -EPERM; - /* reject if this is not an unset and we don't allow that */ - if (is_audit_feature_set(AUDIT_FEATURE_ONLY_UNSET_LOGINUID) && uid_valid(loginuid)) - return -EPERM; - return 0; -} - -static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, - unsigned int oldsessionid, unsigned int sessionid, - int rc) -{ - struct audit_buffer *ab; - uid_t uid, oldloginuid, loginuid; - struct tty_struct *tty; - - if (!audit_enabled) - return; - - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_LOGIN); - if (!ab) - return; - - uid = from_kuid(&init_user_ns, task_uid(current)); - oldloginuid = from_kuid(&init_user_ns, koldloginuid); - loginuid = from_kuid(&init_user_ns, kloginuid), - tty = audit_get_tty(); - - audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); - audit_log_task_context(ab); - audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", - oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", - oldsessionid, sessionid, !rc); - audit_put_tty(tty); - audit_log_end(ab); -} - -/** - * audit_set_loginuid - set current task's audit_context loginuid - * @loginuid: loginuid value - * - * Returns 0. - * - * Called (set) from fs/proc/base.c::proc_loginuid_write(). - */ -int audit_set_loginuid(kuid_t loginuid) -{ - unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; - kuid_t oldloginuid; - int rc; - - oldloginuid = audit_get_loginuid(current); - oldsessionid = audit_get_sessionid(current); - - rc = audit_set_loginuid_perm(loginuid); - if (rc) - goto out; - - /* are we setting or clearing? */ - if (uid_valid(loginuid)) { - sessionid = (unsigned int)atomic_inc_return(&session_id); - if (unlikely(sessionid == AUDIT_SID_UNSET)) - sessionid = (unsigned int)atomic_inc_return(&session_id); - } - - current->sessionid = sessionid; - current->loginuid = loginuid; -out: - audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); - return rc; -} - /** * __audit_mq_open - record audit data for a POSIX MQ open * @oflag: open flag -- cgit v1.3-7-g2ca7 From 2fec30e245a3b46fef89c4cb1f74eefc5fbb29a6 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Wed, 23 Jan 2019 21:36:25 -0500 Subject: audit: add support for fcaps v3 V3 namespaced file capabilities were introduced in commit 8db6c34f1dbc ("Introduce v3 namespaced file capabilities") Add support for these by adding the "frootid" field to the existing fcaps fields in the NAME and BPRM_FCAPS records. Please see github issue https://github.com/linux-audit/audit-kernel/issues/103 Signed-off-by: Richard Guy Briggs Acked-by: Serge Hallyn [PM: comment tweak to fit an 80 char line width] Signed-off-by: Paul Moore --- include/linux/capability.h | 5 +++-- kernel/audit.c | 6 ++++-- kernel/audit.h | 1 + kernel/auditsc.c | 4 ++++ security/commoncap.c | 2 ++ 5 files changed, 14 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/capability.h b/include/linux/capability.h index f640dcbc880c..b769330e9380 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -14,7 +14,7 @@ #define _LINUX_CAPABILITY_H #include - +#include #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3 #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3 @@ -25,11 +25,12 @@ typedef struct kernel_cap_struct { __u32 cap[_KERNEL_CAPABILITY_U32S]; } kernel_cap_t; -/* exact same as vfs_cap_data but in cpu endian and always filled completely */ +/* same as vfs_ns_cap_data but in cpu endian and always filled completely */ struct cpu_vfs_cap_data { __u32 magic_etc; kernel_cap_t permitted; kernel_cap_t inheritable; + kuid_t rootid; }; #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct)) diff --git a/kernel/audit.c b/kernel/audit.c index 2a32f304223d..3f3f1888cac7 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2084,8 +2084,9 @@ static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name) { audit_log_cap(ab, "cap_fp", &name->fcap.permitted); audit_log_cap(ab, "cap_fi", &name->fcap.inheritable); - audit_log_format(ab, " cap_fe=%d cap_fver=%x", - name->fcap.fE, name->fcap_ver); + audit_log_format(ab, " cap_fe=%d cap_fver=%x cap_frootid=%d", + name->fcap.fE, name->fcap_ver, + from_kuid(&init_user_ns, name->fcap.rootid)); } static inline int audit_copy_fcaps(struct audit_names *name, @@ -2104,6 +2105,7 @@ static inline int audit_copy_fcaps(struct audit_names *name, name->fcap.permitted = caps.permitted; name->fcap.inheritable = caps.inheritable; name->fcap.fE = !!(caps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE); + name->fcap.rootid = caps.rootid; name->fcap_ver = (caps.magic_etc & VFS_CAP_REVISION_MASK) >> VFS_CAP_REVISION_SHIFT; diff --git a/kernel/audit.h b/kernel/audit.h index 6ffb70575082..deefdbe61a47 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -69,6 +69,7 @@ struct audit_cap_data { kernel_cap_t effective; /* effective set of process */ }; kernel_cap_t ambient; + kuid_t rootid; }; /* When fs/namei.c:getname() is called, we store the pointer in name and bump diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 572d247957fb..c16beb25fd0a 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1358,6 +1358,9 @@ static void audit_log_exit(void) audit_log_cap(ab, "pi", &axs->new_pcap.inheritable); audit_log_cap(ab, "pe", &axs->new_pcap.effective); audit_log_cap(ab, "pa", &axs->new_pcap.ambient); + audit_log_format(ab, " frootid=%d", + from_kuid(&init_user_ns, + axs->fcap.rootid)); break; } } @@ -2271,6 +2274,7 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm, ax->fcap.permitted = vcaps.permitted; ax->fcap.inheritable = vcaps.inheritable; ax->fcap.fE = !!(vcaps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE); + ax->fcap.rootid = vcaps.rootid; ax->fcap_ver = (vcaps.magic_etc & VFS_CAP_REVISION_MASK) >> VFS_CAP_REVISION_SHIFT; ax->old_pcap.permitted = old->cap_permitted; diff --git a/security/commoncap.c b/security/commoncap.c index 232db019f051..c097f3568001 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -643,6 +643,8 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK; cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK; + cpu_caps->rootid = rootkuid; + return 0; } -- cgit v1.3-7-g2ca7 From 40852275a94afb3e836be9248399e036982d1a79 Mon Sep 17 00:00:00 2001 From: Micah Morton Date: Tue, 22 Jan 2019 14:42:09 -0800 Subject: LSM: add SafeSetID module that gates setid calls This change ensures that the set*uid family of syscalls in kernel/sys.c (setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with the CAP_OPT_INSETID flag, so capability checks in the security_capable hook can know whether they are being called from within a set*uid syscall. This change is a no-op by itself, but is needed for the proposed SafeSetID LSM. Signed-off-by: Micah Morton Acked-by: Kees Cook Signed-off-by: James Morris --- include/linux/capability.h | 5 +++++ kernel/capability.c | 19 +++++++++++++++++++ kernel/sys.c | 10 +++++----- 3 files changed, 29 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/capability.h b/include/linux/capability.h index f640dcbc880c..c3f9a4d558a0 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -209,6 +209,7 @@ extern bool has_ns_capability_noaudit(struct task_struct *t, extern bool capable(int cap); extern bool ns_capable(struct user_namespace *ns, int cap); extern bool ns_capable_noaudit(struct user_namespace *ns, int cap); +extern bool ns_capable_setid(struct user_namespace *ns, int cap); #else static inline bool has_capability(struct task_struct *t, int cap) { @@ -240,6 +241,10 @@ static inline bool ns_capable_noaudit(struct user_namespace *ns, int cap) { return true; } +static inline bool ns_capable_setid(struct user_namespace *ns, int cap) +{ + return true; +} #endif /* CONFIG_MULTIUSER */ extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode); extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap); diff --git a/kernel/capability.c b/kernel/capability.c index cfbbcb68e11e..1444f3954d75 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -415,6 +415,25 @@ bool ns_capable_noaudit(struct user_namespace *ns, int cap) } EXPORT_SYMBOL(ns_capable_noaudit); +/** + * ns_capable_setid - Determine if the current task has a superior capability + * in effect, while signalling that this check is being done from within a + * setid syscall. + * @ns: The usernamespace we want the capability in + * @cap: The capability to be tested for + * + * Return true if the current task has the given superior capability currently + * available for use, false if not. + * + * This sets PF_SUPERPRIV on the task if the capability is available on the + * assumption that it's about to be used. + */ +bool ns_capable_setid(struct user_namespace *ns, int cap) +{ + return ns_capable_common(ns, cap, CAP_OPT_INSETID); +} +EXPORT_SYMBOL(ns_capable_setid); + /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for diff --git a/kernel/sys.c b/kernel/sys.c index f7eb62eceb24..c5f875048aef 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -516,7 +516,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid) new->uid = kruid; if (!uid_eq(old->uid, kruid) && !uid_eq(old->euid, kruid) && - !ns_capable(old->user_ns, CAP_SETUID)) + !ns_capable_setid(old->user_ns, CAP_SETUID)) goto error; } @@ -525,7 +525,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid) if (!uid_eq(old->uid, keuid) && !uid_eq(old->euid, keuid) && !uid_eq(old->suid, keuid) && - !ns_capable(old->user_ns, CAP_SETUID)) + !ns_capable_setid(old->user_ns, CAP_SETUID)) goto error; } @@ -584,7 +584,7 @@ long __sys_setuid(uid_t uid) old = current_cred(); retval = -EPERM; - if (ns_capable(old->user_ns, CAP_SETUID)) { + if (ns_capable_setid(old->user_ns, CAP_SETUID)) { new->suid = new->uid = kuid; if (!uid_eq(kuid, old->uid)) { retval = set_user(new); @@ -646,7 +646,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid) old = current_cred(); retval = -EPERM; - if (!ns_capable(old->user_ns, CAP_SETUID)) { + if (!ns_capable_setid(old->user_ns, CAP_SETUID)) { if (ruid != (uid_t) -1 && !uid_eq(kruid, old->uid) && !uid_eq(kruid, old->euid) && !uid_eq(kruid, old->suid)) goto error; @@ -814,7 +814,7 @@ long __sys_setfsuid(uid_t uid) if (uid_eq(kuid, old->uid) || uid_eq(kuid, old->euid) || uid_eq(kuid, old->suid) || uid_eq(kuid, old->fsuid) || - ns_capable(old->user_ns, CAP_SETUID)) { + ns_capable_setid(old->user_ns, CAP_SETUID)) { if (!uid_eq(kuid, old->fsuid)) { new->fsuid = kuid; if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0) -- cgit v1.3-7-g2ca7 From a252f56a3c922197ef40dce8f8cc258ae75e0193 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Wed, 23 Jan 2019 13:34:59 -0500 Subject: audit: more filter PATH records keyed on filesystem magic Like commit 42d5e37654e4 ("audit: filter PATH records keyed on filesystem magic") that addresses https://github.com/linux-audit/audit-kernel/issues/8 Any user or remote filesystem could become unavailable and effectively block on a forced unmount. -a always,exit -S umount2 -F key=umount2 Provide a method to ignore these user and remote filesystems to prevent them from being impossible to unmount. Extend the "AUDIT_FILTER_FS" filter that uses the field type AUDIT_FSTYPE keying off the filesystem 4-octet hexadecimal magic identifier to filter specific filesystems to cover audit_inode() to address this blockage. An example rule would look like: -a never,filesystem -F fstype=0x517B -F key=ignore_smb -a never,filesystem -F fstype=0x6969 -F key=ignore_nfs Arguably the better way to address this issue is to disable auditing processes that touch removable filesystems. Note: refactor __audit_inode_child() to remove two levels of if indentation. Please see the github issue tracker https://github.com/linux-audit/audit-kernel/issues/100 Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/auditsc.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index c16beb25fd0a..a2696ce790f9 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1766,10 +1766,31 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, struct inode *inode = d_backing_inode(dentry); struct audit_names *n; bool parent = flags & AUDIT_INODE_PARENT; + struct audit_entry *e; + struct list_head *list = &audit_filter_list[AUDIT_FILTER_FS]; + int i; if (!context->in_syscall) return; + rcu_read_lock(); + if (!list_empty(list)) { + list_for_each_entry_rcu(e, list, list) { + for (i = 0; i < e->rule.field_count; i++) { + struct audit_field *f = &e->rule.fields[i]; + + if (f->type == AUDIT_FSTYPE + && audit_comparator(inode->i_sb->s_magic, + f->op, f->val) + && e->rule.action == AUDIT_NEVER) { + rcu_read_unlock(); + return; + } + } + } + } + rcu_read_unlock(); + if (!name) goto out_alloc; @@ -1878,14 +1899,12 @@ void __audit_inode_child(struct inode *parent, for (i = 0; i < e->rule.field_count; i++) { struct audit_field *f = &e->rule.fields[i]; - if (f->type == AUDIT_FSTYPE) { - if (audit_comparator(parent->i_sb->s_magic, - f->op, f->val)) { - if (e->rule.action == AUDIT_NEVER) { - rcu_read_unlock(); - return; - } - } + if (f->type == AUDIT_FSTYPE + && audit_comparator(parent->i_sb->s_magic, + f->op, f->val) + && e->rule.action == AUDIT_NEVER) { + rcu_read_unlock(); + return; } } } -- cgit v1.3-7-g2ca7 From 05c7a9cb2727cd3c3d8e767f48e5cd18486a8d16 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Tue, 22 Jan 2019 17:07:41 -0500 Subject: audit: clean up AUDITSYSCALL prototypes and stubs Pull together all the audit syscall watch, mark and tree prototypes and stubs into the same ifdef. Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit.h | 64 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 33 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.h b/kernel/audit.h index deefdbe61a47..9acb8691ed87 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -268,25 +268,47 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab, extern struct tty_struct *audit_get_tty(void); extern void audit_put_tty(struct tty_struct *tty); -/* audit watch functions */ +/* audit watch/mark/tree functions */ #ifdef CONFIG_AUDITSYSCALL extern void audit_put_watch(struct audit_watch *watch); extern void audit_get_watch(struct audit_watch *watch); -extern int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op); +extern int audit_to_watch(struct audit_krule *krule, char *path, int len, + u32 op); extern int audit_add_watch(struct audit_krule *krule, struct list_head **list); extern void audit_remove_watch_rule(struct audit_krule *krule); extern char *audit_watch_path(struct audit_watch *watch); -extern int audit_watch_compare(struct audit_watch *watch, unsigned long ino, dev_t dev); +extern int audit_watch_compare(struct audit_watch *watch, unsigned long ino, + dev_t dev); -extern struct audit_fsnotify_mark *audit_alloc_mark(struct audit_krule *krule, char *pathname, int len); +extern struct audit_fsnotify_mark *audit_alloc_mark(struct audit_krule *krule, + char *pathname, int len); extern char *audit_mark_path(struct audit_fsnotify_mark *mark); extern void audit_remove_mark(struct audit_fsnotify_mark *audit_mark); extern void audit_remove_mark_rule(struct audit_krule *krule); -extern int audit_mark_compare(struct audit_fsnotify_mark *mark, unsigned long ino, dev_t dev); +extern int audit_mark_compare(struct audit_fsnotify_mark *mark, + unsigned long ino, dev_t dev); extern int audit_dupe_exe(struct audit_krule *new, struct audit_krule *old); -extern int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark *mark); +extern int audit_exe_compare(struct task_struct *tsk, + struct audit_fsnotify_mark *mark); -#else +extern struct audit_chunk *audit_tree_lookup(const struct inode *inode); +extern void audit_put_chunk(struct audit_chunk *chunk); +extern bool audit_tree_match(struct audit_chunk *chunk, + struct audit_tree *tree); +extern int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op); +extern int audit_add_tree_rule(struct audit_krule *rule); +extern int audit_remove_tree_rule(struct audit_krule *rule); +extern void audit_trim_trees(void); +extern int audit_tag_tree(char *old, char *new); +extern const char *audit_tree_path(struct audit_tree *tree); +extern void audit_put_tree(struct audit_tree *tree); +extern void audit_kill_trees(struct audit_context *context); + +extern int audit_signal_info(int sig, struct task_struct *t); +extern void audit_filter_inodes(struct task_struct *tsk, + struct audit_context *ctx); +extern struct list_head *audit_killed_trees(void); +#else /* CONFIG_AUDITSYSCALL */ #define audit_put_watch(w) {} #define audit_get_watch(w) {} #define audit_to_watch(k, p, l, o) (-EINVAL) @@ -302,21 +324,7 @@ extern int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark #define audit_mark_compare(m, i, d) 0 #define audit_exe_compare(t, m) (-EINVAL) #define audit_dupe_exe(n, o) (-EINVAL) -#endif /* CONFIG_AUDITSYSCALL */ -#ifdef CONFIG_AUDITSYSCALL -extern struct audit_chunk *audit_tree_lookup(const struct inode *inode); -extern void audit_put_chunk(struct audit_chunk *chunk); -extern bool audit_tree_match(struct audit_chunk *chunk, struct audit_tree *tree); -extern int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op); -extern int audit_add_tree_rule(struct audit_krule *rule); -extern int audit_remove_tree_rule(struct audit_krule *rule); -extern void audit_trim_trees(void); -extern int audit_tag_tree(char *old, char *new); -extern const char *audit_tree_path(struct audit_tree *tree); -extern void audit_put_tree(struct audit_tree *tree); -extern void audit_kill_trees(struct audit_context *context); -#else #define audit_remove_tree_rule(rule) BUG() #define audit_add_tree_rule(rule) -EINVAL #define audit_make_tree(rule, str, op) -EINVAL @@ -325,7 +333,10 @@ extern void audit_kill_trees(struct audit_context *context); #define audit_tag_tree(old, new) -EINVAL #define audit_tree_path(rule) "" /* never called */ #define audit_kill_trees(context) BUG() -#endif + +#define audit_signal_info(s, t) AUDIT_DISABLED +#define audit_filter_inodes(t, c) AUDIT_DISABLED +#endif /* CONFIG_AUDITSYSCALL */ extern char *audit_unpack_string(void **bufp, size_t *remain, size_t len); @@ -335,14 +346,5 @@ extern u32 audit_sig_sid; extern int audit_filter(int msgtype, unsigned int listtype); -#ifdef CONFIG_AUDITSYSCALL -extern int audit_signal_info(int sig, struct task_struct *t); -extern void audit_filter_inodes(struct task_struct *tsk, struct audit_context *ctx); -extern struct list_head *audit_killed_trees(void); -#else -#define audit_signal_info(s,t) AUDIT_DISABLED -#define audit_filter_inodes(t,c) AUDIT_DISABLED -#endif - extern void audit_ctl_lock(void); extern void audit_ctl_unlock(void); -- cgit v1.3-7-g2ca7 From 337e9b07db3b8c7f7d68b849df32f434a1a3b831 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:10:53 -0800 Subject: sched: Replace call_rcu_sched() with call_rcu() Now that call_rcu()'s callback is not invoked until after all preempt-disable regions of code have completed (in addition to explicitly marked RCU read-side critical sections), call_rcu() can be used in place of call_rcu_sched(). This commit therefore makes that change. While in the area, this commit also updates an outdated header comment for for_each_domain(). Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Peter Zijlstra --- kernel/sched/sched.h | 2 +- kernel/sched/topology.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d04530bf251f..6665b9c02e2f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1260,7 +1260,7 @@ extern void sched_ttwu_pending(void); /* * The domain tree (rq->sd) is protected by RCU's quiescent state transition. - * See detach_destroy_domains: synchronize_sched for details. + * See destroy_sched_domains: call_rcu for details. * * The domain tree of any CPU may only be accessed from within * preempt-disabled sections. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3f35ba1d8fde..7d905f55e7fa 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) raw_spin_unlock_irqrestore(&rq->lock, flags); if (old_rd) - call_rcu_sched(&old_rd->rcu, free_rootdomain); + call_rcu(&old_rd->rcu, free_rootdomain); } void sched_get_rd(struct root_domain *rd) @@ -455,7 +455,7 @@ void sched_put_rd(struct root_domain *rd) if (!atomic_dec_and_test(&rd->refcount)) return; - call_rcu_sched(&rd->rcu, free_rootdomain); + call_rcu(&rd->rcu, free_rootdomain); } static int init_rootdomain(struct root_domain *rd) -- cgit v1.3-7-g2ca7 From b290ebcf7bc4638b38c413f192963f4b74e45b7b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Nov 2018 19:13:54 -0800 Subject: sched: Replace synchronize_sched() with synchronize_rcu() Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(), in fact, synchronize_sched() is now completely equivalent to synchronize_rcu(). This commit therefore replaces synchronize_sched() with synchronize_rcu() so that synchronize_sched() can eventually be removed entirely. Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Peter Zijlstra --- kernel/sched/cpufreq.c | 4 ++-- kernel/sched/cpufreq_schedutil.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c index 22bd8980f32f..835671f0f917 100644 --- a/kernel/sched/cpufreq.c +++ b/kernel/sched/cpufreq.c @@ -48,8 +48,8 @@ EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook); * * Clear the update_util_data pointer for the given CPU. * - * Callers must use RCU-sched callbacks to free any memory that might be - * accessed via the old update_util_data pointer or invoke synchronize_sched() + * Callers must use RCU callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_rcu() * right after this function to avoid use-after-free. */ void cpufreq_remove_update_util_hook(int cpu) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 033ec7c45f13..2efe629425be 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -859,7 +859,7 @@ static void sugov_stop(struct cpufreq_policy *policy) for_each_cpu(cpu, policy->cpus) cpufreq_remove_update_util_hook(cpu); - synchronize_sched(); + synchronize_rcu(); if (!policy->fast_switch_enabled) { irq_work_sync(&sg_policy->irq_work); -- cgit v1.3-7-g2ca7 From ad368d15b08ad22509a56bdfd6ee3a04da91ce10 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 27 Nov 2018 13:55:53 -0800 Subject: rcu: Rename and comment changes due to only one rcuo kthread per CPU Given RCU flavor consolidation, the name rcu_spawn_all_nocb_kthreads() is quite misleading. It no longer ever creates more than one kthread, and it does so only for the specified CPU. This commit therefore changes this name to the more descriptive rcu_spawn_cpu_nocb_kthread(), and also fixes up a similar issue in its header comment while in the area. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9180158756d2..f4edc664fb65 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3299,7 +3299,7 @@ int rcutree_prepare_cpu(unsigned int cpu) trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuonl")); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); rcu_prepare_kthreads(cpu); - rcu_spawn_all_nocb_kthreads(cpu); + rcu_spawn_cpu_nocb_kthread(cpu); return 0; } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index d90b02b53c0e..bcfd684a5c57 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -451,7 +451,7 @@ static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp, static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp); static void do_nocb_deferred_wakeup(struct rcu_data *rdp); static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); -static void rcu_spawn_all_nocb_kthreads(int cpu); +static void rcu_spawn_cpu_nocb_kthread(int cpu); static void __init rcu_spawn_nocb_kthreads(void); #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 1b3dd2fc0cd6..4d4091565a2c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2517,9 +2517,9 @@ static void rcu_spawn_one_nocb_kthread(int cpu) /* * If the specified CPU is a no-CBs CPU that does not already have its - * rcuo kthreads, spawn them. + * rcuo kthread, spawn it. */ -static void rcu_spawn_all_nocb_kthreads(int cpu) +static void rcu_spawn_cpu_nocb_kthread(int cpu) { if (rcu_scheduler_fully_active) rcu_spawn_one_nocb_kthread(cpu); @@ -2536,7 +2536,7 @@ static void __init rcu_spawn_nocb_kthreads(void) int cpu; for_each_online_cpu(cpu) - rcu_spawn_all_nocb_kthreads(cpu); + rcu_spawn_cpu_nocb_kthread(cpu); } /* How many follower CPU IDs per leader? Default of -1 for sqrt(nr_cpu_ids). */ @@ -2670,7 +2670,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp) { } -static void rcu_spawn_all_nocb_kthreads(int cpu) +static void rcu_spawn_cpu_nocb_kthread(int cpu) { } -- cgit v1.3-7-g2ca7 From 1de462ed85062df2ab6939eeee1625e767052907 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 28 Nov 2018 10:37:42 -0800 Subject: rcu: Make expedited IPI handler return after handling critical section During expedited RCU grace-period initialization, IPIs are sent to all non-idle online CPUs. The IPI handler checks to see if the CPU is in quiescent state, reporting one if so. This handler looks at three different cases: (1) The CPU is not in an rcu_read_lock()-based critical section, (2) The CPU is in the process of exiting an rcu_read_lock()-based critical section, and (3) The CPU is in an rcu_read_lock()-based critical section. In case (2), execution falls through into case (3). This is harmless from a functionality viewpoint, but can result in needless overhead during an improbable corner case. This commit therefore adds the "return" statement needed to prevent fall-through. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 928fe5893a57..6d4eb4694b6f 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -697,6 +697,7 @@ static void sync_rcu_exp_handler(void *unused) WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true); } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + return; } /* -- cgit v1.3-7-g2ca7 From cd920e5a34abea837418691d366472311e7b9147 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 28 Nov 2018 16:57:54 -0800 Subject: rcu: Inline force_quiescent_state() into rcu_force_quiescent_state() Given that rcu_force_quiescent_state() is a simple wrapper around force_quiescent_state(), this commit saves a few lines of code by inlining force_quiescent_state() into rcu_force_quiescent_state(), and changing all references to force_quiescent_state() to instead invoke rcu_force_quiescent_state(). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index f4edc664fb65..e56a46444775 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -479,7 +479,6 @@ module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next module_param(rcu_kick_kthreads, bool, 0644); static void force_qs_rnp(int (*f)(struct rcu_data *rdp)); -static void force_quiescent_state(void); static int rcu_pending(void); /* @@ -503,15 +502,6 @@ unsigned long rcu_exp_batches_completed(void) } EXPORT_SYMBOL_GPL(rcu_exp_batches_completed); -/* - * Force a quiescent state. - */ -void rcu_force_quiescent_state(void) -{ - force_quiescent_state(); -} -EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); - /* * Convert a ->gp_state value to a character string. */ @@ -1310,7 +1300,7 @@ static void print_other_cpu_stall(unsigned long gp_seq) panic_on_rcu_stall(); - force_quiescent_state(); /* Kick them all. */ + rcu_force_quiescent_state(); /* Kick them all. */ } static void print_cpu_stall(void) @@ -2578,7 +2568,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) * Force quiescent states on reluctant CPUs, and also detect which * CPUs are in dyntick-idle mode. */ -static void force_quiescent_state(void) +void rcu_force_quiescent_state(void) { unsigned long flags; bool ret; @@ -2610,6 +2600,7 @@ static void force_quiescent_state(void) raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); rcu_gp_kthread_wake(); } +EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); /* * This function checks for grace-period requests that fail to motivate @@ -2801,9 +2792,9 @@ static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head, /* * Force the grace period if too many callbacks or too long waiting. - * Enforce hysteresis, and don't invoke force_quiescent_state() + * Enforce hysteresis, and don't invoke rcu_force_quiescent_state() * if some other CPU has recently done so. Also, don't bother - * invoking force_quiescent_state() if the newly enqueued callback + * invoking rcu_force_quiescent_state() if the newly enqueued callback * is the only one waiting for a grace period to complete. */ if (unlikely(rcu_segcblist_n_cbs(&rdp->cblist) > @@ -2820,7 +2811,7 @@ static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head, rdp->blimit = LONG_MAX; if (rcu_state.n_force_qs == rdp->n_force_qs_snap && rcu_segcblist_first_pend_cb(&rdp->cblist) != head) - force_quiescent_state(); + rcu_force_quiescent_state(); rdp->n_force_qs_snap = rcu_state.n_force_qs; rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist); } -- cgit v1.3-7-g2ca7 From c97058d03329284068e45796df13510e5f940d8b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 28 Nov 2018 16:59:50 -0800 Subject: rcu: Eliminate RCU_BH_FLAVOR and RCU_SCHED_FLAVOR Now that the RCU flavors have been consolidated, RCU_BH_FLAVOR and RCU_SCHED_FLAVOR are no longer used. This commit therefore saves a few lines by removing them. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 2 -- kernel/rcu/tree.c | 2 -- 2 files changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index a393e24a9195..75787186bd4f 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -462,8 +462,6 @@ void rcu_request_urgent_qs_task(struct task_struct *t); enum rcutorture_type { RCU_FLAVOR, - RCU_BH_FLAVOR, - RCU_SCHED_FLAVOR, RCU_TASKS_FLAVOR, SRCU_FLAVOR, INVALID_RCU_FLAVOR diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e56a46444775..fc37bec32731 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -556,8 +556,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, { switch (test_type) { case RCU_FLAVOR: - case RCU_BH_FLAVOR: - case RCU_SCHED_FLAVOR: *flags = READ_ONCE(rcu_state.gp_flags); *gp_seq = rcu_seq_current(&rcu_state.gp_seq); break; -- cgit v1.3-7-g2ca7 From c46f497a6151d48cb341e18fdd4dff345f7d253d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 28 Nov 2018 17:02:44 -0800 Subject: rcu: Inline rcu_kthread_do_work() into its sole remaining caller The rcu_kthread_do_work() function has a single-line body and only one remaining caller. This commit therefore saves a few lines of code by inlining rcu_kthread_do_work() into its sole remaining caller. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 4d4091565a2c..bcf3e7366a28 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1369,11 +1369,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_node *rnp) return 0; } -static void rcu_kthread_do_work(void) -{ - rcu_do_batch(this_cpu_ptr(&rcu_data)); -} - static void rcu_cpu_kthread_setup(unsigned int cpu) { struct sched_param sp; @@ -1413,7 +1408,7 @@ static void rcu_cpu_kthread(unsigned int cpu) *workp = 0; local_irq_enable(); if (work) - rcu_kthread_do_work(); + rcu_do_batch(this_cpu_ptr(&rcu_data)); local_bh_enable(); if (*workp == 0) { trace_rcu_utilization(TPS("End CPU kthread@rcu_wait")); -- cgit v1.3-7-g2ca7 From 142d106d5e62ff2cf0dfd2dfe1adfcaff1c2ed85 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 29 Nov 2018 09:15:54 -0800 Subject: rcu: Determine expedited-GP IPI handler at build time Back when there could be multiple RCU flavors running in the same kernel at the same time, it was necessary to specify the expedited grace-period IPI handler at runtime. Now that there is only one RCU flavor, the IPI handler can be determined at build time. There is therefore no longer any reason for the RCU-preempt and RCU-sched IPI handlers to have different names, nor is there any reason to pass these handlers in function arguments and in the data structures enclosing workqueues. This commit therefore makes all these changes, pushing the specification of the expedited grace-period IPI handler down to the point of use. Signed-off-by: Paul E. McKenney --- .../Expedited-Grace-Periods/ExpSchedFlow.svg | 18 ++++++++----- .../Expedited-Grace-Periods.html | 26 +++++++++---------- kernel/rcu/tree.h | 1 - kernel/rcu/tree_exp.h | 30 ++++++++++------------ 4 files changed, 38 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg index e4233ac93c2b..6189ffcc6aff 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg @@ -328,13 +328,13 @@ inkscape:window-height="1148" id="namedview90" showgrid="true" - inkscape:zoom="0.80021373" - inkscape:cx="462.49289" - inkscape:cy="473.6718" + inkscape:zoom="0.69092787" + inkscape:cx="476.34085" + inkscape:cy="712.80957" inkscape:window-x="770" inkscape:window-y="24" inkscape:window-maximized="0" - inkscape:current-layer="g4114-9-3-9" + inkscape:current-layer="g4" inkscape:snap-grids="false" fit-margin-top="5" fit-margin-right="5" @@ -813,14 +813,18 @@ Requestreched_cpu() + id="tspan3100">context switch diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html index 8e4f873b979f..19e7a5fb6b73 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -72,10 +72,10 @@ will ignore it because idle and offline CPUs are already residing in quiescent states. Otherwise, the expedited grace period will use smp_call_function_single() to send the CPU an IPI, which -is handled by sync_rcu_exp_handler(). +is handled by rcu_exp_handler().

-However, because this is preemptible RCU, sync_rcu_exp_handler() +However, because this is preemptible RCU, rcu_exp_handler() can check to see if the CPU is currently running in an RCU read-side critical section. If not, the handler can immediately report a quiescent state. @@ -145,19 +145,18 @@ expedited grace period is shown in the following diagram:

ExpSchedFlow.svg

-As with RCU-preempt's synchronize_rcu_expedited(), +As with RCU-preempt, RCU-sched's synchronize_sched_expedited() ignores offline and idle CPUs, again because they are in remotely detectable quiescent states. -However, the synchronize_rcu_expedited() handler -is sync_sched_exp_handler(), and because the +However, because the rcu_read_lock_sched() and rcu_read_unlock_sched() leave no trace of their invocation, in general it is not possible to tell whether or not the current CPU is in an RCU read-side critical section. -The best that sync_sched_exp_handler() can do is to check +The best that RCU-sched's rcu_exp_handler() can do is to check for idle, on the off-chance that the CPU went idle while the IPI was in flight. -If the CPU is idle, then sync_sched_exp_handler() reports +If the CPU is idle, then rcu_exp_handler() reports the quiescent state.

Otherwise, the handler forces a future context switch by setting the @@ -298,19 +297,18 @@ Instead, the task pushing the grace period forward will include the idle CPUs in the mask passed to rcu_report_exp_cpu_mult().

-For RCU-sched, there is an additional check for idle in the IPI -handler, sync_sched_exp_handler(). +For RCU-sched, there is an additional check: If the IPI has interrupted the idle loop, then -sync_sched_exp_handler() invokes rcu_report_exp_rdp() +rcu_exp_handler() invokes rcu_report_exp_rdp() to report the corresponding quiescent state.

For RCU-preempt, there is no specific check for idle in the -IPI handler (sync_rcu_exp_handler()), but because +IPI handler (rcu_exp_handler()), but because RCU read-side critical sections are not permitted within the -idle loop, if sync_rcu_exp_handler() sees that the CPU is within +idle loop, if rcu_exp_handler() sees that the CPU is within RCU read-side critical section, the CPU cannot possibly be idle. -Otherwise, sync_rcu_exp_handler() invokes +Otherwise, rcu_exp_handler() invokes rcu_report_exp_rdp() to report the corresponding quiescent state, regardless of whether or not that quiescent state was due to the CPU being idle. @@ -625,6 +623,8 @@ checks, but only during the mid-boot dead zone.

With this refinement, synchronous grace periods can now be used from task context pretty much any time during the life of the kernel. +That is, aside from some points in the suspend, hibernate, or shutdown +code path.

Summary

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bcfd684a5c57..50bb41cdc5fb 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -36,7 +36,6 @@ /* Communicate arguments to a workqueue handler. */ struct rcu_exp_work { - smp_call_func_t rew_func; unsigned long rew_s; struct work_struct rew_work; }; diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 6d4eb4694b6f..7f5cb4228b59 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -22,6 +22,8 @@ #include +static void rcu_exp_handler(void *unused); + /* * Record the start of an expedited grace period. */ @@ -344,7 +346,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) { int cpu; unsigned long flags; - smp_call_func_t func; unsigned long mask_ofl_test; unsigned long mask_ofl_ipi; int ret; @@ -352,7 +353,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) container_of(wp, struct rcu_exp_work, rew_work); struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew); - func = rewp->rew_func; raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Each pass checks a CPU for identity, offline, and idle. */ @@ -396,7 +396,7 @@ retry_ipi: mask_ofl_test |= mask; continue; } - ret = smp_call_function_single(cpu, func, NULL, 0); + ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0); if (!ret) { mask_ofl_ipi &= ~mask; continue; @@ -426,7 +426,7 @@ retry_ipi: * Select the nodes that the upcoming expedited grace period needs * to wait for. */ -static void sync_rcu_exp_select_cpus(smp_call_func_t func) +static void sync_rcu_exp_select_cpus(void) { int cpu; struct rcu_node *rnp; @@ -440,7 +440,6 @@ static void sync_rcu_exp_select_cpus(smp_call_func_t func) rnp->exp_need_flush = false; if (!READ_ONCE(rnp->expmask)) continue; /* Avoid early boot non-existent wq. */ - rnp->rew.rew_func = func; if (!READ_ONCE(rcu_par_gp_wq) || rcu_scheduler_active != RCU_SCHEDULER_RUNNING || rcu_is_last_leaf_node(rnp)) { @@ -580,10 +579,10 @@ static void rcu_exp_wait_wake(unsigned long s) * Common code to drive an expedited grace period forward, used by * workqueues and mid-boot-time tasks. */ -static void rcu_exp_sel_wait_wake(smp_call_func_t func, unsigned long s) +static void rcu_exp_sel_wait_wake(unsigned long s) { /* Initialize the rcu_node tree in preparation for the wait. */ - sync_rcu_exp_select_cpus(func); + sync_rcu_exp_select_cpus(); /* Wait and clean up, including waking everyone. */ rcu_exp_wait_wake(s); @@ -597,14 +596,14 @@ static void wait_rcu_exp_gp(struct work_struct *wp) struct rcu_exp_work *rewp; rewp = container_of(wp, struct rcu_exp_work, rew_work); - rcu_exp_sel_wait_wake(rewp->rew_func, rewp->rew_s); + rcu_exp_sel_wait_wake(rewp->rew_s); } /* * Given a smp_call_function() handler, kick off the specified * implementation of expedited grace period. */ -static void _synchronize_rcu_expedited(smp_call_func_t func) +static void _synchronize_rcu_expedited(void) { struct rcu_data *rdp; struct rcu_exp_work rew; @@ -625,10 +624,9 @@ static void _synchronize_rcu_expedited(smp_call_func_t func) /* Ensure that load happens before action based on it. */ if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) { /* Direct call during scheduler init and early_initcalls(). */ - rcu_exp_sel_wait_wake(func, s); + rcu_exp_sel_wait_wake(s); } else { /* Marshall arguments & schedule the expedited grace period. */ - rew.rew_func = func; rew.rew_s = s; INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); queue_work(rcu_gp_wq, &rew.rew_work); @@ -654,7 +652,7 @@ static void _synchronize_rcu_expedited(smp_call_func_t func) * ->expmask fields in the rcu_node tree. Otherwise, immediately * report the quiescent state. */ -static void sync_rcu_exp_handler(void *unused) +static void rcu_exp_handler(void *unused) { unsigned long flags; struct rcu_data *rdp = this_cpu_ptr(&rcu_data); @@ -760,14 +758,14 @@ void synchronize_rcu_expedited(void) if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) return; - _synchronize_rcu_expedited(sync_rcu_exp_handler); + _synchronize_rcu_expedited(); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); #else /* #ifdef CONFIG_PREEMPT_RCU */ /* Invoked on each online non-idle CPU for expedited quiescent state. */ -static void sync_sched_exp_handler(void *unused) +static void rcu_exp_handler(void *unused) { struct rcu_data *rdp; struct rcu_node *rnp; @@ -799,7 +797,7 @@ static void sync_sched_exp_online_cleanup(int cpu) rnp = rdp->mynode; if (!(READ_ONCE(rnp->expmask) & rdp->grpmask)) return; - ret = smp_call_function_single(cpu, sync_sched_exp_handler, NULL, 0); + ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0); WARN_ON_ONCE(ret); } @@ -835,7 +833,7 @@ void synchronize_rcu_expedited(void) if (rcu_blocking_is_gp()) return; - _synchronize_rcu_expedited(sync_sched_exp_handler); + _synchronize_rcu_expedited(); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); -- cgit v1.3-7-g2ca7 From 3cd4ca47aa577689c2e6b295d8f52af0e6f26333 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 29 Nov 2018 10:01:52 -0800 Subject: rcu: Consolidate PREEMPT and !PREEMPT synchronize_rcu_expedited() The CONFIG_PREEMPT=n and CONFIG_PREEMPT=y implementations of synchronize_rcu_expedited() are quite similar, and with small modifications to rcu_blocking_is_gp() can be made identical. This commit therefore makes this change in order to save a few lines of code and to reduce the amount of duplicate code. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 105 +++++++++++++++++++++++--------------------------- 1 file changed, 49 insertions(+), 56 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 7f5cb4228b59..b800bdfe74b3 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -643,6 +643,33 @@ static void _synchronize_rcu_expedited(void) mutex_unlock(&rcu_state.exp_mutex); } +/* + * During early boot, any blocking grace-period wait automatically + * implies a grace period. Later on, this is never the case for PREEMPT. + * + * Howevr, because a context switch is a grace period for !PREEMPT, any + * blocking grace-period wait automatically implies a grace period if + * there is only one CPU online at any point time during execution of + * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to + * occasionally incorrectly indicate that there are multiple CPUs online + * when there was in fact only one the whole time, as this just adds some + * overhead: RCU still operates correctly. + */ +static int rcu_blocking_is_gp(void) +{ + int ret; + + if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) + return true; + if (IS_ENABLED(CONFIG_PREEMPT)) + return false; + might_sleep(); /* Check for RCU read-side critical section. */ + preempt_disable(); + ret = num_online_cpus() <= 1; + preempt_enable(); + return ret; +} + #ifdef CONFIG_PREEMPT_RCU /* @@ -729,39 +756,6 @@ static void sync_sched_exp_online_cleanup(int cpu) { } -/** - * synchronize_rcu_expedited - Brute-force RCU grace period - * - * Wait for an RCU-preempt grace period, but expedite it. The basic - * idea is to IPI all non-idle non-nohz online CPUs. The IPI handler - * checks whether the CPU is in an RCU-preempt critical section, and - * if so, it sets a flag that causes the outermost rcu_read_unlock() - * to report the quiescent state. On the other hand, if the CPU is - * not in an RCU read-side critical section, the IPI handler reports - * the quiescent state immediately. - * - * Although this is a greate improvement over previous expedited - * implementations, it is still unfriendly to real-time workloads, so is - * thus not recommended for any sort of common-case code. In fact, if - * you are using synchronize_rcu_expedited() in a loop, please restructure - * your code to batch your updates, and then Use a single synchronize_rcu() - * instead. - * - * This has the same semantics as (but is more brutal than) synchronize_rcu(). - */ -void synchronize_rcu_expedited(void) -{ - RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || - lock_is_held(&rcu_lock_map) || - lock_is_held(&rcu_sched_lock_map), - "Illegal synchronize_rcu_expedited() in RCU read-side critical section"); - - if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) - return; - _synchronize_rcu_expedited(); -} -EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); - #else /* #ifdef CONFIG_PREEMPT_RCU */ /* Invoked on each online non-idle CPU for expedited quiescent state. */ @@ -801,27 +795,28 @@ static void sync_sched_exp_online_cleanup(int cpu) WARN_ON_ONCE(ret); } -/* - * Because a context switch is a grace period for !PREEMPT, any - * blocking grace-period wait automatically implies a grace period if - * there is only one CPU online at any point time during execution of - * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to - * occasionally incorrectly indicate that there are multiple CPUs online - * when there was in fact only one the whole time, as this just adds some - * overhead: RCU still operates correctly. - */ -static int rcu_blocking_is_gp(void) -{ - int ret; - - might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); - ret = num_online_cpus() <= 1; - preempt_enable(); - return ret; -} +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */ -/* PREEMPT=n implementation of synchronize_rcu_expedited(). */ +/** + * synchronize_rcu_expedited - Brute-force RCU grace period + * + * Wait for an RCU grace period, but expedite it. The basic idea is to + * IPI all non-idle non-nohz online CPUs. The IPI handler checks whether + * the CPU is in an RCU critical section, and if so, it sets a flag that + * causes the outermost rcu_read_unlock() to report the quiescent state + * for RCU-preempt or asks the scheduler for help for RCU-sched. On the + * other hand, if the CPU is not in an RCU read-side critical section, + * the IPI handler reports the quiescent state immediately. + * + * Although this is a greate improvement over previous expedited + * implementations, it is still unfriendly to real-time workloads, so is + * thus not recommended for any sort of common-case code. In fact, if + * you are using synchronize_rcu_expedited() in a loop, please restructure + * your code to batch your updates, and then Use a single synchronize_rcu() + * instead. + * + * This has the same semantics as (but is more brutal than) synchronize_rcu(). + */ void synchronize_rcu_expedited(void) { RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || @@ -829,12 +824,10 @@ void synchronize_rcu_expedited(void) lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_rcu_expedited() in RCU read-side critical section"); - /* If only one CPU, this is automatically a grace period. */ + /* Is the state is such that the call is a grace period? */ if (rcu_blocking_is_gp()) return; _synchronize_rcu_expedited(); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); - -#endif /* #else #ifdef CONFIG_PREEMPT_RCU */ -- cgit v1.3-7-g2ca7 From e5bc3af7734f90278a47906d917852a85544510b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 29 Nov 2018 10:42:06 -0800 Subject: rcu: Consolidate PREEMPT and !PREEMPT synchronize_rcu() Now that rcu_blocking_is_gp() makes the correct immediate-return decision for both PREEMPT and !PREEMPT, a single implementation of synchronize_rcu() will work correctly under both configurations. This commit therefore eliminates a few lines of code by consolidating the two implementations of synchronize_rcu(). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/rcu/tree_exp.h | 27 ------------------ kernel/rcu/tree_plugin.h | 64 ------------------------------------------ 3 files changed, 73 insertions(+), 91 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index fc37bec32731..e2bd42b2b563 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2950,6 +2950,79 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) } EXPORT_SYMBOL_GPL(kfree_call_rcu); +/* + * During early boot, any blocking grace-period wait automatically + * implies a grace period. Later on, this is never the case for PREEMPT. + * + * Howevr, because a context switch is a grace period for !PREEMPT, any + * blocking grace-period wait automatically implies a grace period if + * there is only one CPU online at any point time during execution of + * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to + * occasionally incorrectly indicate that there are multiple CPUs online + * when there was in fact only one the whole time, as this just adds some + * overhead: RCU still operates correctly. + */ +static int rcu_blocking_is_gp(void) +{ + int ret; + + if (IS_ENABLED(CONFIG_PREEMPT)) + return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; + might_sleep(); /* Check for RCU read-side critical section. */ + preempt_disable(); + ret = num_online_cpus() <= 1; + preempt_enable(); + return ret; +} + +/** + * synchronize_rcu - wait until a grace period has elapsed. + * + * Control will return to the caller some time after a full grace + * period has elapsed, in other words after all currently executing RCU + * read-side critical sections have completed. Note, however, that + * upon return from synchronize_rcu(), the caller might well be executing + * concurrently with new RCU read-side critical sections that began while + * synchronize_rcu() was waiting. RCU read-side critical sections are + * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. + * In addition, regions of code across which interrupts, preemption, or + * softirqs have been disabled also serve as RCU read-side critical + * sections. This includes hardware interrupt handlers, softirq handlers, + * and NMI handlers. + * + * Note that this guarantee implies further memory-ordering guarantees. + * On systems with more than one CPU, when synchronize_rcu() returns, + * each CPU is guaranteed to have executed a full memory barrier since + * the end of its last RCU read-side critical section whose beginning + * preceded the call to synchronize_rcu(). In addition, each CPU having + * an RCU read-side critical section that extends beyond the return from + * synchronize_rcu() is guaranteed to have executed a full memory barrier + * after the beginning of synchronize_rcu() and before the beginning of + * that RCU read-side critical section. Note that these guarantees include + * CPUs that are offline, idle, or executing in user mode, as well as CPUs + * that are executing in the kernel. + * + * Furthermore, if CPU A invoked synchronize_rcu(), which returned + * to its caller on CPU B, then both CPU A and CPU B are guaranteed + * to have executed a full memory barrier during the execution of + * synchronize_rcu() -- even if CPU A and CPU B are the same CPU (but + * again only if the system has more than one CPU). + */ +void synchronize_rcu(void) +{ + RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || + lock_is_held(&rcu_lock_map) || + lock_is_held(&rcu_sched_lock_map), + "Illegal synchronize_rcu() in RCU read-side critical section"); + if (rcu_blocking_is_gp()) + return; + if (rcu_gp_is_expedited()) + synchronize_rcu_expedited(); + else + wait_rcu_gp(call_rcu); +} +EXPORT_SYMBOL_GPL(synchronize_rcu); + /** * get_state_synchronize_rcu - Snapshot current RCU state * diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index b800bdfe74b3..353d113c0cd4 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -643,33 +643,6 @@ static void _synchronize_rcu_expedited(void) mutex_unlock(&rcu_state.exp_mutex); } -/* - * During early boot, any blocking grace-period wait automatically - * implies a grace period. Later on, this is never the case for PREEMPT. - * - * Howevr, because a context switch is a grace period for !PREEMPT, any - * blocking grace-period wait automatically implies a grace period if - * there is only one CPU online at any point time during execution of - * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to - * occasionally incorrectly indicate that there are multiple CPUs online - * when there was in fact only one the whole time, as this just adds some - * overhead: RCU still operates correctly. - */ -static int rcu_blocking_is_gp(void) -{ - int ret; - - if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) - return true; - if (IS_ENABLED(CONFIG_PREEMPT)) - return false; - might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); - ret = num_online_cpus() <= 1; - preempt_enable(); - return ret; -} - #ifdef CONFIG_PREEMPT_RCU /* diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index bcf3e7366a28..43f3f2ee9d63 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -825,54 +825,6 @@ static void rcu_flavor_check_callbacks(int user) t->rcu_read_unlock_special.b.need_qs = true; } -/** - * synchronize_rcu - wait until a grace period has elapsed. - * - * Control will return to the caller some time after a full grace - * period has elapsed, in other words after all currently executing RCU - * read-side critical sections have completed. Note, however, that - * upon return from synchronize_rcu(), the caller might well be executing - * concurrently with new RCU read-side critical sections that began while - * synchronize_rcu() was waiting. RCU read-side critical sections are - * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. - * In addition, regions of code across which interrupts, preemption, or - * softirqs have been disabled also serve as RCU read-side critical - * sections. This includes hardware interrupt handlers, softirq handlers, - * and NMI handlers. - * - * Note that this guarantee implies further memory-ordering guarantees. - * On systems with more than one CPU, when synchronize_rcu() returns, - * each CPU is guaranteed to have executed a full memory barrier since - * the end of its last RCU read-side critical section whose beginning - * preceded the call to synchronize_rcu(). In addition, each CPU having - * an RCU read-side critical section that extends beyond the return from - * synchronize_rcu() is guaranteed to have executed a full memory barrier - * after the beginning of synchronize_rcu() and before the beginning of - * that RCU read-side critical section. Note that these guarantees include - * CPUs that are offline, idle, or executing in user mode, as well as CPUs - * that are executing in the kernel. - * - * Furthermore, if CPU A invoked synchronize_rcu(), which returned - * to its caller on CPU B, then both CPU A and CPU B are guaranteed - * to have executed a full memory barrier during the execution of - * synchronize_rcu() -- even if CPU A and CPU B are the same CPU (but - * again only if the system has more than one CPU). - */ -void synchronize_rcu(void) -{ - RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || - lock_is_held(&rcu_lock_map) || - lock_is_held(&rcu_sched_lock_map), - "Illegal synchronize_rcu() in RCU read-side critical section"); - if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) - return; - if (rcu_gp_is_expedited()) - synchronize_rcu_expedited(); - else - wait_rcu_gp(call_rcu); -} -EXPORT_SYMBOL_GPL(synchronize_rcu); - /* * Check for a task exiting while in a preemptible-RCU read-side * critical section, clean up if so. No need to issue warnings, @@ -1115,22 +1067,6 @@ static void rcu_flavor_check_callbacks(int user) } } -/* PREEMPT=n implementation of synchronize_rcu(). */ -void synchronize_rcu(void) -{ - RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || - lock_is_held(&rcu_lock_map) || - lock_is_held(&rcu_sched_lock_map), - "Illegal synchronize_rcu() in RCU read-side critical section"); - if (rcu_blocking_is_gp()) - return; - if (rcu_gp_is_expedited()) - synchronize_rcu_expedited(); - else - wait_rcu_gp(call_rcu); -} -EXPORT_SYMBOL_GPL(synchronize_rcu); - /* * Because preemptible RCU does not exist, tasks cannot possibly exit * while in preemptible RCU read-side critical sections. -- cgit v1.3-7-g2ca7 From 892307266429f6439803f2735a59a4bc58a6ded4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 29 Nov 2018 11:50:04 -0800 Subject: rcu: Inline _synchronize_rcu_expedited() into synchronize_rcu_expedited() Now that _synchronize_rcu_expedited() has only one caller, and given that this is a tail call, this commit inlines _synchronize_rcu_expedited() into synchronize_rcu_expedited(). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 81 +++++++++++++++++++++++---------------------------- 1 file changed, 36 insertions(+), 45 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 353d113c0cd4..d882ca0cd01b 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -599,50 +599,6 @@ static void wait_rcu_exp_gp(struct work_struct *wp) rcu_exp_sel_wait_wake(rewp->rew_s); } -/* - * Given a smp_call_function() handler, kick off the specified - * implementation of expedited grace period. - */ -static void _synchronize_rcu_expedited(void) -{ - struct rcu_data *rdp; - struct rcu_exp_work rew; - struct rcu_node *rnp; - unsigned long s; - - /* If expedited grace periods are prohibited, fall back to normal. */ - if (rcu_gp_is_normal()) { - wait_rcu_gp(call_rcu); - return; - } - - /* Take a snapshot of the sequence number. */ - s = rcu_exp_gp_seq_snap(); - if (exp_funnel_lock(s)) - return; /* Someone else did our work for us. */ - - /* Ensure that load happens before action based on it. */ - if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) { - /* Direct call during scheduler init and early_initcalls(). */ - rcu_exp_sel_wait_wake(s); - } else { - /* Marshall arguments & schedule the expedited grace period. */ - rew.rew_s = s; - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); - queue_work(rcu_gp_wq, &rew.rew_work); - } - - /* Wait for expedited grace period to complete. */ - rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); - rnp = rcu_get_root(); - wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], - sync_exp_work_done(s)); - smp_mb(); /* Workqueue actions happen before return. */ - - /* Let the next expedited grace period start. */ - mutex_unlock(&rcu_state.exp_mutex); -} - #ifdef CONFIG_PREEMPT_RCU /* @@ -792,6 +748,11 @@ static void sync_sched_exp_online_cleanup(int cpu) */ void synchronize_rcu_expedited(void) { + struct rcu_data *rdp; + struct rcu_exp_work rew; + struct rcu_node *rnp; + unsigned long s; + RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || lock_is_held(&rcu_lock_map) || lock_is_held(&rcu_sched_lock_map), @@ -801,6 +762,36 @@ void synchronize_rcu_expedited(void) if (rcu_blocking_is_gp()) return; - _synchronize_rcu_expedited(); + /* If expedited grace periods are prohibited, fall back to normal. */ + if (rcu_gp_is_normal()) { + wait_rcu_gp(call_rcu); + return; + } + + /* Take a snapshot of the sequence number. */ + s = rcu_exp_gp_seq_snap(); + if (exp_funnel_lock(s)) + return; /* Someone else did our work for us. */ + + /* Ensure that load happens before action based on it. */ + if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) { + /* Direct call during scheduler init and early_initcalls(). */ + rcu_exp_sel_wait_wake(s); + } else { + /* Marshall arguments & schedule the expedited grace period. */ + rew.rew_s = s; + INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); + queue_work(rcu_gp_wq, &rew.rew_work); + } + + /* Wait for expedited grace period to complete. */ + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); + rnp = rcu_get_root(); + wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], + sync_exp_work_done(s)); + smp_mb(); /* Workqueue actions happen before return. */ + + /* Let the next expedited grace period start. */ + mutex_unlock(&rcu_state.exp_mutex); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); -- cgit v1.3-7-g2ca7 From 260e1e4fd826a8e47c8976efba6a54d62b4a57de Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 29 Nov 2018 13:28:49 -0800 Subject: rcu: Discard separate per-CPU callback counts Back when there were multiple flavors of RCU, it was necessary to separately count lazy and non-lazy callbacks for each CPU. These counts were used in CONFIG_RCU_FAST_NO_HZ kernels to determine how long a newly idle CPU should be allowed to sleep before handling its RCU callbacks. But now that there is only one flavor, the callback counts for a given CPU's sole rcu_data structure are the counts for that CPU. This commit therefore removes the rcu_data structure's ->nonlazy_posted and ->nonlazy_posted_snap fields, the rcu_idle_count_callbacks_posted() and rcu_cpu_has_callbacks() functions, repurposes the rcu_data structure's ->all_lazy field to record the laziness state at the beginning of the latest idle sojourn, and modifies CONFIG_RCU_FAST_NO_HZ RCU CPU stall warnings accordingly. Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 15 +++++++------ kernel/rcu/tree.c | 25 --------------------- kernel/rcu/tree.h | 6 +---- kernel/rcu/tree_plugin.h | 50 ++++++++++------------------------------- 4 files changed, 21 insertions(+), 75 deletions(-) (limited to 'kernel') diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 073dbc12d1ea..1ab70c37921f 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -219,17 +219,18 @@ an estimate of the total number of RCU callbacks queued across all CPUs In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed for each CPU: - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 Nonlazy posted: ..D The "last_accelerate:" prints the low-order 16 bits (in hex) of the jiffies counter when this CPU last invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from -rcu_prepare_for_idle(). The "nonlazy_posted:" prints the number -of non-lazy callbacks posted since the last call to rcu_needs_cpu(). -Finally, an "L" indicates that there are currently no non-lazy callbacks -("." is printed otherwise, as shown above) and "D" indicates that -dyntick-idle processing is enabled ("." is printed otherwise, for example, -if disabled via the "nohz=" kernel boot parameter). +rcu_prepare_for_idle(). The "Nonlazy posted:" indicates lazy-callback +status, so that an "l" indicates that all callbacks were lazy at the start +of the last idle period and an "L" indicates that there are currently +no non-lazy callbacks (in both cases, "." is printed otherwise, as +shown above) and "D" indicates that dyntick-idle processing is enabled +("." is printed otherwise, for example, if disabled via the "nohz=" +kernel boot parameter). If the grace period ends just as the stall warning starts printing, there will be a spurious stall-warning message, which will include diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e2bd42b2b563..e53a586b397b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2878,9 +2878,6 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) rcu_segcblist_init(&rdp->cblist); } rcu_segcblist_enqueue(&rdp->cblist, head, lazy); - if (!lazy) - rcu_idle_count_callbacks_posted(); - if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rcu_state.name, head, (unsigned long)func, @@ -3110,28 +3107,6 @@ static int rcu_pending(void) return 0; } -/* - * Return true if the specified CPU has any callback. If all_lazy is - * non-NULL, store an indication of whether all callbacks are lazy. - * (If there are no callbacks, all of them are deemed to be lazy.) - */ -static bool rcu_cpu_has_callbacks(bool *all_lazy) -{ - bool al = true; - bool hc = false; - struct rcu_data *rdp; - - rdp = this_cpu_ptr(&rcu_data); - if (!rcu_segcblist_empty(&rdp->cblist)) { - hc = true; - if (rcu_segcblist_n_nonlazy_cbs(&rdp->cblist)) - al = false; - } - if (all_lazy) - *all_lazy = al; - return hc; -} - /* * Helper function for rcu_barrier() tracing. If tracing is disabled, * the compiler is expected to optimize this away. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 50bb41cdc5fb..20feecbb0ab6 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -193,10 +193,7 @@ struct rcu_data { bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */ bool rcu_urgent_qs; /* GP old need light quiescent state. */ #ifdef CONFIG_RCU_FAST_NO_HZ - bool all_lazy; /* Are all CPU's CBs lazy? */ - unsigned long nonlazy_posted; /* # times non-lazy CB posted to CPU. */ - unsigned long nonlazy_posted_snap; - /* Nonlazy_posted snapshot. */ + bool all_lazy; /* All CPU's CBs lazy at idle start? */ unsigned long last_accelerate; /* Last jiffy CBs were accelerated. */ unsigned long last_advance_all; /* Last jiffy CBs were all advanced. */ int tick_nohz_enabled_snap; /* Previously seen value from sysfs. */ @@ -430,7 +427,6 @@ static void __init rcu_spawn_boost_kthreads(void); static void rcu_prepare_kthreads(int cpu); static void rcu_cleanup_after_idle(void); static void rcu_prepare_for_idle(void); -static void rcu_idle_count_callbacks_posted(void); static bool rcu_preempt_has_tasks(struct rcu_node *rnp); static bool rcu_preempt_need_deferred_qs(struct task_struct *t); static void rcu_preempt_deferred_qs(struct task_struct *t); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 43f3f2ee9d63..abd238c70900 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1474,7 +1474,7 @@ static void rcu_prepare_kthreads(int cpu) int rcu_needs_cpu(u64 basemono, u64 *nextevt) { *nextevt = KTIME_MAX; - return rcu_cpu_has_callbacks(NULL); + return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist); } /* @@ -1493,14 +1493,6 @@ static void rcu_prepare_for_idle(void) { } -/* - * Don't bother keeping a running count of the number of RCU callbacks - * posted because CONFIG_RCU_FAST_NO_HZ=n. - */ -static void rcu_idle_count_callbacks_posted(void) -{ -} - #else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */ /* @@ -1583,11 +1575,8 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt) lockdep_assert_irqs_disabled(); - /* Snapshot to detect later posting of non-lazy callback. */ - rdp->nonlazy_posted_snap = rdp->nonlazy_posted; - /* If no callbacks, RCU doesn't need the CPU. */ - if (!rcu_cpu_has_callbacks(&rdp->all_lazy)) { + if (rcu_segcblist_empty(&rdp->cblist)) { *nextevt = KTIME_MAX; return 0; } @@ -1601,11 +1590,12 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt) rdp->last_accelerate = jiffies; /* Request timer delay depending on laziness, and round. */ - if (!rdp->all_lazy) { + rdp->all_lazy = !rcu_segcblist_n_nonlazy_cbs(&rdp->cblist); + if (rdp->all_lazy) { + dj = round_jiffies(rcu_idle_lazy_gp_delay + jiffies) - jiffies; + } else { dj = round_up(rcu_idle_gp_delay + jiffies, rcu_idle_gp_delay) - jiffies; - } else { - dj = round_jiffies(rcu_idle_lazy_gp_delay + jiffies) - jiffies; } *nextevt = basemono + dj * TICK_NSEC; return 0; @@ -1635,7 +1625,7 @@ static void rcu_prepare_for_idle(void) /* Handle nohz enablement switches conservatively. */ tne = READ_ONCE(tick_nohz_active); if (tne != rdp->tick_nohz_enabled_snap) { - if (rcu_cpu_has_callbacks(NULL)) + if (!rcu_segcblist_empty(&rdp->cblist)) invoke_rcu_core(); /* force nohz to see update. */ rdp->tick_nohz_enabled_snap = tne; return; @@ -1648,10 +1638,8 @@ static void rcu_prepare_for_idle(void) * callbacks, invoke RCU core for the side-effect of recalculating * idle duration on re-entry to idle. */ - if (rdp->all_lazy && - rdp->nonlazy_posted != rdp->nonlazy_posted_snap) { + if (rdp->all_lazy && rcu_segcblist_n_nonlazy_cbs(&rdp->cblist)) { rdp->all_lazy = false; - rdp->nonlazy_posted_snap = rdp->nonlazy_posted; invoke_rcu_core(); return; } @@ -1687,19 +1675,6 @@ static void rcu_cleanup_after_idle(void) invoke_rcu_core(); } -/* - * Keep a running count of the number of non-lazy callbacks posted - * on this CPU. This running counter (which is never decremented) allows - * rcu_prepare_for_idle() to detect when something out of the idle loop - * posts a callback, even if an equal number of callbacks are invoked. - * Of course, callbacks should only be posted from within a trace event - * designed to be called from idle or from within RCU_NONIDLE(). - */ -static void rcu_idle_count_callbacks_posted(void) -{ - __this_cpu_add(rcu_data.nonlazy_posted, 1); -} - #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */ #ifdef CONFIG_RCU_FAST_NO_HZ @@ -1707,13 +1682,12 @@ static void rcu_idle_count_callbacks_posted(void) static void print_cpu_stall_fast_no_hz(char *cp, int cpu) { struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - unsigned long nlpd = rdp->nonlazy_posted - rdp->nonlazy_posted_snap; - sprintf(cp, "last_accelerate: %04lx/%04lx, nonlazy_posted: %ld, %c%c", + sprintf(cp, "last_accelerate: %04lx/%04lx, Nonlazy posted: %c%c%c", rdp->last_accelerate & 0xffff, jiffies & 0xffff, - ulong2long(nlpd), - rdp->all_lazy ? 'L' : '.', - rdp->tick_nohz_enabled_snap ? '.' : 'D'); + ".l"[rdp->all_lazy], + ".L"[!rcu_segcblist_n_nonlazy_cbs(&rdp->cblist)], + ".D"[!rdp->tick_nohz_enabled_snap]); } #else /* #ifdef CONFIG_RCU_FAST_NO_HZ */ -- cgit v1.3-7-g2ca7 From 9cf422a8e71455032b61c6c0ea56a1e96206aab0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 20 Nov 2018 10:43:34 -0800 Subject: rcu: Accommodate zero jiffies_till_first_fqs and kthread kicking It is perfectly fine to set the rcutree.jiffies_till_first_fqs boot parameter to zero, in fact, this can be useful on specialty systems that usually have at least one idle CPU and that need fast grace periods. This is because this setting causes the RCU grace-period kthread to scan for idle threads immediately after grace-period initialization, as opposed to waiting several jiffies to do so. It is also perfectly fine to set the rcutree.rcu_kick_kthreads kernel parameter, which gives the RCU grace-period kthread an extra wakeup if it doesn't make progress for a period of three times the setting of the rcutree.jiffies_till_first_fqs boot parameter. This is of course problematic when the value of this parameter is zero, as it can result in unnecessary wakeup IPIs along with unnecessary WARN_ONCE() invocations. This commit therefore defers kthread kicking for at least two jiffies, regardless of the setting of rcutree.jiffies_till_first_fqs. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9180158756d2..b003a3cfe192 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1939,7 +1939,7 @@ static void rcu_gp_fqs_loop(void) if (!ret) { rcu_state.jiffies_force_qs = jiffies + j; WRITE_ONCE(rcu_state.jiffies_kick_kthreads, - jiffies + 3 * j); + jiffies + (j ? 3 * j : 2)); } trace_rcu_grace_period(rcu_state.name, READ_ONCE(rcu_state.gp_seq), -- cgit v1.3-7-g2ca7 From 37f62d7cf00c085e1d7a91a6af286c4e8d32e1e1 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 30 Nov 2018 16:11:14 -0800 Subject: rcu: Move rcu_cpu_kthread_task to rcu_data structure Given that RCU has a perfectly good per-CPU rcu_data structure, most per-CPU quantities should be stored there. This commit therefore moves the rcu_cpu_kthread_task per-CPU variable to the rcu_data structure. This also makes this variable unconditionally present, which should be acceptable given the memory reduction due to the RCU flavor consolidation and also due to simplifications this will enable. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 6 +++++- kernel/rcu/tree_plugin.h | 11 +++++------ 2 files changed, 10 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index d90b02b53c0e..ef517ba25192 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -234,7 +234,11 @@ struct rcu_data { /* Leader CPU takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ - /* 6) Diagnostic data, including RCU CPU stall warnings. */ + /* 6) RCU priority boosting. */ + struct task_struct *rcu_cpu_kthread_task; + /* rcuc per-CPU kthread or NULL. */ + + /* 7) Diagnostic data, including RCU CPU stall warnings. */ unsigned int softirq_snap; /* Snapshot of softirq activity. */ /* ->rcu_iw* fields protected by leaf rcu_node ->lock. */ struct irq_work rcu_iw; /* Check for non-irq activity. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 1b3dd2fc0cd6..359bf1f6f8e0 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -40,7 +40,6 @@ /* * Control variables for per-CPU and per-rcu_node kthreads. */ -static DEFINE_PER_CPU(struct task_struct *, rcu_cpu_kthread_task); DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_status); DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops); DEFINE_PER_CPU(char, rcu_cpu_has_work); @@ -1308,9 +1307,9 @@ static void invoke_rcu_callbacks_kthread(void) local_irq_save(flags); __this_cpu_write(rcu_cpu_has_work, 1); - if (__this_cpu_read(rcu_cpu_kthread_task) != NULL && - current != __this_cpu_read(rcu_cpu_kthread_task)) { - rcu_wake_cond(__this_cpu_read(rcu_cpu_kthread_task), + if (__this_cpu_read(rcu_data.rcu_cpu_kthread_task) != NULL && + current != __this_cpu_read(rcu_data.rcu_cpu_kthread_task)) { + rcu_wake_cond(__this_cpu_read(rcu_data.rcu_cpu_kthread_task), __this_cpu_read(rcu_cpu_kthread_status)); } local_irq_restore(flags); @@ -1322,7 +1321,7 @@ static void invoke_rcu_callbacks_kthread(void) */ static bool rcu_is_callbacks_kthread(void) { - return __this_cpu_read(rcu_cpu_kthread_task) == current; + return __this_cpu_read(rcu_data.rcu_cpu_kthread_task) == current; } #define RCU_BOOST_DELAY_JIFFIES DIV_ROUND_UP(CONFIG_RCU_BOOST_DELAY * HZ, 1000) @@ -1459,7 +1458,7 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) } static struct smp_hotplug_thread rcu_cpu_thread_spec = { - .store = &rcu_cpu_kthread_task, + .store = &rcu_data.rcu_cpu_kthread_task, .thread_should_run = rcu_cpu_kthread_should_run, .thread_fn = rcu_cpu_kthread, .thread_comm = "rcuc/%u", -- cgit v1.3-7-g2ca7 From 6ffdde28b7558ec48f4e0eee01821a66a67a8e25 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 30 Nov 2018 16:43:05 -0800 Subject: rcu: Move rcu_cpu_kthread_status to rcu_data structure Given that RCU has a perfectly good per-CPU rcu_data structure, most per-CPU quantities should be stored there. This commit therefore moves the rcu_cpu_kthread_status per-CPU variable to the rcu_data structure. This also makes this variable unconditionally present, which should be acceptable given the memory reduction due to the RCU flavor consolidation and also due to simplifications this will enable. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 3 ++- kernel/rcu/tree_plugin.h | 7 +++---- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index ef517ba25192..047f5e9350d1 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -237,6 +237,8 @@ struct rcu_data { /* 6) RCU priority boosting. */ struct task_struct *rcu_cpu_kthread_task; /* rcuc per-CPU kthread or NULL. */ + unsigned int rcu_cpu_kthread_status; + /* Running status for rcuc. */ /* 7) Diagnostic data, including RCU CPU stall warnings. */ unsigned int softirq_snap; /* Snapshot of softirq activity. */ @@ -407,7 +409,6 @@ static const char *tp_rcu_varname __used __tracepoint_string = rcu_name; int rcu_dynticks_snap(struct rcu_data *rdp); #ifdef CONFIG_RCU_BOOST -DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status); DECLARE_PER_CPU(int, rcu_cpu_kthread_cpu); DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_loops); DECLARE_PER_CPU(char, rcu_cpu_has_work); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 359bf1f6f8e0..935dc594cf30 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -40,7 +40,6 @@ /* * Control variables for per-CPU and per-rcu_node kthreads. */ -DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_status); DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops); DEFINE_PER_CPU(char, rcu_cpu_has_work); @@ -1310,7 +1309,7 @@ static void invoke_rcu_callbacks_kthread(void) if (__this_cpu_read(rcu_data.rcu_cpu_kthread_task) != NULL && current != __this_cpu_read(rcu_data.rcu_cpu_kthread_task)) { rcu_wake_cond(__this_cpu_read(rcu_data.rcu_cpu_kthread_task), - __this_cpu_read(rcu_cpu_kthread_status)); + __this_cpu_read(rcu_data.rcu_cpu_kthread_status)); } local_irq_restore(flags); } @@ -1383,7 +1382,7 @@ static void rcu_cpu_kthread_setup(unsigned int cpu) static void rcu_cpu_kthread_park(unsigned int cpu) { - per_cpu(rcu_cpu_kthread_status, cpu) = RCU_KTHREAD_OFFCPU; + per_cpu(rcu_data.rcu_cpu_kthread_status, cpu) = RCU_KTHREAD_OFFCPU; } static int rcu_cpu_kthread_should_run(unsigned int cpu) @@ -1398,7 +1397,7 @@ static int rcu_cpu_kthread_should_run(unsigned int cpu) */ static void rcu_cpu_kthread(unsigned int cpu) { - unsigned int *statusp = this_cpu_ptr(&rcu_cpu_kthread_status); + unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status); char work, *workp = this_cpu_ptr(&rcu_cpu_has_work); int spincnt; -- cgit v1.3-7-g2ca7 From 8b4d0f4858864af6a0753740f00353035bd058de Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 30 Nov 2018 17:17:21 -0800 Subject: rcu: Remove unused rcu_cpu_kthread_loops per-CPU variable The rcu_cpu_kthread_loops variable used to provide debugfs information, but is no longer used. This commit therefore removes it. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 1 - kernel/rcu/tree_plugin.h | 2 -- 2 files changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 047f5e9350d1..e50b0a5a94bc 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -410,7 +410,6 @@ int rcu_dynticks_snap(struct rcu_data *rdp); #ifdef CONFIG_RCU_BOOST DECLARE_PER_CPU(int, rcu_cpu_kthread_cpu); -DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_loops); DECLARE_PER_CPU(char, rcu_cpu_has_work); #endif /* #ifdef CONFIG_RCU_BOOST */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 935dc594cf30..d1f32c63c789 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -40,7 +40,6 @@ /* * Control variables for per-CPU and per-rcu_node kthreads. */ -DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops); DEFINE_PER_CPU(char, rcu_cpu_has_work); #else /* #ifdef CONFIG_RCU_BOOST */ @@ -1405,7 +1404,6 @@ static void rcu_cpu_kthread(unsigned int cpu) trace_rcu_utilization(TPS("Start CPU kthread@rcu_wait")); local_bh_disable(); *statusp = RCU_KTHREAD_RUNNING; - this_cpu_inc(rcu_cpu_kthread_loops); local_irq_disable(); work = *workp; *workp = 0; -- cgit v1.3-7-g2ca7 From f7e972ee128e0a65784b13ec1459fe35b817eed7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 30 Nov 2018 18:21:32 -0800 Subject: rcu: Move rcu_cpu_has_work to rcu_data structure Given that RCU has a perfectly good per-CPU rcu_data structure, most per-CPU quantities should be stored there. This commit therefore moves the rcu_cpu_has_work per-CPU variable to the rcu_data structure. This also makes this variable unconditionally present, which should be acceptable given the memory reduction due to the RCU flavor consolidation and also due to simplifications this will enable. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 3 +-- kernel/rcu/tree_plugin.h | 15 ++++----------- 2 files changed, 5 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index e50b0a5a94bc..7ae6774a920e 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -238,7 +238,7 @@ struct rcu_data { struct task_struct *rcu_cpu_kthread_task; /* rcuc per-CPU kthread or NULL. */ unsigned int rcu_cpu_kthread_status; - /* Running status for rcuc. */ + char rcu_cpu_has_work; /* 7) Diagnostic data, including RCU CPU stall warnings. */ unsigned int softirq_snap; /* Snapshot of softirq activity. */ @@ -410,7 +410,6 @@ int rcu_dynticks_snap(struct rcu_data *rdp); #ifdef CONFIG_RCU_BOOST DECLARE_PER_CPU(int, rcu_cpu_kthread_cpu); -DECLARE_PER_CPU(char, rcu_cpu_has_work); #endif /* #ifdef CONFIG_RCU_BOOST */ /* Forward declarations for rcutree_plugin.h */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index d1f32c63c789..b241c4b20549 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -34,14 +34,7 @@ #include "../time/tick-internal.h" #ifdef CONFIG_RCU_BOOST - #include "../locking/rtmutex_common.h" - -/* - * Control variables for per-CPU and per-rcu_node kthreads. - */ -DEFINE_PER_CPU(char, rcu_cpu_has_work); - #else /* #ifdef CONFIG_RCU_BOOST */ /* @@ -1304,7 +1297,7 @@ static void invoke_rcu_callbacks_kthread(void) unsigned long flags; local_irq_save(flags); - __this_cpu_write(rcu_cpu_has_work, 1); + __this_cpu_write(rcu_data.rcu_cpu_has_work, 1); if (__this_cpu_read(rcu_data.rcu_cpu_kthread_task) != NULL && current != __this_cpu_read(rcu_data.rcu_cpu_kthread_task)) { rcu_wake_cond(__this_cpu_read(rcu_data.rcu_cpu_kthread_task), @@ -1386,7 +1379,7 @@ static void rcu_cpu_kthread_park(unsigned int cpu) static int rcu_cpu_kthread_should_run(unsigned int cpu) { - return __this_cpu_read(rcu_cpu_has_work); + return __this_cpu_read(rcu_data.rcu_cpu_has_work); } /* @@ -1397,7 +1390,7 @@ static int rcu_cpu_kthread_should_run(unsigned int cpu) static void rcu_cpu_kthread(unsigned int cpu) { unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status); - char work, *workp = this_cpu_ptr(&rcu_cpu_has_work); + char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work); int spincnt; for (spincnt = 0; spincnt < 10; spincnt++) { @@ -1472,7 +1465,7 @@ static void __init rcu_spawn_boost_kthreads(void) int cpu; for_each_possible_cpu(cpu) - per_cpu(rcu_cpu_has_work, cpu) = 0; + per_cpu(rcu_data.rcu_cpu_has_work, cpu) = 0; if (WARN_ONCE(smpboot_register_percpu_thread(&rcu_cpu_thread_spec), "%s: Could not start rcub kthread, OOM is now expected behavior\n", __func__)) return; rcu_for_each_leaf_node(rnp) -- cgit v1.3-7-g2ca7 From b2c1955b88495c1531b2080ba4fad119c0a03cc1 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 30 Nov 2018 19:12:04 -0800 Subject: rcu: Remove unused rcu_cpu_kthread_cpu per-CPU variable The rcu_cpu_kthread_cpu used to provide debugfs information, but is no longer used. This commit therefore removes it. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 4 ---- 1 file changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 7ae6774a920e..008c356c7033 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -408,10 +408,6 @@ static const char *tp_rcu_varname __used __tracepoint_string = rcu_name; int rcu_dynticks_snap(struct rcu_data *rdp); -#ifdef CONFIG_RCU_BOOST -DECLARE_PER_CPU(int, rcu_cpu_kthread_cpu); -#endif /* #ifdef CONFIG_RCU_BOOST */ - /* Forward declarations for rcutree_plugin.h */ static void rcu_bootup_announce(void); static void rcu_qs(void); -- cgit v1.3-7-g2ca7 From a9fefdb257259ac2f0f5fcd916edecc5c9427635 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 3 Dec 2018 14:07:17 -0800 Subject: rcu: Update NOCB comments This commit updates a few obsolete comments in the RCU callback-offload code. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b241c4b20549..f0019c2a2cbc 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1857,22 +1857,24 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp) /* * Offload callback processing from the boot-time-specified set of CPUs - * specified by rcu_nocb_mask. For each CPU in the set, there is a - * kthread created that pulls the callbacks from the corresponding CPU, - * waits for a grace period to elapse, and invokes the callbacks. - * The no-CBs CPUs do a wake_up() on their kthread when they insert - * a callback into any empty list, unless the rcu_nocb_poll boot parameter - * has been specified, in which case each kthread actively polls its - * CPU. (Which isn't so great for energy efficiency, but which does - * reduce RCU's overhead on that CPU.) + * specified by rcu_nocb_mask. For the CPUs in the set, there are kthreads + * created that pull the callbacks from the corresponding CPU, wait for + * a grace period to elapse, and invoke the callbacks. These kthreads + * are organized into leaders, which manage incoming callbacks, wait for + * grace periods, and awaken followers, and the followers, which only + * invoke callbacks. Each leader is its own follower. The no-CBs CPUs + * do a wake_up() on their kthread when they insert a callback into any + * empty list, unless the rcu_nocb_poll boot parameter has been specified, + * in which case each kthread actively polls its CPU. (Which isn't so great + * for energy efficiency, but which does reduce RCU's overhead on that CPU.) * * This is intended to be used in conjunction with Frederic Weisbecker's * adaptive-idle work, which would seriously reduce OS jitter on CPUs * running CPU-bound user-mode computations. * - * Offloading of callback processing could also in theory be used as - * an energy-efficiency measure because CPUs with no RCU callbacks - * queued are more aggressive about entering dyntick-idle mode. + * Offloading of callbacks can also be used as an energy-efficiency + * measure because CPUs with no RCU callbacks queued are more aggressive + * about entering dyntick-idle mode. */ @@ -1976,10 +1978,7 @@ static void wake_nocb_leader_defer(struct rcu_data *rdp, int waketype, raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); } -/* - * Does the specified CPU need an RCU callback for this invocation - * of rcu_barrier()? - */ +/* Does rcu_barrier need to queue an RCU callback on the specified CPU? */ static bool rcu_nocb_cpu_needs_barrier(int cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); @@ -1995,8 +1994,8 @@ static bool rcu_nocb_cpu_needs_barrier(int cpu) * callbacks would be posted. In the worst case, the first * barrier in rcu_barrier() suffices (but the caller cannot * necessarily rely on this, not a substitute for the caller - * getting the concurrency design right!). There must also be - * a barrier between the following load an posting of a callback + * getting the concurrency design right!). There must also be a + * barrier between the following load and posting of a callback * (if a callback is in fact needed). This is associated with an * atomic_inc() in the caller. */ -- cgit v1.3-7-g2ca7 From fd897573fa4cfe66ebddf5f4444f36710cf0cad0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 10 Dec 2018 16:09:49 -0800 Subject: rcu: Improve diagnostics for failed RCU grace-period start If a grace period fails to start (for example, because you commented out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core() will invoke rcu_check_gp_start_stall(), which will notice and complain. However, this complaint is lacking crucial debugging information such as when the last wakeup executed and what the value of ->gp_seq was at that time. This commit therefore removes the current pr_alert() from rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(), which has been updated to print the needed information, which is collected by rcu_gp_kthread_wake(). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 55 ++++++++++++++++++++++++++++++++----------------------- kernel/rcu/tree.h | 2 ++ 2 files changed, 34 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b003a3cfe192..8543a90d53f2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -512,6 +512,14 @@ void rcu_force_quiescent_state(void) } EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); +/* + * Return the root node of the rcu_state structure. + */ +static struct rcu_node *rcu_get_root(void) +{ + return &rcu_state.node[0]; +} + /* * Convert a ->gp_state value to a character string. */ @@ -529,19 +537,30 @@ void show_rcu_gp_kthreads(void) { int cpu; unsigned long j; + unsigned long ja; + unsigned long jr; + unsigned long jw; struct rcu_data *rdp; struct rcu_node *rnp; - j = jiffies - READ_ONCE(rcu_state.gp_activity); - pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %ld\n", + j = jiffies; + ja = j - READ_ONCE(rcu_state.gp_activity); + jr = j - READ_ONCE(rcu_state.gp_req_activity); + jw = j - READ_ONCE(rcu_state.gp_wake_time); + pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n", rcu_state.name, gp_state_getname(rcu_state.gp_state), - rcu_state.gp_state, rcu_state.gp_kthread->state, j); + rcu_state.gp_state, + rcu_state.gp_kthread ? rcu_state.gp_kthread->state : 0x1ffffL, + ja, jr, jw, (long)READ_ONCE(rcu_state.gp_wake_seq), + (long)READ_ONCE(rcu_state.gp_seq), + (long)READ_ONCE(rcu_get_root()->gp_seq_needed), + READ_ONCE(rcu_state.gp_flags)); rcu_for_each_node_breadth_first(rnp) { if (ULONG_CMP_GE(rcu_state.gp_seq, rnp->gp_seq_needed)) continue; - pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n", - rnp->grplo, rnp->grphi, rnp->gp_seq, - rnp->gp_seq_needed); + pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n", + rnp->grplo, rnp->grphi, (long)rnp->gp_seq, + (long)rnp->gp_seq_needed); if (!rcu_is_leaf_node(rnp)) continue; for_each_leaf_node_possible_cpu(rnp, cpu) { @@ -550,8 +569,8 @@ void show_rcu_gp_kthreads(void) ULONG_CMP_GE(rcu_state.gp_seq, rdp->gp_seq_needed)) continue; - pr_info("\tcpu %d ->gp_seq_needed %lu\n", - cpu, rdp->gp_seq_needed); + pr_info("\tcpu %d ->gp_seq_needed %ld\n", + cpu, (long)rdp->gp_seq_needed); } } /* sched_show_task(rcu_state.gp_kthread); */ @@ -577,14 +596,6 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags, } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data); -/* - * Return the root node of the rcu_state structure. - */ -static struct rcu_node *rcu_get_root(void) -{ - return &rcu_state.node[0]; -} - /* * Enter an RCU extended quiescent state, which can be either the * idle loop or adaptive-tickless usermode execution. @@ -1560,7 +1571,8 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) * Awaken the grace-period kthread. Don't do a self-awaken, and don't * bother awakening when there is nothing for the grace-period kthread * to do (as in several CPUs raced to awaken, and we lost), and finally - * don't try to awaken a kthread that has not yet been created. + * don't try to awaken a kthread that has not yet been created. If + * all those checks are passed, track some debug information and awaken. */ static void rcu_gp_kthread_wake(void) { @@ -1568,6 +1580,8 @@ static void rcu_gp_kthread_wake(void) !READ_ONCE(rcu_state.gp_flags) || !rcu_state.gp_kthread) return; + WRITE_ONCE(rcu_state.gp_wake_time, jiffies); + WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); swake_up_one(&rcu_state.gp_wq); } @@ -2657,16 +2671,11 @@ rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } - pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n", - __func__, (long)READ_ONCE(rcu_state.gp_seq), - (long)READ_ONCE(rnp_root->gp_seq_needed), - j - rcu_state.gp_req_activity, j - rcu_state.gp_activity, - rcu_state.gp_flags, rcu_state.gp_state, rcu_state.name, - rcu_state.gp_kthread ? rcu_state.gp_kthread->state : 0x1ffffL); WARN_ON(1); if (rnp_root != rnp) raw_spin_unlock_rcu_node(rnp_root); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + show_rcu_gp_kthreads(); } /* diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 008c356c7033..1f2ada7ef7d7 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -309,6 +309,8 @@ struct rcu_state { struct swait_queue_head gp_wq; /* Where GP task waits. */ short gp_flags; /* Commands for GP task. */ short gp_state; /* GP kthread sleep state. */ + unsigned long gp_wake_time; /* Last GP kthread wake. */ + unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */ /* End of fields guarded by root rcu_node's lock. */ -- cgit v1.3-7-g2ca7 From 3b6505fd8eb86e3ef5ce12b34fe81e9edeb84475 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 12 Dec 2018 07:20:07 -0800 Subject: rcu: Protect rcu_check_gp_kthread_starvation() access to ->gp_flags The rcu_check_gp_kthread_starvation() function can be invoked without holding locks, so the access to the rcu_state structure's ->gp_flags field must be protected with READ_ONCE(). This commit therefore adds this protection. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8543a90d53f2..238150684fed 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1192,7 +1192,7 @@ static void rcu_check_gp_kthread_starvation(void) pr_err("%s kthread starved for %ld jiffies! g%ld f%#x %s(%d) ->state=%#lx ->cpu=%d\n", rcu_state.name, j, (long)rcu_seq_current(&rcu_state.gp_seq), - rcu_state.gp_flags, + READ_ONCE(rcu_state.gp_flags), gp_state_getname(rcu_state.gp_state), rcu_state.gp_state, gpk ? gpk->state : ~0, gpk ? task_cpu(gpk) : -1); if (gpk) { -- cgit v1.3-7-g2ca7 From 2ccaff10f71307ad4f75ce673b815d3fe6254e3d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 12 Dec 2018 12:32:06 -0800 Subject: rcu: Add sysrq rcu_node-dump capability Life is hard if RCU manages to get stuck without triggering RCU CPU stall warnings or triggering the rcu_check_gp_start_stall() checks for failing to start a grace period. This commit therefore adds a boot-time-selectable sysrq key (commandeering "y") that allows manually dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter must be set for this sysrq to be available. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 5 +++++ kernel/rcu/tree.c | 25 +++++++++++++++++++++++++ 2 files changed, 30 insertions(+) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b799bcf67d7b..b0dbb2fa401f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3721,6 +3721,11 @@ This wake_up() will be accompanied by a WARN_ONCE() splat and an ftrace_dump(). + rcutree.sysrq_rcu= [KNL] + Commandeer a sysrq key to dump out Tree RCU's + rcu_node tree with an eye towards determining + why a new grace period has not yet started. + rcuperf.gp_async= [KNL] Measure performance of asynchronous grace-period primitives such as call_rcu(). diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 238150684fed..9ceb93f848cd 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -62,6 +62,7 @@ #include #include #include +#include #include "tree.h" #include "rcu.h" @@ -115,6 +116,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall __read_mostly; +/* Commandeer a sysrq key to dump RCU's tree. */ +static bool sysrq_rcu; +module_param(sysrq_rcu, bool, 0444); /* * The rcu_scheduler_active variable is initialized to the value @@ -577,6 +581,27 @@ void show_rcu_gp_kthreads(void) } EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads); +/* Dump grace-period-request information due to commandeered sysrq. */ +static void sysrq_show_rcu(int key) +{ + show_rcu_gp_kthreads(); +} + +static struct sysrq_key_op sysrq_rcudump_op = { + .handler = sysrq_show_rcu, + .help_msg = "show-rcu(y)", + .action_msg = "Show RCU tree", + .enable_mask = SYSRQ_ENABLE_DUMP, +}; + +static int __init rcu_sysrq_init(void) +{ + if (sysrq_rcu) + return register_sysrq_key('y', &sysrq_rcudump_op); + return 0; +} +early_initcall(rcu_sysrq_init); + /* * Send along grace-period-related data for rcutorture diagnostics. */ -- cgit v1.3-7-g2ca7 From 1d1f898df6586c5ea9aeaf349f13089c6fa37903 Mon Sep 17 00:00:00 2001 From: "Zhang, Jun" Date: Tue, 18 Dec 2018 06:55:01 -0800 Subject: rcu: Do RCU GP kthread self-wakeup from softirq and interrupt The rcu_gp_kthread_wake() function is invoked when it might be necessary to wake the RCU grace-period kthread. Because self-wakeups are normally a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from this kthread, it naturally refuses to do the wakeup. Unfortunately, natural though it might be, this heuristic fails when rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler that interrupted the grace-period kthread just after the final check of the wait-event condition but just before the schedule() call. In this case, a wakeup is required, even though the call to rcu_gp_kthread_wake() is within the RCU grace-period kthread's context. Failing to provide this wakeup can result in grace periods failing to start, which in turn results in out-of-memory conditions. This race window is quite narrow, but it actually did happen during real testing. It would of course need to be fixed even if it was strictly theoretical in nature. This patch does not Cc stable because it does not apply cleanly to earlier kernel versions. Fixes: 48a7639ce80c ("rcu: Make callers awaken grace-period kthread") Reported-by: "He, Bo" Co-developed-by: "Zhang, Jun" Co-developed-by: "He, Bo" Co-developed-by: "xiao, jin" Co-developed-by: Bai, Jie A Signed-off: "Zhang, Jun" Signed-off: "He, Bo" Signed-off: "xiao, jin" Signed-off: Bai, Jie A Signed-off-by: "Zhang, Jun" [ paulmck: Switch from !in_softirq() to "!in_interrupt() && !in_serving_softirq() to avoid redundant wakeups and to also handle the interrupt-handler scenario as well as the softirq-handler scenario that actually occurred in testing. ] Signed-off-by: Paul E. McKenney Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com --- kernel/rcu/tree.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9ceb93f848cd..21775eebb8f0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1593,15 +1593,23 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) } /* - * Awaken the grace-period kthread. Don't do a self-awaken, and don't - * bother awakening when there is nothing for the grace-period kthread - * to do (as in several CPUs raced to awaken, and we lost), and finally - * don't try to awaken a kthread that has not yet been created. If - * all those checks are passed, track some debug information and awaken. + * Awaken the grace-period kthread. Don't do a self-awaken (unless in + * an interrupt or softirq handler), and don't bother awakening when there + * is nothing for the grace-period kthread to do (as in several CPUs raced + * to awaken, and we lost), and finally don't try to awaken a kthread that + * has not yet been created. If all those checks are passed, track some + * debug information and awaken. + * + * So why do the self-wakeup when in an interrupt or softirq handler + * in the grace-period kthread's context? Because the kthread might have + * been interrupted just as it was going to sleep, and just after the final + * pre-sleep check of the awaken condition. In this case, a wakeup really + * is required, and is therefore supplied. */ static void rcu_gp_kthread_wake(void) { - if (current == rcu_state.gp_kthread || + if ((current == rcu_state.gp_kthread && + !in_interrupt() && !in_serving_softirq()) || !READ_ONCE(rcu_state.gp_flags) || !rcu_state.gp_kthread) return; -- cgit v1.3-7-g2ca7 From 13dc7d0c7a2ed438f0ec8e9fb365a1256d87cf87 Mon Sep 17 00:00:00 2001 From: "Zhang, Jun" Date: Wed, 19 Dec 2018 10:37:34 -0800 Subject: rcu: Prevent needless ->gp_seq_needed update in __note_gp_changes() Currently, __note_gp_changes() checks to see if the rcu_node structure's ->gp_seq_needed is greater than or equal to that of the rcu_data structure, and if so, updates the rcu_data structure's ->gp_seq_needed field. This results in a useless store in the case where the two fields are equal. This commit therefore carries out this store only in the case where the rcu_node structure's ->gp_seq_needed is strictly greater than that of the rcu_data structure. Signed-off-by: "Zhang, Jun" Signed-off-by: Paul E. McKenney Link: https://lkml.kernel.org/r/88DC34334CA3444C85D647DBFA962C2735AD5F77@SHSMSX104.ccr.corp.intel.com --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 21775eebb8f0..9d0e2ac9356e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1758,7 +1758,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) zero_cpu_stall_ticks(rdp); } rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */ - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap) + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap) rdp->gp_seq_needed = rnp->gp_seq_needed; WRITE_ONCE(rdp->gpwrap, false); rcu_gpnum_ovf(rnp, rdp); -- cgit v1.3-7-g2ca7 From c98cac603f1ce7d00e2a802b5640bced3bc3c1f2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 21 Nov 2018 11:35:03 -0800 Subject: rcu: Rename rcu_check_callbacks() to rcu_sched_clock_irq() The name rcu_check_callbacks() arguably made sense back in the early 2000s when RCU was quite a bit simpler than it is today, but it has become quite misleading, especially with the advent of dyntick-idle and NO_HZ_FULL. The rcu_check_callbacks() function is RCU's hook into the scheduling-clock interrupt, and is now but one of many ways that callbacks get promoted to invocable state. This commit therefore changes the name to rcu_sched_clock_irq(), which is the same number of characters and clearly indicates this function's relation to the rest of the Linux kernel. In addition, for the sake of consistency, rcu_flavor_check_callbacks() is also renamed to rcu_flavor_sched_clock_irq(). While in the area, the header comments for both functions are reworked. Signed-off-by: Paul E. McKenney --- .../Memory-Ordering/Tree-RCU-Memory-Ordering.html | 4 ++-- .../TreeRCU-callback-invocation.svg | 2 +- .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 4 ++-- .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 2 +- include/linux/rcupdate.h | 2 +- kernel/rcu/tiny.c | 2 +- kernel/rcu/tree.c | 18 ++++++++-------- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 24 +++++++++------------- kernel/time/timer.c | 2 +- 10 files changed, 29 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html index e4d94fba6c89..a3acfd49255f 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -485,7 +485,7 @@ section that the grace period must wait on. noted by rcu_node_context_switch() on the left. On the other hand, if the CPU takes a scheduler-clock interrupt while executing in usermode, a quiescent state will be noted by -rcu_check_callbacks() on the right. +rcu_sched_clock_irq() on the right. Either way, the passage through a quiescent state will be noted in a per-CPU variable. @@ -651,7 +651,7 @@ to end. These callbacks are identified by rcu_advance_cbs(), which is usually invoked by __note_gp_changes(). As shown in the diagram below, this invocation can be triggered by -the scheduling-clock interrupt (rcu_check_callbacks() on +the scheduling-clock interrupt (rcu_sched_clock_irq() on the left) or by idle entry (rcu_cleanup_after_idle() on the right, but only for kernels build with CONFIG_RCU_FAST_NO_HZ=y). diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg index 832408313d93..3fcf0c17cef2 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -349,7 +349,7 @@ font-weight="bold" font-size="192" id="text202-7-5" - style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_check_callbacks() + style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_sched_clock_irq() rcu_check_callbacks() + xml:space="preserve">rcu_sched_clock_irq() rcu_check_callbacks() + style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_sched_clock_irq() rcu_check_callbacks() + xml:space="preserve">rcu_sched_clock_irq() rcu_read_unlock_special.b.need_qs = false; } } @@ -778,13 +778,13 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp) } /* - * Check for a quiescent state from the current CPU. When a task blocks, - * the task is recorded in the corresponding CPU's rcu_node structure, - * which is checked elsewhere. - * - * Caller must disable hard irqs. + * Check for a quiescent state from the current CPU, including voluntary + * context switches for Tasks RCU. When a task blocks, the task is + * recorded in the corresponding CPU's rcu_node structure, which is checked + * elsewhere, hence this function need only check for quiescent states + * related to the current CPU, not to those related to tasks. */ -static void rcu_flavor_check_callbacks(int user) +static void rcu_flavor_sched_clock_irq(int user) { struct task_struct *t = current; @@ -1030,14 +1030,10 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp) } /* - * Check to see if this CPU is in a non-context-switch quiescent state - * (user mode or idle loop for rcu, non-softirq execution for rcu_bh). - * Also schedule RCU core processing. - * - * This function must be called from hardirq context. It is normally - * invoked from the scheduling-clock interrupt. + * Check to see if this CPU is in a non-context-switch quiescent state, + * namely user mode and idle loop. */ -static void rcu_flavor_check_callbacks(int user) +static void rcu_flavor_sched_clock_irq(int user) { if (user || rcu_is_cpu_rrupt_from_idle()) { diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 444156debfa0..6eb7cc4b6d52 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1632,7 +1632,7 @@ void update_process_times(int user_tick) /* Note: this timer irq context must be accounted for as well. */ account_process_tick(p, user_tick); run_local_timers(); - rcu_check_callbacks(user_tick); + rcu_sched_clock_irq(user_tick); #ifdef CONFIG_IRQ_WORK if (in_irq()) irq_work_tick(); -- cgit v1.3-7-g2ca7 From fb60e533beab3bf27adc0e39e03337e7584c6d5a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 21 Nov 2018 12:42:12 -0800 Subject: rcu: Rename rcu_process_callbacks() to rcu_core() for Tree RCU Although the name rcu_process_callbacks() still makes sense for Tiny RCU, where most of what it does is invoke callbacks, it no longer makes much sense for Tree RCU, especially given that the actually callback invocation is relegated to rcu_do_batch(), or, for no-CBs CPUs, to the rcuo kthreads. Especially in the latter case, rcu_process_callbacks() has very little to do with actual callbacks. A better description of this function is that it performs RCU's core processing. This commit therefore changes the name of Tree RCU's rcu_process_callbacks() function to rcu_core(), which also has the virtue of being consistent with the existing invoke_rcu_core() function. While in the area, the header comment is reworked. Signed-off-by: Paul E. McKenney --- .../RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | 2 +- Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 4 ++-- Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 4 ++-- kernel/rcu/tree.c | 10 +++------- 4 files changed, 8 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html index a3acfd49255f..8d21af02b1f0 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -491,7 +491,7 @@ in a per-CPU variable.

The next time an RCU_SOFTIRQ handler executes on this CPU (for example, after the next scheduler-clock -interrupt), __rcu_process_callbacks() will invoke +interrupt), rcu_core() will invoke rcu_check_quiescent_state(), which will notice the recorded quiescent state, and invoke rcu_report_qs_rdp(). diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg index f0bbe6f8d729..2bcd742d6e49 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -3924,7 +3924,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_process_callbacks() + xml:space="preserve">rcu_core() rcu_check_quiescent_state()) + xml:space="preserve">rcu_check_quiescent_state() rcu_process_callbacks() + xml:space="preserve">rcu_core() rcu_check_quiescent_state()) + xml:space="preserve">rcu_check_quiescent_state() Date: Mon, 26 Nov 2018 11:12:39 -0800 Subject: rcu: Remove preemption disabling from expedited CPU selection It turns out that it is queue_delayed_work_on() rather than queue_work_on() that has difficulties when used concurrently with CPU-hotplug removal operations. It is therefore unnecessary to protect CPU identification and queue_work_on() with preempt_disable(). This commit therefore removes the preempt_disable() and preempt_enable() from sync_rcu_exp_select_cpus(), which has the further benefit of reducing the number of changes that must be maintained in the -rt patchset. Reported-by: Thomas Gleixner Reported-by: Sebastian Siewior Suggested-by: Boqun Feng Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d882ca0cd01b..763649d345d7 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -448,7 +448,6 @@ static void sync_rcu_exp_select_cpus(void) continue; } INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); - preempt_disable(); cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1); /* If all offline, queue the work on an unbound CPU. */ if (unlikely(cpu > rnp->grphi - rnp->grplo)) @@ -456,7 +455,6 @@ static void sync_rcu_exp_select_cpus(void) else cpu += rnp->grplo; queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work); - preempt_enable(); rnp->exp_need_flush = true; } -- cgit v1.3-7-g2ca7 From 39abefe7433236735c92749492fd045fd40c071e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 26 Nov 2018 18:33:25 -0800 Subject: rcu: Repair rcu_nmi_exit() docbook header This commit removes the "@irq" argument from the rcu_nmi_exit() docbook header, given that this function now has no arguments. Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e67f8dc1894b..9cbadddf1f31 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -725,7 +725,6 @@ static __always_inline void rcu_nmi_exit_common(bool irq) /** * rcu_nmi_exit - inform RCU of exit from NMI context - * @irq: Is this call from rcu_irq_exit? * * If you add or remove a call to rcu_nmi_exit(), be sure to test * with CONFIG_RCU_EQS_DEBUG=y. -- cgit v1.3-7-g2ca7 From c2d8089de7f0b849af11c271278fe6b904db5df2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 28 Nov 2018 12:47:23 -0800 Subject: rcu: Fix obsolete DYNTICK_IRQ_NONIDLE comment This commit updates the DYNTICK_IRQ_NONIDLE header comment to remove the obsolete commentary about unmatched rcu_irq_{enter,exit}(). Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 75787186bd4f..51f1fb21f171 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -30,7 +30,7 @@ #define RCU_TRACE(stmt) #endif /* #else #ifdef CONFIG_RCU_TRACE */ -/* Offset to allow for unmatched rcu_irq_{enter,exit}(). */ +/* Offset to allow distinguishing irq vs. task-based idle entry/exit. */ #define DYNTICK_IRQ_NONIDLE ((LONG_MAX / 2) + 1) -- cgit v1.3-7-g2ca7 From e81baf4cb19a9b428ba477fd0423f81672a58817 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 11 Dec 2018 12:12:38 +0100 Subject: srcu: Remove srcu_queue_delayed_work_on() srcu_queue_delayed_work_on() disables preemption (and therefore CPU hotplug in RCU's case) and then checks based on its own accounting if a CPU is online. If the CPU is online it uses queue_delayed_work_on() otherwise it fallbacks to queue_delayed_work(). The problem here is that queue_work() on -RT does not work with disabled preemption. queue_work_on() works also on an offlined CPU. queue_delayed_work_on() has the problem that it is possible to program a timer on an offlined CPU. This timer will fire once the CPU is online again. But until then, the timer remains programmed and nothing will happen. Add a local timer which will fire (as requested per delay) on the local CPU and then enqueue the work on the specific CPU. RCUtorture testing with SRCU-P for 24h showed no problems. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- include/linux/srcutree.h | 3 ++- kernel/rcu/srcutree.c | 55 +++++++++++++++++++++--------------------------- kernel/rcu/tree.c | 4 ---- kernel/rcu/tree.h | 8 ------- 4 files changed, 26 insertions(+), 44 deletions(-) (limited to 'kernel') diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index 6f292bd3e7db..0faa978c9880 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -45,7 +45,8 @@ struct srcu_data { unsigned long srcu_gp_seq_needed; /* Furthest future GP needed. */ unsigned long srcu_gp_seq_needed_exp; /* Furthest future exp GP. */ bool srcu_cblist_invoking; /* Invoking these CBs? */ - struct delayed_work work; /* Context for CB invoking. */ + struct timer_list delay_work; /* Delay for CB invoking */ + struct work_struct work; /* Context for CB invoking. */ struct rcu_head srcu_barrier_head; /* For srcu_barrier() use. */ struct srcu_node *mynode; /* Leaf srcu_node. */ unsigned long grpmask; /* Mask for leaf srcu_node */ diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 3600d88d8956..7f041f2435df 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -58,6 +58,7 @@ static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); +static void srcu_delay_timer(struct timer_list *t); /* Wrappers for lock acquisition and release, see raw_spin_lock_rcu_node(). */ #define spin_lock_rcu_node(p) \ @@ -156,7 +157,8 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp, bool is_static) snp->grphi = cpu; } sdp->cpu = cpu; - INIT_DELAYED_WORK(&sdp->work, srcu_invoke_callbacks); + INIT_WORK(&sdp->work, srcu_invoke_callbacks); + timer_setup(&sdp->delay_work, srcu_delay_timer, 0); sdp->ssp = ssp; sdp->grpmask = 1 << (cpu - sdp->mynode->grplo); if (is_static) @@ -386,13 +388,19 @@ void _cleanup_srcu_struct(struct srcu_struct *ssp, bool quiesced) } else { flush_delayed_work(&ssp->work); } - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { + struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); + if (quiesced) { - if (WARN_ON(delayed_work_pending(&per_cpu_ptr(ssp->sda, cpu)->work))) + if (WARN_ON(timer_pending(&sdp->delay_work))) + return; /* Just leak it! */ + if (WARN_ON(work_pending(&sdp->work))) return; /* Just leak it! */ } else { - flush_delayed_work(&per_cpu_ptr(ssp->sda, cpu)->work); + del_timer_sync(&sdp->delay_work); + flush_work(&sdp->work); } + } if (WARN_ON(rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) != SRCU_STATE_IDLE) || WARN_ON(srcu_readers_active(ssp))) { pr_info("%s: Active srcu_struct %p state: %d\n", @@ -463,39 +471,23 @@ static void srcu_gp_start(struct srcu_struct *ssp) WARN_ON_ONCE(state != SRCU_STATE_SCAN1); } -/* - * Track online CPUs to guide callback workqueue placement. - */ -DEFINE_PER_CPU(bool, srcu_online); -void srcu_online_cpu(unsigned int cpu) +static void srcu_delay_timer(struct timer_list *t) { - WRITE_ONCE(per_cpu(srcu_online, cpu), true); -} + struct srcu_data *sdp = container_of(t, struct srcu_data, delay_work); -void srcu_offline_cpu(unsigned int cpu) -{ - WRITE_ONCE(per_cpu(srcu_online, cpu), false); + queue_work_on(sdp->cpu, rcu_gp_wq, &sdp->work); } -/* - * Place the workqueue handler on the specified CPU if online, otherwise - * just run it whereever. This is useful for placing workqueue handlers - * that are to invoke the specified CPU's callbacks. - */ -static bool srcu_queue_delayed_work_on(int cpu, struct workqueue_struct *wq, - struct delayed_work *dwork, +static void srcu_queue_delayed_work_on(struct srcu_data *sdp, unsigned long delay) { - bool ret; + if (!delay) { + queue_work_on(sdp->cpu, rcu_gp_wq, &sdp->work); + return; + } - preempt_disable(); - if (READ_ONCE(per_cpu(srcu_online, cpu))) - ret = queue_delayed_work_on(cpu, wq, dwork, delay); - else - ret = queue_delayed_work(wq, dwork, delay); - preempt_enable(); - return ret; + timer_reduce(&sdp->delay_work, jiffies + delay); } /* @@ -504,7 +496,7 @@ static bool srcu_queue_delayed_work_on(int cpu, struct workqueue_struct *wq, */ static void srcu_schedule_cbs_sdp(struct srcu_data *sdp, unsigned long delay) { - srcu_queue_delayed_work_on(sdp->cpu, rcu_gp_wq, &sdp->work, delay); + srcu_queue_delayed_work_on(sdp, delay); } /* @@ -1186,7 +1178,8 @@ static void srcu_invoke_callbacks(struct work_struct *work) struct srcu_data *sdp; struct srcu_struct *ssp; - sdp = container_of(work, struct srcu_data, work.work); + sdp = container_of(work, struct srcu_data, work); + ssp = sdp->ssp; rcu_cblist_init(&ready_cbs); spin_lock_irq_rcu_node(sdp); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1c4add096078..127255795859 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3408,8 +3408,6 @@ int rcutree_online_cpu(unsigned int cpu) raw_spin_lock_irqsave_rcu_node(rnp, flags); rnp->ffmask |= rdp->grpmask; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); - if (IS_ENABLED(CONFIG_TREE_SRCU)) - srcu_online_cpu(cpu); if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) return 0; /* Too early in boot for scheduler work. */ sync_sched_exp_online_cleanup(cpu); @@ -3434,8 +3432,6 @@ int rcutree_offline_cpu(unsigned int cpu) raw_spin_unlock_irqrestore_rcu_node(rnp, flags); rcutree_affinity_setting(cpu, cpu); - if (IS_ENABLED(CONFIG_TREE_SRCU)) - srcu_offline_cpu(cpu); return 0; } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 149557b7c39c..4bba017c703c 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -458,11 +458,3 @@ static void rcu_bind_gp_kthread(void); static bool rcu_nohz_full_cpu(void); static void rcu_dynticks_task_enter(void); static void rcu_dynticks_task_exit(void); - -#ifdef CONFIG_SRCU -void srcu_online_cpu(unsigned int cpu); -void srcu_offline_cpu(unsigned int cpu); -#else /* #ifdef CONFIG_SRCU */ -void srcu_online_cpu(unsigned int cpu) { } -void srcu_offline_cpu(unsigned int cpu) { } -#endif /* #else #ifdef CONFIG_SRCU */ -- cgit v1.3-7-g2ca7 From cd618d102b753a3f6823b297ac5c5d03d4e7f0c2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 8 Jan 2019 13:41:26 -0800 Subject: rcutorture: Record grace periods in forward-progress histogram This commit records grace periods in rcutorture's n_launders_hist[] histogram, thus allowing rcu_torture_fwd_cb_hist() to print out the elapsed number of grace periods between buckets. This information helps to determine whether a lack of forward progress is due to stalled grace periods on the one hand or due to sluggish callback invocation on the other. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f6e85faa4ff4..0955f3a20952 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1630,21 +1630,34 @@ static bool rcu_fwd_emergency_stop; #define MIN_FWD_CB_LAUNDERS 3 /* This many CB invocations to count. */ #define MIN_FWD_CBS_LAUNDERED 100 /* Number of counted CBs. */ #define FWD_CBS_HIST_DIV 10 /* Histogram buckets/second. */ -static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / (HZ / FWD_CBS_HIST_DIV)]; +struct rcu_launder_hist { + long n_launders; + unsigned long launder_gp_seq; +}; +#define N_LAUNDERS_HIST (2 * MAX_FWD_CB_JIFFIES / (HZ / FWD_CBS_HIST_DIV)) +static struct rcu_launder_hist n_launders_hist[N_LAUNDERS_HIST]; +static unsigned long rcu_launder_gp_seq_start; static void rcu_torture_fwd_cb_hist(void) { + unsigned long gps; + unsigned long gps_old; int i; int j; for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--) - if (n_launders_hist[i] > 0) + if (n_launders_hist[i].n_launders > 0) break; pr_alert("%s: Callback-invocation histogram (duration %lu jiffies):", __func__, jiffies - rcu_fwd_startat); - for (j = 0; j <= i; j++) - pr_cont(" %ds/%d: %ld", - j + 1, FWD_CBS_HIST_DIV, n_launders_hist[j]); + gps_old = rcu_launder_gp_seq_start; + for (j = 0; j <= i; j++) { + gps = n_launders_hist[j].launder_gp_seq; + pr_cont(" %ds/%d: %ld:%ld", + j + 1, FWD_CBS_HIST_DIV, n_launders_hist[j].n_launders, + rcutorture_seq_diff(gps, gps_old)); + gps_old = gps; + } pr_cont("\n"); } @@ -1666,7 +1679,8 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) i = ((jiffies - rcu_fwd_startat) / (HZ / FWD_CBS_HIST_DIV)); if (i >= ARRAY_SIZE(n_launders_hist)) i = ARRAY_SIZE(n_launders_hist) - 1; - n_launders_hist[i]++; + n_launders_hist[i].n_launders++; + n_launders_hist[i].launder_gp_seq = cur_ops->get_gp_seq(); spin_unlock_irqrestore(&rcu_fwd_lock, flags); } @@ -1786,9 +1800,10 @@ static void rcu_torture_fwd_prog_cr(void) n_max_cbs = 0; n_max_gps = 0; for (i = 0; i < ARRAY_SIZE(n_launders_hist); i++) - n_launders_hist[i] = 0; + n_launders_hist[i].n_launders = 0; cver = READ_ONCE(rcu_torture_current_version); gps = cur_ops->get_gp_seq(); + rcu_launder_gp_seq_start = gps; while (time_before(jiffies, stopat) && !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { rfcp = READ_ONCE(rcu_fwd_cb_head); -- cgit v1.3-7-g2ca7 From 3a6cb58f159e64241b2af9374acad41a70939349 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 10 Dec 2018 09:44:52 -0800 Subject: rcutorture: Add grace period after CPU offline Beyond a certain point in the CPU-hotplug offline process, timers get stranded on the outgoing CPU, and won't fire until that CPU comes back online, which might well be never. This commit therefore adds a hook in torture_onoff_init() that is invoked from torture_offline(), which rcutorture uses to occasionally wait for a grace period. This should result in failures for RCU implementations that rely on stranded timers eventually firing in the absence of the CPU coming back online. Reported-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- include/linux/torture.h | 3 ++- kernel/locking/locktorture.c | 2 +- kernel/rcu/rcutorture.c | 11 ++++++++++- kernel/torture.c | 6 +++++- 4 files changed, 18 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/torture.h b/include/linux/torture.h index 48fad21109fc..f2d3bcbf4337 100644 --- a/include/linux/torture.h +++ b/include/linux/torture.h @@ -50,11 +50,12 @@ do { if (verbose) pr_alert("%s" TORTURE_FLAG "!!! %s\n", torture_type, s); } while (0) /* Definitions for online/offline exerciser. */ +typedef void torture_ofl_func(void); bool torture_offline(int cpu, long *n_onl_attempts, long *n_onl_successes, unsigned long *sum_offl, int *min_onl, int *max_onl); bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes, unsigned long *sum_onl, int *min_onl, int *max_onl); -int torture_onoff_init(long ooholdoff, long oointerval); +int torture_onoff_init(long ooholdoff, long oointerval, torture_ofl_func *f); void torture_onoff_stats(void); bool torture_onoff_failures(void); diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 7d0b0ed74404..c8b348097bb5 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -970,7 +970,7 @@ static int __init lock_torture_init(void) /* Prepare torture context. */ if (onoff_interval > 0) { firsterr = torture_onoff_init(onoff_holdoff * HZ, - onoff_interval * HZ); + onoff_interval * HZ, NULL); if (firsterr) goto unwind; } diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 0955f3a20952..9eb9235c1ec9 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2243,6 +2243,14 @@ static void rcu_test_debug_objects(void) #endif /* #else #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */ } +static void rcutorture_sync(void) +{ + static unsigned long n; + + if (cur_ops->sync && !(++n & 0xfff)) + cur_ops->sync(); +} + static int __init rcu_torture_init(void) { @@ -2404,7 +2412,8 @@ rcu_torture_init(void) firsterr = torture_shutdown_init(shutdown_secs, rcu_torture_cleanup); if (firsterr) goto unwind; - firsterr = torture_onoff_init(onoff_holdoff * HZ, onoff_interval); + firsterr = torture_onoff_init(onoff_holdoff * HZ, onoff_interval, + rcutorture_sync); if (firsterr) goto unwind; firsterr = rcu_torture_stall_init(); diff --git a/kernel/torture.c b/kernel/torture.c index bbf6d473e50c..a03ff722352b 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -75,6 +75,7 @@ static DEFINE_MUTEX(fullstop_mutex); static struct task_struct *onoff_task; static long onoff_holdoff; static long onoff_interval; +static torture_ofl_func *onoff_f; static long n_offline_attempts; static long n_offline_successes; static unsigned long sum_offline; @@ -118,6 +119,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes, pr_alert("%s" TORTURE_FLAG "torture_onoff task: offlined %d\n", torture_type, cpu); + if (onoff_f) + onoff_f(); (*n_offl_successes)++; delta = jiffies - starttime; *sum_offl += delta; @@ -243,11 +246,12 @@ stop: /* * Initiate online-offline handling. */ -int torture_onoff_init(long ooholdoff, long oointerval) +int torture_onoff_init(long ooholdoff, long oointerval, torture_ofl_func *f) { #ifdef CONFIG_HOTPLUG_CPU onoff_holdoff = ooholdoff; onoff_interval = oointerval; + onoff_f = f; if (onoff_interval <= 0) return 0; return torture_create_kthread(torture_onoff, NULL, onoff_task); -- cgit v1.3-7-g2ca7 From e838a7d66ee2bb7abb46214cb9a3505749e29505 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 28 Dec 2018 07:48:43 -0800 Subject: rcuperf: Stop abusing IS_ENABLED() The ever-evolving IS_ENABLED() macro is intended for CONFIG_* Kconfig options, but rcuperf currently uses it for the decidedly non-CONFIG_* MODULE macro. In the spirit of not inviting trouble, this commit substitutes tried-and-true #ifdef. Reported-by: Ingo Molnar Signed-off-by: Paul E. McKenney Acked-by: Ingo Molnar --- kernel/rcu/rcuperf.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index b459da70b4fc..8d342d57cdaf 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -83,13 +83,19 @@ MODULE_AUTHOR("Paul E. McKenney "); * Various other use cases may of course be specified. */ +#ifdef MODULE +# define RCUPERF_SHUTDOWN 0 +#else +# define RCUPERF_SHUTDOWN 1 +#endif + torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives"); torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader"); torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); torture_param(int, nreaders, -1, "Number of RCU reader threads"); torture_param(int, nwriters, -1, "Number of RCU updater threads"); -torture_param(bool, shutdown, !IS_ENABLED(MODULE), +torture_param(bool, shutdown, RCUPERF_SHUTDOWN, "Shutdown at end of performance tests."); torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable"); -- cgit v1.3-7-g2ca7 From a72dafafbd5f11c6ea3a9682d64da1074f28eb67 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 26 Jan 2019 12:26:00 -0500 Subject: bpf: refactor verifier min/max code for condition jump The current min/max code does both signed and unsigned comparisons against the input argument "val" which is "u64" and there is explicit type casting when the comparison is signed. As we will need slightly more complexer type casting when JMP32 introduced, it is better to host the signed type casting. This makes the code more clean with ignorable runtime overhead. Also, code for J*GE/GT/LT/LE and JEQ/JNE are very similar, this patch combine them. The main purpose for this refactor is to make sure the min/max code will still be readable and with minimum code duplication after JMP32 introduced. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 172 +++++++++++++++++++++++++++++--------------------- 1 file changed, 99 insertions(+), 73 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8cfe39ef770f..eae6cb1fe653 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4033,9 +4033,13 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, */ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) { + s64 sval; + if (__is_pointer_value(false, reg)) return -1; + sval = (s64)val; + switch (opcode) { case BPF_JEQ: if (tnum_is_const(reg->var_off)) @@ -4058,9 +4062,9 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) return 0; break; case BPF_JSGT: - if (reg->smin_value > (s64)val) + if (reg->smin_value > sval) return 1; - else if (reg->smax_value < (s64)val) + else if (reg->smax_value < sval) return 0; break; case BPF_JLT: @@ -4070,9 +4074,9 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) return 0; break; case BPF_JSLT: - if (reg->smax_value < (s64)val) + if (reg->smax_value < sval) return 1; - else if (reg->smin_value >= (s64)val) + else if (reg->smin_value >= sval) return 0; break; case BPF_JGE: @@ -4082,9 +4086,9 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) return 0; break; case BPF_JSGE: - if (reg->smin_value >= (s64)val) + if (reg->smin_value >= sval) return 1; - else if (reg->smax_value < (s64)val) + else if (reg->smax_value < sval) return 0; break; case BPF_JLE: @@ -4094,9 +4098,9 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) return 0; break; case BPF_JSLE: - if (reg->smax_value <= (s64)val) + if (reg->smax_value <= sval) return 1; - else if (reg->smin_value > (s64)val) + else if (reg->smin_value > sval) return 0; break; } @@ -4113,6 +4117,8 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, u8 opcode) { + s64 sval; + /* If the dst_reg is a pointer, we can't learn anything about its * variable offset from the compare (unless src_reg were a pointer into * the same object, but we don't bother with that. @@ -4122,19 +4128,22 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, if (__is_pointer_value(false, false_reg)) return; + sval = (s64)val; + switch (opcode) { case BPF_JEQ: - /* If this is false then we know nothing Jon Snow, but if it is - * true then we know for sure. - */ - __mark_reg_known(true_reg, val); - break; case BPF_JNE: - /* If this is true we know nothing Jon Snow, but if it is false - * we know the value for sure; + { + struct bpf_reg_state *reg = + opcode == BPF_JEQ ? true_reg : false_reg; + + /* For BPF_JEQ, if this is false we know nothing Jon Snow, but + * if it is true we know the value for sure. Likewise for + * BPF_JNE. */ - __mark_reg_known(false_reg, val); + __mark_reg_known(reg, val); break; + } case BPF_JSET: false_reg->var_off = tnum_and(false_reg->var_off, tnum_const(~val)); @@ -4142,38 +4151,46 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, true_reg->var_off = tnum_or(true_reg->var_off, tnum_const(val)); break; - case BPF_JGT: - false_reg->umax_value = min(false_reg->umax_value, val); - true_reg->umin_value = max(true_reg->umin_value, val + 1); - break; - case BPF_JSGT: - false_reg->smax_value = min_t(s64, false_reg->smax_value, val); - true_reg->smin_value = max_t(s64, true_reg->smin_value, val + 1); - break; - case BPF_JLT: - false_reg->umin_value = max(false_reg->umin_value, val); - true_reg->umax_value = min(true_reg->umax_value, val - 1); - break; - case BPF_JSLT: - false_reg->smin_value = max_t(s64, false_reg->smin_value, val); - true_reg->smax_value = min_t(s64, true_reg->smax_value, val - 1); - break; case BPF_JGE: - false_reg->umax_value = min(false_reg->umax_value, val - 1); - true_reg->umin_value = max(true_reg->umin_value, val); + case BPF_JGT: + { + u64 false_umax = opcode == BPF_JGT ? val : val - 1; + u64 true_umin = opcode == BPF_JGT ? val + 1 : val; + + false_reg->umax_value = min(false_reg->umax_value, false_umax); + true_reg->umin_value = max(true_reg->umin_value, true_umin); break; + } case BPF_JSGE: - false_reg->smax_value = min_t(s64, false_reg->smax_value, val - 1); - true_reg->smin_value = max_t(s64, true_reg->smin_value, val); + case BPF_JSGT: + { + s64 false_smax = opcode == BPF_JSGT ? sval : sval - 1; + s64 true_smin = opcode == BPF_JSGT ? sval + 1 : sval; + + false_reg->smax_value = min(false_reg->smax_value, false_smax); + true_reg->smin_value = max(true_reg->smin_value, true_smin); break; + } case BPF_JLE: - false_reg->umin_value = max(false_reg->umin_value, val + 1); - true_reg->umax_value = min(true_reg->umax_value, val); + case BPF_JLT: + { + u64 false_umin = opcode == BPF_JLT ? val : val + 1; + u64 true_umax = opcode == BPF_JLT ? val - 1 : val; + + false_reg->umin_value = max(false_reg->umin_value, false_umin); + true_reg->umax_value = min(true_reg->umax_value, true_umax); break; + } case BPF_JSLE: - false_reg->smin_value = max_t(s64, false_reg->smin_value, val + 1); - true_reg->smax_value = min_t(s64, true_reg->smax_value, val); + case BPF_JSLT: + { + s64 false_smin = opcode == BPF_JSLT ? sval : sval + 1; + s64 true_smax = opcode == BPF_JSLT ? sval - 1 : sval; + + false_reg->smin_value = max(false_reg->smin_value, false_smin); + true_reg->smax_value = min(true_reg->smax_value, true_smax); break; + } default: break; } @@ -4198,22 +4215,23 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, u8 opcode) { + s64 sval; + if (__is_pointer_value(false, false_reg)) return; + sval = (s64)val; + switch (opcode) { case BPF_JEQ: - /* If this is false then we know nothing Jon Snow, but if it is - * true then we know for sure. - */ - __mark_reg_known(true_reg, val); - break; case BPF_JNE: - /* If this is true we know nothing Jon Snow, but if it is false - * we know the value for sure; - */ - __mark_reg_known(false_reg, val); + { + struct bpf_reg_state *reg = + opcode == BPF_JEQ ? true_reg : false_reg; + + __mark_reg_known(reg, val); break; + } case BPF_JSET: false_reg->var_off = tnum_and(false_reg->var_off, tnum_const(~val)); @@ -4221,38 +4239,46 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, true_reg->var_off = tnum_or(true_reg->var_off, tnum_const(val)); break; - case BPF_JGT: - true_reg->umax_value = min(true_reg->umax_value, val - 1); - false_reg->umin_value = max(false_reg->umin_value, val); - break; - case BPF_JSGT: - true_reg->smax_value = min_t(s64, true_reg->smax_value, val - 1); - false_reg->smin_value = max_t(s64, false_reg->smin_value, val); - break; - case BPF_JLT: - true_reg->umin_value = max(true_reg->umin_value, val + 1); - false_reg->umax_value = min(false_reg->umax_value, val); - break; - case BPF_JSLT: - true_reg->smin_value = max_t(s64, true_reg->smin_value, val + 1); - false_reg->smax_value = min_t(s64, false_reg->smax_value, val); - break; case BPF_JGE: - true_reg->umax_value = min(true_reg->umax_value, val); - false_reg->umin_value = max(false_reg->umin_value, val + 1); + case BPF_JGT: + { + u64 false_umin = opcode == BPF_JGT ? val : val + 1; + u64 true_umax = opcode == BPF_JGT ? val - 1 : val; + + false_reg->umin_value = max(false_reg->umin_value, false_umin); + true_reg->umax_value = min(true_reg->umax_value, true_umax); break; + } case BPF_JSGE: - true_reg->smax_value = min_t(s64, true_reg->smax_value, val); - false_reg->smin_value = max_t(s64, false_reg->smin_value, val + 1); + case BPF_JSGT: + { + s64 false_smin = opcode == BPF_JSGT ? sval : sval + 1; + s64 true_smax = opcode == BPF_JSGT ? sval - 1 : sval; + + false_reg->smin_value = max(false_reg->smin_value, false_smin); + true_reg->smax_value = min(true_reg->smax_value, true_smax); break; + } case BPF_JLE: - true_reg->umin_value = max(true_reg->umin_value, val); - false_reg->umax_value = min(false_reg->umax_value, val - 1); + case BPF_JLT: + { + u64 false_umax = opcode == BPF_JLT ? val : val - 1; + u64 true_umin = opcode == BPF_JLT ? val + 1 : val; + + false_reg->umax_value = min(false_reg->umax_value, false_umax); + true_reg->umin_value = max(true_reg->umin_value, true_umin); break; + } case BPF_JSLE: - true_reg->smin_value = max_t(s64, true_reg->smin_value, val); - false_reg->smax_value = min_t(s64, false_reg->smax_value, val - 1); + case BPF_JSLT: + { + s64 false_smax = opcode == BPF_JSLT ? sval : sval - 1; + s64 true_smin = opcode == BPF_JSLT ? sval + 1 : sval; + + false_reg->smax_value = min(false_reg->smax_value, false_smax); + true_reg->smin_value = max(true_reg->smin_value, true_smin); break; + } default: break; } -- cgit v1.3-7-g2ca7 From 092ed0968bb648cd18e8a0430cd0a8a71727315c Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 26 Jan 2019 12:26:01 -0500 Subject: bpf: verifier support JMP32 This patch teach verifier about the new BPF_JMP32 instruction class. Verifier need to treat it similar as the existing BPF_JMP class. A BPF_JMP32 insn needs to go through all checks that have been done on BPF_JMP. Also, verifier is doing runtime optimizations based on the extra info conditional jump instruction could offer, especially when the comparison is between constant and register that the value range of the register could be improved based on the comparison results. These code are updated accordingly. Acked-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 3 +- kernel/bpf/verifier.c | 203 ++++++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 173 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 2a81b8af3748..1e443ba97310 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -362,7 +362,8 @@ static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, s32 end_old, insn = prog->insnsi + end_old; } code = insn->code; - if (BPF_CLASS(code) != BPF_JMP || + if ((BPF_CLASS(code) != BPF_JMP && + BPF_CLASS(code) != BPF_JMP32) || BPF_OP(code) == BPF_EXIT) continue; /* Adjust offset of jmps if we cross patch boundaries. */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index eae6cb1fe653..8c1c21cd50b4 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1095,7 +1095,7 @@ static int check_subprogs(struct bpf_verifier_env *env) for (i = 0; i < insn_cnt; i++) { u8 code = insn[i].code; - if (BPF_CLASS(code) != BPF_JMP) + if (BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32) goto next; if (BPF_OP(code) == BPF_EXIT || BPF_OP(code) == BPF_CALL) goto next; @@ -4031,14 +4031,49 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, * 0 - branch will not be taken and fall-through to next insn * -1 - unknown. Example: "if (reg < 5)" is unknown when register value range [0,10] */ -static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) +static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode, + bool is_jmp32) { + struct bpf_reg_state reg_lo; s64 sval; if (__is_pointer_value(false, reg)) return -1; - sval = (s64)val; + if (is_jmp32) { + reg_lo = *reg; + reg = ®_lo; + /* For JMP32, only low 32 bits are compared, coerce_reg_to_size + * could truncate high bits and update umin/umax according to + * information of low bits. + */ + coerce_reg_to_size(reg, 4); + /* smin/smax need special handling. For example, after coerce, + * if smin_value is 0x00000000ffffffffLL, the value is -1 when + * used as operand to JMP32. It is a negative number from s32's + * point of view, while it is a positive number when seen as + * s64. The smin/smax are kept as s64, therefore, when used with + * JMP32, they need to be transformed into s32, then sign + * extended back to s64. + * + * Also, smin/smax were copied from umin/umax. If umin/umax has + * different sign bit, then min/max relationship doesn't + * maintain after casting into s32, for this case, set smin/smax + * to safest range. + */ + if ((reg->umax_value ^ reg->umin_value) & + (1ULL << 31)) { + reg->smin_value = S32_MIN; + reg->smax_value = S32_MAX; + } + reg->smin_value = (s64)(s32)reg->smin_value; + reg->smax_value = (s64)(s32)reg->smax_value; + + val = (u32)val; + sval = (s64)(s32)val; + } else { + sval = (s64)val; + } switch (opcode) { case BPF_JEQ: @@ -4108,6 +4143,29 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) return -1; } +/* Generate min value of the high 32-bit from TNUM info. */ +static u64 gen_hi_min(struct tnum var) +{ + return var.value & ~0xffffffffULL; +} + +/* Generate max value of the high 32-bit from TNUM info. */ +static u64 gen_hi_max(struct tnum var) +{ + return (var.value | var.mask) & ~0xffffffffULL; +} + +/* Return true if VAL is compared with a s64 sign extended from s32, and they + * are with the same signedness. + */ +static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg) +{ + return ((s32)sval >= 0 && + reg->smin_value >= 0 && reg->smax_value <= S32_MAX) || + ((s32)sval < 0 && + reg->smax_value <= 0 && reg->smin_value >= S32_MIN); +} + /* Adjusts the register min/max values in the case that the dst_reg is the * variable register that we are working on, and src_reg is a constant or we're * simply doing a BPF_K check. @@ -4115,7 +4173,7 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) */ static void reg_set_min_max(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, - u8 opcode) + u8 opcode, bool is_jmp32) { s64 sval; @@ -4128,7 +4186,8 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, if (__is_pointer_value(false, false_reg)) return; - sval = (s64)val; + val = is_jmp32 ? (u32)val : val; + sval = is_jmp32 ? (s64)(s32)val : (s64)val; switch (opcode) { case BPF_JEQ: @@ -4141,7 +4200,15 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, * if it is true we know the value for sure. Likewise for * BPF_JNE. */ - __mark_reg_known(reg, val); + if (is_jmp32) { + u64 old_v = reg->var_off.value; + u64 hi_mask = ~0xffffffffULL; + + reg->var_off.value = (old_v & hi_mask) | val; + reg->var_off.mask &= hi_mask; + } else { + __mark_reg_known(reg, val); + } break; } case BPF_JSET: @@ -4157,6 +4224,10 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, u64 false_umax = opcode == BPF_JGT ? val : val - 1; u64 true_umin = opcode == BPF_JGT ? val + 1 : val; + if (is_jmp32) { + false_umax += gen_hi_max(false_reg->var_off); + true_umin += gen_hi_min(true_reg->var_off); + } false_reg->umax_value = min(false_reg->umax_value, false_umax); true_reg->umin_value = max(true_reg->umin_value, true_umin); break; @@ -4167,6 +4238,11 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, s64 false_smax = opcode == BPF_JSGT ? sval : sval - 1; s64 true_smin = opcode == BPF_JSGT ? sval + 1 : sval; + /* If the full s64 was not sign-extended from s32 then don't + * deduct further info. + */ + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; false_reg->smax_value = min(false_reg->smax_value, false_smax); true_reg->smin_value = max(true_reg->smin_value, true_smin); break; @@ -4177,6 +4253,10 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, u64 false_umin = opcode == BPF_JLT ? val : val + 1; u64 true_umax = opcode == BPF_JLT ? val - 1 : val; + if (is_jmp32) { + false_umin += gen_hi_min(false_reg->var_off); + true_umax += gen_hi_max(true_reg->var_off); + } false_reg->umin_value = max(false_reg->umin_value, false_umin); true_reg->umax_value = min(true_reg->umax_value, true_umax); break; @@ -4187,6 +4267,8 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, s64 false_smin = opcode == BPF_JSLT ? sval : sval + 1; s64 true_smax = opcode == BPF_JSLT ? sval - 1 : sval; + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; false_reg->smin_value = max(false_reg->smin_value, false_smin); true_reg->smax_value = min(true_reg->smax_value, true_smax); break; @@ -4213,14 +4295,15 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, */ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, - u8 opcode) + u8 opcode, bool is_jmp32) { s64 sval; if (__is_pointer_value(false, false_reg)) return; - sval = (s64)val; + val = is_jmp32 ? (u32)val : val; + sval = is_jmp32 ? (s64)(s32)val : (s64)val; switch (opcode) { case BPF_JEQ: @@ -4229,7 +4312,15 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, struct bpf_reg_state *reg = opcode == BPF_JEQ ? true_reg : false_reg; - __mark_reg_known(reg, val); + if (is_jmp32) { + u64 old_v = reg->var_off.value; + u64 hi_mask = ~0xffffffffULL; + + reg->var_off.value = (old_v & hi_mask) | val; + reg->var_off.mask &= hi_mask; + } else { + __mark_reg_known(reg, val); + } break; } case BPF_JSET: @@ -4245,6 +4336,10 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, u64 false_umin = opcode == BPF_JGT ? val : val + 1; u64 true_umax = opcode == BPF_JGT ? val - 1 : val; + if (is_jmp32) { + false_umin += gen_hi_min(false_reg->var_off); + true_umax += gen_hi_max(true_reg->var_off); + } false_reg->umin_value = max(false_reg->umin_value, false_umin); true_reg->umax_value = min(true_reg->umax_value, true_umax); break; @@ -4255,6 +4350,8 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, s64 false_smin = opcode == BPF_JSGT ? sval : sval + 1; s64 true_smax = opcode == BPF_JSGT ? sval - 1 : sval; + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; false_reg->smin_value = max(false_reg->smin_value, false_smin); true_reg->smax_value = min(true_reg->smax_value, true_smax); break; @@ -4265,6 +4362,10 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, u64 false_umax = opcode == BPF_JLT ? val : val - 1; u64 true_umin = opcode == BPF_JLT ? val + 1 : val; + if (is_jmp32) { + false_umax += gen_hi_max(false_reg->var_off); + true_umin += gen_hi_min(true_reg->var_off); + } false_reg->umax_value = min(false_reg->umax_value, false_umax); true_reg->umin_value = max(true_reg->umin_value, true_umin); break; @@ -4275,6 +4376,8 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, s64 false_smax = opcode == BPF_JSLT ? sval : sval - 1; s64 true_smin = opcode == BPF_JSLT ? sval + 1 : sval; + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; false_reg->smax_value = min(false_reg->smax_value, false_smax); true_reg->smin_value = max(true_reg->smin_value, true_smin); break; @@ -4416,6 +4519,10 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn, if (BPF_SRC(insn->code) != BPF_X) return false; + /* Pointers are always 64-bit. */ + if (BPF_CLASS(insn->code) == BPF_JMP32) + return false; + switch (BPF_OP(insn->code)) { case BPF_JGT: if ((dst_reg->type == PTR_TO_PACKET && @@ -4508,16 +4615,18 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, struct bpf_reg_state *regs = this_branch->frame[this_branch->curframe]->regs; struct bpf_reg_state *dst_reg, *other_branch_regs; u8 opcode = BPF_OP(insn->code); + bool is_jmp32; int err; - if (opcode > BPF_JSLE) { - verbose(env, "invalid BPF_JMP opcode %x\n", opcode); + /* Only conditional jumps are expected to reach here. */ + if (opcode == BPF_JA || opcode > BPF_JSLE) { + verbose(env, "invalid BPF_JMP/JMP32 opcode %x\n", opcode); return -EINVAL; } if (BPF_SRC(insn->code) == BPF_X) { if (insn->imm != 0) { - verbose(env, "BPF_JMP uses reserved fields\n"); + verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); return -EINVAL; } @@ -4533,7 +4642,7 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } } else { if (insn->src_reg != BPF_REG_0) { - verbose(env, "BPF_JMP uses reserved fields\n"); + verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); return -EINVAL; } } @@ -4544,9 +4653,11 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, return err; dst_reg = ®s[insn->dst_reg]; + is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32; if (BPF_SRC(insn->code) == BPF_K) { - int pred = is_branch_taken(dst_reg, insn->imm, opcode); + int pred = is_branch_taken(dst_reg, insn->imm, opcode, + is_jmp32); if (pred == 1) { /* only follow the goto, ignore fall-through */ @@ -4574,30 +4685,51 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, * comparable. */ if (BPF_SRC(insn->code) == BPF_X) { + struct bpf_reg_state *src_reg = ®s[insn->src_reg]; + struct bpf_reg_state lo_reg0 = *dst_reg; + struct bpf_reg_state lo_reg1 = *src_reg; + struct bpf_reg_state *src_lo, *dst_lo; + + dst_lo = &lo_reg0; + src_lo = &lo_reg1; + coerce_reg_to_size(dst_lo, 4); + coerce_reg_to_size(src_lo, 4); + if (dst_reg->type == SCALAR_VALUE && - regs[insn->src_reg].type == SCALAR_VALUE) { - if (tnum_is_const(regs[insn->src_reg].var_off)) + src_reg->type == SCALAR_VALUE) { + if (tnum_is_const(src_reg->var_off) || + (is_jmp32 && tnum_is_const(src_lo->var_off))) reg_set_min_max(&other_branch_regs[insn->dst_reg], - dst_reg, regs[insn->src_reg].var_off.value, - opcode); - else if (tnum_is_const(dst_reg->var_off)) + dst_reg, + is_jmp32 + ? src_lo->var_off.value + : src_reg->var_off.value, + opcode, is_jmp32); + else if (tnum_is_const(dst_reg->var_off) || + (is_jmp32 && tnum_is_const(dst_lo->var_off))) reg_set_min_max_inv(&other_branch_regs[insn->src_reg], - ®s[insn->src_reg], - dst_reg->var_off.value, opcode); - else if (opcode == BPF_JEQ || opcode == BPF_JNE) + src_reg, + is_jmp32 + ? dst_lo->var_off.value + : dst_reg->var_off.value, + opcode, is_jmp32); + else if (!is_jmp32 && + (opcode == BPF_JEQ || opcode == BPF_JNE)) /* Comparing for equality, we can combine knowledge */ reg_combine_min_max(&other_branch_regs[insn->src_reg], &other_branch_regs[insn->dst_reg], - ®s[insn->src_reg], - ®s[insn->dst_reg], opcode); + src_reg, dst_reg, opcode); } } else if (dst_reg->type == SCALAR_VALUE) { reg_set_min_max(&other_branch_regs[insn->dst_reg], - dst_reg, insn->imm, opcode); + dst_reg, insn->imm, opcode, is_jmp32); } - /* detect if R == 0 where R is returned from bpf_map_lookup_elem() */ - if (BPF_SRC(insn->code) == BPF_K && + /* detect if R == 0 where R is returned from bpf_map_lookup_elem(). + * NOTE: these optimizations below are related with pointer comparison + * which will never be JMP32. + */ + if (!is_jmp32 && BPF_SRC(insn->code) == BPF_K && insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) && reg_type_may_be_null(dst_reg->type)) { /* Mark all identical registers in each branch as either @@ -4926,7 +5058,8 @@ peek_stack: goto check_state; t = insn_stack[cur_stack - 1]; - if (BPF_CLASS(insns[t].code) == BPF_JMP) { + if (BPF_CLASS(insns[t].code) == BPF_JMP || + BPF_CLASS(insns[t].code) == BPF_JMP32) { u8 opcode = BPF_OP(insns[t].code); if (opcode == BPF_EXIT) { @@ -6082,7 +6215,7 @@ static int do_check(struct bpf_verifier_env *env) if (err) return err; - } else if (class == BPF_JMP) { + } else if (class == BPF_JMP || class == BPF_JMP32) { u8 opcode = BPF_OP(insn->code); if (opcode == BPF_CALL) { @@ -6090,7 +6223,8 @@ static int do_check(struct bpf_verifier_env *env) insn->off != 0 || (insn->src_reg != BPF_REG_0 && insn->src_reg != BPF_PSEUDO_CALL) || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_CALL uses reserved fields\n"); return -EINVAL; } @@ -6106,7 +6240,8 @@ static int do_check(struct bpf_verifier_env *env) if (BPF_SRC(insn->code) != BPF_K || insn->imm != 0 || insn->src_reg != BPF_REG_0 || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_JA uses reserved fields\n"); return -EINVAL; } @@ -6118,7 +6253,8 @@ static int do_check(struct bpf_verifier_env *env) if (BPF_SRC(insn->code) != BPF_K || insn->imm != 0 || insn->src_reg != BPF_REG_0 || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_EXIT uses reserved fields\n"); return -EINVAL; } @@ -6635,6 +6771,9 @@ static bool insn_is_cond_jump(u8 code) { u8 op; + if (BPF_CLASS(code) == BPF_JMP32) + return true; + if (BPF_CLASS(code) != BPF_JMP) return false; -- cgit v1.3-7-g2ca7 From 56cbd82ef0b3dc47a16beeebc8d9a9a9269093dc Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 26 Jan 2019 12:26:02 -0500 Subject: bpf: disassembler support JMP32 This patch teaches disassembler about JMP32. There are two places to update: - Class 0x6 now used by BPF_JMP32, not "unused". - BPF_JMP32 need to show comparison operands properly. The disassemble format is to add an extra "(32)" before the operands if it is a sub-register. A better disassemble format for both JMP32 and ALU32 just show the register prefix as "w" instead of "r", this is the format using by LLVM assembler. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/disasm.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c index d6b76377cb6e..de73f55e42fd 100644 --- a/kernel/bpf/disasm.c +++ b/kernel/bpf/disasm.c @@ -67,7 +67,7 @@ const char *const bpf_class_string[8] = { [BPF_STX] = "stx", [BPF_ALU] = "alu", [BPF_JMP] = "jmp", - [BPF_RET] = "BUG", + [BPF_JMP32] = "jmp32", [BPF_ALU64] = "alu64", }; @@ -136,23 +136,22 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, else print_bpf_end_insn(verbose, cbs->private_data, insn); } else if (BPF_OP(insn->code) == BPF_NEG) { - verbose(cbs->private_data, "(%02x) r%d = %s-r%d\n", - insn->code, insn->dst_reg, - class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d = -%c%d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', + insn->dst_reg, class == BPF_ALU ? 'w' : 'r', insn->dst_reg); } else if (BPF_SRC(insn->code) == BPF_X) { - verbose(cbs->private_data, "(%02x) %sr%d %s %sr%d\n", - insn->code, class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d %s %c%d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', insn->dst_reg, bpf_alu_string[BPF_OP(insn->code) >> 4], - class == BPF_ALU ? "(u32) " : "", + class == BPF_ALU ? 'w' : 'r', insn->src_reg); } else { - verbose(cbs->private_data, "(%02x) %sr%d %s %s%d\n", - insn->code, class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d %s %d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', insn->dst_reg, bpf_alu_string[BPF_OP(insn->code) >> 4], - class == BPF_ALU ? "(u32) " : "", insn->imm); } } else if (class == BPF_STX) { @@ -220,7 +219,7 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, verbose(cbs->private_data, "BUG_ld_%02x\n", insn->code); return; } - } else if (class == BPF_JMP) { + } else if (class == BPF_JMP32 || class == BPF_JMP) { u8 opcode = BPF_OP(insn->code); if (opcode == BPF_CALL) { @@ -244,13 +243,18 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, } else if (insn->code == (BPF_JMP | BPF_EXIT)) { verbose(cbs->private_data, "(%02x) exit\n", insn->code); } else if (BPF_SRC(insn->code) == BPF_X) { - verbose(cbs->private_data, "(%02x) if r%d %s r%d goto pc%+d\n", - insn->code, insn->dst_reg, + verbose(cbs->private_data, + "(%02x) if %c%d %s %c%d goto pc%+d\n", + insn->code, class == BPF_JMP32 ? 'w' : 'r', + insn->dst_reg, bpf_jmp_string[BPF_OP(insn->code) >> 4], + class == BPF_JMP32 ? 'w' : 'r', insn->src_reg, insn->off); } else { - verbose(cbs->private_data, "(%02x) if r%d %s 0x%x goto pc%+d\n", - insn->code, insn->dst_reg, + verbose(cbs->private_data, + "(%02x) if %c%d %s 0x%x goto pc%+d\n", + insn->code, class == BPF_JMP32 ? 'w' : 'r', + insn->dst_reg, bpf_jmp_string[BPF_OP(insn->code) >> 4], insn->imm, insn->off); } -- cgit v1.3-7-g2ca7 From 503a8865a47752d0ac2ff642f07e96e8b2103178 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 26 Jan 2019 12:26:04 -0500 Subject: bpf: interpreter support for JMP32 This patch implements interpreting new JMP32 instructions. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 197 +++++++++++++++++------------------------------------- 1 file changed, 63 insertions(+), 134 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 1e443ba97310..bba11c2565ee 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1145,6 +1145,31 @@ EXPORT_SYMBOL_GPL(__bpf_call_base); INSN_2(JMP, CALL), \ /* Exit instruction. */ \ INSN_2(JMP, EXIT), \ + /* 32-bit Jump instructions. */ \ + /* Register based. */ \ + INSN_3(JMP32, JEQ, X), \ + INSN_3(JMP32, JNE, X), \ + INSN_3(JMP32, JGT, X), \ + INSN_3(JMP32, JLT, X), \ + INSN_3(JMP32, JGE, X), \ + INSN_3(JMP32, JLE, X), \ + INSN_3(JMP32, JSGT, X), \ + INSN_3(JMP32, JSLT, X), \ + INSN_3(JMP32, JSGE, X), \ + INSN_3(JMP32, JSLE, X), \ + INSN_3(JMP32, JSET, X), \ + /* Immediate based. */ \ + INSN_3(JMP32, JEQ, K), \ + INSN_3(JMP32, JNE, K), \ + INSN_3(JMP32, JGT, K), \ + INSN_3(JMP32, JLT, K), \ + INSN_3(JMP32, JGE, K), \ + INSN_3(JMP32, JLE, K), \ + INSN_3(JMP32, JSGT, K), \ + INSN_3(JMP32, JSLT, K), \ + INSN_3(JMP32, JSGE, K), \ + INSN_3(JMP32, JSLE, K), \ + INSN_3(JMP32, JSET, K), \ /* Jump instructions. */ \ /* Register based. */ \ INSN_3(JMP, JEQ, X), \ @@ -1405,145 +1430,49 @@ select_insn: out: CONT; } - /* JMP */ JMP_JA: insn += insn->off; CONT; - JMP_JEQ_X: - if (DST == SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JEQ_K: - if (DST == IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JNE_X: - if (DST != SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JNE_K: - if (DST != IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGT_X: - if (DST > SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGT_K: - if (DST > IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLT_X: - if (DST < SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLT_K: - if (DST < IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGE_X: - if (DST >= SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGE_K: - if (DST >= IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLE_X: - if (DST <= SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLE_K: - if (DST <= IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGT_X: - if (((s64) DST) > ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGT_K: - if (((s64) DST) > ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLT_X: - if (((s64) DST) < ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLT_K: - if (((s64) DST) < ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGE_X: - if (((s64) DST) >= ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGE_K: - if (((s64) DST) >= ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLE_X: - if (((s64) DST) <= ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLE_K: - if (((s64) DST) <= ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSET_X: - if (DST & SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSET_K: - if (DST & IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; JMP_EXIT: return BPF_R0; - + /* JMP */ +#define COND_JMP(SIGN, OPCODE, CMP_OP) \ + JMP_##OPCODE##_X: \ + if ((SIGN##64) DST CMP_OP (SIGN##64) SRC) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP32_##OPCODE##_X: \ + if ((SIGN##32) DST CMP_OP (SIGN##32) SRC) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP_##OPCODE##_K: \ + if ((SIGN##64) DST CMP_OP (SIGN##64) IMM) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP32_##OPCODE##_K: \ + if ((SIGN##32) DST CMP_OP (SIGN##32) IMM) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; + COND_JMP(u, JEQ, ==) + COND_JMP(u, JNE, !=) + COND_JMP(u, JGT, >) + COND_JMP(u, JLT, <) + COND_JMP(u, JGE, >=) + COND_JMP(u, JLE, <=) + COND_JMP(u, JSET, &) + COND_JMP(s, JSGT, >) + COND_JMP(s, JSLT, <) + COND_JMP(s, JSGE, >=) + COND_JMP(s, JSLE, <=) +#undef COND_JMP /* STX and ST and LDX*/ #define LDST(SIZEOP, SIZE) \ STX_MEM_##SIZEOP: \ -- cgit v1.3-7-g2ca7 From a7b76c8857692b0fce063b94ed83da11c396d341 Mon Sep 17 00:00:00 2001 From: Jiong Wang Date: Sat, 26 Jan 2019 12:26:05 -0500 Subject: bpf: JIT blinds support JMP32 This patch adds JIT blinds support for JMP32. Like BPF_JMP_REG/IMM, JMP32 version are needed for building raw bpf insn. They are added to both include/linux/filter.h and tools/include/linux/filter.h. Reviewed-by: Jakub Kicinski Signed-off-by: Jiong Wang Signed-off-by: Alexei Starovoitov --- include/linux/filter.h | 20 ++++++++++++++++++++ kernel/bpf/core.c | 21 +++++++++++++++++++++ tools/include/linux/filter.h | 20 ++++++++++++++++++++ 3 files changed, 61 insertions(+) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index be9af6b4a9e4..e4b473f85b46 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -277,6 +277,26 @@ struct sock_reuseport; .off = OFF, \ .imm = IMM }) +/* Like BPF_JMP_REG, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_REG(OP, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + +/* Like BPF_JMP_IMM, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_IMM(OP, DST, IMM, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + /* Unconditional jumps, goto pc + off16 */ #define BPF_JMP_A(OFF) \ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index bba11c2565ee..a7bcb23bee84 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -949,6 +949,27 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from, *to++ = BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, off); break; + case BPF_JMP32 | BPF_JEQ | BPF_K: + case BPF_JMP32 | BPF_JNE | BPF_K: + case BPF_JMP32 | BPF_JGT | BPF_K: + case BPF_JMP32 | BPF_JLT | BPF_K: + case BPF_JMP32 | BPF_JGE | BPF_K: + case BPF_JMP32 | BPF_JLE | BPF_K: + case BPF_JMP32 | BPF_JSGT | BPF_K: + case BPF_JMP32 | BPF_JSLT | BPF_K: + case BPF_JMP32 | BPF_JSGE | BPF_K: + case BPF_JMP32 | BPF_JSLE | BPF_K: + case BPF_JMP32 | BPF_JSET | BPF_K: + /* Accommodate for extra offset in case of a backjump. */ + off = from->off; + if (off < 0) + off -= 2; + *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm); + *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); + *to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX, + off); + break; + case BPF_LD | BPF_IMM | BPF_DW: *to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[1].imm); *to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h index af55acf73e75..cce0b02c0e28 100644 --- a/tools/include/linux/filter.h +++ b/tools/include/linux/filter.h @@ -199,6 +199,16 @@ .off = OFF, \ .imm = 0 }) +/* Like BPF_JMP_REG, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_REG(OP, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + /* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ #define BPF_JMP_IMM(OP, DST, IMM, OFF) \ @@ -209,6 +219,16 @@ .off = OFF, \ .imm = IMM }) +/* Like BPF_JMP_IMM, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_IMM(OP, DST, IMM, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + /* Unconditional jumps, goto pc + off16 */ #define BPF_JMP_A(OFF) \ -- cgit v1.3-7-g2ca7 From 8d5d0cfb63cbcb4005e19a332b31d687b1d01e58 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 3 Dec 2018 09:56:23 +0000 Subject: sched/topology: Introduce a sysctl for Energy Aware Scheduling In its current state, Energy Aware Scheduling (EAS) starts automatically on asymmetric platforms having an Energy Model (EM). However, there are users who want to have an EM (for thermal management for example), but don't want EAS with it. In order to let users disable EAS explicitly, introduce a new sysctl called 'sched_energy_aware'. It is enabled by default so that EAS can start automatically on platforms where it makes sense. Flipping it to 0 rebuilds the scheduling domains and disables EAS. Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.com Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 12 ++++++++++++ include/linux/sched/sysctl.h | 7 +++++++ kernel/sched/topology.c | 29 +++++++++++++++++++++++++++++ kernel/sysctl.c | 11 +++++++++++ 4 files changed, 59 insertions(+) (limited to 'kernel') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index c0527d8a468a..379063e58326 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -79,6 +79,7 @@ show up in /proc/sys/kernel: - reboot-cmd [ SPARC only ] - rtsig-max - rtsig-nr +- sched_energy_aware - seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst - sem - sem_next_id [ sysv ipc ] @@ -890,6 +891,17 @@ rtsig-nr shows the number of RT signals currently queued. ============================================================== +sched_energy_aware: + +Enables/disables Energy Aware Scheduling (EAS). EAS starts +automatically on platforms where it can run (that is, +platforms with asymmetric CPU topologies and having an Energy +Model available). If your platform happens to meet the +requirements for EAS but you do not want to use it, change +this value to 0. + +============================================================== + sched_schedstats: Enables/disables scheduler statistics. Enabling this feature diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index a9c32daeb9d8..99ce6d728df7 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -83,4 +83,11 @@ extern int sysctl_schedstats(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +extern unsigned int sysctl_sched_energy_aware; +extern int sched_energy_aware_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); +#endif + #endif /* _LINUX_SCHED_SYSCTL_H */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3f35ba1d8fde..50c3fc316c54 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -203,9 +203,35 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) DEFINE_STATIC_KEY_FALSE(sched_energy_present); #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +unsigned int sysctl_sched_energy_aware = 1; DEFINE_MUTEX(sched_energy_mutex); bool sched_energy_update; +#ifdef CONFIG_PROC_SYSCTL +int sched_energy_aware_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret, state; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (!ret && write) { + state = static_branch_unlikely(&sched_energy_present); + if (state != sysctl_sched_energy_aware) { + mutex_lock(&sched_energy_mutex); + sched_energy_update = 1; + rebuild_sched_domains(); + sched_energy_update = 0; + mutex_unlock(&sched_energy_mutex); + } + } + + return ret; +} +#endif + static void free_pd(struct perf_domain *pd) { struct perf_domain *tmp; @@ -322,6 +348,9 @@ static bool build_perf_domains(const struct cpumask *cpu_map) struct cpufreq_policy *policy; struct cpufreq_governor *gov; + if (!sysctl_sched_energy_aware) + goto free; + /* EAS is enabled for asymmetric CPU capacity topologies. */ if (!per_cpu(sd_asym_cpucapacity, cpu)) { if (sched_debug()) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..987ae08147bf 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -467,6 +467,17 @@ static struct ctl_table kern_table[] = { .extra1 = &one, }, #endif +#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) + { + .procname = "sched_energy_aware", + .data = &sysctl_sched_energy_aware, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sched_energy_aware_handler, + .extra1 = &zero, + .extra2 = &one, + }, +#endif #ifdef CONFIG_PROVE_LOCKING { .procname = "prove_locking", -- cgit v1.3-7-g2ca7 From f8a696f25ba09a1821dc6ca3db56f41c264fb896 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 5 Dec 2018 11:23:56 +0100 Subject: sched/core: Give DCE a fighting chance All that fancy new Energy-Aware scheduling foo is hidden behind a static_key, which is awesome if you have the stuff enabled in your config. However, when you lack all the prerequisites it doesn't make any sense to pretend we'll ever actually run this, so provide a little more clue to the compiler so it can more agressively delete the code. text data bss dec hex filename 50297 976 96 51369 c8a9 defconfig-build/kernel/sched/fair.o 49227 944 96 50267 c45b defconfig-build/kernel/sched/fair.o Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 ++-- kernel/sched/sched.h | 18 +++++++++++++----- kernel/sched/topology.c | 2 +- 3 files changed, 16 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 50aa2aba69bd..b1374fbddd0d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6607,7 +6607,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); - if (static_branch_unlikely(&sched_energy_present)) { + if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >= 0) return new_cpu; @@ -8635,7 +8635,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds); - if (static_branch_unlikely(&sched_energy_present)) { + if (sched_energy_enabled()) { struct root_domain *rd = env->dst_rq->rd; if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d04530bf251f..d27c1a5d4e25 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2299,11 +2299,19 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned #endif #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) + #define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus))) -#else + +DECLARE_STATIC_KEY_FALSE(sched_energy_present); + +static inline bool sched_energy_enabled(void) +{ + return static_branch_unlikely(&sched_energy_present); +} + +#else /* ! (CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */ + #define perf_domain_span(pd) NULL -#endif +static inline bool sched_energy_enabled(void) { return false; } -#ifdef CONFIG_SMP -extern struct static_key_false sched_energy_present; -#endif +#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 50c3fc316c54..4ae9403420ed 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -201,8 +201,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) return 1; } -DEFINE_STATIC_KEY_FALSE(sched_energy_present); #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +DEFINE_STATIC_KEY_FALSE(sched_energy_present); unsigned int sysctl_sched_energy_aware = 1; DEFINE_MUTEX(sched_energy_mutex); bool sched_energy_update; -- cgit v1.3-7-g2ca7 From c0ad4aa4d8416a39ad262a2bd68b30acd951bf0e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 7 Jan 2019 13:52:31 +0100 Subject: sched/fair: Robustify CFS-bandwidth timer locking Traditionally hrtimer callbacks were run with IRQs disabled, but with the introduction of HRTIMER_MODE_SOFT it is possible they run from SoftIRQ context, which does _NOT_ have IRQs disabled. Allow for the CFS bandwidth timers (period_timer and slack_timer) to be ran from SoftIRQ context; this entails removing the assumption that IRQs are already disabled from the locking. While mainline doesn't strictly need this, -RT forces all timers not explicitly marked with MODE_HARD into MODE_SOFT and trips over this. And marking these timers as MODE_HARD doesn't make sense as they're not required for RT operation and can potentially be quite expensive. Reported-by: Tom Putzeys Tested-by: Mike Galbraith Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sebastian Andrzej Siewior Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b1374fbddd0d..3b61e19b504a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4565,7 +4565,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, struct rq *rq = rq_of(cfs_rq); struct rq_flags rf; - rq_lock(rq, &rf); + rq_lock_irqsave(rq, &rf); if (!cfs_rq_throttled(cfs_rq)) goto next; @@ -4582,7 +4582,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, unthrottle_cfs_rq(cfs_rq); next: - rq_unlock(rq, &rf); + rq_unlock_irqrestore(rq, &rf); if (!remaining) break; @@ -4598,7 +4598,7 @@ next: * period the timer is deactivated until scheduling resumes; cfs_b->idle is * used to track this state. */ -static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { u64 runtime, runtime_expires; int throttled; @@ -4640,11 +4640,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { runtime = cfs_b->runtime; cfs_b->distribute_running = 1; - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires); - raw_spin_lock(&cfs_b->lock); + raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; throttled = !list_empty(&cfs_b->throttled_cfs_rq); @@ -4753,17 +4753,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); + unsigned long flags; u64 expires; /* confirm we're still not at a refresh boundary */ - raw_spin_lock(&cfs_b->lock); + raw_spin_lock_irqsave(&cfs_b->lock, flags); if (cfs_b->distribute_running) { - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); return; } if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) { - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); return; } @@ -4774,18 +4775,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (runtime) cfs_b->distribute_running = 1; - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); if (!runtime) return; runtime = distribute_cfs_runtime(cfs_b, runtime, expires); - raw_spin_lock(&cfs_b->lock); + raw_spin_lock_irqsave(&cfs_b->lock, flags); if (expires == cfs_b->runtime_expires) lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } /* @@ -4863,20 +4864,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) { struct cfs_bandwidth *cfs_b = container_of(timer, struct cfs_bandwidth, period_timer); + unsigned long flags; int overrun; int idle = 0; - raw_spin_lock(&cfs_b->lock); + raw_spin_lock_irqsave(&cfs_b->lock, flags); for (;;) { overrun = hrtimer_forward_now(timer, cfs_b->period); if (!overrun) break; - idle = do_sched_cfs_period_timer(cfs_b, overrun); + idle = do_sched_cfs_period_timer(cfs_b, overrun, flags); } if (idle) cfs_b->period_active = 0; - raw_spin_unlock(&cfs_b->lock); + raw_spin_unlock_irqrestore(&cfs_b->lock, flags); return idle ? HRTIMER_NORESTART : HRTIMER_RESTART; } -- cgit v1.3-7-g2ca7 From a062d16449c0d2e00d5a54123c9cfadea3f6c763 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 14 Dec 2018 17:01:56 +0100 Subject: sched/fair: Trigger asym_packing during idle load balance Newly idle load balancing is not always triggered when a CPU becomes idle. This prevents the scheduler from getting a chance to migrate the task for asym packing. Enable active migration during idle load balance too. Signed-off-by: Vincent Guittot Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3b61e19b504a..9693cf2ea954 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8832,7 +8832,7 @@ static int need_active_balance(struct lb_env *env) { struct sched_domain *sd = env->sd; - if (env->idle == CPU_NEWLY_IDLE) { + if (env->idle != CPU_NOT_IDLE) { /* * ASYM_PACKING needs to force migrate tasks from busy but -- cgit v1.3-7-g2ca7 From 4ad4e481bd02c80c4b9c64e563d16d1c4442ea13 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 14 Dec 2018 17:01:55 +0100 Subject: sched/fair: Fix rounding bug for asym packing When check_asym_packing() is triggered, the imbalance is set to: busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE But busiest_stat.avg_load equals: sgs->group_load * SCHED_CAPACITY_SCALE / sgs->group_capacity These divisions can generate a rounding that will make imbalance slightly lower than the weighted load of the cfs_rq. But this is enough to skip the rq in find_busiest_queue() and prevents asym migration from happening. Directly set imbalance to busiest's sgs->group_load to remove the rounding. Signed-off-by: Vincent Guittot Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9693cf2ea954..30450901200e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8453,9 +8453,7 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) if (sched_asym_prefer(busiest_cpu, env->dst_cpu)) return 0; - env->imbalance = DIV_ROUND_CLOSEST( - sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, - SCHED_CAPACITY_SCALE); + env->imbalance = sds->busiest_stat.group_load; return 1; } -- cgit v1.3-7-g2ca7 From 46a745d905858c7f453b29cffe368ba214b106fb Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 14 Dec 2018 17:01:57 +0100 Subject: sched/fair: Fix unnecessary increase of balance interval In case of active balancing, we increase the balance interval to cover pinned tasks cases not covered by all_pinned logic. Neverthless, the active migration triggered by asym packing should be treated as the normal unbalanced case and reset the interval to default value, otherwise active migration for asym_packing can be easily delayed for hundreds of ms because of this pinned task detection mechanism. The same happens to other conditions tested in need_active_balance() like misfit task and when the capacity of src_cpu is reduced compared to dst_cpu (see comments in need_active_balance() for details). Signed-off-by: Vincent Guittot Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 40 +++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 30450901200e..e2ff4b69dcf6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8826,21 +8826,25 @@ static struct rq *find_busiest_queue(struct lb_env *env, */ #define MAX_PINNED_INTERVAL 512 -static int need_active_balance(struct lb_env *env) +static inline bool +asym_active_balance(struct lb_env *env) { - struct sched_domain *sd = env->sd; + /* + * ASYM_PACKING needs to force migrate tasks from busy but + * lower priority CPUs in order to pack all tasks in the + * highest priority CPUs. + */ + return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) && + sched_asym_prefer(env->dst_cpu, env->src_cpu); +} - if (env->idle != CPU_NOT_IDLE) { +static inline bool +voluntary_active_balance(struct lb_env *env) +{ + struct sched_domain *sd = env->sd; - /* - * ASYM_PACKING needs to force migrate tasks from busy but - * lower priority CPUs in order to pack all tasks in the - * highest priority CPUs. - */ - if ((sd->flags & SD_ASYM_PACKING) && - sched_asym_prefer(env->dst_cpu, env->src_cpu)) - return 1; - } + if (asym_active_balance(env)) + return 1; /* * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task. @@ -8858,6 +8862,16 @@ static int need_active_balance(struct lb_env *env) if (env->src_grp_type == group_misfit_task) return 1; + return 0; +} + +static int need_active_balance(struct lb_env *env) +{ + struct sched_domain *sd = env->sd; + + if (voluntary_active_balance(env)) + return 1; + return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } @@ -9119,7 +9133,7 @@ more_balance: } else sd->nr_balance_failed = 0; - if (likely(!active_balance)) { + if (likely(!active_balance) || voluntary_active_balance(&env)) { /* We were unbalanced, so reset the balancing interval */ sd->balance_interval = sd->min_interval; } else { -- cgit v1.3-7-g2ca7 From 434537bbd50fefc89c1e29170bf4030ae3ec445a Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 22 Jan 2019 16:21:49 +0100 Subject: genirq/debugfs: No need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman Signed-off-by: Thomas Gleixner Cc: Marc Zyngier Link: https://lkml.kernel.org/r/20190122152151.16139-50-gregkh@linuxfoundation.org --- kernel/irq/debugfs.c | 2 -- kernel/irq/irqdomain.c | 2 -- 2 files changed, 4 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index 6f636136cccc..bbd783a83409 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -256,8 +256,6 @@ static int __init irq_debugfs_init(void) int irq; root_dir = debugfs_create_dir("irq", NULL); - if (!root_dir) - return -ENOMEM; irq_domain_debugfs_init(root_dir); diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 8b0be4bd6565..45c74373c7a4 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1749,8 +1749,6 @@ void __init irq_domain_debugfs_init(struct dentry *root) struct irq_domain *d; domain_dir = debugfs_create_dir("domains", root); - if (!domain_dir) - return; debugfs_create_file("default", 0444, domain_dir, NULL, &irq_domain_debug_fops); -- cgit v1.3-7-g2ca7 From ae503ab04913794b12e1939caa75aa091062a1b3 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 22 Jan 2019 16:21:42 +0100 Subject: timekeeping/debug: No need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman Signed-off-by: Thomas Gleixner Cc: John Stultz Cc: Stephen Boyd Link: https://lkml.kernel.org/r/20190122152151.16139-43-gregkh@linuxfoundation.org --- kernel/time/timekeeping_debug.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c index 86489950d690..b73e8850e58d 100644 --- a/kernel/time/timekeeping_debug.c +++ b/kernel/time/timekeeping_debug.c @@ -37,15 +37,8 @@ DEFINE_SHOW_ATTRIBUTE(tk_debug_sleep_time); static int __init tk_debug_sleep_time_init(void) { - struct dentry *d; - - d = debugfs_create_file("sleep_time", 0444, NULL, NULL, - &tk_debug_sleep_time_fops); - if (!d) { - pr_err("Failed to create sleep_time debug file\n"); - return -ENOMEM; - } - + debugfs_create_file("sleep_time", 0444, NULL, NULL, + &tk_debug_sleep_time_fops); return 0; } late_initcall(tk_debug_sleep_time_init); -- cgit v1.3-7-g2ca7 From 75b710af7139768fd4ba2d4e05335d2344796279 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Wed, 23 Jan 2019 02:14:13 -0600 Subject: timers: Mark expected switch fall-throughs In preparation to enabling -Wimplicit-fallthrough, mark switch cases where fall through is indeed expected. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Thomas Gleixner Cc: Frederic Weisbecker Cc: John Stultz Cc: Stephen Boyd Link: https://lkml.kernel.org/r/20190123081413.GA3949@embeddedor --- kernel/time/hrtimer.c | 2 +- kernel/time/tick-broadcast.c | 1 + kernel/time/timer.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index f5cfa1b73d6f..6418e1bdc549 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -364,7 +364,7 @@ static bool hrtimer_fixup_activate(void *addr, enum debug_obj_state state) switch (state) { case ODEBUG_STATE_ACTIVE: WARN_ON(1); - + /* fall through */ default: return false; } diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 803fa67aace9..ee834d4fb814 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -375,6 +375,7 @@ void tick_broadcast_control(enum tick_broadcast_mode mode) switch (mode) { case TICK_BROADCAST_FORCE: tick_broadcast_forced = 1; + /* fall through */ case TICK_BROADCAST_ON: cpumask_set_cpu(cpu, tick_broadcast_on); if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_mask)) { diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 444156debfa0..167e71f9ed3c 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -647,7 +647,7 @@ static bool timer_fixup_activate(void *addr, enum debug_obj_state state) case ODEBUG_STATE_ACTIVE: WARN_ON(1); - + /* fall through */ default: return false; } -- cgit v1.3-7-g2ca7 From 0365aeba50841e087b3d6a0eca1bddccc5e650c8 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 22 Jan 2019 16:21:39 +0100 Subject: futex: No need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman Signed-off-by: Thomas Gleixner Reviewed-by: Darren Hart (VMware) Cc: Peter Zijlstra Link: https://lkml.kernel.org/r/20190122152151.16139-40-gregkh@linuxfoundation.org --- kernel/futex.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index fdd312da0992..69e619baf709 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -321,12 +321,8 @@ static int __init fail_futex_debugfs(void) if (IS_ERR(dir)) return PTR_ERR(dir); - if (!debugfs_create_bool("ignore-private", mode, dir, - &fail_futex.ignore_private)) { - debugfs_remove_recursive(dir); - return -ENOMEM; - } - + debugfs_create_bool("ignore-private", mode, dir, + &fail_futex.ignore_private); return 0; } -- cgit v1.3-7-g2ca7 From 34d66caf251df91ff27b24a3a786810d29989eca Mon Sep 17 00:00:00 2001 From: Zhenzhong Duan Date: Thu, 17 Jan 2019 02:10:59 -0800 Subject: x86/speculation: Remove redundant arch_smt_update() invocation With commit a74cfffb03b7 ("x86/speculation: Rework SMT state change"), arch_smt_update() is invoked from each individual CPU hotplug function. Therefore the extra arch_smt_update() call in the sysfs SMT control is redundant. Fixes: a74cfffb03b7 ("x86/speculation: Rework SMT state change") Signed-off-by: Zhenzhong Duan Signed-off-by: Thomas Gleixner Cc: Cc: Cc: Cc: Cc: Cc: Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default --- kernel/cpu.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 91d5c38eb7e5..c0c7f64573ed 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2090,10 +2090,8 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) */ cpuhp_offline_cpu_device(cpu); } - if (!ret) { + if (!ret) cpu_smt_control = ctrlval; - arch_smt_update(); - } cpu_maps_update_done(); return ret; } @@ -2104,7 +2102,6 @@ static int cpuhp_smt_enable(void) cpu_maps_update_begin(); cpu_smt_control = CPU_SMT_ENABLED; - arch_smt_update(); for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ if (cpu_online(cpu) || !node_online(cpu_to_node(cpu))) -- cgit v1.3-7-g2ca7 From 81f5c6f5db37bf2360b64c304b27b8f499b48367 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 29 Jan 2019 16:38:16 -0800 Subject: bpf: btf: allow typedef func_proto Current implementation does not allow typedef func_proto. But it is actually allowed. -bash-4.4$ cat t.c typedef int (f) (int); f *g; -bash-4.4$ clang -O2 -g -c -target bpf t.c -Xclang -target-feature -Xclang +dwarfris -bash-4.4$ pahole -JV t.o File t.o: [1] PTR (anon) type_id=2 [2] TYPEDEF f type_id=3 [3] FUNC_PROTO (anon) return=4 args=(4 (anon)) [4] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED -bash-4.4$ This patch related btf verifier to allow such (typedef func_proto) patterns. Fixes: 2667a2626f4d ("bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO") Acked-by: Martin KaFai Lau Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index befe570be5ba..c57bd10340ed 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1459,7 +1459,8 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, /* "typedef void new_void", "const void"...etc */ if (!btf_type_is_void(next_type) && - !btf_type_is_fwd(next_type)) { + !btf_type_is_fwd(next_type) && + !btf_type_is_func_proto(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } -- cgit v1.3-7-g2ca7 From b284909abad48b07d3071a9fc9b5692b3e64914b Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Wed, 30 Jan 2019 07:13:58 -0600 Subject: cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM With the following commit: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") ... the hotplug code attempted to detect when SMT was disabled by BIOS, in which case it reported SMT as permanently disabled. However, that code broke a virt hotplug scenario, where the guest is booted with only primary CPU threads, and a sibling is brought online later. The problem is that there doesn't seem to be a way to reliably distinguish between the HW "SMT disabled by BIOS" case and the virt "sibling not yet brought online" case. So the above-mentioned commit was a bit misguided, as it permanently disabled SMT for both cases, preventing future virt sibling hotplugs. Going back and reviewing the original problems which were attempted to be solved by that commit, when SMT was disabled in BIOS: 1) /sys/devices/system/cpu/smt/control showed "on" instead of "notsupported"; and 2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning. I'd propose that we instead consider #1 above to not actually be a problem. Because, at least in the virt case, it's possible that SMT wasn't disabled by BIOS and a sibling thread could be brought online later. So it makes sense to just always default the smt control to "on" to allow for that possibility (assuming cpuid indicates that the CPU supports SMT). The real problem is #2, which has a simple fix: change vmx_vm_init() to query the actual current SMT state -- i.e., whether any siblings are currently online -- instead of looking at the SMT "control" sysfs value. So fix it by: a) reverting the original "fix" and its followup fix: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation") and b) changing vmx_vm_init() to query the actual current SMT state -- instead of the sysfs control value -- to determine whether the L1TF warning is needed. This also requires the 'sched_smt_present' variable to exported, instead of 'cpu_smt_control'. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Igor Mammedov Signed-off-by: Josh Poimboeuf Signed-off-by: Thomas Gleixner Cc: Joe Mario Cc: Jiri Kosina Cc: Peter Zijlstra Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com --- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kvm/vmx/vmx.c | 3 ++- include/linux/cpu.h | 2 -- kernel/cpu.c | 33 ++++----------------------------- kernel/sched/fair.c | 1 + kernel/smp.c | 2 -- 6 files changed, 8 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 1de0f4170178..01874d54f4fd 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -71,7 +71,7 @@ void __init check_bugs(void) * identify_boot_cpu() initialized SMT support information, let the * core code know. */ - cpu_smt_check_topology_early(); + cpu_smt_check_topology(); if (!IS_ENABLED(CONFIG_SMP)) { pr_info("CPU: "); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4341175339f3..95d618045001 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -6823,7 +6824,7 @@ static int vmx_vm_init(struct kvm *kvm) * Warn upon starting the first VM in a potentially * insecure environment. */ - if (cpu_smt_control == CPU_SMT_ENABLED) + if (sched_smt_active()) pr_warn_once(L1TF_MSG_SMT); if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER) pr_warn_once(L1TF_MSG_L1D); diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 218df7f4d3e1..5041357d0297 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -180,12 +180,10 @@ enum cpuhp_smt_control { #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT) extern enum cpuhp_smt_control cpu_smt_control; extern void cpu_smt_disable(bool force); -extern void cpu_smt_check_topology_early(void); extern void cpu_smt_check_topology(void); #else # define cpu_smt_control (CPU_SMT_ENABLED) static inline void cpu_smt_disable(bool force) { } -static inline void cpu_smt_check_topology_early(void) { } static inline void cpu_smt_check_topology(void) { } #endif diff --git a/kernel/cpu.c b/kernel/cpu.c index c0c7f64573ed..d1c6d152da89 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -376,9 +376,6 @@ void __weak arch_smt_update(void) { } #ifdef CONFIG_HOTPLUG_SMT enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; -EXPORT_SYMBOL_GPL(cpu_smt_control); - -static bool cpu_smt_available __read_mostly; void __init cpu_smt_disable(bool force) { @@ -397,25 +394,11 @@ void __init cpu_smt_disable(bool force) /* * The decision whether SMT is supported can only be done after the full - * CPU identification. Called from architecture code before non boot CPUs - * are brought up. - */ -void __init cpu_smt_check_topology_early(void) -{ - if (!topology_smt_supported()) - cpu_smt_control = CPU_SMT_NOT_SUPPORTED; -} - -/* - * If SMT was disabled by BIOS, detect it here, after the CPUs have been - * brought online. This ensures the smt/l1tf sysfs entries are consistent - * with reality. cpu_smt_available is set to true during the bringup of non - * boot CPUs when a SMT sibling is detected. Note, this may overwrite - * cpu_smt_control's previous setting. + * CPU identification. Called from architecture code. */ void __init cpu_smt_check_topology(void) { - if (!cpu_smt_available) + if (!topology_smt_supported()) cpu_smt_control = CPU_SMT_NOT_SUPPORTED; } @@ -428,18 +411,10 @@ early_param("nosmt", smt_cmdline_disable); static inline bool cpu_smt_allowed(unsigned int cpu) { - if (topology_is_primary_thread(cpu)) + if (cpu_smt_control == CPU_SMT_ENABLED) return true; - /* - * If the CPU is not a 'primary' thread and the booted_once bit is - * set then the processor has SMT support. Store this information - * for the late check of SMT support in cpu_smt_check_topology(). - */ - if (per_cpu(cpuhp_state, cpu).booted_once) - cpu_smt_available = true; - - if (cpu_smt_control == CPU_SMT_ENABLED) + if (topology_is_primary_thread(cpu)) return true; /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 50aa2aba69bd..310d0637fe4b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); +EXPORT_SYMBOL_GPL(sched_smt_present); static inline void set_idle_cores(int cpu, int val) { diff --git a/kernel/smp.c b/kernel/smp.c index 163c451af42e..f4cf1b0bb3b8 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -584,8 +584,6 @@ void __init smp_init(void) num_nodes, (num_nodes > 1 ? "s" : ""), num_cpus, (num_cpus > 1 ? "s" : "")); - /* Final decision about SMT support */ - cpu_smt_check_topology(); /* Any cleanup work */ smp_cpus_done(setup_max_cpus); } -- cgit v1.3-7-g2ca7 From 57d4657716aca81ef4d7ec23e8123d26e3d28954 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Wed, 23 Jan 2019 13:35:00 -0500 Subject: audit: ignore fcaps on umount Don't fetch fcaps when umount2 is called to avoid a process hang while it waits for the missing resource to (possibly never) re-appear. Note the comment above user_path_mountpoint_at(): * A umount is a special case for path walking. We're not actually interested * in the inode in this situation, and ESTALE errors can be a problem. We * simply want track down the dentry and vfsmount attached at the mountpoint * and avoid revalidating the last component. This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS. Please see the github issue tracker https://github.com/linux-audit/audit-kernel/issues/100 Signed-off-by: Richard Guy Briggs [PM: merge fuzz in audit_log_fcaps()] Signed-off-by: Paul Moore --- fs/namei.c | 2 +- fs/namespace.c | 2 ++ include/linux/audit.h | 15 ++++++++++----- include/linux/namei.h | 3 +++ kernel/audit.c | 10 +++++++++- kernel/audit.h | 2 +- kernel/auditsc.c | 6 +++--- 7 files changed, 29 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/fs/namei.c b/fs/namei.c index 914178cdbe94..87d7710a2e1d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2720,7 +2720,7 @@ filename_mountpoint(int dfd, struct filename *name, struct path *path, if (unlikely(error == -ESTALE)) error = path_mountpoint(&nd, flags | LOOKUP_REVAL, path); if (likely(!error)) - audit_inode(name, path->dentry, 0); + audit_inode(name, path->dentry, flags & LOOKUP_NO_EVAL); restore_nameidata(); putname(name); return error; diff --git a/fs/namespace.c b/fs/namespace.c index a677b59efd74..e5de0e372df2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1640,6 +1640,8 @@ int ksys_umount(char __user *name, int flags) if (!(flags & UMOUNT_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; + lookup_flags |= LOOKUP_NO_EVAL; + retval = user_path_mountpoint_at(AT_FDCWD, name, lookup_flags, &path); if (retval) goto out; diff --git a/include/linux/audit.h b/include/linux/audit.h index ecb5d317d6a2..29251b18331a 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -25,6 +25,7 @@ #include #include +#include /* LOOKUP_* */ #include #define AUDIT_INO_UNSET ((unsigned long)-1) @@ -248,6 +249,7 @@ extern void __audit_getname(struct filename *name); #define AUDIT_INODE_PARENT 1 /* dentry represents the parent */ #define AUDIT_INODE_HIDDEN 2 /* audit record should be hidden */ +#define AUDIT_INODE_NOEVAL 4 /* audit record incomplete */ extern void __audit_inode(struct filename *name, const struct dentry *dentry, unsigned int flags); extern void __audit_file(const struct file *); @@ -308,12 +310,15 @@ static inline void audit_getname(struct filename *name) } static inline void audit_inode(struct filename *name, const struct dentry *dentry, - unsigned int parent) { + unsigned int flags) { if (unlikely(!audit_dummy_context())) { - unsigned int flags = 0; - if (parent) - flags |= AUDIT_INODE_PARENT; - __audit_inode(name, dentry, flags); + unsigned int aflags = 0; + + if (flags & LOOKUP_PARENT) + aflags |= AUDIT_INODE_PARENT; + if (flags & LOOKUP_NO_EVAL) + aflags |= AUDIT_INODE_NOEVAL; + __audit_inode(name, dentry, aflags); } } static inline void audit_file(struct file *file) diff --git a/include/linux/namei.h b/include/linux/namei.h index a78606e8e3df..9138b4471dbf 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -24,6 +24,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; * - internal "there are more path components" flag * - dentry cache is untrusted; force a real lookup * - suppress terminal automount + * - skip revalidation + * - don't fetch xattrs on audit_inode */ #define LOOKUP_FOLLOW 0x0001 #define LOOKUP_DIRECTORY 0x0002 @@ -33,6 +35,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_REVAL 0x0020 #define LOOKUP_RCU 0x0040 #define LOOKUP_NO_REVAL 0x0080 +#define LOOKUP_NO_EVAL 0x0100 /* * Intent data diff --git a/kernel/audit.c b/kernel/audit.c index 3f3f1888cac7..b7177a8def2e 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2082,6 +2082,10 @@ void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap) static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name) { + if (name->fcap_ver == -1) { + audit_log_format(ab, " cap_fe=? cap_fver=? cap_fp=? cap_fi=?"); + return; + } audit_log_cap(ab, "cap_fp", &name->fcap.permitted); audit_log_cap(ab, "cap_fi", &name->fcap.inheritable); audit_log_format(ab, " cap_fe=%d cap_fver=%x cap_frootid=%d", @@ -2114,7 +2118,7 @@ static inline int audit_copy_fcaps(struct audit_names *name, /* Copy inode data into an audit_names. */ void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, - struct inode *inode) + struct inode *inode, unsigned int flags) { name->ino = inode->i_ino; name->dev = inode->i_sb->s_dev; @@ -2123,6 +2127,10 @@ void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, name->gid = inode->i_gid; name->rdev = inode->i_rdev; security_inode_getsecid(inode, &name->osid); + if (flags & AUDIT_INODE_NOEVAL) { + name->fcap_ver = -1; + return; + } audit_copy_fcaps(name, dentry); } diff --git a/kernel/audit.h b/kernel/audit.h index 9acb8691ed87..002f0f7ba732 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -215,7 +215,7 @@ extern void audit_log_session_info(struct audit_buffer *ab); extern void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, - struct inode *inode); + struct inode *inode, unsigned int flags); extern void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap); extern void audit_log_name(struct audit_context *context, diff --git a/kernel/auditsc.c b/kernel/auditsc.c index a2696ce790f9..68da71001096 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1856,7 +1856,7 @@ out: n->type = AUDIT_TYPE_NORMAL; } handle_path(dentry); - audit_copy_inode(n, dentry, inode); + audit_copy_inode(n, dentry, inode, flags & AUDIT_INODE_NOEVAL); } void __audit_file(const struct file *file) @@ -1955,7 +1955,7 @@ void __audit_inode_child(struct inode *parent, n = audit_alloc_name(context, AUDIT_TYPE_PARENT); if (!n) return; - audit_copy_inode(n, NULL, parent); + audit_copy_inode(n, NULL, parent, 0); } if (!found_child) { @@ -1974,7 +1974,7 @@ void __audit_inode_child(struct inode *parent, } if (inode) - audit_copy_inode(found_child, dentry, inode); + audit_copy_inode(found_child, dentry, inode, 0); else found_child->ino = AUDIT_INO_UNSET; } -- cgit v1.3-7-g2ca7 From de1da68d9c9d230e448b9fa0abff139a0468fb96 Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Mon, 28 Jan 2019 23:04:46 -0500 Subject: bpf: fix bitrotted kerneldoc Over the years, the function signature has changed, but the kerneldoc block hasn't. Signed-off-by: Valdis Kletnieks Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- kernel/bpf/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index a7bcb23bee84..f13c543b7b36 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1263,8 +1263,9 @@ bool bpf_opcode_in_insntable(u8 code) #ifndef CONFIG_BPF_JIT_ALWAYS_ON /** * __bpf_prog_run - run eBPF program on a given context - * @ctx: is the data we are operating on + * @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo-registers * @insn: is the array of eBPF instructions + * @stack: is the eBPF storage stack * * Decode and execute eBPF instructions. */ -- cgit v1.3-7-g2ca7 From 1832f4ef5867fd3898d8a6c6c1978b75d76fc246 Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Tue, 29 Jan 2019 01:47:06 -0500 Subject: bpf, cgroups: clean up kerneldoc warnings Building with W=1 reveals some bitrot: CC kernel/bpf/cgroup.o kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach' kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach' Add a kerneldoc line for 'flags'. Fixing the warning for 'unused_flags' is best approached by removing the unused parameter on the function call. Signed-off-by: Valdis Kletnieks Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- include/linux/bpf-cgroup.h | 2 +- kernel/bpf/cgroup.c | 3 ++- kernel/cgroup/cgroup.c | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 588dd5f0bd85..695b2a880d9a 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -78,7 +78,7 @@ int cgroup_bpf_inherit(struct cgroup *cgrp); int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, enum bpf_attach_type type, u32 flags); int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 flags); + enum bpf_attach_type type); int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index ab612fe9862f..d78cfec5807d 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -230,6 +230,7 @@ cleanup: * @cgrp: The cgroup which descendants to traverse * @prog: A program to attach * @type: Type of attach operation + * @flags: Option flags * * Must be called with cgroup_mutex held. */ @@ -363,7 +364,7 @@ cleanup: * Must be called with cgroup_mutex held. */ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 unused_flags) + enum bpf_attach_type type) { struct list_head *progs = &cgrp->bpf.progs[type]; enum bpf_cgroup_storage_type stype; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index f31bd61c9466..9f617605dacb 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5996,7 +5996,7 @@ int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, int ret; mutex_lock(&cgroup_mutex); - ret = __cgroup_bpf_detach(cgrp, prog, type, flags); + ret = __cgroup_bpf_detach(cgrp, prog, type); mutex_unlock(&cgroup_mutex); return ret; } -- cgit v1.3-7-g2ca7 From 2c1cf00eeacb784781cf1c9896b8af001246d339 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Thu, 31 Jan 2019 13:57:58 +0100 Subject: relay: check return of create_buf_file() properly If create_buf_file() returns an error, don't try to reference it later as a valid dentry pointer. This problem was exposed when debugfs started to return errors instead of just NULL for some calls when they do not succeed properly. Also, the check for WARN_ON(dentry) was just wrong :) Reported-by: Kees Cook Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com Reported-by: Tetsuo Handa Cc: Andrew Morton Cc: David Rientjes Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL") Signed-off-by: Greg Kroah-Hartman --- kernel/relay.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/relay.c b/kernel/relay.c index 04f248644e06..9e0f52375487 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -428,6 +428,8 @@ static struct dentry *relay_create_buf_file(struct rchan *chan, dentry = chan->cb->create_buf_file(tmpname, chan->parent, S_IRUSR, buf, &chan->is_global); + if (IS_ERR(dentry)) + dentry = NULL; kfree(tmpname); @@ -461,7 +463,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) dentry = chan->cb->create_buf_file(NULL, NULL, S_IRUSR, buf, &chan->is_global); - if (WARN_ON(dentry)) + if (IS_ERR_OR_NULL(dentry)) goto free_buf; } -- cgit v1.3-7-g2ca7 From 8204e0c1113d6b7f599bcd7ebfbfde72e76c102f Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Tue, 22 Jan 2019 10:39:26 -0800 Subject: workqueue: Provide queue_work_node to queue work near a given NUMA node Provide a new function, queue_work_node, which is meant to schedule work on a "random" CPU of the requested NUMA node. The main motivation for this is to help assist asynchronous init to better improve boot times for devices that are local to a specific node. For now we just default to the first CPU that is in the intersection of the cpumask of the node and the online cpumask. The only exception is if the CPU is local to the node we will just use the current CPU. This should work for our purposes as we are currently only using this for unbound work so the CPU will be translated to a node anyway instead of being directly used. As we are only using the first CPU to represent the NUMA node for now I am limiting the scope of the function so that it can only be used with unbound workqueues. Acked-by: Tejun Heo Reviewed-by: Bart Van Assche Acked-by: Dan Williams Signed-off-by: Alexander Duyck Signed-off-by: Greg Kroah-Hartman --- include/linux/workqueue.h | 2 ++ kernel/workqueue.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+) (limited to 'kernel') diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 60d673e15632..1f50c1e586e7 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -463,6 +463,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask); extern bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work); +extern bool queue_work_node(int node, struct workqueue_struct *wq, + struct work_struct *work); extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *work, unsigned long delay); extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 392be4b252f6..d5a26e456f7a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1492,6 +1492,90 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq, } EXPORT_SYMBOL(queue_work_on); +/** + * workqueue_select_cpu_near - Select a CPU based on NUMA node + * @node: NUMA node ID that we want to select a CPU from + * + * This function will attempt to find a "random" cpu available on a given + * node. If there are no CPUs available on the given node it will return + * WORK_CPU_UNBOUND indicating that we should just schedule to any + * available CPU if we need to schedule this work. + */ +static int workqueue_select_cpu_near(int node) +{ + int cpu; + + /* No point in doing this if NUMA isn't enabled for workqueues */ + if (!wq_numa_enabled) + return WORK_CPU_UNBOUND; + + /* Delay binding to CPU if node is not valid or online */ + if (node < 0 || node >= MAX_NUMNODES || !node_online(node)) + return WORK_CPU_UNBOUND; + + /* Use local node/cpu if we are already there */ + cpu = raw_smp_processor_id(); + if (node == cpu_to_node(cpu)) + return cpu; + + /* Use "random" otherwise know as "first" online CPU of node */ + cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); + + /* If CPU is valid return that, otherwise just defer */ + return cpu < nr_cpu_ids ? cpu : WORK_CPU_UNBOUND; +} + +/** + * queue_work_node - queue work on a "random" cpu for a given NUMA node + * @node: NUMA node that we are targeting the work for + * @wq: workqueue to use + * @work: work to queue + * + * We queue the work to a "random" CPU within a given NUMA node. The basic + * idea here is to provide a way to somehow associate work with a given + * NUMA node. + * + * This function will only make a best effort attempt at getting this onto + * the right NUMA node. If no node is requested or the requested node is + * offline then we just fall back to standard queue_work behavior. + * + * Currently the "random" CPU ends up being the first available CPU in the + * intersection of cpu_online_mask and the cpumask of the node, unless we + * are running on the node. In that case we just use the current CPU. + * + * Return: %false if @work was already on a queue, %true otherwise. + */ +bool queue_work_node(int node, struct workqueue_struct *wq, + struct work_struct *work) +{ + unsigned long flags; + bool ret = false; + + /* + * This current implementation is specific to unbound workqueues. + * Specifically we only return the first available CPU for a given + * node instead of cycling through individual CPUs within the node. + * + * If this is used with a per-cpu workqueue then the logic in + * workqueue_select_cpu_near would need to be updated to allow for + * some round robin type logic. + */ + WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)); + + local_irq_save(flags); + + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + int cpu = workqueue_select_cpu_near(node); + + __queue_work(cpu, wq, work); + ret = true; + } + + local_irq_restore(flags); + return ret; +} +EXPORT_SYMBOL_GPL(queue_work_node); + void delayed_work_timer_fn(struct timer_list *t) { struct delayed_work *dwork = from_timer(dwork, t, timer); -- cgit v1.3-7-g2ca7 From 6be9238e5cb64741ff95c3ae440b112753ad93de Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Tue, 22 Jan 2019 10:39:31 -0800 Subject: async: Add support for queueing on specific NUMA node Introduce four new variants of the async_schedule_ functions that allow scheduling on a specific NUMA node. The first two functions are async_schedule_near and async_schedule_near_domain end up mapping to async_schedule and async_schedule_domain, but provide NUMA node specific functionality. They replace the original functions which were moved to inline function definitions that call the new functions while passing NUMA_NO_NODE. The second two functions are async_schedule_dev and async_schedule_dev_domain which provide NUMA specific functionality when passing a device as the data member and that device has a NUMA node other than NUMA_NO_NODE. The main motivation behind this is to address the need to be able to schedule device specific init work on specific NUMA nodes in order to improve performance of memory initialization. I have seen a significant improvement in initialziation time for persistent memory as a result of this approach. In the case of 3TB of memory on a single node the initialization time in the worst case went from 36s down to about 26s for a 10s improvement. As such the data shows a general benefit for affinitizing the async work to the node local to the device. Reviewed-by: Bart Van Assche Reviewed-by: Dan Williams Signed-off-by: Alexander Duyck Signed-off-by: Greg Kroah-Hartman --- include/linux/async.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++-- kernel/async.c | 53 ++++++++++++++++++--------------- 2 files changed, 108 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/include/linux/async.h b/include/linux/async.h index 6b0226bdaadc..f81d6dbffe68 100644 --- a/include/linux/async.h +++ b/include/linux/async.h @@ -14,6 +14,8 @@ #include #include +#include +#include typedef u64 async_cookie_t; typedef void (*async_func_t) (void *data, async_cookie_t cookie); @@ -37,9 +39,83 @@ struct async_domain { struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \ .registered = 0 } -extern async_cookie_t async_schedule(async_func_t func, void *data); -extern async_cookie_t async_schedule_domain(async_func_t func, void *data, - struct async_domain *domain); +async_cookie_t async_schedule_node(async_func_t func, void *data, + int node); +async_cookie_t async_schedule_node_domain(async_func_t func, void *data, + int node, + struct async_domain *domain); + +/** + * async_schedule - schedule a function for asynchronous execution + * @func: function to execute asynchronously + * @data: data pointer to pass to the function + * + * Returns an async_cookie_t that may be used for checkpointing later. + * Note: This function may be called from atomic or non-atomic contexts. + */ +static inline async_cookie_t async_schedule(async_func_t func, void *data) +{ + return async_schedule_node(func, data, NUMA_NO_NODE); +} + +/** + * async_schedule_domain - schedule a function for asynchronous execution within a certain domain + * @func: function to execute asynchronously + * @data: data pointer to pass to the function + * @domain: the domain + * + * Returns an async_cookie_t that may be used for checkpointing later. + * @domain may be used in the async_synchronize_*_domain() functions to + * wait within a certain synchronization domain rather than globally. + * Note: This function may be called from atomic or non-atomic contexts. + */ +static inline async_cookie_t +async_schedule_domain(async_func_t func, void *data, + struct async_domain *domain) +{ + return async_schedule_node_domain(func, data, NUMA_NO_NODE, domain); +} + +/** + * async_schedule_dev - A device specific version of async_schedule + * @func: function to execute asynchronously + * @dev: device argument to be passed to function + * + * Returns an async_cookie_t that may be used for checkpointing later. + * @dev is used as both the argument for the function and to provide NUMA + * context for where to run the function. By doing this we can try to + * provide for the best possible outcome by operating on the device on the + * CPUs closest to the device. + * Note: This function may be called from atomic or non-atomic contexts. + */ +static inline async_cookie_t +async_schedule_dev(async_func_t func, struct device *dev) +{ + return async_schedule_node(func, dev, dev_to_node(dev)); +} + +/** + * async_schedule_dev_domain - A device specific version of async_schedule_domain + * @func: function to execute asynchronously + * @dev: device argument to be passed to function + * @domain: the domain + * + * Returns an async_cookie_t that may be used for checkpointing later. + * @dev is used as both the argument for the function and to provide NUMA + * context for where to run the function. By doing this we can try to + * provide for the best possible outcome by operating on the device on the + * CPUs closest to the device. + * @domain may be used in the async_synchronize_*_domain() functions to + * wait within a certain synchronization domain rather than globally. + * Note: This function may be called from atomic or non-atomic contexts. + */ +static inline async_cookie_t +async_schedule_dev_domain(async_func_t func, struct device *dev, + struct async_domain *domain) +{ + return async_schedule_node_domain(func, dev, dev_to_node(dev), domain); +} + void async_unregister_domain(struct async_domain *domain); extern void async_synchronize_full(void); extern void async_synchronize_full_domain(struct async_domain *domain); diff --git a/kernel/async.c b/kernel/async.c index a893d6170944..f6bd0d9885e1 100644 --- a/kernel/async.c +++ b/kernel/async.c @@ -149,7 +149,25 @@ static void async_run_entry_fn(struct work_struct *work) wake_up(&async_done); } -static async_cookie_t __async_schedule(async_func_t func, void *data, struct async_domain *domain) +/** + * async_schedule_node_domain - NUMA specific version of async_schedule_domain + * @func: function to execute asynchronously + * @data: data pointer to pass to the function + * @node: NUMA node that we want to schedule this on or close to + * @domain: the domain + * + * Returns an async_cookie_t that may be used for checkpointing later. + * @domain may be used in the async_synchronize_*_domain() functions to + * wait within a certain synchronization domain rather than globally. + * + * Note: This function may be called from atomic or non-atomic contexts. + * + * The node requested will be honored on a best effort basis. If the node + * has no CPUs associated with it then the work is distributed among all + * available CPUs. + */ +async_cookie_t async_schedule_node_domain(async_func_t func, void *data, + int node, struct async_domain *domain) { struct async_entry *entry; unsigned long flags; @@ -195,43 +213,30 @@ static async_cookie_t __async_schedule(async_func_t func, void *data, struct asy current->flags |= PF_USED_ASYNC; /* schedule for execution */ - queue_work(system_unbound_wq, &entry->work); + queue_work_node(node, system_unbound_wq, &entry->work); return newcookie; } +EXPORT_SYMBOL_GPL(async_schedule_node_domain); /** - * async_schedule - schedule a function for asynchronous execution + * async_schedule_node - NUMA specific version of async_schedule * @func: function to execute asynchronously * @data: data pointer to pass to the function + * @node: NUMA node that we want to schedule this on or close to * * Returns an async_cookie_t that may be used for checkpointing later. * Note: This function may be called from atomic or non-atomic contexts. - */ -async_cookie_t async_schedule(async_func_t func, void *data) -{ - return __async_schedule(func, data, &async_dfl_domain); -} -EXPORT_SYMBOL_GPL(async_schedule); - -/** - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain - * @func: function to execute asynchronously - * @data: data pointer to pass to the function - * @domain: the domain * - * Returns an async_cookie_t that may be used for checkpointing later. - * @domain may be used in the async_synchronize_*_domain() functions to - * wait within a certain synchronization domain rather than globally. A - * synchronization domain is specified via @domain. Note: This function - * may be called from atomic or non-atomic contexts. + * The node requested will be honored on a best effort basis. If the node + * has no CPUs associated with it then the work is distributed among all + * available CPUs. */ -async_cookie_t async_schedule_domain(async_func_t func, void *data, - struct async_domain *domain) +async_cookie_t async_schedule_node(async_func_t func, void *data, int node) { - return __async_schedule(func, data, domain); + return async_schedule_node_domain(func, data, node, &async_dfl_domain); } -EXPORT_SYMBOL_GPL(async_schedule_domain); +EXPORT_SYMBOL_GPL(async_schedule_node); /** * async_synchronize_full - synchronize all asynchronous function calls -- cgit v1.3-7-g2ca7 From 51bee5abeab2058ea5813c5615d6197a23dbf041 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Mon, 28 Jan 2019 17:00:13 +0100 Subject: cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which needs pids_free() to uncharge the pid. However, ->free() is called from __put_task_struct()->cgroup_free() and this is too late. Even the trivial program which does for (;;) { int pid = fork(); assert(pid >= 0); if (pid) wait(NULL); else exit(0); } can run out of limits because release_task()->call_rcu(delayed_put_task_struct) implies an RCU gp after the task/pid goes away and before the final put(). Test-case: mkdir -p /tmp/CG mount -t cgroup2 none /tmp/CG echo '+pids' > /tmp/CG/cgroup.subtree_control mkdir /tmp/CG/PID echo 2 > /tmp/CG/PID/pids.max perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' & echo $! > /tmp/CG/PID/cgroup.procs Without this patch the forking process fails soon after migration. Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite into the new helper, cgroup_release(), called by release_task() which actually frees the pid(s). Reported-by: Herton R. Krzesinski Reported-by: Jan Stancek Signed-off-by: Oleg Nesterov Signed-off-by: Tejun Heo --- include/linux/cgroup-defs.h | 2 +- include/linux/cgroup.h | 2 ++ kernel/cgroup/cgroup.c | 15 +++++++++------ kernel/cgroup/pids.c | 4 ++-- kernel/exit.c | 1 + 5 files changed, 15 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 8fcbae1b8db0..120d1d40704b 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -602,7 +602,7 @@ struct cgroup_subsys { void (*cancel_fork)(struct task_struct *task); void (*fork)(struct task_struct *task); void (*exit)(struct task_struct *task); - void (*free)(struct task_struct *task); + void (*release)(struct task_struct *task); void (*bind)(struct cgroup_subsys_state *root_css); bool early_init:1; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 9968332cceed..81f58b4a5418 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -121,6 +121,7 @@ extern int cgroup_can_fork(struct task_struct *p); extern void cgroup_cancel_fork(struct task_struct *p); extern void cgroup_post_fork(struct task_struct *p); void cgroup_exit(struct task_struct *p); +void cgroup_release(struct task_struct *p); void cgroup_free(struct task_struct *p); int cgroup_init_early(void); @@ -697,6 +698,7 @@ static inline int cgroup_can_fork(struct task_struct *p) { return 0; } static inline void cgroup_cancel_fork(struct task_struct *p) {} static inline void cgroup_post_fork(struct task_struct *p) {} static inline void cgroup_exit(struct task_struct *p) {} +static inline void cgroup_release(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {} static inline int cgroup_init_early(void) { return 0; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index f31bd61c9466..f4418371c83b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -197,7 +197,7 @@ static u64 css_serial_nr_next = 1; */ static u16 have_fork_callback __read_mostly; static u16 have_exit_callback __read_mostly; -static u16 have_free_callback __read_mostly; +static u16 have_release_callback __read_mostly; static u16 have_canfork_callback __read_mostly; /* cgroup namespace for init task */ @@ -5313,7 +5313,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) have_fork_callback |= (bool)ss->fork << ss->id; have_exit_callback |= (bool)ss->exit << ss->id; - have_free_callback |= (bool)ss->free << ss->id; + have_release_callback |= (bool)ss->release << ss->id; have_canfork_callback |= (bool)ss->can_fork << ss->id; /* At system boot, before all subsystems have been @@ -5749,16 +5749,19 @@ void cgroup_exit(struct task_struct *tsk) } while_each_subsys_mask(); } -void cgroup_free(struct task_struct *task) +void cgroup_release(struct task_struct *task) { - struct css_set *cset = task_css_set(task); struct cgroup_subsys *ss; int ssid; - do_each_subsys_mask(ss, ssid, have_free_callback) { - ss->free(task); + do_each_subsys_mask(ss, ssid, have_release_callback) { + ss->release(task); } while_each_subsys_mask(); +} +void cgroup_free(struct task_struct *task) +{ + struct css_set *cset = task_css_set(task); put_css_set(cset); } diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 9829c67ebc0a..c9960baaa14f 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -247,7 +247,7 @@ static void pids_cancel_fork(struct task_struct *task) pids_uncharge(pids, 1); } -static void pids_free(struct task_struct *task) +static void pids_release(struct task_struct *task) { struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id)); @@ -342,7 +342,7 @@ struct cgroup_subsys pids_cgrp_subsys = { .cancel_attach = pids_cancel_attach, .can_fork = pids_can_fork, .cancel_fork = pids_cancel_fork, - .free = pids_free, + .release = pids_release, .legacy_cftypes = pids_files, .dfl_cftypes = pids_files, .threaded = true, diff --git a/kernel/exit.c b/kernel/exit.c index 3fb7be001964..c2b8443f30b4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -219,6 +219,7 @@ repeat: } write_unlock_irq(&tasklist_lock); + cgroup_release(p); release_thread(p); call_rcu(&p->rcu, delayed_put_task_struct); -- cgit v1.3-7-g2ca7 From 6cab5e90ab2bd323c9f3811b6c70a4687df51e27 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Mon, 28 Jan 2019 18:43:34 -0800 Subject: bpf: run bpf programs with preemption disabled Disabled preemption is necessary for proper access to per-cpu maps from BPF programs. But the sender side of socket filters didn't have preemption disabled: unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN and a combination of af_packet with tun device didn't disable either: tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue-> tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN Disable preemption before executing BPF programs (both classic and extended). Reported-by: Jann Horn Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- include/linux/filter.h | 21 ++++++++++++++++++--- kernel/bpf/cgroup.c | 2 +- 2 files changed, 19 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index ad106d845b22..e532fcc6e4b5 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -591,8 +591,8 @@ static inline u8 *bpf_skb_cb(struct sk_buff *skb) return qdisc_skb_cb(skb)->data; } -static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, - struct sk_buff *skb) +static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog, + struct sk_buff *skb) { u8 *cb_data = bpf_skb_cb(skb); u8 cb_saved[BPF_SKB_CB_LEN]; @@ -611,15 +611,30 @@ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, return res; } +static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, + struct sk_buff *skb) +{ + u32 res; + + preempt_disable(); + res = __bpf_prog_run_save_cb(prog, skb); + preempt_enable(); + return res; +} + static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, struct sk_buff *skb) { u8 *cb_data = bpf_skb_cb(skb); + u32 res; if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); - return BPF_PROG_RUN(prog, skb); + preempt_disable(); + res = BPF_PROG_RUN(prog, skb); + preempt_enable(); + return res; } static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index ab612fe9862f..d17d05570a3f 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -572,7 +572,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, bpf_compute_and_save_data_end(skb, &saved_data_end); ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb, - bpf_prog_run_save_cb); + __bpf_prog_run_save_cb); bpf_restore_data_end(skb, saved_data_end); __skb_pull(skb, offset); skb->sk = save_sk; -- cgit v1.3-7-g2ca7 From a89fac57b5d080771efd4d71feaae19877cf68f0 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Wed, 30 Jan 2019 18:12:43 -0800 Subject: bpf: fix lockdep false positive in percpu_freelist Lockdep warns about false positive: [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40 [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 12.493275] (&rq->lock){-.-.} [ 12.493276] [ 12.493276] [ 12.493276] and interrupts could create inverse lock ordering between them. [ 12.493276] [ 12.494435] [ 12.494435] other info that might help us debug this: [ 12.494979] Possible interrupt unsafe locking scenario: [ 12.494979] [ 12.495518] CPU0 CPU1 [ 12.495879] ---- ---- [ 12.496243] lock(&head->lock); [ 12.496502] local_irq_disable(); [ 12.496969] lock(&rq->lock); [ 12.497431] lock(&head->lock); [ 12.497890] [ 12.498104] lock(&rq->lock); [ 12.498368] [ 12.498368] *** DEADLOCK *** [ 12.498368] [ 12.498837] 1 lock held by dd/276: [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240 [ 12.499747] [ 12.499747] the shortest dependencies between 2nd lock and 1st lock: [ 12.500389] -> (&rq->lock){-.-.} { [ 12.500669] IN-HARDIRQ-W at: [ 12.500934] _raw_spin_lock+0x2f/0x40 [ 12.501373] scheduler_tick+0x4c/0xf0 [ 12.501812] update_process_times+0x40/0x50 [ 12.502294] tick_periodic+0x27/0xb0 [ 12.502723] tick_handle_periodic+0x1f/0x60 [ 12.503203] timer_interrupt+0x11/0x20 [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0 [ 12.504167] handle_irq_event_percpu+0x20/0x50 [ 12.504674] handle_irq_event+0x37/0x60 [ 12.505139] handle_level_irq+0xa7/0x120 [ 12.505601] handle_irq+0xa1/0x150 [ 12.506018] do_IRQ+0x77/0x140 [ 12.506411] ret_from_intr+0x0/0x1d [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60 [ 12.507362] __setup_irq+0x481/0x730 [ 12.507789] setup_irq+0x49/0x80 [ 12.508195] hpet_time_init+0x21/0x32 [ 12.508644] x86_late_time_init+0xb/0x16 [ 12.509106] start_kernel+0x390/0x42a [ 12.509554] secondary_startup_64+0xa4/0xb0 [ 12.510034] IN-SOFTIRQ-W at: [ 12.510305] _raw_spin_lock+0x2f/0x40 [ 12.510772] try_to_wake_up+0x1c7/0x4e0 [ 12.511220] swake_up_locked+0x20/0x40 [ 12.511657] swake_up_one+0x1a/0x30 [ 12.512070] rcu_process_callbacks+0xc5/0x650 [ 12.512553] __do_softirq+0xe6/0x47b [ 12.512978] irq_exit+0xc3/0xd0 [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250 [ 12.513876] apic_timer_interrupt+0xf/0x20 [ 12.514343] default_idle+0x1c/0x170 [ 12.514765] do_idle+0x199/0x240 [ 12.515159] cpu_startup_entry+0x19/0x20 [ 12.515614] start_kernel+0x422/0x42a [ 12.516045] secondary_startup_64+0xa4/0xb0 [ 12.516521] INITIAL USE at: [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50 [ 12.517258] rq_attach_root+0x16/0xd0 [ 12.517685] sched_init+0x2f2/0x3eb [ 12.518096] start_kernel+0x1fb/0x42a [ 12.518525] secondary_startup_64+0xa4/0xb0 [ 12.518986] } [ 12.519132] ... key at: [] __key.71384+0x0/0x8 [ 12.519649] ... acquired at: [ 12.519892] pcpu_freelist_pop+0x7b/0xd0 [ 12.520221] bpf_get_stackid+0x1d2/0x4d0 [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0 [ 12.520887] [ 12.521008] -> (&head->lock){+...} { [ 12.521292] HARDIRQ-ON-W at: [ 12.521539] _raw_spin_lock+0x2f/0x40 [ 12.521950] pcpu_freelist_push+0x2a/0x40 [ 12.522396] bpf_get_stackid+0x494/0x4d0 [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0 [ 12.523296] INITIAL USE at: [ 12.523537] _raw_spin_lock+0x2f/0x40 [ 12.523944] pcpu_freelist_populate+0xc0/0x120 [ 12.524417] htab_map_alloc+0x405/0x500 [ 12.524835] __do_sys_bpf+0x1a3/0x1a90 [ 12.525253] do_syscall_64+0x4a/0x180 [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 12.526167] } [ 12.526311] ... key at: [] __key.13130+0x0/0x8 [ 12.526812] ... acquired at: [ 12.527047] __lock_acquire+0x521/0x1350 [ 12.527371] lock_acquire+0x98/0x190 [ 12.527680] _raw_spin_lock+0x2f/0x40 [ 12.527994] pcpu_freelist_push+0x2a/0x40 [ 12.528325] bpf_get_stackid+0x494/0x4d0 [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0 [ 12.528970] [ 12.529092] [ 12.529092] stack backtrace: [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475 [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 12.530750] Call Trace: [ 12.530948] dump_stack+0x5f/0x8b [ 12.531248] check_usage_backwards+0x10c/0x120 [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0 [ 12.531935] ? mark_lock+0x382/0x560 [ 12.532229] mark_lock+0x382/0x560 [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180 [ 12.532928] __lock_acquire+0x521/0x1350 [ 12.533271] ? find_get_entry+0x17f/0x2e0 [ 12.533586] ? find_get_entry+0x19c/0x2e0 [ 12.533902] ? lock_acquire+0x98/0x190 [ 12.534196] lock_acquire+0x98/0x190 [ 12.534482] ? pcpu_freelist_push+0x2a/0x40 [ 12.534810] _raw_spin_lock+0x2f/0x40 [ 12.535099] ? pcpu_freelist_push+0x2a/0x40 [ 12.535432] pcpu_freelist_push+0x2a/0x40 [ 12.535750] bpf_get_stackid+0x494/0x4d0 [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0 It has been explained that is a false positive here: https://lkml.org/lkml/2018/7/25/756 Recap: - stackmap uses pcpu_freelist - The lock in pcpu_freelist is a percpu lock - stackmap is only used by tracing bpf_prog - A tracing bpf_prog cannot be run if another bpf_prog has already been running (ensured by the percpu bpf_prog_active counter). Eric pointed out that this lockdep splats stops other legit lockdep splats in selftests/bpf/test_progs.c. Fix this by calling local_irq_save/restore for stackmap. Another false positive had also been worked around by calling local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat"). That commit added unnecessary irq_save/restore to fast path of bpf hash map. irqs are already disabled at that point, since htab is holding per bucket spin_lock with irqsave. Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop function w/o irqsave and convert pcpu_freelist_push/pop to irqsave to be used elsewhere (right now only in stackmap). It stops lockdep false positive in stackmap with a bit of acceptable overhead. Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Naresh Kamboju Reported-by: Eric Dumazet Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/bpf/hashtab.c | 4 ++-- kernel/bpf/percpu_freelist.c | 41 +++++++++++++++++++++++++++++------------ kernel/bpf/percpu_freelist.h | 4 ++++ 3 files changed, 35 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 4b7c76765d9d..f9274114c88d 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -686,7 +686,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) } if (htab_is_prealloc(htab)) { - pcpu_freelist_push(&htab->freelist, &l->fnode); + __pcpu_freelist_push(&htab->freelist, &l->fnode); } else { atomic_dec(&htab->count); l->htab = htab; @@ -748,7 +748,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, } else { struct pcpu_freelist_node *l; - l = pcpu_freelist_pop(&htab->freelist); + l = __pcpu_freelist_pop(&htab->freelist); if (!l) return ERR_PTR(-E2BIG); l_new = container_of(l, struct htab_elem, fnode); diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c index 673fa6fe2d73..0c1b4ba9e90e 100644 --- a/kernel/bpf/percpu_freelist.c +++ b/kernel/bpf/percpu_freelist.c @@ -28,8 +28,8 @@ void pcpu_freelist_destroy(struct pcpu_freelist *s) free_percpu(s->freelist); } -static inline void __pcpu_freelist_push(struct pcpu_freelist_head *head, - struct pcpu_freelist_node *node) +static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head, + struct pcpu_freelist_node *node) { raw_spin_lock(&head->lock); node->next = head->first; @@ -37,12 +37,22 @@ static inline void __pcpu_freelist_push(struct pcpu_freelist_head *head, raw_spin_unlock(&head->lock); } -void pcpu_freelist_push(struct pcpu_freelist *s, +void __pcpu_freelist_push(struct pcpu_freelist *s, struct pcpu_freelist_node *node) { struct pcpu_freelist_head *head = this_cpu_ptr(s->freelist); - __pcpu_freelist_push(head, node); + ___pcpu_freelist_push(head, node); +} + +void pcpu_freelist_push(struct pcpu_freelist *s, + struct pcpu_freelist_node *node) +{ + unsigned long flags; + + local_irq_save(flags); + __pcpu_freelist_push(s, node); + local_irq_restore(flags); } void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, @@ -63,7 +73,7 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, for_each_possible_cpu(cpu) { again: head = per_cpu_ptr(s->freelist, cpu); - __pcpu_freelist_push(head, buf); + ___pcpu_freelist_push(head, buf); i++; buf += elem_size; if (i == nr_elems) @@ -74,14 +84,12 @@ again: local_irq_restore(flags); } -struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s) +struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s) { struct pcpu_freelist_head *head; struct pcpu_freelist_node *node; - unsigned long flags; int orig_cpu, cpu; - local_irq_save(flags); orig_cpu = cpu = raw_smp_processor_id(); while (1) { head = per_cpu_ptr(s->freelist, cpu); @@ -89,16 +97,25 @@ struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s) node = head->first; if (node) { head->first = node->next; - raw_spin_unlock_irqrestore(&head->lock, flags); + raw_spin_unlock(&head->lock); return node; } raw_spin_unlock(&head->lock); cpu = cpumask_next(cpu, cpu_possible_mask); if (cpu >= nr_cpu_ids) cpu = 0; - if (cpu == orig_cpu) { - local_irq_restore(flags); + if (cpu == orig_cpu) return NULL; - } } } + +struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s) +{ + struct pcpu_freelist_node *ret; + unsigned long flags; + + local_irq_save(flags); + ret = __pcpu_freelist_pop(s); + local_irq_restore(flags); + return ret; +} diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h index 3049aae8ea1e..c3960118e617 100644 --- a/kernel/bpf/percpu_freelist.h +++ b/kernel/bpf/percpu_freelist.h @@ -22,8 +22,12 @@ struct pcpu_freelist_node { struct pcpu_freelist_node *next; }; +/* pcpu_freelist_* do spin_lock_irqsave. */ void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *); struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *); +/* __pcpu_freelist_* do spin_lock only. caller must disable irqs. */ +void __pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *); +struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *); void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, u32 nr_elems); int pcpu_freelist_init(struct pcpu_freelist *); -- cgit v1.3-7-g2ca7 From e16ec34039c701594d55d08a5aa49ee3e1abc821 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Wed, 30 Jan 2019 18:12:44 -0800 Subject: bpf: fix potential deadlock in bpf_prog_register Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex: [ 13.007000] WARNING: possible circular locking dependency detected [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted [ 13.008124] ------------------------------------------------------ [ 13.008624] test_progs/246 is trying to acquire lock: [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300 [ 13.009770] [ 13.009770] but task is already holding lock: [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.010877] [ 13.010877] which lock already depends on the new lock. [ 13.010877] [ 13.011532] [ 13.011532] the existing dependency chain (in reverse order) is: [ 13.012129] [ 13.012129] -> #4 (bpf_event_mutex){+.+.}: [ 13.012582] perf_event_query_prog_array+0x9b/0x130 [ 13.013016] _perf_ioctl+0x3aa/0x830 [ 13.013354] perf_ioctl+0x2e/0x50 [ 13.013668] do_vfs_ioctl+0x8f/0x6a0 [ 13.014003] ksys_ioctl+0x70/0x80 [ 13.014320] __x64_sys_ioctl+0x16/0x20 [ 13.014668] do_syscall_64+0x4a/0x180 [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.015469] [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}: [ 13.015910] perf_event_init_cpu+0x5a/0x90 [ 13.016291] perf_event_init+0x1b2/0x1de [ 13.016654] start_kernel+0x2b8/0x42a [ 13.016995] secondary_startup_64+0xa4/0xb0 [ 13.017382] [ 13.017382] -> #2 (pmus_lock){+.+.}: [ 13.017794] perf_event_init_cpu+0x21/0x90 [ 13.018172] cpuhp_invoke_callback+0xb3/0x960 [ 13.018573] _cpu_up+0xa7/0x140 [ 13.018871] do_cpu_up+0xa4/0xc0 [ 13.019178] smp_init+0xcd/0xd2 [ 13.019483] kernel_init_freeable+0x123/0x24f [ 13.019878] kernel_init+0xa/0x110 [ 13.020201] ret_from_fork+0x24/0x30 [ 13.020541] [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}: [ 13.021051] static_key_slow_inc+0xe/0x20 [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300 [ 13.021891] perf_trace_event_init+0x11f/0x250 [ 13.022297] perf_trace_init+0x6b/0xa0 [ 13.022644] perf_tp_event_init+0x25/0x40 [ 13.023011] perf_try_init_event+0x6b/0x90 [ 13.023386] perf_event_alloc+0x9a8/0xc40 [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30 [ 13.024173] do_syscall_64+0x4a/0x180 [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.024968] [ 13.024968] -> #0 (tracepoints_mutex){+.+.}: [ 13.025434] __mutex_lock+0x86/0x970 [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300 [ 13.026215] bpf_probe_register+0x40/0x60 [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.027042] __do_sys_bpf+0x94f/0x1a90 [ 13.027389] do_syscall_64+0x4a/0x180 [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.028171] [ 13.028171] other info that might help us debug this: [ 13.028171] [ 13.028807] Chain exists of: [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex [ 13.028807] [ 13.029666] Possible unsafe locking scenario: [ 13.029666] [ 13.030140] CPU0 CPU1 [ 13.030510] ---- ---- [ 13.030875] lock(bpf_event_mutex); [ 13.031166] lock(&cpuctx_mutex); [ 13.031645] lock(bpf_event_mutex); [ 13.032135] lock(tracepoints_mutex); [ 13.032441] [ 13.032441] *** DEADLOCK *** [ 13.032441] [ 13.032911] 1 lock held by test_progs/246: [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.033909] [ 13.033909] stack backtrace: [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477 [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 13.035657] Call Trace: [ 13.035859] dump_stack+0x5f/0x8b [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db [ 13.036526] __lock_acquire+0x1158/0x1350 [ 13.036852] ? lock_acquire+0x98/0x190 [ 13.037154] lock_acquire+0x98/0x190 [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.037876] __mutex_lock+0x86/0x970 [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.039028] ? __mutex_lock+0x86/0x970 [ 13.039337] ? __mutex_lock+0x24a/0x970 [ 13.039649] ? bpf_probe_register+0x1d/0x60 [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10 [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300 [ 13.041325] bpf_probe_register+0x40/0x60 [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.042068] ? __might_fault+0x3e/0x90 [ 13.042374] __do_sys_bpf+0x94f/0x1a90 [ 13.042678] do_syscall_64+0x4a/0x180 [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.043382] RIP: 0033:0x7f23b10a07f9 [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9 [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011 [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10 [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000 Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister() there is no need to take bpf_event_mutex too. bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs. bpf_raw_tracepoints don't need to take this mutex. Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT") Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/trace/bpf_trace.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 8b068adb9da1..f1a86a0d881d 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1204,22 +1204,12 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog * int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog) { - int err; - - mutex_lock(&bpf_event_mutex); - err = __bpf_probe_register(btp, prog); - mutex_unlock(&bpf_event_mutex); - return err; + return __bpf_probe_register(btp, prog); } int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog) { - int err; - - mutex_lock(&bpf_event_mutex); - err = tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog); - mutex_unlock(&bpf_event_mutex); - return err; + return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog); } int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, -- cgit v1.3-7-g2ca7 From 7c4cd051add3d00bbff008a133c936c515eaa8fe Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Wed, 30 Jan 2019 18:12:45 -0800 Subject: bpf: Fix syscall's stackmap lookup potential deadlock The map_lookup_elem used to not acquiring spinlock in order to optimize the reader. It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy(). bpf_stackmap_copy() may find the elem no longer needed after the copy is done. If that is the case, pcpu_freelist_push() saves this elem for reuse later. This push requires a spinlock. If a tracing bpf_prog got run in the middle of the syscall's map_lookup_elem(stackmap) and this tracing bpf_prog is calling bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's spinlock, it may end up with a dead lock situation as reported by Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/ The situation is the same as the syscall's map_update_elem() which needs to acquire the pcpu_freelist's spinlock and could race with tracing bpf_prog. Hence, this patch fixes it by protecting bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active) to prevent tracing bpf_prog from running. A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps") also acquires a spinlock and races with tracing bpf_prog similarly. Hence, this patch is forward looking and protects the majority of the map lookups. bpf_map_offload_lookup_elem() is the exception since it is for network bpf_prog only (i.e. never called by tracing bpf_prog). Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Eric Dumazet Acked-by: Alexei Starovoitov Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/bpf/syscall.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b155cd17c1bd..8577bb7f8be6 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -713,8 +713,13 @@ static int map_lookup_elem(union bpf_attr *attr) if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_lookup_elem(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || - map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { + goto done; + } + + preempt_disable(); + this_cpu_inc(bpf_prog_active); + if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || + map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_copy(map, key, value); @@ -744,7 +749,10 @@ static int map_lookup_elem(union bpf_attr *attr) } rcu_read_unlock(); } + this_cpu_dec(bpf_prog_active); + preempt_enable(); +done: if (err) goto free_value; -- cgit v1.3-7-g2ca7 From 90462a5bd30c6ed91c6758e59537d047d7878ff9 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Thu, 31 Jan 2019 11:52:11 -0500 Subject: audit: remove unused actx param from audit_rule_match The audit_rule_match() struct audit_context *actx parameter is not used by any in-tree consumers (selinux, apparmour, integrity, smack). The audit context is an internal audit structure that should only be accessed by audit accessor functions. It was part of commit 03d37d25e0f9 ("LSM/Audit: Introduce generic Audit LSM hooks") but appears to have never been used. Remove it. Please see the github issue https://github.com/linux-audit/audit-kernel/issues/107 Signed-off-by: Richard Guy Briggs [PM: fixed the referenced commit title] Signed-off-by: Paul Moore --- include/linux/lsm_hooks.h | 4 +--- include/linux/security.h | 5 ++--- kernel/auditfilter.c | 2 +- kernel/auditsc.c | 21 ++++++++++++--------- security/apparmor/audit.c | 3 +-- security/apparmor/include/audit.h | 3 +-- security/integrity/ima/ima.h | 3 +-- security/integrity/ima/ima_policy.c | 6 ++---- security/security.c | 6 ++---- security/selinux/include/audit.h | 4 +--- security/selinux/ss/services.c | 3 +-- security/smack/smack_lsm.c | 4 +--- 12 files changed, 26 insertions(+), 38 deletions(-) (limited to 'kernel') diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 9a0bdf91e646..d0b5c7a05832 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1344,7 +1344,6 @@ * @field contains the field which relates to current LSM. * @op contains the operator that will be used for matching. * @rule points to the audit rule that will be checked against. - * @actx points to the audit context associated with the check. * Return 1 if secid matches the rule, 0 if it does not, -ERRNO on failure. * * @audit_rule_free: @@ -1764,8 +1763,7 @@ union security_list_options { int (*audit_rule_init)(u32 field, u32 op, char *rulestr, void **lsmrule); int (*audit_rule_known)(struct audit_krule *krule); - int (*audit_rule_match)(u32 secid, u32 field, u32 op, void *lsmrule, - struct audit_context *actx); + int (*audit_rule_match)(u32 secid, u32 field, u32 op, void *lsmrule); void (*audit_rule_free)(void *lsmrule); #endif /* CONFIG_AUDIT */ diff --git a/include/linux/security.h b/include/linux/security.h index dbfb5a66babb..e8febec62ffb 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -1674,8 +1674,7 @@ static inline int security_key_getsecurity(struct key *key, char **_buffer) #ifdef CONFIG_SECURITY int security_audit_rule_init(u32 field, u32 op, char *rulestr, void **lsmrule); int security_audit_rule_known(struct audit_krule *krule); -int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule, - struct audit_context *actx); +int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule); void security_audit_rule_free(void *lsmrule); #else @@ -1692,7 +1691,7 @@ static inline int security_audit_rule_known(struct audit_krule *krule) } static inline int security_audit_rule_match(u32 secid, u32 field, u32 op, - void *lsmrule, struct audit_context *actx) + void *lsmrule) { return 0; } diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 26a80a9d43a9..add360b46b38 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -1355,7 +1355,7 @@ int audit_filter(int msgtype, unsigned int listtype) if (f->lsm_rule) { security_task_getsecid(current, &sid); result = security_audit_rule_match(sid, - f->type, f->op, f->lsm_rule, NULL); + f->type, f->op, f->lsm_rule); } break; case AUDIT_EXE: diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 68da71001096..7d37cb1e4aef 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -631,9 +631,8 @@ static int audit_filter_rules(struct task_struct *tsk, need_sid = 0; } result = security_audit_rule_match(sid, f->type, - f->op, - f->lsm_rule, - ctx); + f->op, + f->lsm_rule); } break; case AUDIT_OBJ_USER: @@ -647,13 +646,17 @@ static int audit_filter_rules(struct task_struct *tsk, /* Find files that match */ if (name) { result = security_audit_rule_match( - name->osid, f->type, f->op, - f->lsm_rule, ctx); + name->osid, + f->type, + f->op, + f->lsm_rule); } else if (ctx) { list_for_each_entry(n, &ctx->names_list, list) { - if (security_audit_rule_match(n->osid, f->type, - f->op, f->lsm_rule, - ctx)) { + if (security_audit_rule_match( + n->osid, + f->type, + f->op, + f->lsm_rule)) { ++result; break; } @@ -664,7 +667,7 @@ static int audit_filter_rules(struct task_struct *tsk, break; if (security_audit_rule_match(ctx->ipc.osid, f->type, f->op, - f->lsm_rule, ctx)) + f->lsm_rule)) ++result; } break; diff --git a/security/apparmor/audit.c b/security/apparmor/audit.c index eeaddfe0c0fb..5a8b9cded4f2 100644 --- a/security/apparmor/audit.c +++ b/security/apparmor/audit.c @@ -225,8 +225,7 @@ int aa_audit_rule_known(struct audit_krule *rule) return 0; } -int aa_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule, - struct audit_context *actx) +int aa_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule) { struct aa_audit_rule *rule = vrule; struct aa_label *label; diff --git a/security/apparmor/include/audit.h b/security/apparmor/include/audit.h index b8c8b1066b0a..ee559bc2acb8 100644 --- a/security/apparmor/include/audit.h +++ b/security/apparmor/include/audit.h @@ -192,7 +192,6 @@ static inline int complain_error(int error) void aa_audit_rule_free(void *vrule); int aa_audit_rule_init(u32 field, u32 op, char *rulestr, void **vrule); int aa_audit_rule_known(struct audit_krule *rule); -int aa_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule, - struct audit_context *actx); +int aa_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule); #endif /* __AA_AUDIT_H */ diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index cc12f3449a72..026163f37ba1 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -307,8 +307,7 @@ static inline int security_filter_rule_init(u32 field, u32 op, char *rulestr, } static inline int security_filter_rule_match(u32 secid, u32 field, u32 op, - void *lsmrule, - struct audit_context *actx) + void *lsmrule) { return -EINVAL; } diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index 8bc8a1c8cb3f..26fa9d9723f6 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -340,8 +340,7 @@ retry: rc = security_filter_rule_match(osid, rule->lsm[i].type, Audit_equal, - rule->lsm[i].rule, - NULL); + rule->lsm[i].rule); break; case LSM_SUBJ_USER: case LSM_SUBJ_ROLE: @@ -349,8 +348,7 @@ retry: rc = security_filter_rule_match(secid, rule->lsm[i].type, Audit_equal, - rule->lsm[i].rule, - NULL); + rule->lsm[i].rule); default: break; } diff --git a/security/security.c b/security/security.c index f1b8d2587639..5f954b179a8e 100644 --- a/security/security.c +++ b/security/security.c @@ -1783,11 +1783,9 @@ void security_audit_rule_free(void *lsmrule) call_void_hook(audit_rule_free, lsmrule); } -int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule, - struct audit_context *actx) +int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule) { - return call_int_hook(audit_rule_match, 0, secid, field, op, lsmrule, - actx); + return call_int_hook(audit_rule_match, 0, secid, field, op, lsmrule); } #endif /* CONFIG_AUDIT */ diff --git a/security/selinux/include/audit.h b/security/selinux/include/audit.h index 1bdf973433cc..e51a81ffb8c9 100644 --- a/security/selinux/include/audit.h +++ b/security/selinux/include/audit.h @@ -46,13 +46,11 @@ void selinux_audit_rule_free(void *rule); * @field: the field this rule refers to * @op: the operater the rule uses * @rule: pointer to the audit rule to check against - * @actx: the audit context (can be NULL) associated with the check * * Returns 1 if the context id matches the rule, 0 if it does not, and * -errno on failure. */ -int selinux_audit_rule_match(u32 sid, u32 field, u32 op, void *rule, - struct audit_context *actx); +int selinux_audit_rule_match(u32 sid, u32 field, u32 op, void *rule); /** * selinux_audit_rule_known - check to see if rule contains selinux fields. diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c index dd44126c8d14..0b7e33f6aa59 100644 --- a/security/selinux/ss/services.c +++ b/security/selinux/ss/services.c @@ -3376,8 +3376,7 @@ int selinux_audit_rule_known(struct audit_krule *rule) return 0; } -int selinux_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule, - struct audit_context *actx) +int selinux_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule) { struct selinux_state *state = &selinux_state; struct context *ctxt; diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 430d4f35e55c..403513df42fc 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -4393,13 +4393,11 @@ static int smack_audit_rule_known(struct audit_krule *krule) * @field: audit rule flags given from user-space * @op: required testing operator * @vrule: smack internal rule presentation - * @actx: audit context associated with the check * * The core Audit hook. It's used to take the decision of * whether to audit or not to audit a given object. */ -static int smack_audit_rule_match(u32 secid, u32 field, u32 op, void *vrule, - struct audit_context *actx) +static int smack_audit_rule_match(u32 secid, u32 field, u32 op, void *vrule) { struct smack_known *skp; char *rule = vrule; -- cgit v1.3-7-g2ca7 From 01e7187b41191376cee8bea8de9f907b001e87b4 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Wed, 23 Jan 2019 15:19:18 +0100 Subject: pipe: stop using ->can_merge Al Viro pointed out that since there is only one pipe buffer type to which new data can be appended, it isn't necessary to have a ->can_merge field in struct pipe_buf_operations, we can just check for a magic type. Suggested-by: Al Viro Signed-off-by: Jann Horn Signed-off-by: Al Viro --- fs/pipe.c | 20 ++++++++++++++++---- fs/splice.c | 4 ---- include/linux/pipe_fs_i.h | 7 ------- kernel/relay.c | 1 - kernel/trace/trace.c | 2 -- net/smc/smc_rx.c | 1 - 6 files changed, 16 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/fs/pipe.c b/fs/pipe.c index c51750ed4011..0ff09b490ddf 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -226,8 +226,8 @@ void generic_pipe_buf_release(struct pipe_inode_info *pipe, } EXPORT_SYMBOL(generic_pipe_buf_release); +/* New data written to a pipe may be appended to a buffer with this type. */ static const struct pipe_buf_operations anon_pipe_buf_ops = { - .can_merge = 1, .confirm = generic_pipe_buf_confirm, .release = anon_pipe_buf_release, .steal = anon_pipe_buf_steal, @@ -235,7 +235,6 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = { }; static const struct pipe_buf_operations anon_pipe_buf_nomerge_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = anon_pipe_buf_release, .steal = anon_pipe_buf_steal, @@ -243,19 +242,32 @@ static const struct pipe_buf_operations anon_pipe_buf_nomerge_ops = { }; static const struct pipe_buf_operations packet_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = anon_pipe_buf_release, .steal = anon_pipe_buf_steal, .get = generic_pipe_buf_get, }; +/** + * pipe_buf_mark_unmergeable - mark a &struct pipe_buffer as unmergeable + * @buf: the buffer to mark + * + * Description: + * This function ensures that no future writes will be merged into the + * given &struct pipe_buffer. This is necessary when multiple pipe buffers + * share the same backing page. + */ void pipe_buf_mark_unmergeable(struct pipe_buffer *buf) { if (buf->ops == &anon_pipe_buf_ops) buf->ops = &anon_pipe_buf_nomerge_ops; } +static bool pipe_buf_can_merge(struct pipe_buffer *buf) +{ + return buf->ops == &anon_pipe_buf_ops; +} + static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { @@ -393,7 +405,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) struct pipe_buffer *buf = pipe->bufs + lastbuf; int offset = buf->offset + buf->len; - if (buf->ops->can_merge && offset + chars <= PAGE_SIZE) { + if (pipe_buf_can_merge(buf) && offset + chars <= PAGE_SIZE) { ret = pipe_buf_confirm(pipe, buf); if (ret) goto out; diff --git a/fs/splice.c b/fs/splice.c index 90c29675d573..fc71e9733f7a 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -138,7 +138,6 @@ error: } const struct pipe_buf_operations page_cache_pipe_buf_ops = { - .can_merge = 0, .confirm = page_cache_pipe_buf_confirm, .release = page_cache_pipe_buf_release, .steal = page_cache_pipe_buf_steal, @@ -156,7 +155,6 @@ static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe, } static const struct pipe_buf_operations user_page_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = page_cache_pipe_buf_release, .steal = user_page_pipe_buf_steal, @@ -326,7 +324,6 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, EXPORT_SYMBOL(generic_file_splice_read); const struct pipe_buf_operations default_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = generic_pipe_buf_release, .steal = generic_pipe_buf_steal, @@ -341,7 +338,6 @@ static int generic_pipe_buf_nosteal(struct pipe_inode_info *pipe, /* Pipe buffer operations for a socket and similar. */ const struct pipe_buf_operations nosteal_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = generic_pipe_buf_release, .steal = generic_pipe_buf_nosteal, diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 3ecd7ea212ae..787d224ff43e 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -73,13 +73,6 @@ struct pipe_inode_info { * in fs/pipe.c for the pipe and generic variants of these hooks. */ struct pipe_buf_operations { - /* - * This is set to 1, if the generic pipe read/write may coalesce - * data into an existing buffer. If this is set to 0, a new pipe - * page segment is always used for new data. - */ - int can_merge; - /* * ->confirm() verifies that the data in the pipe buffer is there * and that the contents are good. If the pages in the pipe belong diff --git a/kernel/relay.c b/kernel/relay.c index 04f248644e06..db3e419c25a6 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -1175,7 +1175,6 @@ static void relay_pipe_buf_release(struct pipe_inode_info *pipe, } static const struct pipe_buf_operations relay_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = relay_pipe_buf_release, .steal = generic_pipe_buf_steal, diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c521b7347482..f380139e972c 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5823,7 +5823,6 @@ static void tracing_spd_release_pipe(struct splice_pipe_desc *spd, } static const struct pipe_buf_operations tracing_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = generic_pipe_buf_release, .steal = generic_pipe_buf_steal, @@ -6843,7 +6842,6 @@ static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, /* Pipe buffer operations for a buffer. */ static const struct pipe_buf_operations buffer_pipe_buf_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = buffer_pipe_buf_release, .steal = generic_pipe_buf_steal, diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index bbcf0fe4ae10..413a6abf227e 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -136,7 +136,6 @@ static int smc_rx_pipe_buf_nosteal(struct pipe_inode_info *pipe, } static const struct pipe_buf_operations smc_pipe_ops = { - .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = smc_rx_pipe_buf_release, .steal = smc_rx_pipe_buf_nosteal, -- cgit v1.3-7-g2ca7 From cfced786969c2a3e1bca45d7055a00311d93ae6c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 4 Jan 2019 18:20:05 +0100 Subject: dma-mapping: remove the default map_resource implementation Instead provide a proper implementation in the direct mapping code, and also wire it up for arm and powerpc, leaving an error return for all the IOMMU or virtual mapping instances for which we'd have to wire up an actual implementation Signed-off-by: Christoph Hellwig Tested-by: Marek Szyprowski --- arch/arm/mm/dma-mapping.c | 2 ++ arch/powerpc/kernel/dma-swiotlb.c | 1 + arch/powerpc/kernel/dma.c | 1 + include/linux/dma-mapping.h | 12 +++++++----- kernel/dma/direct.c | 14 ++++++++++++++ 5 files changed, 25 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index f1e2922e447c..3c8534904209 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -188,6 +188,7 @@ const struct dma_map_ops arm_dma_ops = { .unmap_page = arm_dma_unmap_page, .map_sg = arm_dma_map_sg, .unmap_sg = arm_dma_unmap_sg, + .map_resource = dma_direct_map_resource, .sync_single_for_cpu = arm_dma_sync_single_for_cpu, .sync_single_for_device = arm_dma_sync_single_for_device, .sync_sg_for_cpu = arm_dma_sync_sg_for_cpu, @@ -211,6 +212,7 @@ const struct dma_map_ops arm_coherent_dma_ops = { .get_sgtable = arm_dma_get_sgtable, .map_page = arm_coherent_dma_map_page, .map_sg = arm_dma_map_sg, + .map_resource = dma_direct_map_resource, .dma_supported = arm_dma_supported, }; EXPORT_SYMBOL(arm_coherent_dma_ops); diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c index 7d5fc9751622..fbb2506a414e 100644 --- a/arch/powerpc/kernel/dma-swiotlb.c +++ b/arch/powerpc/kernel/dma-swiotlb.c @@ -55,6 +55,7 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = { .dma_supported = swiotlb_dma_supported, .map_page = dma_direct_map_page, .unmap_page = dma_direct_unmap_page, + .map_resource = dma_direct_map_resource, .sync_single_for_cpu = dma_direct_sync_single_for_cpu, .sync_single_for_device = dma_direct_sync_single_for_device, .sync_sg_for_cpu = dma_direct_sync_sg_for_cpu, diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index b1903ebb2e9c..258b9e8ebb99 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -273,6 +273,7 @@ const struct dma_map_ops dma_nommu_ops = { .dma_supported = dma_nommu_dma_supported, .map_page = dma_nommu_map_page, .unmap_page = dma_nommu_unmap_page, + .map_resource = dma_direct_map_resource, .get_required_mask = dma_nommu_get_required_mask, #ifdef CONFIG_NOT_COHERENT_CACHE .sync_single_for_cpu = dma_nommu_sync_single, diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f6ded992c183..9842085e6774 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -208,6 +208,8 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long attrs); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs); +dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir, unsigned long attrs); #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ defined(CONFIG_SWIOTLB) @@ -346,19 +348,19 @@ static inline dma_addr_t dma_map_resource(struct device *dev, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); - dma_addr_t addr; + dma_addr_t addr = DMA_MAPPING_ERROR; BUG_ON(!valid_dma_direction(dir)); /* Don't allow RAM to be mapped */ BUG_ON(pfn_valid(PHYS_PFN(phys_addr))); - addr = phys_addr; - if (ops && ops->map_resource) + if (dma_is_direct(ops)) + addr = dma_direct_map_resource(dev, phys_addr, size, dir, attrs); + else if (ops->map_resource) addr = ops->map_resource(dev, phys_addr, size, dir, attrs); debug_dma_map_resource(dev, phys_addr, size, dir, addr); - return addr; } @@ -369,7 +371,7 @@ static inline void dma_unmap_resource(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops && ops->unmap_resource) + if (!dma_is_direct(ops) && ops->unmap_resource) ops->unmap_resource(dev, addr, size, dir, attrs); debug_dma_unmap_resource(dev, addr, size, dir); } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 355d16acee6d..25bd19974223 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -356,6 +356,20 @@ out_unmap: } EXPORT_SYMBOL(dma_direct_map_sg); +dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ + dma_addr_t dma_addr = paddr; + + if (unlikely(!dma_direct_possible(dev, dma_addr, size))) { + report_addr(dev, dma_addr, size); + return DMA_MAPPING_ERROR; + } + + return dma_addr; +} +EXPORT_SYMBOL(dma_direct_map_resource); + /* * Because 32-bit DMA masks are so common we expect every architecture to be * able to satisfy them - either by not supporting more physical memory, or by -- cgit v1.3-7-g2ca7 From 8e4d81b98b7859b120dd142c8634f625db118b30 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 22 Jan 2019 16:21:38 +0100 Subject: dma: debug: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also delete the variables for the file dentries for the debugfs entries as they are never used at all once they are created. Signed-off-by: Greg Kroah-Hartman Reviewed-by: Robin Murphy [hch: moved dma_debug_dent to function scope and renamed it] Signed-off-by: Christoph Hellwig --- kernel/dma/debug.c | 88 ++++++++---------------------------------------------- 1 file changed, 12 insertions(+), 76 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 23cf5361bcf1..56cd836a902c 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -134,17 +134,6 @@ static u32 nr_total_entries; /* number of preallocated entries requested by kernel cmdline */ static u32 nr_prealloc_entries = PREALLOC_DMA_DEBUG_ENTRIES; -/* debugfs dentry's for the stuff above */ -static struct dentry *dma_debug_dent __read_mostly; -static struct dentry *global_disable_dent __read_mostly; -static struct dentry *error_count_dent __read_mostly; -static struct dentry *show_all_errors_dent __read_mostly; -static struct dentry *show_num_errors_dent __read_mostly; -static struct dentry *num_free_entries_dent __read_mostly; -static struct dentry *min_free_entries_dent __read_mostly; -static struct dentry *nr_total_entries_dent __read_mostly; -static struct dentry *filter_dent __read_mostly; - /* per-driver filter related state */ #define NAME_MAX_LEN 64 @@ -840,66 +829,18 @@ static const struct file_operations filter_fops = { .llseek = default_llseek, }; -static int dma_debug_fs_init(void) +static void dma_debug_fs_init(void) { - dma_debug_dent = debugfs_create_dir("dma-api", NULL); - if (!dma_debug_dent) { - pr_err("can not create debugfs directory\n"); - return -ENOMEM; - } - - global_disable_dent = debugfs_create_bool("disabled", 0444, - dma_debug_dent, - &global_disable); - if (!global_disable_dent) - goto out_err; - - error_count_dent = debugfs_create_u32("error_count", 0444, - dma_debug_dent, &error_count); - if (!error_count_dent) - goto out_err; - - show_all_errors_dent = debugfs_create_u32("all_errors", 0644, - dma_debug_dent, - &show_all_errors); - if (!show_all_errors_dent) - goto out_err; - - show_num_errors_dent = debugfs_create_u32("num_errors", 0644, - dma_debug_dent, - &show_num_errors); - if (!show_num_errors_dent) - goto out_err; - - num_free_entries_dent = debugfs_create_u32("num_free_entries", 0444, - dma_debug_dent, - &num_free_entries); - if (!num_free_entries_dent) - goto out_err; - - min_free_entries_dent = debugfs_create_u32("min_free_entries", 0444, - dma_debug_dent, - &min_free_entries); - if (!min_free_entries_dent) - goto out_err; - - nr_total_entries_dent = debugfs_create_u32("nr_total_entries", 0444, - dma_debug_dent, - &nr_total_entries); - if (!nr_total_entries_dent) - goto out_err; - - filter_dent = debugfs_create_file("driver_filter", 0644, - dma_debug_dent, NULL, &filter_fops); - if (!filter_dent) - goto out_err; - - return 0; - -out_err: - debugfs_remove_recursive(dma_debug_dent); - - return -ENOMEM; + struct dentry *dentry = debugfs_create_dir("dma-api", NULL); + + debugfs_create_bool("disabled", 0444, dentry, &global_disable); + debugfs_create_u32("error_count", 0444, dentry, &error_count); + debugfs_create_u32("all_errors", 0644, dentry, &show_all_errors); + debugfs_create_u32("num_errors", 0644, dentry, &show_num_errors); + debugfs_create_u32("num_free_entries", 0444, dentry, &num_free_entries); + debugfs_create_u32("min_free_entries", 0444, dentry, &min_free_entries); + debugfs_create_u32("nr_total_entries", 0444, dentry, &nr_total_entries); + debugfs_create_file("driver_filter", 0644, dentry, NULL, &filter_fops); } static int device_dma_allocations(struct device *dev, struct dma_debug_entry **out_entry) @@ -985,12 +926,7 @@ static int dma_debug_init(void) spin_lock_init(&dma_entry_hash[i].lock); } - if (dma_debug_fs_init() != 0) { - pr_err("error creating debugfs entries - disabling\n"); - global_disable = true; - - return 0; - } + dma_debug_fs_init(); nr_pages = DIV_ROUND_UP(nr_prealloc_entries, DMA_DEBUG_DYNAMIC_ENTRIES); for (i = 0; i < nr_pages; ++i) -- cgit v1.3-7-g2ca7 From 0a3b192c26da198fce38e1ee242a34f558670246 Mon Sep 17 00:00:00 2001 From: Corentin Labbe Date: Fri, 18 Jan 2019 13:44:18 +0000 Subject: dma-debug: add dumping facility via debugfs While debugging a DMA mapping leak, I needed to access debug_dma_dump_mappings() but easily from user space. This patch adds a /sys/kernel/debug/dma-api/dump file which contain all current DMA mapping. Signed-off-by: Corentin Labbe Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 3 +++ kernel/dma/debug.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 31 insertions(+) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index e133ccd60228..78114ee63057 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -696,6 +696,9 @@ dma-api/disabled This read-only file contains the character 'Y' happen when it runs out of memory or if it was disabled at boot time +dma-api/dump This read-only file contains current DMA + mappings. + dma-api/error_count This file is read-only and shows the total numbers of errors found. diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 56cd836a902c..45d51e8e26f6 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -829,6 +829,33 @@ static const struct file_operations filter_fops = { .llseek = default_llseek, }; +static int dump_show(struct seq_file *seq, void *v) +{ + int idx; + + for (idx = 0; idx < HASH_SIZE; idx++) { + struct hash_bucket *bucket = &dma_entry_hash[idx]; + struct dma_debug_entry *entry; + unsigned long flags; + + spin_lock_irqsave(&bucket->lock, flags); + list_for_each_entry(entry, &bucket->list, list) { + seq_printf(seq, + "%s %s %s idx %d P=%llx N=%lx D=%llx L=%llx %s %s\n", + dev_name(entry->dev), + dev_driver_string(entry->dev), + type2name[entry->type], idx, + phys_addr(entry), entry->pfn, + entry->dev_addr, entry->size, + dir2name[entry->direction], + maperr2str[entry->map_err_type]); + } + spin_unlock_irqrestore(&bucket->lock, flags); + } + return 0; +} +DEFINE_SHOW_ATTRIBUTE(dump); + static void dma_debug_fs_init(void) { struct dentry *dentry = debugfs_create_dir("dma-api", NULL); @@ -841,6 +868,7 @@ static void dma_debug_fs_init(void) debugfs_create_u32("min_free_entries", 0444, dentry, &min_free_entries); debugfs_create_u32("nr_total_entries", 0444, dentry, &nr_total_entries); debugfs_create_file("driver_filter", 0644, dentry, NULL, &filter_fops); + debugfs_create_file("dump", 0444, dentry, NULL, &dump_fops); } static int device_dma_allocations(struct device *dev, struct dma_debug_entry **out_entry) -- cgit v1.3-7-g2ca7 From d83525ca62cf8ebe3271d14c36fb900c294274a2 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 31 Jan 2019 15:40:04 -0800 Subject: bpf: introduce bpf_spin_lock Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let bpf program serialize access to other variables. Example: struct hash_elem { int cnt; struct bpf_spin_lock lock; }; struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key); if (val) { bpf_spin_lock(&val->lock); val->cnt++; bpf_spin_unlock(&val->lock); } Restrictions and safety checks: - bpf_spin_lock is only allowed inside HASH and ARRAY maps. - BTF description of the map is mandatory for safety analysis. - bpf program can take one bpf_spin_lock at a time, since two or more can cause dead locks. - only one 'struct bpf_spin_lock' is allowed per map element. It drastically simplifies implementation yet allows bpf program to use any number of bpf_spin_locks. - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed. - bpf program must bpf_spin_unlock() before return. - bpf program can access 'struct bpf_spin_lock' only via bpf_spin_lock()/bpf_spin_unlock() helpers. - load/store into 'struct bpf_spin_lock lock;' field is not allowed. - to use bpf_spin_lock() helper the BTF description of map value must be a struct and have 'struct bpf_spin_lock anyname;' field at the top level. Nested lock inside another struct is not allowed. - syscall map_lookup doesn't copy bpf_spin_lock field to user space. - syscall map_update and program map_update do not update bpf_spin_lock field. - bpf_spin_lock cannot be on the stack or inside networking packet. bpf_spin_lock can only be inside HASH or ARRAY map value. - bpf_spin_lock is available to root only and to all program types. - bpf_spin_lock is not allowed in inner maps of map-in-map. - ld_abs is not allowed inside spin_lock-ed region. - tracing progs and socket filter progs cannot use bpf_spin_lock due to insufficient preemption checks Implementation details: - cgroup-bpf class of programs can nest with xdp/tc programs. Hence bpf_spin_lock is equivalent to spin_lock_irqsave. Other solutions to avoid nested bpf_spin_lock are possible. Like making sure that all networking progs run with softirq disabled. spin_lock_irqsave is the simplest and doesn't add overhead to the programs that don't use it. - arch_spinlock_t is used when its implemented as queued_spin_lock - archs can force their own arch_spinlock_t - on architectures where queued_spin_lock is not available and sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used. - presence of bpf_spin_lock inside map value could have been indicated via extra flag during map_create, but specifying it via BTF is cleaner. It provides introspection for map key/value and reduces user mistakes. Next steps: - allow bpf_spin_lock in other map types (like cgroup local storage) - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper to request kernel to grab bpf_spin_lock before rewriting the value. That will serialize access to map elements. Acked-by: Peter Zijlstra (Intel) Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- include/linux/bpf.h | 37 +++++++++- include/linux/bpf_verifier.h | 1 + include/linux/btf.h | 1 + include/uapi/linux/bpf.h | 7 +- kernel/Kconfig.locks | 3 + kernel/bpf/arraymap.c | 7 +- kernel/bpf/btf.c | 42 +++++++++++ kernel/bpf/core.c | 2 + kernel/bpf/hashtab.c | 21 +++--- kernel/bpf/helpers.c | 80 ++++++++++++++++++++ kernel/bpf/map_in_map.c | 5 ++ kernel/bpf/syscall.c | 21 +++++- kernel/bpf/verifier.c | 169 ++++++++++++++++++++++++++++++++++++++++++- net/core/filter.c | 16 +++- 14 files changed, 386 insertions(+), 26 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0394f1f9213b..2ae615b48bb8 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -72,14 +72,15 @@ struct bpf_map { u32 value_size; u32 max_entries; u32 map_flags; - u32 pages; + int spin_lock_off; /* >=0 valid offset, <0 error */ u32 id; int numa_node; u32 btf_key_type_id; u32 btf_value_type_id; struct btf *btf; + u32 pages; bool unpriv_array; - /* 55 bytes hole */ + /* 51 bytes hole */ /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. @@ -91,6 +92,34 @@ struct bpf_map { char name[BPF_OBJ_NAME_LEN]; }; +static inline bool map_value_has_spin_lock(const struct bpf_map *map) +{ + return map->spin_lock_off >= 0; +} + +static inline void check_and_init_map_lock(struct bpf_map *map, void *dst) +{ + if (likely(!map_value_has_spin_lock(map))) + return; + *(struct bpf_spin_lock *)(dst + map->spin_lock_off) = + (struct bpf_spin_lock){}; +} + +/* copy everything but bpf_spin_lock */ +static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) +{ + if (unlikely(map_value_has_spin_lock(map))) { + u32 off = map->spin_lock_off; + + memcpy(dst, src, off); + memcpy(dst + off + sizeof(struct bpf_spin_lock), + src + off + sizeof(struct bpf_spin_lock), + map->value_size - off - sizeof(struct bpf_spin_lock)); + } else { + memcpy(dst, src, map->value_size); + } +} + struct bpf_offload_dev; struct bpf_offloaded_map; @@ -162,6 +191,7 @@ enum bpf_arg_type { ARG_PTR_TO_CTX, /* pointer to context */ ARG_ANYTHING, /* any (initialized) argument is ok */ ARG_PTR_TO_SOCKET, /* pointer to bpf_sock */ + ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ }; /* type of values returned from helper functions */ @@ -879,7 +909,8 @@ extern const struct bpf_func_proto bpf_msg_redirect_hash_proto; extern const struct bpf_func_proto bpf_msg_redirect_map_proto; extern const struct bpf_func_proto bpf_sk_redirect_hash_proto; extern const struct bpf_func_proto bpf_sk_redirect_map_proto; - +extern const struct bpf_func_proto bpf_spin_lock_proto; +extern const struct bpf_func_proto bpf_spin_unlock_proto; extern const struct bpf_func_proto bpf_get_local_storage_proto; /* Shared helpers among cBPF and eBPF. */ diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 0620e418dde5..69f7a3449eda 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -148,6 +148,7 @@ struct bpf_verifier_state { /* call stack tracking */ struct bpf_func_state *frame[MAX_CALL_FRAMES]; u32 curframe; + u32 active_spin_lock; bool speculative; }; diff --git a/include/linux/btf.h b/include/linux/btf.h index 12502e25e767..455d31b55828 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -50,6 +50,7 @@ u32 btf_id(const struct btf *btf); bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, const struct btf_member *m, u32 expected_offset, u32 expected_size); +int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t); #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 60b99b730a41..86f7c438d40f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2422,7 +2422,9 @@ union bpf_attr { FN(map_peek_elem), \ FN(msg_push_data), \ FN(msg_pop_data), \ - FN(rc_pointer_rel), + FN(rc_pointer_rel), \ + FN(spin_lock), \ + FN(spin_unlock), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3056,4 +3058,7 @@ struct bpf_line_info { __u32 line_col; }; +struct bpf_spin_lock { + __u32 val; +}; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 84d882f3e299..fbba478ae522 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -242,6 +242,9 @@ config QUEUED_SPINLOCKS def_bool y if ARCH_USE_QUEUED_SPINLOCKS depends on SMP +config BPF_ARCH_SPINLOCK + bool + config ARCH_USE_QUEUED_RWLOCKS bool diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 25632a75d630..d6d979910a2a 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -270,9 +270,10 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]), value, map->value_size); else - memcpy(array->value + - array->elem_size * (index & array->index_mask), - value, map->value_size); + copy_map_value(map, + array->value + + array->elem_size * (index & array->index_mask), + value); return 0; } diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 3d661f0606fe..7019c1f05cab 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -355,6 +355,11 @@ static bool btf_type_is_struct(const struct btf_type *t) return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION; } +static bool __btf_type_is_struct(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT; +} + static bool btf_type_is_array(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_ARRAY; @@ -2045,6 +2050,43 @@ static void btf_struct_log(struct btf_verifier_env *env, btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); } +/* find 'struct bpf_spin_lock' in map value. + * return >= 0 offset if found + * and < 0 in case of error + */ +int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t) +{ + const struct btf_member *member; + u32 i, off = -ENOENT; + + if (!__btf_type_is_struct(t)) + return -EINVAL; + + for_each_member(i, t, member) { + const struct btf_type *member_type = btf_type_by_id(btf, + member->type); + if (!__btf_type_is_struct(member_type)) + continue; + if (member_type->size != sizeof(struct bpf_spin_lock)) + continue; + if (strcmp(__btf_name_by_offset(btf, member_type->name_off), + "bpf_spin_lock")) + continue; + if (off != -ENOENT) + /* only one 'struct bpf_spin_lock' is allowed */ + return -E2BIG; + off = btf_member_bit_offset(t, member); + if (off % 8) + /* valid C code cannot generate such BTF */ + return -EINVAL; + off /= 8; + if (off % __alignof__(struct bpf_spin_lock)) + /* valid struct bpf_spin_lock will be 4 byte aligned */ + return -EINVAL; + } + return off; +} + static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct seq_file *m) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f13c543b7b36..ef88b167959d 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2002,6 +2002,8 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak; const struct bpf_func_proto bpf_map_push_elem_proto __weak; const struct bpf_func_proto bpf_map_pop_elem_proto __weak; const struct bpf_func_proto bpf_map_peek_elem_proto __weak; +const struct bpf_func_proto bpf_spin_lock_proto __weak; +const struct bpf_func_proto bpf_spin_unlock_proto __weak; const struct bpf_func_proto bpf_get_prandom_u32_proto __weak; const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak; diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 4b7c76765d9d..6d3b22c5ad68 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -718,21 +718,12 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab) BITS_PER_LONG == 64; } -static u32 htab_size_value(const struct bpf_htab *htab, bool percpu) -{ - u32 size = htab->map.value_size; - - if (percpu || fd_htab_map_needs_adjust(htab)) - size = round_up(size, 8); - return size; -} - static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, void *value, u32 key_size, u32 hash, bool percpu, bool onallcpus, struct htab_elem *old_elem) { - u32 size = htab_size_value(htab, percpu); + u32 size = htab->map.value_size; bool prealloc = htab_is_prealloc(htab); struct htab_elem *l_new, **pl_new; void __percpu *pptr; @@ -770,10 +761,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, l_new = ERR_PTR(-ENOMEM); goto dec_count; } + check_and_init_map_lock(&htab->map, + l_new->key + round_up(key_size, 8)); } memcpy(l_new->key, key, key_size); if (percpu) { + size = round_up(size, 8); if (prealloc) { pptr = htab_elem_get_ptr(l_new, key_size); } else { @@ -791,8 +785,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, if (!prealloc) htab_elem_set_ptr(l_new, key_size, pptr); - } else { + } else if (fd_htab_map_needs_adjust(htab)) { + size = round_up(size, 8); memcpy(l_new->key + round_up(key_size, 8), value, size); + } else { + copy_map_value(&htab->map, + l_new->key + round_up(key_size, 8), + value); } l_new->hash = hash; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a74972b07e74..fbe544761628 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -221,6 +221,86 @@ const struct bpf_func_proto bpf_get_current_comm_proto = { .arg2_type = ARG_CONST_SIZE, }; +#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK) + +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) +{ + arch_spinlock_t *l = (void *)lock; + union { + __u32 val; + arch_spinlock_t lock; + } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED }; + + compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0"); + BUILD_BUG_ON(sizeof(*l) != sizeof(__u32)); + BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32)); + arch_spin_lock(l); +} + +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) +{ + arch_spinlock_t *l = (void *)lock; + + arch_spin_unlock(l); +} + +#else + +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) +{ + atomic_t *l = (void *)lock; + + BUILD_BUG_ON(sizeof(*l) != sizeof(*lock)); + do { + atomic_cond_read_relaxed(l, !VAL); + } while (atomic_xchg(l, 1)); +} + +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) +{ + atomic_t *l = (void *)lock; + + atomic_set_release(l, 0); +} + +#endif + +static DEFINE_PER_CPU(unsigned long, irqsave_flags); + +notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) +{ + unsigned long flags; + + local_irq_save(flags); + __bpf_spin_lock(lock); + __this_cpu_write(irqsave_flags, flags); + return 0; +} + +const struct bpf_func_proto bpf_spin_lock_proto = { + .func = bpf_spin_lock, + .gpl_only = false, + .ret_type = RET_VOID, + .arg1_type = ARG_PTR_TO_SPIN_LOCK, +}; + +notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) +{ + unsigned long flags; + + flags = __this_cpu_read(irqsave_flags); + __bpf_spin_unlock(lock); + local_irq_restore(flags); + return 0; +} + +const struct bpf_func_proto bpf_spin_unlock_proto = { + .func = bpf_spin_unlock, + .gpl_only = false, + .ret_type = RET_VOID, + .arg1_type = ARG_PTR_TO_SPIN_LOCK, +}; + #ifdef CONFIG_CGROUPS BPF_CALL_0(bpf_get_current_cgroup_id) { diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 52378d3e34b3..583346a0ab29 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -37,6 +37,11 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) return ERR_PTR(-EINVAL); } + if (map_value_has_spin_lock(inner_map)) { + fdput(f); + return ERR_PTR(-ENOTSUPP); + } + inner_map_meta_size = sizeof(*inner_map_meta); /* In some cases verifier needs to access beyond just base map. */ if (inner_map->ops == &array_map_ops) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b155cd17c1bd..ebf0a673cb83 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -463,7 +463,7 @@ int map_check_no_btf(const struct bpf_map *map, return -ENOTSUPP; } -static int map_check_btf(const struct bpf_map *map, const struct btf *btf, +static int map_check_btf(struct bpf_map *map, const struct btf *btf, u32 btf_key_id, u32 btf_value_id) { const struct btf_type *key_type, *value_type; @@ -478,6 +478,21 @@ static int map_check_btf(const struct bpf_map *map, const struct btf *btf, if (!value_type || value_size != map->value_size) return -EINVAL; + map->spin_lock_off = btf_find_spin_lock(btf, value_type); + + if (map_value_has_spin_lock(map)) { + if (map->map_type != BPF_MAP_TYPE_HASH && + map->map_type != BPF_MAP_TYPE_ARRAY) + return -ENOTSUPP; + if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > + map->value_size) { + WARN_ONCE(1, + "verifier bug spin_lock_off %d value_size %d\n", + map->spin_lock_off, map->value_size); + return -EFAULT; + } + } + if (map->ops->map_check_btf) ret = map->ops->map_check_btf(map, btf, key_type, value_type); @@ -542,6 +557,8 @@ static int map_create(union bpf_attr *attr) map->btf = btf; map->btf_key_type_id = attr->btf_key_type_id; map->btf_value_type_id = attr->btf_value_type_id; + } else { + map->spin_lock_off = -EINVAL; } err = security_bpf_map_alloc(map); @@ -740,7 +757,7 @@ static int map_lookup_elem(union bpf_attr *attr) err = -ENOENT; } else { err = 0; - memcpy(value, ptr, value_size); + copy_map_value(map, value, ptr); } rcu_read_unlock(); } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8c1c21cd50b4..38892bdee651 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -213,6 +213,7 @@ struct bpf_call_arg_meta { s64 msize_smax_value; u64 msize_umax_value; int ptr_id; + int func_id; }; static DEFINE_MUTEX(bpf_verifier_lock); @@ -351,6 +352,12 @@ static bool reg_is_refcounted(const struct bpf_reg_state *reg) return type_is_refcounted(reg->type); } +static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) +{ + return reg->type == PTR_TO_MAP_VALUE && + map_value_has_spin_lock(reg->map_ptr); +} + static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg) { return type_is_refcounted_or_null(reg->type); @@ -712,6 +719,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, } dst_state->speculative = src->speculative; dst_state->curframe = src->curframe; + dst_state->active_spin_lock = src->active_spin_lock; for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; if (!dst) { @@ -1483,6 +1491,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, if (err) verbose(env, "R%d max value is outside of the array range\n", regno); + + if (map_value_has_spin_lock(reg->map_ptr)) { + u32 lock = reg->map_ptr->spin_lock_off; + + /* if any part of struct bpf_spin_lock can be touched by + * load/store reject this program. + * To check that [x1, x2) overlaps with [y1, y2) + * it is sufficient to check x1 < y2 && y1 < x2. + */ + if (reg->smin_value + off < lock + sizeof(struct bpf_spin_lock) && + lock < reg->umax_value + off + size) { + verbose(env, "bpf_spin_lock cannot be accessed directly by load/store\n"); + return -EACCES; + } + } return err; } @@ -2192,6 +2215,91 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, } } +/* Implementation details: + * bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL + * Two bpf_map_lookups (even with the same key) will have different reg->id. + * For traditional PTR_TO_MAP_VALUE the verifier clears reg->id after + * value_or_null->value transition, since the verifier only cares about + * the range of access to valid map value pointer and doesn't care about actual + * address of the map element. + * For maps with 'struct bpf_spin_lock' inside map value the verifier keeps + * reg->id > 0 after value_or_null->value transition. By doing so + * two bpf_map_lookups will be considered two different pointers that + * point to different bpf_spin_locks. + * The verifier allows taking only one bpf_spin_lock at a time to avoid + * dead-locks. + * Since only one bpf_spin_lock is allowed the checks are simpler than + * reg_is_refcounted() logic. The verifier needs to remember only + * one spin_lock instead of array of acquired_refs. + * cur_state->active_spin_lock remembers which map value element got locked + * and clears it after bpf_spin_unlock. + */ +static int process_spin_lock(struct bpf_verifier_env *env, int regno, + bool is_lock) +{ + struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; + struct bpf_verifier_state *cur = env->cur_state; + bool is_const = tnum_is_const(reg->var_off); + struct bpf_map *map = reg->map_ptr; + u64 val = reg->var_off.value; + + if (reg->type != PTR_TO_MAP_VALUE) { + verbose(env, "R%d is not a pointer to map_value\n", regno); + return -EINVAL; + } + if (!is_const) { + verbose(env, + "R%d doesn't have constant offset. bpf_spin_lock has to be at the constant offset\n", + regno); + return -EINVAL; + } + if (!map->btf) { + verbose(env, + "map '%s' has to have BTF in order to use bpf_spin_lock\n", + map->name); + return -EINVAL; + } + if (!map_value_has_spin_lock(map)) { + if (map->spin_lock_off == -E2BIG) + verbose(env, + "map '%s' has more than one 'struct bpf_spin_lock'\n", + map->name); + else if (map->spin_lock_off == -ENOENT) + verbose(env, + "map '%s' doesn't have 'struct bpf_spin_lock'\n", + map->name); + else + verbose(env, + "map '%s' is not a struct type or bpf_spin_lock is mangled\n", + map->name); + return -EINVAL; + } + if (map->spin_lock_off != val + reg->off) { + verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock'\n", + val + reg->off); + return -EINVAL; + } + if (is_lock) { + if (cur->active_spin_lock) { + verbose(env, + "Locking two bpf_spin_locks are not allowed\n"); + return -EINVAL; + } + cur->active_spin_lock = reg->id; + } else { + if (!cur->active_spin_lock) { + verbose(env, "bpf_spin_unlock without taking a lock\n"); + return -EINVAL; + } + if (cur->active_spin_lock != reg->id) { + verbose(env, "bpf_spin_unlock of different lock\n"); + return -EINVAL; + } + cur->active_spin_lock = 0; + } + return 0; +} + static bool arg_type_is_mem_ptr(enum bpf_arg_type type) { return type == ARG_PTR_TO_MEM || @@ -2268,6 +2376,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, return -EFAULT; } meta->ptr_id = reg->id; + } else if (arg_type == ARG_PTR_TO_SPIN_LOCK) { + if (meta->func_id == BPF_FUNC_spin_lock) { + if (process_spin_lock(env, regno, true)) + return -EACCES; + } else if (meta->func_id == BPF_FUNC_spin_unlock) { + if (process_spin_lock(env, regno, false)) + return -EACCES; + } else { + verbose(env, "verifier internal error\n"); + return -EFAULT; + } } else if (arg_type_is_mem_ptr(arg_type)) { expected_type = PTR_TO_STACK; /* One exception here. In case function allows for NULL to be @@ -2887,6 +3006,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return err; } + meta.func_id = func_id; /* check args */ err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &meta); if (err) @@ -4473,7 +4593,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } else if (reg->type == PTR_TO_SOCKET_OR_NULL) { reg->type = PTR_TO_SOCKET; } - if (is_null || !reg_is_refcounted(reg)) { + if (is_null || !(reg_is_refcounted(reg) || + reg_may_point_to_spin_lock(reg))) { /* We don't need id from this point onwards anymore, * thus we should better reset it, so that state * pruning has chances to take effect. @@ -4871,6 +4992,11 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) return err; } + if (env->cur_state->active_spin_lock) { + verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n"); + return -EINVAL; + } + if (regs[BPF_REG_6].type != PTR_TO_CTX) { verbose(env, "at the time of BPF_LD_ABS|IND R6 != pointer to skb\n"); @@ -5607,8 +5733,11 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, case PTR_TO_MAP_VALUE: /* If the new min/max/var_off satisfy the old ones and * everything else matches, we are OK. - * We don't care about the 'id' value, because nothing - * uses it for PTR_TO_MAP_VALUE (only for ..._OR_NULL) + * 'id' is not compared, since it's only used for maps with + * bpf_spin_lock inside map element and in such cases if + * the rest of the prog is valid for one map element then + * it's valid for all map elements regardless of the key + * used in bpf_map_lookup() */ return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 && range_within(rold, rcur) && @@ -5811,6 +5940,9 @@ static bool states_equal(struct bpf_verifier_env *env, if (old->speculative && !cur->speculative) return false; + if (old->active_spin_lock != cur->active_spin_lock) + return false; + /* for states to be equal callsites have to be the same * and all frame states need to be equivalent */ @@ -6229,6 +6361,12 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } + if (env->cur_state->active_spin_lock && + (insn->src_reg == BPF_PSEUDO_CALL || + insn->imm != BPF_FUNC_spin_unlock)) { + verbose(env, "function calls are not allowed while holding a lock\n"); + return -EINVAL; + } if (insn->src_reg == BPF_PSEUDO_CALL) err = check_func_call(env, insn, &env->insn_idx); else @@ -6259,6 +6397,11 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } + if (env->cur_state->active_spin_lock) { + verbose(env, "bpf_spin_unlock is missing\n"); + return -EINVAL; + } + if (state->curframe) { /* exit from nested function */ env->prev_insn_idx = env->insn_idx; @@ -6356,6 +6499,19 @@ static int check_map_prealloc(struct bpf_map *map) !(map->map_flags & BPF_F_NO_PREALLOC); } +static bool is_tracing_prog_type(enum bpf_prog_type type) +{ + switch (type) { + case BPF_PROG_TYPE_KPROBE: + case BPF_PROG_TYPE_TRACEPOINT: + case BPF_PROG_TYPE_PERF_EVENT: + case BPF_PROG_TYPE_RAW_TRACEPOINT: + return true; + default: + return false; + } +} + static int check_map_prog_compatibility(struct bpf_verifier_env *env, struct bpf_map *map, struct bpf_prog *prog) @@ -6378,6 +6534,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, } } + if ((is_tracing_prog_type(prog->type) || + prog->type == BPF_PROG_TYPE_SOCKET_FILTER) && + map_value_has_spin_lock(map)) { + verbose(env, "tracing progs cannot use bpf_spin_lock yet\n"); + return -EINVAL; + } + if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) && !bpf_offload_prog_map_match(prog, map)) { verbose(env, "offload device mismatch between prog and map\n"); diff --git a/net/core/filter.c b/net/core/filter.c index 41984ad4b9b4..3a49f68eda10 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5314,10 +5314,20 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_tail_call_proto; case BPF_FUNC_ktime_get_ns: return &bpf_ktime_get_ns_proto; + default: + break; + } + + if (!capable(CAP_SYS_ADMIN)) + return NULL; + + switch (func_id) { + case BPF_FUNC_spin_lock: + return &bpf_spin_lock_proto; + case BPF_FUNC_spin_unlock: + return &bpf_spin_unlock_proto; case BPF_FUNC_trace_printk: - if (capable(CAP_SYS_ADMIN)) - return bpf_get_trace_printk_proto(); - /* else, fall through */ + return bpf_get_trace_printk_proto(); default: return NULL; } -- cgit v1.3-7-g2ca7 From e16d2f1ab96849b4b65e64b82550a7ecdbf405eb Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 31 Jan 2019 15:40:05 -0800 Subject: bpf: add support for bpf_spin_lock to cgroup local storage Allow 'struct bpf_spin_lock' to reside inside cgroup local storage. Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/bpf/local_storage.c | 2 ++ kernel/bpf/syscall.c | 3 ++- kernel/bpf/verifier.c | 2 ++ 3 files changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 07a34ef562a0..0295427f06e2 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -147,6 +147,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key, return -ENOMEM; memcpy(&new->data[0], value, map->value_size); + check_and_init_map_lock(map, new->data); new = xchg(&storage->buf, new); kfree_rcu(new, rcu); @@ -483,6 +484,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog, storage->buf = kmalloc_node(size, flags, map->numa_node); if (!storage->buf) goto enomem; + check_and_init_map_lock(map, storage->buf->data); } else { storage->percpu_buf = __alloc_percpu_gfp(size, 8, flags); if (!storage->percpu_buf) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ebf0a673cb83..b29e6dc44650 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -482,7 +482,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, if (map_value_has_spin_lock(map)) { if (map->map_type != BPF_MAP_TYPE_HASH && - map->map_type != BPF_MAP_TYPE_ARRAY) + map->map_type != BPF_MAP_TYPE_ARRAY && + map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE) return -ENOTSUPP; if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > map->value_size) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 38892bdee651..b63bc77af2d1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3089,6 +3089,8 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn regs[BPF_REG_0].map_ptr = meta.map_ptr; if (fn->ret_type == RET_PTR_TO_MAP_VALUE) { regs[BPF_REG_0].type = PTR_TO_MAP_VALUE; + if (map_value_has_spin_lock(meta.map_ptr)) + regs[BPF_REG_0].id = ++env->id_gen; } else { regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL; regs[BPF_REG_0].id = ++env->id_gen; -- cgit v1.3-7-g2ca7 From 96049f3afd50fe8db69fa0068cdca822e747b1e4 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 31 Jan 2019 15:40:09 -0800 Subject: bpf: introduce BPF_F_LOCK flag Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands and for map_update() helper function. In all these cases take a lock of existing element (which was provided in BTF description) before copying (in or out) the rest of map value. Implementation details that are part of uapi: Array: The array map takes the element lock for lookup/update. Hash: hash map also takes the lock for lookup/update and tries to avoid the bucket lock. If old element exists it takes the element lock and updates the element in place. If element doesn't exist it allocates new one and inserts into hash table while holding the bucket lock. In rare case the hashmap has to take both the bucket lock and the element lock to update old value in place. Cgroup local storage: It is similar to array. update in place and lookup are done with lock taken. Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- include/linux/bpf.h | 2 ++ include/uapi/linux/bpf.h | 1 + kernel/bpf/arraymap.c | 24 ++++++++++++++++-------- kernel/bpf/hashtab.c | 42 +++++++++++++++++++++++++++++++++++++++--- kernel/bpf/helpers.c | 16 ++++++++++++++++ kernel/bpf/local_storage.c | 14 +++++++++++++- kernel/bpf/syscall.c | 25 +++++++++++++++++++++++-- 7 files changed, 110 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2ae615b48bb8..bd169a7bcc93 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -119,6 +119,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) memcpy(dst, src, map->value_size); } } +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, + bool lock_src); struct bpf_offload_dev; struct bpf_offloaded_map; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 86f7c438d40f..1777fa0c61e4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -267,6 +267,7 @@ enum bpf_attach_type { #define BPF_ANY 0 /* create new element or update existing */ #define BPF_NOEXIST 1 /* create new element if it didn't exist */ #define BPF_EXIST 2 /* update existing element */ +#define BPF_F_LOCK 4 /* spin_lock-ed map_lookup/map_update */ /* flags for BPF_MAP_CREATE command */ #define BPF_F_NO_PREALLOC (1U << 0) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index d6d979910a2a..c72e0d8e1e65 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -253,8 +253,9 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; + char *val; - if (unlikely(map_flags > BPF_EXIST)) + if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST)) /* unknown flags */ return -EINVAL; @@ -262,18 +263,25 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, /* all elements were pre-allocated, cannot insert a new one */ return -E2BIG; - if (unlikely(map_flags == BPF_NOEXIST)) + if (unlikely(map_flags & BPF_NOEXIST)) /* all elements already exist */ return -EEXIST; - if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) + if (unlikely((map_flags & BPF_F_LOCK) && + !map_value_has_spin_lock(map))) + return -EINVAL; + + if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]), value, map->value_size); - else - copy_map_value(map, - array->value + - array->elem_size * (index & array->index_mask), - value); + } else { + val = array->value + + array->elem_size * (index & array->index_mask); + if (map_flags & BPF_F_LOCK) + copy_map_value_locked(map, val, value, false); + else + copy_map_value(map, val, value); + } return 0; } diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 6d3b22c5ad68..937776531998 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -804,11 +804,11 @@ dec_count: static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old, u64 map_flags) { - if (l_old && map_flags == BPF_NOEXIST) + if (l_old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST) /* elem already exists */ return -EEXIST; - if (!l_old && map_flags == BPF_EXIST) + if (!l_old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST) /* elem doesn't exist, cannot update it */ return -ENOENT; @@ -827,7 +827,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, u32 key_size, hash; int ret; - if (unlikely(map_flags > BPF_EXIST)) + if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST)) /* unknown flags */ return -EINVAL; @@ -840,6 +840,28 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, b = __select_bucket(htab, hash); head = &b->head; + if (unlikely(map_flags & BPF_F_LOCK)) { + if (unlikely(!map_value_has_spin_lock(map))) + return -EINVAL; + /* find an element without taking the bucket lock */ + l_old = lookup_nulls_elem_raw(head, hash, key, key_size, + htab->n_buckets); + ret = check_flags(htab, l_old, map_flags); + if (ret) + return ret; + if (l_old) { + /* grab the element lock and update value in place */ + copy_map_value_locked(map, + l_old->key + round_up(key_size, 8), + value, false); + return 0; + } + /* fall through, grab the bucket lock and lookup again. + * 99.9% chance that the element won't be found, + * but second lookup under lock has to be done. + */ + } + /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); @@ -849,6 +871,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, if (ret) goto err; + if (unlikely(l_old && (map_flags & BPF_F_LOCK))) { + /* first lookup without the bucket lock didn't find the element, + * but second lookup with the bucket lock found it. + * This case is highly unlikely, but has to be dealt with: + * grab the element lock in addition to the bucket lock + * and update element in place + */ + copy_map_value_locked(map, + l_old->key + round_up(key_size, 8), + value, false); + ret = 0; + goto err; + } + l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false, l_old); if (IS_ERR(l_new)) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index fbe544761628..a411fc17d265 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -301,6 +301,22 @@ const struct bpf_func_proto bpf_spin_unlock_proto = { .arg1_type = ARG_PTR_TO_SPIN_LOCK, }; +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, + bool lock_src) +{ + struct bpf_spin_lock *lock; + + if (lock_src) + lock = src + map->spin_lock_off; + else + lock = dst + map->spin_lock_off; + preempt_disable(); + ____bpf_spin_lock(lock); + copy_map_value(map, dst, src); + ____bpf_spin_unlock(lock); + preempt_enable(); +} + #ifdef CONFIG_CGROUPS BPF_CALL_0(bpf_get_current_cgroup_id) { diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 0295427f06e2..6b572e2de7fb 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -131,7 +131,14 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key, struct bpf_cgroup_storage *storage; struct bpf_storage_buffer *new; - if (flags != BPF_ANY && flags != BPF_EXIST) + if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST | BPF_NOEXIST))) + return -EINVAL; + + if (unlikely(flags & BPF_NOEXIST)) + return -EINVAL; + + if (unlikely((flags & BPF_F_LOCK) && + !map_value_has_spin_lock(map))) return -EINVAL; storage = cgroup_storage_lookup((struct bpf_cgroup_storage_map *)map, @@ -139,6 +146,11 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key, if (!storage) return -ENOENT; + if (flags & BPF_F_LOCK) { + copy_map_value_locked(map, storage->buf->data, value, false); + return 0; + } + new = kmalloc_node(sizeof(struct bpf_storage_buffer) + map->value_size, __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b29e6dc44650..0834958f1dc4 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -682,7 +682,7 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size) } /* last field in 'union bpf_attr' used by this command */ -#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value +#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags static int map_lookup_elem(union bpf_attr *attr) { @@ -698,6 +698,9 @@ static int map_lookup_elem(union bpf_attr *attr) if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM)) return -EINVAL; + if (attr->flags & ~BPF_F_LOCK) + return -EINVAL; + f = fdget(ufd); map = __bpf_map_get(f); if (IS_ERR(map)) @@ -708,6 +711,12 @@ static int map_lookup_elem(union bpf_attr *attr) goto err_put; } + if ((attr->flags & BPF_F_LOCK) && + !map_value_has_spin_lock(map)) { + err = -EINVAL; + goto err_put; + } + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); @@ -758,7 +767,13 @@ static int map_lookup_elem(union bpf_attr *attr) err = -ENOENT; } else { err = 0; - copy_map_value(map, value, ptr); + if (attr->flags & BPF_F_LOCK) + /* lock 'ptr' and copy everything but lock */ + copy_map_value_locked(map, value, ptr, true); + else + copy_map_value(map, value, ptr); + /* mask lock, since value wasn't zero inited */ + check_and_init_map_lock(map, value); } rcu_read_unlock(); } @@ -818,6 +833,12 @@ static int map_update_elem(union bpf_attr *attr) goto err_put; } + if ((attr->flags & BPF_F_LOCK) && + !map_value_has_spin_lock(map)) { + err = -EINVAL; + goto err_put; + } + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); -- cgit v1.3-7-g2ca7 From 8fb335e078378c8426fabeed1ebee1fbf915690c Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Fri, 1 Feb 2019 14:20:24 -0800 Subject: kernel/exit.c: release ptraced tasks before zap_pid_ns_processes Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 3fb7be001964..2639a30a8aa5 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -558,12 +558,14 @@ static struct task_struct *find_alive_thread(struct task_struct *p) return NULL; } -static struct task_struct *find_child_reaper(struct task_struct *father) +static struct task_struct *find_child_reaper(struct task_struct *father, + struct list_head *dead) __releases(&tasklist_lock) __acquires(&tasklist_lock) { struct pid_namespace *pid_ns = task_active_pid_ns(father); struct task_struct *reaper = pid_ns->child_reaper; + struct task_struct *p, *n; if (likely(reaper != father)) return reaper; @@ -579,6 +581,12 @@ static struct task_struct *find_child_reaper(struct task_struct *father) panic("Attempted to kill init! exitcode=0x%08x\n", father->signal->group_exit_code ?: father->exit_code); } + + list_for_each_entry_safe(p, n, dead, ptrace_entry) { + list_del_init(&p->ptrace_entry); + release_task(p); + } + zap_pid_ns_processes(pid_ns); write_lock_irq(&tasklist_lock); @@ -668,7 +676,7 @@ static void forget_original_parent(struct task_struct *father, exit_ptrace(father, dead); /* Can drop and reacquire tasklist_lock */ - reaper = find_child_reaper(father); + reaper = find_child_reaper(father, dead); if (list_empty(&father->children)) return; -- cgit v1.3-7-g2ca7 From 1b69ac6b40ebd85eed73e4dbccde2a36961ab990 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 1 Feb 2019 14:20:42 -0800 Subject: psi: fix aggregation idle shut-off psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner Reported-by: Suren Baghdasaryan Acked-by: Tejun Heo Cc: Peter Zijlstra Cc: Lai Jiangshan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched/psi.c | 21 +++++++++++++++++---- kernel/workqueue.c | 23 +++++++++++++++++++++++ kernel/workqueue_internal.h | 6 +++++- 3 files changed, 45 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index fe24de3fbc93..c3484785b179 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -124,6 +124,7 @@ * sampling of the aggregate task states would be. */ +#include "../workqueue_internal.h" #include #include #include @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu, groupc->tasks[t]++; write_seqcount_end(&groupc->seq); - - if (!delayed_work_pending(&group->clock_work)) - schedule_delayed_work(&group->clock_work, PSI_FREQ); } static struct psi_group *iterate_groups(struct task_struct *task, void **iter) @@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; + bool wake_clock = true; void *iter = NULL; if (!task->pid) @@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set) task->psi_flags &= ~clear; task->psi_flags |= set; - while ((group = iterate_groups(task, &iter))) + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((clear & TSK_RUNNING) && + (task->flags & PF_WQ_WORKER) && + wq_worker_last_func(task) == psi_update_work)) + wake_clock = false; + + while ((group = iterate_groups(task, &iter))) { psi_group_change(group, cpu, clear, set); + if (wake_clock && !delayed_work_pending(&group->clock_work)) + schedule_delayed_work(&group->clock_work, PSI_FREQ); + } } void psi_memstall_tick(struct task_struct *task, int cpu) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 392be4b252f6..fc5d23d752a5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) return to_wakeup ? to_wakeup->task : NULL; } +/** + * wq_worker_last_func - retrieve worker's last work function + * + * Determine the last function a worker executed. This is called from + * the scheduler to get a worker's last known identity. + * + * CONTEXT: + * spin_lock_irq(rq->lock) + * + * Return: + * The last work function %current executed as a worker, NULL if it + * hasn't executed any work yet. + */ +work_func_t wq_worker_last_func(struct task_struct *task) +{ + struct worker *worker = kthread_data(task); + + return worker->last_func; +} + /** * worker_set_flags - set worker flags and adjust nr_running accordingly * @worker: self @@ -2184,6 +2204,9 @@ __acquires(&pool->lock) if (unlikely(cpu_intensive)) worker_clr_flags(worker, WORKER_CPU_INTENSIVE); + /* tag the worker for identification in schedule() */ + worker->last_func = worker->current_func; + /* we're done with it, release */ hash_del(&worker->hentry); worker->current_work = NULL; diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h index 66fbb5a9e633..cb68b03ca89a 100644 --- a/kernel/workqueue_internal.h +++ b/kernel/workqueue_internal.h @@ -53,6 +53,9 @@ struct worker { /* used only by rescuers to point to the target workqueue */ struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */ + + /* used by the scheduler to determine a worker's last known identity */ + work_func_t last_func; }; /** @@ -67,9 +70,10 @@ static inline struct worker *current_wq_worker(void) /* * Scheduler hooks for concurrency managed workqueue. Only to be used from - * sched/core.c and workqueue.c. + * sched/ and workqueue.c. */ void wq_worker_waking_up(struct task_struct *task, int cpu); struct task_struct *wq_worker_sleeping(struct task_struct *task); +work_func_t wq_worker_last_func(struct task_struct *task); #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */ -- cgit v1.3-7-g2ca7 From 5f3d544f1671d214cd26e45bda326f921455256e Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Fri, 1 Feb 2019 22:45:17 -0500 Subject: audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL Remove audit_context from struct task_struct and struct audit_buffer when CONFIG_AUDIT is enabled but CONFIG_AUDITSYSCALL is not. Also, audit_log_name() (and supporting inode and fcaps functions) should have been put back in auditsc.c when soft and hard link logging was normalized since it is only used by syscall auditing. See github issue https://github.com/linux-audit/audit-kernel/issues/105 Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- include/linux/sched.h | 4 +- kernel/audit.c | 157 ------------------------------------------------- kernel/audit.h | 9 --- kernel/auditsc.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 161 insertions(+), 167 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index f9788bb122c5..765119df759a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -885,8 +885,10 @@ struct task_struct { struct callback_head *task_works; - struct audit_context *audit_context; #ifdef CONFIG_AUDIT +#ifdef CONFIG_AUDITSYSCALL + struct audit_context *audit_context; +#endif kuid_t loginuid; unsigned int sessionid; #endif diff --git a/kernel/audit.c b/kernel/audit.c index b7177a8def2e..c89ea48c70a6 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -2067,163 +2067,6 @@ void audit_log_key(struct audit_buffer *ab, char *key) audit_log_format(ab, "(null)"); } -void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap) -{ - int i; - - if (cap_isclear(*cap)) { - audit_log_format(ab, " %s=0", prefix); - return; - } - audit_log_format(ab, " %s=", prefix); - CAP_FOR_EACH_U32(i) - audit_log_format(ab, "%08x", cap->cap[CAP_LAST_U32 - i]); -} - -static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name) -{ - if (name->fcap_ver == -1) { - audit_log_format(ab, " cap_fe=? cap_fver=? cap_fp=? cap_fi=?"); - return; - } - audit_log_cap(ab, "cap_fp", &name->fcap.permitted); - audit_log_cap(ab, "cap_fi", &name->fcap.inheritable); - audit_log_format(ab, " cap_fe=%d cap_fver=%x cap_frootid=%d", - name->fcap.fE, name->fcap_ver, - from_kuid(&init_user_ns, name->fcap.rootid)); -} - -static inline int audit_copy_fcaps(struct audit_names *name, - const struct dentry *dentry) -{ - struct cpu_vfs_cap_data caps; - int rc; - - if (!dentry) - return 0; - - rc = get_vfs_caps_from_disk(dentry, &caps); - if (rc) - return rc; - - name->fcap.permitted = caps.permitted; - name->fcap.inheritable = caps.inheritable; - name->fcap.fE = !!(caps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE); - name->fcap.rootid = caps.rootid; - name->fcap_ver = (caps.magic_etc & VFS_CAP_REVISION_MASK) >> - VFS_CAP_REVISION_SHIFT; - - return 0; -} - -/* Copy inode data into an audit_names. */ -void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, - struct inode *inode, unsigned int flags) -{ - name->ino = inode->i_ino; - name->dev = inode->i_sb->s_dev; - name->mode = inode->i_mode; - name->uid = inode->i_uid; - name->gid = inode->i_gid; - name->rdev = inode->i_rdev; - security_inode_getsecid(inode, &name->osid); - if (flags & AUDIT_INODE_NOEVAL) { - name->fcap_ver = -1; - return; - } - audit_copy_fcaps(name, dentry); -} - -/** - * audit_log_name - produce AUDIT_PATH record from struct audit_names - * @context: audit_context for the task - * @n: audit_names structure with reportable details - * @path: optional path to report instead of audit_names->name - * @record_num: record number to report when handling a list of names - * @call_panic: optional pointer to int that will be updated if secid fails - */ -void audit_log_name(struct audit_context *context, struct audit_names *n, - const struct path *path, int record_num, int *call_panic) -{ - struct audit_buffer *ab; - ab = audit_log_start(context, GFP_KERNEL, AUDIT_PATH); - if (!ab) - return; - - audit_log_format(ab, "item=%d", record_num); - - if (path) - audit_log_d_path(ab, " name=", path); - else if (n->name) { - switch (n->name_len) { - case AUDIT_NAME_FULL: - /* log the full path */ - audit_log_format(ab, " name="); - audit_log_untrustedstring(ab, n->name->name); - break; - case 0: - /* name was specified as a relative path and the - * directory component is the cwd */ - audit_log_d_path(ab, " name=", &context->pwd); - break; - default: - /* log the name's directory component */ - audit_log_format(ab, " name="); - audit_log_n_untrustedstring(ab, n->name->name, - n->name_len); - } - } else - audit_log_format(ab, " name=(null)"); - - if (n->ino != AUDIT_INO_UNSET) - audit_log_format(ab, " inode=%lu" - " dev=%02x:%02x mode=%#ho" - " ouid=%u ogid=%u rdev=%02x:%02x", - n->ino, - MAJOR(n->dev), - MINOR(n->dev), - n->mode, - from_kuid(&init_user_ns, n->uid), - from_kgid(&init_user_ns, n->gid), - MAJOR(n->rdev), - MINOR(n->rdev)); - if (n->osid != 0) { - char *ctx = NULL; - u32 len; - if (security_secid_to_secctx( - n->osid, &ctx, &len)) { - audit_log_format(ab, " osid=%u", n->osid); - if (call_panic) - *call_panic = 2; - } else { - audit_log_format(ab, " obj=%s", ctx); - security_release_secctx(ctx, len); - } - } - - /* log the audit_names record type */ - switch(n->type) { - case AUDIT_TYPE_NORMAL: - audit_log_format(ab, " nametype=NORMAL"); - break; - case AUDIT_TYPE_PARENT: - audit_log_format(ab, " nametype=PARENT"); - break; - case AUDIT_TYPE_CHILD_DELETE: - audit_log_format(ab, " nametype=DELETE"); - break; - case AUDIT_TYPE_CHILD_CREATE: - audit_log_format(ab, " nametype=CREATE"); - break; - default: - audit_log_format(ab, " nametype=UNKNOWN"); - break; - } - - audit_log_fcaps(ab, n); - audit_log_end(ab); -} - int audit_log_task_context(struct audit_buffer *ab) { char *ctx = NULL; diff --git a/kernel/audit.h b/kernel/audit.h index 002f0f7ba732..82734f438ddd 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -213,15 +213,6 @@ extern bool audit_ever_enabled; extern void audit_log_session_info(struct audit_buffer *ab); -extern void audit_copy_inode(struct audit_names *name, - const struct dentry *dentry, - struct inode *inode, unsigned int flags); -extern void audit_log_cap(struct audit_buffer *ab, char *prefix, - kernel_cap_t *cap); -extern void audit_log_name(struct audit_context *context, - struct audit_names *n, const struct path *path, - int record_num, int *call_panic); - extern int auditd_test_task(struct task_struct *task); #define AUDIT_INODE_BUCKETS 32 diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 7d37cb1e4aef..d1eab1d4a930 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1139,6 +1139,32 @@ out: kfree(buf_head); } +void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap) +{ + int i; + + if (cap_isclear(*cap)) { + audit_log_format(ab, " %s=0", prefix); + return; + } + audit_log_format(ab, " %s=", prefix); + CAP_FOR_EACH_U32(i) + audit_log_format(ab, "%08x", cap->cap[CAP_LAST_U32 - i]); +} + +static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name) +{ + if (name->fcap_ver == -1) { + audit_log_format(ab, " cap_fe=? cap_fver=? cap_fp=? cap_fi=?"); + return; + } + audit_log_cap(ab, "cap_fp", &name->fcap.permitted); + audit_log_cap(ab, "cap_fi", &name->fcap.inheritable); + audit_log_format(ab, " cap_fe=%d cap_fver=%x cap_frootid=%d", + name->fcap.fE, name->fcap_ver, + from_kuid(&init_user_ns, name->fcap.rootid)); +} + static void show_special(struct audit_context *context, int *call_panic) { struct audit_buffer *ab; @@ -1261,6 +1287,97 @@ static inline int audit_proctitle_rtrim(char *proctitle, int len) return len; } +/* + * audit_log_name - produce AUDIT_PATH record from struct audit_names + * @context: audit_context for the task + * @n: audit_names structure with reportable details + * @path: optional path to report instead of audit_names->name + * @record_num: record number to report when handling a list of names + * @call_panic: optional pointer to int that will be updated if secid fails + */ +static void audit_log_name(struct audit_context *context, struct audit_names *n, + const struct path *path, int record_num, int *call_panic) +{ + struct audit_buffer *ab; + + ab = audit_log_start(context, GFP_KERNEL, AUDIT_PATH); + if (!ab) + return; + + audit_log_format(ab, "item=%d", record_num); + + if (path) + audit_log_d_path(ab, " name=", path); + else if (n->name) { + switch (n->name_len) { + case AUDIT_NAME_FULL: + /* log the full path */ + audit_log_format(ab, " name="); + audit_log_untrustedstring(ab, n->name->name); + break; + case 0: + /* name was specified as a relative path and the + * directory component is the cwd + */ + audit_log_d_path(ab, " name=", &context->pwd); + break; + default: + /* log the name's directory component */ + audit_log_format(ab, " name="); + audit_log_n_untrustedstring(ab, n->name->name, + n->name_len); + } + } else + audit_log_format(ab, " name=(null)"); + + if (n->ino != AUDIT_INO_UNSET) + audit_log_format(ab, " inode=%lu dev=%02x:%02x mode=%#ho ouid=%u ogid=%u rdev=%02x:%02x", + n->ino, + MAJOR(n->dev), + MINOR(n->dev), + n->mode, + from_kuid(&init_user_ns, n->uid), + from_kgid(&init_user_ns, n->gid), + MAJOR(n->rdev), + MINOR(n->rdev)); + if (n->osid != 0) { + char *ctx = NULL; + u32 len; + + if (security_secid_to_secctx( + n->osid, &ctx, &len)) { + audit_log_format(ab, " osid=%u", n->osid); + if (call_panic) + *call_panic = 2; + } else { + audit_log_format(ab, " obj=%s", ctx); + security_release_secctx(ctx, len); + } + } + + /* log the audit_names record type */ + switch (n->type) { + case AUDIT_TYPE_NORMAL: + audit_log_format(ab, " nametype=NORMAL"); + break; + case AUDIT_TYPE_PARENT: + audit_log_format(ab, " nametype=PARENT"); + break; + case AUDIT_TYPE_CHILD_DELETE: + audit_log_format(ab, " nametype=DELETE"); + break; + case AUDIT_TYPE_CHILD_CREATE: + audit_log_format(ab, " nametype=CREATE"); + break; + default: + audit_log_format(ab, " nametype=UNKNOWN"); + break; + } + + audit_log_fcaps(ab, n); + audit_log_end(ab); +} + static void audit_log_proctitle(void) { int res; @@ -1756,6 +1873,47 @@ void __audit_getname(struct filename *name) get_fs_pwd(current->fs, &context->pwd); } +static inline int audit_copy_fcaps(struct audit_names *name, + const struct dentry *dentry) +{ + struct cpu_vfs_cap_data caps; + int rc; + + if (!dentry) + return 0; + + rc = get_vfs_caps_from_disk(dentry, &caps); + if (rc) + return rc; + + name->fcap.permitted = caps.permitted; + name->fcap.inheritable = caps.inheritable; + name->fcap.fE = !!(caps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE); + name->fcap.rootid = caps.rootid; + name->fcap_ver = (caps.magic_etc & VFS_CAP_REVISION_MASK) >> + VFS_CAP_REVISION_SHIFT; + + return 0; +} + +/* Copy inode data into an audit_names. */ +void audit_copy_inode(struct audit_names *name, const struct dentry *dentry, + struct inode *inode, unsigned int flags) +{ + name->ino = inode->i_ino; + name->dev = inode->i_sb->s_dev; + name->mode = inode->i_mode; + name->uid = inode->i_uid; + name->gid = inode->i_gid; + name->rdev = inode->i_rdev; + security_inode_getsecid(inode, &name->osid); + if (flags & AUDIT_INODE_NOEVAL) { + name->fcap_ver = -1; + return; + } + audit_copy_fcaps(name, dentry); +} + /** * __audit_inode - store the inode and device from a lookup * @name: name being audited -- cgit v1.3-7-g2ca7 From 9dff0aa95a324e262ffb03f425d00e4751f3294e Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 10 Jan 2019 14:27:45 +0000 Subject: perf/core: Don't WARN() for impossible ring-buffer sizes The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how large its ringbuffer mmap should be. This can be configured to arbitrary values, which can be larger than the maximum possible allocation from kmalloc. When this is configured to a suitably large value (e.g. thanks to the perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in __alloc_pages_nodemask(): WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8 Let's avoid this by checking that the requested allocation is possible before calling kzalloc. Reported-by: Julien Thierry Signed-off-by: Mark Rutland Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Julien Thierry Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 4a9937076331..309ef5a64af5 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -734,6 +734,9 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags) size = sizeof(struct ring_buffer); size += nr_pages * sizeof(void *); + if (order_base_2(size) >= MAX_ORDER) + goto fail; + rb = kzalloc(size, GFP_KERNEL); if (!rb) goto fail; -- cgit v1.3-7-g2ca7 From 8e86e01526764e8cdc77b80a8f24f33e6847b9e7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 16 Jan 2019 12:10:59 +0100 Subject: perf/core: Convert to SPDX license identifiers Use proper SPDX license identifiers instead of the bogus reference to kernel-base/COPYING. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Greg Kroah-Hartman Cc: Jiri Olsa Cc: Kate Stewart Cc: Linus Torvalds Cc: Peter Zijlstra Link: https://lkml.kernel.org/r/20190116111308.012666937@linutronix.de Signed-off-by: Ingo Molnar --- kernel/events/callchain.c | 3 +-- kernel/events/core.c | 3 +-- kernel/events/ring_buffer.c | 3 +-- 3 files changed, 3 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 24a77c34e9ad..c2b41a263166 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Performance events callchain code, extracted from core.c: * @@ -5,8 +6,6 @@ * Copyright (C) 2008-2011 Red Hat, Inc., Ingo Molnar * Copyright (C) 2008-2011 Red Hat, Inc., Peter Zijlstra * Copyright © 2009 Paul Mackerras, IBM Corp. - * - * For licensing details see kernel-base/COPYING */ #include diff --git a/kernel/events/core.c b/kernel/events/core.c index 280a72b3a553..5b89de7918d0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Performance events core code: * @@ -5,8 +6,6 @@ * Copyright (C) 2008-2011 Red Hat, Inc., Ingo Molnar * Copyright (C) 2008-2011 Red Hat, Inc., Peter Zijlstra * Copyright © 2009 Paul Mackerras, IBM Corp. - * - * For licensing details see kernel-base/COPYING */ #include diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 309ef5a64af5..ed6409300ef5 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Performance events ring-buffer code: * @@ -5,8 +6,6 @@ * Copyright (C) 2008-2011 Red Hat, Inc., Ingo Molnar * Copyright (C) 2008-2011 Red Hat, Inc., Peter Zijlstra * Copyright © 2009 Paul Mackerras, IBM Corp. - * - * For licensing details see kernel-base/COPYING */ #include -- cgit v1.3-7-g2ca7 From 469eb32eaf361971dfc8ad165af14ae3f2217487 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 16 Jan 2019 12:11:00 +0100 Subject: perf/hw_breakpoints: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paul McKenney Cc: Alan Stern Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Frederic Weisbecker Cc: Greg Kroah-Hartman Cc: Jiri Olsa Cc: Kate Stewart Cc: Linus Torvalds Cc: Peter Zijlstra Link: https://lkml.kernel.org/r/20190116111308.105855650@linutronix.de Signed-off-by: Ingo Molnar --- kernel/events/hw_breakpoint.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c index 5befb338a18d..c5cd852fe86b 100644 --- a/kernel/events/hw_breakpoint.c +++ b/kernel/events/hw_breakpoint.c @@ -1,18 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0+ /* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. - * * Copyright (C) 2007 Alan Stern * Copyright (C) IBM Corporation, 2009 * Copyright (C) 2009, Frederic Weisbecker -- cgit v1.3-7-g2ca7 From 720e596a16cc170798a60dc7afa27146ec5fb14e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 16 Jan 2019 12:11:01 +0100 Subject: perf/uprobes: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paul McKenney Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Greg Kroah-Hartman Cc: Jiri Olsa Cc: Kate Stewart Cc: Linus Torvalds Cc: Peter Zijlstra Link: https://lkml.kernel.org/r/20190116111308.211981422@linutronix.de Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8aef47ee7bfa..affa830a198c 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1,20 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * User-space Probes (UProbes) * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. - * * Copyright (C) IBM Corporation, 2008-2012 * Authors: * Srikar Dronamraju -- cgit v1.3-7-g2ca7 From 8c94abbbe1ba24961278055434504b7dc3595415 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Mon, 28 Jan 2019 14:27:26 +0200 Subject: perf: Convert perf_event_context.refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable perf_event_context.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the perf_event_context.refcount it might make a difference in following places: - get_ctx(), perf_event_ctx_lock_nested(), perf_lock_task_context() and __perf_event_ctx_lock_double(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_ctx(): decrement in refcount_dec_and_test() provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- include/linux/perf_event.h | 3 ++- kernel/events/core.c | 12 ++++++------ 2 files changed, 8 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index a79e59fc3b7d..6cb5d483ab34 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -54,6 +54,7 @@ struct perf_guest_info_callbacks { #include #include #include +#include #include struct perf_callchain_entry { @@ -737,7 +738,7 @@ struct perf_event_context { int nr_stat; int nr_freq; int rotate_disable; - atomic_t refcount; + refcount_t refcount; struct task_struct *task; /* diff --git a/kernel/events/core.c b/kernel/events/core.c index 5b89de7918d0..677164d54547 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1172,7 +1172,7 @@ static void perf_event_ctx_deactivate(struct perf_event_context *ctx) static void get_ctx(struct perf_event_context *ctx) { - WARN_ON(!atomic_inc_not_zero(&ctx->refcount)); + refcount_inc(&ctx->refcount); } static void free_ctx(struct rcu_head *head) @@ -1186,7 +1186,7 @@ static void free_ctx(struct rcu_head *head) static void put_ctx(struct perf_event_context *ctx) { - if (atomic_dec_and_test(&ctx->refcount)) { + if (refcount_dec_and_test(&ctx->refcount)) { if (ctx->parent_ctx) put_ctx(ctx->parent_ctx); if (ctx->task && ctx->task != TASK_TOMBSTONE) @@ -1268,7 +1268,7 @@ perf_event_ctx_lock_nested(struct perf_event *event, int nesting) again: rcu_read_lock(); ctx = READ_ONCE(event->ctx); - if (!atomic_inc_not_zero(&ctx->refcount)) { + if (!refcount_inc_not_zero(&ctx->refcount)) { rcu_read_unlock(); goto again; } @@ -1401,7 +1401,7 @@ retry: } if (ctx->task == TASK_TOMBSTONE || - !atomic_inc_not_zero(&ctx->refcount)) { + !refcount_inc_not_zero(&ctx->refcount)) { raw_spin_unlock(&ctx->lock); ctx = NULL; } else { @@ -4057,7 +4057,7 @@ static void __perf_event_init_context(struct perf_event_context *ctx) INIT_LIST_HEAD(&ctx->event_list); INIT_LIST_HEAD(&ctx->pinned_active); INIT_LIST_HEAD(&ctx->flexible_active); - atomic_set(&ctx->refcount, 1); + refcount_set(&ctx->refcount, 1); } static struct perf_event_context * @@ -10613,7 +10613,7 @@ __perf_event_ctx_lock_double(struct perf_event *group_leader, again: rcu_read_lock(); gctx = READ_ONCE(group_leader->ctx); - if (!atomic_inc_not_zero(&gctx->refcount)) { + if (!refcount_inc_not_zero(&gctx->refcount)) { rcu_read_unlock(); goto again; } -- cgit v1.3-7-g2ca7 From fecb8ed2ce7010db373f8517ee815380d8e3c0c4 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Mon, 28 Jan 2019 14:27:27 +0200 Subject: perf/ring_buffer: Convert ring_buffer.refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable ring_buffer.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the ring_buffer.refcount it might make a difference in following places: - ring_buffer_get(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - ring_buffer_put(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 4 ++-- kernel/events/internal.h | 3 ++- kernel/events/ring_buffer.c | 2 +- 3 files changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 677164d54547..284232edf9be 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5393,7 +5393,7 @@ struct ring_buffer *ring_buffer_get(struct perf_event *event) rcu_read_lock(); rb = rcu_dereference(event->rb); if (rb) { - if (!atomic_inc_not_zero(&rb->refcount)) + if (!refcount_inc_not_zero(&rb->refcount)) rb = NULL; } rcu_read_unlock(); @@ -5403,7 +5403,7 @@ struct ring_buffer *ring_buffer_get(struct perf_event *event) void ring_buffer_put(struct ring_buffer *rb) { - if (!atomic_dec_and_test(&rb->refcount)) + if (!refcount_dec_and_test(&rb->refcount)) return; WARN_ON_ONCE(!list_empty(&rb->event_list)); diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 6dc725a7e7bc..4718de2a04e6 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -4,13 +4,14 @@ #include #include +#include /* Buffer handling */ #define RING_BUFFER_WRITABLE 0x01 struct ring_buffer { - atomic_t refcount; + refcount_t refcount; struct rcu_head rcu_head; #ifdef CONFIG_PERF_USE_VMALLOC struct work_struct work; diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index ed6409300ef5..0a71d16ca41b 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -284,7 +284,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags) else rb->overwrite = 1; - atomic_set(&rb->refcount, 1); + refcount_set(&rb->refcount, 1); INIT_LIST_HEAD(&rb->event_list); spin_lock_init(&rb->event_lock); -- cgit v1.3-7-g2ca7 From ca3bb3d027f69ac3ab1dafb32bde2f5a3a44439c Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Mon, 28 Jan 2019 14:27:28 +0200 Subject: perf/ring_buffer: Convert ring_buffer.aux_refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable ring_buffer.aux_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the ring_buffer.aux_refcount it might make a difference in following places: - perf_aux_output_begin(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - rb_free_aux(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-4-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 2 +- kernel/events/internal.h | 2 +- kernel/events/ring_buffer.c | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 284232edf9be..5aeb4c74fb99 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5468,7 +5468,7 @@ static void perf_mmap_close(struct vm_area_struct *vma) /* this has to be the last one */ rb_free_aux(rb); - WARN_ON_ONCE(atomic_read(&rb->aux_refcount)); + WARN_ON_ONCE(refcount_read(&rb->aux_refcount)); mutex_unlock(&event->mmap_mutex); } diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 4718de2a04e6..79c47076700a 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -49,7 +49,7 @@ struct ring_buffer { atomic_t aux_mmap_count; unsigned long aux_mmap_locked; void (*free_aux)(void *); - atomic_t aux_refcount; + refcount_t aux_refcount; void **aux_pages; void *aux_priv; diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 0a71d16ca41b..805f0423ee0b 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -357,7 +357,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle, if (!atomic_read(&rb->aux_mmap_count)) goto err; - if (!atomic_inc_not_zero(&rb->aux_refcount)) + if (!refcount_inc_not_zero(&rb->aux_refcount)) goto err; /* @@ -670,7 +670,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, * we keep a refcount here to make sure either of the two can * reference them safely. */ - atomic_set(&rb->aux_refcount, 1); + refcount_set(&rb->aux_refcount, 1); rb->aux_overwrite = overwrite; rb->aux_watermark = watermark; @@ -689,7 +689,7 @@ out: void rb_free_aux(struct ring_buffer *rb) { - if (atomic_dec_and_test(&rb->aux_refcount)) + if (refcount_dec_and_test(&rb->aux_refcount)) __rb_free_aux(rb); } -- cgit v1.3-7-g2ca7 From d036bda7d0e7269c2982eb979acfef855f5d7977 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Fri, 18 Jan 2019 14:27:26 +0200 Subject: sched/core: Convert sighand_struct.count to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable sighand_struct.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the sighand_struct.count it might make a difference in following places: - __cleanup_sighand: decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Andrea Parri Reviewed-by: Oleg Nesterov Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- fs/exec.c | 4 ++-- fs/proc/task_nommu.c | 2 +- include/linux/sched/signal.h | 3 ++- kernel/fork.c | 8 ++++---- 4 files changed, 9 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/fs/exec.c b/fs/exec.c index fb72d36f7823..966cd98a2ce2 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,7 +1189,7 @@ no_thread_group: flush_itimer_signals(); #endif - if (atomic_read(&oldsighand->count) != 1) { + if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /* * This ->sighand is shared with the CLONE_SIGHAND @@ -1199,7 +1199,7 @@ no_thread_group: if (!newsighand) return -ENOMEM; - atomic_set(&newsighand->count, 1); + refcount_set(&newsighand->count, 1); memcpy(newsighand->action, oldsighand->action, sizeof(newsighand->action)); diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 0b63d68dedb2..f912872fbf91 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) else bytes += kobjsize(current->files); - if (current->sighand && atomic_read(¤t->sighand->count) > 1) + if (current->sighand && refcount_read(¤t->sighand->count) > 1) sbytes += kobjsize(current->sighand); else bytes += kobjsize(current->sighand); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 13789d10a50e..37eeb1a28eba 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -8,13 +8,14 @@ #include #include #include +#include /* * Types defining task->signal and task->sighand and APIs using them: */ struct sighand_struct { - atomic_t count; + refcount_t count; struct k_sigaction action[_NSIG]; spinlock_t siglock; wait_queue_head_t signalfd_wqh; diff --git a/kernel/fork.c b/kernel/fork.c index b69248e6f0e0..370856d4c0b3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1463,7 +1463,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { - atomic_inc(¤t->sighand->count); + refcount_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); @@ -1471,7 +1471,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) if (!sig) return -ENOMEM; - atomic_set(&sig->count, 1); + refcount_set(&sig->count, 1); spin_lock_irq(¤t->sighand->siglock); memcpy(sig->action, current->sighand->action, sizeof(sig->action)); spin_unlock_irq(¤t->sighand->siglock); @@ -1480,7 +1480,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) void __cleanup_sighand(struct sighand_struct *sighand) { - if (atomic_dec_and_test(&sighand->count)) { + if (refcount_dec_and_test(&sighand->count)) { signalfd_cleanup(sighand); /* * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it @@ -2439,7 +2439,7 @@ static int check_unshare_flags(unsigned long unshare_flags) return -EINVAL; } if (unshare_flags & (CLONE_SIGHAND | CLONE_VM)) { - if (atomic_read(¤t->sighand->count) > 1) + if (refcount_read(¤t->sighand->count) > 1) return -EINVAL; } if (unshare_flags & CLONE_VM) { -- cgit v1.3-7-g2ca7 From 60d4de3ff7f775509deba94b3db3c1abe55bf7a5 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Fri, 18 Jan 2019 14:27:27 +0200 Subject: sched/core: Convert signal_struct.sigcnt to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable signal_struct.sigcnt is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the signal_struct.sigcnt it might make a difference in following places: - put_signal_struct(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Andrea Parri Reviewed-by: Oleg Nesterov Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- include/linux/sched/signal.h | 2 +- init/init_task.c | 2 +- kernel/fork.c | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 37eeb1a28eba..ae5655197698 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -83,7 +83,7 @@ struct multiprocess_signals { * the locking of signal_struct. */ struct signal_struct { - atomic_t sigcnt; + refcount_t sigcnt; atomic_t live; int nr_threads; struct list_head thread_head; diff --git a/init/init_task.c b/init/init_task.c index 5aebe3be4d7c..9aa3ebc74970 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -44,7 +44,7 @@ static struct signal_struct init_signals = { }; static struct sighand_struct init_sighand = { - .count = ATOMIC_INIT(1), + .count = REFCOUNT_INIT(1), .action = { { { .sa_handler = SIG_DFL, } }, }, .siglock = __SPIN_LOCK_UNLOCKED(init_sighand.siglock), .signalfd_wqh = __WAIT_QUEUE_HEAD_INITIALIZER(init_sighand.signalfd_wqh), diff --git a/kernel/fork.c b/kernel/fork.c index 370856d4c0b3..935a42d5f8ff 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -710,7 +710,7 @@ static inline void free_signal_struct(struct signal_struct *sig) static inline void put_signal_struct(struct signal_struct *sig) { - if (atomic_dec_and_test(&sig->sigcnt)) + if (refcount_dec_and_test(&sig->sigcnt)) free_signal_struct(sig); } @@ -1527,7 +1527,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->nr_threads = 1; atomic_set(&sig->live, 1); - atomic_set(&sig->sigcnt, 1); + refcount_set(&sig->sigcnt, 1); /* list_add(thread_node, thread_head) without INIT_LIST_HEAD() */ sig->thread_head = (struct list_head)LIST_HEAD_INIT(tsk->thread_node); @@ -2082,7 +2082,7 @@ static __latent_entropy struct task_struct *copy_process( } else { current->signal->nr_threads++; atomic_inc(¤t->signal->live); - atomic_inc(¤t->signal->sigcnt); + refcount_inc(¤t->signal->sigcnt); task_join_group_stop(p); list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group); -- cgit v1.3-7-g2ca7 From c45a77952427b678aa9205e1b0ee3bcf33339a2e Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Fri, 18 Jan 2019 14:27:28 +0200 Subject: sched/fair: Convert numa_group.refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable numa_group.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the numa_group.refcount it might make a difference in following places: - get_numa_group(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_numa_group(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Andrea Parri Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-4-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e2ff4b69dcf6..5b2b919c7929 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1035,7 +1035,7 @@ unsigned int sysctl_numa_balancing_scan_size = 256; unsigned int sysctl_numa_balancing_scan_delay = 1000; struct numa_group { - atomic_t refcount; + refcount_t refcount; spinlock_t lock; /* nr_tasks, tasks */ int nr_tasks; @@ -1104,7 +1104,7 @@ static unsigned int task_scan_start(struct task_struct *p) unsigned long shared = group_faults_shared(ng); unsigned long private = group_faults_priv(ng); - period *= atomic_read(&ng->refcount); + period *= refcount_read(&ng->refcount); period *= shared + 1; period /= private + shared + 1; } @@ -1127,7 +1127,7 @@ static unsigned int task_scan_max(struct task_struct *p) unsigned long private = group_faults_priv(ng); unsigned long period = smax; - period *= atomic_read(&ng->refcount); + period *= refcount_read(&ng->refcount); period *= shared + 1; period /= private + shared + 1; @@ -2203,12 +2203,12 @@ static void task_numa_placement(struct task_struct *p) static inline int get_numa_group(struct numa_group *grp) { - return atomic_inc_not_zero(&grp->refcount); + return refcount_inc_not_zero(&grp->refcount); } static inline void put_numa_group(struct numa_group *grp) { - if (atomic_dec_and_test(&grp->refcount)) + if (refcount_dec_and_test(&grp->refcount)) kfree_rcu(grp, rcu); } @@ -2229,7 +2229,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!grp) return; - atomic_set(&grp->refcount, 1); + refcount_set(&grp->refcount, 1); grp->active_nodes = 1; grp->max_faults_cpu = 0; spin_lock_init(&grp->lock); -- cgit v1.3-7-g2ca7 From ec1d281923cf81cc660343d0cb8ffc837ffb991d Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Fri, 18 Jan 2019 14:27:29 +0200 Subject: sched/core: Convert task_struct.usage to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable task_struct.usage is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the task_struct.usage it might make a difference in following places: - put_task_struct(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Andrea Parri Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 3 ++- include/linux/sched/task.h | 4 ++-- init/init_task.c | 2 +- kernel/fork.c | 4 ++-- 4 files changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index e2bba022827d..9d14d6864ca6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -607,7 +608,7 @@ struct task_struct { randomized_struct_fields_start void *stack; - atomic_t usage; + refcount_t usage; /* Per task flags (PF_*), defined further below: */ unsigned int flags; unsigned int ptrace; diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 44c6f15800ff..2e97a2227045 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -88,13 +88,13 @@ extern void sched_exec(void); #define sched_exec() {} #endif -#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0) +#define get_task_struct(tsk) do { refcount_inc(&(tsk)->usage); } while(0) extern void __put_task_struct(struct task_struct *t); static inline void put_task_struct(struct task_struct *t) { - if (atomic_dec_and_test(&t->usage)) + if (refcount_dec_and_test(&t->usage)) __put_task_struct(t); } diff --git a/init/init_task.c b/init/init_task.c index 9aa3ebc74970..aca34c89529f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -65,7 +65,7 @@ struct task_struct init_task #endif .state = 0, .stack = init_stack, - .usage = ATOMIC_INIT(2), + .usage = REFCOUNT_INIT(2), .flags = PF_KTHREAD, .prio = MAX_PRIO - 20, .static_prio = MAX_PRIO - 20, diff --git a/kernel/fork.c b/kernel/fork.c index 935a42d5f8ff..3f7e192e29f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -717,7 +717,7 @@ static inline void put_signal_struct(struct signal_struct *sig) void __put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); - WARN_ON(atomic_read(&tsk->usage)); + WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); cgroup_free(tsk); @@ -896,7 +896,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) * One for us, one for whoever does the "release_task()" (usually * parent) */ - atomic_set(&tsk->usage, 2); + refcount_set(&tsk->usage, 2); #ifdef CONFIG_BLK_DEV_IO_TRACE tsk->btrace_seq = 0; #endif -- cgit v1.3-7-g2ca7 From f0b89d3958d73cd0785ec381f0ddf8efb6f183d8 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Fri, 18 Jan 2019 14:27:30 +0200 Subject: sched/core: Convert task_struct.stack_refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable task_struct.stack_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the task_struct.stack_refcount it might make a difference in following places: - try_get_task_stack(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_task_stack(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Andrea Parri Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 +- include/linux/sched/task_stack.h | 2 +- init/init_task.c | 2 +- kernel/fork.c | 6 +++--- 5 files changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index a7083a45a26c..6049baa5b8bc 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/sched.h b/include/linux/sched.h index 9d14d6864ca6..628bf13cb5a5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1194,7 +1194,7 @@ struct task_struct { #endif #ifdef CONFIG_THREAD_INFO_IN_TASK /* A live task holds one reference: */ - atomic_t stack_refcount; + refcount_t stack_refcount; #endif #ifdef CONFIG_LIVEPATCH int patch_state; diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h index 6a841929073f..2413427e439c 100644 --- a/include/linux/sched/task_stack.h +++ b/include/linux/sched/task_stack.h @@ -61,7 +61,7 @@ static inline unsigned long *end_of_stack(struct task_struct *p) #ifdef CONFIG_THREAD_INFO_IN_TASK static inline void *try_get_task_stack(struct task_struct *tsk) { - return atomic_inc_not_zero(&tsk->stack_refcount) ? + return refcount_inc_not_zero(&tsk->stack_refcount) ? task_stack_page(tsk) : NULL; } diff --git a/init/init_task.c b/init/init_task.c index aca34c89529f..46dbf546264d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -61,7 +61,7 @@ struct task_struct init_task = { #ifdef CONFIG_THREAD_INFO_IN_TASK .thread_info = INIT_THREAD_INFO(init_task), - .stack_refcount = ATOMIC_INIT(1), + .stack_refcount = REFCOUNT_INIT(1), #endif .state = 0, .stack = init_stack, diff --git a/kernel/fork.c b/kernel/fork.c index 3f7e192e29f2..77059b211608 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -429,7 +429,7 @@ static void release_task_stack(struct task_struct *tsk) #ifdef CONFIG_THREAD_INFO_IN_TASK void put_task_stack(struct task_struct *tsk) { - if (atomic_dec_and_test(&tsk->stack_refcount)) + if (refcount_dec_and_test(&tsk->stack_refcount)) release_task_stack(tsk); } #endif @@ -447,7 +447,7 @@ void free_task(struct task_struct *tsk) * If the task had a separate stack allocation, it should be gone * by now. */ - WARN_ON_ONCE(atomic_read(&tsk->stack_refcount) != 0); + WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0); #endif rt_mutex_debug_task_free(tsk); ftrace_graph_exit_task(tsk); @@ -867,7 +867,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->stack_vm_area = stack_vm_area; #endif #ifdef CONFIG_THREAD_INFO_IN_TASK - atomic_set(&tsk->stack_refcount, 1); + refcount_set(&tsk->stack_refcount, 1); #endif if (err) -- cgit v1.3-7-g2ca7 From 513e1073d52e55b8024b4f238a48de7587c64ccf Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Wed, 9 Jan 2019 23:03:25 -0500 Subject: locking/lockdep: Add debug_locks check in __lock_downgrade() Tetsuo Handa had reported he saw an incorrect "downgrading a read lock" warning right after a previous lockdep warning. It is likely that the previous warning turned off lock debugging causing the lockdep to have inconsistency states leading to the lock downgrade warning. Fix that by add a check for debug_locks at the beginning of __lock_downgrade(). Reported-by: Tetsuo Handa Reported-by: syzbot+53383ae265fb161ef488@syzkaller.appspotmail.com Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: https://lkml.kernel.org/r/1547093005-26085-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 608f74ed8bb9..7f7db23fc002 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3479,6 +3479,9 @@ __lock_set_class(struct lockdep_map *lock, const char *name, unsigned int depth; int i; + if (unlikely(!debug_locks)) + return 0; + depth = curr->lockdep_depth; /* * This function is about (re)setting the class of a held lock, -- cgit v1.3-7-g2ca7 From 07879c6a3740fbbf3c8891a0ab484c20a12794d8 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Tue, 18 Dec 2018 11:53:52 -0800 Subject: sched/wake_q: Reduce reference counting for special users Some users, specifically futexes and rwsems, required fixes that allowed the callers to be safe when wakeups occur before they are expected by wake_up_q(). Such scenarios also play games and rely on reference counting, and until now were pivoting on wake_q doing it. With the wake_q_add() call being moved down, this can no longer be the case. As such we end up with a a double task refcounting overhead; and these callers care enough about this (being rather core-ish). This patch introduces a wake_q_add_safe() call that serves for callers that have already done refcounting and therefore the task is 'safe' from wake_q point of view (int that it requires reference throughout the entire queue/>wakeup cycle). In the one case it has internal reference counting, in the other case it consumes the reference counting. Signed-off-by: Davidlohr Bueso Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: Xie Yongji Cc: Yongji Xie Cc: andrea.parri@amarulasolutions.com Cc: lilin24@baidu.com Cc: liuqi16@baidu.com Cc: nixun@baidu.com Cc: yuanlinsi01@baidu.com Cc: zhangyu31@baidu.com Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5 Signed-off-by: Ingo Molnar --- include/linux/sched/wake_q.h | 4 +-- kernel/futex.c | 3 +-- kernel/locking/rwsem-xadd.c | 4 +-- kernel/sched/core.c | 60 ++++++++++++++++++++++++++++++++------------ 4 files changed, 48 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h index 545f37138057..ad826d2a4557 100644 --- a/include/linux/sched/wake_q.h +++ b/include/linux/sched/wake_q.h @@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head) head->lastp = &head->first; } -extern void wake_q_add(struct wake_q_head *head, - struct task_struct *task); +extern void wake_q_add(struct wake_q_head *head, struct task_struct *task); +extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task); extern void wake_up_q(struct wake_q_head *head); #endif /* _LINUX_SCHED_WAKE_Q_H */ diff --git a/kernel/futex.c b/kernel/futex.c index 69e619baf709..2abe1a0b3062 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1463,8 +1463,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) * Queue the task for later wakeup for after we've released * the hb->lock. wake_q_add() grabs reference to p. */ - wake_q_add(wake_q, p); - put_task_struct(p); + wake_q_add_safe(wake_q, p); } /* diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 50d9af615dc4..fbe96341beee 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, * Ensure issuing the wakeup (either by us or someone else) * after setting the reader waiter to nil. */ - wake_q_add(wake_q, tsk); - /* wake_q_add() already take the task ref */ - put_task_struct(tsk); + wake_q_add_safe(wake_q, tsk); } adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3c8b4dba3d2d..64ceaa5158c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p) #endif #endif -/** - * wake_q_add() - queue a wakeup for 'later' waking. - * @head: the wake_q_head to add @task to - * @task: the task to queue for 'later' wakeup - * - * Queue a task for later wakeup, most likely by the wake_up_q() call in the - * same context, _HOWEVER_ this is not guaranteed, the wakeup can come - * instantly. - * - * This function must be used as-if it were wake_up_process(); IOW the task - * must be ready to be woken at this location. - */ -void wake_q_add(struct wake_q_head *head, struct task_struct *task) +static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task) { struct wake_q_node *node = &task->wake_q; @@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) */ smp_mb__before_atomic(); if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))) - return; - - get_task_struct(task); + return false; /* * The head is context local, there can be no concurrency. */ *head->lastp = node; head->lastp = &node->next; + return true; +} + +/** + * wake_q_add() - queue a wakeup for 'later' waking. + * @head: the wake_q_head to add @task to + * @task: the task to queue for 'later' wakeup + * + * Queue a task for later wakeup, most likely by the wake_up_q() call in the + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come + * instantly. + * + * This function must be used as-if it were wake_up_process(); IOW the task + * must be ready to be woken at this location. + */ +void wake_q_add(struct wake_q_head *head, struct task_struct *task) +{ + if (__wake_q_add(head, task)) + get_task_struct(task); +} + +/** + * wake_q_add_safe() - safely queue a wakeup for 'later' waking. + * @head: the wake_q_head to add @task to + * @task: the task to queue for 'later' wakeup + * + * Queue a task for later wakeup, most likely by the wake_up_q() call in the + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come + * instantly. + * + * This function must be used as-if it were wake_up_process(); IOW the task + * must be ready to be woken at this location. + * + * This function is essentially a task-safe equivalent to wake_q_add(). Callers + * that already hold reference to @task can call the 'safe' version and trust + * wake_q to do the right thing depending whether or not the @task is already + * queued for wakeup. + */ +void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task) +{ + if (!__wake_q_add(head, task)) + put_task_struct(task); } void wake_up_q(struct wake_q_head *head) -- cgit v1.3-7-g2ca7 From d682b596d99345ef0000e7017db714ba7f29e017 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 29 Jan 2019 22:53:45 +0100 Subject: locking/qspinlock: Handle > 4 slowpath nesting levels Four queue nodes per CPU are allocated to enable up to 4 nesting levels using the per-CPU nodes. Nested NMIs are possible in some architectures. Still it is very unlikely that we will ever hit more than 4 nested levels with contention in the slowpath. When that rare condition happens, however, it is likely that the system will hang or crash shortly after that. It is not good and we need to handle this exception case. This is done by spinning directly on the lock using repeated trylock. This alternative code path should only be used when there is nested NMIs. Assuming that the locks used by those NMI handlers will not be heavily contended, a simple TAS locking should work out. Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Acked-by: Will Deacon Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: James Morse Cc: Linus Torvalds Cc: Paul E. McKenney Cc: SRINIVAS Cc: Thomas Gleixner Cc: Zhenzhong Duan Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 8a8c3c208c5e..0875053c4050 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -412,6 +412,21 @@ pv_queue: idx = node->count++; tail = encode_tail(smp_processor_id(), idx); + /* + * 4 nodes are allocated based on the assumption that there will + * not be nested NMIs taking spinlocks. That may not be true in + * some architectures even though the chance of needing more than + * 4 nodes will still be extremely unlikely. When that happens, + * we fall back to spinning on the lock directly without using + * any MCS node. This is not the most elegant solution, but is + * simple enough. + */ + if (unlikely(idx >= MAX_NODES)) { + while (!queued_spin_trylock(lock)) + cpu_relax(); + goto release; + } + node = grab_mcs_node(node, idx); /* -- cgit v1.3-7-g2ca7 From 412f34a82ccf7dd52f6b197f6450a33f03342523 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 29 Jan 2019 22:53:46 +0100 Subject: locking/qspinlock_stat: Track the no MCS node available case Track the number of slowpath locking operations that are being done without any MCS node available as well renaming lock_index[123] to make them more descriptive. Using these stat counters is one way to find out if a code path is being exercised. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: James Morse Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: SRINIVAS Cc: Thomas Gleixner Cc: Will Deacon Cc: Zhenzhong Duan Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock.c | 3 ++- kernel/locking/qspinlock_stat.h | 21 +++++++++++++++------ 2 files changed, 17 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0875053c4050..21ee51b47961 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -422,6 +422,7 @@ pv_queue: * simple enough. */ if (unlikely(idx >= MAX_NODES)) { + qstat_inc(qstat_lock_no_node, true); while (!queued_spin_trylock(lock)) cpu_relax(); goto release; @@ -432,7 +433,7 @@ pv_queue: /* * Keep counts of non-zero index values: */ - qstat_inc(qstat_lock_idx1 + idx - 1, idx); + qstat_inc(qstat_lock_use_node2 + idx - 1, idx); /* * Ensure that we increment the head node->count before initialising diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h index 42d3d8dc8f49..d73f85388d5c 100644 --- a/kernel/locking/qspinlock_stat.h +++ b/kernel/locking/qspinlock_stat.h @@ -30,6 +30,13 @@ * pv_wait_node - # of vCPU wait's at a non-head queue node * lock_pending - # of locking operations via pending code * lock_slowpath - # of locking operations via MCS lock queue + * lock_use_node2 - # of locking operations that use 2nd per-CPU node + * lock_use_node3 - # of locking operations that use 3rd per-CPU node + * lock_use_node4 - # of locking operations that use 4th per-CPU node + * lock_no_node - # of locking operations without using per-CPU node + * + * Subtracting lock_use_node[234] from lock_slowpath will give you + * lock_use_node1. * * Writing to the "reset_counters" file will reset all the above counter * values. @@ -55,9 +62,10 @@ enum qlock_stats { qstat_pv_wait_node, qstat_lock_pending, qstat_lock_slowpath, - qstat_lock_idx1, - qstat_lock_idx2, - qstat_lock_idx3, + qstat_lock_use_node2, + qstat_lock_use_node3, + qstat_lock_use_node4, + qstat_lock_no_node, qstat_num, /* Total number of statistical counters */ qstat_reset_cnts = qstat_num, }; @@ -85,9 +93,10 @@ static const char * const qstat_names[qstat_num + 1] = { [qstat_pv_wait_node] = "pv_wait_node", [qstat_lock_pending] = "lock_pending", [qstat_lock_slowpath] = "lock_slowpath", - [qstat_lock_idx1] = "lock_index1", - [qstat_lock_idx2] = "lock_index2", - [qstat_lock_idx3] = "lock_index3", + [qstat_lock_use_node2] = "lock_use_node2", + [qstat_lock_use_node3] = "lock_use_node3", + [qstat_lock_use_node4] = "lock_use_node4", + [qstat_lock_no_node] = "lock_no_node", [qstat_reset_cnts] = "reset_counters", }; -- cgit v1.3-7-g2ca7 From 62478d9911fab9694c195f0ca8e4701de09be98e Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 23 Jan 2019 16:26:52 +0100 Subject: sched/fair: Move the rq_of() helper function Move rq_of() helper function so it can be used in pelt.c [ mingo: Improve readability while at it. ] Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 13 ------------- kernel/sched/sched.h | 16 ++++++++++++++++ 2 files changed, 16 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b2b919c7929..da13e834e990 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -248,13 +248,6 @@ const struct sched_class fair_sched_class; */ #ifdef CONFIG_FAIR_GROUP_SCHED - -/* cpu runqueue to which this cfs_rq is attached */ -static inline struct rq *rq_of(struct cfs_rq *cfs_rq) -{ - return cfs_rq->rq; -} - static inline struct task_struct *task_of(struct sched_entity *se) { SCHED_WARN_ON(!entity_is_task(se)); @@ -410,12 +403,6 @@ static inline struct task_struct *task_of(struct sched_entity *se) return container_of(se, struct task_struct, se); } -static inline struct rq *rq_of(struct cfs_rq *cfs_rq) -{ - return container_of(cfs_rq, struct rq, cfs); -} - - #define for_each_sched_entity(se) \ for (; se; se = NULL) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d27c1a5d4e25..0ed130fae2a9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -951,6 +951,22 @@ struct rq { #endif }; +#ifdef CONFIG_FAIR_GROUP_SCHED + +/* CPU runqueue to which this cfs_rq is attached */ +static inline struct rq *rq_of(struct cfs_rq *cfs_rq) +{ + return cfs_rq->rq; +} + +#else + +static inline struct rq *rq_of(struct cfs_rq *cfs_rq) +{ + return container_of(cfs_rq, struct rq, cfs); +} +#endif + static inline int cpu_of(struct rq *rq) { #ifdef CONFIG_SMP -- cgit v1.3-7-g2ca7 From 23127296889fe84b0762b191b5d041e8ba6f2599 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 23 Jan 2019 16:26:53 +0100 Subject: sched/fair: Update scale invariance of PELT The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- include/linux/sched.h | 23 +++------- kernel/sched/core.c | 1 + kernel/sched/deadline.c | 6 +-- kernel/sched/fair.c | 45 ++++++++++--------- kernel/sched/pelt.c | 45 ++++++++++--------- kernel/sched/pelt.h | 114 ++++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/rt.c | 6 +-- kernel/sched/sched.h | 5 ++- 8 files changed, 177 insertions(+), 68 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 628bf13cb5a5..351c0fe64c85 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -357,12 +357,6 @@ struct util_est { * For cfs_rq, it is the aggregated load_avg of all runnable and * blocked sched_entities. * - * load_avg may also take frequency scaling into account: - * - * load_avg = runnable% * scale_load_down(load) * freq% - * - * where freq% is the CPU frequency normalized to the highest frequency. - * * [util_avg definition] * * util_avg = running% * SCHED_CAPACITY_SCALE @@ -371,17 +365,14 @@ struct util_est { * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable * and blocked sched_entities. * - * util_avg may also factor frequency scaling and CPU capacity scaling: - * - * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity% - * - * where freq% is the same as above, and capacity% is the CPU capacity - * normalized to the greatest capacity (due to uarch differences, etc). + * load_avg and util_avg don't direcly factor frequency scaling and CPU + * capacity scaling. The scaling is done through the rq_clock_pelt that + * is used for computing those signals (see update_rq_clock_pelt()) * - * N.B., the above ratios (runnable%, running%, freq%, and capacity%) - * themselves are in the range of [0, 1]. To do fixed point arithmetics, - * we therefore scale them to as large a range as necessary. This is for - * example reflected by util_avg's SCHED_CAPACITY_SCALE. + * N.B., the above ratios (runnable% and running%) themselves are in the + * range of [0, 1]. To do fixed point arithmetics, we therefore scale them + * to as large a range as necessary. This is for example reflected by + * util_avg's SCHED_CAPACITY_SCALE. * * [Overflow issue] * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a674c7db2f29..32e06704565e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -180,6 +180,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif + update_rq_clock_pelt(rq, delta); } void update_rq_clock(struct rq *rq) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index fb8b7b5d745d..6a73e41a2016 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1767,7 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) deadline_queue_push_tasks(rq); if (rq->curr->sched_class != &dl_sched_class) - update_dl_rq_load_avg(rq_clock_task(rq), rq, 0); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0); return p; } @@ -1776,7 +1776,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p) { update_curr_dl(rq); - update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1); if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1) enqueue_pushable_dl_task(rq, p); } @@ -1793,7 +1793,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued) { update_curr_dl(rq); - update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1); /* * Even when we have runtime, update_curr_dl() might have resulted in us * not being the leftmost task anymore. In that case NEED_RESCHED will diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da13e834e990..f41f2eec6186 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -673,9 +673,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) return calc_delta_fair(sched_slice(cfs_rq, se), se); } -#ifdef CONFIG_SMP #include "pelt.h" -#include "sched-pelt.h" +#ifdef CONFIG_SMP static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); @@ -763,7 +762,7 @@ void post_init_entity_util_avg(struct sched_entity *se) * such that the next switched_to_fair() has the * expected state. */ - se->avg.last_update_time = cfs_rq_clock_task(cfs_rq); + se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq); return; } } @@ -3109,7 +3108,7 @@ void set_task_rq_fair(struct sched_entity *se, p_last_update_time = prev->avg.last_update_time; n_last_update_time = next->avg.last_update_time; #endif - __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se); + __update_load_avg_blocked_se(p_last_update_time, se); se->avg.last_update_time = n_last_update_time; } @@ -3244,11 +3243,11 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf /* * runnable_sum can't be lower than running_sum - * As running sum is scale with CPU capacity wehreas the runnable sum - * is not we rescale running_sum 1st + * Rescale running sum to be in the same range as runnable sum + * running_sum is in [0 : LOAD_AVG_MAX << SCHED_CAPACITY_SHIFT] + * runnable_sum is in [0 : LOAD_AVG_MAX] */ - running_sum = se->avg.util_sum / - arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq))); + running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; runnable_sum = max(runnable_sum, running_sum); load_sum = (s64)se_weight(se) * runnable_sum; @@ -3351,7 +3350,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum /** * update_cfs_rq_load_avg - update the cfs_rq's load/util averages - * @now: current time, as per cfs_rq_clock_task() + * @now: current time, as per cfs_rq_clock_pelt() * @cfs_rq: cfs_rq to update * * The cfs_rq avg is the direct sum of all its entities (blocked and runnable) @@ -3396,7 +3395,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) decayed = 1; } - decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq); + decayed |= __update_load_avg_cfs_rq(now, cfs_rq); #ifndef CONFIG_64BIT smp_wmb(); @@ -3486,9 +3485,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s /* Update task and its cfs_rq load average */ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - u64 now = cfs_rq_clock_task(cfs_rq); - struct rq *rq = rq_of(cfs_rq); - int cpu = cpu_of(rq); + u64 now = cfs_rq_clock_pelt(cfs_rq); int decayed; /* @@ -3496,7 +3493,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s * track group sched_entity load average for task_h_load calc in migration */ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) - __update_load_avg_se(now, cpu, cfs_rq, se); + __update_load_avg_se(now, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq); decayed |= propagate_entity_load_avg(se); @@ -3548,7 +3545,7 @@ void sync_entity_load_avg(struct sched_entity *se) u64 last_update_time; last_update_time = cfs_rq_last_update_time(cfs_rq); - __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se); + __update_load_avg_blocked_se(last_update_time, se); } /* @@ -7015,6 +7012,12 @@ idle: if (new_tasks > 0) goto again; + /* + * rq is about to be idle, check if we need to update the + * lost_idle_time of clock_pelt + */ + update_idle_rq_clock_pelt(rq); + return NULL; } @@ -7657,7 +7660,7 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq)) + if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent, if any: */ @@ -7671,8 +7674,8 @@ static void update_blocked_averages(int cpu) } curr_class = rq->curr->sched_class; - update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class); - update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class); update_irq_load_avg(rq, 0); /* Don't need periodic decay once load/util_avg are null */ if (others_have_blocked(rq)) @@ -7742,11 +7745,11 @@ static inline void update_blocked_averages(int cpu) rq_lock_irqsave(rq, &rf); update_rq_clock(rq); - update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq); curr_class = rq->curr->sched_class; - update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class); - update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class); update_irq_load_avg(rq, 0); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 90fb5bc12ad4..befce29bd882 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -26,7 +26,6 @@ #include #include "sched.h" -#include "sched-pelt.h" #include "pelt.h" /* @@ -106,16 +105,12 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) * n=1 */ static __always_inline u32 -accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, +accumulate_sum(u64 delta, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { - unsigned long scale_freq, scale_cpu; u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; - scale_freq = arch_scale_freq_capacity(cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); - delta += sa->period_contrib; periods = delta / 1024; /* A period is 1024us (~1ms) */ @@ -137,13 +132,12 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, } sa->period_contrib = delta; - contrib = cap_scale(contrib, scale_freq); if (load) sa->load_sum += load * contrib; if (runnable) sa->runnable_load_sum += runnable * contrib; if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; return periods; } @@ -177,7 +171,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] */ static __always_inline int -___update_load_sum(u64 now, int cpu, struct sched_avg *sa, +___update_load_sum(u64 now, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { u64 delta; @@ -221,7 +215,7 @@ ___update_load_sum(u64 now, int cpu, struct sched_avg *sa, * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ - if (!accumulate_sum(delta, cpu, sa, load, runnable, running)) + if (!accumulate_sum(delta, sa, load, runnable, running)) return 0; return 1; @@ -267,9 +261,9 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna * runnable_load_avg = \Sum se->avg.runable_load_avg */ -int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) +int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) { - if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) { + if (___update_load_sum(now, &se->avg, 0, 0, 0)) { ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); return 1; } @@ -277,9 +271,9 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) return 0; } -int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se) +int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq, + if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq, cfs_rq->curr == se)) { ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); @@ -290,9 +284,9 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e return 0; } -int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) +int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { - if (___update_load_sum(now, cpu, &cfs_rq->avg, + if (___update_load_sum(now, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), scale_load_down(cfs_rq->runnable_weight), cfs_rq->curr != NULL)) { @@ -317,7 +311,7 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) { - if (___update_load_sum(now, rq->cpu, &rq->avg_rt, + if (___update_load_sum(now, &rq->avg_rt, running, running, running)) { @@ -340,7 +334,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) { - if (___update_load_sum(now, rq->cpu, &rq->avg_dl, + if (___update_load_sum(now, &rq->avg_dl, running, running, running)) { @@ -365,22 +359,31 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) int update_irq_load_avg(struct rq *rq, u64 running) { int ret = 0; + + /* + * We can't use clock_pelt because irq time is not accounted in + * clock_task. Instead we directly scale the running time to + * reflect the real amount of computation + */ + running = cap_scale(running, arch_scale_freq_capacity(cpu_of(rq))); + running = cap_scale(running, arch_scale_cpu_capacity(NULL, cpu_of(rq))); + /* * We know the time that has been used by interrupt since last update * but we don't when. Let be pessimistic and assume that interrupt has * happened just before the update. This is not so far from reality * because interrupt will most probably wake up task and trig an update - * of rq clock during which the metric si updated. + * of rq clock during which the metric is updated. * We start to decay with normal context time and then we add the * interrupt context time. * We can safely remove running from rq->clock because * rq->clock += delta with delta >= running */ - ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq, + ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0, 0, 0); - ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq, + ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1, 1, 1); diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 7e56b489ff32..7489d5f56960 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -1,8 +1,9 @@ #ifdef CONFIG_SMP +#include "sched-pelt.h" -int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se); -int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se); -int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq); +int __update_load_avg_blocked_se(u64 now, struct sched_entity *se); +int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se); +int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq); int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); int update_dl_rq_load_avg(u64 now, struct rq *rq, int running); @@ -42,6 +43,101 @@ static inline void cfs_se_util_change(struct sched_avg *avg) WRITE_ONCE(avg->util_est.enqueued, enqueued); } +/* + * The clock_pelt scales the time to reflect the effective amount of + * computation done during the running delta time but then sync back to + * clock_task when rq is idle. + * + * + * absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16 + * @ max capacity ------******---------------******--------------- + * @ half capacity ------************---------************--------- + * clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16 + * + */ +static inline void update_rq_clock_pelt(struct rq *rq, s64 delta) +{ + if (unlikely(is_idle_task(rq->curr))) { + /* The rq is idle, we can sync to clock_task */ + rq->clock_pelt = rq_clock_task(rq); + return; + } + + /* + * When a rq runs at a lower compute capacity, it will need + * more time to do the same amount of work than at max + * capacity. In order to be invariant, we scale the delta to + * reflect how much work has been really done. + * Running longer results in stealing idle time that will + * disturb the load signal compared to max capacity. This + * stolen idle time will be automatically reflected when the + * rq will be idle and the clock will be synced with + * rq_clock_task. + */ + + /* + * Scale the elapsed time to reflect the real amount of + * computation + */ + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq))); + delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq))); + + rq->clock_pelt += delta; +} + +/* + * When rq becomes idle, we have to check if it has lost idle time + * because it was fully busy. A rq is fully used when the /Sum util_sum + * is greater or equal to: + * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT; + * For optimization and computing rounding purpose, we don't take into account + * the position in the current window (period_contrib) and we use the higher + * bound of util_sum to decide. + */ +static inline void update_idle_rq_clock_pelt(struct rq *rq) +{ + u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX; + u32 util_sum = rq->cfs.avg.util_sum; + util_sum += rq->avg_rt.util_sum; + util_sum += rq->avg_dl.util_sum; + + /* + * Reflecting stolen time makes sense only if the idle + * phase would be present at max capacity. As soon as the + * utilization of a rq has reached the maximum value, it is + * considered as an always runnig rq without idle time to + * steal. This potential idle time is considered as lost in + * this case. We keep track of this lost idle time compare to + * rq's clock_task. + */ + if (util_sum >= divider) + rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; +} + +static inline u64 rq_clock_pelt(struct rq *rq) +{ + lockdep_assert_held(&rq->lock); + assert_clock_updated(rq); + + return rq->clock_pelt - rq->lost_idle_time; +} + +#ifdef CONFIG_CFS_BANDWIDTH +/* rq->task_clock normalized against any time this cfs_rq has spent throttled */ +static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) +{ + if (unlikely(cfs_rq->throttle_count)) + return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time; + + return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time; +} +#else +static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) +{ + return rq_clock_pelt(rq_of(cfs_rq)); +} +#endif + #else static inline int @@ -67,6 +163,18 @@ update_irq_load_avg(struct rq *rq, u64 running) { return 0; } + +static inline u64 rq_clock_pelt(struct rq *rq) +{ + return rq_clock_task(rq); +} + +static inline void +update_rq_clock_pelt(struct rq *rq, s64 delta) { } + +static inline void +update_idle_rq_clock_pelt(struct rq *rq) { } + #endif diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e4f398ad9e73..90fa23d36565 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1587,7 +1587,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * rt task */ if (rq->curr->sched_class != &rt_sched_class) - update_rt_rq_load_avg(rq_clock_task(rq), rq, 0); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0); return p; } @@ -1596,7 +1596,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) { update_curr_rt(rq); - update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1); /* * The previous task needs to be made eligible for pushing @@ -2325,7 +2325,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); - update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1); watchdog(rq, p); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0ed130fae2a9..fe31bc472f3e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -861,7 +861,10 @@ struct rq { unsigned int clock_update_flags; u64 clock; - u64 clock_task; + /* Ensure that all clocks are in the same cache line */ + u64 clock_task ____cacheline_aligned; + u64 clock_pelt; + unsigned long lost_idle_time; atomic_t nr_iowait; -- cgit v1.3-7-g2ca7 From 10a35e6812aa0953f02a956c499d23fe4e68af4a Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 23 Jan 2019 16:26:54 +0100 Subject: sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity util_est is mainly meant to be a lower-bound for tasks utilization. That's why task_util_est() returns the actual util_avg when it's higher than the estimated utilization. With new invaraince signal and without any special check on samples collection, if a task is limited because of thermal capping for example, we could end up overestimating its utilization and thus perhaps generating an unwanted frequency spike when the capping is relaxed... and (even worst) it will take some more activations for the estimated utilization to converge back to the actual utilization. Since we cannot easily know if there is idle time in a CPU when a task completes an activation with a utilization higher then the CPU capacity, we skip the sampling when utilization is higher than CPU's capacity. Suggested-by: Patrick Bellasi Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 14 +++++++++----- kernel/sched/sched.h | 7 +++++++ 2 files changed, 16 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f41f2eec6186..8c165f0d33b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3638,6 +3638,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) { long last_ewma_diff; struct util_est ue; + int cpu; if (!sched_feat(UTIL_EST)) return; @@ -3671,6 +3672,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) return; + /* + * To avoid overestimation of actual task utilization, skip updates if + * we cannot grant there is idle time in this CPU. + */ + cpu = cpu_of(rq_of(cfs_rq)); + if (task_util(p) > capacity_orig_of(cpu)) + return; + /* * Update Task's estimated utilization * @@ -5542,11 +5551,6 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; } -static unsigned long capacity_orig_of(int cpu) -{ - return cpu_rq(cpu)->cpu_capacity_orig; -} - static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fe31bc472f3e..99e2a7772d16 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2230,6 +2230,13 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} # define arch_scale_freq_invariant() false #endif +#ifdef CONFIG_SMP +static inline unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} +#endif + #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL /** * enum schedutil_type - CPU utilization type -- cgit v1.3-7-g2ca7 From 1ca4fa3ab604734e38e2a3000c9abf788512ffa7 Mon Sep 17 00:00:00 2001 From: Hidetoshi Seto Date: Tue, 29 Jan 2019 10:12:45 -0500 Subject: sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK register_sched_domain_sysctl() copies the cpu_possible_mask into sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left uninitialized (all zeroes) and the kernel may fail to initialize sched_domain sysctl entries for all possible CPUs. This is visible to the user if the kernel is booted with maxcpus=n, or if ACPI tables have been modified to leave CPUs offline, and then checking for missing /proc/sys/kernel/sched_domain/cpu* entries. Fix this by separating the allocation and initialization, and adding a flag to initialize the possible CPU entries while system booting only. Tested-by: Syuuichirou Ishii Tested-by: Tarumizu, Kohei Signed-off-by: Hidetoshi Seto Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Masayoshi Mizuma Acked-by: Joe Lawrence Cc: Linus Torvalds Cc: Masayoshi Mizuma Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/debug.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index de3de997e245..8039d62ae36e 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -315,6 +315,7 @@ void register_sched_domain_sysctl(void) { static struct ctl_table *cpu_entries; static struct ctl_table **cpu_idx; + static bool init_done = false; char buf[32]; int i; @@ -344,7 +345,10 @@ void register_sched_domain_sysctl(void) if (!cpumask_available(sd_sysctl_cpus)) { if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL)) return; + } + if (!init_done) { + init_done = true; /* init to possible to not have holes in @cpu_entries */ cpumask_copy(sd_sysctl_cpus, cpu_possible_mask); } -- cgit v1.3-7-g2ca7 From c546951d9c9300065bad253ecdf1ac59ce9d06c8 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 21 Jan 2019 16:52:40 +0100 Subject: sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() move_queued_task() synchronizes with task_rq_lock() as follows: move_queued_task() task_rq_lock() [S] ->on_rq = MIGRATING [L] rq = task_rq() WMB (__set_task_cpu()) ACQUIRE (rq->lock); [S] ->cpu = new_cpu [L] ->on_rq where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before "[L] ->on_rq" by the ACQUIRE itself. Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor this address dependency. Also, mark the accesses to ->cpu and ->on_rq with READ_ONCE()/WRITE_ONCE() to comply with the LKMM. Signed-off-by: Andrea Parri Signed-off-by: Peter Zijlstra (Intel) Cc: Alan Stern Cc: Linus Torvalds Cc: Mike Galbraith Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 4 ++-- kernel/sched/core.c | 9 +++++---- kernel/sched/sched.h | 6 +++--- 3 files changed, 10 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 351c0fe64c85..4112639c2a85 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1745,9 +1745,9 @@ static __always_inline bool need_resched(void) static inline unsigned int task_cpu(const struct task_struct *p) { #ifdef CONFIG_THREAD_INFO_IN_TASK - return p->cpu; + return READ_ONCE(p->cpu); #else - return task_thread_info(p)->cpu; + return READ_ONCE(task_thread_info(p)->cpu); #endif } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 32e06704565e..ec1b67a195cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -107,11 +107,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) * [L] ->on_rq * RELEASE (rq->lock) * - * If we observe the old CPU in task_rq_lock, the acquire of + * If we observe the old CPU in task_rq_lock(), the acquire of * the old rq->lock will fully serialize against the stores. * - * If we observe the new CPU in task_rq_lock, the acquire will - * pair with the WMB to ensure we must then also see migrating. + * If we observe the new CPU in task_rq_lock(), the address + * dependency headed by '[L] rq = task_rq()' and the acquire + * will pair with the WMB to ensure we then also see migrating. */ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { rq_pin_lock(rq, rf); @@ -916,7 +917,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, { lockdep_assert_held(&rq->lock); - p->on_rq = TASK_ON_RQ_MIGRATING; + WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); dequeue_task(rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, new_cpu); rq_unlock(rq, rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 99e2a7772d16..c688ef5012e5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1479,9 +1479,9 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu) */ smp_wmb(); #ifdef CONFIG_THREAD_INFO_IN_TASK - p->cpu = cpu; + WRITE_ONCE(p->cpu, cpu); #else - task_thread_info(p)->cpu = cpu; + WRITE_ONCE(task_thread_info(p)->cpu, cpu); #endif p->wake_cpu = cpu; #endif @@ -1582,7 +1582,7 @@ static inline int task_on_rq_queued(struct task_struct *p) static inline int task_on_rq_migrating(struct task_struct *p) { - return p->on_rq == TASK_ON_RQ_MIGRATING; + return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING; } /* -- cgit v1.3-7-g2ca7 From 5d299eabea5a251fbf66e8277704b874bbba92dc Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Jan 2019 14:41:04 +0100 Subject: sched/fair: Add tmp_alone_branch assertion The magic in list_add_leaf_cfs_rq() requires that at the end of enqueue_task_fair(): rq->tmp_alone_branch == &rq->lead_cfs_rq_list If this is violated, list integrity is compromised for list entries and the tmp_alone_branch pointer might dangle. Also, reflow list_add_leaf_cfs_rq() while there. This looses one indentation level and generates a form that's convenient for the next patch. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 126 +++++++++++++++++++++++++++++----------------------- 1 file changed, 71 insertions(+), 55 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c165f0d33b3..d6a536dec0ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -277,64 +277,69 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) { - if (!cfs_rq->on_list) { - struct rq *rq = rq_of(cfs_rq); - int cpu = cpu_of(rq); + struct rq *rq = rq_of(cfs_rq); + int cpu = cpu_of(rq); + + if (cfs_rq->on_list) + return; + + cfs_rq->on_list = 1; + + /* + * Ensure we either appear before our parent (if already + * enqueued) or force our parent to appear after us when it is + * enqueued. The fact that we always enqueue bottom-up + * reduces this to two cases and a special case for the root + * cfs_rq. Furthermore, it also means that we will always reset + * tmp_alone_branch either when the branch is connected + * to a tree or when we reach the top of the tree + */ + if (cfs_rq->tg->parent && + cfs_rq->tg->parent->cfs_rq[cpu]->on_list) { /* - * Ensure we either appear before our parent (if already - * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases and a special case for the root - * cfs_rq. Furthermore, it also means that we will always reset - * tmp_alone_branch either when the branch is connected - * to a tree or when we reach the beg of the tree + * If parent is already on the list, we add the child + * just before. Thanks to circular linked property of + * the list, this means to put the child at the tail + * of the list that starts by parent. */ - if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu]->on_list) { - /* - * If parent is already on the list, we add the child - * just before. Thanks to circular linked property of - * the list, this means to put the child at the tail - * of the list that starts by parent. - */ - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list)); - /* - * The branch is now connected to its tree so we can - * reset tmp_alone_branch to the beginning of the - * list. - */ - rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; - } else if (!cfs_rq->tg->parent) { - /* - * cfs rq without parent should be put - * at the tail of the list. - */ - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq->leaf_cfs_rq_list); - /* - * We have reach the beg of a tree so we can reset - * tmp_alone_branch to the beginning of the list. - */ - rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; - } else { - /* - * The parent has not already been added so we want to - * make sure that it will be put after us. - * tmp_alone_branch points to the beg of the branch - * where we will add parent. - */ - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - rq->tmp_alone_branch); - /* - * update tmp_alone_branch to points to the new beg - * of the branch - */ - rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list; - } + list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, + &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list)); + /* + * The branch is now connected to its tree so we can + * reset tmp_alone_branch to the beginning of the + * list. + */ + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; + return; + } - cfs_rq->on_list = 1; + if (!cfs_rq->tg->parent) { + /* + * cfs rq without parent should be put + * at the tail of the list. + */ + list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, + &rq->leaf_cfs_rq_list); + /* + * We have reach the top of a tree so we can reset + * tmp_alone_branch to the beginning of the list. + */ + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; + return; } + + /* + * The parent has not already been added so we want to + * make sure that it will be put after us. + * tmp_alone_branch points to the begin of the branch + * where we will add parent. + */ + list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch); + /* + * update tmp_alone_branch to points to the new begin + * of the branch + */ + rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -345,7 +350,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) } } -/* Iterate through all leaf cfs_rq's on a runqueue: */ +static inline void assert_list_leaf_cfs_rq(struct rq *rq) +{ + SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); +} + +/* Iterate through all cfs_rq's on a runqueue in bottom-up order */ #define for_each_leaf_cfs_rq(rq, cfs_rq) \ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) @@ -433,6 +443,10 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { } +static inline void assert_list_leaf_cfs_rq(struct rq *rq) +{ +} + #define for_each_leaf_cfs_rq(rq, cfs_rq) \ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) @@ -5172,6 +5186,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) } + assert_list_leaf_cfs_rq(rq); + hrtick_update(rq); } -- cgit v1.3-7-g2ca7 From f6783319737f28e4436a69611853a5a098cbe974 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 30 Jan 2019 06:22:47 +0100 Subject: sched/fair: Fix insertion in rq->leaf_cfs_rq_list Sargun reported a crash: "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c and put it on top of 4.19.13. In addition to this, I uninlined list_add_leaf_cfs_rq for debugging. This revealed a new bug that we didn't get to because we kept getting crashes from the previous issue. When we are running with cgroups that are rapidly changing, with CFS bandwidth control, and in addition using the cpusets cgroup, we see this crash. Specifically, it seems to occur with cgroups that are throttled and we change the allowed cpuset." The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that it will walk down to root the 1st time a cfs_rq is used and we will finish to add either a cfs_rq without parent or a cfs_rq with a parent that is already on the list. But this is not always true in presence of throttling. Because a cfs_rq can be throttled even if it has never been used but other CPUs of the cgroup have already used all the bandwdith, we are not sure to go down to the root and add all cfs_rq in the list. Ensure that all cfs_rq will be added in the list even if they are throttled. [ mingo: Fix !CGROUPS build. ] Reported-by: Sargun Dhillon Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: tj@kernel.org Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list") Link: https://lkml.kernel.org/r/1548825767-10799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d6a536dec0ca..ffd1ae7237e7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -275,13 +275,13 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) return grp->my_q; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq = rq_of(cfs_rq); int cpu = cpu_of(rq); if (cfs_rq->on_list) - return; + return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list; cfs_rq->on_list = 1; @@ -310,7 +310,7 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) * list. */ rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; - return; + return true; } if (!cfs_rq->tg->parent) { @@ -325,7 +325,7 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) * tmp_alone_branch to the beginning of the list. */ rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; - return; + return true; } /* @@ -340,6 +340,7 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) * of the branch */ rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list; + return false; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -435,8 +436,9 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) return NULL; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) { + return true; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -4995,6 +4997,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) } #else /* CONFIG_CFS_BANDWIDTH */ + +static inline bool cfs_bandwidth_used(void) +{ + return false; +} + static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq) { return rq_clock_task(rq_of(cfs_rq)); @@ -5186,6 +5194,21 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) } + if (cfs_bandwidth_used()) { + /* + * When bandwidth control is enabled; the cfs_rq_throttled() + * breaks in the above iteration can result in incomplete + * leaf list maintenance, resulting in triggering the assertion + * below. + */ + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + + if (list_add_leaf_cfs_rq(cfs_rq)) + break; + } + } + assert_list_leaf_cfs_rq(rq); hrtick_update(rq); -- cgit v1.3-7-g2ca7 From 38f7ae9bdfb6570271c7429d8d72784465c6281e Mon Sep 17 00:00:00 2001 From: Brian Masney Date: Mon, 4 Feb 2019 04:58:52 -0500 Subject: genirq: export irq_chip_set_wake_parent symbol Export the irq_chip_set_wake_parent symbol so that drivers with hierarchical IRQ chips can be built as a module. Signed-off-by: Brian Masney Reported-by: Mark Brown Acked-by: Marc Zyngier Signed-off-by: Linus Walleij --- kernel/irq/chip.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 34e969069488..086d5a34b5a0 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1381,6 +1381,7 @@ int irq_chip_set_wake_parent(struct irq_data *data, unsigned int on) return -ENOSYS; } +EXPORT_SYMBOL_GPL(irq_chip_set_wake_parent); #endif /** -- cgit v1.3-7-g2ca7 From 26b523356f49a0117c8f9e32ca98aa6d6e496e1a Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Fri, 1 Feb 2019 10:46:52 +0000 Subject: powerpc: Drop page_is_ram() and walk_system_ram_range() Since commit c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem") it is possible to use the generic walk_system_ram_range() and the generic page_is_ram(). To enable the use of walk_system_ram_range() by the IBM EHEA ethernet driver, we still need an export of the generic function. As powerpc was the only user of CONFIG_ARCH_HAS_WALK_MEMORY, the ifdef around the generic walk_system_ram_range() has become useless and can be dropped. Fixes: c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem") Signed-off-by: Christophe Leroy [mpe: Keep the EXPORT_SYMBOL_GPL in powerpc code] Signed-off-by: Michael Ellerman --- arch/powerpc/Kconfig | 3 --- arch/powerpc/include/asm/page.h | 1 - arch/powerpc/mm/mem.c | 39 ++++++--------------------------------- kernel/resource.c | 4 ---- 4 files changed, 6 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 9c70c2864657..08908219fba9 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -478,9 +478,6 @@ config ARCH_CPU_PROBE_RELEASE config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y -config ARCH_HAS_WALK_MEMORY - def_bool y - config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 5c5ea2413413..aa4497175bd3 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -326,7 +326,6 @@ struct page; extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg); extern void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *p); -extern int page_is_ram(unsigned long pfn); extern int devmem_is_allowed(unsigned long pfn); #ifdef CONFIG_PPC_SMLPAR diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 33cc6f676fa6..81f251fc4169 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -80,11 +80,6 @@ static inline pte_t *virt_to_kpte(unsigned long vaddr) #define TOP_ZONE ZONE_NORMAL #endif -int page_is_ram(unsigned long pfn) -{ - return memblock_is_memory(__pfn_to_phys(pfn)); -} - pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { @@ -176,34 +171,6 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 size, #endif #endif /* CONFIG_MEMORY_HOTPLUG */ -/* - * walk_memory_resource() needs to make sure there is no holes in a given - * memory range. PPC64 does not maintain the memory layout in /proc/iomem. - * Instead it maintains it in memblock.memory structures. Walk through the - * memory regions, find holes and callback for contiguous regions. - */ -int -walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, - void *arg, int (*func)(unsigned long, unsigned long, void *)) -{ - struct memblock_region *reg; - unsigned long end_pfn = start_pfn + nr_pages; - unsigned long tstart, tend; - int ret = -1; - - for_each_memblock(memory, reg) { - tstart = max(start_pfn, memblock_region_memory_base_pfn(reg)); - tend = min(end_pfn, memblock_region_memory_end_pfn(reg)); - if (tstart >= tend) - continue; - ret = (*func)(tstart, tend - tstart, arg); - if (ret) - break; - } - return ret; -} -EXPORT_SYMBOL_GPL(walk_system_ram_range); - #ifndef CONFIG_NEED_MULTIPLE_NODES void __init mem_topology_setup(void) { @@ -585,3 +552,9 @@ int devmem_is_allowed(unsigned long pfn) return 0; } #endif /* CONFIG_STRICT_DEVMEM */ + +/* + * This is defined in kernel/resource.c but only powerpc needs to export it, for + * the EHEA driver. Drop this when drivers/net/ethernet/ibm/ehea is removed. + */ +EXPORT_SYMBOL_GPL(walk_system_ram_range); diff --git a/kernel/resource.c b/kernel/resource.c index 915c02e8e5dd..e81b17b53fa5 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -448,8 +448,6 @@ int walk_mem_res(u64 start, u64 end, void *arg, arg, func); } -#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY) - /* * This function calls the @func callback against all memory ranges of type * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY. @@ -481,8 +479,6 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, return ret; } -#endif - static int __is_ram(unsigned long pfn, unsigned long nr_pages, void *arg) { return 1; -- cgit v1.3-7-g2ca7 From a692933a87691681e880feb708081681ff32400a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 5 Feb 2019 07:19:11 -0600 Subject: signal: Always attempt to allocate siginfo for SIGSTOP Since 2.5.34 the code has had the potential to not allocate siginfo for SIGSTOP signals. Except for ptrace this is perfectly fine as only ptrace can use PTRACE_PEEK_SIGINFO and see what the contents of the delivered siginfo are. Users of PTRACE_PEEK_SIGINFO that care about the contents siginfo for SIGSTOP are rare, but they do exist. A seccomp self test has cared and lldb cares. Jack Andersen writes: > The patch titled > `signal: Never allocate siginfo for SIGKILL or SIGSTOP` > created a regression for users of PTRACE_GETSIGINFO needing to > discern signals that were raised via the tgkill syscall. > > A notable user of this tgkill+ptrace combination is lldb while > debugging a multithreaded program. Without the ability to detect a > SIGSTOP originating from tgkill, lldb does not have a way to > synchronize on a per-thread basis and falls back to SIGSTOP-ing the > entire process. Everyone affected by this please note. The kernel can still fail to allocate a siginfo structure. The allocation is with GFP_KERNEL and is best effort only. If memory is tight when the signal allocation comes in this will fail to allocate a siginfo. So I strongly recommend looking at more robust solutions for synchronizing with a single thread such as PTRACE_INTERRUPT. Or if that does not work persuading your friendly local kernel developer to build the interface you need. Reported-by: Tycho Andersen Reported-by: Kees Cook Reported-by: Jack Andersen Acked-by: Linus Torvalds Reviewed-by: Christian Brauner Cc: stable@vger.kernel.org Fixes: f149b3155744 ("signal: Never allocate siginfo for SIGKILL or SIGSTOP") Fixes: 6dfc88977e42 ("[PATCH] shared thread signals") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index e1d7ad8e6ab1..9ca8e5278c8e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1057,10 +1057,9 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc result = TRACE_SIGNAL_DELIVERED; /* - * Skip useless siginfo allocation for SIGKILL SIGSTOP, - * and kernel threads. + * Skip useless siginfo allocation for SIGKILL and kernel threads. */ - if (sig_kernel_only(sig) || (t->flags & PF_KTHREAD)) + if ((sig == SIGKILL) || (t->flags & PF_KTHREAD)) goto out_set; /* -- cgit v1.3-7-g2ca7 From b525903c254dab2491410f0f23707691b7c2c317 Mon Sep 17 00:00:00 2001 From: Julien Thierry Date: Thu, 31 Jan 2019 14:53:58 +0000 Subject: genirq: Provide basic NMI management for interrupt lines Add functionality to allocate interrupt lines that will deliver IRQs as Non-Maskable Interrupts. These allocations are only successful if the irqchip provides the necessary support and allows NMI delivery for the interrupt line. Interrupt lines allocated for NMI delivery must be enabled/disabled through enable_nmi/disable_nmi_nosync to keep their state consistent. To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded, the irqchip directly managing the IRQ must be the root irqchip and the irqchip cannot be behind a slow bus. Signed-off-by: Julien Thierry Reviewed-by: Marc Zyngier Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Marc Zyngier Signed-off-by: Marc Zyngier --- include/linux/interrupt.h | 9 ++ include/linux/irq.h | 7 ++ kernel/irq/debugfs.c | 6 +- kernel/irq/internals.h | 2 + kernel/irq/manage.c | 228 +++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 249 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c672f34235e7..9941d1a8d83c 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -156,6 +156,10 @@ __request_percpu_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *devname, void __percpu *percpu_dev_id); +extern int __must_check +request_nmi(unsigned int irq, irq_handler_t handler, unsigned long flags, + const char *name, void *dev); + static inline int __must_check request_percpu_irq(unsigned int irq, irq_handler_t handler, const char *devname, void __percpu *percpu_dev_id) @@ -167,6 +171,8 @@ request_percpu_irq(unsigned int irq, irq_handler_t handler, extern const void *free_irq(unsigned int, void *); extern void free_percpu_irq(unsigned int, void __percpu *); +extern const void *free_nmi(unsigned int irq, void *dev_id); + struct device; extern int __must_check @@ -217,6 +223,9 @@ extern void enable_percpu_irq(unsigned int irq, unsigned int type); extern bool irq_percpu_is_enabled(unsigned int irq); extern void irq_wake_thread(unsigned int irq, void *dev_id); +extern void disable_nmi_nosync(unsigned int irq); +extern void enable_nmi(unsigned int irq); + /* The following three functions are for the core kernel use only. */ extern void suspend_device_irqs(void); extern void resume_device_irqs(void); diff --git a/include/linux/irq.h b/include/linux/irq.h index def2b2aac8b1..a7298e4998c8 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -442,6 +442,8 @@ static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d) * @irq_set_vcpu_affinity: optional to target a vCPU in a virtual machine * @ipi_send_single: send a single IPI to destination cpus * @ipi_send_mask: send an IPI to destination cpus in cpumask + * @irq_nmi_setup: function called from core code before enabling an NMI + * @irq_nmi_teardown: function called from core code after disabling an NMI * @flags: chip specific flags */ struct irq_chip { @@ -490,6 +492,9 @@ struct irq_chip { void (*ipi_send_single)(struct irq_data *data, unsigned int cpu); void (*ipi_send_mask)(struct irq_data *data, const struct cpumask *dest); + int (*irq_nmi_setup)(struct irq_data *data); + void (*irq_nmi_teardown)(struct irq_data *data); + unsigned long flags; }; @@ -505,6 +510,7 @@ struct irq_chip { * IRQCHIP_ONESHOT_SAFE: One shot does not require mask/unmask * IRQCHIP_EOI_THREADED: Chip requires eoi() on unmask in threaded mode * IRQCHIP_SUPPORTS_LEVEL_MSI Chip can provide two doorbells for Level MSIs + * IRQCHIP_SUPPORTS_NMI: Chip can deliver NMIs, only for root irqchips */ enum { IRQCHIP_SET_TYPE_MASKED = (1 << 0), @@ -515,6 +521,7 @@ enum { IRQCHIP_ONESHOT_SAFE = (1 << 5), IRQCHIP_EOI_THREADED = (1 << 6), IRQCHIP_SUPPORTS_LEVEL_MSI = (1 << 7), + IRQCHIP_SUPPORTS_NMI = (1 << 8), }; #include diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index 6f636136cccc..59a04d2a66df 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -56,6 +56,7 @@ static const struct irq_bit_descr irqchip_flags[] = { BIT_MASK_DESCR(IRQCHIP_ONESHOT_SAFE), BIT_MASK_DESCR(IRQCHIP_EOI_THREADED), BIT_MASK_DESCR(IRQCHIP_SUPPORTS_LEVEL_MSI), + BIT_MASK_DESCR(IRQCHIP_SUPPORTS_NMI), }; static void @@ -140,6 +141,7 @@ static const struct irq_bit_descr irqdesc_istates[] = { BIT_MASK_DESCR(IRQS_WAITING), BIT_MASK_DESCR(IRQS_PENDING), BIT_MASK_DESCR(IRQS_SUSPENDED), + BIT_MASK_DESCR(IRQS_NMI), }; @@ -203,8 +205,8 @@ static ssize_t irq_debug_write(struct file *file, const char __user *user_buf, chip_bus_lock(desc); raw_spin_lock_irqsave(&desc->lock, flags); - if (irq_settings_is_level(desc)) { - /* Can't do level, sorry */ + if (irq_settings_is_level(desc) || desc->istate & IRQS_NMI) { + /* Can't do level nor NMIs, sorry */ err = -EINVAL; } else { desc->istate |= IRQS_PENDING; diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index ca6afa267070..2a77cdd27ca9 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -49,6 +49,7 @@ enum { * IRQS_WAITING - irq is waiting * IRQS_PENDING - irq is pending and replayed later * IRQS_SUSPENDED - irq is suspended + * IRQS_NMI - irq line is used to deliver NMIs */ enum { IRQS_AUTODETECT = 0x00000001, @@ -60,6 +61,7 @@ enum { IRQS_PENDING = 0x00000200, IRQS_SUSPENDED = 0x00000800, IRQS_TIMINGS = 0x00001000, + IRQS_NMI = 0x00002000, }; #include "debug.h" diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a4888ce4667a..9472ae987946 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -341,7 +341,7 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) /* The release function is promised process context */ might_sleep(); - if (!desc) + if (!desc || desc->istate & IRQS_NMI) return -EINVAL; /* Complete initialisation of *notify */ @@ -550,6 +550,21 @@ bool disable_hardirq(unsigned int irq) } EXPORT_SYMBOL_GPL(disable_hardirq); +/** + * disable_nmi_nosync - disable an nmi without waiting + * @irq: Interrupt to disable + * + * Disable the selected interrupt line. Disables and enables are + * nested. + * The interrupt to disable must have been requested through request_nmi. + * Unlike disable_nmi(), this function does not ensure existing + * instances of the IRQ handler have completed before returning. + */ +void disable_nmi_nosync(unsigned int irq) +{ + disable_irq_nosync(irq); +} + void __enable_irq(struct irq_desc *desc) { switch (desc->depth) { @@ -606,6 +621,20 @@ out: } EXPORT_SYMBOL(enable_irq); +/** + * enable_nmi - enable handling of an nmi + * @irq: Interrupt to enable + * + * The interrupt to enable must have been requested through request_nmi. + * Undoes the effect of one call to disable_nmi(). If this + * matches the last disable, processing of interrupts on this + * IRQ line is re-enabled. + */ +void enable_nmi(unsigned int irq) +{ + enable_irq(irq); +} + static int set_irq_wake_real(unsigned int irq, unsigned int on) { struct irq_desc *desc = irq_to_desc(irq); @@ -641,6 +670,12 @@ int irq_set_irq_wake(unsigned int irq, unsigned int on) if (!desc) return -EINVAL; + /* Don't use NMIs as wake up interrupts please */ + if (desc->istate & IRQS_NMI) { + ret = -EINVAL; + goto out_unlock; + } + /* wakeup-capable irqs can be shared between drivers that * don't need to have the same sleep mode behaviors. */ @@ -663,6 +698,8 @@ int irq_set_irq_wake(unsigned int irq, unsigned int on) irqd_clear(&desc->irq_data, IRQD_WAKEUP_STATE); } } + +out_unlock: irq_put_desc_busunlock(desc, flags); return ret; } @@ -1125,6 +1162,39 @@ static void irq_release_resources(struct irq_desc *desc) c->irq_release_resources(d); } +static bool irq_supports_nmi(struct irq_desc *desc) +{ + struct irq_data *d = irq_desc_get_irq_data(desc); + +#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY + /* Only IRQs directly managed by the root irqchip can be set as NMI */ + if (d->parent_data) + return false; +#endif + /* Don't support NMIs for chips behind a slow bus */ + if (d->chip->irq_bus_lock || d->chip->irq_bus_sync_unlock) + return false; + + return d->chip->flags & IRQCHIP_SUPPORTS_NMI; +} + +static int irq_nmi_setup(struct irq_desc *desc) +{ + struct irq_data *d = irq_desc_get_irq_data(desc); + struct irq_chip *c = d->chip; + + return c->irq_nmi_setup ? c->irq_nmi_setup(d) : -EINVAL; +} + +static void irq_nmi_teardown(struct irq_desc *desc) +{ + struct irq_data *d = irq_desc_get_irq_data(desc); + struct irq_chip *c = d->chip; + + if (c->irq_nmi_teardown) + c->irq_nmi_teardown(d); +} + static int setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary) { @@ -1299,9 +1369,17 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) * fields must have IRQF_SHARED set and the bits which * set the trigger type must match. Also all must * agree on ONESHOT. + * Interrupt lines used for NMIs cannot be shared. */ unsigned int oldtype; + if (desc->istate & IRQS_NMI) { + pr_err("Invalid attempt to share NMI for %s (irq %d) on irqchip %s.\n", + new->name, irq, desc->irq_data.chip->name); + ret = -EINVAL; + goto out_unlock; + } + /* * If nobody did set the configuration before, inherit * the one provided by the requester. @@ -1753,6 +1831,59 @@ const void *free_irq(unsigned int irq, void *dev_id) } EXPORT_SYMBOL(free_irq); +/* This function must be called with desc->lock held */ +static const void *__cleanup_nmi(unsigned int irq, struct irq_desc *desc) +{ + const char *devname = NULL; + + desc->istate &= ~IRQS_NMI; + + if (!WARN_ON(desc->action == NULL)) { + irq_pm_remove_action(desc, desc->action); + devname = desc->action->name; + unregister_handler_proc(irq, desc->action); + + kfree(desc->action); + desc->action = NULL; + } + + irq_settings_clr_disable_unlazy(desc); + irq_shutdown(desc); + + irq_release_resources(desc); + + irq_chip_pm_put(&desc->irq_data); + module_put(desc->owner); + + return devname; +} + +const void *free_nmi(unsigned int irq, void *dev_id) +{ + struct irq_desc *desc = irq_to_desc(irq); + unsigned long flags; + const void *devname; + + if (!desc || WARN_ON(!(desc->istate & IRQS_NMI))) + return NULL; + + if (WARN_ON(irq_settings_is_per_cpu_devid(desc))) + return NULL; + + /* NMI still enabled */ + if (WARN_ON(desc->depth == 0)) + disable_nmi_nosync(irq); + + raw_spin_lock_irqsave(&desc->lock, flags); + + irq_nmi_teardown(desc); + devname = __cleanup_nmi(irq, desc); + + raw_spin_unlock_irqrestore(&desc->lock, flags); + + return devname; +} + /** * request_threaded_irq - allocate an interrupt line * @irq: Interrupt line to allocate @@ -1922,6 +2053,101 @@ int request_any_context_irq(unsigned int irq, irq_handler_t handler, } EXPORT_SYMBOL_GPL(request_any_context_irq); +/** + * request_nmi - allocate an interrupt line for NMI delivery + * @irq: Interrupt line to allocate + * @handler: Function to be called when the IRQ occurs. + * Threaded handler for threaded interrupts. + * @irqflags: Interrupt type flags + * @name: An ascii name for the claiming device + * @dev_id: A cookie passed back to the handler function + * + * This call allocates interrupt resources and enables the + * interrupt line and IRQ handling. It sets up the IRQ line + * to be handled as an NMI. + * + * An interrupt line delivering NMIs cannot be shared and IRQ handling + * cannot be threaded. + * + * Interrupt lines requested for NMI delivering must produce per cpu + * interrupts and have auto enabling setting disabled. + * + * Dev_id must be globally unique. Normally the address of the + * device data structure is used as the cookie. Since the handler + * receives this value it makes sense to use it. + * + * If the interrupt line cannot be used to deliver NMIs, function + * will fail and return a negative value. + */ +int request_nmi(unsigned int irq, irq_handler_t handler, + unsigned long irqflags, const char *name, void *dev_id) +{ + struct irqaction *action; + struct irq_desc *desc; + unsigned long flags; + int retval; + + if (irq == IRQ_NOTCONNECTED) + return -ENOTCONN; + + /* NMI cannot be shared, used for Polling */ + if (irqflags & (IRQF_SHARED | IRQF_COND_SUSPEND | IRQF_IRQPOLL)) + return -EINVAL; + + if (!(irqflags & IRQF_PERCPU)) + return -EINVAL; + + if (!handler) + return -EINVAL; + + desc = irq_to_desc(irq); + + if (!desc || irq_settings_can_autoenable(desc) || + !irq_settings_can_request(desc) || + WARN_ON(irq_settings_is_per_cpu_devid(desc)) || + !irq_supports_nmi(desc)) + return -EINVAL; + + action = kzalloc(sizeof(struct irqaction), GFP_KERNEL); + if (!action) + return -ENOMEM; + + action->handler = handler; + action->flags = irqflags | IRQF_NO_THREAD | IRQF_NOBALANCING; + action->name = name; + action->dev_id = dev_id; + + retval = irq_chip_pm_get(&desc->irq_data); + if (retval < 0) + goto err_out; + + retval = __setup_irq(irq, desc, action); + if (retval) + goto err_irq_setup; + + raw_spin_lock_irqsave(&desc->lock, flags); + + /* Setup NMI state */ + desc->istate |= IRQS_NMI; + retval = irq_nmi_setup(desc); + if (retval) { + __cleanup_nmi(irq, desc); + raw_spin_unlock_irqrestore(&desc->lock, flags); + return -EINVAL; + } + + raw_spin_unlock_irqrestore(&desc->lock, flags); + + return 0; + +err_irq_setup: + irq_chip_pm_put(&desc->irq_data); +err_out: + kfree(action); + + return retval; +} + void enable_percpu_irq(unsigned int irq, unsigned int type) { unsigned int cpu = smp_processor_id(); -- cgit v1.3-7-g2ca7 From 4b078c3f1a26487c39363089ba0d5c6b09f2a89f Mon Sep 17 00:00:00 2001 From: Julien Thierry Date: Thu, 31 Jan 2019 14:53:59 +0000 Subject: genirq: Provide NMI management for percpu_devid interrupts Add support for percpu_devid interrupts treated as NMIs. Percpu_devid NMIs need to be setup/torn down on each CPU they target. The same restrictions as for global NMIs still apply for percpu_devid NMIs. Signed-off-by: Julien Thierry Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Marc Zyngier Signed-off-by: Marc Zyngier --- include/linux/interrupt.h | 9 +++ kernel/irq/manage.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 186 insertions(+) (limited to 'kernel') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 9941d1a8d83c..831ddcdc5597 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -168,10 +168,15 @@ request_percpu_irq(unsigned int irq, irq_handler_t handler, devname, percpu_dev_id); } +extern int __must_check +request_percpu_nmi(unsigned int irq, irq_handler_t handler, + const char *devname, void __percpu *dev); + extern const void *free_irq(unsigned int, void *); extern void free_percpu_irq(unsigned int, void __percpu *); extern const void *free_nmi(unsigned int irq, void *dev_id); +extern void free_percpu_nmi(unsigned int irq, void __percpu *percpu_dev_id); struct device; @@ -224,7 +229,11 @@ extern bool irq_percpu_is_enabled(unsigned int irq); extern void irq_wake_thread(unsigned int irq, void *dev_id); extern void disable_nmi_nosync(unsigned int irq); +extern void disable_percpu_nmi(unsigned int irq); extern void enable_nmi(unsigned int irq); +extern void enable_percpu_nmi(unsigned int irq, unsigned int type); +extern int prepare_percpu_nmi(unsigned int irq); +extern void teardown_percpu_nmi(unsigned int irq); /* The following three functions are for the core kernel use only. */ extern void suspend_device_irqs(void); diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 9472ae987946..0a1ebc004a59 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -2182,6 +2182,11 @@ out: } EXPORT_SYMBOL_GPL(enable_percpu_irq); +void enable_percpu_nmi(unsigned int irq, unsigned int type) +{ + enable_percpu_irq(irq, type); +} + /** * irq_percpu_is_enabled - Check whether the per cpu irq is enabled * @irq: Linux irq number to check for @@ -2221,6 +2226,11 @@ void disable_percpu_irq(unsigned int irq) } EXPORT_SYMBOL_GPL(disable_percpu_irq); +void disable_percpu_nmi(unsigned int irq) +{ + disable_percpu_irq(irq); +} + /* * Internal function to unregister a percpu irqaction. */ @@ -2252,6 +2262,8 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_ /* Found it - now remove it from the list of entries: */ desc->action = NULL; + desc->istate &= ~IRQS_NMI; + raw_spin_unlock_irqrestore(&desc->lock, flags); unregister_handler_proc(irq, action); @@ -2305,6 +2317,19 @@ void free_percpu_irq(unsigned int irq, void __percpu *dev_id) } EXPORT_SYMBOL_GPL(free_percpu_irq); +void free_percpu_nmi(unsigned int irq, void __percpu *dev_id) +{ + struct irq_desc *desc = irq_to_desc(irq); + + if (!desc || !irq_settings_is_per_cpu_devid(desc)) + return; + + if (WARN_ON(!(desc->istate & IRQS_NMI))) + return; + + kfree(__free_percpu_irq(irq, dev_id)); +} + /** * setup_percpu_irq - setup a per-cpu interrupt * @irq: Interrupt line to setup @@ -2394,6 +2419,158 @@ int __request_percpu_irq(unsigned int irq, irq_handler_t handler, } EXPORT_SYMBOL_GPL(__request_percpu_irq); +/** + * request_percpu_nmi - allocate a percpu interrupt line for NMI delivery + * @irq: Interrupt line to allocate + * @handler: Function to be called when the IRQ occurs. + * @name: An ascii name for the claiming device + * @dev_id: A percpu cookie passed back to the handler function + * + * This call allocates interrupt resources for a per CPU NMI. Per CPU NMIs + * have to be setup on each CPU by calling ready_percpu_nmi() before being + * enabled on the same CPU by using enable_percpu_nmi(). + * + * Dev_id must be globally unique. It is a per-cpu variable, and + * the handler gets called with the interrupted CPU's instance of + * that variable. + * + * Interrupt lines requested for NMI delivering should have auto enabling + * setting disabled. + * + * If the interrupt line cannot be used to deliver NMIs, function + * will fail returning a negative value. + */ +int request_percpu_nmi(unsigned int irq, irq_handler_t handler, + const char *name, void __percpu *dev_id) +{ + struct irqaction *action; + struct irq_desc *desc; + unsigned long flags; + int retval; + + if (!handler) + return -EINVAL; + + desc = irq_to_desc(irq); + + if (!desc || !irq_settings_can_request(desc) || + !irq_settings_is_per_cpu_devid(desc) || + irq_settings_can_autoenable(desc) || + !irq_supports_nmi(desc)) + return -EINVAL; + + /* The line cannot already be NMI */ + if (desc->istate & IRQS_NMI) + return -EINVAL; + + action = kzalloc(sizeof(struct irqaction), GFP_KERNEL); + if (!action) + return -ENOMEM; + + action->handler = handler; + action->flags = IRQF_PERCPU | IRQF_NO_SUSPEND | IRQF_NO_THREAD + | IRQF_NOBALANCING; + action->name = name; + action->percpu_dev_id = dev_id; + + retval = irq_chip_pm_get(&desc->irq_data); + if (retval < 0) + goto err_out; + + retval = __setup_irq(irq, desc, action); + if (retval) + goto err_irq_setup; + + raw_spin_lock_irqsave(&desc->lock, flags); + desc->istate |= IRQS_NMI; + raw_spin_unlock_irqrestore(&desc->lock, flags); + + return 0; + +err_irq_setup: + irq_chip_pm_put(&desc->irq_data); +err_out: + kfree(action); + + return retval; +} + +/** + * prepare_percpu_nmi - performs CPU local setup for NMI delivery + * @irq: Interrupt line to prepare for NMI delivery + * + * This call prepares an interrupt line to deliver NMI on the current CPU, + * before that interrupt line gets enabled with enable_percpu_nmi(). + * + * As a CPU local operation, this should be called from non-preemptible + * context. + * + * If the interrupt line cannot be used to deliver NMIs, function + * will fail returning a negative value. + */ +int prepare_percpu_nmi(unsigned int irq) +{ + unsigned long flags; + struct irq_desc *desc; + int ret = 0; + + WARN_ON(preemptible()); + + desc = irq_get_desc_lock(irq, &flags, + IRQ_GET_DESC_CHECK_PERCPU); + if (!desc) + return -EINVAL; + + if (WARN(!(desc->istate & IRQS_NMI), + KERN_ERR "prepare_percpu_nmi called for a non-NMI interrupt: irq %u\n", + irq)) { + ret = -EINVAL; + goto out; + } + + ret = irq_nmi_setup(desc); + if (ret) { + pr_err("Failed to setup NMI delivery: irq %u\n", irq); + goto out; + } + +out: + irq_put_desc_unlock(desc, flags); + return ret; +} + +/** + * teardown_percpu_nmi - undoes NMI setup of IRQ line + * @irq: Interrupt line from which CPU local NMI configuration should be + * removed + * + * This call undoes the setup done by prepare_percpu_nmi(). + * + * IRQ line should not be enabled for the current CPU. + * + * As a CPU local operation, this should be called from non-preemptible + * context. + */ +void teardown_percpu_nmi(unsigned int irq) +{ + unsigned long flags; + struct irq_desc *desc; + + WARN_ON(preemptible()); + + desc = irq_get_desc_lock(irq, &flags, + IRQ_GET_DESC_CHECK_PERCPU); + if (!desc) + return; + + if (WARN_ON(!(desc->istate & IRQS_NMI))) + goto out; + + irq_nmi_teardown(desc); +out: + irq_put_desc_unlock(desc, flags); +} + /** * irq_get_irqchip_state - returns the irqchip state of a interrupt. * @irq: Interrupt line that is forwarded to a VM -- cgit v1.3-7-g2ca7 From 2dcf1fbcad352baaa5f47b17e57c5743c8eedbad Mon Sep 17 00:00:00 2001 From: Julien Thierry Date: Thu, 31 Jan 2019 14:54:00 +0000 Subject: genirq: Provide NMI handlers Provide flow handlers that are NMI safe for interrupts and percpu_devid interrupts. Signed-off-by: Julien Thierry Acked-by: Marc Zyngier Cc: Thomas Gleixner Cc: Marc Zyngier Cc: Peter Zijlstra Signed-off-by: Marc Zyngier --- include/linux/irq.h | 3 +++ kernel/irq/chip.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 57 insertions(+) (limited to 'kernel') diff --git a/include/linux/irq.h b/include/linux/irq.h index a7298e4998c8..5e91f6bcaacd 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -601,6 +601,9 @@ extern void handle_percpu_devid_irq(struct irq_desc *desc); extern void handle_bad_irq(struct irq_desc *desc); extern void handle_nested_irq(unsigned int irq); +extern void handle_fasteoi_nmi(struct irq_desc *desc); +extern void handle_percpu_devid_fasteoi_nmi(struct irq_desc *desc); + extern int irq_chip_compose_msi_msg(struct irq_data *data, struct msi_msg *msg); extern int irq_chip_pm_get(struct irq_data *data); extern int irq_chip_pm_put(struct irq_data *data); diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 34e969069488..c32d5f386f68 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -729,6 +729,37 @@ out: } EXPORT_SYMBOL_GPL(handle_fasteoi_irq); +/** + * handle_fasteoi_nmi - irq handler for NMI interrupt lines + * @desc: the interrupt description structure for this irq + * + * A simple NMI-safe handler, considering the restrictions + * from request_nmi. + * + * Only a single callback will be issued to the chip: an ->eoi() + * call when the interrupt has been serviced. This enables support + * for modern forms of interrupt handlers, which handle the flow + * details in hardware, transparently. + */ +void handle_fasteoi_nmi(struct irq_desc *desc) +{ + struct irq_chip *chip = irq_desc_get_chip(desc); + struct irqaction *action = desc->action; + unsigned int irq = irq_desc_get_irq(desc); + irqreturn_t res; + + trace_irq_handler_entry(irq, action); + /* + * NMIs cannot be shared, there is only one action. + */ + res = action->handler(irq, action->dev_id); + trace_irq_handler_exit(irq, action, res); + + if (chip->irq_eoi) + chip->irq_eoi(&desc->irq_data); +} +EXPORT_SYMBOL_GPL(handle_fasteoi_nmi); + /** * handle_edge_irq - edge type IRQ handler * @desc: the interrupt description structure for this irq @@ -908,6 +939,29 @@ void handle_percpu_devid_irq(struct irq_desc *desc) chip->irq_eoi(&desc->irq_data); } +/** + * handle_percpu_devid_fasteoi_nmi - Per CPU local NMI handler with per cpu + * dev ids + * @desc: the interrupt description structure for this irq + * + * Similar to handle_fasteoi_nmi, but handling the dev_id cookie + * as a percpu pointer. + */ +void handle_percpu_devid_fasteoi_nmi(struct irq_desc *desc) +{ + struct irq_chip *chip = irq_desc_get_chip(desc); + struct irqaction *action = desc->action; + unsigned int irq = irq_desc_get_irq(desc); + irqreturn_t res; + + trace_irq_handler_entry(irq, action); + res = action->handler(irq, raw_cpu_ptr(action->percpu_dev_id)); + trace_irq_handler_exit(irq, action, res); + + if (chip->irq_eoi) + chip->irq_eoi(&desc->irq_data); +} + static void __irq_do_set_handler(struct irq_desc *desc, irq_flow_handler_t handle, int is_chained, const char *name) -- cgit v1.3-7-g2ca7 From 6e4933a006616343f66c4702dc4fc56bb25e7b02 Mon Sep 17 00:00:00 2001 From: Julien Thierry Date: Thu, 31 Jan 2019 14:54:01 +0000 Subject: irqdesc: Add domain handler for NMIs NMI handling code should be executed between calls to nmi_enter and nmi_exit. Add a separate domain handler to properly setup NMI context when handling an interrupt requested as NMI. Signed-off-by: Julien Thierry Acked-by: Marc Zyngier Cc: Thomas Gleixner Cc: Marc Zyngier Cc: Will Deacon Cc: Peter Zijlstra Signed-off-by: Marc Zyngier --- include/linux/irqdesc.h | 5 +++++ kernel/irq/irqdesc.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) (limited to 'kernel') diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index dd1e40ddac7d..ba05b0d6401a 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -171,6 +171,11 @@ static inline int handle_domain_irq(struct irq_domain *domain, { return __handle_domain_irq(domain, hwirq, true, regs); } + +#ifdef CONFIG_IRQ_DOMAIN +int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq, + struct pt_regs *regs); +#endif #endif /* Test to see if a driver has successfully requested an irq */ diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index ee062b7939d3..a1d7a7d484e0 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -669,6 +669,41 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, set_irq_regs(old_regs); return ret; } + +#ifdef CONFIG_IRQ_DOMAIN +/** + * handle_domain_nmi - Invoke the handler for a HW irq belonging to a domain + * @domain: The domain where to perform the lookup + * @hwirq: The HW irq number to convert to a logical one + * @regs: Register file coming from the low-level handling code + * + * Returns: 0 on success, or -EINVAL if conversion has failed + */ +int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq, + struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + unsigned int irq; + int ret = 0; + + nmi_enter(); + + irq = irq_find_mapping(domain, hwirq); + + /* + * ack_bad_irq is not NMI-safe, just report + * an invalid interrupt. + */ + if (likely(irq)) + generic_handle_irq(irq); + else + ret = -EINVAL; + + nmi_exit(); + set_irq_regs(old_regs); + return ret; +} +#endif #endif /* Dynamic interrupt handling */ -- cgit v1.3-7-g2ca7 From 375bfca3459db1c5596c28c56d2ebac26e51c7b3 Mon Sep 17 00:00:00 2001 From: Alice Ferrazzi Date: Tue, 5 Feb 2019 03:33:28 +0900 Subject: livepatch: core: Return EOPNOTSUPP instead of ENOSYS As a result of an unsupported operation is better to use EOPNOTSUPP as error code. ENOSYS is only used for 'invalid syscall nr' and nothing else. Signed-off-by: Alice Ferrazzi Acked-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Petr Mladek --- kernel/livepatch/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index adca5cf07f7e..5fc98a1cc3c3 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1036,7 +1036,7 @@ int klp_enable_patch(struct klp_patch *patch) if (!klp_have_reliable_stack()) { pr_err("This architecture doesn't have support for the livepatch consistency model.\n"); - return -ENOSYS; + return -EOPNOTSUPP; } -- cgit v1.3-7-g2ca7 From ecba29f434a8fa333356d54d2491d174c4aab8de Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Mon, 4 Feb 2019 14:56:50 +0100 Subject: livepatch: Introduce klp_for_each_patch macro There are already macros to iterate over struct klp_func and klp_object. Add also klp_for_each_patch(). But make it internal because also klp_patches list is internal. Suggested-by: Josh Poimboeuf Acked-by: Miroslav Benes Acked-by: Joe Lawrence Signed-off-by: Petr Mladek --- kernel/livepatch/core.c | 8 ++++---- kernel/livepatch/core.h | 6 ++++++ kernel/livepatch/transition.c | 2 +- 3 files changed, 11 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 5fc98a1cc3c3..4b7f55d9e89c 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -554,7 +554,7 @@ static int klp_add_nops(struct klp_patch *patch) struct klp_patch *old_patch; struct klp_object *old_obj; - list_for_each_entry(old_patch, &klp_patches, list) { + klp_for_each_patch(old_patch) { klp_for_each_object(old_patch, old_obj) { int err; @@ -1089,7 +1089,7 @@ void klp_discard_replaced_patches(struct klp_patch *new_patch) { struct klp_patch *old_patch, *tmp_patch; - list_for_each_entry_safe(old_patch, tmp_patch, &klp_patches, list) { + klp_for_each_patch_safe(old_patch, tmp_patch) { if (old_patch == new_patch) return; @@ -1133,7 +1133,7 @@ static void klp_cleanup_module_patches_limited(struct module *mod, struct klp_patch *patch; struct klp_object *obj; - list_for_each_entry(patch, &klp_patches, list) { + klp_for_each_patch(patch) { if (patch == limit) break; @@ -1180,7 +1180,7 @@ int klp_module_coming(struct module *mod) */ mod->klp_alive = true; - list_for_each_entry(patch, &klp_patches, list) { + klp_for_each_patch(patch) { klp_for_each_object(patch, obj) { if (!klp_is_module(obj) || strcmp(obj->name, mod->name)) continue; diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h index e6200f38701f..ec43a40b853f 100644 --- a/kernel/livepatch/core.h +++ b/kernel/livepatch/core.h @@ -7,6 +7,12 @@ extern struct mutex klp_mutex; extern struct list_head klp_patches; +#define klp_for_each_patch_safe(patch, tmp_patch) \ + list_for_each_entry_safe(patch, tmp_patch, &klp_patches, list) + +#define klp_for_each_patch(patch) \ + list_for_each_entry(patch, &klp_patches, list) + void klp_free_patch_start(struct klp_patch *patch); void klp_discard_replaced_patches(struct klp_patch *new_patch); void klp_discard_nops(struct klp_patch *new_patch); diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 300273819674..a3a6f32c6fd0 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -642,6 +642,6 @@ void klp_force_transition(void) for_each_possible_cpu(cpu) klp_update_patch_state(idle_task(cpu)); - list_for_each_entry(patch, &klp_patches, list) + klp_for_each_patch(patch) patch->forced = true; } -- cgit v1.3-7-g2ca7 From a087cdd4073b78f4739ce6709daeb4f3267b4dbf Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Mon, 4 Feb 2019 14:56:53 +0100 Subject: livepatch: Module coming and going callbacks can proceed with all listed patches Livepatches can no longer get enabled and disabled repeatedly. The list klp_patches contains only enabled patches and eventually the patch in transition. The module coming and going callbacks do no longer need to check for these state. They have to proceed with all listed patches. Suggested-by: Josh Poimboeuf Acked-by: Miroslav Benes Acked-by: Joe Lawrence Signed-off-by: Petr Mladek --- kernel/livepatch/core.c | 26 ++++++-------------------- 1 file changed, 6 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 4b7f55d9e89c..d1af69e9f0e3 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1141,21 +1141,14 @@ static void klp_cleanup_module_patches_limited(struct module *mod, if (!klp_is_module(obj) || strcmp(obj->name, mod->name)) continue; - /* - * Only unpatch the module if the patch is enabled or - * is in transition. - */ - if (patch->enabled || patch == klp_transition_patch) { - - if (patch != klp_transition_patch) - klp_pre_unpatch_callback(obj); + if (patch != klp_transition_patch) + klp_pre_unpatch_callback(obj); - pr_notice("reverting patch '%s' on unloading module '%s'\n", - patch->mod->name, obj->mod->name); - klp_unpatch_object(obj); + pr_notice("reverting patch '%s' on unloading module '%s'\n", + patch->mod->name, obj->mod->name); + klp_unpatch_object(obj); - klp_post_unpatch_callback(obj); - } + klp_post_unpatch_callback(obj); klp_free_object_loaded(obj); break; @@ -1194,13 +1187,6 @@ int klp_module_coming(struct module *mod) goto err; } - /* - * Only patch the module if the patch is enabled or is - * in transition. - */ - if (!patch->enabled && patch != klp_transition_patch) - break; - pr_notice("applying patch '%s' to loading module '%s'\n", patch->mod->name, obj->mod->name); -- cgit v1.3-7-g2ca7 From 840018668ce2d96783356204ff282d6c9b0e5f66 Mon Sep 17 00:00:00 2001 From: Mathieu Poirier Date: Thu, 31 Jan 2019 11:47:08 -0700 Subject: perf/aux: Make perf_event accessible to setup_aux() When pmu::setup_aux() is called the coresight PMU needs to know which sink to use for the session by looking up the information in the event's attr::config2 field. As such simply replace the cpu information by the complete perf_event structure and change all affected customers. Signed-off-by: Mathieu Poirier Reviewed-by: Suzuki Poulouse Acked-by: Peter Zijlstra Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Alexei Starovoitov Cc: Greg Kroah-Hartman Cc: H. Peter Anvin Cc: Heiko Carstens Cc: Jiri Olsa Cc: Mark Rutland Cc: Martin Schwidefsky Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Will Deacon Cc: linux-arm-kernel@lists.infradead.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org Signed-off-by: Arnaldo Carvalho de Melo --- arch/s390/kernel/perf_cpum_sf.c | 6 +++--- arch/x86/events/intel/bts.c | 4 +++- arch/x86/events/intel/pt.c | 5 +++-- drivers/hwtracing/coresight/coresight-etm-perf.c | 6 +++--- drivers/perf/arm_spe_pmu.c | 6 +++--- include/linux/perf_event.h | 2 +- kernel/events/ring_buffer.c | 2 +- 7 files changed, 17 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c index bfabeb1889cc..1266194afb02 100644 --- a/arch/s390/kernel/perf_cpum_sf.c +++ b/arch/s390/kernel/perf_cpum_sf.c @@ -1600,7 +1600,7 @@ static void aux_sdb_init(unsigned long sdb) /* * aux_buffer_setup() - Setup AUX buffer for diagnostic mode sampling - * @cpu: On which to allocate, -1 means current + * @event: Event the buffer is setup for, event->cpu == -1 means current * @pages: Array of pointers to buffer pages passed from perf core * @nr_pages: Total pages * @snapshot: Flag for snapshot mode @@ -1612,8 +1612,8 @@ static void aux_sdb_init(unsigned long sdb) * * Return the private AUX buffer structure if success or NULL if fails. */ -static void *aux_buffer_setup(int cpu, void **pages, int nr_pages, - bool snapshot) +static void *aux_buffer_setup(struct perf_event *event, void **pages, + int nr_pages, bool snapshot) { struct sf_buffer *sfb; struct aux_buffer *aux; diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index a01ef1b0f883..7cdd7b13bbda 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -77,10 +77,12 @@ static size_t buf_size(struct page *page) } static void * -bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite) +bts_buffer_setup_aux(struct perf_event *event, void **pages, + int nr_pages, bool overwrite) { struct bts_buffer *buf; struct page *page; + int cpu = event->cpu; int node = (cpu == -1) ? cpu : cpu_to_node(cpu); unsigned long offset; size_t size = nr_pages << PAGE_SHIFT; diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c index 9494ca68fd9d..c0e86ff21f81 100644 --- a/arch/x86/events/intel/pt.c +++ b/arch/x86/events/intel/pt.c @@ -1114,10 +1114,11 @@ static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages, * Return: Our private PT buffer structure. */ static void * -pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot) +pt_buffer_setup_aux(struct perf_event *event, void **pages, + int nr_pages, bool snapshot) { struct pt_buffer *buf; - int node, ret; + int node, ret, cpu = event->cpu; if (!nr_pages) return NULL; diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c index abe8249b893b..f21eb28b6782 100644 --- a/drivers/hwtracing/coresight/coresight-etm-perf.c +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c @@ -177,15 +177,15 @@ static void etm_free_aux(void *data) schedule_work(&event_data->work); } -static void *etm_setup_aux(int event_cpu, void **pages, +static void *etm_setup_aux(struct perf_event *event, void **pages, int nr_pages, bool overwrite) { - int cpu; + int cpu = event->cpu; cpumask_t *mask; struct coresight_device *sink; struct etm_event_data *event_data = NULL; - event_data = alloc_event_data(event_cpu); + event_data = alloc_event_data(cpu); if (!event_data) return NULL; INIT_WORK(&event_data->work, free_event_data); diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c index 8e46a9dad2fa..7cb766dafe85 100644 --- a/drivers/perf/arm_spe_pmu.c +++ b/drivers/perf/arm_spe_pmu.c @@ -824,10 +824,10 @@ static void arm_spe_pmu_read(struct perf_event *event) { } -static void *arm_spe_pmu_setup_aux(int cpu, void **pages, int nr_pages, - bool snapshot) +static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages, + int nr_pages, bool snapshot) { - int i; + int i, cpu = event->cpu; struct page **pglist; struct arm_spe_pmu_buf *buf; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 6cb5d483ab34..d9c3610e0e25 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -410,7 +410,7 @@ struct pmu { /* * Set up pmu-private data structures for an AUX area */ - void *(*setup_aux) (int cpu, void **pages, + void *(*setup_aux) (struct perf_event *event, void **pages, int nr_pages, bool overwrite); /* optional */ diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 805f0423ee0b..70ae2422cbaf 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -657,7 +657,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, goto out; } - rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages, + rb->aux_priv = event->pmu->setup_aux(event, rb->aux_pages, nr_pages, overwrite); if (!rb->aux_priv) goto out; -- cgit v1.3-7-g2ca7 From 9acd8de69d107537a68d010c9149fa9d9aba91f4 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 1 Jan 2019 23:46:10 +0800 Subject: function_graph: Support displaying relative timestamp When function_graph is used for latency tracers, relative timestamp is more straightforward than absolute timestamp as function trace does. This change adds relative timestamp support to function_graph and applies to latency tracers (wakeup and irqsoff). Instead of: # tracer: irqsoff # # irqsoff latency trace v1.1.5 on 5.0.0-rc1-test # -------------------------------------------------------------------- # latency: 521 us, #1125/1125, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: swapper/2-0 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: __schedule # => ended at: _raw_spin_unlock_irq # # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / # TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 124.974306 | 2) systemd-693 | d..1 0.000 us | __schedule(); 124.974307 | 2) systemd-693 | d..1 | rcu_note_context_switch() { 124.974308 | 2) systemd-693 | d..1 0.487 us | rcu_preempt_deferred_qs(); 124.974309 | 2) systemd-693 | d..1 0.451 us | rcu_qs(); 124.974310 | 2) systemd-693 | d..1 2.301 us | } [..] 124.974826 | 2) -0 | d..2 | finish_task_switch() { 124.974826 | 2) -0 | d..2 | _raw_spin_unlock_irq() { 124.974827 | 2) -0 | d..2 0.000 us | _raw_spin_unlock_irq(); 124.974828 | 2) -0 | d..2 0.000 us | tracer_hardirqs_on(); -0 2d..2 552us : => __schedule => schedule_idle => do_idle => cpu_startup_entry => start_secondary => secondary_startup_64 Show: # tracer: irqsoff # # irqsoff latency trace v1.1.5 on 5.0.0-rc1-test+ # -------------------------------------------------------------------- # latency: 511 us, #1053/1053, CPU#7 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: swapper/7-0 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: __schedule # => ended at: _raw_spin_unlock_irq # # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 0 us | 7) sshd-1704 | d..1 0.000 us | __schedule(); 1 us | 7) sshd-1704 | d..1 | rcu_note_context_switch() { 1 us | 7) sshd-1704 | d..1 0.611 us | rcu_preempt_deferred_qs(); 2 us | 7) sshd-1704 | d..1 0.484 us | rcu_qs(); 3 us | 7) sshd-1704 | d..1 2.599 us | } [..] 509 us | 7) -0 | d..2 | finish_task_switch() { 510 us | 7) -0 | d..2 | _raw_spin_unlock_irq() { 510 us | 7) -0 | d..2 0.000 us | _raw_spin_unlock_irq(); 512 us | 7) -0 | d..2 0.000 us | tracer_hardirqs_on(); -0 7d..2 543us : => __schedule => schedule_idle => do_idle => cpu_startup_entry => start_secondary => secondary_startup_64 Link: http://lkml.kernel.org/r/20190101154614.8887-2-changbin.du@gmail.com Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.h | 9 +++++---- kernel/trace/trace_functions_graph.c | 25 +++++++++++++++++++++++++ kernel/trace/trace_irqsoff.c | 2 +- kernel/trace/trace_sched_wakeup.c | 2 +- 4 files changed, 32 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 08900828d282..a34fa5e76abb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -855,10 +855,11 @@ static __always_inline bool ftrace_hash_empty(struct ftrace_hash *hash) #define TRACE_GRAPH_PRINT_PROC 0x8 #define TRACE_GRAPH_PRINT_DURATION 0x10 #define TRACE_GRAPH_PRINT_ABS_TIME 0x20 -#define TRACE_GRAPH_PRINT_IRQS 0x40 -#define TRACE_GRAPH_PRINT_TAIL 0x80 -#define TRACE_GRAPH_SLEEP_TIME 0x100 -#define TRACE_GRAPH_GRAPH_TIME 0x200 +#define TRACE_GRAPH_PRINT_REL_TIME 0x40 +#define TRACE_GRAPH_PRINT_IRQS 0x80 +#define TRACE_GRAPH_PRINT_TAIL 0x100 +#define TRACE_GRAPH_SLEEP_TIME 0x200 +#define TRACE_GRAPH_GRAPH_TIME 0x400 #define TRACE_GRAPH_PRINT_FILL_SHIFT 28 #define TRACE_GRAPH_PRINT_FILL_MASK (0x3 << TRACE_GRAPH_PRINT_FILL_SHIFT) diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index c2af1560e856..16ebbdd7b22e 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -500,6 +500,17 @@ static void print_graph_abs_time(u64 t, struct trace_seq *s) (unsigned long)t, usecs_rem); } +static void +print_graph_rel_time(struct trace_iterator *iter, struct trace_seq *s) +{ + unsigned long long usecs; + + usecs = iter->ts - iter->trace_buffer->time_start; + do_div(usecs, NSEC_PER_USEC); + + trace_seq_printf(s, "%9llu us | ", usecs); +} + static void print_graph_irq(struct trace_iterator *iter, unsigned long addr, enum trace_type type, int cpu, pid_t pid, u32 flags) @@ -517,6 +528,10 @@ print_graph_irq(struct trace_iterator *iter, unsigned long addr, if (flags & TRACE_GRAPH_PRINT_ABS_TIME) print_graph_abs_time(iter->ts, s); + /* Relative time */ + if (flags & TRACE_GRAPH_PRINT_REL_TIME) + print_graph_rel_time(iter, s); + /* Cpu */ if (flags & TRACE_GRAPH_PRINT_CPU) print_graph_cpu(s, cpu); @@ -725,6 +740,10 @@ print_graph_prologue(struct trace_iterator *iter, struct trace_seq *s, if (flags & TRACE_GRAPH_PRINT_ABS_TIME) print_graph_abs_time(iter->ts, s); + /* Relative time */ + if (flags & TRACE_GRAPH_PRINT_REL_TIME) + print_graph_rel_time(iter, s); + /* Cpu */ if (flags & TRACE_GRAPH_PRINT_CPU) print_graph_cpu(s, cpu); @@ -1101,6 +1120,8 @@ static void print_lat_header(struct seq_file *s, u32 flags) if (flags & TRACE_GRAPH_PRINT_ABS_TIME) size += 16; + if (flags & TRACE_GRAPH_PRINT_REL_TIME) + size += 16; if (flags & TRACE_GRAPH_PRINT_CPU) size += 4; if (flags & TRACE_GRAPH_PRINT_PROC) @@ -1125,6 +1146,8 @@ static void __print_graph_headers_flags(struct trace_array *tr, seq_putc(s, '#'); if (flags & TRACE_GRAPH_PRINT_ABS_TIME) seq_puts(s, " TIME "); + if (flags & TRACE_GRAPH_PRINT_REL_TIME) + seq_puts(s, " REL TIME "); if (flags & TRACE_GRAPH_PRINT_CPU) seq_puts(s, " CPU"); if (flags & TRACE_GRAPH_PRINT_PROC) @@ -1139,6 +1162,8 @@ static void __print_graph_headers_flags(struct trace_array *tr, seq_putc(s, '#'); if (flags & TRACE_GRAPH_PRINT_ABS_TIME) seq_puts(s, " | "); + if (flags & TRACE_GRAPH_PRINT_REL_TIME) + seq_puts(s, " | "); if (flags & TRACE_GRAPH_PRINT_CPU) seq_puts(s, " | "); if (flags & TRACE_GRAPH_PRINT_PROC) diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index d3294721f119..e6a62e7c6b69 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -238,7 +238,7 @@ static void irqsoff_trace_close(struct trace_iterator *iter) #define GRAPH_TRACER_FLAGS (TRACE_GRAPH_PRINT_CPU | \ TRACE_GRAPH_PRINT_PROC | \ - TRACE_GRAPH_PRINT_ABS_TIME | \ + TRACE_GRAPH_PRINT_REL_TIME | \ TRACE_GRAPH_PRINT_DURATION) static enum print_line_t irqsoff_print_line(struct trace_iterator *iter) diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 4ea7e6845efb..b6c5fa10347e 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -180,7 +180,7 @@ static void wakeup_trace_close(struct trace_iterator *iter) } #define GRAPH_TRACER_FLAGS (TRACE_GRAPH_PRINT_PROC | \ - TRACE_GRAPH_PRINT_ABS_TIME | \ + TRACE_GRAPH_PRINT_REL_TIME | \ TRACE_GRAPH_PRINT_DURATION) static enum print_line_t wakeup_print_line(struct trace_iterator *iter) -- cgit v1.3-7-g2ca7 From 91457c018f15591682d160ae7f0f31f81c4a9f33 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:30:37 +0100 Subject: tracing: Annotate implicit fall through in parse_probe_arg() There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: kernel/trace/trace_probe.c:302:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Link: http://lkml.kernel.org/r/20190114203039.16535-1-malat@debian.org Signed-off-by: Mathieu Malaterre Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_probe.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 9962cb5da8ac..89da34b326e3 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -300,6 +300,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, case '+': /* deref memory */ arg++; /* Skip '+', because kstrtol() rejects it. */ + /* fall through */ case '-': tmp = strchr(arg, '('); if (!tmp) -- cgit v1.3-7-g2ca7 From 9399ca21d203f01a39696a6554344b5d6613ba7b Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:30:38 +0100 Subject: tracing: Annotate implicit fall through in predicate_parse() There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: kernel/trace/trace_events_filter.c:494:8: warning: this statement may fall through [-Wimplicit-fallthrough=] Link: http://lkml.kernel.org/r/20190114203039.16535-2-malat@debian.org Signed-off-by: Mathieu Malaterre Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_filter.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 27821480105e..eb694756c4bb 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -495,6 +495,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds, ptr++; break; } + /* fall through */ default: parse_error(pe, FILT_ERR_TOO_MANY_PREDS, next - str); -- cgit v1.3-7-g2ca7 From 6c6dbce196c201810348b6cef6fc5ec77d4d0973 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Mon, 14 Jan 2019 16:37:53 -0500 Subject: tracing: Add comment to predicate_parse() about "&&" or "||" As the predicat_parse() code is rather complex, commenting subtleties is important. The switch case statement should be commented to describe that it is only looking for two '&' or '|' together, which is why the fall through to an error is after the check. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_filter.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index eb694756c4bb..f052ecb085e9 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -491,6 +491,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds, break; case '&': case '|': + /* accepting only "&&" or "||" */ if (next[1] == next[0]) { ptr++; break; -- cgit v1.3-7-g2ca7 From 97f0a3bcdf34f66c2017bad51445eafef1224870 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 1 Jan 2019 23:46:11 +0800 Subject: tracing: Show more info for funcgraph wakeup tracers Add these info fields to funcgraph wakeup tracers: o Show CPU info since the waker could be on a different CPU. o Show function duration and overhead. o Show IRQ markers. Link: http://lkml.kernel.org/r/20190101154614.8887-3-changbin.du@gmail.com Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_sched_wakeup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index b6c5fa10347e..da5b6e012840 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -180,8 +180,11 @@ static void wakeup_trace_close(struct trace_iterator *iter) } #define GRAPH_TRACER_FLAGS (TRACE_GRAPH_PRINT_PROC | \ + TRACE_GRAPH_PRINT_CPU | \ TRACE_GRAPH_PRINT_REL_TIME | \ - TRACE_GRAPH_PRINT_DURATION) + TRACE_GRAPH_PRINT_DURATION | \ + TRACE_GRAPH_PRINT_OVERHEAD | \ + TRACE_GRAPH_PRINT_IRQS) static enum print_line_t wakeup_print_line(struct trace_iterator *iter) { -- cgit v1.3-7-g2ca7 From afbab501c66bece057bc30656f71f856cd1b3baa Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 1 Jan 2019 23:46:12 +0800 Subject: tracing: Put a margin between flags and duration for wakeup tracers Don't mix context flags with function duration info. Instead of this: # tracer: wakeup_rt # # wakeup_rt latency trace v1.1.5 on 5.0.0-rc1-test+ # -------------------------------------------------------------------- # latency: 177 us, #545/545, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) # ----------------- # | task: migration/0-11 (uid:0 nice:0 policy:1 rt_prio:99) # ----------------- # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 0 us | 0) -0 | dNh5 | /* 0:120:R + [000] 11: 0:R migration/0 */ 2 us | 0) -0 | dNh5 0.000 us | (null)(); 4 us | 0) -0 | dNh4 | _raw_spin_unlock() { 4 us | 0) -0 | dNh4 0.304 us | preempt_count_sub(); 5 us | 0) -0 | dNh3 1.063 us | } 5 us | 0) -0 | dNh3 0.266 us | ttwu_stat(); 6 us | 0) -0 | dNh3 | _raw_spin_unlock_irqrestore() { 6 us | 0) -0 | dNh3 0.273 us | preempt_count_sub(); 6 us | 0) -0 | dNh2 0.818 us | } Show this: # tracer: wakeup # # wakeup latency trace v1.1.5 on 4.20.0+ # -------------------------------------------------------------------- # latency: 593 us, #674/674, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # | task: kworker/0:1H-339 (uid:0 nice:-20 policy:0 rt_prio:0) # ----------------- # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 0 us | 0) -0 | dNs. | | /* 0:120:R + [000] 339:100:R kworker/0:1H */ 3 us | 0) -0 | dNs. | 0.000 us | (null)(); 67 us | 0) -0 | dNs. | 0.721 us | ttwu_stat(); 69 us | 0) -0 | dNs. | 0.607 us | _raw_spin_unlock_irqrestore(); 71 us | 0) -0 | .Ns. | 0.598 us | _raw_spin_lock_irq(); 72 us | 0) -0 | .Ns. | 0.584 us | _raw_spin_lock_irq(); 73 us | 0) -0 | dNs. | + 11.118 us | __next_timer_interrupt(); 75 us | 0) -0 | dNs. | | call_timer_fn() { 76 us | 0) -0 | dNs. | | delayed_work_timer_fn() { 76 us | 0) -0 | dNs. | | __queue_work() { ... Link: http://lkml.kernel.org/r/20190101154614.8887-4-changbin.du@gmail.com Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_functions_graph.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 16ebbdd7b22e..69ebf3c2f1b5 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -380,6 +380,7 @@ static void print_graph_lat_fmt(struct trace_seq *s, struct trace_entry *entry) { trace_seq_putc(s, ' '); trace_print_lat_fmt(s, entry); + trace_seq_puts(s, " | "); } /* If the pid changed since the last trace, output this event */ @@ -1153,7 +1154,7 @@ static void __print_graph_headers_flags(struct trace_array *tr, if (flags & TRACE_GRAPH_PRINT_PROC) seq_puts(s, " TASK/PID "); if (lat) - seq_puts(s, "||||"); + seq_puts(s, "|||| "); if (flags & TRACE_GRAPH_PRINT_DURATION) seq_puts(s, " DURATION "); seq_puts(s, " FUNCTION CALLS\n"); @@ -1169,7 +1170,7 @@ static void __print_graph_headers_flags(struct trace_array *tr, if (flags & TRACE_GRAPH_PRINT_PROC) seq_puts(s, " | | "); if (lat) - seq_puts(s, "||||"); + seq_puts(s, "|||| "); if (flags & TRACE_GRAPH_PRINT_DURATION) seq_puts(s, " | | "); seq_puts(s, " | | | |\n"); -- cgit v1.3-7-g2ca7 From f52d569f3d929d9530d75976baa191d2d50fc51b Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Thu, 17 Jan 2019 00:02:49 +0800 Subject: tracing: Show stacktrace for wakeup tracers This align the behavior of wakeup tracers with irqsoff latency tracer that we record stacktrace at the beginning and end of waking up. The stacktrace shows us what is happening in the kernel. Link: http://lkml.kernel.org/r/20190116160249.7554-1-changbin.du@gmail.com Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_sched_wakeup.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index da5b6e012840..f4fe7d1781e9 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -475,6 +475,7 @@ probe_wakeup_sched_switch(void *ignore, bool preempt, __trace_function(wakeup_trace, CALLER_ADDR0, CALLER_ADDR1, flags, pc); tracing_sched_switch_trace(wakeup_trace, prev, next, flags, pc); + __trace_stack(wakeup_trace, flags, 0, pc); T0 = data->preempt_timestamp; T1 = ftrace_now(cpu); @@ -586,6 +587,7 @@ probe_wakeup(void *ignore, struct task_struct *p) data = per_cpu_ptr(wakeup_trace->trace_buffer.data, wakeup_cpu); data->preempt_timestamp = ftrace_now(cpu); tracing_sched_wakeup_trace(wakeup_trace, p, current, flags, pc); + __trace_stack(wakeup_trace, flags, 0, pc); /* * We must be careful in using CALLER_ADDR2. But since wake_up -- cgit v1.3-7-g2ca7 From d325c402964e7c63db94e9138c530832269a1297 Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Fri, 28 Dec 2018 14:38:47 +0100 Subject: ring-buffer: Remove unused function ring_buffer_page_len() Commit 6b7e633fe9c2 ("tracing: Remove extra zeroing out of the ring buffer page") removed the only caller of ring_buffer_page_len(). The function is now unused and may be removed. Link: http://lkml.kernel.org/r/20181228133847.106177-1-mbenes@suse.cz Signed-off-by: Miroslav Benes Signed-off-by: Steven Rostedt (VMware) --- include/linux/ring_buffer.h | 2 -- kernel/trace/ring_buffer.c | 14 -------------- 2 files changed, 16 deletions(-) (limited to 'kernel') diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 5b9ae62272bb..f1429675f252 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -187,8 +187,6 @@ void ring_buffer_set_clock(struct ring_buffer *buffer, void ring_buffer_set_time_stamp_abs(struct ring_buffer *buffer, bool abs); bool ring_buffer_time_stamp_abs(struct ring_buffer *buffer); -size_t ring_buffer_page_len(void *page); - size_t ring_buffer_nr_pages(struct ring_buffer *buffer, int cpu); size_t ring_buffer_nr_dirty_pages(struct ring_buffer *buffer, int cpu); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 06e864a334bb..9a91479bbbfe 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -353,20 +353,6 @@ static void rb_init_page(struct buffer_data_page *bpage) local_set(&bpage->commit, 0); } -/** - * ring_buffer_page_len - the size of data on the page. - * @page: The page to read - * - * Returns the amount of data on the page, including buffer page header. - */ -size_t ring_buffer_page_len(void *page) -{ - struct buffer_data_page *bpage = page; - - return (local_read(&bpage->commit) & ~RB_MISSED_FLAGS) - + BUF_PAGE_HDR_SIZE; -} - /* * Also stolen from mm/slob.c. Thanks to Mathieu Desnoyers for pointing * this issue out. -- cgit v1.3-7-g2ca7 From 4d5f007eedb74d71a7bde2bff69b6a31ad8ab427 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 2 Jan 2019 13:28:47 +0100 Subject: time: make adjtime compat handling available for 32 bit We want to reuse the compat_timex handling on 32-bit architectures the same way we are using the compat handling for timespec when moving to 64-bit time_t. Move all definitions related to compat_timex out of the compat code into the normal timekeeping code, along with a rename to old_timex32, corresponding to the timespec/timeval structures, and make it controlled by CONFIG_COMPAT_32BIT_TIME, which 32-bit architectures will then select. Signed-off-by: Arnd Bergmann --- include/linux/compat.h | 35 ++--------------------- include/linux/time32.h | 32 ++++++++++++++++++++- kernel/compat.c | 64 ------------------------------------------ kernel/time/posix-timers.c | 14 ++-------- kernel/time/time.c | 70 +++++++++++++++++++++++++++++++++++++++++++--- 5 files changed, 102 insertions(+), 113 deletions(-) (limited to 'kernel') diff --git a/include/linux/compat.h b/include/linux/compat.h index 056be0d03722..657ca6abd855 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -132,37 +132,6 @@ struct compat_tms { compat_clock_t tms_cstime; }; -struct compat_timex { - compat_uint_t modes; - compat_long_t offset; - compat_long_t freq; - compat_long_t maxerror; - compat_long_t esterror; - compat_int_t status; - compat_long_t constant; - compat_long_t precision; - compat_long_t tolerance; - struct old_timeval32 time; - compat_long_t tick; - compat_long_t ppsfreq; - compat_long_t jitter; - compat_int_t shift; - compat_long_t stabil; - compat_long_t jitcnt; - compat_long_t calcnt; - compat_long_t errcnt; - compat_long_t stbcnt; - compat_int_t tai; - - compat_int_t:32; compat_int_t:32; compat_int_t:32; compat_int_t:32; - compat_int_t:32; compat_int_t:32; compat_int_t:32; compat_int_t:32; - compat_int_t:32; compat_int_t:32; compat_int_t:32; -}; - -struct timex; -int compat_get_timex(struct timex *, const struct compat_timex __user *); -int compat_put_timex(struct compat_timex __user *, const struct timex *); - #define _COMPAT_NSIG_WORDS (_COMPAT_NSIG / _COMPAT_NSIG_BPW) typedef struct { @@ -808,7 +777,7 @@ asmlinkage long compat_sys_gettimeofday(struct old_timeval32 __user *tv, struct timezone __user *tz); asmlinkage long compat_sys_settimeofday(struct old_timeval32 __user *tv, struct timezone __user *tz); -asmlinkage long compat_sys_adjtimex(struct compat_timex __user *utp); +asmlinkage long compat_sys_adjtimex(struct old_timex32 __user *utp); /* kernel/timer.c */ asmlinkage long compat_sys_sysinfo(struct compat_sysinfo __user *info); @@ -911,7 +880,7 @@ asmlinkage long compat_sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); asmlinkage long compat_sys_clock_adjtime(clockid_t which_clock, - struct compat_timex __user *tp); + struct old_timex32 __user *tp); asmlinkage long compat_sys_sendmmsg(int fd, struct compat_mmsghdr __user *mmsg, unsigned vlen, unsigned int flags); asmlinkage ssize_t compat_sys_process_vm_readv(compat_pid_t pid, diff --git a/include/linux/time32.h b/include/linux/time32.h index 118b9977080c..820a22e2b98b 100644 --- a/include/linux/time32.h +++ b/include/linux/time32.h @@ -10,6 +10,7 @@ */ #include +#include #define TIME_T_MAX (time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1) @@ -35,13 +36,42 @@ struct old_utimbuf32 { old_time32_t modtime; }; +struct old_timex32 { + u32 modes; + s32 offset; + s32 freq; + s32 maxerror; + s32 esterror; + s32 status; + s32 constant; + s32 precision; + s32 tolerance; + struct old_timeval32 time; + s32 tick; + s32 ppsfreq; + s32 jitter; + s32 shift; + s32 stabil; + s32 jitcnt; + s32 calcnt; + s32 errcnt; + s32 stbcnt; + s32 tai; + + s32:32; s32:32; s32:32; s32:32; + s32:32; s32:32; s32:32; s32:32; + s32:32; s32:32; s32:32; +}; + extern int get_old_timespec32(struct timespec64 *, const void __user *); extern int put_old_timespec32(const struct timespec64 *, void __user *); extern int get_old_itimerspec32(struct itimerspec64 *its, const struct old_itimerspec32 __user *uits); extern int put_old_itimerspec32(const struct itimerspec64 *its, struct old_itimerspec32 __user *uits); - +struct timex; +int get_old_timex32(struct timex *, const struct old_timex32 __user *); +int put_old_timex32(struct old_timex32 __user *, const struct timex *); #if __BITS_PER_LONG == 64 diff --git a/kernel/compat.c b/kernel/compat.c index f01affa17e22..d8a36c6ad7c9 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -30,69 +29,6 @@ #include -int compat_get_timex(struct timex *txc, const struct compat_timex __user *utp) -{ - struct compat_timex tx32; - - memset(txc, 0, sizeof(struct timex)); - if (copy_from_user(&tx32, utp, sizeof(struct compat_timex))) - return -EFAULT; - - txc->modes = tx32.modes; - txc->offset = tx32.offset; - txc->freq = tx32.freq; - txc->maxerror = tx32.maxerror; - txc->esterror = tx32.esterror; - txc->status = tx32.status; - txc->constant = tx32.constant; - txc->precision = tx32.precision; - txc->tolerance = tx32.tolerance; - txc->time.tv_sec = tx32.time.tv_sec; - txc->time.tv_usec = tx32.time.tv_usec; - txc->tick = tx32.tick; - txc->ppsfreq = tx32.ppsfreq; - txc->jitter = tx32.jitter; - txc->shift = tx32.shift; - txc->stabil = tx32.stabil; - txc->jitcnt = tx32.jitcnt; - txc->calcnt = tx32.calcnt; - txc->errcnt = tx32.errcnt; - txc->stbcnt = tx32.stbcnt; - - return 0; -} - -int compat_put_timex(struct compat_timex __user *utp, const struct timex *txc) -{ - struct compat_timex tx32; - - memset(&tx32, 0, sizeof(struct compat_timex)); - tx32.modes = txc->modes; - tx32.offset = txc->offset; - tx32.freq = txc->freq; - tx32.maxerror = txc->maxerror; - tx32.esterror = txc->esterror; - tx32.status = txc->status; - tx32.constant = txc->constant; - tx32.precision = txc->precision; - tx32.tolerance = txc->tolerance; - tx32.time.tv_sec = txc->time.tv_sec; - tx32.time.tv_usec = txc->time.tv_usec; - tx32.tick = txc->tick; - tx32.ppsfreq = txc->ppsfreq; - tx32.jitter = txc->jitter; - tx32.shift = txc->shift; - tx32.stabil = txc->stabil; - tx32.jitcnt = txc->jitcnt; - tx32.calcnt = txc->calcnt; - tx32.errcnt = txc->errcnt; - tx32.stbcnt = txc->stbcnt; - tx32.tai = txc->tai; - if (copy_to_user(utp, &tx32, sizeof(struct compat_timex))) - return -EFAULT; - return 0; -} - static int __compat_get_timeval(struct timeval *tv, const struct old_timeval32 __user *ctv) { return (!access_ok(ctv, sizeof(*ctv)) || diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index 0e84bb72a3da..8955f32f2a36 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -1123,12 +1123,8 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, return err; } -#endif - -#ifdef CONFIG_COMPAT - COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, - struct compat_timex __user *, utp) + struct old_timex32 __user *, utp) { const struct k_clock *kc = clockid_to_kclock(which_clock); struct timex ktx; @@ -1139,22 +1135,18 @@ COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, if (!kc->clock_adj) return -EOPNOTSUPP; - err = compat_get_timex(&ktx, utp); + err = get_old_timex32(&ktx, utp); if (err) return err; err = kc->clock_adj(which_clock, &ktx); if (err >= 0) - err = compat_put_timex(utp, &ktx); + err = put_old_timex32(utp, &ktx); return err; } -#endif - -#ifdef CONFIG_COMPAT_32BIT_TIME - COMPAT_SYSCALL_DEFINE2(clock_getres, clockid_t, which_clock, struct old_timespec32 __user *, tp) { diff --git a/kernel/time/time.c b/kernel/time/time.c index 2edb5088a70b..2d013bc2b271 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -278,20 +278,82 @@ SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p) return copy_to_user(txc_p, &txc, sizeof(struct timex)) ? -EFAULT : ret; } -#ifdef CONFIG_COMPAT +#ifdef CONFIG_COMPAT_32BIT_TIME +int get_old_timex32(struct timex *txc, const struct old_timex32 __user *utp) +{ + struct old_timex32 tx32; + + memset(txc, 0, sizeof(struct timex)); + if (copy_from_user(&tx32, utp, sizeof(struct old_timex32))) + return -EFAULT; + + txc->modes = tx32.modes; + txc->offset = tx32.offset; + txc->freq = tx32.freq; + txc->maxerror = tx32.maxerror; + txc->esterror = tx32.esterror; + txc->status = tx32.status; + txc->constant = tx32.constant; + txc->precision = tx32.precision; + txc->tolerance = tx32.tolerance; + txc->time.tv_sec = tx32.time.tv_sec; + txc->time.tv_usec = tx32.time.tv_usec; + txc->tick = tx32.tick; + txc->ppsfreq = tx32.ppsfreq; + txc->jitter = tx32.jitter; + txc->shift = tx32.shift; + txc->stabil = tx32.stabil; + txc->jitcnt = tx32.jitcnt; + txc->calcnt = tx32.calcnt; + txc->errcnt = tx32.errcnt; + txc->stbcnt = tx32.stbcnt; + + return 0; +} + +int put_old_timex32(struct old_timex32 __user *utp, const struct timex *txc) +{ + struct old_timex32 tx32; + + memset(&tx32, 0, sizeof(struct old_timex32)); + tx32.modes = txc->modes; + tx32.offset = txc->offset; + tx32.freq = txc->freq; + tx32.maxerror = txc->maxerror; + tx32.esterror = txc->esterror; + tx32.status = txc->status; + tx32.constant = txc->constant; + tx32.precision = txc->precision; + tx32.tolerance = txc->tolerance; + tx32.time.tv_sec = txc->time.tv_sec; + tx32.time.tv_usec = txc->time.tv_usec; + tx32.tick = txc->tick; + tx32.ppsfreq = txc->ppsfreq; + tx32.jitter = txc->jitter; + tx32.shift = txc->shift; + tx32.stabil = txc->stabil; + tx32.jitcnt = txc->jitcnt; + tx32.calcnt = txc->calcnt; + tx32.errcnt = txc->errcnt; + tx32.stbcnt = txc->stbcnt; + tx32.tai = txc->tai; + if (copy_to_user(utp, &tx32, sizeof(struct old_timex32))) + return -EFAULT; + return 0; +} -COMPAT_SYSCALL_DEFINE1(adjtimex, struct compat_timex __user *, utp) +COMPAT_SYSCALL_DEFINE1(adjtimex, struct old_timex32 __user *, utp) { struct timex txc; int err, ret; - err = compat_get_timex(&txc, utp); + err = get_old_timex32(&txc, utp); if (err) return err; ret = do_adjtimex(&txc); - err = compat_put_timex(utp, &txc); + err = put_old_timex32(utp, &txc); if (err) return err; -- cgit v1.3-7-g2ca7 From 1a596398a3d75f966b75f428e992cf1f242f9a5b Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 3 Jan 2019 21:12:39 +0100 Subject: sparc64: add custom adjtimex/clock_adjtime functions sparc64 is the only architecture on Linux that has a 'timeval' definition with a 32-bit tv_usec but a 64-bit tv_sec. This causes problems for sparc32 compat mode when we convert it to use the new __kernel_timex type that has the same layout as all other 64-bit architectures. To avoid adding sparc64 specific code into the generic adjtimex implementation, this adds a wrapper in the sparc64 system call handling that converts the sparc64 'timex' into the new '__kernel_timex'. At this point, the two structures are defined to be identical, but that will change in the next step once we convert sparc32. Signed-off-by: Arnd Bergmann --- arch/sparc/kernel/sys_sparc_64.c | 59 +++++++++++++++++++++++++++++++++- arch/sparc/kernel/syscalls/syscall.tbl | 6 ++-- include/linux/timex.h | 2 ++ kernel/time/posix-timers.c | 24 +++++++------- 4 files changed, 76 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index 1c079e7bab09..37de18a11207 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -28,8 +28,9 @@ #include #include #include - +#include #include + #include #include @@ -544,6 +545,62 @@ out_unlock: return err; } +SYSCALL_DEFINE1(sparc_adjtimex, struct timex __user *, txc_p) +{ + struct timex txc; /* Local copy of parameter */ + struct timex *kt = (void *)&txc; + int ret; + + /* Copy the user data space into the kernel copy + * structure. But bear in mind that the structures + * may change + */ + if (copy_from_user(&txc, txc_p, sizeof(struct timex))) + return -EFAULT; + + /* + * override for sparc64 specific timeval type: tv_usec + * is 32 bit wide instead of 64-bit in __kernel_timex + */ + kt->time.tv_usec = txc.time.tv_usec; + ret = do_adjtimex(kt); + txc.time.tv_usec = kt->time.tv_usec; + + return copy_to_user(txc_p, &txc, sizeof(struct timex)) ? -EFAULT : ret; +} + +SYSCALL_DEFINE2(sparc_clock_adjtime, const clockid_t, which_clock,struct timex __user *, txc_p) +{ + struct timex txc; /* Local copy of parameter */ + struct timex *kt = (void *)&txc; + int ret; + + if (!IS_ENABLED(CONFIG_POSIX_TIMERS)) { + pr_err_once("process %d (%s) attempted a POSIX timer syscall " + "while CONFIG_POSIX_TIMERS is not set\n", + current->pid, current->comm); + + return -ENOSYS; + } + + /* Copy the user data space into the kernel copy + * structure. But bear in mind that the structures + * may change + */ + if (copy_from_user(&txc, txc_p, sizeof(struct timex))) + return -EFAULT; + + /* + * override for sparc64 specific timeval type: tv_usec + * is 32 bit wide instead of 64-bit in __kernel_timex + */ + kt->time.tv_usec = txc.time.tv_usec; + ret = do_clock_adjtime(which_clock, kt); + txc.time.tv_usec = kt->time.tv_usec; + + return copy_to_user(txc_p, &txc, sizeof(struct timex)) ? -EFAULT : ret; +} + SYSCALL_DEFINE5(utrap_install, utrap_entry_t, type, utrap_handler_t, new_p, utrap_handler_t, new_d, utrap_handler_t __user *, old_p, diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 6992d17cce37..e63cd013cc77 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -258,7 +258,8 @@ 216 64 sigreturn sys_nis_syscall 217 common clone sys_clone 218 common ioprio_get sys_ioprio_get -219 common adjtimex sys_adjtimex compat_sys_adjtimex +219 32 adjtimex sys_adjtimex compat_sys_adjtimex +219 64 adjtimex sys_sparc_adjtimex 220 32 sigprocmask sys_sigprocmask compat_sys_sigprocmask 220 64 sigprocmask sys_nis_syscall 221 common create_module sys_ni_syscall @@ -377,7 +378,8 @@ 331 common prlimit64 sys_prlimit64 332 common name_to_handle_at sys_name_to_handle_at 333 common open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -334 common clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +334 32 clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +334 64 clock_adjtime sys_sparc_clock_adjtime 335 common syncfs sys_syncfs 336 common sendmmsg sys_sendmmsg compat_sys_sendmmsg 337 common setns sys_setns diff --git a/include/linux/timex.h b/include/linux/timex.h index 7f40e9e42ecc..a15e6aeb8d49 100644 --- a/include/linux/timex.h +++ b/include/linux/timex.h @@ -159,6 +159,8 @@ extern unsigned long tick_nsec; /* SHIFTED_HZ period (nsec) */ #define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ) extern int do_adjtimex(struct timex *); +extern int do_clock_adjtime(const clockid_t which_clock, struct timex * ktx); + extern void hardpps(const struct timespec64 *, const struct timespec64 *); int read_current_timer(unsigned long *timer_val); diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index 8955f32f2a36..8f7f1dd95940 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -1047,22 +1047,28 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock, return error; } -SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock, - struct timex __user *, utx) +int do_clock_adjtime(const clockid_t which_clock, struct timex * ktx) { const struct k_clock *kc = clockid_to_kclock(which_clock); - struct timex ktx; - int err; if (!kc) return -EINVAL; if (!kc->clock_adj) return -EOPNOTSUPP; + return kc->clock_adj(which_clock, ktx); +} + +SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock, + struct timex __user *, utx) +{ + struct timex ktx; + int err; + if (copy_from_user(&ktx, utx, sizeof(ktx))) return -EFAULT; - err = kc->clock_adj(which_clock, &ktx); + err = do_clock_adjtime(which_clock, &ktx); if (err >= 0 && copy_to_user(utx, &ktx, sizeof(ktx))) return -EFAULT; @@ -1126,20 +1132,14 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, struct old_timex32 __user *, utp) { - const struct k_clock *kc = clockid_to_kclock(which_clock); struct timex ktx; int err; - if (!kc) - return -EINVAL; - if (!kc->clock_adj) - return -EOPNOTSUPP; - err = get_old_timex32(&ktx, utp); if (err) return err; - err = kc->clock_adj(which_clock, &ktx); + err = do_clock_adjtime(which_clock, &ktx); if (err >= 0) err = put_old_timex32(utp, &ktx); -- cgit v1.3-7-g2ca7 From ead25417f82ed7f8a21da4dcefc768169f7da884 Mon Sep 17 00:00:00 2001 From: Deepa Dinamani Date: Mon, 2 Jul 2018 22:44:21 -0700 Subject: timex: use __kernel_timex internally struct timex is not y2038 safe. Replace all uses of timex with y2038 safe __kernel_timex. Note that struct __kernel_timex is an ABI interface definition. We could define a new structure based on __kernel_timex that is only available internally instead. Right now, there isn't a strong motivation for this as the structure is isolated to a few defined struct timex interfaces and such a structure would be exactly the same as struct timex. The patch was generated by the following coccinelle script: virtual patch @depends on patch forall@ identifier ts; expression e; @@ ( - struct timex ts; + struct __kernel_timex ts; | - struct timex ts = {}; + struct __kernel_timex ts = {}; | - struct timex ts = e; + struct __kernel_timex ts = e; | - struct timex *ts; + struct __kernel_timex *ts; | (memset \| copy_from_user \| copy_to_user \)(..., - sizeof(struct timex)) + sizeof(struct __kernel_timex)) ) @depends on patch forall@ identifier ts; identifier fn; @@ fn(..., - struct timex *ts, + struct __kernel_timex *ts, ...) { ... } @depends on patch forall@ identifier ts; identifier fn; @@ fn(..., - struct timex *ts) { + struct __kernel_timex *ts) { ... } Signed-off-by: Deepa Dinamani Cc: linux-alpha@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Arnd Bergmann --- arch/alpha/kernel/osf_sys.c | 5 +++-- arch/sparc/kernel/sys_sparc_64.c | 4 ++-- drivers/ptp/ptp_clock.c | 2 +- include/linux/posix-clock.h | 2 +- include/linux/time32.h | 6 +++--- include/linux/timex.h | 4 ++-- kernel/time/ntp.c | 18 ++++++++++-------- kernel/time/ntp_internal.h | 2 +- kernel/time/posix-clock.c | 2 +- kernel/time/posix-timers.c | 8 ++++---- kernel/time/posix-timers.h | 2 +- kernel/time/time.c | 14 +++++++------- kernel/time/timekeeping.c | 4 ++-- 13 files changed, 38 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index 792586038808..bf497b8b0ec6 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1253,7 +1253,7 @@ struct timex32 { SYSCALL_DEFINE1(old_adjtimex, struct timex32 __user *, txc_p) { - struct timex txc; + struct __kernel_timex txc; int ret; /* copy relevant bits of struct timex. */ @@ -1270,7 +1270,8 @@ SYSCALL_DEFINE1(old_adjtimex, struct timex32 __user *, txc_p) if (copy_to_user(txc_p, &txc, offsetof(struct timex32, time)) || (copy_to_user(&txc_p->tick, &txc.tick, sizeof(struct timex32) - offsetof(struct timex32, tick))) || - (put_tv_to_tv32(&txc_p->time, &txc.time))) + (put_user(txc.time.tv_sec, &txc_p->time.tv_sec)) || + (put_user(txc.time.tv_usec, &txc_p->time.tv_usec))) return -EFAULT; return ret; diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index 37de18a11207..9825ca6a6020 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -548,7 +548,7 @@ out_unlock: SYSCALL_DEFINE1(sparc_adjtimex, struct timex __user *, txc_p) { struct timex txc; /* Local copy of parameter */ - struct timex *kt = (void *)&txc; + struct __kernel_timex *kt = (void *)&txc; int ret; /* Copy the user data space into the kernel copy @@ -572,7 +572,7 @@ SYSCALL_DEFINE1(sparc_adjtimex, struct timex __user *, txc_p) SYSCALL_DEFINE2(sparc_clock_adjtime, const clockid_t, which_clock,struct timex __user *, txc_p) { struct timex txc; /* Local copy of parameter */ - struct timex *kt = (void *)&txc; + struct __kernel_timex *kt = (void *)&txc; int ret; if (!IS_ENABLED(CONFIG_POSIX_TIMERS)) { diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c index 48f3594a7458..79bd102c9bbc 100644 --- a/drivers/ptp/ptp_clock.c +++ b/drivers/ptp/ptp_clock.c @@ -124,7 +124,7 @@ static int ptp_clock_gettime(struct posix_clock *pc, struct timespec64 *tp) return err; } -static int ptp_clock_adjtime(struct posix_clock *pc, struct timex *tx) +static int ptp_clock_adjtime(struct posix_clock *pc, struct __kernel_timex *tx) { struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock); struct ptp_clock_info *ops; diff --git a/include/linux/posix-clock.h b/include/linux/posix-clock.h index 3a3bc71017d5..18674d7d5b1c 100644 --- a/include/linux/posix-clock.h +++ b/include/linux/posix-clock.h @@ -51,7 +51,7 @@ struct posix_clock; struct posix_clock_operations { struct module *owner; - int (*clock_adjtime)(struct posix_clock *pc, struct timex *tx); + int (*clock_adjtime)(struct posix_clock *pc, struct __kernel_timex *tx); int (*clock_gettime)(struct posix_clock *pc, struct timespec64 *ts); diff --git a/include/linux/time32.h b/include/linux/time32.h index 820a22e2b98b..0a1f302a1753 100644 --- a/include/linux/time32.h +++ b/include/linux/time32.h @@ -69,9 +69,9 @@ extern int get_old_itimerspec32(struct itimerspec64 *its, const struct old_itimerspec32 __user *uits); extern int put_old_itimerspec32(const struct itimerspec64 *its, struct old_itimerspec32 __user *uits); -struct timex; -int get_old_timex32(struct timex *, const struct old_timex32 __user *); -int put_old_timex32(struct old_timex32 __user *, const struct timex *); +struct __kernel_timex; +int get_old_timex32(struct __kernel_timex *, const struct old_timex32 __user *); +int put_old_timex32(struct old_timex32 __user *, const struct __kernel_timex *); #if __BITS_PER_LONG == 64 diff --git a/include/linux/timex.h b/include/linux/timex.h index a15e6aeb8d49..4aff9f0d1367 100644 --- a/include/linux/timex.h +++ b/include/linux/timex.h @@ -158,8 +158,8 @@ extern unsigned long tick_nsec; /* SHIFTED_HZ period (nsec) */ #define NTP_INTERVAL_FREQ (HZ) #define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ) -extern int do_adjtimex(struct timex *); -extern int do_clock_adjtime(const clockid_t which_clock, struct timex * ktx); +extern int do_adjtimex(struct __kernel_timex *); +extern int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx); extern void hardpps(const struct timespec64 *, const struct timespec64 *); diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 36a2bef00125..92a90014a925 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -188,13 +188,13 @@ static inline int is_error_status(int status) && (status & (STA_PPSWANDER|STA_PPSERROR))); } -static inline void pps_fill_timex(struct timex *txc) +static inline void pps_fill_timex(struct __kernel_timex *txc) { txc->ppsfreq = shift_right((pps_freq >> PPM_SCALE_INV_SHIFT) * PPM_SCALE_INV, NTP_SCALE_SHIFT); txc->jitter = pps_jitter; if (!(time_status & STA_NANO)) - txc->jitter /= NSEC_PER_USEC; + txc->jitter = pps_jitter / NSEC_PER_USEC; txc->shift = pps_shift; txc->stabil = pps_stabil; txc->jitcnt = pps_jitcnt; @@ -220,7 +220,7 @@ static inline int is_error_status(int status) return status & (STA_UNSYNC|STA_CLOCKERR); } -static inline void pps_fill_timex(struct timex *txc) +static inline void pps_fill_timex(struct __kernel_timex *txc) { /* PPS is not implemented, so these are zero */ txc->ppsfreq = 0; @@ -633,7 +633,7 @@ void ntp_notify_cmos_timer(void) /* * Propagate a new txc->status value into the NTP state: */ -static inline void process_adj_status(const struct timex *txc) +static inline void process_adj_status(const struct __kernel_timex *txc) { if ((time_status & STA_PLL) && !(txc->status & STA_PLL)) { time_state = TIME_OK; @@ -656,7 +656,8 @@ static inline void process_adj_status(const struct timex *txc) } -static inline void process_adjtimex_modes(const struct timex *txc, s32 *time_tai) +static inline void process_adjtimex_modes(const struct __kernel_timex *txc, + s32 *time_tai) { if (txc->modes & ADJ_STATUS) process_adj_status(txc); @@ -707,7 +708,8 @@ static inline void process_adjtimex_modes(const struct timex *txc, s32 *time_tai * adjtimex mainly allows reading (and writing, if superuser) of * kernel time-keeping variables. used by xntpd. */ -int __do_adjtimex(struct timex *txc, const struct timespec64 *ts, s32 *time_tai) +int __do_adjtimex(struct __kernel_timex *txc, const struct timespec64 *ts, + s32 *time_tai) { int result; @@ -729,7 +731,7 @@ int __do_adjtimex(struct timex *txc, const struct timespec64 *ts, s32 *time_tai) txc->offset = shift_right(time_offset * NTP_INTERVAL_FREQ, NTP_SCALE_SHIFT); if (!(time_status & STA_NANO)) - txc->offset /= NSEC_PER_USEC; + txc->offset = (u32)txc->offset / NSEC_PER_USEC; } result = time_state; /* mostly `TIME_OK' */ @@ -754,7 +756,7 @@ int __do_adjtimex(struct timex *txc, const struct timespec64 *ts, s32 *time_tai) txc->time.tv_sec = (time_t)ts->tv_sec; txc->time.tv_usec = ts->tv_nsec; if (!(time_status & STA_NANO)) - txc->time.tv_usec /= NSEC_PER_USEC; + txc->time.tv_usec = ts->tv_nsec / NSEC_PER_USEC; /* Handle leapsec adjustments */ if (unlikely(ts->tv_sec >= ntp_next_leap_sec)) { diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h index c24b0e13f011..40e6122e634e 100644 --- a/kernel/time/ntp_internal.h +++ b/kernel/time/ntp_internal.h @@ -8,6 +8,6 @@ extern void ntp_clear(void); extern u64 ntp_tick_length(void); extern ktime_t ntp_get_next_leap(void); extern int second_overflow(time64_t secs); -extern int __do_adjtimex(struct timex *txc, const struct timespec64 *ts, s32 *time_tai); +extern int __do_adjtimex(struct __kernel_timex *txc, const struct timespec64 *ts, s32 *time_tai); extern void __hardpps(const struct timespec64 *phase_ts, const struct timespec64 *raw_ts); #endif /* _LINUX_NTP_INTERNAL_H */ diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c index 425bbfce6819..ec960bb939fd 100644 --- a/kernel/time/posix-clock.c +++ b/kernel/time/posix-clock.c @@ -228,7 +228,7 @@ static void put_clock_desc(struct posix_clock_desc *cd) fput(cd->fp); } -static int pc_clock_adjtime(clockid_t id, struct timex *tx) +static int pc_clock_adjtime(clockid_t id, struct __kernel_timex *tx) { struct posix_clock_desc cd; int err; diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index 8f7f1dd95940..2d84b3db1ade 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -179,7 +179,7 @@ static int posix_clock_realtime_set(const clockid_t which_clock, } static int posix_clock_realtime_adj(const clockid_t which_clock, - struct timex *t) + struct __kernel_timex *t) { return do_adjtimex(t); } @@ -1047,7 +1047,7 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock, return error; } -int do_clock_adjtime(const clockid_t which_clock, struct timex * ktx) +int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx) { const struct k_clock *kc = clockid_to_kclock(which_clock); @@ -1062,7 +1062,7 @@ int do_clock_adjtime(const clockid_t which_clock, struct timex * ktx) SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock, struct timex __user *, utx) { - struct timex ktx; + struct __kernel_timex ktx; int err; if (copy_from_user(&ktx, utx, sizeof(ktx))) @@ -1132,7 +1132,7 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, struct old_timex32 __user *, utp) { - struct timex ktx; + struct __kernel_timex ktx; int err; err = get_old_timex32(&ktx, utp); diff --git a/kernel/time/posix-timers.h b/kernel/time/posix-timers.h index ddb21145211a..de5daa6d975a 100644 --- a/kernel/time/posix-timers.h +++ b/kernel/time/posix-timers.h @@ -8,7 +8,7 @@ struct k_clock { const struct timespec64 *tp); int (*clock_get)(const clockid_t which_clock, struct timespec64 *tp); - int (*clock_adj)(const clockid_t which_clock, struct timex *tx); + int (*clock_adj)(const clockid_t which_clock, struct __kernel_timex *tx); int (*timer_create)(struct k_itimer *timer); int (*nsleep)(const clockid_t which_clock, int flags, const struct timespec64 *); diff --git a/kernel/time/time.c b/kernel/time/time.c index 2d013bc2b271..d179d33f639a 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -265,25 +265,25 @@ COMPAT_SYSCALL_DEFINE2(settimeofday, struct old_timeval32 __user *, tv, SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p) { - struct timex txc; /* Local copy of parameter */ + struct __kernel_timex txc; /* Local copy of parameter */ int ret; /* Copy the user data space into the kernel copy * structure. But bear in mind that the structures * may change */ - if (copy_from_user(&txc, txc_p, sizeof(struct timex))) + if (copy_from_user(&txc, txc_p, sizeof(struct __kernel_timex))) return -EFAULT; ret = do_adjtimex(&txc); - return copy_to_user(txc_p, &txc, sizeof(struct timex)) ? -EFAULT : ret; + return copy_to_user(txc_p, &txc, sizeof(struct __kernel_timex)) ? -EFAULT : ret; } #ifdef CONFIG_COMPAT_32BIT_TIME -int get_old_timex32(struct timex *txc, const struct old_timex32 __user *utp) +int get_old_timex32(struct __kernel_timex *txc, const struct old_timex32 __user *utp) { struct old_timex32 tx32; - memset(txc, 0, sizeof(struct timex)); + memset(txc, 0, sizeof(struct __kernel_timex)); if (copy_from_user(&tx32, utp, sizeof(struct old_timex32))) return -EFAULT; @@ -311,7 +311,7 @@ int get_old_timex32(struct timex *txc, const struct old_timex32 __user *utp) return 0; } -int put_old_timex32(struct old_timex32 __user *utp, const struct timex *txc) +int put_old_timex32(struct old_timex32 __user *utp, const struct __kernel_timex *txc) { struct old_timex32 tx32; @@ -344,7 +344,7 @@ int put_old_timex32(struct old_timex32 __user *utp, const struct timex *txc) COMPAT_SYSCALL_DEFINE1(adjtimex, struct old_timex32 __user *, utp) { - struct timex txc; + struct __kernel_timex txc; int err, ret; err = get_old_timex32(&txc, utp); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ac5dbf2cd4a2..f986e1918d12 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2234,7 +2234,7 @@ ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq, ktime_t *offs_real, /** * timekeeping_validate_timex - Ensures the timex is ok for use in do_adjtimex */ -static int timekeeping_validate_timex(const struct timex *txc) +static int timekeeping_validate_timex(const struct __kernel_timex *txc) { if (txc->modes & ADJ_ADJTIME) { /* singleshot must not be used with any other mode bits */ @@ -2300,7 +2300,7 @@ static int timekeeping_validate_timex(const struct timex *txc) /** * do_adjtimex() - Accessor function to NTP __do_adjtimex function */ -int do_adjtimex(struct timex *txc) +int do_adjtimex(struct __kernel_timex *txc) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long flags; -- cgit v1.3-7-g2ca7 From 3876ced476c8ec17265d1739467e726ada88b660 Mon Sep 17 00:00:00 2001 From: Deepa Dinamani Date: Mon, 2 Jul 2018 22:44:22 -0700 Subject: timex: change syscalls to use struct __kernel_timex struct timex is not y2038 safe. Switch all the syscall apis to use y2038 safe __kernel_timex. Note that sys_adjtimex() does not have a y2038 safe solution. C libraries can implement it by calling clock_adjtime(CLOCK_REALTIME, ...). Signed-off-by: Deepa Dinamani Signed-off-by: Arnd Bergmann --- include/linux/syscalls.h | 6 +++--- kernel/time/posix-timers.c | 2 +- kernel/time/time.c | 4 +++- 3 files changed, 7 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index baa4b70b02d3..09330d5bda0c 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,7 +54,7 @@ struct __sysctl_args; struct sysinfo; struct timespec; struct timeval; -struct timex; +struct __kernel_timex; struct timezone; struct tms; struct utimbuf; @@ -695,7 +695,7 @@ asmlinkage long sys_gettimeofday(struct timeval __user *tv, struct timezone __user *tz); asmlinkage long sys_settimeofday(struct timeval __user *tv, struct timezone __user *tz); -asmlinkage long sys_adjtimex(struct timex __user *txc_p); +asmlinkage long sys_adjtimex(struct __kernel_timex __user *txc_p); /* kernel/timer.c */ asmlinkage long sys_getpid(void); @@ -870,7 +870,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); asmlinkage long sys_clock_adjtime(clockid_t which_clock, - struct timex __user *tx); + struct __kernel_timex __user *tx); asmlinkage long sys_syncfs(int fd); asmlinkage long sys_setns(int fd, int nstype); asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg, diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index 2d84b3db1ade..de79f85ae14f 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -1060,7 +1060,7 @@ int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx) } SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock, - struct timex __user *, utx) + struct __kernel_timex __user *, utx) { struct __kernel_timex ktx; int err; diff --git a/kernel/time/time.c b/kernel/time/time.c index d179d33f639a..78b5c8f1495a 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -263,7 +263,8 @@ COMPAT_SYSCALL_DEFINE2(settimeofday, struct old_timeval32 __user *, tv, } #endif -SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p) +#if !defined(CONFIG_64BIT_TIME) || defined(CONFIG_64BIT) +SYSCALL_DEFINE1(adjtimex, struct __kernel_timex __user *, txc_p) { struct __kernel_timex txc; /* Local copy of parameter */ int ret; @@ -277,6 +278,7 @@ SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p) ret = do_adjtimex(&txc); return copy_to_user(txc_p, &txc, sizeof(struct __kernel_timex)) ? -EFAULT : ret; } +#endif #ifdef CONFIG_COMPAT_32BIT_TIME int get_old_timex32(struct __kernel_timex *txc, const struct old_timex32 __user *utp) -- cgit v1.3-7-g2ca7 From 8dabe7245bbc134f2cfcc12cde75c019dab924cc Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 7 Jan 2019 00:33:08 +0100 Subject: y2038: syscalls: rename y2038 compat syscalls A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by: Catalin Marinas Signed-off-by: Arnd Bergmann --- arch/arm64/include/asm/unistd32.h | 48 ++++++++++---------- arch/mips/kernel/syscalls/syscall_n32.tbl | 50 ++++++++++----------- arch/mips/kernel/syscalls/syscall_o32.tbl | 52 +++++++++++----------- arch/parisc/kernel/syscalls/syscall.tbl | 54 +++++++++++------------ arch/powerpc/kernel/syscalls/syscall.tbl | 52 +++++++++++----------- arch/s390/kernel/syscalls/syscall.tbl | 52 +++++++++++----------- arch/sparc/kernel/syscalls/syscall.tbl | 52 +++++++++++----------- arch/x86/entry/syscalls/syscall_32.tbl | 52 +++++++++++----------- fs/aio.c | 10 ++--- fs/select.c | 4 +- fs/timerfd.c | 4 +- fs/utimes.c | 10 ++--- include/linux/compat.h | 73 ++----------------------------- include/linux/syscalls.h | 57 ++++++++++++++++++++++++ include/uapi/asm-generic/unistd.h | 44 +++++++++---------- ipc/mqueue.c | 16 +++---- ipc/sem.c | 2 +- kernel/futex.c | 2 +- kernel/sched/core.c | 5 +-- kernel/signal.c | 2 +- kernel/sys_ni.c | 18 ++++---- kernel/time/hrtimer.c | 2 +- kernel/time/posix-stubs.c | 25 ++++++----- kernel/time/posix-timers.c | 32 +++++++------- kernel/time/time.c | 8 ++-- net/compat.c | 2 +- 26 files changed, 361 insertions(+), 367 deletions(-) (limited to 'kernel') diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index d10cce69a4b0..1ded82857161 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -270,7 +270,7 @@ __SYSCALL(__NR_uname, sys_newuname) /* 123 was sys_modify_ldt */ __SYSCALL(123, sys_ni_syscall) #define __NR_adjtimex 124 -__SYSCALL(__NR_adjtimex, compat_sys_adjtimex) +__SYSCALL(__NR_adjtimex, sys_adjtimex_time32) #define __NR_mprotect 125 __SYSCALL(__NR_mprotect, sys_mprotect) #define __NR_sigprocmask 126 @@ -344,9 +344,9 @@ __SYSCALL(__NR_sched_get_priority_max, sys_sched_get_priority_max) #define __NR_sched_get_priority_min 160 __SYSCALL(__NR_sched_get_priority_min, sys_sched_get_priority_min) #define __NR_sched_rr_get_interval 161 -__SYSCALL(__NR_sched_rr_get_interval, compat_sys_sched_rr_get_interval) +__SYSCALL(__NR_sched_rr_get_interval, sys_sched_rr_get_interval_time32) #define __NR_nanosleep 162 -__SYSCALL(__NR_nanosleep, compat_sys_nanosleep) +__SYSCALL(__NR_nanosleep, sys_nanosleep_time32) #define __NR_mremap 163 __SYSCALL(__NR_mremap, sys_mremap) #define __NR_setresuid 164 @@ -376,7 +376,7 @@ __SYSCALL(__NR_rt_sigprocmask, compat_sys_rt_sigprocmask) #define __NR_rt_sigpending 176 __SYSCALL(__NR_rt_sigpending, compat_sys_rt_sigpending) #define __NR_rt_sigtimedwait 177 -__SYSCALL(__NR_rt_sigtimedwait, compat_sys_rt_sigtimedwait) +__SYSCALL(__NR_rt_sigtimedwait, compat_sys_rt_sigtimedwait_time32) #define __NR_rt_sigqueueinfo 178 __SYSCALL(__NR_rt_sigqueueinfo, compat_sys_rt_sigqueueinfo) #define __NR_rt_sigsuspend 179 @@ -502,7 +502,7 @@ __SYSCALL(__NR_tkill, sys_tkill) #define __NR_sendfile64 239 __SYSCALL(__NR_sendfile64, sys_sendfile64) #define __NR_futex 240 -__SYSCALL(__NR_futex, compat_sys_futex) +__SYSCALL(__NR_futex, sys_futex_time32) #define __NR_sched_setaffinity 241 __SYSCALL(__NR_sched_setaffinity, compat_sys_sched_setaffinity) #define __NR_sched_getaffinity 242 @@ -512,7 +512,7 @@ __SYSCALL(__NR_io_setup, compat_sys_io_setup) #define __NR_io_destroy 244 __SYSCALL(__NR_io_destroy, sys_io_destroy) #define __NR_io_getevents 245 -__SYSCALL(__NR_io_getevents, compat_sys_io_getevents) +__SYSCALL(__NR_io_getevents, sys_io_getevents_time32) #define __NR_io_submit 246 __SYSCALL(__NR_io_submit, compat_sys_io_submit) #define __NR_io_cancel 247 @@ -538,21 +538,21 @@ __SYSCALL(__NR_set_tid_address, sys_set_tid_address) #define __NR_timer_create 257 __SYSCALL(__NR_timer_create, compat_sys_timer_create) #define __NR_timer_settime 258 -__SYSCALL(__NR_timer_settime, compat_sys_timer_settime) +__SYSCALL(__NR_timer_settime, sys_timer_settime32) #define __NR_timer_gettime 259 -__SYSCALL(__NR_timer_gettime, compat_sys_timer_gettime) +__SYSCALL(__NR_timer_gettime, sys_timer_gettime32) #define __NR_timer_getoverrun 260 __SYSCALL(__NR_timer_getoverrun, sys_timer_getoverrun) #define __NR_timer_delete 261 __SYSCALL(__NR_timer_delete, sys_timer_delete) #define __NR_clock_settime 262 -__SYSCALL(__NR_clock_settime, compat_sys_clock_settime) +__SYSCALL(__NR_clock_settime, sys_clock_settime32) #define __NR_clock_gettime 263 -__SYSCALL(__NR_clock_gettime, compat_sys_clock_gettime) +__SYSCALL(__NR_clock_gettime, sys_clock_gettime32) #define __NR_clock_getres 264 -__SYSCALL(__NR_clock_getres, compat_sys_clock_getres) +__SYSCALL(__NR_clock_getres, sys_clock_getres_time32) #define __NR_clock_nanosleep 265 -__SYSCALL(__NR_clock_nanosleep, compat_sys_clock_nanosleep) +__SYSCALL(__NR_clock_nanosleep, sys_clock_nanosleep_time32) #define __NR_statfs64 266 __SYSCALL(__NR_statfs64, compat_sys_aarch32_statfs64) #define __NR_fstatfs64 267 @@ -560,7 +560,7 @@ __SYSCALL(__NR_fstatfs64, compat_sys_aarch32_fstatfs64) #define __NR_tgkill 268 __SYSCALL(__NR_tgkill, sys_tgkill) #define __NR_utimes 269 -__SYSCALL(__NR_utimes, compat_sys_utimes) +__SYSCALL(__NR_utimes, sys_utimes_time32) #define __NR_arm_fadvise64_64 270 __SYSCALL(__NR_arm_fadvise64_64, compat_sys_aarch32_fadvise64_64) #define __NR_pciconfig_iobase 271 @@ -574,9 +574,9 @@ __SYSCALL(__NR_mq_open, compat_sys_mq_open) #define __NR_mq_unlink 275 __SYSCALL(__NR_mq_unlink, sys_mq_unlink) #define __NR_mq_timedsend 276 -__SYSCALL(__NR_mq_timedsend, compat_sys_mq_timedsend) +__SYSCALL(__NR_mq_timedsend, sys_mq_timedsend_time32) #define __NR_mq_timedreceive 277 -__SYSCALL(__NR_mq_timedreceive, compat_sys_mq_timedreceive) +__SYSCALL(__NR_mq_timedreceive, sys_mq_timedreceive_time32) #define __NR_mq_notify 278 __SYSCALL(__NR_mq_notify, compat_sys_mq_notify) #define __NR_mq_getsetattr 279 @@ -646,7 +646,7 @@ __SYSCALL(__NR_request_key, sys_request_key) #define __NR_keyctl 311 __SYSCALL(__NR_keyctl, compat_sys_keyctl) #define __NR_semtimedop 312 -__SYSCALL(__NR_semtimedop, compat_sys_semtimedop) +__SYSCALL(__NR_semtimedop, sys_semtimedop_time32) #define __NR_vserver 313 __SYSCALL(__NR_vserver, sys_ni_syscall) #define __NR_ioprio_set 314 @@ -674,7 +674,7 @@ __SYSCALL(__NR_mknodat, sys_mknodat) #define __NR_fchownat 325 __SYSCALL(__NR_fchownat, sys_fchownat) #define __NR_futimesat 326 -__SYSCALL(__NR_futimesat, compat_sys_futimesat) +__SYSCALL(__NR_futimesat, sys_futimesat_time32) #define __NR_fstatat64 327 __SYSCALL(__NR_fstatat64, sys_fstatat64) #define __NR_unlinkat 328 @@ -692,9 +692,9 @@ __SYSCALL(__NR_fchmodat, sys_fchmodat) #define __NR_faccessat 334 __SYSCALL(__NR_faccessat, sys_faccessat) #define __NR_pselect6 335 -__SYSCALL(__NR_pselect6, compat_sys_pselect6) +__SYSCALL(__NR_pselect6, compat_sys_pselect6_time32) #define __NR_ppoll 336 -__SYSCALL(__NR_ppoll, compat_sys_ppoll) +__SYSCALL(__NR_ppoll, compat_sys_ppoll_time32) #define __NR_unshare 337 __SYSCALL(__NR_unshare, sys_unshare) #define __NR_set_robust_list 338 @@ -718,7 +718,7 @@ __SYSCALL(__NR_epoll_pwait, compat_sys_epoll_pwait) #define __NR_kexec_load 347 __SYSCALL(__NR_kexec_load, compat_sys_kexec_load) #define __NR_utimensat 348 -__SYSCALL(__NR_utimensat, compat_sys_utimensat) +__SYSCALL(__NR_utimensat, sys_utimensat_time32) #define __NR_signalfd 349 __SYSCALL(__NR_signalfd, compat_sys_signalfd) #define __NR_timerfd_create 350 @@ -728,9 +728,9 @@ __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate 352 __SYSCALL(__NR_fallocate, compat_sys_aarch32_fallocate) #define __NR_timerfd_settime 353 -__SYSCALL(__NR_timerfd_settime, compat_sys_timerfd_settime) +__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime32) #define __NR_timerfd_gettime 354 -__SYSCALL(__NR_timerfd_gettime, compat_sys_timerfd_gettime) +__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime32) #define __NR_signalfd4 355 __SYSCALL(__NR_signalfd4, compat_sys_signalfd4) #define __NR_eventfd2 356 @@ -752,7 +752,7 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, compat_sys_rt_tgsigqueueinfo) #define __NR_perf_event_open 364 __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_recvmmsg 365 -__SYSCALL(__NR_recvmmsg, compat_sys_recvmmsg) +__SYSCALL(__NR_recvmmsg, compat_sys_recvmmsg_time32) #define __NR_accept4 366 __SYSCALL(__NR_accept4, sys_accept4) #define __NR_fanotify_init 367 @@ -766,7 +766,7 @@ __SYSCALL(__NR_name_to_handle_at, sys_name_to_handle_at) #define __NR_open_by_handle_at 371 __SYSCALL(__NR_open_by_handle_at, compat_sys_open_by_handle_at) #define __NR_clock_adjtime 372 -__SYSCALL(__NR_clock_adjtime, compat_sys_clock_adjtime) +__SYSCALL(__NR_clock_adjtime, sys_clock_adjtime32) #define __NR_syncfs 373 __SYSCALL(__NR_syncfs, sys_syncfs) #define __NR_sendmmsg 374 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index cc134b1211aa..6d1e019817c8 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -41,7 +41,7 @@ 31 n32 dup sys_dup 32 n32 dup2 sys_dup2 33 n32 pause sys_pause -34 n32 nanosleep compat_sys_nanosleep +34 n32 nanosleep sys_nanosleep_time32 35 n32 getitimer compat_sys_getitimer 36 n32 setitimer compat_sys_setitimer 37 n32 alarm sys_alarm @@ -133,11 +133,11 @@ 123 n32 capget sys_capget 124 n32 capset sys_capset 125 n32 rt_sigpending compat_sys_rt_sigpending -126 n32 rt_sigtimedwait compat_sys_rt_sigtimedwait +126 n32 rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 127 n32 rt_sigqueueinfo compat_sys_rt_sigqueueinfo 128 n32 rt_sigsuspend compat_sys_rt_sigsuspend 129 n32 sigaltstack compat_sys_sigaltstack -130 n32 utime compat_sys_utime +130 n32 utime sys_utime32 131 n32 mknod sys_mknod 132 n32 personality sys_32_personality 133 n32 ustat compat_sys_ustat @@ -152,7 +152,7 @@ 142 n32 sched_getscheduler sys_sched_getscheduler 143 n32 sched_get_priority_max sys_sched_get_priority_max 144 n32 sched_get_priority_min sys_sched_get_priority_min -145 n32 sched_rr_get_interval compat_sys_sched_rr_get_interval +145 n32 sched_rr_get_interval sys_sched_rr_get_interval_time32 146 n32 mlock sys_mlock 147 n32 munlock sys_munlock 148 n32 mlockall sys_mlockall @@ -161,7 +161,7 @@ 151 n32 pivot_root sys_pivot_root 152 n32 _sysctl compat_sys_sysctl 153 n32 prctl sys_prctl -154 n32 adjtimex compat_sys_adjtimex +154 n32 adjtimex sys_adjtimex_time32 155 n32 setrlimit compat_sys_setrlimit 156 n32 chroot sys_chroot 157 n32 sync sys_sync @@ -202,7 +202,7 @@ 191 n32 fremovexattr sys_fremovexattr 192 n32 tkill sys_tkill 193 n32 reserved193 sys_ni_syscall -194 n32 futex compat_sys_futex +194 n32 futex sys_futex_time32 195 n32 sched_setaffinity compat_sys_sched_setaffinity 196 n32 sched_getaffinity compat_sys_sched_getaffinity 197 n32 cacheflush sys_cacheflush @@ -210,7 +210,7 @@ 199 n32 sysmips __sys_sysmips 200 n32 io_setup compat_sys_io_setup 201 n32 io_destroy sys_io_destroy -202 n32 io_getevents compat_sys_io_getevents +202 n32 io_getevents sys_io_getevents_time32 203 n32 io_submit compat_sys_io_submit 204 n32 io_cancel sys_io_cancel 205 n32 exit_group sys_exit_group @@ -223,29 +223,29 @@ 212 n32 fcntl64 compat_sys_fcntl64 213 n32 set_tid_address sys_set_tid_address 214 n32 restart_syscall sys_restart_syscall -215 n32 semtimedop compat_sys_semtimedop +215 n32 semtimedop sys_semtimedop_time32 216 n32 fadvise64 sys_fadvise64_64 217 n32 statfs64 compat_sys_statfs64 218 n32 fstatfs64 compat_sys_fstatfs64 219 n32 sendfile64 sys_sendfile64 220 n32 timer_create compat_sys_timer_create -221 n32 timer_settime compat_sys_timer_settime -222 n32 timer_gettime compat_sys_timer_gettime +221 n32 timer_settime sys_timer_settime32 +222 n32 timer_gettime sys_timer_gettime32 223 n32 timer_getoverrun sys_timer_getoverrun 224 n32 timer_delete sys_timer_delete -225 n32 clock_settime compat_sys_clock_settime -226 n32 clock_gettime compat_sys_clock_gettime -227 n32 clock_getres compat_sys_clock_getres -228 n32 clock_nanosleep compat_sys_clock_nanosleep +225 n32 clock_settime sys_clock_settime32 +226 n32 clock_gettime sys_clock_gettime32 +227 n32 clock_getres sys_clock_getres_time32 +228 n32 clock_nanosleep sys_clock_nanosleep_time32 229 n32 tgkill sys_tgkill -230 n32 utimes compat_sys_utimes +230 n32 utimes sys_utimes_time32 231 n32 mbind compat_sys_mbind 232 n32 get_mempolicy compat_sys_get_mempolicy 233 n32 set_mempolicy compat_sys_set_mempolicy 234 n32 mq_open compat_sys_mq_open 235 n32 mq_unlink sys_mq_unlink -236 n32 mq_timedsend compat_sys_mq_timedsend -237 n32 mq_timedreceive compat_sys_mq_timedreceive +236 n32 mq_timedsend sys_mq_timedsend_time32 +237 n32 mq_timedreceive sys_mq_timedreceive_time32 238 n32 mq_notify compat_sys_mq_notify 239 n32 mq_getsetattr compat_sys_mq_getsetattr 240 n32 vserver sys_ni_syscall @@ -263,7 +263,7 @@ 252 n32 mkdirat sys_mkdirat 253 n32 mknodat sys_mknodat 254 n32 fchownat sys_fchownat -255 n32 futimesat compat_sys_futimesat +255 n32 futimesat sys_futimesat_time32 256 n32 newfstatat sys_newfstatat 257 n32 unlinkat sys_unlinkat 258 n32 renameat sys_renameat @@ -272,8 +272,8 @@ 261 n32 readlinkat sys_readlinkat 262 n32 fchmodat sys_fchmodat 263 n32 faccessat sys_faccessat -264 n32 pselect6 compat_sys_pselect6 -265 n32 ppoll compat_sys_ppoll +264 n32 pselect6 compat_sys_pselect6_time32 +265 n32 ppoll compat_sys_ppoll_time32 266 n32 unshare sys_unshare 267 n32 splice sys_splice 268 n32 sync_file_range sys_sync_file_range @@ -287,14 +287,14 @@ 276 n32 epoll_pwait compat_sys_epoll_pwait 277 n32 ioprio_set sys_ioprio_set 278 n32 ioprio_get sys_ioprio_get -279 n32 utimensat compat_sys_utimensat +279 n32 utimensat sys_utimensat_time32 280 n32 signalfd compat_sys_signalfd 281 n32 timerfd sys_ni_syscall 282 n32 eventfd sys_eventfd 283 n32 fallocate sys_fallocate 284 n32 timerfd_create sys_timerfd_create -285 n32 timerfd_gettime compat_sys_timerfd_gettime -286 n32 timerfd_settime compat_sys_timerfd_settime +285 n32 timerfd_gettime sys_timerfd_gettime32 +286 n32 timerfd_settime sys_timerfd_settime32 287 n32 signalfd4 compat_sys_signalfd4 288 n32 eventfd2 sys_eventfd2 289 n32 epoll_create1 sys_epoll_create1 @@ -306,14 +306,14 @@ 295 n32 rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo 296 n32 perf_event_open sys_perf_event_open 297 n32 accept4 sys_accept4 -298 n32 recvmmsg compat_sys_recvmmsg +298 n32 recvmmsg compat_sys_recvmmsg_time32 299 n32 getdents64 sys_getdents64 300 n32 fanotify_init sys_fanotify_init 301 n32 fanotify_mark sys_fanotify_mark 302 n32 prlimit64 sys_prlimit64 303 n32 name_to_handle_at sys_name_to_handle_at 304 n32 open_by_handle_at sys_open_by_handle_at -305 n32 clock_adjtime compat_sys_clock_adjtime +305 n32 clock_adjtime sys_clock_adjtime32 306 n32 syncfs sys_syncfs 307 n32 sendmmsg compat_sys_sendmmsg 308 n32 setns sys_setns diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index fa47ea8cc6ef..e9fec7bac5a9 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -20,7 +20,7 @@ 10 o32 unlink sys_unlink 11 o32 execve sys_execve compat_sys_execve 12 o32 chdir sys_chdir -13 o32 time sys_time compat_sys_time +13 o32 time sys_time sys_time32 14 o32 mknod sys_mknod 15 o32 chmod sys_chmod 16 o32 lchown sys_lchown @@ -33,13 +33,13 @@ 22 o32 umount sys_oldumount 23 o32 setuid sys_setuid 24 o32 getuid sys_getuid -25 o32 stime sys_stime compat_sys_stime +25 o32 stime sys_stime sys_stime32 26 o32 ptrace sys_ptrace compat_sys_ptrace 27 o32 alarm sys_alarm # 28 was sys_fstat 28 o32 unused28 sys_ni_syscall 29 o32 pause sys_pause -30 o32 utime sys_utime compat_sys_utime +30 o32 utime sys_utime sys_utime32 31 o32 stty sys_ni_syscall 32 o32 gtty sys_ni_syscall 33 o32 access sys_access @@ -135,7 +135,7 @@ 121 o32 setdomainname sys_setdomainname 122 o32 uname sys_newuname 123 o32 modify_ldt sys_ni_syscall -124 o32 adjtimex sys_adjtimex compat_sys_adjtimex +124 o32 adjtimex sys_adjtimex sys_adjtimex_time32 125 o32 mprotect sys_mprotect 126 o32 sigprocmask sys_sigprocmask compat_sys_sigprocmask 127 o32 create_module sys_ni_syscall @@ -176,8 +176,8 @@ 162 o32 sched_yield sys_sched_yield 163 o32 sched_get_priority_max sys_sched_get_priority_max 164 o32 sched_get_priority_min sys_sched_get_priority_min -165 o32 sched_rr_get_interval sys_sched_rr_get_interval compat_sys_sched_rr_get_interval -166 o32 nanosleep sys_nanosleep compat_sys_nanosleep +165 o32 sched_rr_get_interval sys_sched_rr_get_interval sys_sched_rr_get_interval_time32 +166 o32 nanosleep sys_nanosleep sys_nanosleep_time32 167 o32 mremap sys_mremap 168 o32 accept sys_accept 169 o32 bind sys_bind @@ -208,7 +208,7 @@ 194 o32 rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 195 o32 rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask 196 o32 rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending -197 o32 rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait +197 o32 rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 198 o32 rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 199 o32 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 200 o32 pread64 sys_pread64 sys_32_pread @@ -249,12 +249,12 @@ 235 o32 fremovexattr sys_fremovexattr 236 o32 tkill sys_tkill 237 o32 sendfile64 sys_sendfile64 -238 o32 futex sys_futex compat_sys_futex +238 o32 futex sys_futex sys_futex_time32 239 o32 sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity 240 o32 sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity 241 o32 io_setup sys_io_setup compat_sys_io_setup 242 o32 io_destroy sys_io_destroy -243 o32 io_getevents sys_io_getevents compat_sys_io_getevents +243 o32 io_getevents sys_io_getevents sys_io_getevents_time32 244 o32 io_submit sys_io_submit compat_sys_io_submit 245 o32 io_cancel sys_io_cancel 246 o32 exit_group sys_exit_group @@ -269,23 +269,23 @@ 255 o32 statfs64 sys_statfs64 compat_sys_statfs64 256 o32 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 257 o32 timer_create sys_timer_create compat_sys_timer_create -258 o32 timer_settime sys_timer_settime compat_sys_timer_settime -259 o32 timer_gettime sys_timer_gettime compat_sys_timer_gettime +258 o32 timer_settime sys_timer_settime sys_timer_settime32 +259 o32 timer_gettime sys_timer_gettime sys_timer_gettime32 260 o32 timer_getoverrun sys_timer_getoverrun 261 o32 timer_delete sys_timer_delete -262 o32 clock_settime sys_clock_settime compat_sys_clock_settime -263 o32 clock_gettime sys_clock_gettime compat_sys_clock_gettime -264 o32 clock_getres sys_clock_getres compat_sys_clock_getres -265 o32 clock_nanosleep sys_clock_nanosleep compat_sys_clock_nanosleep +262 o32 clock_settime sys_clock_settime sys_clock_settime32 +263 o32 clock_gettime sys_clock_gettime sys_clock_gettime32 +264 o32 clock_getres sys_clock_getres sys_clock_getres_time32 +265 o32 clock_nanosleep sys_clock_nanosleep sys_clock_nanosleep_time32 266 o32 tgkill sys_tgkill -267 o32 utimes sys_utimes compat_sys_utimes +267 o32 utimes sys_utimes sys_utimes_time32 268 o32 mbind sys_mbind compat_sys_mbind 269 o32 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy 270 o32 set_mempolicy sys_set_mempolicy compat_sys_set_mempolicy 271 o32 mq_open sys_mq_open compat_sys_mq_open 272 o32 mq_unlink sys_mq_unlink -273 o32 mq_timedsend sys_mq_timedsend compat_sys_mq_timedsend -274 o32 mq_timedreceive sys_mq_timedreceive compat_sys_mq_timedreceive +273 o32 mq_timedsend sys_mq_timedsend sys_mq_timedsend_time32 +274 o32 mq_timedreceive sys_mq_timedreceive sys_mq_timedreceive_time32 275 o32 mq_notify sys_mq_notify compat_sys_mq_notify 276 o32 mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 277 o32 vserver sys_ni_syscall @@ -303,7 +303,7 @@ 289 o32 mkdirat sys_mkdirat 290 o32 mknodat sys_mknodat 291 o32 fchownat sys_fchownat -292 o32 futimesat sys_futimesat compat_sys_futimesat +292 o32 futimesat sys_futimesat sys_futimesat_time32 293 o32 fstatat64 sys_fstatat64 sys_newfstatat 294 o32 unlinkat sys_unlinkat 295 o32 renameat sys_renameat @@ -312,8 +312,8 @@ 298 o32 readlinkat sys_readlinkat 299 o32 fchmodat sys_fchmodat 300 o32 faccessat sys_faccessat -301 o32 pselect6 sys_pselect6 compat_sys_pselect6 -302 o32 ppoll sys_ppoll compat_sys_ppoll +301 o32 pselect6 sys_pselect6 compat_sys_pselect6_time32 +302 o32 ppoll sys_ppoll compat_sys_ppoll_time32 303 o32 unshare sys_unshare 304 o32 splice sys_splice 305 o32 sync_file_range sys_sync_file_range sys32_sync_file_range @@ -327,14 +327,14 @@ 313 o32 epoll_pwait sys_epoll_pwait compat_sys_epoll_pwait 314 o32 ioprio_set sys_ioprio_set 315 o32 ioprio_get sys_ioprio_get -316 o32 utimensat sys_utimensat compat_sys_utimensat +316 o32 utimensat sys_utimensat sys_utimensat_time32 317 o32 signalfd sys_signalfd compat_sys_signalfd 318 o32 timerfd sys_ni_syscall 319 o32 eventfd sys_eventfd 320 o32 fallocate sys_fallocate sys32_fallocate 321 o32 timerfd_create sys_timerfd_create -322 o32 timerfd_gettime sys_timerfd_gettime compat_sys_timerfd_gettime -323 o32 timerfd_settime sys_timerfd_settime compat_sys_timerfd_settime +322 o32 timerfd_gettime sys_timerfd_gettime sys_timerfd_gettime32 +323 o32 timerfd_settime sys_timerfd_settime sys_timerfd_settime32 324 o32 signalfd4 sys_signalfd4 compat_sys_signalfd4 325 o32 eventfd2 sys_eventfd2 326 o32 epoll_create1 sys_epoll_create1 @@ -346,13 +346,13 @@ 332 o32 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo 333 o32 perf_event_open sys_perf_event_open 334 o32 accept4 sys_accept4 -335 o32 recvmmsg sys_recvmmsg compat_sys_recvmmsg +335 o32 recvmmsg sys_recvmmsg compat_sys_recvmmsg_time32 336 o32 fanotify_init sys_fanotify_init 337 o32 fanotify_mark sys_fanotify_mark compat_sys_fanotify_mark 338 o32 prlimit64 sys_prlimit64 339 o32 name_to_handle_at sys_name_to_handle_at 340 o32 open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -341 o32 clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +341 o32 clock_adjtime sys_clock_adjtime sys_clock_adjtime32 342 o32 syncfs sys_syncfs 343 o32 sendmmsg sys_sendmmsg compat_sys_sendmmsg 344 o32 setns sys_setns diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 71873bb72782..f7440427d459 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink 11 common execve sys_execve compat_sys_execve 12 common chdir sys_chdir -13 common time sys_time compat_sys_time +13 common time sys_time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown @@ -32,12 +32,12 @@ 22 common bind sys_bind 23 common setuid sys_setuid 24 common getuid sys_getuid -25 common stime sys_stime compat_sys_stime +25 common stime sys_stime sys_stime32 26 common ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm 28 common fstat sys_newfstat compat_sys_newfstat 29 common pause sys_pause -30 common utime sys_utime compat_sys_utime +30 common utime sys_utime sys_utime32 31 common connect sys_connect 32 common listen sys_listen 33 common access sys_access @@ -133,7 +133,7 @@ 121 common setdomainname sys_setdomainname 122 common sendfile sys_sendfile compat_sys_sendfile 123 common recvfrom sys_recvfrom -124 common adjtimex sys_adjtimex compat_sys_adjtimex +124 common adjtimex sys_adjtimex sys_adjtimex_time32 125 common mprotect sys_mprotect 126 common sigprocmask sys_sigprocmask compat_sys_sigprocmask # 127 was create_module @@ -171,8 +171,8 @@ 158 common sched_yield sys_sched_yield 159 common sched_get_priority_max sys_sched_get_priority_max 160 common sched_get_priority_min sys_sched_get_priority_min -161 common sched_rr_get_interval sys_sched_rr_get_interval compat_sys_sched_rr_get_interval -162 common nanosleep sys_nanosleep compat_sys_nanosleep +161 common sched_rr_get_interval sys_sched_rr_get_interval sys_sched_rr_get_interval_time32 +162 common nanosleep sys_nanosleep sys_nanosleep_time32 163 common mremap sys_mremap 164 common setresuid sys_setresuid 165 common getresuid sys_getresuid @@ -187,7 +187,7 @@ 174 common rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 175 common rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask 176 common rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending -177 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait +177 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 178 common rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 179 common rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 180 common chown sys_chown @@ -223,14 +223,14 @@ 207 64 readahead sys_readahead 208 common tkill sys_tkill 209 common sendfile64 sys_sendfile64 compat_sys_sendfile64 -210 common futex sys_futex compat_sys_futex +210 common futex sys_futex sys_futex_time32 211 common sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity 212 common sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity # 213 was set_thread_area # 214 was get_thread_area 215 common io_setup sys_io_setup compat_sys_io_setup 216 common io_destroy sys_io_destroy -217 common io_getevents sys_io_getevents compat_sys_io_getevents +217 common io_getevents sys_io_getevents sys_io_getevents_time32 218 common io_submit sys_io_submit compat_sys_io_submit 219 common io_cancel sys_io_cancel # 220 was alloc_hugepages @@ -241,11 +241,11 @@ 225 common epoll_ctl sys_epoll_ctl 226 common epoll_wait sys_epoll_wait 227 common remap_file_pages sys_remap_file_pages -228 common semtimedop sys_semtimedop compat_sys_semtimedop +228 common semtimedop sys_semtimedop sys_semtimedop_time32 229 common mq_open sys_mq_open compat_sys_mq_open 230 common mq_unlink sys_mq_unlink -231 common mq_timedsend sys_mq_timedsend compat_sys_mq_timedsend -232 common mq_timedreceive sys_mq_timedreceive compat_sys_mq_timedreceive +231 common mq_timedsend sys_mq_timedsend sys_mq_timedsend_time32 +232 common mq_timedreceive sys_mq_timedreceive sys_mq_timedreceive_time32 233 common mq_notify sys_mq_notify compat_sys_mq_notify 234 common mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 235 common waitid sys_waitid compat_sys_waitid @@ -265,14 +265,14 @@ 248 common lremovexattr sys_lremovexattr 249 common fremovexattr sys_fremovexattr 250 common timer_create sys_timer_create compat_sys_timer_create -251 common timer_settime sys_timer_settime compat_sys_timer_settime -252 common timer_gettime sys_timer_gettime compat_sys_timer_gettime +251 common timer_settime sys_timer_settime sys_timer_settime32 +252 common timer_gettime sys_timer_gettime sys_timer_gettime32 253 common timer_getoverrun sys_timer_getoverrun 254 common timer_delete sys_timer_delete -255 common clock_settime sys_clock_settime compat_sys_clock_settime -256 common clock_gettime sys_clock_gettime compat_sys_clock_gettime -257 common clock_getres sys_clock_getres compat_sys_clock_getres -258 common clock_nanosleep sys_clock_nanosleep compat_sys_clock_nanosleep +255 common clock_settime sys_clock_settime sys_clock_settime32 +256 common clock_gettime sys_clock_gettime sys_clock_gettime32 +257 common clock_getres sys_clock_getres sys_clock_getres_time32 +258 common clock_nanosleep sys_clock_nanosleep sys_clock_nanosleep_time32 259 common tgkill sys_tgkill 260 common mbind sys_mbind compat_sys_mbind 261 common get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy @@ -287,13 +287,13 @@ 270 common inotify_add_watch sys_inotify_add_watch 271 common inotify_rm_watch sys_inotify_rm_watch 272 common migrate_pages sys_migrate_pages -273 common pselect6 sys_pselect6 compat_sys_pselect6 -274 common ppoll sys_ppoll compat_sys_ppoll +273 common pselect6 sys_pselect6 compat_sys_pselect6_time32 +274 common ppoll sys_ppoll compat_sys_ppoll_time32 275 common openat sys_openat compat_sys_openat 276 common mkdirat sys_mkdirat 277 common mknodat sys_mknodat 278 common fchownat sys_fchownat -279 common futimesat sys_futimesat compat_sys_futimesat +279 common futimesat sys_futimesat sys_futimesat_time32 280 common fstatat64 sys_fstatat64 281 common unlinkat sys_unlinkat 282 common renameat sys_renameat @@ -316,15 +316,15 @@ 298 common statfs64 sys_statfs64 compat_sys_statfs64 299 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 300 common kexec_load sys_kexec_load compat_sys_kexec_load -301 common utimensat sys_utimensat compat_sys_utimensat +301 common utimensat sys_utimensat sys_utimensat_time32 302 common signalfd sys_signalfd compat_sys_signalfd # 303 was timerfd 304 common eventfd sys_eventfd 305 32 fallocate parisc_fallocate 305 64 fallocate sys_fallocate 306 common timerfd_create sys_timerfd_create -307 common timerfd_settime sys_timerfd_settime compat_sys_timerfd_settime -308 common timerfd_gettime sys_timerfd_gettime compat_sys_timerfd_gettime +307 common timerfd_settime sys_timerfd_settime sys_timerfd_settime32 +308 common timerfd_gettime sys_timerfd_gettime sys_timerfd_gettime32 309 common signalfd4 sys_signalfd4 compat_sys_signalfd4 310 common eventfd2 sys_eventfd2 311 common epoll_create1 sys_epoll_create1 @@ -335,12 +335,12 @@ 316 common pwritev sys_pwritev compat_sys_pwritev 317 common rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo 318 common perf_event_open sys_perf_event_open -319 common recvmmsg sys_recvmmsg compat_sys_recvmmsg +319 common recvmmsg sys_recvmmsg compat_sys_recvmmsg_time32 320 common accept4 sys_accept4 321 common prlimit64 sys_prlimit64 322 common fanotify_init sys_fanotify_init 323 common fanotify_mark sys_fanotify_mark sys32_fanotify_mark -324 common clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +324 common clock_adjtime sys_clock_adjtime sys_clock_adjtime32 325 common name_to_handle_at sys_name_to_handle_at 326 common open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at 327 common syncfs sys_syncfs @@ -352,7 +352,7 @@ 333 common finit_module sys_finit_module 334 common sched_setattr sys_sched_setattr 335 common sched_getattr sys_sched_getattr -336 common utimes sys_utimes compat_sys_utimes +336 common utimes sys_utimes sys_utimes_time32 337 common renameat2 sys_renameat2 338 common seccomp sys_seccomp 339 common getrandom sys_getrandom diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 7555874ce39c..86650dcd2185 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink 11 nospu execve sys_execve compat_sys_execve 12 common chdir sys_chdir -13 common time sys_time compat_sys_time +13 common time sys_time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown @@ -36,14 +36,14 @@ 22 spu umount sys_ni_syscall 23 common setuid sys_setuid 24 common getuid sys_getuid -25 common stime sys_stime compat_sys_stime +25 common stime sys_stime sys_stime32 26 nospu ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm 28 32 oldfstat sys_fstat sys_ni_syscall 28 64 oldfstat sys_ni_syscall 28 spu oldfstat sys_ni_syscall 29 nospu pause sys_pause -30 nospu utime sys_utime compat_sys_utime +30 nospu utime sys_utime sys_utime32 31 common stty sys_ni_syscall 32 common gtty sys_ni_syscall 33 common access sys_access @@ -157,7 +157,7 @@ 121 common setdomainname sys_setdomainname 122 common uname sys_newuname 123 common modify_ldt sys_ni_syscall -124 common adjtimex sys_adjtimex compat_sys_adjtimex +124 common adjtimex sys_adjtimex sys_adjtimex_time32 125 common mprotect sys_mprotect 126 32 sigprocmask sys_sigprocmask compat_sys_sigprocmask 126 64 sigprocmask sys_ni_syscall @@ -198,8 +198,8 @@ 158 common sched_yield sys_sched_yield 159 common sched_get_priority_max sys_sched_get_priority_max 160 common sched_get_priority_min sys_sched_get_priority_min -161 common sched_rr_get_interval sys_sched_rr_get_interval compat_sys_sched_rr_get_interval -162 common nanosleep sys_nanosleep compat_sys_nanosleep +161 common sched_rr_get_interval sys_sched_rr_get_interval sys_sched_rr_get_interval_time32 +162 common nanosleep sys_nanosleep sys_nanosleep_time32 163 common mremap sys_mremap 164 common setresuid sys_setresuid 165 common getresuid sys_getresuid @@ -213,7 +213,7 @@ 173 nospu rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 174 nospu rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask 175 nospu rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending -176 nospu rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait +176 nospu rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 177 nospu rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 178 nospu rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 179 common pread64 sys_pread64 compat_sys_pread64 @@ -260,7 +260,7 @@ 218 common removexattr sys_removexattr 219 common lremovexattr sys_lremovexattr 220 common fremovexattr sys_fremovexattr -221 common futex sys_futex compat_sys_futex +221 common futex sys_futex sys_futex_time32 222 common sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity 223 common sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity # 224 unused @@ -268,7 +268,7 @@ 226 32 sendfile64 sys_sendfile64 compat_sys_sendfile64 227 common io_setup sys_io_setup compat_sys_io_setup 228 common io_destroy sys_io_destroy -229 common io_getevents sys_io_getevents compat_sys_io_getevents +229 common io_getevents sys_io_getevents sys_io_getevents_time32 230 common io_submit sys_io_submit compat_sys_io_submit 231 common io_cancel sys_io_cancel 232 nospu set_tid_address sys_set_tid_address @@ -280,19 +280,19 @@ 238 common epoll_wait sys_epoll_wait 239 common remap_file_pages sys_remap_file_pages 240 common timer_create sys_timer_create compat_sys_timer_create -241 common timer_settime sys_timer_settime compat_sys_timer_settime -242 common timer_gettime sys_timer_gettime compat_sys_timer_gettime +241 common timer_settime sys_timer_settime sys_timer_settime32 +242 common timer_gettime sys_timer_gettime sys_timer_gettime32 243 common timer_getoverrun sys_timer_getoverrun 244 common timer_delete sys_timer_delete -245 common clock_settime sys_clock_settime compat_sys_clock_settime -246 common clock_gettime sys_clock_gettime compat_sys_clock_gettime -247 common clock_getres sys_clock_getres compat_sys_clock_getres -248 common clock_nanosleep sys_clock_nanosleep compat_sys_clock_nanosleep +245 common clock_settime sys_clock_settime sys_clock_settime32 +246 common clock_gettime sys_clock_gettime sys_clock_gettime32 +247 common clock_getres sys_clock_getres sys_clock_getres_time32 +248 common clock_nanosleep sys_clock_nanosleep sys_clock_nanosleep_time32 249 32 swapcontext ppc_swapcontext ppc32_swapcontext 249 64 swapcontext ppc64_swapcontext 249 spu swapcontext sys_ni_syscall 250 common tgkill sys_tgkill -251 common utimes sys_utimes compat_sys_utimes +251 common utimes sys_utimes sys_utimes_time32 252 common statfs64 sys_statfs64 compat_sys_statfs64 253 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 254 32 fadvise64_64 ppc_fadvise64_64 @@ -308,8 +308,8 @@ 261 nospu set_mempolicy sys_set_mempolicy compat_sys_set_mempolicy 262 nospu mq_open sys_mq_open compat_sys_mq_open 263 nospu mq_unlink sys_mq_unlink -264 nospu mq_timedsend sys_mq_timedsend compat_sys_mq_timedsend -265 nospu mq_timedreceive sys_mq_timedreceive compat_sys_mq_timedreceive +264 nospu mq_timedsend sys_mq_timedsend sys_mq_timedsend_time32 +265 nospu mq_timedreceive sys_mq_timedreceive sys_mq_timedreceive_time32 266 nospu mq_notify sys_mq_notify compat_sys_mq_notify 267 nospu mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 268 nospu kexec_load sys_kexec_load compat_sys_kexec_load @@ -324,8 +324,8 @@ 277 nospu inotify_rm_watch sys_inotify_rm_watch 278 nospu spu_run sys_spu_run 279 nospu spu_create sys_spu_create -280 nospu pselect6 sys_pselect6 compat_sys_pselect6 -281 nospu ppoll sys_ppoll compat_sys_ppoll +280 nospu pselect6 sys_pselect6 compat_sys_pselect6_time32 +281 nospu ppoll sys_ppoll compat_sys_ppoll_time32 282 common unshare sys_unshare 283 common splice sys_splice 284 common tee sys_tee @@ -334,7 +334,7 @@ 287 common mkdirat sys_mkdirat 288 common mknodat sys_mknodat 289 common fchownat sys_fchownat -290 common futimesat sys_futimesat compat_sys_futimesat +290 common futimesat sys_futimesat sys_futimesat_time32 291 32 fstatat64 sys_fstatat64 291 64 newfstatat sys_newfstatat 291 spu newfstatat sys_newfstatat @@ -350,15 +350,15 @@ 301 common move_pages sys_move_pages compat_sys_move_pages 302 common getcpu sys_getcpu 303 nospu epoll_pwait sys_epoll_pwait compat_sys_epoll_pwait -304 common utimensat sys_utimensat compat_sys_utimensat +304 common utimensat sys_utimensat sys_utimensat_time32 305 common signalfd sys_signalfd compat_sys_signalfd 306 common timerfd_create sys_timerfd_create 307 common eventfd sys_eventfd 308 common sync_file_range2 sys_sync_file_range2 compat_sys_sync_file_range2 309 nospu fallocate sys_fallocate compat_sys_fallocate 310 nospu subpage_prot sys_subpage_prot -311 common timerfd_settime sys_timerfd_settime compat_sys_timerfd_settime -312 common timerfd_gettime sys_timerfd_gettime compat_sys_timerfd_gettime +311 common timerfd_settime sys_timerfd_settime sys_timerfd_settime32 +312 common timerfd_gettime sys_timerfd_gettime sys_timerfd_gettime32 313 common signalfd4 sys_signalfd4 compat_sys_signalfd4 314 common eventfd2 sys_eventfd2 315 common epoll_create1 sys_epoll_create1 @@ -389,11 +389,11 @@ 340 common getsockopt sys_getsockopt compat_sys_getsockopt 341 common sendmsg sys_sendmsg compat_sys_sendmsg 342 common recvmsg sys_recvmsg compat_sys_recvmsg -343 common recvmmsg sys_recvmmsg compat_sys_recvmmsg +343 common recvmmsg sys_recvmmsg compat_sys_recvmmsg_time32 344 common accept4 sys_accept4 345 common name_to_handle_at sys_name_to_handle_at 346 common open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -347 common clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +347 common clock_adjtime sys_clock_adjtime sys_clock_adjtime32 348 common syncfs sys_syncfs 349 common sendmmsg sys_sendmmsg compat_sys_sendmmsg 350 common setns sys_setns diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 620e222003ca..285201cf1f83 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink sys_unlink 11 common execve sys_execve compat_sys_execve 12 common chdir sys_chdir sys_chdir -13 32 time - compat_sys_time +13 32 time - sys_time32 14 common mknod sys_mknod sys_mknod 15 common chmod sys_chmod sys_chmod 16 32 lchown - sys_lchown16 @@ -30,11 +30,11 @@ 22 common umount sys_oldumount sys_oldumount 23 32 setuid - sys_setuid16 24 32 getuid - sys_getuid16 -25 32 stime - compat_sys_stime +25 32 stime - sys_stime32 26 common ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm sys_alarm 29 common pause sys_pause sys_pause -30 common utime sys_utime compat_sys_utime +30 common utime sys_utime sys_utime32 33 common access sys_access sys_access 34 common nice sys_nice sys_nice 36 common sync sys_sync sys_sync @@ -112,7 +112,7 @@ 120 common clone sys_clone sys_clone 121 common setdomainname sys_setdomainname sys_setdomainname 122 common uname sys_newuname sys_newuname -124 common adjtimex sys_adjtimex compat_sys_adjtimex +124 common adjtimex sys_adjtimex sys_adjtimex_time32 125 common mprotect sys_mprotect sys_mprotect 126 common sigprocmask sys_sigprocmask compat_sys_sigprocmask 127 common create_module - - @@ -150,8 +150,8 @@ 158 common sched_yield sys_sched_yield sys_sched_yield 159 common sched_get_priority_max sys_sched_get_priority_max sys_sched_get_priority_max 160 common sched_get_priority_min sys_sched_get_priority_min sys_sched_get_priority_min -161 common sched_rr_get_interval sys_sched_rr_get_interval compat_sys_sched_rr_get_interval -162 common nanosleep sys_nanosleep compat_sys_nanosleep +161 common sched_rr_get_interval sys_sched_rr_get_interval sys_sched_rr_get_interval_time32 +162 common nanosleep sys_nanosleep sys_nanosleep_time32 163 common mremap sys_mremap sys_mremap 164 32 setresuid - sys_setresuid16 165 32 getresuid - sys_getresuid16 @@ -165,7 +165,7 @@ 174 common rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 175 common rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask 176 common rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending -177 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait +177 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 178 common rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 179 common rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 180 common pread64 sys_pread64 compat_sys_s390_pread64 @@ -246,13 +246,13 @@ 235 common fremovexattr sys_fremovexattr sys_fremovexattr 236 common gettid sys_gettid sys_gettid 237 common tkill sys_tkill sys_tkill -238 common futex sys_futex compat_sys_futex +238 common futex sys_futex sys_futex_time32 239 common sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity 240 common sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity 241 common tgkill sys_tgkill sys_tgkill 243 common io_setup sys_io_setup compat_sys_io_setup 244 common io_destroy sys_io_destroy sys_io_destroy -245 common io_getevents sys_io_getevents compat_sys_io_getevents +245 common io_getevents sys_io_getevents sys_io_getevents_time32 246 common io_submit sys_io_submit compat_sys_io_submit 247 common io_cancel sys_io_cancel sys_io_cancel 248 common exit_group sys_exit_group sys_exit_group @@ -262,14 +262,14 @@ 252 common set_tid_address sys_set_tid_address sys_set_tid_address 253 common fadvise64 sys_fadvise64_64 compat_sys_s390_fadvise64 254 common timer_create sys_timer_create compat_sys_timer_create -255 common timer_settime sys_timer_settime compat_sys_timer_settime -256 common timer_gettime sys_timer_gettime compat_sys_timer_gettime +255 common timer_settime sys_timer_settime sys_timer_settime32 +256 common timer_gettime sys_timer_gettime sys_timer_gettime32 257 common timer_getoverrun sys_timer_getoverrun sys_timer_getoverrun 258 common timer_delete sys_timer_delete sys_timer_delete -259 common clock_settime sys_clock_settime compat_sys_clock_settime -260 common clock_gettime sys_clock_gettime compat_sys_clock_gettime -261 common clock_getres sys_clock_getres compat_sys_clock_getres -262 common clock_nanosleep sys_clock_nanosleep compat_sys_clock_nanosleep +259 common clock_settime sys_clock_settime sys_clock_settime32 +260 common clock_gettime sys_clock_gettime sys_clock_gettime32 +261 common clock_getres sys_clock_getres sys_clock_getres_time32 +262 common clock_nanosleep sys_clock_nanosleep sys_clock_nanosleep_time32 264 32 fadvise64_64 - compat_sys_s390_fadvise64_64 265 common statfs64 sys_statfs64 compat_sys_statfs64 266 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 @@ -279,8 +279,8 @@ 270 common set_mempolicy sys_set_mempolicy compat_sys_set_mempolicy 271 common mq_open sys_mq_open compat_sys_mq_open 272 common mq_unlink sys_mq_unlink sys_mq_unlink -273 common mq_timedsend sys_mq_timedsend compat_sys_mq_timedsend -274 common mq_timedreceive sys_mq_timedreceive compat_sys_mq_timedreceive +273 common mq_timedsend sys_mq_timedsend sys_mq_timedsend_time32 +274 common mq_timedreceive sys_mq_timedreceive sys_mq_timedreceive_time32 275 common mq_notify sys_mq_notify compat_sys_mq_notify 276 common mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 277 common kexec_load sys_kexec_load compat_sys_kexec_load @@ -298,7 +298,7 @@ 289 common mkdirat sys_mkdirat sys_mkdirat 290 common mknodat sys_mknodat sys_mknodat 291 common fchownat sys_fchownat sys_fchownat -292 common futimesat sys_futimesat compat_sys_futimesat +292 common futimesat sys_futimesat sys_futimesat_time32 293 32 fstatat64 - compat_sys_s390_fstatat64 293 64 newfstatat sys_newfstatat - 294 common unlinkat sys_unlinkat sys_unlinkat @@ -308,8 +308,8 @@ 298 common readlinkat sys_readlinkat sys_readlinkat 299 common fchmodat sys_fchmodat sys_fchmodat 300 common faccessat sys_faccessat sys_faccessat -301 common pselect6 sys_pselect6 compat_sys_pselect6 -302 common ppoll sys_ppoll compat_sys_ppoll +301 common pselect6 sys_pselect6 compat_sys_pselect6_time32 +302 common ppoll sys_ppoll compat_sys_ppoll_time32 303 common unshare sys_unshare sys_unshare 304 common set_robust_list sys_set_robust_list compat_sys_set_robust_list 305 common get_robust_list sys_get_robust_list compat_sys_get_robust_list @@ -320,15 +320,15 @@ 310 common move_pages sys_move_pages compat_sys_move_pages 311 common getcpu sys_getcpu sys_getcpu 312 common epoll_pwait sys_epoll_pwait compat_sys_epoll_pwait -313 common utimes sys_utimes compat_sys_utimes +313 common utimes sys_utimes sys_utimes_time32 314 common fallocate sys_fallocate compat_sys_s390_fallocate -315 common utimensat sys_utimensat compat_sys_utimensat +315 common utimensat sys_utimensat sys_utimensat_time32 316 common signalfd sys_signalfd compat_sys_signalfd 317 common timerfd - - 318 common eventfd sys_eventfd sys_eventfd 319 common timerfd_create sys_timerfd_create sys_timerfd_create -320 common timerfd_settime sys_timerfd_settime compat_sys_timerfd_settime -321 common timerfd_gettime sys_timerfd_gettime compat_sys_timerfd_gettime +320 common timerfd_settime sys_timerfd_settime sys_timerfd_settime32 +321 common timerfd_gettime sys_timerfd_gettime sys_timerfd_gettime32 322 common signalfd4 sys_signalfd4 compat_sys_signalfd4 323 common eventfd2 sys_eventfd2 sys_eventfd2 324 common inotify_init1 sys_inotify_init1 sys_inotify_init1 @@ -344,7 +344,7 @@ 334 common prlimit64 sys_prlimit64 sys_prlimit64 335 common name_to_handle_at sys_name_to_handle_at sys_name_to_handle_at 336 common open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -337 common clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +337 common clock_adjtime sys_clock_adjtime sys_clock_adjtime32 338 common syncfs sys_syncfs sys_syncfs 339 common setns sys_setns sys_setns 340 common process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv @@ -364,7 +364,7 @@ 354 common execveat sys_execveat compat_sys_execveat 355 common userfaultfd sys_userfaultfd sys_userfaultfd 356 common membarrier sys_membarrier sys_membarrier -357 common recvmmsg sys_recvmmsg compat_sys_recvmmsg +357 common recvmmsg sys_recvmmsg compat_sys_recvmmsg_time32 358 common sendmmsg sys_sendmmsg compat_sys_sendmmsg 359 common socket sys_socket sys_socket 360 common socketpair sys_socketpair sys_socketpair diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index e63cd013cc77..7cb05b50aeaa 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -44,7 +44,7 @@ 28 common sigaltstack sys_sigaltstack compat_sys_sigaltstack 29 32 pause sys_pause 29 64 pause sys_nis_syscall -30 common utime sys_utime compat_sys_utime +30 common utime sys_utime sys_utime32 31 32 lchown32 sys_lchown 32 32 fchown32 sys_fchown 33 common access sys_access @@ -128,7 +128,7 @@ 102 common rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 103 common rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask 104 common rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending -105 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait +105 common rt_sigtimedwait sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time32 106 common rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 107 common rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 108 32 setresuid32 sys_setresuid @@ -168,11 +168,11 @@ 135 common socketpair sys_socketpair 136 common mkdir sys_mkdir 137 common rmdir sys_rmdir -138 common utimes sys_utimes compat_sys_utimes +138 common utimes sys_utimes sys_utimes_time32 139 common stat64 sys_stat64 compat_sys_stat64 140 common sendfile64 sys_sendfile64 141 common getpeername sys_getpeername -142 common futex sys_futex compat_sys_futex +142 common futex sys_futex sys_futex_time32 143 common gettid sys_gettid 144 common getrlimit sys_getrlimit compat_sys_getrlimit 145 common setrlimit sys_setrlimit compat_sys_setrlimit @@ -258,7 +258,7 @@ 216 64 sigreturn sys_nis_syscall 217 common clone sys_clone 218 common ioprio_get sys_ioprio_get -219 32 adjtimex sys_adjtimex compat_sys_adjtimex +219 32 adjtimex sys_adjtimex sys_adjtimex_time32 219 64 adjtimex sys_sparc_adjtimex 220 32 sigprocmask sys_sigprocmask compat_sys_sigprocmask 220 64 sigprocmask sys_nis_syscall @@ -272,9 +272,9 @@ 228 common setfsuid sys_setfsuid16 229 common setfsgid sys_setfsgid16 230 common _newselect sys_select compat_sys_select -231 32 time sys_time compat_sys_time +231 32 time sys_time sys_time32 232 common splice sys_splice -233 common stime sys_stime compat_sys_stime +233 common stime sys_stime sys_stime32 234 common statfs64 sys_statfs64 compat_sys_statfs64 235 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 236 common _llseek sys_llseek @@ -289,8 +289,8 @@ 245 common sched_yield sys_sched_yield 246 common sched_get_priority_max sys_sched_get_priority_max 247 common sched_get_priority_min sys_sched_get_priority_min -248 common sched_rr_get_interval sys_sched_rr_get_interval compat_sys_sched_rr_get_interval -249 common nanosleep sys_nanosleep compat_sys_nanosleep +248 common sched_rr_get_interval sys_sched_rr_get_interval sys_sched_rr_get_interval_time32 +249 common nanosleep sys_nanosleep sys_nanosleep_time32 250 32 mremap sys_mremap 250 64 mremap sys_64_mremap 251 common _sysctl sys_sysctl compat_sys_sysctl @@ -299,14 +299,14 @@ 254 32 nfsservctl sys_ni_syscall sys_nis_syscall 254 64 nfsservctl sys_nis_syscall 255 common sync_file_range sys_sync_file_range compat_sys_sync_file_range -256 common clock_settime sys_clock_settime compat_sys_clock_settime -257 common clock_gettime sys_clock_gettime compat_sys_clock_gettime -258 common clock_getres sys_clock_getres compat_sys_clock_getres -259 common clock_nanosleep sys_clock_nanosleep compat_sys_clock_nanosleep +256 common clock_settime sys_clock_settime sys_clock_settime32 +257 common clock_gettime sys_clock_gettime sys_clock_gettime32 +258 common clock_getres sys_clock_getres sys_clock_getres_time32 +259 common clock_nanosleep sys_clock_nanosleep sys_clock_nanosleep_time32 260 common sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity 261 common sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity -262 common timer_settime sys_timer_settime compat_sys_timer_settime -263 common timer_gettime sys_timer_gettime compat_sys_timer_gettime +262 common timer_settime sys_timer_settime sys_timer_settime32 +263 common timer_gettime sys_timer_gettime sys_timer_gettime32 264 common timer_getoverrun sys_timer_getoverrun 265 common timer_delete sys_timer_delete 266 common timer_create sys_timer_create compat_sys_timer_create @@ -316,11 +316,11 @@ 269 common io_destroy sys_io_destroy 270 common io_submit sys_io_submit compat_sys_io_submit 271 common io_cancel sys_io_cancel -272 common io_getevents sys_io_getevents compat_sys_io_getevents +272 common io_getevents sys_io_getevents sys_io_getevents_time32 273 common mq_open sys_mq_open compat_sys_mq_open 274 common mq_unlink sys_mq_unlink -275 common mq_timedsend sys_mq_timedsend compat_sys_mq_timedsend -276 common mq_timedreceive sys_mq_timedreceive compat_sys_mq_timedreceive +275 common mq_timedsend sys_mq_timedsend sys_mq_timedsend_time32 +276 common mq_timedreceive sys_mq_timedreceive sys_mq_timedreceive_time32 277 common mq_notify sys_mq_notify compat_sys_mq_notify 278 common mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 279 common waitid sys_waitid compat_sys_waitid @@ -332,7 +332,7 @@ 285 common mkdirat sys_mkdirat 286 common mknodat sys_mknodat 287 common fchownat sys_fchownat -288 common futimesat sys_futimesat compat_sys_futimesat +288 common futimesat sys_futimesat sys_futimesat_time32 289 common fstatat64 sys_fstatat64 compat_sys_fstatat64 290 common unlinkat sys_unlinkat 291 common renameat sys_renameat @@ -341,8 +341,8 @@ 294 common readlinkat sys_readlinkat 295 common fchmodat sys_fchmodat 296 common faccessat sys_faccessat -297 common pselect6 sys_pselect6 compat_sys_pselect6 -298 common ppoll sys_ppoll compat_sys_ppoll +297 common pselect6 sys_pselect6 compat_sys_pselect6_time32 +298 common ppoll sys_ppoll compat_sys_ppoll_time32 299 common unshare sys_unshare 300 common set_robust_list sys_set_robust_list compat_sys_set_robust_list 301 common get_robust_list sys_get_robust_list compat_sys_get_robust_list @@ -354,13 +354,13 @@ 307 common move_pages sys_move_pages compat_sys_move_pages 308 common getcpu sys_getcpu 309 common epoll_pwait sys_epoll_pwait compat_sys_epoll_pwait -310 common utimensat sys_utimensat compat_sys_utimensat +310 common utimensat sys_utimensat sys_utimensat_time32 311 common signalfd sys_signalfd compat_sys_signalfd 312 common timerfd_create sys_timerfd_create 313 common eventfd sys_eventfd 314 common fallocate sys_fallocate compat_sys_fallocate -315 common timerfd_settime sys_timerfd_settime compat_sys_timerfd_settime -316 common timerfd_gettime sys_timerfd_gettime compat_sys_timerfd_gettime +315 common timerfd_settime sys_timerfd_settime sys_timerfd_settime32 +316 common timerfd_gettime sys_timerfd_gettime sys_timerfd_gettime32 317 common signalfd4 sys_signalfd4 compat_sys_signalfd4 318 common eventfd2 sys_eventfd2 319 common epoll_create1 sys_epoll_create1 @@ -372,13 +372,13 @@ 325 common pwritev sys_pwritev compat_sys_pwritev 326 common rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo 327 common perf_event_open sys_perf_event_open -328 common recvmmsg sys_recvmmsg compat_sys_recvmmsg +328 common recvmmsg sys_recvmmsg compat_sys_recvmmsg_time32 329 common fanotify_init sys_fanotify_init 330 common fanotify_mark sys_fanotify_mark compat_sys_fanotify_mark 331 common prlimit64 sys_prlimit64 332 common name_to_handle_at sys_name_to_handle_at 333 common open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -334 32 clock_adjtime sys_clock_adjtime compat_sys_clock_adjtime +334 32 clock_adjtime sys_clock_adjtime sys_clock_adjtime32 334 64 clock_adjtime sys_sparc_clock_adjtime 335 common syncfs sys_syncfs 336 common sendmmsg sys_sendmmsg compat_sys_sendmmsg diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2be1d0eb7754..7705d5ecad25 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -24,7 +24,7 @@ 10 i386 unlink sys_unlink __ia32_sys_unlink 11 i386 execve sys_execve __ia32_compat_sys_execve 12 i386 chdir sys_chdir __ia32_sys_chdir -13 i386 time sys_time __ia32_compat_sys_time +13 i386 time sys_time __ia32_sys_time32 14 i386 mknod sys_mknod __ia32_sys_mknod 15 i386 chmod sys_chmod __ia32_sys_chmod 16 i386 lchown sys_lchown16 __ia32_sys_lchown16 @@ -36,12 +36,12 @@ 22 i386 umount sys_oldumount __ia32_sys_oldumount 23 i386 setuid sys_setuid16 __ia32_sys_setuid16 24 i386 getuid sys_getuid16 __ia32_sys_getuid16 -25 i386 stime sys_stime __ia32_compat_sys_stime +25 i386 stime sys_stime __ia32_sys_stime32 26 i386 ptrace sys_ptrace __ia32_compat_sys_ptrace 27 i386 alarm sys_alarm __ia32_sys_alarm 28 i386 oldfstat sys_fstat __ia32_sys_fstat 29 i386 pause sys_pause __ia32_sys_pause -30 i386 utime sys_utime __ia32_compat_sys_utime +30 i386 utime sys_utime __ia32_sys_utime32 31 i386 stty 32 i386 gtty 33 i386 access sys_access __ia32_sys_access @@ -135,7 +135,7 @@ 121 i386 setdomainname sys_setdomainname __ia32_sys_setdomainname 122 i386 uname sys_newuname __ia32_sys_newuname 123 i386 modify_ldt sys_modify_ldt __ia32_sys_modify_ldt -124 i386 adjtimex sys_adjtimex __ia32_compat_sys_adjtimex +124 i386 adjtimex sys_adjtimex __ia32_sys_adjtimex_time32 125 i386 mprotect sys_mprotect __ia32_sys_mprotect 126 i386 sigprocmask sys_sigprocmask __ia32_compat_sys_sigprocmask 127 i386 create_module @@ -172,8 +172,8 @@ 158 i386 sched_yield sys_sched_yield __ia32_sys_sched_yield 159 i386 sched_get_priority_max sys_sched_get_priority_max __ia32_sys_sched_get_priority_max 160 i386 sched_get_priority_min sys_sched_get_priority_min __ia32_sys_sched_get_priority_min -161 i386 sched_rr_get_interval sys_sched_rr_get_interval __ia32_compat_sys_sched_rr_get_interval -162 i386 nanosleep sys_nanosleep __ia32_compat_sys_nanosleep +161 i386 sched_rr_get_interval sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval_time32 +162 i386 nanosleep sys_nanosleep __ia32_sys_nanosleep_time32 163 i386 mremap sys_mremap __ia32_sys_mremap 164 i386 setresuid sys_setresuid16 __ia32_sys_setresuid16 165 i386 getresuid sys_getresuid16 __ia32_sys_getresuid16 @@ -188,7 +188,7 @@ 174 i386 rt_sigaction sys_rt_sigaction __ia32_compat_sys_rt_sigaction 175 i386 rt_sigprocmask sys_rt_sigprocmask __ia32_sys_rt_sigprocmask 176 i386 rt_sigpending sys_rt_sigpending __ia32_compat_sys_rt_sigpending -177 i386 rt_sigtimedwait sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait +177 i386 rt_sigtimedwait sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time32 178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo __ia32_compat_sys_rt_sigqueueinfo 179 i386 rt_sigsuspend sys_rt_sigsuspend __ia32_sys_rt_sigsuspend 180 i386 pread64 sys_pread64 __ia32_compat_sys_x86_pread @@ -251,14 +251,14 @@ 237 i386 fremovexattr sys_fremovexattr __ia32_sys_fremovexattr 238 i386 tkill sys_tkill __ia32_sys_tkill 239 i386 sendfile64 sys_sendfile64 __ia32_sys_sendfile64 -240 i386 futex sys_futex __ia32_compat_sys_futex +240 i386 futex sys_futex __ia32_sys_futex_time32 241 i386 sched_setaffinity sys_sched_setaffinity __ia32_compat_sys_sched_setaffinity 242 i386 sched_getaffinity sys_sched_getaffinity __ia32_compat_sys_sched_getaffinity 243 i386 set_thread_area sys_set_thread_area __ia32_sys_set_thread_area 244 i386 get_thread_area sys_get_thread_area __ia32_sys_get_thread_area 245 i386 io_setup sys_io_setup __ia32_compat_sys_io_setup 246 i386 io_destroy sys_io_destroy __ia32_sys_io_destroy -247 i386 io_getevents sys_io_getevents __ia32_compat_sys_io_getevents +247 i386 io_getevents sys_io_getevents __ia32_sys_io_getevents_time32 248 i386 io_submit sys_io_submit __ia32_compat_sys_io_submit 249 i386 io_cancel sys_io_cancel __ia32_sys_io_cancel 250 i386 fadvise64 sys_fadvise64 __ia32_compat_sys_x86_fadvise64 @@ -271,18 +271,18 @@ 257 i386 remap_file_pages sys_remap_file_pages __ia32_sys_remap_file_pages 258 i386 set_tid_address sys_set_tid_address __ia32_sys_set_tid_address 259 i386 timer_create sys_timer_create __ia32_compat_sys_timer_create -260 i386 timer_settime sys_timer_settime __ia32_compat_sys_timer_settime -261 i386 timer_gettime sys_timer_gettime __ia32_compat_sys_timer_gettime +260 i386 timer_settime sys_timer_settime __ia32_sys_timer_settime32 +261 i386 timer_gettime sys_timer_gettime __ia32_sys_timer_gettime32 262 i386 timer_getoverrun sys_timer_getoverrun __ia32_sys_timer_getoverrun 263 i386 timer_delete sys_timer_delete __ia32_sys_timer_delete -264 i386 clock_settime sys_clock_settime __ia32_compat_sys_clock_settime -265 i386 clock_gettime sys_clock_gettime __ia32_compat_sys_clock_gettime -266 i386 clock_getres sys_clock_getres __ia32_compat_sys_clock_getres -267 i386 clock_nanosleep sys_clock_nanosleep __ia32_compat_sys_clock_nanosleep +264 i386 clock_settime sys_clock_settime __ia32_sys_clock_settime32 +265 i386 clock_gettime sys_clock_gettime __ia32_sys_clock_gettime32 +266 i386 clock_getres sys_clock_getres __ia32_sys_clock_getres_time32 +267 i386 clock_nanosleep sys_clock_nanosleep __ia32_sys_clock_nanosleep_time32 268 i386 statfs64 sys_statfs64 __ia32_compat_sys_statfs64 269 i386 fstatfs64 sys_fstatfs64 __ia32_compat_sys_fstatfs64 270 i386 tgkill sys_tgkill __ia32_sys_tgkill -271 i386 utimes sys_utimes __ia32_compat_sys_utimes +271 i386 utimes sys_utimes __ia32_sys_utimes_time32 272 i386 fadvise64_64 sys_fadvise64_64 __ia32_compat_sys_x86_fadvise64_64 273 i386 vserver 274 i386 mbind sys_mbind __ia32_sys_mbind @@ -290,8 +290,8 @@ 276 i386 set_mempolicy sys_set_mempolicy __ia32_sys_set_mempolicy 277 i386 mq_open sys_mq_open __ia32_compat_sys_mq_open 278 i386 mq_unlink sys_mq_unlink __ia32_sys_mq_unlink -279 i386 mq_timedsend sys_mq_timedsend __ia32_compat_sys_mq_timedsend -280 i386 mq_timedreceive sys_mq_timedreceive __ia32_compat_sys_mq_timedreceive +279 i386 mq_timedsend sys_mq_timedsend __ia32_sys_mq_timedsend_time32 +280 i386 mq_timedreceive sys_mq_timedreceive __ia32_sys_mq_timedreceive_time32 281 i386 mq_notify sys_mq_notify __ia32_compat_sys_mq_notify 282 i386 mq_getsetattr sys_mq_getsetattr __ia32_compat_sys_mq_getsetattr 283 i386 kexec_load sys_kexec_load __ia32_compat_sys_kexec_load @@ -310,7 +310,7 @@ 296 i386 mkdirat sys_mkdirat __ia32_sys_mkdirat 297 i386 mknodat sys_mknodat __ia32_sys_mknodat 298 i386 fchownat sys_fchownat __ia32_sys_fchownat -299 i386 futimesat sys_futimesat __ia32_compat_sys_futimesat +299 i386 futimesat sys_futimesat __ia32_sys_futimesat_time32 300 i386 fstatat64 sys_fstatat64 __ia32_compat_sys_x86_fstatat 301 i386 unlinkat sys_unlinkat __ia32_sys_unlinkat 302 i386 renameat sys_renameat __ia32_sys_renameat @@ -319,8 +319,8 @@ 305 i386 readlinkat sys_readlinkat __ia32_sys_readlinkat 306 i386 fchmodat sys_fchmodat __ia32_sys_fchmodat 307 i386 faccessat sys_faccessat __ia32_sys_faccessat -308 i386 pselect6 sys_pselect6 __ia32_compat_sys_pselect6 -309 i386 ppoll sys_ppoll __ia32_compat_sys_ppoll +308 i386 pselect6 sys_pselect6 __ia32_compat_sys_pselect6_time32 +309 i386 ppoll sys_ppoll __ia32_compat_sys_ppoll_time32 310 i386 unshare sys_unshare __ia32_sys_unshare 311 i386 set_robust_list sys_set_robust_list __ia32_compat_sys_set_robust_list 312 i386 get_robust_list sys_get_robust_list __ia32_compat_sys_get_robust_list @@ -331,13 +331,13 @@ 317 i386 move_pages sys_move_pages __ia32_compat_sys_move_pages 318 i386 getcpu sys_getcpu __ia32_sys_getcpu 319 i386 epoll_pwait sys_epoll_pwait __ia32_sys_epoll_pwait -320 i386 utimensat sys_utimensat __ia32_compat_sys_utimensat +320 i386 utimensat sys_utimensat __ia32_sys_utimensat_time32 321 i386 signalfd sys_signalfd __ia32_compat_sys_signalfd 322 i386 timerfd_create sys_timerfd_create __ia32_sys_timerfd_create 323 i386 eventfd sys_eventfd __ia32_sys_eventfd 324 i386 fallocate sys_fallocate __ia32_compat_sys_x86_fallocate -325 i386 timerfd_settime sys_timerfd_settime __ia32_compat_sys_timerfd_settime -326 i386 timerfd_gettime sys_timerfd_gettime __ia32_compat_sys_timerfd_gettime +325 i386 timerfd_settime sys_timerfd_settime __ia32_sys_timerfd_settime32 +326 i386 timerfd_gettime sys_timerfd_gettime __ia32_sys_timerfd_gettime32 327 i386 signalfd4 sys_signalfd4 __ia32_compat_sys_signalfd4 328 i386 eventfd2 sys_eventfd2 __ia32_sys_eventfd2 329 i386 epoll_create1 sys_epoll_create1 __ia32_sys_epoll_create1 @@ -348,13 +348,13 @@ 334 i386 pwritev sys_pwritev __ia32_compat_sys_pwritev 335 i386 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo __ia32_compat_sys_rt_tgsigqueueinfo 336 i386 perf_event_open sys_perf_event_open __ia32_sys_perf_event_open -337 i386 recvmmsg sys_recvmmsg __ia32_compat_sys_recvmmsg +337 i386 recvmmsg sys_recvmmsg __ia32_compat_sys_recvmmsg_time32 338 i386 fanotify_init sys_fanotify_init __ia32_sys_fanotify_init 339 i386 fanotify_mark sys_fanotify_mark __ia32_compat_sys_fanotify_mark 340 i386 prlimit64 sys_prlimit64 __ia32_sys_prlimit64 341 i386 name_to_handle_at sys_name_to_handle_at __ia32_sys_name_to_handle_at 342 i386 open_by_handle_at sys_open_by_handle_at __ia32_compat_sys_open_by_handle_at -343 i386 clock_adjtime sys_clock_adjtime __ia32_compat_sys_clock_adjtime +343 i386 clock_adjtime sys_clock_adjtime __ia32_sys_clock_adjtime32 344 i386 syncfs sys_syncfs __ia32_sys_syncfs 345 i386 sendmmsg sys_sendmmsg __ia32_compat_sys_sendmmsg 346 i386 setns sys_setns __ia32_sys_setns diff --git a/fs/aio.c b/fs/aio.c index b906ff70c90f..4394d3fe116a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2198,11 +2198,11 @@ SYSCALL_DEFINE6(io_pgetevents_time32, #if defined(CONFIG_COMPAT_32BIT_TIME) -COMPAT_SYSCALL_DEFINE5(io_getevents, compat_aio_context_t, ctx_id, - compat_long_t, min_nr, - compat_long_t, nr, - struct io_event __user *, events, - struct old_timespec32 __user *, timeout) +SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, + __s32, min_nr, + __s32, nr, + struct io_event __user *, events, + struct old_timespec32 __user *, timeout) { struct timespec64 t; int ret; diff --git a/fs/select.c b/fs/select.c index d0f35dbc0e8f..6cbc9ff56ba0 100644 --- a/fs/select.c +++ b/fs/select.c @@ -1379,7 +1379,7 @@ COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp, #if defined(CONFIG_COMPAT_32BIT_TIME) -COMPAT_SYSCALL_DEFINE6(pselect6, int, n, compat_ulong_t __user *, inp, +COMPAT_SYSCALL_DEFINE6(pselect6_time32, int, n, compat_ulong_t __user *, inp, compat_ulong_t __user *, outp, compat_ulong_t __user *, exp, struct old_timespec32 __user *, tsp, void __user *, sig) { @@ -1402,7 +1402,7 @@ COMPAT_SYSCALL_DEFINE6(pselect6, int, n, compat_ulong_t __user *, inp, #endif #if defined(CONFIG_COMPAT_32BIT_TIME) -COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, +COMPAT_SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, struct old_timespec32 __user *, tsp, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) { diff --git a/fs/timerfd.c b/fs/timerfd.c index 803ca070d42e..6a6fc8aa1de7 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -560,7 +560,7 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct __kernel_itimerspec __user *, } #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags, +SYSCALL_DEFINE4(timerfd_settime32, int, ufd, int, flags, const struct old_itimerspec32 __user *, utmr, struct old_itimerspec32 __user *, otmr) { @@ -577,7 +577,7 @@ COMPAT_SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags, return ret; } -COMPAT_SYSCALL_DEFINE2(timerfd_gettime, int, ufd, +SYSCALL_DEFINE2(timerfd_gettime32, int, ufd, struct old_itimerspec32 __user *, otmr) { struct itimerspec64 kotmr; diff --git a/fs/utimes.c b/fs/utimes.c index bdcf2daf39c1..350c9c16ace1 100644 --- a/fs/utimes.c +++ b/fs/utimes.c @@ -224,8 +224,8 @@ SYSCALL_DEFINE2(utime, char __user *, filename, struct utimbuf __user *, times) * of sys_utimes. */ #ifdef __ARCH_WANT_SYS_UTIME32 -COMPAT_SYSCALL_DEFINE2(utime, const char __user *, filename, - struct old_utimbuf32 __user *, t) +SYSCALL_DEFINE2(utime32, const char __user *, filename, + struct old_utimbuf32 __user *, t) { struct timespec64 tv[2]; @@ -240,7 +240,7 @@ COMPAT_SYSCALL_DEFINE2(utime, const char __user *, filename, } #endif -COMPAT_SYSCALL_DEFINE4(utimensat, unsigned int, dfd, const char __user *, filename, struct old_timespec32 __user *, t, int, flags) +SYSCALL_DEFINE4(utimensat_time32, unsigned int, dfd, const char __user *, filename, struct old_timespec32 __user *, t, int, flags) { struct timespec64 tv[2]; @@ -276,14 +276,14 @@ static long do_compat_futimesat(unsigned int dfd, const char __user *filename, return do_utimes(dfd, filename, t ? tv : NULL, 0); } -COMPAT_SYSCALL_DEFINE3(futimesat, unsigned int, dfd, +SYSCALL_DEFINE3(futimesat_time32, unsigned int, dfd, const char __user *, filename, struct old_timeval32 __user *, t) { return do_compat_futimesat(dfd, filename, t); } -COMPAT_SYSCALL_DEFINE2(utimes, const char __user *, filename, struct old_timeval32 __user *, t) +SYSCALL_DEFINE2(utimes_time32, const char __user *, filename, struct old_timeval32 __user *, t) { return do_compat_futimesat(AT_FDCWD, filename, t); } diff --git a/include/linux/compat.h b/include/linux/compat.h index 657ca6abd855..ebddcb6cfcf8 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -520,11 +520,6 @@ int __compat_save_altstack(compat_stack_t __user *, unsigned long); asmlinkage long compat_sys_io_setup(unsigned nr_reqs, u32 __user *ctx32p); asmlinkage long compat_sys_io_submit(compat_aio_context_t ctx_id, int nr, u32 __user *iocb); -asmlinkage long compat_sys_io_getevents(compat_aio_context_t ctx_id, - compat_long_t min_nr, - compat_long_t nr, - struct io_event __user *events, - struct old_timespec32 __user *timeout); asmlinkage long compat_sys_io_pgetevents(compat_aio_context_t ctx_id, compat_long_t min_nr, compat_long_t nr, @@ -617,7 +612,7 @@ asmlinkage long compat_sys_sendfile64(int out_fd, int in_fd, compat_loff_t __user *offset, compat_size_t count); /* fs/select.c */ -asmlinkage long compat_sys_pselect6(int n, compat_ulong_t __user *inp, +asmlinkage long compat_sys_pselect6_time32(int n, compat_ulong_t __user *inp, compat_ulong_t __user *outp, compat_ulong_t __user *exp, struct old_timespec32 __user *tsp, @@ -627,7 +622,7 @@ asmlinkage long compat_sys_pselect6_time64(int n, compat_ulong_t __user *inp, compat_ulong_t __user *exp, struct __kernel_timespec __user *tsp, void __user *sig); -asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds, +asmlinkage long compat_sys_ppoll_time32(struct pollfd __user *ufds, unsigned int nfds, struct old_timespec32 __user *tsp, const compat_sigset_t __user *sigmask, @@ -657,19 +652,6 @@ asmlinkage long compat_sys_newfstat(unsigned int fd, /* fs/sync.c: No generic prototype for sync_file_range and sync_file_range2 */ -/* fs/timerfd.c */ -asmlinkage long compat_sys_timerfd_gettime(int ufd, - struct old_itimerspec32 __user *otmr); -asmlinkage long compat_sys_timerfd_settime(int ufd, int flags, - const struct old_itimerspec32 __user *utmr, - struct old_itimerspec32 __user *otmr); - -/* fs/utimes.c */ -asmlinkage long compat_sys_utimensat(unsigned int dfd, - const char __user *filename, - struct old_timespec32 __user *t, - int flags); - /* kernel/exit.c */ asmlinkage long compat_sys_waitid(int, compat_pid_t, struct compat_siginfo __user *, int, @@ -678,9 +660,6 @@ asmlinkage long compat_sys_waitid(int, compat_pid_t, /* kernel/futex.c */ -asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, u32 val, - struct old_timespec32 __user *utime, u32 __user *uaddr2, - u32 val3); asmlinkage long compat_sys_set_robust_list(struct compat_robust_list_head __user *head, compat_size_t len); @@ -688,10 +667,6 @@ asmlinkage long compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr, compat_size_t __user *len_ptr); -/* kernel/hrtimer.c */ -asmlinkage long compat_sys_nanosleep(struct old_timespec32 __user *rqtp, - struct old_timespec32 __user *rmtp); - /* kernel/itimer.c */ asmlinkage long compat_sys_getitimer(int which, struct compat_itimerval __user *it); @@ -709,20 +684,6 @@ asmlinkage long compat_sys_kexec_load(compat_ulong_t entry, asmlinkage long compat_sys_timer_create(clockid_t which_clock, struct compat_sigevent __user *timer_event_spec, timer_t __user *created_timer_id); -asmlinkage long compat_sys_timer_gettime(timer_t timer_id, - struct old_itimerspec32 __user *setting); -asmlinkage long compat_sys_timer_settime(timer_t timer_id, int flags, - struct old_itimerspec32 __user *new, - struct old_itimerspec32 __user *old); -asmlinkage long compat_sys_clock_settime(clockid_t which_clock, - struct old_timespec32 __user *tp); -asmlinkage long compat_sys_clock_gettime(clockid_t which_clock, - struct old_timespec32 __user *tp); -asmlinkage long compat_sys_clock_getres(clockid_t which_clock, - struct old_timespec32 __user *tp); -asmlinkage long compat_sys_clock_nanosleep(clockid_t which_clock, int flags, - struct old_timespec32 __user *rqtp, - struct old_timespec32 __user *rmtp); /* kernel/ptrace.c */ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid, @@ -735,8 +696,6 @@ asmlinkage long compat_sys_sched_setaffinity(compat_pid_t pid, asmlinkage long compat_sys_sched_getaffinity(compat_pid_t pid, unsigned int len, compat_ulong_t __user *user_mask_ptr); -asmlinkage long compat_sys_sched_rr_get_interval(compat_pid_t pid, - struct old_timespec32 __user *interval); /* kernel/signal.c */ asmlinkage long compat_sys_sigaltstack(const compat_stack_t __user *uss_ptr, @@ -754,7 +713,7 @@ asmlinkage long compat_sys_rt_sigprocmask(int how, compat_sigset_t __user *set, compat_size_t sigsetsize); asmlinkage long compat_sys_rt_sigpending(compat_sigset_t __user *uset, compat_size_t sigsetsize); -asmlinkage long compat_sys_rt_sigtimedwait(compat_sigset_t __user *uthese, +asmlinkage long compat_sys_rt_sigtimedwait_time32(compat_sigset_t __user *uthese, struct compat_siginfo __user *uinfo, struct old_timespec32 __user *uts, compat_size_t sigsetsize); asmlinkage long compat_sys_rt_sigtimedwait_time64(compat_sigset_t __user *uthese, @@ -777,7 +736,6 @@ asmlinkage long compat_sys_gettimeofday(struct old_timeval32 __user *tv, struct timezone __user *tz); asmlinkage long compat_sys_settimeofday(struct old_timeval32 __user *tv, struct timezone __user *tz); -asmlinkage long compat_sys_adjtimex(struct old_timex32 __user *utp); /* kernel/timer.c */ asmlinkage long compat_sys_sysinfo(struct compat_sysinfo __user *info); @@ -786,14 +744,6 @@ asmlinkage long compat_sys_sysinfo(struct compat_sysinfo __user *info); asmlinkage long compat_sys_mq_open(const char __user *u_name, int oflag, compat_mode_t mode, struct compat_mq_attr __user *u_attr); -asmlinkage long compat_sys_mq_timedsend(mqd_t mqdes, - const char __user *u_msg_ptr, - compat_size_t msg_len, unsigned int msg_prio, - const struct old_timespec32 __user *u_abs_timeout); -asmlinkage ssize_t compat_sys_mq_timedreceive(mqd_t mqdes, - char __user *u_msg_ptr, - compat_size_t msg_len, unsigned int __user *u_msg_prio, - const struct old_timespec32 __user *u_abs_timeout); asmlinkage long compat_sys_mq_notify(mqd_t mqdes, const struct compat_sigevent __user *u_notification); asmlinkage long compat_sys_mq_getsetattr(mqd_t mqdes, @@ -809,8 +759,6 @@ asmlinkage long compat_sys_msgsnd(int msqid, compat_uptr_t msgp, /* ipc/sem.c */ asmlinkage long compat_sys_semctl(int semid, int semnum, int cmd, int arg); -asmlinkage long compat_sys_semtimedop(int semid, struct sembuf __user *tsems, - unsigned nsems, const struct old_timespec32 __user *timeout); /* ipc/shm.c */ asmlinkage long compat_sys_shmctl(int first, int second, void __user *uptr); @@ -868,7 +816,7 @@ asmlinkage long compat_sys_rt_tgsigqueueinfo(compat_pid_t tgid, asmlinkage long compat_sys_recvmmsg_time64(int fd, struct compat_mmsghdr __user *mmsg, unsigned vlen, unsigned int flags, struct __kernel_timespec __user *timeout); -asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg, +asmlinkage long compat_sys_recvmmsg_time32(int fd, struct compat_mmsghdr __user *mmsg, unsigned vlen, unsigned int flags, struct old_timespec32 __user *timeout); asmlinkage long compat_sys_wait4(compat_pid_t pid, @@ -879,8 +827,6 @@ asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32, asmlinkage long compat_sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); -asmlinkage long compat_sys_clock_adjtime(clockid_t which_clock, - struct old_timex32 __user *tp); asmlinkage long compat_sys_sendmmsg(int fd, struct compat_mmsghdr __user *mmsg, unsigned vlen, unsigned int flags); asmlinkage ssize_t compat_sys_process_vm_readv(compat_pid_t pid, @@ -921,8 +867,6 @@ asmlinkage long compat_sys_pwritev64v2(unsigned long fd, /* __ARCH_WANT_SYSCALL_NO_AT */ asmlinkage long compat_sys_open(const char __user *filename, int flags, umode_t mode); -asmlinkage long compat_sys_utimes(const char __user *filename, - struct old_timeval32 __user *t); /* __ARCH_WANT_SYSCALL_NO_FLAGS */ asmlinkage long compat_sys_signalfd(int ufd, @@ -936,12 +880,6 @@ asmlinkage long compat_sys_newlstat(const char __user *filename, struct compat_stat __user *statbuf); /* __ARCH_WANT_SYSCALL_DEPRECATED */ -asmlinkage long compat_sys_time(old_time32_t __user *tloc); -asmlinkage long compat_sys_utime(const char __user *filename, - struct old_utimbuf32 __user *t); -asmlinkage long compat_sys_futimesat(unsigned int dfd, - const char __user *filename, - struct old_timeval32 __user *t); asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp, compat_ulong_t __user *outp, compat_ulong_t __user *exp, struct old_timeval32 __user *tvp); @@ -976,9 +914,6 @@ asmlinkage long compat_sys_sigaction(int sig, struct compat_old_sigaction __user *oact); #endif -/* obsolete: kernel/time/time.c */ -asmlinkage long compat_sys_stime(old_time32_t __user *tptr); - /* obsolete: net/socket.c */ asmlinkage long compat_sys_socketcall(int call, u32 __user *args); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 09330d5bda0c..94369f5bd8e5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -297,6 +297,11 @@ asmlinkage long sys_io_getevents(aio_context_t ctx_id, long nr, struct io_event __user *events, struct __kernel_timespec __user *timeout); +asmlinkage long sys_io_getevents_time32(__u32 ctx_id, + __s32 min_nr, + __s32 nr, + struct io_event __user *events, + struct old_timespec32 __user *timeout); asmlinkage long sys_io_pgetevents(aio_context_t ctx_id, long min_nr, long nr, @@ -522,11 +527,19 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags, const struct __kernel_itimerspec __user *utmr, struct __kernel_itimerspec __user *otmr); asmlinkage long sys_timerfd_gettime(int ufd, struct __kernel_itimerspec __user *otmr); +asmlinkage long sys_timerfd_gettime32(int ufd, + struct old_itimerspec32 __user *otmr); +asmlinkage long sys_timerfd_settime32(int ufd, int flags, + const struct old_itimerspec32 __user *utmr, + struct old_itimerspec32 __user *otmr); /* fs/utimes.c */ asmlinkage long sys_utimensat(int dfd, const char __user *filename, struct __kernel_timespec __user *utimes, int flags); +asmlinkage long sys_utimensat_time32(unsigned int dfd, + const char __user *filename, + struct old_timespec32 __user *t, int flags); /* kernel/acct.c */ asmlinkage long sys_acct(const char __user *name); @@ -555,6 +568,9 @@ asmlinkage long sys_unshare(unsigned long unshare_flags); asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val, struct __kernel_timespec __user *utime, u32 __user *uaddr2, u32 val3); +asmlinkage long sys_futex_time32(u32 __user *uaddr, int op, u32 val, + struct old_timespec32 __user *utime, u32 __user *uaddr2, + u32 val3); asmlinkage long sys_get_robust_list(int pid, struct robust_list_head __user * __user *head_ptr, size_t __user *len_ptr); @@ -564,6 +580,8 @@ asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, /* kernel/hrtimer.c */ asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp, struct __kernel_timespec __user *rmtp); +asmlinkage long sys_nanosleep_time32(struct old_timespec32 __user *rqtp, + struct old_timespec32 __user *rmtp); /* kernel/itimer.c */ asmlinkage long sys_getitimer(int which, struct itimerval __user *value); @@ -602,6 +620,20 @@ asmlinkage long sys_clock_getres(clockid_t which_clock, asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags, const struct __kernel_timespec __user *rqtp, struct __kernel_timespec __user *rmtp); +asmlinkage long sys_timer_gettime32(timer_t timer_id, + struct old_itimerspec32 __user *setting); +asmlinkage long sys_timer_settime32(timer_t timer_id, int flags, + struct old_itimerspec32 __user *new, + struct old_itimerspec32 __user *old); +asmlinkage long sys_clock_settime32(clockid_t which_clock, + struct old_timespec32 __user *tp); +asmlinkage long sys_clock_gettime32(clockid_t which_clock, + struct old_timespec32 __user *tp); +asmlinkage long sys_clock_getres_time32(clockid_t which_clock, + struct old_timespec32 __user *tp); +asmlinkage long sys_clock_nanosleep_time32(clockid_t which_clock, int flags, + struct old_timespec32 __user *rqtp, + struct old_timespec32 __user *rmtp); /* kernel/printk.c */ asmlinkage long sys_syslog(int type, char __user *buf, int len); @@ -627,6 +659,8 @@ asmlinkage long sys_sched_get_priority_max(int policy); asmlinkage long sys_sched_get_priority_min(int policy); asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct __kernel_timespec __user *interval); +asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid, + struct old_timespec32 __user *interval); /* kernel/signal.c */ asmlinkage long sys_restart_syscall(void); @@ -696,6 +730,7 @@ asmlinkage long sys_gettimeofday(struct timeval __user *tv, asmlinkage long sys_settimeofday(struct timeval __user *tv, struct timezone __user *tz); asmlinkage long sys_adjtimex(struct __kernel_timex __user *txc_p); +asmlinkage long sys_adjtimex_time32(struct old_timex32 __user *txc_p); /* kernel/timer.c */ asmlinkage long sys_getpid(void); @@ -714,6 +749,14 @@ asmlinkage long sys_mq_timedsend(mqd_t mqdes, const char __user *msg_ptr, size_t asmlinkage long sys_mq_timedreceive(mqd_t mqdes, char __user *msg_ptr, size_t msg_len, unsigned int __user *msg_prio, const struct __kernel_timespec __user *abs_timeout); asmlinkage long sys_mq_notify(mqd_t mqdes, const struct sigevent __user *notification); asmlinkage long sys_mq_getsetattr(mqd_t mqdes, const struct mq_attr __user *mqstat, struct mq_attr __user *omqstat); +asmlinkage long sys_mq_timedreceive_time32(mqd_t mqdes, + char __user *u_msg_ptr, + unsigned int msg_len, unsigned int __user *u_msg_prio, + const struct old_timespec32 __user *u_abs_timeout); +asmlinkage long sys_mq_timedsend_time32(mqd_t mqdes, + const char __user *u_msg_ptr, + unsigned int msg_len, unsigned int msg_prio, + const struct old_timespec32 __user *u_abs_timeout); /* ipc/msg.c */ asmlinkage long sys_msgget(key_t key, int msgflg); @@ -731,6 +774,9 @@ asmlinkage long sys_old_semctl(int semid, int semnum, int cmd, unsigned long arg asmlinkage long sys_semtimedop(int semid, struct sembuf __user *sops, unsigned nsops, const struct __kernel_timespec __user *timeout); +asmlinkage long sys_semtimedop_time32(int semid, struct sembuf __user *sops, + unsigned nsops, + const struct old_timespec32 __user *timeout); asmlinkage long sys_semop(int semid, struct sembuf __user *sops, unsigned nsops); @@ -871,6 +917,8 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd, int flags); asmlinkage long sys_clock_adjtime(clockid_t which_clock, struct __kernel_timex __user *tx); +asmlinkage long sys_clock_adjtime32(clockid_t which_clock, + struct old_timex32 __user *tx); asmlinkage long sys_syncfs(int fd); asmlinkage long sys_setns(int fd, int nstype); asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg, @@ -1006,6 +1054,7 @@ asmlinkage long sys_alarm(unsigned int seconds); asmlinkage long sys_getpgrp(void); asmlinkage long sys_pause(void); asmlinkage long sys_time(time_t __user *tloc); +asmlinkage long sys_time32(old_time32_t __user *tloc); #ifdef __ARCH_WANT_SYS_UTIME asmlinkage long sys_utime(char __user *filename, struct utimbuf __user *times); @@ -1014,6 +1063,13 @@ asmlinkage long sys_utimes(char __user *filename, asmlinkage long sys_futimesat(int dfd, const char __user *filename, struct timeval __user *utimes); #endif +asmlinkage long sys_futimesat_time32(unsigned int dfd, + const char __user *filename, + struct old_timeval32 __user *t); +asmlinkage long sys_utime32(const char __user *filename, + struct old_utimbuf32 __user *t); +asmlinkage long sys_utimes_time32(const char __user *filename, + struct old_timeval32 __user *t); asmlinkage long sys_creat(const char __user *pathname, umode_t mode); asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user *dirent, @@ -1038,6 +1094,7 @@ asmlinkage long sys_fork(void); /* obsolete: kernel/time/time.c */ asmlinkage long sys_stime(time_t __user *tptr); +asmlinkage long sys_stime32(old_time32_t __user *tptr); /* obsolete: kernel/signal.c */ asmlinkage long sys_sigpending(old_sigset_t __user *uset); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 509484dbfd5d..153b55b94234 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -39,7 +39,7 @@ __SC_COMP(__NR_io_submit, sys_io_submit, compat_sys_io_submit) #define __NR_io_cancel 3 __SYSCALL(__NR_io_cancel, sys_io_cancel) #define __NR_io_getevents 4 -__SC_COMP(__NR_io_getevents, sys_io_getevents, compat_sys_io_getevents) +__SC_COMP(__NR_io_getevents, sys_io_getevents, sys_io_getevents_time32) /* fs/xattr.c */ #define __NR_setxattr 5 @@ -223,9 +223,9 @@ __SYSCALL(__NR3264_sendfile, sys_sendfile64) /* fs/select.c */ #define __NR_pselect6 72 -__SC_COMP(__NR_pselect6, sys_pselect6, compat_sys_pselect6) +__SC_COMP(__NR_pselect6, sys_pselect6, compat_sys_pselect6_time32) #define __NR_ppoll 73 -__SC_COMP(__NR_ppoll, sys_ppoll, compat_sys_ppoll) +__SC_COMP(__NR_ppoll, sys_ppoll, compat_sys_ppoll_time32) /* fs/signalfd.c */ #define __NR_signalfd4 74 @@ -271,14 +271,14 @@ __SC_COMP(__NR_sync_file_range, sys_sync_file_range, \ __SYSCALL(__NR_timerfd_create, sys_timerfd_create) #define __NR_timerfd_settime 86 __SC_COMP(__NR_timerfd_settime, sys_timerfd_settime, \ - compat_sys_timerfd_settime) + sys_timerfd_settime32) #define __NR_timerfd_gettime 87 __SC_COMP(__NR_timerfd_gettime, sys_timerfd_gettime, \ - compat_sys_timerfd_gettime) + sys_timerfd_gettime32) /* fs/utimes.c */ #define __NR_utimensat 88 -__SC_COMP(__NR_utimensat, sys_utimensat, compat_sys_utimensat) +__SC_COMP(__NR_utimensat, sys_utimensat, sys_utimensat_time32) /* kernel/acct.c */ #define __NR_acct 89 @@ -310,7 +310,7 @@ __SYSCALL(__NR_unshare, sys_unshare) /* kernel/futex.c */ #define __NR_futex 98 -__SC_COMP(__NR_futex, sys_futex, compat_sys_futex) +__SC_COMP(__NR_futex, sys_futex, sys_futex_time32) #define __NR_set_robust_list 99 __SC_COMP(__NR_set_robust_list, sys_set_robust_list, \ compat_sys_set_robust_list) @@ -320,7 +320,7 @@ __SC_COMP(__NR_get_robust_list, sys_get_robust_list, \ /* kernel/hrtimer.c */ #define __NR_nanosleep 101 -__SC_COMP(__NR_nanosleep, sys_nanosleep, compat_sys_nanosleep) +__SC_COMP(__NR_nanosleep, sys_nanosleep, sys_nanosleep_time32) /* kernel/itimer.c */ #define __NR_getitimer 102 @@ -342,22 +342,22 @@ __SYSCALL(__NR_delete_module, sys_delete_module) #define __NR_timer_create 107 __SC_COMP(__NR_timer_create, sys_timer_create, compat_sys_timer_create) #define __NR_timer_gettime 108 -__SC_COMP(__NR_timer_gettime, sys_timer_gettime, compat_sys_timer_gettime) +__SC_COMP(__NR_timer_gettime, sys_timer_gettime, sys_timer_gettime32) #define __NR_timer_getoverrun 109 __SYSCALL(__NR_timer_getoverrun, sys_timer_getoverrun) #define __NR_timer_settime 110 -__SC_COMP(__NR_timer_settime, sys_timer_settime, compat_sys_timer_settime) +__SC_COMP(__NR_timer_settime, sys_timer_settime, sys_timer_settime32) #define __NR_timer_delete 111 __SYSCALL(__NR_timer_delete, sys_timer_delete) #define __NR_clock_settime 112 -__SC_COMP(__NR_clock_settime, sys_clock_settime, compat_sys_clock_settime) +__SC_COMP(__NR_clock_settime, sys_clock_settime, sys_clock_settime32) #define __NR_clock_gettime 113 -__SC_COMP(__NR_clock_gettime, sys_clock_gettime, compat_sys_clock_gettime) +__SC_COMP(__NR_clock_gettime, sys_clock_gettime, sys_clock_gettime32) #define __NR_clock_getres 114 -__SC_COMP(__NR_clock_getres, sys_clock_getres, compat_sys_clock_getres) +__SC_COMP(__NR_clock_getres, sys_clock_getres, sys_clock_getres_time32) #define __NR_clock_nanosleep 115 __SC_COMP(__NR_clock_nanosleep, sys_clock_nanosleep, \ - compat_sys_clock_nanosleep) + sys_clock_nanosleep_time32) /* kernel/printk.c */ #define __NR_syslog 116 @@ -390,7 +390,7 @@ __SYSCALL(__NR_sched_get_priority_max, sys_sched_get_priority_max) __SYSCALL(__NR_sched_get_priority_min, sys_sched_get_priority_min) #define __NR_sched_rr_get_interval 127 __SC_COMP(__NR_sched_rr_get_interval, sys_sched_rr_get_interval, \ - compat_sys_sched_rr_get_interval) + sys_sched_rr_get_interval_time32) /* kernel/signal.c */ #define __NR_restart_syscall 128 @@ -413,7 +413,7 @@ __SC_COMP(__NR_rt_sigprocmask, sys_rt_sigprocmask, compat_sys_rt_sigprocmask) __SC_COMP(__NR_rt_sigpending, sys_rt_sigpending, compat_sys_rt_sigpending) #define __NR_rt_sigtimedwait 137 __SC_COMP(__NR_rt_sigtimedwait, sys_rt_sigtimedwait, \ - compat_sys_rt_sigtimedwait) + compat_sys_rt_sigtimedwait_time32) #define __NR_rt_sigqueueinfo 138 __SC_COMP(__NR_rt_sigqueueinfo, sys_rt_sigqueueinfo, \ compat_sys_rt_sigqueueinfo) @@ -486,7 +486,7 @@ __SC_COMP(__NR_gettimeofday, sys_gettimeofday, compat_sys_gettimeofday) #define __NR_settimeofday 170 __SC_COMP(__NR_settimeofday, sys_settimeofday, compat_sys_settimeofday) #define __NR_adjtimex 171 -__SC_COMP(__NR_adjtimex, sys_adjtimex, compat_sys_adjtimex) +__SC_COMP(__NR_adjtimex, sys_adjtimex, sys_adjtimex_time32) /* kernel/timer.c */ #define __NR_getpid 172 @@ -512,10 +512,10 @@ __SC_COMP(__NR_mq_open, sys_mq_open, compat_sys_mq_open) #define __NR_mq_unlink 181 __SYSCALL(__NR_mq_unlink, sys_mq_unlink) #define __NR_mq_timedsend 182 -__SC_COMP(__NR_mq_timedsend, sys_mq_timedsend, compat_sys_mq_timedsend) +__SC_COMP(__NR_mq_timedsend, sys_mq_timedsend, sys_mq_timedsend_time32) #define __NR_mq_timedreceive 183 __SC_COMP(__NR_mq_timedreceive, sys_mq_timedreceive, \ - compat_sys_mq_timedreceive) + sys_mq_timedreceive_time32) #define __NR_mq_notify 184 __SC_COMP(__NR_mq_notify, sys_mq_notify, compat_sys_mq_notify) #define __NR_mq_getsetattr 185 @@ -537,7 +537,7 @@ __SYSCALL(__NR_semget, sys_semget) #define __NR_semctl 191 __SC_COMP(__NR_semctl, sys_semctl, compat_sys_semctl) #define __NR_semtimedop 192 -__SC_COMP(__NR_semtimedop, sys_semtimedop, compat_sys_semtimedop) +__SC_COMP(__NR_semtimedop, sys_semtimedop, sys_semtimedop_time32) #define __NR_semop 193 __SYSCALL(__NR_semop, sys_semop) @@ -659,7 +659,7 @@ __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_accept4 242 __SYSCALL(__NR_accept4, sys_accept4) #define __NR_recvmmsg 243 -__SC_COMP(__NR_recvmmsg, sys_recvmmsg, compat_sys_recvmmsg) +__SC_COMP(__NR_recvmmsg, sys_recvmmsg, compat_sys_recvmmsg_time32) /* * Architectures may provide up to 16 syscalls of their own @@ -681,7 +681,7 @@ __SYSCALL(__NR_name_to_handle_at, sys_name_to_handle_at) __SC_COMP(__NR_open_by_handle_at, sys_open_by_handle_at, \ compat_sys_open_by_handle_at) #define __NR_clock_adjtime 266 -__SC_COMP(__NR_clock_adjtime, sys_clock_adjtime, compat_sys_clock_adjtime) +__SC_COMP(__NR_clock_adjtime, sys_clock_adjtime, sys_clock_adjtime32) #define __NR_syncfs 267 __SYSCALL(__NR_syncfs, sys_syncfs) #define __NR_setns 268 diff --git a/ipc/mqueue.c b/ipc/mqueue.c index c595bed7bfcb..c839bf83231d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -1471,10 +1471,10 @@ static int compat_prepare_timeout(const struct old_timespec32 __user *p, return 0; } -COMPAT_SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, - const char __user *, u_msg_ptr, - compat_size_t, msg_len, unsigned int, msg_prio, - const struct old_timespec32 __user *, u_abs_timeout) +SYSCALL_DEFINE5(mq_timedsend_time32, mqd_t, mqdes, + const char __user *, u_msg_ptr, + unsigned int, msg_len, unsigned int, msg_prio, + const struct old_timespec32 __user *, u_abs_timeout) { struct timespec64 ts, *p = NULL; if (u_abs_timeout) { @@ -1486,10 +1486,10 @@ COMPAT_SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, return do_mq_timedsend(mqdes, u_msg_ptr, msg_len, msg_prio, p); } -COMPAT_SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, - char __user *, u_msg_ptr, - compat_size_t, msg_len, unsigned int __user *, u_msg_prio, - const struct old_timespec32 __user *, u_abs_timeout) +SYSCALL_DEFINE5(mq_timedreceive_time32, mqd_t, mqdes, + char __user *, u_msg_ptr, + unsigned int, msg_len, unsigned int __user *, u_msg_prio, + const struct old_timespec32 __user *, u_abs_timeout) { struct timespec64 ts, *p = NULL; if (u_abs_timeout) { diff --git a/ipc/sem.c b/ipc/sem.c index d1efff3a81bb..80909464acff 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -2250,7 +2250,7 @@ long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems, return do_semtimedop(semid, tsems, nsops, NULL); } -COMPAT_SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsems, +SYSCALL_DEFINE4(semtimedop_time32, int, semid, struct sembuf __user *, tsems, unsigned int, nsops, const struct old_timespec32 __user *, timeout) { diff --git a/kernel/futex.c b/kernel/futex.c index be3bff2315ff..caead6c113d4 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3812,7 +3812,7 @@ err_unlock: #endif /* CONFIG_COMPAT */ #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, +SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, op, u32, val, struct old_timespec32 __user *, utime, u32 __user *, uaddr2, u32, val3) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a674c7db2f29..62862419cd05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5252,9 +5252,8 @@ SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid, } #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE2(sched_rr_get_interval, - compat_pid_t, pid, - struct old_timespec32 __user *, interval) +SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid, + struct old_timespec32 __user *, interval) { struct timespec64 t; int retval = sched_rr_get_interval(pid, &t); diff --git a/kernel/signal.c b/kernel/signal.c index e1d7ad8e6ab1..af27629918cf 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3397,7 +3397,7 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait_time64, compat_sigset_t __user *, uthese, } #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, +COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait_time32, compat_sigset_t __user *, uthese, struct compat_siginfo __user *, uinfo, struct old_timespec32 __user *, uts, compat_size_t, sigsetsize) { diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ce04431a40d1..85e5ccec0955 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -42,9 +42,11 @@ COND_SYSCALL(io_destroy); COND_SYSCALL(io_submit); COND_SYSCALL_COMPAT(io_submit); COND_SYSCALL(io_cancel); +COND_SYSCALL(io_getevents_time32); COND_SYSCALL(io_getevents); +COND_SYSCALL(io_pgetevents_time32); COND_SYSCALL(io_pgetevents); -COND_SYSCALL_COMPAT(io_getevents); +COND_SYSCALL_COMPAT(io_pgetevents_time32); COND_SYSCALL_COMPAT(io_pgetevents); /* fs/xattr.c */ @@ -114,9 +116,9 @@ COND_SYSCALL_COMPAT(signalfd4); /* fs/timerfd.c */ COND_SYSCALL(timerfd_create); COND_SYSCALL(timerfd_settime); -COND_SYSCALL_COMPAT(timerfd_settime); +COND_SYSCALL(timerfd_settime32); COND_SYSCALL(timerfd_gettime); -COND_SYSCALL_COMPAT(timerfd_gettime); +COND_SYSCALL(timerfd_gettime32); /* fs/utimes.c */ @@ -135,7 +137,7 @@ COND_SYSCALL(capset); /* kernel/futex.c */ COND_SYSCALL(futex); -COND_SYSCALL_COMPAT(futex); +COND_SYSCALL(futex_time32); COND_SYSCALL(set_robust_list); COND_SYSCALL_COMPAT(set_robust_list); COND_SYSCALL(get_robust_list); @@ -187,9 +189,9 @@ COND_SYSCALL(mq_open); COND_SYSCALL_COMPAT(mq_open); COND_SYSCALL(mq_unlink); COND_SYSCALL(mq_timedsend); -COND_SYSCALL_COMPAT(mq_timedsend); +COND_SYSCALL(mq_timedsend_time32); COND_SYSCALL(mq_timedreceive); -COND_SYSCALL_COMPAT(mq_timedreceive); +COND_SYSCALL(mq_timedreceive_time32); COND_SYSCALL(mq_notify); COND_SYSCALL_COMPAT(mq_notify); COND_SYSCALL(mq_getsetattr); @@ -211,7 +213,7 @@ COND_SYSCALL(old_semctl); COND_SYSCALL(semctl); COND_SYSCALL_COMPAT(semctl); COND_SYSCALL(semtimedop); -COND_SYSCALL_COMPAT(semtimedop); +COND_SYSCALL(semtimedop_time32); COND_SYSCALL(semop); /* ipc/shm.c */ @@ -288,7 +290,7 @@ COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); COND_SYSCALL(recvmmsg); COND_SYSCALL(recvmmsg_time32); -COND_SYSCALL_COMPAT(recvmmsg); +COND_SYSCALL_COMPAT(recvmmsg_time32); COND_SYSCALL_COMPAT(recvmmsg_time64); /* diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index f5cfa1b73d6f..0f5f96075110 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1771,7 +1771,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp, #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE2(nanosleep, struct old_timespec32 __user *, rqtp, +SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp, struct old_timespec32 __user *, rmtp) { struct timespec64 tu; diff --git a/kernel/time/posix-stubs.c b/kernel/time/posix-stubs.c index a51895486e5e..67df65f887ac 100644 --- a/kernel/time/posix-stubs.c +++ b/kernel/time/posix-stubs.c @@ -45,6 +45,7 @@ SYS_NI(timer_delete); SYS_NI(clock_adjtime); SYS_NI(getitimer); SYS_NI(setitimer); +SYS_NI(clock_adjtime32); #ifdef __ARCH_WANT_SYS_ALARM SYS_NI(alarm); #endif @@ -150,16 +151,16 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags, #ifdef CONFIG_COMPAT COMPAT_SYS_NI(timer_create); -COMPAT_SYS_NI(clock_adjtime); -COMPAT_SYS_NI(timer_settime); -COMPAT_SYS_NI(timer_gettime); COMPAT_SYS_NI(getitimer); COMPAT_SYS_NI(setitimer); #endif #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE2(clock_settime, const clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYS_NI(timer_settime32); +SYS_NI(timer_gettime32); + +SYSCALL_DEFINE2(clock_settime32, const clockid_t, which_clock, + struct old_timespec32 __user *, tp) { struct timespec64 new_tp; @@ -171,8 +172,8 @@ COMPAT_SYSCALL_DEFINE2(clock_settime, const clockid_t, which_clock, return do_sys_settimeofday64(&new_tp, NULL); } -COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYSCALL_DEFINE2(clock_gettime32, clockid_t, which_clock, + struct old_timespec32 __user *, tp) { int ret; struct timespec64 kernel_tp; @@ -186,8 +187,8 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, return 0; } -COMPAT_SYSCALL_DEFINE2(clock_getres, clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYSCALL_DEFINE2(clock_getres_time32, clockid_t, which_clock, + struct old_timespec32 __user *, tp) { struct timespec64 rtn_tp = { .tv_sec = 0, @@ -206,9 +207,9 @@ COMPAT_SYSCALL_DEFINE2(clock_getres, clockid_t, which_clock, } } -COMPAT_SYSCALL_DEFINE4(clock_nanosleep, clockid_t, which_clock, int, flags, - struct old_timespec32 __user *, rqtp, - struct old_timespec32 __user *, rmtp) +SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags, + struct old_timespec32 __user *, rqtp, + struct old_timespec32 __user *, rmtp) { struct timespec64 t; diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index de79f85ae14f..29176635991f 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -730,8 +730,8 @@ SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id, #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id, - struct old_itimerspec32 __user *, setting) +SYSCALL_DEFINE2(timer_gettime32, timer_t, timer_id, + struct old_itimerspec32 __user *, setting) { struct itimerspec64 cur_setting; @@ -903,9 +903,9 @@ SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags, } #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags, - struct old_itimerspec32 __user *, new, - struct old_itimerspec32 __user *, old) +SYSCALL_DEFINE4(timer_settime32, timer_t, timer_id, int, flags, + struct old_itimerspec32 __user *, new, + struct old_itimerspec32 __user *, old) { struct itimerspec64 new_spec, old_spec; struct itimerspec64 *rtn = old ? &old_spec : NULL; @@ -1096,8 +1096,8 @@ SYSCALL_DEFINE2(clock_getres, const clockid_t, which_clock, #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE2(clock_settime, clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYSCALL_DEFINE2(clock_settime32, clockid_t, which_clock, + struct old_timespec32 __user *, tp) { const struct k_clock *kc = clockid_to_kclock(which_clock); struct timespec64 ts; @@ -1111,8 +1111,8 @@ COMPAT_SYSCALL_DEFINE2(clock_settime, clockid_t, which_clock, return kc->clock_set(which_clock, &ts); } -COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYSCALL_DEFINE2(clock_gettime32, clockid_t, which_clock, + struct old_timespec32 __user *, tp) { const struct k_clock *kc = clockid_to_kclock(which_clock); struct timespec64 ts; @@ -1129,8 +1129,8 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock, return err; } -COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, - struct old_timex32 __user *, utp) +SYSCALL_DEFINE2(clock_adjtime32, clockid_t, which_clock, + struct old_timex32 __user *, utp) { struct __kernel_timex ktx; int err; @@ -1147,8 +1147,8 @@ COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock, return err; } -COMPAT_SYSCALL_DEFINE2(clock_getres, clockid_t, which_clock, - struct old_timespec32 __user *, tp) +SYSCALL_DEFINE2(clock_getres_time32, clockid_t, which_clock, + struct old_timespec32 __user *, tp) { const struct k_clock *kc = clockid_to_kclock(which_clock); struct timespec64 ts; @@ -1204,9 +1204,9 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags, #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE4(clock_nanosleep, clockid_t, which_clock, int, flags, - struct old_timespec32 __user *, rqtp, - struct old_timespec32 __user *, rmtp) +SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags, + struct old_timespec32 __user *, rqtp, + struct old_timespec32 __user *, rmtp) { const struct k_clock *kc = clockid_to_kclock(which_clock); struct timespec64 t; diff --git a/kernel/time/time.c b/kernel/time/time.c index 78b5c8f1495a..6261f969dcb7 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -98,11 +98,11 @@ SYSCALL_DEFINE1(stime, time_t __user *, tptr) #endif /* __ARCH_WANT_SYS_TIME */ -#ifdef CONFIG_COMPAT +#ifdef CONFIG_COMPAT_32BIT_TIME #ifdef __ARCH_WANT_COMPAT_SYS_TIME /* old_time32_t is a 32 bit "long" and needs to get converted. */ -COMPAT_SYSCALL_DEFINE1(time, old_time32_t __user *, tloc) +SYSCALL_DEFINE1(time32, old_time32_t __user *, tloc) { old_time32_t i; @@ -116,7 +116,7 @@ COMPAT_SYSCALL_DEFINE1(time, old_time32_t __user *, tloc) return i; } -COMPAT_SYSCALL_DEFINE1(stime, old_time32_t __user *, tptr) +SYSCALL_DEFINE1(stime32, old_time32_t __user *, tptr) { struct timespec64 tv; int err; @@ -344,7 +344,7 @@ int put_old_timex32(struct old_timex32 __user *utp, const struct __kernel_timex return 0; } -COMPAT_SYSCALL_DEFINE1(adjtimex, struct old_timex32 __user *, utp) +SYSCALL_DEFINE1(adjtimex_time32, struct old_timex32 __user *, utp) { struct __kernel_timex txc; int err, ret; diff --git a/net/compat.c b/net/compat.c index 959d1c51826d..2fef7b9db434 100644 --- a/net/compat.c +++ b/net/compat.c @@ -822,7 +822,7 @@ COMPAT_SYSCALL_DEFINE5(recvmmsg_time64, int, fd, struct compat_mmsghdr __user *, } #ifdef CONFIG_COMPAT_32BIT_TIME -COMPAT_SYSCALL_DEFINE5(recvmmsg, int, fd, struct compat_mmsghdr __user *, mmsg, +COMPAT_SYSCALL_DEFINE5(recvmmsg_time32, int, fd, struct compat_mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct old_timespec32 __user *, timeout) { -- cgit v1.3-7-g2ca7 From d33c577cccd0b3e5bb2425f85037f26714a59363 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Sun, 6 Jan 2019 23:45:29 +0100 Subject: y2038: rename old time and utime syscalls The time, stime, utime, utimes, and futimesat system calls are only used on older architectures, and we do not provide y2038 safe variants of them, as they are replaced by clock_gettime64, clock_settime64, and utimensat_time64. However, for consistency it seems better to have the 32-bit architectures that still use them call the "time32" entry points (leaving the traditional handlers for the 64-bit architectures), like we do for system calls that now require two versions. Note: We used to always define __ARCH_WANT_SYS_TIME and __ARCH_WANT_SYS_UTIME and only set __ARCH_WANT_COMPAT_SYS_TIME and __ARCH_WANT_SYS_UTIME32 for compat mode on 64-bit kernels. Now this is reversed: only 64-bit architectures set __ARCH_WANT_SYS_TIME/UTIME, while we need __ARCH_WANT_SYS_TIME32/UTIME32 for 32-bit architectures and compat mode. The resulting asm/unistd.h changes look a bit counterintuitive. This is only a cleanup patch and it should not change any behavior. Signed-off-by: Arnd Bergmann Acked-by: Geert Uytterhoeven Acked-by: Heiko Carstens --- arch/arm/include/asm/unistd.h | 4 ++-- arch/arm/tools/syscall.tbl | 10 +++++----- arch/m68k/include/asm/unistd.h | 4 ++-- arch/m68k/kernel/syscalls/syscall.tbl | 10 +++++----- arch/microblaze/include/asm/unistd.h | 4 ++-- arch/microblaze/kernel/syscalls/syscall.tbl | 10 +++++----- arch/mips/include/asm/unistd.h | 4 ++-- arch/mips/kernel/syscalls/syscall_o32.tbl | 10 +++++----- arch/parisc/include/asm/unistd.h | 9 ++++++--- arch/parisc/kernel/syscalls/syscall.tbl | 15 ++++++++++----- arch/powerpc/include/asm/unistd.h | 8 ++++---- arch/powerpc/kernel/syscalls/syscall.tbl | 19 ++++++++++++++----- arch/s390/include/asm/unistd.h | 2 +- arch/sh/include/asm/unistd.h | 4 ++-- arch/sh/kernel/syscalls/syscall.tbl | 10 +++++----- arch/sparc/include/asm/unistd.h | 8 ++++---- arch/sparc/kernel/syscalls/syscall.tbl | 14 +++++++++----- arch/x86/entry/syscalls/syscall_32.tbl | 10 +++++----- arch/x86/include/asm/unistd.h | 8 ++++---- arch/xtensa/include/asm/unistd.h | 2 +- arch/xtensa/kernel/syscalls/syscall.tbl | 6 +++--- kernel/time/time.c | 4 ++-- 22 files changed, 98 insertions(+), 77 deletions(-) (limited to 'kernel') diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h index d713587dfcf4..7a39e77984ef 100644 --- a/arch/arm/include/asm/unistd.h +++ b/arch/arm/include/asm/unistd.h @@ -26,10 +26,10 @@ #define __ARCH_WANT_SYS_SIGPROCMASK #define __ARCH_WANT_SYS_OLD_MMAP #define __ARCH_WANT_SYS_OLD_SELECT -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_UTIME32 #if !defined(CONFIG_AEABI) || defined(CONFIG_OABI_COMPAT) -#define __ARCH_WANT_SYS_TIME +#define __ARCH_WANT_SYS_TIME32 #define __ARCH_WANT_SYS_IPC #define __ARCH_WANT_SYS_OLDUMOUNT #define __ARCH_WANT_SYS_ALARM diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 200f4b878a46..a96d9b5ee04e 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -24,7 +24,7 @@ 10 common unlink sys_unlink 11 common execve sys_execve 12 common chdir sys_chdir -13 oabi time sys_time +13 oabi time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown16 @@ -36,12 +36,12 @@ 22 oabi umount sys_oldumount 23 common setuid sys_setuid16 24 common getuid sys_getuid16 -25 oabi stime sys_stime +25 oabi stime sys_stime32 26 common ptrace sys_ptrace 27 oabi alarm sys_alarm # 28 was sys_fstat 29 common pause sys_pause -30 oabi utime sys_utime +30 oabi utime sys_utime32 # 31 was sys_stty # 32 was sys_gtty 33 common access sys_access @@ -283,7 +283,7 @@ 266 common statfs64 sys_statfs64_wrapper 267 common fstatfs64 sys_fstatfs64_wrapper 268 common tgkill sys_tgkill -269 common utimes sys_utimes +269 common utimes sys_utimes_time32 270 common arm_fadvise64_64 sys_arm_fadvise64_64 271 common pciconfig_iobase sys_pciconfig_iobase 272 common pciconfig_read sys_pciconfig_read @@ -340,7 +340,7 @@ 323 common mkdirat sys_mkdirat 324 common mknodat sys_mknodat 325 common fchownat sys_fchownat -326 common futimesat sys_futimesat +326 common futimesat sys_futimesat_time32 327 common fstatat64 sys_fstatat64 sys_oabi_fstatat64 328 common unlinkat sys_unlinkat 329 common renameat sys_renameat diff --git a/arch/m68k/include/asm/unistd.h b/arch/m68k/include/asm/unistd.h index 49d5de18646b..2e0047cf86f8 100644 --- a/arch/m68k/include/asm/unistd.h +++ b/arch/m68k/include/asm/unistd.h @@ -15,8 +15,8 @@ #define __ARCH_WANT_SYS_IPC #define __ARCH_WANT_SYS_PAUSE #define __ARCH_WANT_SYS_SIGNAL -#define __ARCH_WANT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_TIME32 +#define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_WAITPID #define __ARCH_WANT_SYS_SOCKETCALL #define __ARCH_WANT_SYS_FADVISE64 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 01529dce1d2e..253bd2a069bd 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink 11 common execve sys_execve 12 common chdir sys_chdir -13 common time sys_time +13 common time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common chown sys_chown16 @@ -32,12 +32,12 @@ 22 common umount sys_oldumount 23 common setuid sys_setuid16 24 common getuid sys_getuid16 -25 common stime sys_stime +25 common stime sys_stime32 26 common ptrace sys_ptrace 27 common alarm sys_alarm 28 common oldfstat sys_fstat 29 common pause sys_pause -30 common utime sys_utime +30 common utime sys_utime32 # 31 was stty # 32 was gtty 33 common access sys_access @@ -273,7 +273,7 @@ 263 common statfs64 sys_statfs64 264 common fstatfs64 sys_fstatfs64 265 common tgkill sys_tgkill -266 common utimes sys_utimes +266 common utimes sys_utimes_time32 267 common fadvise64_64 sys_fadvise64_64 268 common mbind sys_mbind 269 common get_mempolicy sys_get_mempolicy @@ -299,7 +299,7 @@ 289 common mkdirat sys_mkdirat 290 common mknodat sys_mknodat 291 common fchownat sys_fchownat -292 common futimesat sys_futimesat +292 common futimesat sys_futimesat_time32 293 common fstatat64 sys_fstatat64 294 common unlinkat sys_unlinkat 295 common renameat sys_renameat diff --git a/arch/microblaze/include/asm/unistd.h b/arch/microblaze/include/asm/unistd.h index 9b7c2c4eaf12..d79d35ac6253 100644 --- a/arch/microblaze/include/asm/unistd.h +++ b/arch/microblaze/include/asm/unistd.h @@ -21,8 +21,8 @@ #define __ARCH_WANT_SYS_GETHOSTNAME #define __ARCH_WANT_SYS_PAUSE #define __ARCH_WANT_SYS_SIGNAL -#define __ARCH_WANT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_TIME32 +#define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_WAITPID #define __ARCH_WANT_SYS_SOCKETCALL #define __ARCH_WANT_SYS_FADVISE64 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 492ff5c35b68..44a87649d681 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink 11 common execve sys_execve 12 common chdir sys_chdir -13 common time sys_time +13 common time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown @@ -32,12 +32,12 @@ 22 common umount sys_oldumount 23 common setuid sys_setuid 24 common getuid sys_getuid -25 common stime sys_stime +25 common stime sys_stime32 26 common ptrace sys_ptrace 27 common alarm sys_alarm 28 common oldfstat sys_ni_syscall 29 common pause sys_pause -30 common utime sys_utime +30 common utime sys_utime32 31 common stty sys_ni_syscall 32 common gtty sys_ni_syscall 33 common access sys_access @@ -278,7 +278,7 @@ 268 common statfs64 sys_statfs64 269 common fstatfs64 sys_fstatfs64 270 common tgkill sys_tgkill -271 common utimes sys_utimes +271 common utimes sys_utimes_time32 272 common fadvise64_64 sys_fadvise64_64 273 common vserver sys_ni_syscall 274 common mbind sys_mbind @@ -306,7 +306,7 @@ 296 common mkdirat sys_mkdirat 297 common mknodat sys_mknodat 298 common fchownat sys_fchownat -299 common futimesat sys_futimesat +299 common futimesat sys_futimesat_time32 300 common fstatat64 sys_fstatat64 301 common unlinkat sys_unlinkat 302 common renameat sys_renameat diff --git a/arch/mips/include/asm/unistd.h b/arch/mips/include/asm/unistd.h index 75c590229a23..071053ece677 100644 --- a/arch/mips/include/asm/unistd.h +++ b/arch/mips/include/asm/unistd.h @@ -45,10 +45,10 @@ #define __ARCH_WANT_SYS_SIGPROCMASK # ifdef CONFIG_32BIT # define __ARCH_WANT_STAT64 -# define __ARCH_WANT_SYS_TIME +# define __ARCH_WANT_SYS_TIME32 # endif # ifdef CONFIG_MIPS32_O32 -# define __ARCH_WANT_COMPAT_SYS_TIME +# define __ARCH_WANT_SYS_TIME32 # endif #define __ARCH_WANT_SYS_FORK #define __ARCH_WANT_SYS_CLONE diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 5642d93b64c0..54312c5b5343 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -20,7 +20,7 @@ 10 o32 unlink sys_unlink 11 o32 execve sys_execve compat_sys_execve 12 o32 chdir sys_chdir -13 o32 time sys_time sys_time32 +13 o32 time sys_time32 14 o32 mknod sys_mknod 15 o32 chmod sys_chmod 16 o32 lchown sys_lchown @@ -33,13 +33,13 @@ 22 o32 umount sys_oldumount 23 o32 setuid sys_setuid 24 o32 getuid sys_getuid -25 o32 stime sys_stime sys_stime32 +25 o32 stime sys_stime32 26 o32 ptrace sys_ptrace compat_sys_ptrace 27 o32 alarm sys_alarm # 28 was sys_fstat 28 o32 unused28 sys_ni_syscall 29 o32 pause sys_pause -30 o32 utime sys_utime sys_utime32 +30 o32 utime sys_utime32 31 o32 stty sys_ni_syscall 32 o32 gtty sys_ni_syscall 33 o32 access sys_access @@ -278,7 +278,7 @@ 264 o32 clock_getres sys_clock_getres_time32 265 o32 clock_nanosleep sys_clock_nanosleep_time32 266 o32 tgkill sys_tgkill -267 o32 utimes sys_utimes sys_utimes_time32 +267 o32 utimes sys_utimes_time32 268 o32 mbind sys_mbind compat_sys_mbind 269 o32 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy 270 o32 set_mempolicy sys_set_mempolicy compat_sys_set_mempolicy @@ -303,7 +303,7 @@ 289 o32 mkdirat sys_mkdirat 290 o32 mknodat sys_mknodat 291 o32 fchownat sys_fchownat -292 o32 futimesat sys_futimesat sys_futimesat_time32 +292 o32 futimesat sys_futimesat_time32 293 o32 fstatat64 sys_fstatat64 sys_newfstatat 294 o32 unlinkat sys_unlinkat 295 o32 renameat sys_renameat diff --git a/arch/parisc/include/asm/unistd.h b/arch/parisc/include/asm/unistd.h index ac742b80e333..b0838dc4dfee 100644 --- a/arch/parisc/include/asm/unistd.h +++ b/arch/parisc/include/asm/unistd.h @@ -152,10 +152,8 @@ type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5) \ #define __ARCH_WANT_SYS_GETHOSTNAME #define __ARCH_WANT_SYS_PAUSE #define __ARCH_WANT_SYS_SIGNAL -#define __ARCH_WANT_SYS_TIME -#define __ARCH_WANT_COMPAT_SYS_TIME +#define __ARCH_WANT_SYS_TIME32 #define __ARCH_WANT_COMPAT_SYS_SCHED_RR_GET_INTERVAL -#define __ARCH_WANT_SYS_UTIME #define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_WAITPID #define __ARCH_WANT_SYS_SOCKETCALL @@ -170,6 +168,11 @@ type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5) \ #define __ARCH_WANT_SYS_CLONE #define __ARCH_WANT_COMPAT_SYS_SENDFILE +#ifdef CONFIG_64BIT +#define __ARCH_WANT_SYS_TIME +#define __ARCH_WANT_SYS_UTIME +#endif + #endif /* __ASSEMBLY__ */ #undef STR diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index b110c7371b08..7eff3dc3d613 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -20,7 +20,8 @@ 10 common unlink sys_unlink 11 common execve sys_execve compat_sys_execve 12 common chdir sys_chdir -13 common time sys_time sys_time32 +13 32 time sys_time32 +13 64 time sys_time 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown @@ -32,12 +33,14 @@ 22 common bind sys_bind 23 common setuid sys_setuid 24 common getuid sys_getuid -25 common stime sys_stime sys_stime32 +25 32 stime sys_stime32 +25 64 stime sys_stime 26 common ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm 28 common fstat sys_newfstat compat_sys_newfstat 29 common pause sys_pause -30 common utime sys_utime sys_utime32 +30 32 utime sys_utime32 +30 64 utime sys_utime 31 common connect sys_connect 32 common listen sys_listen 33 common access sys_access @@ -310,7 +313,8 @@ 276 common mkdirat sys_mkdirat 277 common mknodat sys_mknodat 278 common fchownat sys_fchownat -279 common futimesat sys_futimesat sys_futimesat_time32 +279 32 futimesat sys_futimesat_time32 +279 64 futimesat sys_futimesat 280 common fstatat64 sys_fstatat64 281 common unlinkat sys_unlinkat 282 common renameat sys_renameat @@ -374,7 +378,8 @@ 333 common finit_module sys_finit_module 334 common sched_setattr sys_sched_setattr 335 common sched_getattr sys_sched_getattr -336 common utimes sys_utimes sys_utimes_time32 +336 32 utimes sys_utimes_time32 +336 64 utimes sys_utimes 337 common renameat2 sys_renameat2 338 common seccomp sys_seccomp 339 common getrandom sys_getrandom diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index a3c35e6d6ffb..f44dbc65e38e 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -29,8 +29,8 @@ #define __ARCH_WANT_SYS_IPC #define __ARCH_WANT_SYS_PAUSE #define __ARCH_WANT_SYS_SIGNAL -#define __ARCH_WANT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_TIME32 +#define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_WAITPID #define __ARCH_WANT_SYS_SOCKETCALL #define __ARCH_WANT_SYS_FADVISE64 @@ -45,8 +45,8 @@ #define __ARCH_WANT_OLD_STAT #endif #ifdef CONFIG_PPC64 -#define __ARCH_WANT_COMPAT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME32 +#define __ARCH_WANT_SYS_TIME +#define __ARCH_WANT_SYS_UTIME #define __ARCH_WANT_SYS_NEWFSTATAT #define __ARCH_WANT_COMPAT_SYS_SENDFILE #endif diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 2a8b060f73b3..500edbf9e8a6 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -20,7 +20,9 @@ 10 common unlink sys_unlink 11 nospu execve sys_execve compat_sys_execve 12 common chdir sys_chdir -13 common time sys_time sys_time32 +13 32 time sys_time32 +13 64 time sys_time +13 spu time sys_time 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown @@ -36,14 +38,17 @@ 22 spu umount sys_ni_syscall 23 common setuid sys_setuid 24 common getuid sys_getuid -25 common stime sys_stime sys_stime32 +25 32 stime sys_stime32 +25 64 stime sys_stime +25 spu stime sys_stime 26 nospu ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm 28 32 oldfstat sys_fstat sys_ni_syscall 28 64 oldfstat sys_ni_syscall 28 spu oldfstat sys_ni_syscall 29 nospu pause sys_pause -30 nospu utime sys_utime sys_utime32 +30 32 utime sys_utime32 +30 64 utime sys_utime 31 common stty sys_ni_syscall 32 common gtty sys_ni_syscall 33 common access sys_access @@ -315,7 +320,9 @@ 249 64 swapcontext ppc64_swapcontext 249 spu swapcontext sys_ni_syscall 250 common tgkill sys_tgkill -251 common utimes sys_utimes sys_utimes_time32 +251 32 utimes sys_utimes_time32 +251 64 utimes sys_utimes +251 spu utimes sys_utimes 252 common statfs64 sys_statfs64 compat_sys_statfs64 253 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 254 32 fadvise64_64 ppc_fadvise64_64 @@ -361,7 +368,9 @@ 287 common mkdirat sys_mkdirat 288 common mknodat sys_mknodat 289 common fchownat sys_fchownat -290 common futimesat sys_futimesat sys_futimesat_time32 +290 32 futimesat sys_futimesat_time32 +290 64 futimesat sys_futimesat +290 spu utimesat sys_futimesat 291 32 fstatat64 sys_fstatat64 291 64 newfstatat sys_newfstatat 291 spu newfstatat sys_newfstatat diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h index 59202ceea1f6..b6755685c7b8 100644 --- a/arch/s390/include/asm/unistd.h +++ b/arch/s390/include/asm/unistd.h @@ -28,7 +28,7 @@ #define __ARCH_WANT_SYS_SIGPENDING #define __ARCH_WANT_SYS_SIGPROCMASK # ifdef CONFIG_COMPAT -# define __ARCH_WANT_COMPAT_SYS_TIME +# define __ARCH_WANT_SYS_TIME32 # define __ARCH_WANT_SYS_UTIME32 # endif #define __ARCH_WANT_SYS_FORK diff --git a/arch/sh/include/asm/unistd.h b/arch/sh/include/asm/unistd.h index a97f93ca3bd7..9c7d9d9999c6 100644 --- a/arch/sh/include/asm/unistd.h +++ b/arch/sh/include/asm/unistd.h @@ -16,8 +16,8 @@ # define __ARCH_WANT_SYS_IPC # define __ARCH_WANT_SYS_PAUSE # define __ARCH_WANT_SYS_SIGNAL -# define __ARCH_WANT_SYS_TIME -# define __ARCH_WANT_SYS_UTIME +# define __ARCH_WANT_SYS_TIME32 +# define __ARCH_WANT_SYS_UTIME32 # define __ARCH_WANT_SYS_WAITPID # define __ARCH_WANT_SYS_SOCKETCALL # define __ARCH_WANT_SYS_FADVISE64 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index a08a5312cd12..06d768c3cc4c 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -20,7 +20,7 @@ 10 common unlink sys_unlink 11 common execve sys_execve 12 common chdir sys_chdir -13 common time sys_time +13 common time sys_time32 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown16 @@ -32,12 +32,12 @@ 22 common umount sys_oldumount 23 common setuid sys_setuid16 24 common getuid sys_getuid16 -25 common stime sys_stime +25 common stime sys_stime32 26 common ptrace sys_ptrace 27 common alarm sys_alarm 28 common oldfstat sys_fstat 29 common pause sys_pause -30 common utime sys_utime +30 common utime sys_utime32 # 31 was stty # 32 was gtty 33 common access sys_access @@ -278,7 +278,7 @@ 268 common statfs64 sys_statfs64 269 common fstatfs64 sys_fstatfs64 270 common tgkill sys_tgkill -271 common utimes sys_utimes +271 common utimes sys_utimes_time32 272 common fadvise64_64 sys_fadvise64_64_wrapper # 273 is reserved for vserver 274 common mbind sys_mbind @@ -306,7 +306,7 @@ 296 common mkdirat sys_mkdirat 297 common mknodat sys_mknodat 298 common fchownat sys_fchownat -299 common futimesat sys_futimesat +299 common futimesat sys_futimesat_time32 300 common fstatat64 sys_fstatat64 301 common unlinkat sys_unlinkat 302 common renameat sys_renameat diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h index 08696ea5dca8..1e66278ba4a5 100644 --- a/arch/sparc/include/asm/unistd.h +++ b/arch/sparc/include/asm/unistd.h @@ -30,8 +30,8 @@ #define __ARCH_WANT_SYS_GETHOSTNAME #define __ARCH_WANT_SYS_PAUSE #define __ARCH_WANT_SYS_SIGNAL -#define __ARCH_WANT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_TIME32 +#define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_WAITPID #define __ARCH_WANT_SYS_SOCKETCALL #define __ARCH_WANT_SYS_FADVISE64 @@ -43,8 +43,8 @@ #ifdef __32bit_syscall_numbers__ #define __ARCH_WANT_SYS_IPC #else -#define __ARCH_WANT_COMPAT_SYS_TIME -#define __ARCH_WANT_SYS_UTIME32 +#define __ARCH_WANT_SYS_TIME +#define __ARCH_WANT_SYS_UTIME #define __ARCH_WANT_COMPAT_SYS_SENDFILE #endif diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index dc1e08040b39..99c40abd8878 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -44,7 +44,8 @@ 28 common sigaltstack sys_sigaltstack compat_sys_sigaltstack 29 32 pause sys_pause 29 64 pause sys_nis_syscall -30 common utime sys_utime sys_utime32 +30 32 utime sys_utime32 +30 64 utime sys_utime 31 32 lchown32 sys_lchown 32 32 fchown32 sys_fchown 33 common access sys_access @@ -169,7 +170,8 @@ 135 common socketpair sys_socketpair 136 common mkdir sys_mkdir 137 common rmdir sys_rmdir -138 common utimes sys_utimes sys_utimes_time32 +138 32 utimes sys_utimes_time32 +138 64 utimes sys_utimes 139 common stat64 sys_stat64 compat_sys_stat64 140 common sendfile64 sys_sendfile64 141 common getpeername sys_getpeername @@ -274,9 +276,10 @@ 228 common setfsuid sys_setfsuid16 229 common setfsgid sys_setfsgid16 230 common _newselect sys_select compat_sys_select -231 32 time sys_time sys_time32 +231 32 time sys_time32 232 common splice sys_splice -233 common stime sys_stime sys_stime32 +233 32 stime sys_stime32 +233 64 stime sys_stime 234 common statfs64 sys_statfs64 compat_sys_statfs64 235 common fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 236 common _llseek sys_llseek @@ -345,7 +348,8 @@ 285 common mkdirat sys_mkdirat 286 common mknodat sys_mknodat 287 common fchownat sys_fchownat -288 common futimesat sys_futimesat sys_futimesat_time32 +288 32 futimesat sys_futimesat_time32 +288 64 futimesat sys_futimesat 289 common fstatat64 sys_fstatat64 compat_sys_fstatat64 290 common unlinkat sys_unlinkat 291 common renameat sys_renameat diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 28888d6b7f61..8c47c1191a53 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -24,7 +24,7 @@ 10 i386 unlink sys_unlink __ia32_sys_unlink 11 i386 execve sys_execve __ia32_compat_sys_execve 12 i386 chdir sys_chdir __ia32_sys_chdir -13 i386 time sys_time __ia32_sys_time32 +13 i386 time sys_time32 __ia32_sys_time32 14 i386 mknod sys_mknod __ia32_sys_mknod 15 i386 chmod sys_chmod __ia32_sys_chmod 16 i386 lchown sys_lchown16 __ia32_sys_lchown16 @@ -36,12 +36,12 @@ 22 i386 umount sys_oldumount __ia32_sys_oldumount 23 i386 setuid sys_setuid16 __ia32_sys_setuid16 24 i386 getuid sys_getuid16 __ia32_sys_getuid16 -25 i386 stime sys_stime __ia32_sys_stime32 +25 i386 stime sys_stime32 __ia32_sys_stime32 26 i386 ptrace sys_ptrace __ia32_compat_sys_ptrace 27 i386 alarm sys_alarm __ia32_sys_alarm 28 i386 oldfstat sys_fstat __ia32_sys_fstat 29 i386 pause sys_pause __ia32_sys_pause -30 i386 utime sys_utime __ia32_sys_utime32 +30 i386 utime sys_utime32 __ia32_sys_utime32 31 i386 stty 32 i386 gtty 33 i386 access sys_access __ia32_sys_access @@ -282,7 +282,7 @@ 268 i386 statfs64 sys_statfs64 __ia32_compat_sys_statfs64 269 i386 fstatfs64 sys_fstatfs64 __ia32_compat_sys_fstatfs64 270 i386 tgkill sys_tgkill __ia32_sys_tgkill -271 i386 utimes sys_utimes __ia32_sys_utimes_time32 +271 i386 utimes sys_utimes_time32 __ia32_sys_utimes_time32 272 i386 fadvise64_64 sys_fadvise64_64 __ia32_compat_sys_x86_fadvise64_64 273 i386 vserver 274 i386 mbind sys_mbind __ia32_sys_mbind @@ -310,7 +310,7 @@ 296 i386 mkdirat sys_mkdirat __ia32_sys_mkdirat 297 i386 mknodat sys_mknodat __ia32_sys_mknodat 298 i386 fchownat sys_fchownat __ia32_sys_fchownat -299 i386 futimesat sys_futimesat __ia32_sys_futimesat_time32 +299 i386 futimesat sys_futimesat_time32 __ia32_sys_futimesat_time32 300 i386 fstatat64 sys_fstatat64 __ia32_compat_sys_x86_fstatat 301 i386 unlinkat sys_unlinkat __ia32_sys_unlinkat 302 i386 renameat sys_renameat __ia32_sys_renameat diff --git a/arch/x86/include/asm/unistd.h b/arch/x86/include/asm/unistd.h index dc4ed8bc2382..146859efd83c 100644 --- a/arch/x86/include/asm/unistd.h +++ b/arch/x86/include/asm/unistd.h @@ -23,8 +23,8 @@ # include # include -# define __ARCH_WANT_COMPAT_SYS_TIME -# define __ARCH_WANT_SYS_UTIME32 +# define __ARCH_WANT_SYS_TIME +# define __ARCH_WANT_SYS_UTIME # define __ARCH_WANT_COMPAT_SYS_PREADV64 # define __ARCH_WANT_COMPAT_SYS_PWRITEV64 # define __ARCH_WANT_COMPAT_SYS_PREADV64V2 @@ -48,8 +48,8 @@ # define __ARCH_WANT_SYS_SIGPENDING # define __ARCH_WANT_SYS_SIGPROCMASK # define __ARCH_WANT_SYS_SOCKETCALL -# define __ARCH_WANT_SYS_TIME -# define __ARCH_WANT_SYS_UTIME +# define __ARCH_WANT_SYS_TIME32 +# define __ARCH_WANT_SYS_UTIME32 # define __ARCH_WANT_SYS_WAITPID # define __ARCH_WANT_SYS_FORK # define __ARCH_WANT_SYS_VFORK diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h index 81cc52ea1bd5..30af4dc3ce7b 100644 --- a/arch/xtensa/include/asm/unistd.h +++ b/arch/xtensa/include/asm/unistd.h @@ -7,7 +7,7 @@ #define __ARCH_WANT_NEW_STAT #define __ARCH_WANT_STAT64 -#define __ARCH_WANT_SYS_UTIME +#define __ARCH_WANT_SYS_UTIME32 #define __ARCH_WANT_SYS_GETPGRP #define NR_syscalls __NR_syscalls diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 6f05bc8f015a..482673389e21 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -72,8 +72,8 @@ 61 common fcntl64 sys_fcntl64 62 common fallocate sys_fallocate 63 common fadvise64_64 xtensa_fadvise64_64 -64 common utime sys_utime -65 common utimes sys_utimes +64 common utime sys_utime32 +65 common utimes sys_utimes_time32 66 common ioctl sys_ioctl 67 common fcntl sys_fcntl 68 common setxattr sys_setxattr @@ -318,7 +318,7 @@ 295 common readlinkat sys_readlinkat 296 common utimensat sys_utimensat_time32 297 common fchownat sys_fchownat -298 common futimesat sys_futimesat +298 common futimesat sys_futimesat_time32 299 common fstatat64 sys_fstatat64 300 common fchmodat sys_fchmodat 301 common faccessat sys_faccessat diff --git a/kernel/time/time.c b/kernel/time/time.c index 6261f969dcb7..c3f756f8534b 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -99,7 +99,7 @@ SYSCALL_DEFINE1(stime, time_t __user *, tptr) #endif /* __ARCH_WANT_SYS_TIME */ #ifdef CONFIG_COMPAT_32BIT_TIME -#ifdef __ARCH_WANT_COMPAT_SYS_TIME +#ifdef __ARCH_WANT_SYS_TIME32 /* old_time32_t is a 32 bit "long" and needs to get converted. */ SYSCALL_DEFINE1(time32, old_time32_t __user *, tloc) @@ -134,7 +134,7 @@ SYSCALL_DEFINE1(stime32, old_time32_t __user *, tptr) return 0; } -#endif /* __ARCH_WANT_COMPAT_SYS_TIME */ +#endif /* __ARCH_WANT_SYS_TIME32 */ #endif SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv, -- cgit v1.3-7-g2ca7 From 35634ffa1751b6efd8cf75010b509dcb0263e29b Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 6 Feb 2019 18:39:40 -0600 Subject: signal: Always notice exiting tasks Recently syzkaller was able to create unkillablle processes by creating a timer that is delivered as a thread local signal on SIGHUP, and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing to deliver SIGHUP but always trying. Upon examination it turns out part of the problem is actually most of the solution. Since 2.5 signal delivery has found all fatal signals, marked the signal group for death, and queued SIGKILL in every threads thread queue relying on signal->group_exit_code to preserve the information of which was the actual fatal signal. The conversion of all fatal signals to SIGKILL results in the synchronous signal heuristic in next_signal kicking in and preferring SIGHUP to SIGKILL. Which is especially problematic as all fatal signals have already been transformed into SIGKILL. Instead of dequeueing signals and depending upon SIGKILL to be the first signal dequeued, first test if the signal group has already been marked for death. This guarantees that nothing in the signal queue can prevent a process that needs to exit from exiting. Cc: stable@vger.kernel.org Tested-by: Dmitry Vyukov Reported-by: Dmitry Vyukov Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 9ca8e5278c8e..5424cb0006bc 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2393,6 +2393,11 @@ relock: goto relock; } + /* Has this task already been marked for death? */ + ksig->info.si_signo = signr = SIGKILL; + if (signal_group_exit(signal)) + goto fatal; + for (;;) { struct k_sigaction *ka; @@ -2488,6 +2493,7 @@ relock: continue; } + fatal: spin_unlock_irq(&sighand->siglock); /* -- cgit v1.3-7-g2ca7 From 7146db3317c67b517258cb5e1b08af387da0618b Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 6 Feb 2019 17:51:47 -0600 Subject: signal: Better detection of synchronous signals Recently syzkaller was able to create unkillablle processes by creating a timer that is delivered as a thread local signal on SIGHUP, and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing to deliver SIGHUP but always trying. When the stack overflows delivery of SIGHUP fails and force_sigsegv is called. Unfortunately because SIGSEGV is numerically higher than SIGHUP next_signal tries again to deliver a SIGHUP. From a quality of implementation standpoint attempting to deliver the timer SIGHUP signal is wrong. We should attempt to deliver the synchronous SIGSEGV signal we just forced. We can make that happening in a fairly straight forward manner by instead of just looking at the signal number we also look at the si_code. In particular for exceptions (aka synchronous signals) the si_code is always greater than 0. That still has the potential to pick up a number of asynchronous signals as in a few cases the same si_codes that are used for synchronous signals are also used for asynchronous signals, and SI_KERNEL is also included in the list of possible si_codes. Still the heuristic is much better and timer signals are definitely excluded. Which is enough to prevent all known ways for someone sending a process signals fast enough to cause unexpected and arguably incorrect behavior. Cc: stable@vger.kernel.org Fixes: a27341cd5fcb ("Prioritize synchronous signals over 'normal' signals") Tested-by: Dmitry Vyukov Reported-by: Dmitry Vyukov Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 5424cb0006bc..99fa8ff06fd9 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -688,6 +688,48 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, kernel_siginfo_t *in } EXPORT_SYMBOL_GPL(dequeue_signal); +static int dequeue_synchronous_signal(kernel_siginfo_t *info) +{ + struct task_struct *tsk = current; + struct sigpending *pending = &tsk->pending; + struct sigqueue *q, *sync = NULL; + + /* + * Might a synchronous signal be in the queue? + */ + if (!((pending->signal.sig[0] & ~tsk->blocked.sig[0]) & SYNCHRONOUS_MASK)) + return 0; + + /* + * Return the first synchronous signal in the queue. + */ + list_for_each_entry(q, &pending->list, list) { + /* Synchronous signals have a postive si_code */ + if ((q->info.si_code > SI_USER) && + (sigmask(q->info.si_signo) & SYNCHRONOUS_MASK)) { + sync = q; + goto next; + } + } + return 0; +next: + /* + * Check if there is another siginfo for the same signal. + */ + list_for_each_entry_continue(q, &pending->list, list) { + if (q->info.si_signo == sync->info.si_signo) + goto still_pending; + } + + sigdelset(&pending->signal, sync->info.si_signo); + recalc_sigpending(); +still_pending: + list_del_init(&sync->list); + copy_siginfo(info, &sync->info); + __sigqueue_free(sync); + return info->si_signo; +} + /* * Tell a process that it has a new active signal.. * @@ -2411,7 +2453,15 @@ relock: goto relock; } - signr = dequeue_signal(current, ¤t->blocked, &ksig->info); + /* + * Signals generated by the execution of an instruction + * need to be delivered before any other pending signals + * so that the instruction pointer in the signal stack + * frame points to the faulting instruction. + */ + signr = dequeue_synchronous_signal(&ksig->info); + if (!signr) + signr = dequeue_signal(current, ¤t->blocked, &ksig->info); if (!signr) break; /* will return 0 */ -- cgit v1.3-7-g2ca7 From 70f8a3ca68d3e1f3344d959981ca55d5f6ec77f7 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Wed, 6 Feb 2019 09:59:15 -0800 Subject: mm: make mm->pinned_vm an atomic64 counter Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny Reviewed-by: Christoph Lameter Reviewed-by: Daniel Jordan Reviewed-by: Jan Kara Signed-off-by: Davidlohr Bueso Signed-off-by: Jason Gunthorpe --- drivers/infiniband/core/umem.c | 12 ++++++------ drivers/infiniband/hw/hfi1/user_pages.c | 6 +++--- drivers/infiniband/hw/qib/qib_user_pages.c | 4 ++-- drivers/infiniband/hw/usnic/usnic_uiom.c | 8 ++++---- drivers/misc/mic/scif/scif_rma.c | 6 +++--- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/events/core.c | 8 ++++---- kernel/fork.c | 2 +- mm/debug.c | 5 +++-- 10 files changed, 28 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 1efe0a74e06b..678abe1afcba 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -166,13 +166,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr, lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; down_write(&mm->mmap_sem); - if (check_add_overflow(mm->pinned_vm, npages, &new_pinned) || - (new_pinned > lock_limit && !capable(CAP_IPC_LOCK))) { + new_pinned = atomic64_read(&mm->pinned_vm) + npages; + if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) { up_write(&mm->mmap_sem); ret = -ENOMEM; goto out; } - mm->pinned_vm = new_pinned; + atomic64_set(&mm->pinned_vm, new_pinned); up_write(&mm->mmap_sem); cur_base = addr & PAGE_MASK; @@ -234,7 +234,7 @@ umem_release: __ib_umem_release(context->device, umem, 0); vma: down_write(&mm->mmap_sem); - mm->pinned_vm -= ib_umem_num_pages(umem); + atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm); up_write(&mm->mmap_sem); out: if (vma_list) @@ -263,7 +263,7 @@ static void ib_umem_release_defer(struct work_struct *work) struct ib_umem *umem = container_of(work, struct ib_umem, work); down_write(&umem->owning_mm->mmap_sem); - umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem); + atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm); up_write(&umem->owning_mm->mmap_sem); __ib_umem_release_tail(umem); @@ -302,7 +302,7 @@ void ib_umem_release(struct ib_umem *umem) } else { down_write(&umem->owning_mm->mmap_sem); } - umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem); + atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm); up_write(&umem->owning_mm->mmap_sem); __ib_umem_release_tail(umem); diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c index e341e6dcc388..40a6e434190f 100644 --- a/drivers/infiniband/hw/hfi1/user_pages.c +++ b/drivers/infiniband/hw/hfi1/user_pages.c @@ -92,7 +92,7 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm, size = DIV_ROUND_UP(size, PAGE_SIZE); down_read(&mm->mmap_sem); - pinned = mm->pinned_vm; + pinned = atomic64_read(&mm->pinned_vm); up_read(&mm->mmap_sem); /* First, check the absolute limit against all pinned pages. */ @@ -112,7 +112,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np return ret; down_write(&mm->mmap_sem); - mm->pinned_vm += ret; + atomic64_add(ret, &mm->pinned_vm); up_write(&mm->mmap_sem); return ret; @@ -131,7 +131,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p, if (mm) { /* during close after signal, mm can be NULL */ down_write(&mm->mmap_sem); - mm->pinned_vm -= npages; + atomic64_sub(npages, &mm->pinned_vm); up_write(&mm->mmap_sem); } } diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index 075f09fb7ce3..c6c81022d313 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -75,7 +75,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages, goto bail_release; } - current->mm->pinned_vm += num_pages; + atomic64_add(num_pages, ¤t->mm->pinned_vm); ret = 0; goto bail; @@ -156,7 +156,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages) __qib_release_user_pages(p, num_pages, 1); if (current->mm) { - current->mm->pinned_vm -= num_pages; + atomic64_sub(num_pages, ¤t->mm->pinned_vm); up_write(¤t->mm->mmap_sem); } } diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index ce01a59fccc4..854436a2b437 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -129,7 +129,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, uiomr->owning_mm = mm = current->mm; down_write(&mm->mmap_sem); - locked = npages + current->mm->pinned_vm; + locked = npages + atomic64_read(¤t->mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { @@ -187,7 +187,7 @@ out: if (ret < 0) usnic_uiom_put_pages(chunk_list, 0); else { - mm->pinned_vm = locked; + atomic64_set(&mm->pinned_vm, locked); mmgrab(uiomr->owning_mm); } @@ -441,7 +441,7 @@ static void usnic_uiom_release_defer(struct work_struct *work) container_of(work, struct usnic_uiom_reg, work); down_write(&uiomr->owning_mm->mmap_sem); - uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr); + atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm); up_write(&uiomr->owning_mm->mmap_sem); __usnic_uiom_release_tail(uiomr); @@ -469,7 +469,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr, } else { down_write(&uiomr->owning_mm->mmap_sem); } - uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr); + atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm); up_write(&uiomr->owning_mm->mmap_sem); __usnic_uiom_release_tail(uiomr); diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c index 749321eb91ae..2448368f181e 100644 --- a/drivers/misc/mic/scif/scif_rma.c +++ b/drivers/misc/mic/scif/scif_rma.c @@ -285,7 +285,7 @@ __scif_dec_pinned_vm_lock(struct mm_struct *mm, } else { down_write(&mm->mmap_sem); } - mm->pinned_vm -= nr_pages; + atomic64_sub(nr_pages, &mm->pinned_vm); up_write(&mm->mmap_sem); return 0; } @@ -299,7 +299,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm, return 0; locked = nr_pages; - locked += mm->pinned_vm; + locked += atomic64_read(&mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { dev_err(scif_info.mdev.this_device, @@ -307,7 +307,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm, locked, lock_limit); return -ENOMEM; } - mm->pinned_vm = locked; + atomic64_set(&mm->pinned_vm, locked); return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f0ec9edab2f3..d2902962244d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) SEQ_PUT_DEC("VmPeak:\t", hiwater_vm); SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm); SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm); - SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm); + SEQ_PUT_DEC(" kB\nVmPin:\t", atomic64_read(&mm->pinned_vm)); SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss); SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss); SEQ_PUT_DEC(" kB\nRssAnon:\t", anon); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2c471a2c43fa..acea2ea2d6c4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -405,7 +405,7 @@ struct mm_struct { unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ - unsigned long pinned_vm; /* Refcount permanently increased */ + atomic64_t pinned_vm; /* Refcount permanently increased */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ unsigned long stack_vm; /* VM_STACK */ diff --git a/kernel/events/core.c b/kernel/events/core.c index e5ede6918050..29e9f2473656 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5459,7 +5459,7 @@ static void perf_mmap_close(struct vm_area_struct *vma) /* now it's safe to free the pages */ atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm); - vma->vm_mm->pinned_vm -= rb->aux_mmap_locked; + atomic64_sub(rb->aux_mmap_locked, &vma->vm_mm->pinned_vm); /* this has to be the last one */ rb_free_aux(rb); @@ -5532,7 +5532,7 @@ again: */ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm); - vma->vm_mm->pinned_vm -= mmap_locked; + atomic64_sub(mmap_locked, &vma->vm_mm->pinned_vm); free_uid(mmap_user); out_put: @@ -5680,7 +5680,7 @@ accounting: lock_limit = rlimit(RLIMIT_MEMLOCK); lock_limit >>= PAGE_SHIFT; - locked = vma->vm_mm->pinned_vm + extra; + locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra; if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() && !capable(CAP_IPC_LOCK)) { @@ -5721,7 +5721,7 @@ accounting: unlock: if (!ret) { atomic_long_add(user_extra, &user->locked_vm); - vma->vm_mm->pinned_vm += extra; + atomic64_add(extra, &vma->vm_mm->pinned_vm); atomic_inc(&event->mmap_count); } else if (rb) { diff --git a/kernel/fork.c b/kernel/fork.c index b69248e6f0e0..85e08c379a9e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -981,7 +981,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm_pgtables_bytes_init(mm); mm->map_count = 0; mm->locked_vm = 0; - mm->pinned_vm = 0; + atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); spin_lock_init(&mm->page_table_lock); spin_lock_init(&mm->arg_lock); diff --git a/mm/debug.c b/mm/debug.c index 0abb987dad9b..7d13941a72f9 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -135,7 +135,7 @@ void dump_mm(const struct mm_struct *mm) "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" "pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" - "pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n" + "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" "start_brk %lx brk %lx start_stack %lx\n" "arg_start %lx arg_end %lx env_start %lx env_end %lx\n" @@ -166,7 +166,8 @@ void dump_mm(const struct mm_struct *mm) mm_pgtables_bytes(mm), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, - mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm, + atomic64_read(&mm->pinned_vm), + mm->data_vm, mm->exec_vm, mm->stack_vm, mm->start_code, mm->end_code, mm->start_data, mm->end_data, mm->start_brk, mm->brk, mm->start_stack, mm->arg_start, mm->arg_end, mm->env_start, mm->env_end, -- cgit v1.3-7-g2ca7 From cd108b5c51db30aa01657322bb89e48c98216ff9 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Tue, 5 Feb 2019 16:06:30 -0500 Subject: audit: hide auditsc_get_stamp and audit_serial prototypes auditsc_get_stamp() and audit_serial() are internal audit functions so move their prototypes from include/linux/audit.h to kernel/audit.h so they are not visible to the rest of the kernel. Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- include/linux/audit.h | 9 --------- kernel/audit.h | 5 +++++ 2 files changed, 5 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/include/linux/audit.h b/include/linux/audit.h index 29251b18331a..1e69d9fe16da 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -348,10 +348,6 @@ static inline void audit_ptrace(struct task_struct *t) } /* Private API (for audit.c only) */ -extern unsigned int audit_serial(void); -extern int auditsc_get_stamp(struct audit_context *ctx, - struct timespec64 *t, unsigned int *serial); - extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp); extern void __audit_ipc_set_perm(unsigned long qbytes, uid_t uid, gid_t gid, umode_t mode); extern void __audit_bprm(struct linux_binprm *bprm); @@ -531,11 +527,6 @@ static inline void audit_seccomp(unsigned long syscall, long signr, int code) static inline void audit_seccomp_actions_logged(const char *names, const char *old_names, int res) { } -static inline int auditsc_get_stamp(struct audit_context *ctx, - struct timespec64 *t, unsigned int *serial) -{ - return 0; -} static inline void audit_ipc_obj(struct kern_ipc_perm *ipcp) { } static inline void audit_ipc_set_perm(unsigned long qbytes, uid_t uid, diff --git a/kernel/audit.h b/kernel/audit.h index 82734f438ddd..958d5b8fc1b3 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -261,6 +261,10 @@ extern void audit_put_tty(struct tty_struct *tty); /* audit watch/mark/tree functions */ #ifdef CONFIG_AUDITSYSCALL +extern unsigned int audit_serial(void); +extern int auditsc_get_stamp(struct audit_context *ctx, + struct timespec64 *t, unsigned int *serial); + extern void audit_put_watch(struct audit_watch *watch); extern void audit_get_watch(struct audit_watch *watch); extern int audit_to_watch(struct audit_krule *krule, char *path, int len, @@ -300,6 +304,7 @@ extern void audit_filter_inodes(struct task_struct *tsk, struct audit_context *ctx); extern struct list_head *audit_killed_trees(void); #else /* CONFIG_AUDITSYSCALL */ +#define auditsc_get_stamp(c, t, s) 0 #define audit_put_watch(w) {} #define audit_get_watch(w) {} #define audit_to_watch(k, p, l, o) (-EINVAL) -- cgit v1.3-7-g2ca7 From 6f568ebe2afefdc33a6fb06ef20a94f8b96455f1 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Wed, 6 Feb 2019 10:56:02 -0800 Subject: futex: Fix barrier comment The current comment for the barrier that guarantees that waiter increment is always before taking the hb spinlock (barrier (A)) needs to be fixed as it is misplaced. This is obviously referring to hb_waiters_inc, which is a full barrier. Reported-by: Peter Zijlstra Signed-off-by: Davidlohr Bueso Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20190206185602.949-1-dave@stgolabs.net --- kernel/futex.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index fdd312da0992..5ec2473a3497 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2221,11 +2221,11 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) * decrement the counter at queue_unlock() when some error has * occurred and we don't end up adding the task to the list. */ - hb_waiters_inc(hb); + hb_waiters_inc(hb); /* implies smp_mb(); (A) */ q->lock_ptr = &hb->lock; - spin_lock(&hb->lock); /* implies smp_mb(); (A) */ + spin_lock(&hb->lock); return hb; } -- cgit v1.3-7-g2ca7 From 1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 29 Jan 2019 23:15:12 +0100 Subject: futex: Handle early deadlock return correctly commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex") changed the locking rules in the futex code so that the hash bucket lock is not longer held while the waiter is enqueued into the rtmutex wait list. This made the lock and the unlock path symmetric, but unfortunately the possible early exit from __rt_mutex_proxy_start() due to a detected deadlock was not updated accordingly. That allows a concurrent unlocker to observe inconsitent state which triggers the warning in the unlock path. futex_lock_pi() futex_unlock_pi() lock(hb->lock) queue(hb_waiter) lock(hb->lock) lock(rtmutex->wait_lock) unlock(hb->lock) // acquired hb->lock hb_waiter = futex_top_waiter() lock(rtmutex->wait_lock) __rt_mutex_proxy_start() ---> fail remove(rtmutex_waiter); ---> returns -EDEADLOCK unlock(rtmutex->wait_lock) // acquired wait_lock wake_futex_pi() rt_mutex_next_owner() --> returns NULL --> WARN lock(hb->lock) unqueue(hb_waiter) The problem is caused by the remove(rtmutex_waiter) in the failure case of __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the hash bucket but no waiter on the rtmutex, i.e. inconsistent state. The original commit handles this correctly for the other early return cases (timeout, signal) by delaying the removal of the rtmutex waiter until the returning task reacquired the hash bucket lock. Treat the failure case of __rt_mutex_proxy_start() in the same way and let the existing cleanup code handle the eventual handover of the rtmutex gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter removal for the failure case, so that the other callsites are still operating correctly. Add proper comments to the code so all these details are fully documented. Thanks to Peter for helping with the analysis and writing the really valuable code comments. Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex") Reported-by: Heiko Carstens Co-developed-by: Peter Zijlstra Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner Tested-by: Heiko Carstens Cc: Martin Schwidefsky Cc: linux-s390@vger.kernel.org Cc: Stefan Liebler Cc: Sebastian Sewior Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de --- kernel/futex.c | 28 ++++++++++++++++++---------- kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++----- 2 files changed, 50 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index 5ec2473a3497..a0514e01c3eb 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2861,35 +2861,39 @@ retry_private: * and BUG when futex_unlock_pi() interleaves with this. * * Therefore acquire wait_lock while holding hb->lock, but drop the - * latter before calling rt_mutex_start_proxy_lock(). This still fully - * serializes against futex_unlock_pi() as that does the exact same - * lock handoff sequence. + * latter before calling __rt_mutex_start_proxy_lock(). This + * interleaves with futex_unlock_pi() -- which does a similar lock + * handoff -- such that the latter can observe the futex_q::pi_state + * before __rt_mutex_start_proxy_lock() is done. */ raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); spin_unlock(q.lock_ptr); + /* + * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter + * such that futex_unlock_pi() is guaranteed to observe the waiter when + * it sees the futex_q::pi_state. + */ ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); if (ret) { if (ret == 1) ret = 0; - - spin_lock(q.lock_ptr); - goto no_block; + goto cleanup; } - if (unlikely(to)) hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); +cleanup: spin_lock(q.lock_ptr); /* - * If we failed to acquire the lock (signal/timeout), we must + * If we failed to acquire the lock (deadlock/signal/timeout), we must * first acquire the hb->lock before removing the lock from the - * rt_mutex waitqueue, such that we can keep the hb and rt_mutex - * wait lists consistent. + * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait + * lists consistent. * * In particular; it is important that futex_unlock_pi() can not * observe this inconsistency. @@ -3013,6 +3017,10 @@ retry: * there is no point where we hold neither; and therefore * wake_futex_pi() must observe a state consistent with what we * observed. + * + * In particular; this forces __rt_mutex_start_proxy() to + * complete such that we're guaranteed to observe the + * rt_waiter. Also see the WARN in wake_futex_pi(). */ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 581edcc63c26..978d63a8261c 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, rt_mutex_set_owner(lock, NULL); } +/** + * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task + * @lock: the rt_mutex to take + * @waiter: the pre-initialized rt_mutex_waiter + * @task: the task to prepare + * + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. + * + * NOTE: does _NOT_ remove the @waiter on failure; must either call + * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this. + * + * Returns: + * 0 - task blocked on lock + * 1 - acquired the lock for task, caller should wake it up + * <0 - error + * + * Special API call for PI-futex support. + */ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task) { int ret; + lockdep_assert_held(&lock->wait_lock); + if (try_to_take_rt_mutex(lock, task, NULL)) return 1; @@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, ret = 0; } - if (unlikely(ret)) - remove_waiter(lock, waiter); - debug_rt_mutex_print_deadlock(waiter); return ret; @@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, * @waiter: the pre-initialized rt_mutex_waiter * @task: the task to prepare * + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. + * + * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter + * on failure. + * * Returns: * 0 - task blocked on lock * 1 - acquired the lock for task, caller should wake it up * <0 - error * - * Special API call for FUTEX_REQUEUE_PI support. + * Special API call for PI-futex support. */ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, @@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, raw_spin_lock_irq(&lock->wait_lock); ret = __rt_mutex_start_proxy_lock(lock, waiter, task); + if (unlikely(ret)) + remove_waiter(lock, waiter); raw_spin_unlock_irq(&lock->wait_lock); return ret; @@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, * @lock: the rt_mutex we were woken on * @waiter: the pre-initialized rt_mutex_waiter * - * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). + * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or + * rt_mutex_wait_proxy_lock(). * * Unless we acquired the lock; we're still enqueued on the wait-list and can * in fact still be granted ownership until we're removed. Therefore we can -- cgit v1.3-7-g2ca7 From a1939185c7a907f91b40da259a610ce2e2da9e18 Mon Sep 17 00:00:00 2001 From: Prarit Bhargava Date: Fri, 8 Feb 2019 19:24:49 +0100 Subject: printk: Export console_printk The fbcon can be built as a module and requires console_printk. Export console_printk. Signed-off-by: Prarit Bhargava Reviewed-by: Sergey Senozhatsky Acked-by: Petr Mladek Signed-off-by: Bartlomiej Zolnierkiewicz --- kernel/printk/printk.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index d3d170374ceb..8201019d1fff 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -65,6 +65,7 @@ int console_printk[4] = { CONSOLE_LOGLEVEL_MIN, /* minimum_console_loglevel */ CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */ }; +EXPORT_SYMBOL_GPL(console_printk); atomic_t ignore_console_lock_warning __read_mostly = ATOMIC_INIT(0); EXPORT_SYMBOL(ignore_console_lock_warning); -- cgit v1.3-7-g2ca7 From b5b11890de69ec216ab7a10a24fcd1b2d46a2d6e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:05:33 -0800 Subject: rcu/rcu.h: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/rcu.h | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 75787186bd4f..e672b8f050ac 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * Read-Copy Update definitions shared among RCU implementations. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2011 * - * Author: Paul E. McKenney + * Author: Paul E. McKenney */ #ifndef __LINUX_RCU_H -- cgit v1.3-7-g2ca7 From 8bf05ed3adf9c40f2f47a967dcfc713d26b07247 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:09:19 -0800 Subject: rcu/rcuperf: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/rcuperf.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index b459da70b4fc..83c411cb09bd 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Read-Copy Update module-based performance-test facility * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2015 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney */ #define pr_fmt(fmt) fmt @@ -54,7 +41,7 @@ #include "rcu.h" MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Paul E. McKenney "); +MODULE_AUTHOR("Paul E. McKenney "); #define PERF_FLAG "-perf:" #define PERFOUT_STRING(s) \ -- cgit v1.3-7-g2ca7 From eb7935e479a32cd77b9770baf7eaae6726e68f46 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:13:19 -0800 Subject: rcu/rcu_segcblist: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/rcu_segcblist.c | 17 ++--------------- kernel/rcu/rcu_segcblist.h | 17 ++--------------- 2 files changed, 4 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 5aff271adf1e..9bd5f6023c21 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * RCU segmented callback lists, function definitions * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2017 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney */ #include diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 948470cef385..71b64648464e 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * RCU segmented callback lists, internal-to-rcu header file * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2017 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney */ #include -- cgit v1.3-7-g2ca7 From 2e24ce88524714ce82675dfc0e203abc509f84c3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:16:42 -0800 Subject: rcu/rcutorture: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/rcutorture.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f6e85faa4ff4..c47dba261cbe 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Read-Copy Update module-based torture test facility * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2005, 2006 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney * Josh Triplett * * See also: Documentation/RCU/torture.txt @@ -61,7 +48,7 @@ #include "rcu.h" MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Paul E. McKenney and Josh Triplett "); +MODULE_AUTHOR("Paul E. McKenney and Josh Triplett "); /* Bits for ->extendables field, extendables param, and related definitions. */ -- cgit v1.3-7-g2ca7 From e7ee1501cd5af551c3bcd92162bf91b9877b2057 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:18:16 -0800 Subject: rcu/srcu: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/srcutiny.c | 17 ++--------------- kernel/rcu/srcutree.c | 17 ++--------------- 2 files changed, 4 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 32dfd6522548..5d4a39a6505a 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -1,24 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Sleepable Read-Copy Update mechanism for mutual exclusion, * tiny version for non-preemptible single-CPU use. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2017 * - * Author: Paul McKenney + * Author: Paul McKenney */ #include diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 3600d88d8956..7a5944b42b02 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1,24 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Sleepable Read-Copy Update mechanism for mutual exclusion. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2006 * Copyright (C) Fujitsu, 2012 * - * Author: Paul McKenney + * Author: Paul McKenney * Lai Jiangshan * * For detailed explanation of Read-Copy Update mechanism see - -- cgit v1.3-7-g2ca7 From 96b903f5da491199b51a7a9e70699f606a8a09b4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:19:01 -0800 Subject: rcu/sync: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/sync.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c index be10036fa621..a8304d90573f 100644 --- a/kernel/rcu/sync.c +++ b/kernel/rcu/sync.c @@ -1,20 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * RCU-based infrastructure for lightweight reader-writer locking * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (c) 2015, Red Hat, Inc. * * Author: Oleg Nesterov -- cgit v1.3-7-g2ca7 From 00de9d7415352ef546e0bf0b8ff5d092e265c080 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:21:12 -0800 Subject: rcu/tiny: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/tiny.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index 5f5963ba313e..f142c579fda8 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Read-Copy Update mechanism for mutual exclusion, the Bloatwatch edition. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2008 * - * Author: Paul E. McKenney + * Author: Paul E. McKenney * * For detailed explanation of Read-Copy Update mechanism see - * Documentation/RCU -- cgit v1.3-7-g2ca7 From 22e409253144d564ec7c8a425fa84b598e36f03c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:23:39 -0800 Subject: rcu/tree: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney [ paulmck: Update .h file SPDX comment format per Joe Perches. ] Reviewed-by: Thomas Gleixner --- kernel/rcu/tree.c | 19 +++---------------- kernel/rcu/tree.h | 17 ++--------------- kernel/rcu/tree_exp.h | 17 ++--------------- kernel/rcu/tree_plugin.h | 17 ++--------------- 4 files changed, 9 insertions(+), 61 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1c4add096078..3b8e7d56c028 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1,27 +1,14 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Read-Copy Update mechanism for mutual exclusion * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2008 * * Authors: Dipankar Sarma * Manfred Spraul - * Paul E. McKenney Hierarchical version + * Paul E. McKenney Hierarchical version * - * Based on the original work by Paul McKenney + * Based on the original work by Paul McKenney * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. * * For detailed explanation of Read-Copy Update mechanism see - diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 149557b7c39c..f22c21ff3a89 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -1,25 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * Read-Copy Update mechanism for mutual exclusion (tree-based version) * Internal non-public definitions. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2008 * * Author: Ingo Molnar - * Paul E. McKenney + * Paul E. McKenney */ #include diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d882ca0cd01b..66983626d37c 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * RCU expedited grace periods * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2016 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney */ #include diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 8ceed9e25ad5..8f67814a5a85 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1,27 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * Read-Copy Update mechanism for mutual exclusion (tree-based version) * Internal non-public definitions that provide either classic * or preemptible semantics. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright Red Hat, 2009 * Copyright IBM Corporation, 2009 * * Author: Ingo Molnar - * Paul E. McKenney + * Paul E. McKenney */ #include -- cgit v1.3-7-g2ca7 From 38b4df649e8c71c193e4ace237403f1574b900be Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:25:18 -0800 Subject: rcu/update: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/rcu/update.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 1971869c4072..e3c6395c9b4c 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -1,26 +1,13 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Read-Copy Update mechanism for mutual exclusion * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright IBM Corporation, 2001 * * Authors: Dipankar Sarma * Manfred Spraul * - * Based on the original work by Paul McKenney + * Based on the original work by Paul McKenney * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. * Papers: * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf -- cgit v1.3-7-g2ca7 From 8f8e76c09ced491a0ab9b088a90b726cb23c4c0a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 10:41:31 -0800 Subject: torture: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/torture.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/torture.c b/kernel/torture.c index bbf6d473e50c..67620e5141d2 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Common functions for in-kernel torture tests. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2014 * - * Author: Paul E. McKenney + * Author: Paul E. McKenney * Based on kernel/rcu/torture.c. */ @@ -53,7 +40,7 @@ #include "rcu/rcu.h" MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Paul E. McKenney "); +MODULE_AUTHOR("Paul E. McKenney "); static char *torture_type; static int verbose; -- cgit v1.3-7-g2ca7 From 5a4eb3cb2012b38022041c7a87cbcf5af6a3302f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Jan 2019 11:11:00 -0800 Subject: locking/locktorture: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney Reviewed-by: Thomas Gleixner --- kernel/locking/locktorture.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 7d0b0ed74404..d163c5b75d72 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -1,23 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Module-based torture test facility for locking * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, you can access it online at - * http://www.gnu.org/licenses/gpl-2.0.html. - * * Copyright (C) IBM Corporation, 2014 * - * Authors: Paul E. McKenney + * Authors: Paul E. McKenney * Davidlohr Bueso * Based on kernel/rcu/torture.c. */ @@ -45,7 +32,7 @@ #include MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Paul E. McKenney "); +MODULE_AUTHOR("Paul E. McKenney "); torture_param(int, nwriters_stress, -1, "Number of write-locking stress-test threads"); -- cgit v1.3-7-g2ca7 From d623876646be119439999a229a2c3ce30fd197fb Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 8 Feb 2019 22:25:54 -0800 Subject: bpf: Fix narrow load on a bpf_sock returned from sk_lookup() By adding this test to test_verifier: { "reference tracking: access sk->src_ip4 (narrow load)", .insns = { BPF_SK_LOOKUP, BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2), BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), BPF_EMIT_CALL(BPF_FUNC_sk_release), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SCHED_CLS, .result = ACCEPT, }, The above test loads 2 bytes from sk->src_ip4 where sk is obtained by bpf_sk_lookup_tcp(). It hits an internal verifier error from convert_ctx_accesses(): [root@arch-fb-vm1 bpf]# ./test_verifier 665 665 Failed to load prog 'Invalid argument'! 0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (7b) *(u64 *)(r10 -16) = r2 3: (7b) *(u64 *)(r10 -24) = r2 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 7: (bf) r2 = r10 8: (07) r2 += -48 9: (b7) r3 = 36 10: (b7) r4 = 0 11: (b7) r5 = 0 12: (85) call bpf_sk_lookup_tcp#84 13: (bf) r6 = r0 14: (15) if r0 == 0x0 goto pc+3 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1 15: (69) r2 = *(u16 *)(r0 +26) 16: (bf) r1 = r6 17: (85) call bpf_sk_release#86 18: (95) exit from 14 to 18: safe processed 20 insns (limit 131072), stack depth 48 bpf verifier is misconfigured Summary: 0 PASSED, 0 SKIPPED, 1 FAILED The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly loaded (meaning load any 1 or 2 bytes of the src_ip4) by marking info->ctx_field_size. However, this marked ctx_field_size is not used. This patch fixes it. Due to the recent refactoring in test_verifier, this new test will be added to the bpf-next branch (together with the bpf_tcp_sock patchset) to avoid merge conflict. Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type") Cc: Joe Stringer Signed-off-by: Martin KaFai Lau Acked-by: Joe Stringer Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 56674a7c3778..8f295b790297 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1617,12 +1617,13 @@ static int check_flow_keys_access(struct bpf_verifier_env *env, int off, return 0; } -static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off, - int size, enum bpf_access_type t) +static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, + u32 regno, int off, int size, + enum bpf_access_type t) { struct bpf_reg_state *regs = cur_regs(env); struct bpf_reg_state *reg = ®s[regno]; - struct bpf_insn_access_aux info; + struct bpf_insn_access_aux info = {}; if (reg->smin_value < 0) { verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n", @@ -1636,6 +1637,8 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off, return -EACCES; } + env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; + return 0; } @@ -2032,7 +2035,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn verbose(env, "cannot write into socket\n"); return -EACCES; } - err = check_sock_access(env, regno, off, size, t); + err = check_sock_access(env, insn_idx, regno, off, size, t); if (!err && value_regno >= 0) mark_reg_unknown(env, regs, value_regno); } else { -- cgit v1.3-7-g2ca7 From 347253c42d7c673aa2a659d756bc7ff893459247 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 25 Jan 2019 17:53:43 +0800 Subject: genirq/affinity: Move allocation of 'node_to_cpumask' to irq_build_affinity_masks() 'node_to_cpumask' is just one temparay variable for irq_build_affinity_masks(), so move it into irq_build_affinity_masks(). No functioanl change. Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Reviewed-by: Bjorn Helgaas Cc: Christoph Hellwig Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Link: https://lkml.kernel.org/r/20190125095347.17950-2-ming.lei@redhat.com --- kernel/irq/affinity.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 45b68b4ea48b..118b66d64a53 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -175,18 +175,22 @@ out: */ static int irq_build_affinity_masks(const struct irq_affinity *affd, int startvec, int numvecs, int firstvec, - cpumask_var_t *node_to_cpumask, struct irq_affinity_desc *masks) { int curvec = startvec, nr_present, nr_others; int ret = -ENOMEM; cpumask_var_t nmsk, npresmsk; + cpumask_var_t *node_to_cpumask; if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) return ret; if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) - goto fail; + goto fail_nmsk; + + node_to_cpumask = alloc_node_to_cpumask(); + if (!node_to_cpumask) + goto fail_npresmsk; ret = 0; /* Stabilize the cpumasks */ @@ -217,9 +221,12 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, if (nr_present < numvecs) WARN_ON(nr_present + nr_others < numvecs); + free_node_to_cpumask(node_to_cpumask); + + fail_npresmsk: free_cpumask_var(npresmsk); - fail: + fail_nmsk: free_cpumask_var(nmsk); return ret; } @@ -236,7 +243,6 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { int affvecs = nvecs - affd->pre_vectors - affd->post_vectors; int curvec, usedvecs; - cpumask_var_t *node_to_cpumask; struct irq_affinity_desc *masks = NULL; int i, nr_sets; @@ -247,13 +253,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) if (nvecs == affd->pre_vectors + affd->post_vectors) return NULL; - node_to_cpumask = alloc_node_to_cpumask(); - if (!node_to_cpumask) - return NULL; - masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); if (!masks) - goto outnodemsk; + return NULL; /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) @@ -271,11 +273,10 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) int ret; ret = irq_build_affinity_masks(affd, curvec, this_vecs, - curvec, node_to_cpumask, masks); + curvec, masks); if (ret) { kfree(masks); - masks = NULL; - goto outnodemsk; + return NULL; } curvec += this_vecs; usedvecs += this_vecs; @@ -293,8 +294,6 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) masks[i].is_managed = 1; -outnodemsk: - free_node_to_cpumask(node_to_cpumask); return masks; } -- cgit v1.3-7-g2ca7 From 1136b0728969901a091f0471968b2b76ed14d9ad Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 8 Feb 2019 14:48:03 +0100 Subject: genirq: Avoid summation loops for /proc/stat Waiman reported that on large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics. In principle this is not a problem. but for unknown reasons some enterprise quality software reads /proc/stat with a high frequency. The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. This can be largely avoided for interrupts which are not marked as 'PER_CPU' interrupts by simply adding a per interrupt summation counter which is incremented along with the per interrupt per cpu counter. The PER_CPU interrupts need to avoid that and use only per cpu accounting because they share the interrupt number and the interrupt descriptor and concurrent updates would conflict or require unwanted synchronization. Reported-by: Waiman Long Signed-off-by: Thomas Gleixner Reviewed-by: Waiman Long Reviewed-by: Marc Zyngier Reviewed-by: Davidlohr Bueso Cc: Matthew Wilcox Cc: Andrew Morton Cc: Alexey Dobriyan Cc: Kees Cook Cc: linux-fsdevel@vger.kernel.org Cc: Davidlohr Bueso Cc: Miklos Szeredi Cc: Daniel Colascione Cc: Dave Chinner Cc: Randy Dunlap Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de 8<------------- v2: Undo the unintentional layout change of struct irq_desc. include/linux/irqdesc.h | 1 + kernel/irq/chip.c | 12 ++++++++++-- kernel/irq/internals.h | 8 +++++++- kernel/irq/irqdesc.c | 7 ++++++- 4 files changed, 24 insertions(+), 4 deletions(-) --- include/linux/irqdesc.h | 1 + kernel/irq/chip.c | 12 ++++++++++-- kernel/irq/internals.h | 8 +++++++- kernel/irq/irqdesc.c | 7 ++++++- 4 files changed, 24 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index dd1e40ddac7d..875c41b23f20 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -65,6 +65,7 @@ struct irq_desc { unsigned int core_internal_state__do_not_mess_with_it; unsigned int depth; /* nested irq disables */ unsigned int wake_depth; /* nested wake enables */ + unsigned int tot_count; unsigned int irq_count; /* For detecting broken IRQs */ unsigned long last_unhandled; /* Aging timer for unhandled count */ unsigned int irqs_unhandled; diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 34e969069488..e960c4f46ee0 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -855,7 +855,11 @@ void handle_percpu_irq(struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); - kstat_incr_irqs_this_cpu(desc); + /* + * PER CPU interrupts are not serialized. Do not touch + * desc->tot_count. + */ + __kstat_incr_irqs_this_cpu(desc); if (chip->irq_ack) chip->irq_ack(&desc->irq_data); @@ -884,7 +888,11 @@ void handle_percpu_devid_irq(struct irq_desc *desc) unsigned int irq = irq_desc_get_irq(desc); irqreturn_t res; - kstat_incr_irqs_this_cpu(desc); + /* + * PER CPU interrupts are not serialized. Do not touch + * desc->tot_count. + */ + __kstat_incr_irqs_this_cpu(desc); if (chip->irq_ack) chip->irq_ack(&desc->irq_data); diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index ca6afa267070..e74e7eea76cf 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -242,12 +242,18 @@ static inline void irq_state_set_masked(struct irq_desc *desc) #undef __irqd_to_state -static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc) +static inline void __kstat_incr_irqs_this_cpu(struct irq_desc *desc) { __this_cpu_inc(*desc->kstat_irqs); __this_cpu_inc(kstat.irqs_sum); } +static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc) +{ + __kstat_incr_irqs_this_cpu(desc); + desc->tot_count++; +} + static inline int irq_desc_get_node(struct irq_desc *desc) { return irq_common_data_get_node(&desc->irq_common_data); diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index ee062b7939d3..f98293d0e173 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -119,6 +119,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node, desc->depth = 1; desc->irq_count = 0; desc->irqs_unhandled = 0; + desc->tot_count = 0; desc->name = NULL; desc->owner = owner; for_each_possible_cpu(cpu) @@ -919,11 +920,15 @@ unsigned int kstat_irqs_cpu(unsigned int irq, int cpu) unsigned int kstat_irqs(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); - int cpu; unsigned int sum = 0; + int cpu; if (!desc || !desc->kstat_irqs) return 0; + if (!irq_settings_is_per_cpu_devid(desc) && + !irq_settings_is_per_cpu(desc)) + return desc->tot_count; + for_each_possible_cpu(cpu) sum += *per_cpu_ptr(desc->kstat_irqs, cpu); return sum; -- cgit v1.3-7-g2ca7 From 0121805d9d2b1fff371e195c28e9b86ae38b5e47 Mon Sep 17 00:00:00 2001 From: Matthias Kaehlcke Date: Mon, 28 Jan 2019 15:46:24 -0800 Subject: kthread: Add __kthread_should_park() kthread_should_park() is used to check if the calling kthread ('current') should park, but there is no function to check whether an arbitrary kthread should be parked. The latter is required to plug a CPU hotplug race vs. a parking ksoftirqd thread. The new __kthread_should_park() receives a task_struct as parameter to check if the corresponding kernel thread should be parked. Call __kthread_should_park() from kthread_should_park() to avoid code duplication. Signed-off-by: Matthias Kaehlcke Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Steven Rostedt Cc: "Paul E . McKenney" Cc: Sebastian Andrzej Siewior Cc: Douglas Anderson Cc: Stephen Boyd Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org --- include/linux/kthread.h | 1 + kernel/kthread.c | 8 +++++++- 2 files changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/kthread.h b/include/linux/kthread.h index c1961761311d..1577a2d56e9d 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -56,6 +56,7 @@ void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask); int kthread_stop(struct task_struct *k); bool kthread_should_stop(void); bool kthread_should_park(void); +bool __kthread_should_park(struct task_struct *k); bool kthread_freezable_should_stop(bool *was_frozen); void *kthread_data(struct task_struct *k); void *kthread_probe_data(struct task_struct *k); diff --git a/kernel/kthread.c b/kernel/kthread.c index 087d18d771b5..65234c89d85b 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -101,6 +101,12 @@ bool kthread_should_stop(void) } EXPORT_SYMBOL(kthread_should_stop); +bool __kthread_should_park(struct task_struct *k) +{ + return test_bit(KTHREAD_SHOULD_PARK, &to_kthread(k)->flags); +} +EXPORT_SYMBOL_GPL(__kthread_should_park); + /** * kthread_should_park - should this kthread park now? * @@ -114,7 +120,7 @@ EXPORT_SYMBOL(kthread_should_stop); */ bool kthread_should_park(void) { - return test_bit(KTHREAD_SHOULD_PARK, &to_kthread(current)->flags); + return __kthread_should_park(current); } EXPORT_SYMBOL_GPL(kthread_should_park); -- cgit v1.3-7-g2ca7 From 1342d8080f6183b0419a9246c6e6545e21fa1e05 Mon Sep 17 00:00:00 2001 From: Matthias Kaehlcke Date: Mon, 28 Jan 2019 15:46:25 -0800 Subject: softirq: Don't skip softirq execution when softirq thread is parking When a CPU is unplugged the kernel threads of this CPU are parked (see smpboot_park_threads()). kthread_park() is used to mark each thread as parked and wake it up, so it can complete the process of parking itselfs (see smpboot_thread_fn()). If local softirqs are pending on interrupt exit invoke_softirq() is called to process the softirqs, however it skips processing when the softirq kernel thread of the local CPU is scheduled to run. The softirq kthread is one of the threads that is parked when a CPU is unplugged. Parking the kthread wakes it up, however only to complete the parking process, not to process the pending softirqs. Hence processing of softirqs at the end of an interrupt is skipped, but not done elsewhere, which can result in warnings about pending softirqs when a CPU is unplugged: /sys/devices/system/cpu # echo 0 > cpu4/online [ ... ] NOHZ: local_softirq_pending 02 [ ... ] NOHZ: local_softirq_pending 202 [ ... ] CPU4: shutdown [ ... ] psci: CPU4 killed. Don't skip processing of softirqs at the end of an interrupt when the softirq thread of the CPU is parking. Signed-off-by: Matthias Kaehlcke Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Steven Rostedt Cc: "Paul E . McKenney" Cc: Sebastian Andrzej Siewior Cc: Douglas Anderson Cc: Stephen Boyd Link: https://lkml.kernel.org/r/20190128234625.78241-3-mka@chromium.org --- kernel/softirq.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/softirq.c b/kernel/softirq.c index d28813306b2c..10277429ed84 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -89,7 +89,8 @@ static bool ksoftirqd_running(unsigned long pending) if (pending & SOFTIRQ_NOW_MASK) return false; - return tsk && (tsk->state == TASK_RUNNING); + return tsk && (tsk->state == TASK_RUNNING) && + !__kthread_should_park(tsk); } /* -- cgit v1.3-7-g2ca7 From 5f4566498dee5e38e36a015a968c22ed21568f0b Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 8 Feb 2019 22:25:54 -0800 Subject: bpf: Fix narrow load on a bpf_sock returned from sk_lookup() By adding this test to test_verifier: { "reference tracking: access sk->src_ip4 (narrow load)", .insns = { BPF_SK_LOOKUP, BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2), BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), BPF_EMIT_CALL(BPF_FUNC_sk_release), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SCHED_CLS, .result = ACCEPT, }, The above test loads 2 bytes from sk->src_ip4 where sk is obtained by bpf_sk_lookup_tcp(). It hits an internal verifier error from convert_ctx_accesses(): [root@arch-fb-vm1 bpf]# ./test_verifier 665 665 Failed to load prog 'Invalid argument'! 0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (7b) *(u64 *)(r10 -16) = r2 3: (7b) *(u64 *)(r10 -24) = r2 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 7: (bf) r2 = r10 8: (07) r2 += -48 9: (b7) r3 = 36 10: (b7) r4 = 0 11: (b7) r5 = 0 12: (85) call bpf_sk_lookup_tcp#84 13: (bf) r6 = r0 14: (15) if r0 == 0x0 goto pc+3 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1 15: (69) r2 = *(u16 *)(r0 +26) 16: (bf) r1 = r6 17: (85) call bpf_sk_release#86 18: (95) exit from 14 to 18: safe processed 20 insns (limit 131072), stack depth 48 bpf verifier is misconfigured Summary: 0 PASSED, 0 SKIPPED, 1 FAILED The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly loaded (meaning load any 1 or 2 bytes of the src_ip4) by marking info->ctx_field_size. However, this marked ctx_field_size is not used. This patch fixes it. Due to the recent refactoring in test_verifier, this new test will be added to the bpf-next branch (together with the bpf_tcp_sock patchset) to avoid merge conflict. Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type") Cc: Joe Stringer Signed-off-by: Martin KaFai Lau Acked-by: Joe Stringer Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b63bc77af2d1..516dfc6d78de 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1640,12 +1640,13 @@ static int check_flow_keys_access(struct bpf_verifier_env *env, int off, return 0; } -static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off, - int size, enum bpf_access_type t) +static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, + u32 regno, int off, int size, + enum bpf_access_type t) { struct bpf_reg_state *regs = cur_regs(env); struct bpf_reg_state *reg = ®s[regno]; - struct bpf_insn_access_aux info; + struct bpf_insn_access_aux info = {}; if (reg->smin_value < 0) { verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n", @@ -1659,6 +1660,8 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off, return -EACCES; } + env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; + return 0; } @@ -2055,7 +2058,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn verbose(env, "cannot write into socket\n"); return -EACCES; } - err = check_sock_access(env, regno, off, size, t); + err = check_sock_access(env, insn_idx, regno, off, size, t); if (!err && value_regno >= 0) mark_reg_unknown(env, regs, value_regno); } else { -- cgit v1.3-7-g2ca7 From 46f8bc92758c6259bcf945e9216098661c1587cd Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Sat, 9 Feb 2019 23:22:20 -0800 Subject: bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)" before accessing the fields in sock. For example, in __netdev_pick_tx: static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb, struct net_device *sb_dev) { /* ... */ struct sock *sk = skb->sk; if (queue_index != new_index && sk && sk_fullsock(sk) && rcu_access_pointer(sk->sk_dst_cache)) sk_tx_queue_set(sk, new_index); /* ... */ return queue_index; } This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff" where a few of the convert_ctx_access() in filter.c has already been accessing the skb->sk sock_common's fields, e.g. sock_ops_convert_ctx_access(). "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier. Some of the fileds in "bpf_sock" will not be directly accessible through the "__sk_buff->sk" pointer. It is limited by the new "bpf_sock_common_is_valid_access()". e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock are not allowed. The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)" can be used to get a sk with all accessible fields in "bpf_sock". This helper is added to both cg_skb and sched_(cls|act). int cg_skb_foo(struct __sk_buff *skb) { struct bpf_sock *sk; sk = skb->sk; if (!sk) return 1; sk = bpf_sk_fullsock(sk); if (!sk) return 1; if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP) return 1; /* some_traffic_shaping(); */ return 1; } (1) The sk is read only (2) There is no new "struct bpf_sock_common" introduced. (3) Future kernel sock's members could be added to bpf_sock only instead of repeatedly adding at multiple places like currently in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc. (4) After "sk = skb->sk", the reg holding sk is in type PTR_TO_SOCK_COMMON_OR_NULL. (5) After bpf_sk_fullsock(), the return type will be in type PTR_TO_SOCKET_OR_NULL which is the same as the return type of bpf_sk_lookup_xxx(). However, bpf_sk_fullsock() does not take refcnt. The acquire_reference_state() is only depending on the return type now. To avoid it, a new is_acquire_function() is checked before calling acquire_reference_state(). (6) The WARN_ON in "release_reference_state()" is no longer an internal verifier bug. When reg->id is not found in state->refs[], it means the bpf_prog does something wrong like "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has never been acquired by calling "bpf_sk_fullsock(skb->sk)". A -EINVAL and a verbose are done instead of WARN_ON. A test is added to the test_verifier in a later patch. Since the WARN_ON in "release_reference_state()" is no longer needed, "__release_reference_state()" is folded into "release_reference_state()" also. Acked-by: Alexei Starovoitov Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 12 +++++ include/uapi/linux/bpf.h | 12 ++++- kernel/bpf/verifier.c | 132 +++++++++++++++++++++++++++++++++-------------- net/core/filter.c | 42 +++++++++++++++ 4 files changed, 157 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index bd169a7bcc93..a60463b45b54 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -194,6 +194,7 @@ enum bpf_arg_type { ARG_ANYTHING, /* any (initialized) argument is ok */ ARG_PTR_TO_SOCKET, /* pointer to bpf_sock */ ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ + ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */ }; /* type of values returned from helper functions */ @@ -256,6 +257,8 @@ enum bpf_reg_type { PTR_TO_FLOW_KEYS, /* reg points to bpf_flow_keys */ PTR_TO_SOCKET, /* reg points to struct bpf_sock */ PTR_TO_SOCKET_OR_NULL, /* reg points to struct bpf_sock or NULL */ + PTR_TO_SOCK_COMMON, /* reg points to sock_common */ + PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */ }; /* The information passed from prog-specific *_is_valid_access @@ -920,6 +923,9 @@ void bpf_user_rnd_init_once(void); u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); #if defined(CONFIG_NET) +bool bpf_sock_common_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info); bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info); u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, @@ -928,6 +934,12 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, struct bpf_prog *prog, u32 *target_size); #else +static inline bool bpf_sock_common_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + return false; +} static inline bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1777fa0c61e4..5d79cba74ddc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2329,6 +2329,14 @@ union bpf_attr { * "**y**". * Return * 0 + * + * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk) + * Description + * This helper gets a **struct bpf_sock** pointer such + * that all the fields in bpf_sock can be accessed. + * Return + * A **struct bpf_sock** pointer on success, or NULL in + * case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2425,7 +2433,8 @@ union bpf_attr { FN(msg_pop_data), \ FN(rc_pointer_rel), \ FN(spin_lock), \ - FN(spin_unlock), + FN(spin_unlock), \ + FN(sk_fullsock), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2545,6 +2554,7 @@ struct __sk_buff { __u64 tstamp; __u32 wire_len; __u32 gso_segs; + __bpf_md_ptr(struct bpf_sock *, sk); }; struct bpf_tunnel_key { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 516dfc6d78de..b755d55a3791 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -331,10 +331,17 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type) type == PTR_TO_PACKET_META; } +static bool type_is_sk_pointer(enum bpf_reg_type type) +{ + return type == PTR_TO_SOCKET || + type == PTR_TO_SOCK_COMMON; +} + static bool reg_type_may_be_null(enum bpf_reg_type type) { return type == PTR_TO_MAP_VALUE_OR_NULL || - type == PTR_TO_SOCKET_OR_NULL; + type == PTR_TO_SOCKET_OR_NULL || + type == PTR_TO_SOCK_COMMON_OR_NULL; } static bool type_is_refcounted(enum bpf_reg_type type) @@ -377,6 +384,12 @@ static bool is_release_function(enum bpf_func_id func_id) return func_id == BPF_FUNC_sk_release; } +static bool is_acquire_function(enum bpf_func_id func_id) +{ + return func_id == BPF_FUNC_sk_lookup_tcp || + func_id == BPF_FUNC_sk_lookup_udp; +} + /* string representation of 'enum bpf_reg_type' */ static const char * const reg_type_str[] = { [NOT_INIT] = "?", @@ -392,6 +405,8 @@ static const char * const reg_type_str[] = { [PTR_TO_FLOW_KEYS] = "flow_keys", [PTR_TO_SOCKET] = "sock", [PTR_TO_SOCKET_OR_NULL] = "sock_or_null", + [PTR_TO_SOCK_COMMON] = "sock_common", + [PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null", }; static char slot_type_char[] = { @@ -618,13 +633,10 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx) } /* release function corresponding to acquire_reference_state(). Idempotent. */ -static int __release_reference_state(struct bpf_func_state *state, int ptr_id) +static int release_reference_state(struct bpf_func_state *state, int ptr_id) { int i, last_idx; - if (!ptr_id) - return -EFAULT; - last_idx = state->acquired_refs - 1; for (i = 0; i < state->acquired_refs; i++) { if (state->refs[i].id == ptr_id) { @@ -636,21 +648,7 @@ static int __release_reference_state(struct bpf_func_state *state, int ptr_id) return 0; } } - return -EFAULT; -} - -/* variation on the above for cases where we expect that there must be an - * outstanding reference for the specified ptr_id. - */ -static int release_reference_state(struct bpf_verifier_env *env, int ptr_id) -{ - struct bpf_func_state *state = cur_func(env); - int err; - - err = __release_reference_state(state, ptr_id); - if (WARN_ON_ONCE(err != 0)) - verbose(env, "verifier internal error: can't release reference\n"); - return err; + return -EINVAL; } static int transfer_reference_state(struct bpf_func_state *dst, @@ -1209,6 +1207,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case CONST_PTR_TO_MAP: case PTR_TO_SOCKET: case PTR_TO_SOCKET_OR_NULL: + case PTR_TO_SOCK_COMMON: + case PTR_TO_SOCK_COMMON_OR_NULL: return true; default: return false; @@ -1647,6 +1647,7 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, struct bpf_reg_state *regs = cur_regs(env); struct bpf_reg_state *reg = ®s[regno]; struct bpf_insn_access_aux info = {}; + bool valid; if (reg->smin_value < 0) { verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n", @@ -1654,15 +1655,28 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, return -EACCES; } - if (!bpf_sock_is_valid_access(off, size, t, &info)) { - verbose(env, "invalid bpf_sock access off=%d size=%d\n", - off, size); - return -EACCES; + switch (reg->type) { + case PTR_TO_SOCK_COMMON: + valid = bpf_sock_common_is_valid_access(off, size, t, &info); + break; + case PTR_TO_SOCKET: + valid = bpf_sock_is_valid_access(off, size, t, &info); + break; + default: + valid = false; } - env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; - return 0; + if (valid) { + env->insn_aux_data[insn_idx].ctx_field_size = + info.ctx_field_size; + return 0; + } + + verbose(env, "R%d invalid %s access off=%d size=%d\n", + regno, reg_type_str[reg->type], off, size); + + return -EACCES; } static bool __is_pointer_value(bool allow_ptr_leaks, @@ -1688,8 +1702,14 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno) { const struct bpf_reg_state *reg = reg_state(env, regno); - return reg->type == PTR_TO_CTX || - reg->type == PTR_TO_SOCKET; + return reg->type == PTR_TO_CTX; +} + +static bool is_sk_reg(struct bpf_verifier_env *env, int regno) +{ + const struct bpf_reg_state *reg = reg_state(env, regno); + + return type_is_sk_pointer(reg->type); } static bool is_pkt_reg(struct bpf_verifier_env *env, int regno) @@ -1800,6 +1820,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env, case PTR_TO_SOCKET: pointer_desc = "sock "; break; + case PTR_TO_SOCK_COMMON: + pointer_desc = "sock_common "; + break; default: break; } @@ -2003,11 +2026,14 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn * PTR_TO_PACKET[_META,_END]. In the latter * case, we know the offset is zero. */ - if (reg_type == SCALAR_VALUE) + if (reg_type == SCALAR_VALUE) { mark_reg_unknown(env, regs, value_regno); - else + } else { mark_reg_known_zero(env, regs, value_regno); + if (reg_type_may_be_null(reg_type)) + regs[value_regno].id = ++env->id_gen; + } regs[value_regno].type = reg_type; } @@ -2053,9 +2079,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn err = check_flow_keys_access(env, off, size); if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown(env, regs, value_regno); - } else if (reg->type == PTR_TO_SOCKET) { + } else if (type_is_sk_pointer(reg->type)) { if (t == BPF_WRITE) { - verbose(env, "cannot write into socket\n"); + verbose(env, "R%d cannot write into %s\n", + regno, reg_type_str[reg->type]); return -EACCES; } err = check_sock_access(env, insn_idx, regno, off, size, t); @@ -2102,7 +2129,8 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins if (is_ctx_reg(env, insn->dst_reg) || is_pkt_reg(env, insn->dst_reg) || - is_flow_key_reg(env, insn->dst_reg)) { + is_flow_key_reg(env, insn->dst_reg) || + is_sk_reg(env, insn->dst_reg)) { verbose(env, "BPF_XADD stores into R%d %s is not allowed\n", insn->dst_reg, reg_type_str[reg_state(env, insn->dst_reg)->type]); @@ -2369,6 +2397,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, err = check_ctx_reg(env, reg, regno); if (err < 0) return err; + } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { + expected_type = PTR_TO_SOCK_COMMON; + /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ + if (!type_is_sk_pointer(type)) + goto err_type; } else if (arg_type == ARG_PTR_TO_SOCKET) { expected_type = PTR_TO_SOCKET; if (type != expected_type) @@ -2783,7 +2816,7 @@ static int release_reference(struct bpf_verifier_env *env, for (i = 0; i <= vstate->curframe; i++) release_reg_references(env, vstate->frame[i], meta->ptr_id); - return release_reference_state(env, meta->ptr_id); + return release_reference_state(cur_func(env), meta->ptr_id); } static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, @@ -3049,8 +3082,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn } } else if (is_release_function(func_id)) { err = release_reference(env, &meta); - if (err) + if (err) { + verbose(env, "func %s#%d reference has not been acquired before\n", + func_id_name(func_id), func_id); return err; + } } regs = cur_regs(env); @@ -3099,12 +3135,19 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn regs[BPF_REG_0].id = ++env->id_gen; } } else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) { - int id = acquire_reference_state(env, insn_idx); - if (id < 0) - return id; mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL; - regs[BPF_REG_0].id = id; + if (is_acquire_function(func_id)) { + int id = acquire_reference_state(env, insn_idx); + + if (id < 0) + return id; + /* For release_reference() */ + regs[BPF_REG_0].id = id; + } else { + /* For mark_ptr_or_null_reg() */ + regs[BPF_REG_0].id = ++env->id_gen; + } } else { verbose(env, "unknown return type %d of func %s#%d\n", fn->ret_type, func_id_name(func_id), func_id); @@ -3364,6 +3407,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, case PTR_TO_PACKET_END: case PTR_TO_SOCKET: case PTR_TO_SOCKET_OR_NULL: + case PTR_TO_SOCK_COMMON: + case PTR_TO_SOCK_COMMON_OR_NULL: verbose(env, "R%d pointer arithmetic on %s prohibited\n", dst, reg_type_str[ptr_reg->type]); return -EACCES; @@ -4597,6 +4642,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } } else if (reg->type == PTR_TO_SOCKET_OR_NULL) { reg->type = PTR_TO_SOCKET; + } else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) { + reg->type = PTR_TO_SOCK_COMMON; } if (is_null || !(reg_is_refcounted(reg) || reg_may_point_to_spin_lock(reg))) { @@ -4621,7 +4668,7 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, int i, j; if (reg_is_refcounted_or_null(®s[regno]) && is_null) - __release_reference_state(state, id); + release_reference_state(state, id); for (i = 0; i < MAX_BPF_REG; i++) mark_ptr_or_null_reg(state, ®s[i], id, is_null); @@ -5790,6 +5837,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, case PTR_TO_FLOW_KEYS: case PTR_TO_SOCKET: case PTR_TO_SOCKET_OR_NULL: + case PTR_TO_SOCK_COMMON: + case PTR_TO_SOCK_COMMON_OR_NULL: /* Only valid matches are exact, which memcmp() above * would have accepted */ @@ -6110,6 +6159,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type) case PTR_TO_CTX: case PTR_TO_SOCKET: case PTR_TO_SOCKET_OR_NULL: + case PTR_TO_SOCK_COMMON: + case PTR_TO_SOCK_COMMON_OR_NULL: return false; default: return true; @@ -7112,6 +7163,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) convert_ctx_access = ops->convert_ctx_access; break; case PTR_TO_SOCKET: + case PTR_TO_SOCK_COMMON: convert_ctx_access = bpf_sock_convert_ctx_access; break; default: diff --git a/net/core/filter.c b/net/core/filter.c index 3a49f68eda10..401d2e0aebf8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1793,6 +1793,20 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk) +{ + sk = sk_to_full_sk(sk); + + return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL; +} + +static const struct bpf_func_proto bpf_sk_fullsock_proto = { + .func = bpf_sk_fullsock, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + static inline int sk_skb_try_make_writable(struct sk_buff *skb, unsigned int write_len) { @@ -5406,6 +5420,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) switch (func_id) { case BPF_FUNC_get_local_storage: return &bpf_get_local_storage_proto; + case BPF_FUNC_sk_fullsock: + return &bpf_sk_fullsock_proto; default: return sk_filter_func_proto(func_id, prog); } @@ -5477,6 +5493,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_uid_proto; case BPF_FUNC_fib_lookup: return &bpf_skb_fib_lookup_proto; + case BPF_FUNC_sk_fullsock: + return &bpf_sk_fullsock_proto; #ifdef CONFIG_XFRM case BPF_FUNC_skb_get_xfrm_state: return &bpf_skb_get_xfrm_state_proto; @@ -5764,6 +5782,11 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type if (size != sizeof(__u64)) return false; break; + case offsetof(struct __sk_buff, sk): + if (type == BPF_WRITE || size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL; + break; default: /* Only narrow read access allowed for now. */ if (type == BPF_WRITE) { @@ -5950,6 +5973,18 @@ static bool __sock_filter_check_size(int off, int size, return size == size_default; } +bool bpf_sock_common_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + switch (off) { + case bpf_ctx_range_till(struct bpf_sock, type, priority): + return false; + default: + return bpf_sock_is_valid_access(off, size, type, info); + } +} + bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info) { @@ -6748,6 +6783,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, off += offsetof(struct qdisc_skb_cb, pkt_len); *target_size = 4; *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off); + break; + + case offsetof(struct __sk_buff, sk): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_buff, sk)); + break; } return insn - insn_buf; -- cgit v1.3-7-g2ca7 From 655a51e536c09d15ffa3603b1b6fce2b45b85a1f Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Sat, 9 Feb 2019 23:22:24 -0800 Subject: bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock This patch adds a helper function BPF_FUNC_tcp_sock and it is currently available for cg_skb and sched_(cls|act): struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk); int cg_skb_foo(struct __sk_buff *skb) { struct bpf_tcp_sock *tp; struct bpf_sock *sk; __u32 snd_cwnd; sk = skb->sk; if (!sk) return 1; tp = bpf_tcp_sock(sk); if (!tp) return 1; snd_cwnd = tp->snd_cwnd; /* ... */ return 1; } A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide read-only access. bpf_tcp_sock has all the existing tcp_sock's fields that has already been exposed by the bpf_sock_ops. i.e. no new tcp_sock's fields are exposed in bpf.h. This helper returns a pointer to the tcp_sock. If it is not a tcp_sock or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it returns NULL. Hence, the caller needs to check for NULL before accessing it. The current use case is to expose members from tcp_sock to allow a cg_skb_bpf_prog to provide per cgroup traffic policing/shaping. Acked-by: Alexei Starovoitov Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 30 ++++++++++++++++++ include/uapi/linux/bpf.h | 51 ++++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 31 +++++++++++++++++-- net/core/filter.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 188 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a60463b45b54..7f58828755fd 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -204,6 +204,7 @@ enum bpf_return_type { RET_PTR_TO_MAP_VALUE, /* returns a pointer to map elem value */ RET_PTR_TO_MAP_VALUE_OR_NULL, /* returns a pointer to map elem value or NULL */ RET_PTR_TO_SOCKET_OR_NULL, /* returns a pointer to a socket or NULL */ + RET_PTR_TO_TCP_SOCK_OR_NULL, /* returns a pointer to a tcp_sock or NULL */ }; /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs @@ -259,6 +260,8 @@ enum bpf_reg_type { PTR_TO_SOCKET_OR_NULL, /* reg points to struct bpf_sock or NULL */ PTR_TO_SOCK_COMMON, /* reg points to sock_common */ PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */ + PTR_TO_TCP_SOCK, /* reg points to struct tcp_sock */ + PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */ }; /* The information passed from prog-specific *_is_valid_access @@ -956,4 +959,31 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, } #endif +#ifdef CONFIG_INET +bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, + struct bpf_insn_access_aux *info); + +u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size); +#else +static inline bool bpf_tcp_sock_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + return false; +} + +static inline u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size) +{ + return 0; +} +#endif /* CONFIG_INET */ + #endif /* _LINUX_BPF_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d8f91777c5b6..25c8c0e62ecf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2337,6 +2337,15 @@ union bpf_attr { * Return * A **struct bpf_sock** pointer on success, or NULL in * case of failure. + * + * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk) + * Description + * This helper gets a **struct bpf_tcp_sock** pointer from a + * **struct bpf_sock** pointer. + * + * Return + * A **struct bpf_tcp_sock** pointer on success, or NULL in + * case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2434,7 +2443,8 @@ union bpf_attr { FN(rc_pointer_rel), \ FN(spin_lock), \ FN(spin_unlock), \ - FN(sk_fullsock), + FN(sk_fullsock), \ + FN(tcp_sock), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2616,6 +2626,45 @@ struct bpf_sock { __u32 state; }; +struct bpf_tcp_sock { + __u32 snd_cwnd; /* Sending congestion window */ + __u32 srtt_us; /* smoothed round trip time << 3 in usecs */ + __u32 rtt_min; + __u32 snd_ssthresh; /* Slow start size threshold */ + __u32 rcv_nxt; /* What we want to receive next */ + __u32 snd_nxt; /* Next sequence we send */ + __u32 snd_una; /* First byte we want an ack for */ + __u32 mss_cache; /* Cached effective mss, not including SACKS */ + __u32 ecn_flags; /* ECN status bits. */ + __u32 rate_delivered; /* saved rate sample: packets delivered */ + __u32 rate_interval_us; /* saved rate sample: time elapsed */ + __u32 packets_out; /* Packets which are "in flight" */ + __u32 retrans_out; /* Retransmitted packets out */ + __u32 total_retrans; /* Total retransmits for entire connection */ + __u32 segs_in; /* RFC4898 tcpEStatsPerfSegsIn + * total number of segments in. + */ + __u32 data_segs_in; /* RFC4898 tcpEStatsPerfDataSegsIn + * total number of data segments in. + */ + __u32 segs_out; /* RFC4898 tcpEStatsPerfSegsOut + * The total number of segments sent. + */ + __u32 data_segs_out; /* RFC4898 tcpEStatsPerfDataSegsOut + * total number of data segments sent. + */ + __u32 lost_out; /* Lost packets */ + __u32 sacked_out; /* SACK'd packets */ + __u64 bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived + * sum(delta(rcv_nxt)), or how many bytes + * were acked. + */ + __u64 bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked + * sum(delta(snd_una)), or how many bytes + * were acked. + */ +}; + struct bpf_sock_tuple { union { struct { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b755d55a3791..1b9496c41383 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -334,14 +334,16 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type) static bool type_is_sk_pointer(enum bpf_reg_type type) { return type == PTR_TO_SOCKET || - type == PTR_TO_SOCK_COMMON; + type == PTR_TO_SOCK_COMMON || + type == PTR_TO_TCP_SOCK; } static bool reg_type_may_be_null(enum bpf_reg_type type) { return type == PTR_TO_MAP_VALUE_OR_NULL || type == PTR_TO_SOCKET_OR_NULL || - type == PTR_TO_SOCK_COMMON_OR_NULL; + type == PTR_TO_SOCK_COMMON_OR_NULL || + type == PTR_TO_TCP_SOCK_OR_NULL; } static bool type_is_refcounted(enum bpf_reg_type type) @@ -407,6 +409,8 @@ static const char * const reg_type_str[] = { [PTR_TO_SOCKET_OR_NULL] = "sock_or_null", [PTR_TO_SOCK_COMMON] = "sock_common", [PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null", + [PTR_TO_TCP_SOCK] = "tcp_sock", + [PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null", }; static char slot_type_char[] = { @@ -1209,6 +1213,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case PTR_TO_SOCKET_OR_NULL: case PTR_TO_SOCK_COMMON: case PTR_TO_SOCK_COMMON_OR_NULL: + case PTR_TO_TCP_SOCK: + case PTR_TO_TCP_SOCK_OR_NULL: return true; default: return false; @@ -1662,6 +1668,9 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, case PTR_TO_SOCKET: valid = bpf_sock_is_valid_access(off, size, t, &info); break; + case PTR_TO_TCP_SOCK: + valid = bpf_tcp_sock_is_valid_access(off, size, t, &info); + break; default: valid = false; } @@ -1823,6 +1832,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env, case PTR_TO_SOCK_COMMON: pointer_desc = "sock_common "; break; + case PTR_TO_TCP_SOCK: + pointer_desc = "tcp_sock "; + break; default: break; } @@ -3148,6 +3160,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn /* For mark_ptr_or_null_reg() */ regs[BPF_REG_0].id = ++env->id_gen; } + } else if (fn->ret_type == RET_PTR_TO_TCP_SOCK_OR_NULL) { + mark_reg_known_zero(env, regs, BPF_REG_0); + regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL; + regs[BPF_REG_0].id = ++env->id_gen; } else { verbose(env, "unknown return type %d of func %s#%d\n", fn->ret_type, func_id_name(func_id), func_id); @@ -3409,6 +3425,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, case PTR_TO_SOCKET_OR_NULL: case PTR_TO_SOCK_COMMON: case PTR_TO_SOCK_COMMON_OR_NULL: + case PTR_TO_TCP_SOCK: + case PTR_TO_TCP_SOCK_OR_NULL: verbose(env, "R%d pointer arithmetic on %s prohibited\n", dst, reg_type_str[ptr_reg->type]); return -EACCES; @@ -4644,6 +4662,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, reg->type = PTR_TO_SOCKET; } else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) { reg->type = PTR_TO_SOCK_COMMON; + } else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) { + reg->type = PTR_TO_TCP_SOCK; } if (is_null || !(reg_is_refcounted(reg) || reg_may_point_to_spin_lock(reg))) { @@ -5839,6 +5859,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, case PTR_TO_SOCKET_OR_NULL: case PTR_TO_SOCK_COMMON: case PTR_TO_SOCK_COMMON_OR_NULL: + case PTR_TO_TCP_SOCK: + case PTR_TO_TCP_SOCK_OR_NULL: /* Only valid matches are exact, which memcmp() above * would have accepted */ @@ -6161,6 +6183,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type) case PTR_TO_SOCKET_OR_NULL: case PTR_TO_SOCK_COMMON: case PTR_TO_SOCK_COMMON_OR_NULL: + case PTR_TO_TCP_SOCK: + case PTR_TO_TCP_SOCK_OR_NULL: return false; default: return true; @@ -7166,6 +7190,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) case PTR_TO_SOCK_COMMON: convert_ctx_access = bpf_sock_convert_ctx_access; break; + case PTR_TO_TCP_SOCK: + convert_ctx_access = bpf_tcp_sock_convert_ctx_access; + break; default: continue; } diff --git a/net/core/filter.c b/net/core/filter.c index c0d7b9ef279f..353735575204 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5315,6 +5315,79 @@ static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = { .arg5_type = ARG_ANYTHING, }; +bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked)) + return false; + + if (off % size != 0) + return false; + + switch (off) { + case offsetof(struct bpf_tcp_sock, bytes_received): + case offsetof(struct bpf_tcp_sock, bytes_acked): + return size == sizeof(__u64); + default: + return size == sizeof(__u32); + } +} + +u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, u32 *target_size) +{ + struct bpf_insn *insn = insn_buf; + +#define BPF_TCP_SOCK_GET_COMMON(FIELD) \ + do { \ + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) > \ + FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_sock, FIELD),\ + si->dst_reg, si->src_reg, \ + offsetof(struct tcp_sock, FIELD)); \ + } while (0) + + CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock, + BPF_TCP_SOCK_GET_COMMON); + + if (insn > insn_buf) + return insn - insn_buf; + + switch (si->off) { + case offsetof(struct bpf_tcp_sock, rtt_min): + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) != + sizeof(struct minmax)); + BUILD_BUG_ON(sizeof(struct minmax) < + sizeof(struct minmax_sample)); + + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, + offsetof(struct tcp_sock, rtt_min) + + offsetof(struct minmax_sample, v)); + break; + } + + return insn - insn_buf; +} + +BPF_CALL_1(bpf_tcp_sock, struct sock *, sk) +{ + sk = sk_to_full_sk(sk); + + if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) + return (unsigned long)sk; + + return (unsigned long)NULL; +} + +static const struct bpf_func_proto bpf_tcp_sock_proto = { + .func = bpf_tcp_sock, + .gpl_only = false, + .ret_type = RET_PTR_TO_TCP_SOCK_OR_NULL, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + #endif /* CONFIG_INET */ bool bpf_helper_changes_pkt_data(void *func) @@ -5470,6 +5543,10 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_local_storage_proto; case BPF_FUNC_sk_fullsock: return &bpf_sk_fullsock_proto; +#ifdef CONFIG_INET + case BPF_FUNC_tcp_sock: + return &bpf_tcp_sock_proto; +#endif default: return sk_filter_func_proto(func_id, prog); } @@ -5560,6 +5637,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_lookup_udp_proto; case BPF_FUNC_sk_release: return &bpf_sk_release_proto; + case BPF_FUNC_tcp_sock: + return &bpf_tcp_sock_proto; #endif default: return bpf_base_func_proto(func_id); -- cgit v1.3-7-g2ca7 From 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 6 Feb 2019 17:14:21 +0100 Subject: sched/fair: Optimize update_blocked_averages() Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 38d4669aa2ef..027f8e1b5b66 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { if (cfs_rq->on_list) { + struct rq *rq = rq_of(cfs_rq); + + /* + * With cfs_rq being unthrottled/throttled during an enqueue, + * it can happen the tmp_alone_branch points the a leaf that + * we finally want to del. In this case, tmp_alone_branch moves + * to the prev element but it will point to rq->leaf_cfs_rq_list + * at the end of the enqueue. + */ + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list) + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev; + list_del_rcu(&cfs_rq->leaf_cfs_rq_list); cfs_rq->on_list = 0; } @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) /* adjust cfs_rq_clock_task() */ cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - cfs_rq->throttled_clock_task; + + /* Add cfs_rq with already running entity in the list */ + if (cfs_rq->nr_running >= 1) + list_add_leaf_cfs_rq(cfs_rq); } return 0; @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data) struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; /* group is entering throttled state, stop time */ - if (!cfs_rq->throttle_count) + if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_task = rq_clock_task(rq); + list_del_leaf_cfs_rq(cfs_rq); + } cfs_rq->throttle_count++; return 0; @@ -4553,6 +4571,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) break; } + assert_list_leaf_cfs_rq(rq); + if (!se) add_nr_running(rq, task_delta); @@ -7700,10 +7720,6 @@ static void update_blocked_averages(int cpu) for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; - /* throttled entities do not contribute to load */ - if (throttled_hierarchy(cfs_rq)) - continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) update_tg_load_avg(cfs_rq, 0); -- cgit v1.3-7-g2ca7 From 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 6 Feb 2019 17:14:22 +0100 Subject: sched/fair: Fix O(nr_cgroups) in the load balancing path This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 027f8e1b5b66..17a961522d1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); } -/* Iterate through all cfs_rq's on a runqueue in bottom-up order */ -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +/* Iterate thr' all leaf cfs_rq's on a runqueue */ +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) { } -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7702,10 +7703,27 @@ static inline bool others_have_blocked(struct rq *rq) #ifdef CONFIG_FAIR_GROUP_SCHED +static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->load.weight) + return false; + + if (cfs_rq->avg.load_sum) + return false; + + if (cfs_rq->avg.util_sum) + return false; + + if (cfs_rq->avg.runnable_load_sum) + return false; + + return true; +} + static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7717,7 +7735,7 @@ static void update_blocked_averages(int cpu) * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { struct sched_entity *se; if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) @@ -7728,6 +7746,13 @@ static void update_blocked_averages(int cpu) if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (cfs_rq_is_decayed(cfs_rq)) + list_del_leaf_cfs_rq(cfs_rq); + /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10609,10 +10634,10 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; rcu_read_lock(); - for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) + for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); } -- cgit v1.3-7-g2ca7 From d0fe0b9c45c144e4ac60cf7f07f7e8ae86d3536d Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Tue, 22 Jan 2019 16:25:01 +0000 Subject: sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument Since commit: d03266910a53 ("sched/fair: Fix task group initialization") the utilization of a sched entity representing a task group is no longer initialized to any other value than 0. So post_init_entity_util_avg() is only used for tasks, not for sched_entities. Make this clear by calling it with a task_struct pointer argument which also eliminates the entity_is_task(se) if condition in the fork path and get rid of the stale comment in remove_entity_load_avg() accordingly. Signed-off-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Morten Rasmussen Cc: Patrick Bellasi Cc: Peter Zijlstra Cc: Quentin Perret Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vincent Guittot Link: https://lkml.kernel.org/r/20190122162501.12000-1-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 38 ++++++++++++++++---------------------- kernel/sched/sched.h | 2 +- 3 files changed, 18 insertions(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e86e2b8f6922..6b2c055564b5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2433,7 +2433,7 @@ void wake_up_new_task(struct task_struct *p) #endif rq = __task_rq_lock(p, &rf); update_rq_clock(rq); - post_init_entity_util_avg(&p->se); + post_init_entity_util_avg(p); activate_task(rq, p, ENQUEUE_NOCLOCK); p->on_rq = TASK_ON_RQ_QUEUED; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 17a961522d1e..58edbbdeb661 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -759,8 +759,9 @@ static void attach_entity_cfs_rq(struct sched_entity *se); * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap) * if util_avg > util_avg_cap. */ -void post_init_entity_util_avg(struct sched_entity *se) +void post_init_entity_util_avg(struct task_struct *p) { + struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); struct sched_avg *sa = &se->avg; long cpu_scale = arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq))); @@ -778,22 +779,19 @@ void post_init_entity_util_avg(struct sched_entity *se) } } - if (entity_is_task(se)) { - struct task_struct *p = task_of(se); - if (p->sched_class != &fair_sched_class) { - /* - * For !fair tasks do: - * - update_cfs_rq_load_avg(now, cfs_rq); - attach_entity_load_avg(cfs_rq, se, 0); - switched_from_fair(rq, p); - * - * such that the next switched_to_fair() has the - * expected state. - */ - se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq); - return; - } + if (p->sched_class != &fair_sched_class) { + /* + * For !fair tasks do: + * + update_cfs_rq_load_avg(now, cfs_rq); + attach_entity_load_avg(cfs_rq, se, 0); + switched_from_fair(rq, p); + * + * such that the next switched_to_fair() has the + * expected state. + */ + se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq); + return; } attach_entity_cfs_rq(se); @@ -803,7 +801,7 @@ void post_init_entity_util_avg(struct sched_entity *se) void init_entity_runnable_average(struct sched_entity *se) { } -void post_init_entity_util_avg(struct sched_entity *se) +void post_init_entity_util_avg(struct task_struct *p) { } static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) @@ -3590,10 +3588,6 @@ void remove_entity_load_avg(struct sched_entity *se) * tasks cannot exit without having gone through wake_up_new_task() -> * post_init_entity_util_avg() which will have added things to the * cfs_rq, so we can remove unconditionally. - * - * Similarly for groups, they will have passed through - * post_init_entity_util_avg() before unregister_sched_fair_group() - * calls this. */ sync_entity_load_avg(se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c688ef5012e5..71208b67e58a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1800,7 +1800,7 @@ extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); unsigned long to_ratio(u64 period, u64 runtime); extern void init_entity_runnable_average(struct sched_entity *se); -extern void post_init_entity_util_avg(struct sched_entity *se); +extern void post_init_entity_util_avg(struct task_struct *p); #ifdef CONFIG_NO_HZ_FULL extern bool sched_can_stop_tick(struct rq *rq); -- cgit v1.3-7-g2ca7 From 99687cdbb3f6c8e32bcc7f37496e811f30460e48 Mon Sep 17 00:00:00 2001 From: Luc Van Oostenryck Date: Fri, 18 Jan 2019 15:49:36 +0100 Subject: sched/topology: Fix percpu data types in struct sd_data & struct s_data The percpu members of struct sd_data and s_data are declared as: struct ... ** __percpu member; So their type is: __percpu pointer to pointer to struct ... But looking at how they're used, their type should be: pointer to __percpu pointer to struct ... and they should thus be declared as: struct ... * __percpu *member; So fix the placement of '__percpu' in the definition of these structures. This addresses a bunch of Sparse's warnings like: warning: incorrect type in initializer (different address spaces) expected void const [noderef] *__vpp_verify got struct sched_domain ** Signed-off-by: Luc Van Oostenryck Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com Signed-off-by: Ingo Molnar --- include/linux/sched/topology.h | 8 ++++---- kernel/sched/topology.c | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c31d3a47a47c..57c7ed3fe465 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -176,10 +176,10 @@ typedef int (*sched_domain_flags_f)(void); #define SDTL_OVERLAP 0x01 struct sd_data { - struct sched_domain **__percpu sd; - struct sched_domain_shared **__percpu sds; - struct sched_group **__percpu sg; - struct sched_group_capacity **__percpu sgc; + struct sched_domain *__percpu *sd; + struct sched_domain_shared *__percpu *sds; + struct sched_group *__percpu *sg; + struct sched_group_capacity *__percpu *sgc; }; struct sched_domain_topology_level { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 4ae9403420ed..93ff526e77b0 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -705,7 +705,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) } struct s_data { - struct sched_domain ** __percpu sd; + struct sched_domain * __percpu *sd; struct root_domain *rd; }; -- cgit v1.3-7-g2ca7 From 7edab78d7400ea0997f8e2e971004d824b5bb511 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Thu, 17 Jan 2019 15:34:07 +0000 Subject: sched/fair: Simplify nohz_balancer_kick() Calling 'nohz_balance_exit_idle(rq)' will always clear 'rq->cpu' from 'nohz.idle_cpus_mask' if it is set. Since it is called at the top of 'nohz_balancer_kick()', 'rq->cpu' will never be set in 'nohz.idle_cpus_mask' if it is accessed in the rest of the function. Combine the 'sched_domain_span()' with 'nohz.idle_cpus_mask' and drop the '(i == cpu)' check since 'rq->cpu' will never be iterated over. While at it, clean up a condition alignment. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 58edbbdeb661..0692c8ff6ff6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9615,7 +9615,7 @@ static void nohz_balancer_kick(struct rq *rq) sd = rcu_dereference(rq->sd); if (sd) { if ((rq->cfs.h_nr_running >= 1) && - check_cpu_capacity(rq, sd)) { + check_cpu_capacity(rq, sd)) { flags = NOHZ_KICK_MASK; goto unlock; } @@ -9623,11 +9623,7 @@ static void nohz_balancer_kick(struct rq *rq) sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); if (sd) { - for_each_cpu(i, sched_domain_span(sd)) { - if (i == cpu || - !cpumask_test_cpu(i, nohz.idle_cpus_mask)) - continue; - + for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { if (sched_asym_prefer(i, cpu)) { flags = NOHZ_KICK_MASK; goto unlock; -- cgit v1.3-7-g2ca7 From 892d59c22208be820a5463b5f74eb7f0b7f2b03a Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Thu, 17 Jan 2019 15:34:08 +0000 Subject: sched/fair: Explain LLC nohz kick condition Provide a comment explaining the LLC related nohz kick in nohz_balancer_kick(). Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0692c8ff6ff6..ac6b52d8c79e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9601,8 +9601,13 @@ static void nohz_balancer_kick(struct rq *rq) sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); if (sds) { /* - * XXX: write a coherent comment on why we do this. - * See also: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com + * If there is an imbalance between LLC domains (IOW we could + * increase the overall cache use), we need some less-loaded LLC + * domain to pull some load. Likewise, we may need to spread + * load within the current LLC domain (e.g. packed SMT cores but + * other CPUs are idle). We can't really know from here how busy + * the others are - so just get a nohz balance going if it looks + * like this LLC domain has tasks we could move. */ nr_busy = atomic_read(&sds->nr_busy_cpus); if (nr_busy > 1) { -- cgit v1.3-7-g2ca7 From 9f132742d5c4146397fef0c5b09fe220576a5bb2 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Thu, 17 Jan 2019 15:34:09 +0000 Subject: sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block The comment block for that function lists the heuristics for triggering a nohz kick, but the most recent ones (blocked load updates, misfit) aren't included, and some of them (LLC nohz logic, asym packing) are no longer in sync with the code. The conditions are either simple enough or properly commented, so get rid of that list instead of letting it grow. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac6b52d8c79e..f44da9b491ff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9551,15 +9551,8 @@ static void kick_ilb(unsigned int flags) } /* - * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu in the system. - * - This rq has more than one task. - * - This rq has at least one CFS task and the capacity of the CPU is - * significantly reduced because of RT tasks or IRQs. - * - At parent of LLC scheduler domain level, this cpu's scheduler group has - * multiple busy cpu. - * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler - * domain span are idle. + * Current decision point for kicking the idle load balancer in the presence + * of idle CPUs in the system. */ static void nohz_balancer_kick(struct rq *rq) { -- cgit v1.3-7-g2ca7 From 1b5500d73466c62fe048153f0cea1610d2543c7f Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Thu, 7 Feb 2019 16:16:05 +0530 Subject: sched/fair: Remove unused 'sd' parameter from select_idle_smt() The 'sd' parameter isn't getting used in select_idle_smt(), drop it. Signed-off-by: Viresh Kumar Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/f91c5e118183e79d4a982e9ac4ce5e47948f6c1b.1549536337.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f44da9b491ff..8abd1c271499 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6117,7 +6117,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int /* * Scan the local SMT mask for idle CPUs. */ -static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_smt(struct task_struct *p, int target) { int cpu; @@ -6141,7 +6141,7 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) +static inline int select_idle_smt(struct task_struct *p, int target) { return -1; } @@ -6246,7 +6246,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; -- cgit v1.3-7-g2ca7 From 49262de2270e09882d7bd8866a691cdd69ab32f6 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Tue, 5 Feb 2019 14:24:27 +0200 Subject: futex: Convert futex_pi_state.refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable futex_pi_state.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the futex_pi_state.refcount it might make a difference in following places: - get_pi_state() and exit_pi_state_list(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_pi_state(): decrement in refcount_dec_and_test() provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook Signed-off-by: Elena Reshetova Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: dvhart@infradead.org Link: http://lkml.kernel.org/r/1549369467-3505-1-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar --- kernel/futex.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index 2abe1a0b3062..113f1c042250 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -68,6 +68,7 @@ #include #include #include +#include #include @@ -212,7 +213,7 @@ struct futex_pi_state { struct rt_mutex pi_mutex; struct task_struct *owner; - atomic_t refcount; + refcount_t refcount; union futex_key key; } __randomize_layout; @@ -799,7 +800,7 @@ static int refill_pi_state_cache(void) INIT_LIST_HEAD(&pi_state->list); /* pi_mutex gets initialized later */ pi_state->owner = NULL; - atomic_set(&pi_state->refcount, 1); + refcount_set(&pi_state->refcount, 1); pi_state->key = FUTEX_KEY_INIT; current->pi_state_cache = pi_state; @@ -819,7 +820,7 @@ static struct futex_pi_state *alloc_pi_state(void) static void get_pi_state(struct futex_pi_state *pi_state) { - WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount)); + WARN_ON_ONCE(!refcount_inc_not_zero(&pi_state->refcount)); } /* @@ -831,7 +832,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) if (!pi_state) return; - if (!atomic_dec_and_test(&pi_state->refcount)) + if (!refcount_dec_and_test(&pi_state->refcount)) return; /* @@ -861,7 +862,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) * refcount is at 0 - put it back to 1. */ pi_state->owner = NULL; - atomic_set(&pi_state->refcount, 1); + refcount_set(&pi_state->refcount, 1); current->pi_state_cache = pi_state; } } @@ -904,7 +905,7 @@ void exit_pi_state_list(struct task_struct *curr) * In that case; drop the locks to let put_pi_state() make * progress and retry the loop. */ - if (!atomic_inc_not_zero(&pi_state->refcount)) { + if (!refcount_inc_not_zero(&pi_state->refcount)) { raw_spin_unlock_irq(&curr->pi_lock); cpu_relax(); raw_spin_lock_irq(&curr->pi_lock); @@ -1060,7 +1061,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, * and futex_wait_requeue_pi() as it cannot go to 0 and consequently * free pi_state before we can take a reference ourselves. */ - WARN_ON(!atomic_read(&pi_state->refcount)); + WARN_ON(!refcount_read(&pi_state->refcount)); /* * Now that we have a pi_state, we can acquire wait_lock -- cgit v1.3-7-g2ca7 From 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Mon, 4 Feb 2019 13:35:32 +0100 Subject: perf/x86: Add check_period PMU callback Vince (and later on Ravi) reported crashes in the BTS code during fuzzing with the following backtrace: general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:perf_prepare_sample+0x8f/0x510 ... Call Trace: ? intel_pmu_drain_bts_buffer+0x194/0x230 intel_pmu_drain_bts_buffer+0x160/0x230 ? tick_nohz_irq_exit+0x31/0x40 ? smp_call_function_single_interrupt+0x48/0xe0 ? call_function_single_interrupt+0xf/0x20 ? call_function_single_interrupt+0xa/0x20 ? x86_schedule_events+0x1a0/0x2f0 ? x86_pmu_commit_txn+0xb4/0x100 ? find_busiest_group+0x47/0x5d0 ? perf_event_set_state.part.42+0x12/0x50 ? perf_mux_hrtimer_restart+0x40/0xb0 intel_pmu_disable_event+0xae/0x100 ? intel_pmu_disable_event+0xae/0x100 x86_pmu_stop+0x7a/0xb0 x86_pmu_del+0x57/0x120 event_sched_out.isra.101+0x83/0x180 group_sched_out.part.103+0x57/0xe0 ctx_sched_out+0x188/0x240 ctx_resched+0xa8/0xd0 __perf_event_enable+0x193/0x1e0 event_function+0x8e/0xc0 remote_function+0x41/0x50 flush_smp_call_function_queue+0x68/0x100 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x3e/0xe0 call_function_single_interrupt+0xf/0x20 The reason is that while event init code does several checks for BTS events and prevents several unwanted config bits for BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows to create BTS event without those checks being done. Following sequence will cause the crash: If we create an 'almost' BTS event with precise_ip and callchains, and it into a BTS event it will crash the perf_prepare_sample() function because precise_ip events are expected to come in with callchain data initialized, but that's not the case for intel_pmu_drain_bts_buffer() caller. Adding a check_period callback to be called before the period is changed via PERF_EVENT_IOC_PERIOD. It will deny the change if the event would become BTS. Plus adding also the limit_period check as well. Reported-by: Vince Weaver Signed-off-by: Jiri Olsa Acked-by: Peter Zijlstra Cc: Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Naveen N. Rao Cc: Ravi Bangoria Cc: Stephane Eranian Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava Signed-off-by: Ingo Molnar --- arch/x86/events/core.c | 14 ++++++++++++++ arch/x86/events/intel/core.c | 9 +++++++++ arch/x86/events/perf_event.h | 16 ++++++++++++++-- include/linux/perf_event.h | 5 +++++ kernel/events/core.c | 16 ++++++++++++++++ 5 files changed, 58 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 374a19712e20..b684f0294f35 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2278,6 +2278,19 @@ void perf_check_microcode(void) x86_pmu.check_microcode(); } +static int x86_pmu_check_period(struct perf_event *event, u64 value) +{ + if (x86_pmu.check_period && x86_pmu.check_period(event, value)) + return -EINVAL; + + if (value && x86_pmu.limit_period) { + if (x86_pmu.limit_period(event, value) > value) + return -EINVAL; + } + + return 0; +} + static struct pmu pmu = { .pmu_enable = x86_pmu_enable, .pmu_disable = x86_pmu_disable, @@ -2302,6 +2315,7 @@ static struct pmu pmu = { .event_idx = x86_pmu_event_idx, .sched_task = x86_pmu_sched_task, .task_ctx_size = sizeof(struct x86_perf_task_context), + .check_period = x86_pmu_check_period, }; void arch_perf_update_userpage(struct perf_event *event, diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index daafb893449b..730978dff63f 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3587,6 +3587,11 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx, intel_pmu_lbr_sched_task(ctx, sched_in); } +static int intel_pmu_check_period(struct perf_event *event, u64 value) +{ + return intel_pmu_has_bts_period(event, value) ? -EINVAL : 0; +} + PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63"); PMU_FORMAT_ATTR(ldlat, "config1:0-15"); @@ -3667,6 +3672,8 @@ static __initconst const struct x86_pmu core_pmu = { .cpu_starting = intel_pmu_cpu_starting, .cpu_dying = intel_pmu_cpu_dying, .cpu_dead = intel_pmu_cpu_dead, + + .check_period = intel_pmu_check_period, }; static struct attribute *intel_pmu_attrs[]; @@ -3711,6 +3718,8 @@ static __initconst const struct x86_pmu intel_pmu = { .guest_get_msrs = intel_guest_get_msrs, .sched_task = intel_pmu_sched_task, + + .check_period = intel_pmu_check_period, }; static __init void intel_clovertown_quirk(void) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 78d7b7031bfc..d46fd6754d92 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -646,6 +646,11 @@ struct x86_pmu { * Intel host/guest support (KVM) */ struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr); + + /* + * Check period value for PERF_EVENT_IOC_PERIOD ioctl. + */ + int (*check_period) (struct perf_event *event, u64 period); }; struct x86_perf_task_context { @@ -857,7 +862,7 @@ static inline int amd_pmu_init(void) #ifdef CONFIG_CPU_SUP_INTEL -static inline bool intel_pmu_has_bts(struct perf_event *event) +static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period) { struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event; @@ -868,7 +873,14 @@ static inline bool intel_pmu_has_bts(struct perf_event *event) hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS); - return hw_event == bts_event && hwc->sample_period == 1; + return hw_event == bts_event && period == 1; +} + +static inline bool intel_pmu_has_bts(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + return intel_pmu_has_bts_period(event, hwc->sample_period); } int intel_pmu_save_and_restart(struct perf_event *event); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1d5c551a5add..e1a051724f7e 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -447,6 +447,11 @@ struct pmu { * Filter events for PMU-specific reasons. */ int (*filter_match) (struct perf_event *event); /* optional */ + + /* + * Check period value for PERF_EVENT_IOC_PERIOD ioctl. + */ + int (*check_period) (struct perf_event *event, u64 value); /* optional */ }; enum perf_addr_filter_action_t { diff --git a/kernel/events/core.c b/kernel/events/core.c index e5ede6918050..26d6edab051a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4963,6 +4963,11 @@ static void __perf_event_period(struct perf_event *event, } } +static int perf_event_check_period(struct perf_event *event, u64 value) +{ + return event->pmu->check_period(event, value); +} + static int perf_event_period(struct perf_event *event, u64 __user *arg) { u64 value; @@ -4979,6 +4984,9 @@ static int perf_event_period(struct perf_event *event, u64 __user *arg) if (event->attr.freq && value > sysctl_perf_event_sample_rate) return -EINVAL; + if (perf_event_check_period(event, value)) + return -EINVAL; + event_function_call(event, __perf_event_period, &value); return 0; @@ -9391,6 +9399,11 @@ static int perf_pmu_nop_int(struct pmu *pmu) return 0; } +static int perf_event_nop_int(struct perf_event *event, u64 value) +{ + return 0; +} + static DEFINE_PER_CPU(unsigned int, nop_txn_flags); static void perf_pmu_start_txn(struct pmu *pmu, unsigned int flags) @@ -9691,6 +9704,9 @@ got_cpu_context: pmu->pmu_disable = perf_pmu_nop_void; } + if (!pmu->check_period) + pmu->check_period = perf_event_nop_int; + if (!pmu->event_idx) pmu->event_idx = perf_event_idx_default; -- cgit v1.3-7-g2ca7 From 3defaf2f15b2bfd86c6664181ac009e91985f8ac Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Sun, 10 Feb 2019 12:52:35 -0800 Subject: bpf: fix lockdep false positive in stackmap Lockdep warns about false positive: [ 11.211460] ------------[ cut here ]------------ [ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280 [ 11.213134] Modules linked in: [ 11.214954] RIP: 0010:lock_release+0x1ad/0x280 [ 11.223508] Call Trace: [ 11.223705] [ 11.223874] ? __local_bh_enable+0x7a/0x80 [ 11.224199] up_read+0x1c/0xa0 [ 11.224446] do_up_read+0x12/0x20 [ 11.224713] irq_work_run_list+0x43/0x70 [ 11.225030] irq_work_run+0x26/0x50 [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0 [ 11.225662] irq_work_interrupt+0xf/0x20 since rw_semaphore is released in a different task vs task that locked the sema. It is expected behavior. Fix the warning with up_read_non_owner() and rwsem_release() annotation. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/bpf/stackmap.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index d43b14535827..950ab2f28922 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -44,7 +44,7 @@ static void do_up_read(struct irq_work *entry) struct stack_map_irq_work *work; work = container_of(entry, struct stack_map_irq_work, irq_work); - up_read(work->sem); + up_read_non_owner(work->sem); work->sem = NULL; } @@ -338,6 +338,12 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, } else { work->sem = ¤t->mm->mmap_sem; irq_work_queue(&work->irq_work); + /* + * The irq_work will release the mmap_sem with + * up_read_non_owner(). The rwsem_release() is called + * here to release the lock from lockdep's perspective. + */ + rwsem_release(¤t->mm->mmap_sem.dep_map, 1, _RET_IP_); } } -- cgit v1.3-7-g2ca7 From 85acbb21b9310043692cf18b1fd14067b5a25ccd Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Sun, 10 Feb 2019 00:19:19 +0800 Subject: tracing: Change the function format to display function names by perf Here is an example for this change. $ sudo perf record -e 'ftrace:function' --filter='ip==schedule' $ sudo perf report The output of perf before this patch: \# Samples: 100 of event 'ftrace:function' \# Event count (approx.): 100 \# \# Overhead Trace output \# ........ ...................................... \# 51.00% ffffffff81f6aaa0 <-- ffffffff81158e8d 29.00% ffffffff81f6aaa0 <-- ffffffff8116ccb2 8.00% ffffffff81f6aaa0 <-- ffffffff81f6f2ed 4.00% ffffffff81f6aaa0 <-- ffffffff811628db 4.00% ffffffff81f6aaa0 <-- ffffffff81f6ec5b 2.00% ffffffff81f6aaa0 <-- ffffffff81f6f21a 1.00% ffffffff81f6aaa0 <-- ffffffff811b04af 1.00% ffffffff81f6aaa0 <-- ffffffff8143ce17 After this patch: \# Samples: 36 of event 'ftrace:function' \# Event count (approx.): 36 \# \# Overhead Trace output \# ........ ............................................ \# 38.89% schedule <-- schedule_hrtimeout_range_clock 27.78% schedule <-- worker_thread 13.89% schedule <-- schedule_timeout 11.11% schedule <-- smpboot_thread_fn 5.56% schedule <-- rcu_gp_kthread 2.78% schedule <-- exit_to_usermode_loop Link: http://lkml.kernel.org/r/20190209161919.32350-1-changbin.du@gmail.com Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_entries.h | 41 +++++++++++++++++++---------------------- 1 file changed, 19 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 06bb2fd9a56c..fc8e97328e54 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -65,7 +65,8 @@ FTRACE_ENTRY_REG(function, ftrace_entry, __field( unsigned long, parent_ip ) ), - F_printk(" %lx <-- %lx", __entry->ip, __entry->parent_ip), + F_printk(" %ps <-- %ps", + (void *)__entry->ip, (void *)__entry->parent_ip), FILTER_TRACE_FN, @@ -83,7 +84,7 @@ FTRACE_ENTRY_PACKED(funcgraph_entry, ftrace_graph_ent_entry, __field_desc( int, graph_ent, depth ) ), - F_printk("--> %lx (%d)", __entry->func, __entry->depth), + F_printk("--> %ps (%d)", (void *)__entry->func, __entry->depth), FILTER_OTHER ); @@ -102,8 +103,8 @@ FTRACE_ENTRY_PACKED(funcgraph_exit, ftrace_graph_ret_entry, __field_desc( int, ret, depth ) ), - F_printk("<-- %lx (%d) (start: %llx end: %llx) over: %d", - __entry->func, __entry->depth, + F_printk("<-- %ps (%d) (start: %llx end: %llx) over: %d", + (void *)__entry->func, __entry->depth, __entry->calltime, __entry->rettime, __entry->depth), @@ -167,12 +168,6 @@ FTRACE_ENTRY_DUP(wakeup, ctx_switch_entry, #define FTRACE_STACK_ENTRIES 8 -#ifndef CONFIG_64BIT -# define IP_FMT "%08lx" -#else -# define IP_FMT "%016lx" -#endif - FTRACE_ENTRY(kernel_stack, stack_entry, TRACE_STACK, @@ -182,12 +177,13 @@ FTRACE_ENTRY(kernel_stack, stack_entry, __dynamic_array(unsigned long, caller ) ), - F_printk("\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n" - "\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n" - "\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n", - __entry->caller[0], __entry->caller[1], __entry->caller[2], - __entry->caller[3], __entry->caller[4], __entry->caller[5], - __entry->caller[6], __entry->caller[7]), + F_printk("\t=> %ps\n\t=> %ps\n\t=> %ps\n" + "\t=> %ps\n\t=> %ps\n\t=> %ps\n" + "\t=> %ps\n\t=> %ps\n", + (void *)__entry->caller[0], (void *)__entry->caller[1], + (void *)__entry->caller[2], (void *)__entry->caller[3], + (void *)__entry->caller[4], (void *)__entry->caller[5], + (void *)__entry->caller[6], (void *)__entry->caller[7]), FILTER_OTHER ); @@ -201,12 +197,13 @@ FTRACE_ENTRY(user_stack, userstack_entry, __array( unsigned long, caller, FTRACE_STACK_ENTRIES ) ), - F_printk("\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n" - "\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n" - "\t=> (" IP_FMT ")\n\t=> (" IP_FMT ")\n", - __entry->caller[0], __entry->caller[1], __entry->caller[2], - __entry->caller[3], __entry->caller[4], __entry->caller[5], - __entry->caller[6], __entry->caller[7]), + F_printk("\t=> %ps\n\t=> %ps\n\t=> %ps\n" + "\t=> %ps\n\t=> %ps\n\t=> %ps\n" + "\t=> %ps\n\t=> %ps\n", + (void *)__entry->caller[0], (void *)__entry->caller[1], + (void *)__entry->caller[2], (void *)__entry->caller[3], + (void *)__entry->caller[4], (void *)__entry->caller[5], + (void *)__entry->caller[6], (void *)__entry->caller[7]), FILTER_OTHER ); -- cgit v1.3-7-g2ca7 From f6675872db57305fa957021efc788f9983ed3b67 Mon Sep 17 00:00:00 2001 From: Andreas Ziegler Date: Wed, 6 Feb 2019 20:00:13 +0100 Subject: tracing: probeevent: Correctly update remaining space in dynamic area Commit 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") improved the string fetching mechanism by returning the number of required bytes after copying the argument to the dynamic area. However, this return value is now only used to increment the pointer inside the dynamic area but misses updating the 'maxlen' variable which indicates the remaining space in the dynamic area. This means that fetch_store_string() always reads the *total* size of the dynamic area from the data_loc pointer instead of the *remaining* size (and passes it along to strncpy_from_{user,unsafe}) even if we're already about to copy data into the middle of the dynamic area. Link: http://lkml.kernel.org/r/20190206190013.16405-1-andreas.ziegler@fau.de Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Acked-by: Masami Hiramatsu Signed-off-by: Andreas Ziegler Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_probe_tmpl.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h index 5c56afc17cf8..4737bb8c07a3 100644 --- a/kernel/trace/trace_probe_tmpl.h +++ b/kernel/trace/trace_probe_tmpl.h @@ -180,10 +180,12 @@ store_trace_args(void *data, struct trace_probe *tp, struct pt_regs *regs, if (unlikely(arg->dynamic)) *dl = make_data_loc(maxlen, dyndata - base); ret = process_fetch_insn(arg->code, regs, dl, base); - if (unlikely(ret < 0 && arg->dynamic)) + if (unlikely(ret < 0 && arg->dynamic)) { *dl = make_data_loc(0, dyndata - base); - else + } else { dyndata += ret; + maxlen -= ret; + } } } -- cgit v1.3-7-g2ca7 From dd27c2e3d0a05c01ff14bb672d1a3f0fdd8f98fc Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 12 Feb 2019 00:20:39 -0800 Subject: bpf: offload: add priv field for drivers Currently bpf_offload_dev does not have any priv pointer, forcing the drivers to work backwards from the netdev in program metadata. This is not great given programs are conceptually associated with the offload device, and it means one or two unnecessary deferences. Add a priv pointer to bpf_offload_dev. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Signed-off-by: Daniel Borkmann --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 4 +--- drivers/net/netdevsim/bpf.c | 5 +++-- include/linux/bpf.h | 3 ++- kernel/bpf/offload.c | 10 +++++++++- 5 files changed, 16 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index dccae0319204..275de9f4c61c 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -465,7 +465,7 @@ static int nfp_bpf_init(struct nfp_app *app) app->ctrl_mtu = nfp_bpf_ctrl_cmsg_mtu(bpf); } - bpf->bpf_dev = bpf_offload_dev_create(&nfp_bpf_dev_ops); + bpf->bpf_dev = bpf_offload_dev_create(&nfp_bpf_dev_ops, bpf); err = PTR_ERR_OR_ZERO(bpf->bpf_dev); if (err) goto err_free_neutral_maps; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 55c7dbf8b421..15dce97650a5 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -185,8 +185,6 @@ static void nfp_prog_free(struct nfp_prog *nfp_prog) static int nfp_bpf_verifier_prep(struct bpf_prog *prog) { - struct nfp_net *nn = netdev_priv(prog->aux->offload->netdev); - struct nfp_app *app = nn->app; struct nfp_prog *nfp_prog; int ret; @@ -197,7 +195,7 @@ static int nfp_bpf_verifier_prep(struct bpf_prog *prog) INIT_LIST_HEAD(&nfp_prog->insns); nfp_prog->type = prog->type; - nfp_prog->bpf = app->priv; + nfp_prog->bpf = bpf_offload_dev_priv(prog->aux->offload->offdev); ret = nfp_prog_prepare(nfp_prog, prog->insnsi, prog->len); if (ret) diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 172b271c8bd2..f92c43453ec6 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -248,7 +248,7 @@ static int nsim_bpf_create_prog(struct netdevsim *ns, struct bpf_prog *prog) static int nsim_bpf_verifier_prep(struct bpf_prog *prog) { - struct netdevsim *ns = netdev_priv(prog->aux->offload->netdev); + struct netdevsim *ns = bpf_offload_dev_priv(prog->aux->offload->offdev); if (!ns->bpf_bind_accept) return -EOPNOTSUPP; @@ -589,7 +589,8 @@ int nsim_bpf_init(struct netdevsim *ns) if (IS_ERR_OR_NULL(ns->sdev->ddir_bpf_bound_progs)) return -ENOMEM; - ns->sdev->bpf_dev = bpf_offload_dev_create(&nsim_bpf_dev_ops); + ns->sdev->bpf_dev = bpf_offload_dev_create(&nsim_bpf_dev_ops, + ns); err = PTR_ERR_OR_ZERO(ns->sdev->bpf_dev); if (err) return err; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 7f58828755fd..de18227b3d95 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -773,8 +773,9 @@ int bpf_map_offload_get_next_key(struct bpf_map *map, bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map); struct bpf_offload_dev * -bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops); +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv); void bpf_offload_dev_destroy(struct bpf_offload_dev *offdev); +void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev); int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev, struct net_device *netdev); void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev, diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 39dba8c90331..ba635209ae9a 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -35,6 +35,7 @@ static DECLARE_RWSEM(bpf_devs_lock); struct bpf_offload_dev { const struct bpf_prog_offload_ops *ops; struct list_head netdevs; + void *priv; }; struct bpf_offload_netdev { @@ -669,7 +670,7 @@ unlock: EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister); struct bpf_offload_dev * -bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops) +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv) { struct bpf_offload_dev *offdev; int err; @@ -688,6 +689,7 @@ bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops) return ERR_PTR(-ENOMEM); offdev->ops = ops; + offdev->priv = priv; INIT_LIST_HEAD(&offdev->netdevs); return offdev; @@ -700,3 +702,9 @@ void bpf_offload_dev_destroy(struct bpf_offload_dev *offdev) kfree(offdev); } EXPORT_SYMBOL_GPL(bpf_offload_dev_destroy); + +void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev) +{ + return offdev->priv; +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_priv); -- cgit v1.3-7-g2ca7 From 6442ca2abf882d9d838fb844d852ba6acd1db7f4 Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Fri, 18 Jan 2019 15:10:26 +0800 Subject: swiotlb: fix comment on swiotlb_bounce() Fix the comment as swiotlb_bounce() is used to copy from original dma location to swiotlb buffer during swiotlb_tbl_map_single(), while to copy from swiotlb buffer to original dma location during swiotlb_tbl_unmap_single(). Signed-off-by: Dongli Zhang Signed-off-by: Konrad Rzeszutek Wilk --- kernel/dma/swiotlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 1fb6fd68b9c7..1d8b37747bf0 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -385,7 +385,7 @@ void __init swiotlb_exit(void) } /* - * Bounce: copy the swiotlb buffer back to the original dma location + * Bounce: copy the swiotlb buffer from or back to the original dma location */ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) -- cgit v1.3-7-g2ca7 From 71602fe6d4e9291af105adfef8e893b57c735906 Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Fri, 18 Jan 2019 15:10:27 +0800 Subject: swiotlb: add debugfs to track swiotlb buffer usage The device driver will not be able to do dma operations once swiotlb buffer is full, either because the driver is using so many IO TLB blocks inflight, or because there is memory leak issue in device driver. To export the swiotlb buffer usage via debugfs would help the user estimate the size of swiotlb buffer to pre-allocate or analyze device driver memory leak issue. Signed-off-by: Dongli Zhang Signed-off-by: Konrad Rzeszutek Wilk --- kernel/dma/swiotlb.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 1d8b37747bf0..bedc9f945836 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -34,6 +34,9 @@ #include #include #include +#ifdef CONFIG_DEBUG_FS +#include +#endif #include #include @@ -72,6 +75,11 @@ phys_addr_t io_tlb_start, io_tlb_end; */ static unsigned long io_tlb_nslabs; +/* + * The number of used IO TLB block + */ +static unsigned long io_tlb_used; + /* * This is a free list describing the number of free entries available from * each index @@ -524,6 +532,7 @@ not_found: dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes)\n", size); return DMA_MAPPING_ERROR; found: + io_tlb_used += nslots; spin_unlock_irqrestore(&io_tlb_lock, flags); /* @@ -584,6 +593,8 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr, */ for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE -1) && io_tlb_list[i]; i--) io_tlb_list[i] = ++count; + + io_tlb_used -= nslots; } spin_unlock_irqrestore(&io_tlb_lock, flags); } @@ -662,3 +673,36 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask) { return __phys_to_dma(hwdev, io_tlb_end - 1) <= mask; } + +#ifdef CONFIG_DEBUG_FS + +static int __init swiotlb_create_debugfs(void) +{ + static struct dentry *d_swiotlb_usage; + struct dentry *ent; + + d_swiotlb_usage = debugfs_create_dir("swiotlb", NULL); + + if (!d_swiotlb_usage) + return -ENOMEM; + + ent = debugfs_create_ulong("io_tlb_nslabs", 0400, + d_swiotlb_usage, &io_tlb_nslabs); + if (!ent) + goto fail; + + ent = debugfs_create_ulong("io_tlb_used", 0400, + d_swiotlb_usage, &io_tlb_used); + if (!ent) + goto fail; + + return 0; + +fail: + debugfs_remove_recursive(d_swiotlb_usage); + return -ENOMEM; +} + +late_initcall(swiotlb_create_debugfs); + +#endif -- cgit v1.3-7-g2ca7 From 60513ed06a41049768a6875229b025b6e726e148 Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Fri, 18 Jan 2019 15:10:28 +0800 Subject: swiotlb: checking whether swiotlb buffer is full with io_tlb_used This patch uses io_tlb_used to help check whether swiotlb buffer is full. io_tlb_used is no longer used for only debugfs. It is also used to help optimize swiotlb_tbl_map_single(). Suggested-by: Joe Jin Signed-off-by: Dongli Zhang Signed-off-by: Konrad Rzeszutek Wilk --- kernel/dma/swiotlb.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index bedc9f945836..a01b83e95a2a 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -483,6 +483,10 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, * request and allocate a buffer from that IO TLB pool. */ spin_lock_irqsave(&io_tlb_lock, flags); + + if (unlikely(nslots > io_tlb_nslabs - io_tlb_used)) + goto not_found; + index = ALIGN(io_tlb_index, stride); if (index >= io_tlb_nslabs) index = 0; -- cgit v1.3-7-g2ca7 From 131d34cb07957151c369366b158690057d2bce5e Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 12 Feb 2019 14:46:00 -0600 Subject: audit: mark expected switch fall-through MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: kernel/auditfilter.c: In function ‘audit_krule_to_data’: kernel/auditfilter.c:668:7: warning: this statement may fall through [-Wimplicit-fallthrough=] if (krule->pflags & AUDIT_LOGINUID_LEGACY && !f->val) { ^ kernel/auditfilter.c:674:3: note: here default: ^~~~~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 Notice that, in this particular case, the code comment is modified in accordance with what GCC is expecting to find. This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Paul Moore --- kernel/auditfilter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index add360b46b38..63f8b3f26fab 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -670,7 +670,7 @@ static struct audit_rule_data *audit_krule_to_data(struct audit_krule *krule) data->values[i] = AUDIT_UID_UNSET; break; } - /* fallthrough if set */ + /* fall through - if set */ default: data->values[i] = f->val; } -- cgit v1.3-7-g2ca7 From 528871b456026e6127d95b1b2bd8e3a003dc1614 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 13 Feb 2019 07:57:02 +0100 Subject: perf/core: Fix impossible ring-buffer sizes warning The following commit: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes") results in perf recording failures with larger mmap areas: root@skl:/tmp# perf record -g -a failed to mmap with 12 (Cannot allocate memory) The root cause is that the following condition is buggy: if (order_base_2(size) >= MAX_ORDER) goto fail; The problem is that @size is in bytes and MAX_ORDER is in pages, so the right test is: if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER) goto fail; Fix it. Reported-by: "Jin, Yao" Bisected-by: Borislav Petkov Analyzed-by: Peter Zijlstra Cc: Julien Thierry Cc: Mark Rutland Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Greg Kroah-Hartman Cc: Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes") Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 309ef5a64af5..5ab4fe3b1dcc 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -734,7 +734,7 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags) size = sizeof(struct ring_buffer); size += nr_pages * sizeof(void *); - if (order_base_2(size) >= MAX_ORDER) + if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER) goto fail; rb = kzalloc(size, GFP_KERNEL); -- cgit v1.3-7-g2ca7 From c13324a505c7790fe91a9df35be2e0462abccdb0 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 13 Feb 2019 01:12:15 +0900 Subject: x86/kprobes: Prohibit probing on functions before kprobe_int3_handler() Prohibit probing on the functions called before kprobe_int3_handler() in do_int3(). More specifically, ftrace_int3_handler(), poke_int3_handler(), and ist_enter(). And since rcu_nmi_enter() is called by ist_enter(), it also should be marked as NOKPROBE_SYMBOL. Since those are handled before kprobe_int3_handler(), probing those functions can cause a breakpoint recursion and crash the kernel. Signed-off-by: Masami Hiramatsu Cc: Alexander Shishkin Cc: Andrea Righi Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/154998793571.31052.11301258949601150994.stgit@devbox Signed-off-by: Ingo Molnar --- arch/x86/kernel/alternative.c | 3 ++- arch/x86/kernel/ftrace.c | 3 ++- arch/x86/kernel/traps.c | 1 + kernel/rcu/tree.c | 2 ++ 4 files changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ebeac487a20c..e8b628b1b279 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -764,8 +765,8 @@ int poke_int3_handler(struct pt_regs *regs) regs->ip = (unsigned long) bp_int3_handler; return 1; - } +NOKPROBE_SYMBOL(poke_int3_handler); /** * text_poke_bp() -- update instructions on live kernel on SMP diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 8257a59704ae..3e3789c8f8e1 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -269,7 +269,7 @@ int ftrace_update_ftrace_func(ftrace_func_t func) return ret; } -static int is_ftrace_caller(unsigned long ip) +static nokprobe_inline int is_ftrace_caller(unsigned long ip) { if (ip == ftrace_update_func) return 1; @@ -299,6 +299,7 @@ int ftrace_int3_handler(struct pt_regs *regs) return 1; } +NOKPROBE_SYMBOL(ftrace_int3_handler); static int ftrace_write(unsigned long ip, const char *val, int size) { diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9b7c4ca8f0a7..e289ce1332ab 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -111,6 +111,7 @@ void ist_enter(struct pt_regs *regs) /* This code is a bit fragile. Test it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work"); } +NOKPROBE_SYMBOL(ist_enter); void ist_exit(struct pt_regs *regs) { diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9180158756d2..74db52a0a466 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -62,6 +62,7 @@ #include #include #include +#include #include "tree.h" #include "rcu.h" @@ -872,6 +873,7 @@ void rcu_nmi_enter(void) { rcu_nmi_enter_common(false); } +NOKPROBE_SYMBOL(rcu_nmi_enter); /** * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle -- cgit v1.3-7-g2ca7 From 6143c6fb1e8f9bde9c434038f7548a19d36b55e7 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 13 Feb 2019 01:13:12 +0900 Subject: kprobes: Search non-suffixed symbol in blacklist Newer GCC versions can generate some different instances of a function with suffixed symbols if the function is optimized and only has a part of that. (e.g. .constprop, .part etc.) In this case, it is not enough to check the entry of kprobe blacklist because it only records non-suffixed symbol address. To fix this issue, search non-suffixed symbol in blacklist if given address is within a symbol which has a suffix. Note that this can cause false positive cases if a kprobe-safe function is optimized to suffixed instance and has same name symbol which is blacklisted. But I would like to chose a fail-safe design for this issue. Signed-off-by: Masami Hiramatsu Reviewed-by: Steven Rostedt (VMware) Cc: Alexander Shishkin Cc: Andrea Righi Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/154998799234.31052.6136378903570418008.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index f4ddfdd2d07e..c83e54727131 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1396,7 +1396,7 @@ bool __weak arch_within_kprobe_blacklist(unsigned long addr) addr < (unsigned long)__kprobes_text_end; } -bool within_kprobe_blacklist(unsigned long addr) +static bool __within_kprobe_blacklist(unsigned long addr) { struct kprobe_blacklist_entry *ent; @@ -1410,7 +1410,26 @@ bool within_kprobe_blacklist(unsigned long addr) if (addr >= ent->start_addr && addr < ent->end_addr) return true; } + return false; +} +bool within_kprobe_blacklist(unsigned long addr) +{ + char symname[KSYM_NAME_LEN], *p; + + if (__within_kprobe_blacklist(addr)) + return true; + + /* Check if the address is on a suffixed-symbol */ + if (!lookup_symbol_name(addr, symname)) { + p = strchr(symname, '.'); + if (!p) + return false; + *p = '\0'; + addr = (unsigned long)kprobe_lookup_name(symname, 0); + if (addr) + return __within_kprobe_blacklist(addr); + } return false; } -- cgit v1.3-7-g2ca7 From eeeb080bae906a57b6513d37efe3c38f2cb87a1c Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 13 Feb 2019 01:13:40 +0900 Subject: kprobes: Prohibit probing on hardirq tracers Since kprobes breakpoint handling involves hardirq tracer, probing these functions cause breakpoint recursion problem. Prohibit probing on those functions. Signed-off-by: Masami Hiramatsu Acked-by: Steven Rostedt (VMware) Cc: Alexander Shishkin Cc: Andrea Righi Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/154998802073.31052.17255044712514564153.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/trace/trace_irqsoff.c | 9 +++++++-- kernel/trace/trace_preemptirq.c | 5 +++++ 2 files changed, 12 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index d3294721f119..d42a473b8240 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "trace.h" @@ -365,7 +366,7 @@ out: __trace_function(tr, CALLER_ADDR0, parent_ip, flags, pc); } -static inline void +static nokprobe_inline void start_critical_timing(unsigned long ip, unsigned long parent_ip, int pc) { int cpu; @@ -401,7 +402,7 @@ start_critical_timing(unsigned long ip, unsigned long parent_ip, int pc) atomic_dec(&data->disabled); } -static inline void +static nokprobe_inline void stop_critical_timing(unsigned long ip, unsigned long parent_ip, int pc) { int cpu; @@ -443,6 +444,7 @@ void start_critical_timings(void) start_critical_timing(CALLER_ADDR0, CALLER_ADDR1, pc); } EXPORT_SYMBOL_GPL(start_critical_timings); +NOKPROBE_SYMBOL(start_critical_timings); void stop_critical_timings(void) { @@ -452,6 +454,7 @@ void stop_critical_timings(void) stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1, pc); } EXPORT_SYMBOL_GPL(stop_critical_timings); +NOKPROBE_SYMBOL(stop_critical_timings); #ifdef CONFIG_FUNCTION_TRACER static bool function_enabled; @@ -611,6 +614,7 @@ void tracer_hardirqs_on(unsigned long a0, unsigned long a1) if (!preempt_trace(pc) && irq_trace()) stop_critical_timing(a0, a1, pc); } +NOKPROBE_SYMBOL(tracer_hardirqs_on); void tracer_hardirqs_off(unsigned long a0, unsigned long a1) { @@ -619,6 +623,7 @@ void tracer_hardirqs_off(unsigned long a0, unsigned long a1) if (!preempt_trace(pc) && irq_trace()) start_critical_timing(a0, a1, pc); } +NOKPROBE_SYMBOL(tracer_hardirqs_off); static int irqsoff_tracer_init(struct trace_array *tr) { diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c index 71f553cceb3c..4d8e99fdbbbe 100644 --- a/kernel/trace/trace_preemptirq.c +++ b/kernel/trace/trace_preemptirq.c @@ -9,6 +9,7 @@ #include #include #include +#include #include "trace.h" #define CREATE_TRACE_POINTS @@ -30,6 +31,7 @@ void trace_hardirqs_on(void) lockdep_hardirqs_on(CALLER_ADDR0); } EXPORT_SYMBOL(trace_hardirqs_on); +NOKPROBE_SYMBOL(trace_hardirqs_on); void trace_hardirqs_off(void) { @@ -43,6 +45,7 @@ void trace_hardirqs_off(void) lockdep_hardirqs_off(CALLER_ADDR0); } EXPORT_SYMBOL(trace_hardirqs_off); +NOKPROBE_SYMBOL(trace_hardirqs_off); __visible void trace_hardirqs_on_caller(unsigned long caller_addr) { @@ -56,6 +59,7 @@ __visible void trace_hardirqs_on_caller(unsigned long caller_addr) lockdep_hardirqs_on(CALLER_ADDR0); } EXPORT_SYMBOL(trace_hardirqs_on_caller); +NOKPROBE_SYMBOL(trace_hardirqs_on_caller); __visible void trace_hardirqs_off_caller(unsigned long caller_addr) { @@ -69,6 +73,7 @@ __visible void trace_hardirqs_off_caller(unsigned long caller_addr) lockdep_hardirqs_off(CALLER_ADDR0); } EXPORT_SYMBOL(trace_hardirqs_off_caller); +NOKPROBE_SYMBOL(trace_hardirqs_off_caller); #endif /* CONFIG_TRACE_IRQFLAGS */ #ifdef CONFIG_TRACE_PREEMPT_TOGGLE -- cgit v1.3-7-g2ca7 From a39f15b9644fac3f950f522c39e667c3af25c588 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 13 Feb 2019 01:14:37 +0900 Subject: kprobes: Prohibit probing on RCU debug routine Since kprobe itself depends on RCU, probing on RCU debug routine can cause recursive breakpoint bugs. Prohibit probing on RCU debug routines. int3 ->do_int3() ->ist_enter() ->RCU_LOCKDEP_WARN() ->debug_lockdep_rcu_enabled() -> int3 Signed-off-by: Masami Hiramatsu Cc: Alexander Shishkin Cc: Andrea Righi Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/rcu/update.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 1971869c4072..f4ca36d92138 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -52,6 +52,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS @@ -249,6 +250,7 @@ int notrace debug_lockdep_rcu_enabled(void) current->lockdep_recursion == 0; } EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled); +NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled); /** * rcu_read_lock_held() - might we be in RCU read-side critical section? -- cgit v1.3-7-g2ca7 From 2f43c6022d84b2f562623a7023f49f1431e50747 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 13 Feb 2019 01:15:05 +0900 Subject: kprobes: Prohibit probing on lockdep functions Some lockdep functions can be involved in breakpoint handling and probing on those functions can cause a breakpoint recursion. Prohibit probing on those functions by blacklist. Signed-off-by: Masami Hiramatsu Cc: Alexander Shishkin Cc: Andrea Righi Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/154998810578.31052.1680977921449292812.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 95932333a48b..bc35a54ae3d4 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -2814,6 +2815,7 @@ void lockdep_hardirqs_on(unsigned long ip) __trace_hardirqs_on_caller(ip); current->lockdep_recursion = 0; } +NOKPROBE_SYMBOL(lockdep_hardirqs_on); /* * Hardirqs were disabled: @@ -2843,6 +2845,7 @@ void lockdep_hardirqs_off(unsigned long ip) } else debug_atomic_inc(redundant_hardirqs_off); } +NOKPROBE_SYMBOL(lockdep_hardirqs_off); /* * Softirqs will be enabled: @@ -3650,7 +3653,8 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) return 0; } -static int __lock_is_held(const struct lockdep_map *lock, int read) +static nokprobe_inline +int __lock_is_held(const struct lockdep_map *lock, int read) { struct task_struct *curr = current; int i; @@ -3883,6 +3887,7 @@ int lock_is_held_type(const struct lockdep_map *lock, int read) return ret; } EXPORT_SYMBOL_GPL(lock_is_held_type); +NOKPROBE_SYMBOL(lock_is_held_type); struct pin_cookie lock_pin_lock(struct lockdep_map *lock) { -- cgit v1.3-7-g2ca7 From c89d92eddfad11e912fb506f85e1796064a9f9d2 Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Tue, 12 Feb 2019 14:57:01 +0530 Subject: sched/fair: Use non-atomic cpumask_{set,clear}_cpu() The cpumasks updated here are not subject to concurrency and using atomic bitops for them is pointless and expensive. Use the non-atomic variants instead. Suggested-by: Peter Zijlstra Signed-off-by: Viresh Kumar Cc: Linus Torvalds Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/2e2a10f84b9049a81eef94ed6d5989447c21e34a.1549963617.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 6 +++--- kernel/sched/isolation.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8abd1c271499..8213ff6e365d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6097,7 +6097,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { - cpumask_clear_cpu(cpu, cpus); + __cpumask_clear_cpu(cpu, cpus); if (!available_idle_cpu(cpu)) idle = false; } @@ -9105,7 +9105,7 @@ more_balance: if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) { /* Prevent to re-select dst_cpu via env's CPUs */ - cpumask_clear_cpu(env.dst_cpu, env.cpus); + __cpumask_clear_cpu(env.dst_cpu, env.cpus); env.dst_rq = cpu_rq(env.new_dst_cpu); env.dst_cpu = env.new_dst_cpu; @@ -9132,7 +9132,7 @@ more_balance: /* All tasks on this runqueue were pinned by CPU affinity */ if (unlikely(env.flags & LBF_ALL_PINNED)) { - cpumask_clear_cpu(cpu_of(busiest), cpus); + __cpumask_clear_cpu(cpu_of(busiest), cpus); /* * Attempting to continue load balancing at the current * sched_domain level only makes sense if there are diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 81faddba9e20..b02d148e7672 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -80,7 +80,7 @@ static int __init housekeeping_setup(char *str, enum hk_flags flags) cpumask_andnot(housekeeping_mask, cpu_possible_mask, non_housekeeping_mask); if (cpumask_empty(housekeeping_mask)) - cpumask_set_cpu(smp_processor_id(), housekeeping_mask); + __cpumask_set_cpu(smp_processor_id(), housekeeping_mask); } else { cpumask_var_t tmp; -- cgit v1.3-7-g2ca7 From b5c231d8c8037f63d34199ea1667bbe1cd9f940f Mon Sep 17 00:00:00 2001 From: Brian Masney Date: Thu, 7 Feb 2019 21:16:22 -0500 Subject: genirq: introduce irq_domain_translate_twocell Add a new function irq_domain_translate_twocell() that is to be used as the translate function in struct irq_domain_ops for the v2 IRQ API. This patch also changes irq_domain_xlate_twocell() from the v1 IRQ API to call irq_domain_translate_twocell() in the v2 IRQ API. This required changes to of_phandle_args_to_fwspec()'s arguments so that it can be called from multiple places. Cc: Thomas Gleixner Reviewed-by: Marc Zyngier Signed-off-by: Brian Masney Signed-off-by: Linus Walleij --- include/linux/irqdomain.h | 5 +++++ kernel/irq/irqdomain.c | 45 ++++++++++++++++++++++++++++++++++----------- 2 files changed, 39 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 35965f41d7be..fcefe0c7263f 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -419,6 +419,11 @@ int irq_domain_xlate_onetwocell(struct irq_domain *d, struct device_node *ctrlr, const u32 *intspec, unsigned int intsize, irq_hw_number_t *out_hwirq, unsigned int *out_type); +int irq_domain_translate_twocell(struct irq_domain *d, + struct irq_fwspec *fwspec, + unsigned long *out_hwirq, + unsigned int *out_type); + /* IPI functions */ int irq_reserve_ipi(struct irq_domain *domain, const struct cpumask *dest); int irq_destroy_ipi(unsigned int irq, const struct cpumask *dest); diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 8b0be4bd6565..56a30d542b8e 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -729,16 +729,17 @@ static int irq_domain_translate(struct irq_domain *d, return 0; } -static void of_phandle_args_to_fwspec(struct of_phandle_args *irq_data, +static void of_phandle_args_to_fwspec(struct device_node *np, const u32 *args, + unsigned int count, struct irq_fwspec *fwspec) { int i; - fwspec->fwnode = irq_data->np ? &irq_data->np->fwnode : NULL; - fwspec->param_count = irq_data->args_count; + fwspec->fwnode = np ? &np->fwnode : NULL; + fwspec->param_count = count; - for (i = 0; i < irq_data->args_count; i++) - fwspec->param[i] = irq_data->args[i]; + for (i = 0; i < count; i++) + fwspec->param[i] = args[i]; } unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec) @@ -836,7 +837,9 @@ unsigned int irq_create_of_mapping(struct of_phandle_args *irq_data) { struct irq_fwspec fwspec; - of_phandle_args_to_fwspec(irq_data, &fwspec); + of_phandle_args_to_fwspec(irq_data->np, irq_data->args, + irq_data->args_count, &fwspec); + return irq_create_fwspec_mapping(&fwspec); } EXPORT_SYMBOL_GPL(irq_create_of_mapping); @@ -928,11 +931,10 @@ int irq_domain_xlate_twocell(struct irq_domain *d, struct device_node *ctrlr, const u32 *intspec, unsigned int intsize, irq_hw_number_t *out_hwirq, unsigned int *out_type) { - if (WARN_ON(intsize < 2)) - return -EINVAL; - *out_hwirq = intspec[0]; - *out_type = intspec[1] & IRQ_TYPE_SENSE_MASK; - return 0; + struct irq_fwspec fwspec; + + of_phandle_args_to_fwspec(ctrlr, intspec, intsize, &fwspec); + return irq_domain_translate_twocell(d, &fwspec, out_hwirq, out_type); } EXPORT_SYMBOL_GPL(irq_domain_xlate_twocell); @@ -968,6 +970,27 @@ const struct irq_domain_ops irq_domain_simple_ops = { }; EXPORT_SYMBOL_GPL(irq_domain_simple_ops); +/** + * irq_domain_translate_twocell() - Generic translate for direct two cell + * bindings + * + * Device Tree IRQ specifier translation function which works with two cell + * bindings where the cell values map directly to the hwirq number + * and linux irq flags. + */ +int irq_domain_translate_twocell(struct irq_domain *d, + struct irq_fwspec *fwspec, + unsigned long *out_hwirq, + unsigned int *out_type) +{ + if (WARN_ON(fwspec->param_count < 2)) + return -EINVAL; + *out_hwirq = fwspec->param[0]; + *out_type = fwspec->param[1] & IRQ_TYPE_SENSE_MASK; + return 0; +} +EXPORT_SYMBOL_GPL(irq_domain_translate_twocell); + int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, int node, const struct irq_affinity_desc *affinity) { -- cgit v1.3-7-g2ca7 From 5aa5bd563ce041d931c0dc1fc436dd18c27c60a7 Mon Sep 17 00:00:00 2001 From: Linus Walleij Date: Thu, 7 Feb 2019 21:16:23 -0500 Subject: genirq: introduce irq_chip_mask_ack_parent() The hierarchical irqchip never before ran into a situation where the parent is not "simple", i.e. does not implement .irq_ack() and .irq_mask() like most, but the qcom-pm8xxx.c happens to implement only .irq_mask_ack(). Since we want to make ssbi-gpio a hierarchical child of this irqchip, it must *also* only implement .irq_mask_ack() and call down to the parent, and for this we of course need irq_chip_mask_ack_parent(). Cc: Marc Zyngier Cc: Thomas Gleixner Acked-by: Marc Zyngier Signed-off-by: Brian Masney Signed-off-by: Linus Walleij --- include/linux/irq.h | 1 + kernel/irq/chip.c | 11 +++++++++++ 2 files changed, 12 insertions(+) (limited to 'kernel') diff --git a/include/linux/irq.h b/include/linux/irq.h index def2b2aac8b1..9a1a67d2e07d 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -605,6 +605,7 @@ extern void irq_chip_disable_parent(struct irq_data *data); extern void irq_chip_ack_parent(struct irq_data *data); extern int irq_chip_retrigger_hierarchy(struct irq_data *data); extern void irq_chip_mask_parent(struct irq_data *data); +extern void irq_chip_mask_ack_parent(struct irq_data *data); extern void irq_chip_unmask_parent(struct irq_data *data); extern void irq_chip_eoi_parent(struct irq_data *data); extern int irq_chip_set_affinity_parent(struct irq_data *data, diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 34e969069488..982b75e127c5 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1277,6 +1277,17 @@ void irq_chip_mask_parent(struct irq_data *data) } EXPORT_SYMBOL_GPL(irq_chip_mask_parent); +/** + * irq_chip_mask_ack_parent - Mask and acknowledge the parent interrupt + * @data: Pointer to interrupt specific data + */ +void irq_chip_mask_ack_parent(struct irq_data *data) +{ + data = data->parent_data; + data->chip->irq_mask_ack(data); +} +EXPORT_SYMBOL_GPL(irq_chip_mask_ack_parent); + /** * irq_chip_unmask_parent - Unmask the parent interrupt * @data: Pointer to interrupt specific data -- cgit v1.3-7-g2ca7 From cf43a757fd49442bc38f76088b70c2299eed2c2f Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 11 Feb 2019 23:27:42 -0600 Subject: signal: Restore the stop PTRACE_EVENT_EXIT In the middle of do_exit() there is there is a call "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for for the debugger to release the task or SIGKILL to be delivered. Skipping past dequeue_signal when we know a fatal signal has already been delivered resulted in SIGKILL remaining pending and TIF_SIGPENDING remaining set. This in turn caused the scheduler to not sleep in PTACE_EVENT_EXIT as it figured a fatal signal was pending. This also caused ptrace_freeze_traced in ptrace_check_attach to fail because it left a per thread SIGKILL pending which is what fatal_signal_pending tests for. This difference in signal state caused strace to report strace: Exit of unknown pid NNNNN ignored Therefore update the signal handling state like dequeue_signal would when removing a per thread SIGKILL, by removing SIGKILL from the per thread signal mask and clearing TIF_SIGPENDING. Acked-by: Oleg Nesterov Reported-by: Oleg Nesterov Reported-by: Ivan Delalande Cc: stable@vger.kernel.org Fixes: 35634ffa1751 ("signal: Always notice exiting tasks") Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 99fa8ff06fd9..57b7771e20d7 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2436,9 +2436,12 @@ relock: } /* Has this task already been marked for death? */ - ksig->info.si_signo = signr = SIGKILL; - if (signal_group_exit(signal)) + if (signal_group_exit(signal)) { + ksig->info.si_signo = signr = SIGKILL; + sigdelset(¤t->pending.signal, SIGKILL); + recalc_sigpending(); goto fatal; + } for (;;) { struct k_sigaction *ka; -- cgit v1.3-7-g2ca7 From 70ca7ba2dbe4f1858b85e30269c8408a8bb8f272 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 11 Feb 2019 18:12:30 +0200 Subject: dma-mapping: move debug configuration options to kernel/dma This is a follow up to the commit cf65a0f6f6ff ("dma-mapping: move all DMA mapping code to kernel/dma") which moved source code of DMA API to kernel/dma folder. Since there is no file left in the lib that require DMA API debugging options move the latter to kernel/dma as well. Cc: Christoph Hellwig Signed-off-by: Andy Shevchenko Signed-off-by: Christoph Hellwig --- kernel/dma/Kconfig | 36 ++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 36 ------------------------------------ 2 files changed, 36 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index ca88b867e7fe..61cebea36d89 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -53,3 +53,39 @@ config DMA_REMAP config DMA_DIRECT_REMAP bool select DMA_REMAP + +config DMA_API_DEBUG + bool "Enable debugging of DMA-API usage" + select NEED_DMA_MAP_STATE + help + Enable this option to debug the use of the DMA API by device drivers. + With this option you will be able to detect common bugs in device + drivers like double-freeing of DMA mappings or freeing mappings that + were never allocated. + + This also attempts to catch cases where a page owned by DMA is + accessed by the cpu in a way that could cause data corruption. For + example, this enables cow_user_page() to check that the source page is + not undergoing DMA. + + This option causes a performance degradation. Use only if you want to + debug device drivers and dma interactions. + + If unsure, say N. + +config DMA_API_DEBUG_SG + bool "Debug DMA scatter-gather usage" + default y + depends on DMA_API_DEBUG + help + Perform extra checking that callers of dma_map_sg() have respected the + appropriate segment length/boundary limits for the given device when + preparing DMA scatterlists. + + This is particularly likely to have been overlooked in cases where the + dma_map_sg() API is used for general bulk mapping of pages rather than + preparing literal scatter-gather descriptors, where there is a risk of + unexpected behaviour from DMA API implementations if the scatterlist + is technically out-of-spec. + + If unsure, say N. diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d4df5b24d75e..ef5d7c08e5b9 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1655,42 +1655,6 @@ config PROVIDE_OHCI1394_DMA_INIT See Documentation/debugging-via-ohci1394.txt for more information. -config DMA_API_DEBUG - bool "Enable debugging of DMA-API usage" - select NEED_DMA_MAP_STATE - help - Enable this option to debug the use of the DMA API by device drivers. - With this option you will be able to detect common bugs in device - drivers like double-freeing of DMA mappings or freeing mappings that - were never allocated. - - This also attempts to catch cases where a page owned by DMA is - accessed by the cpu in a way that could cause data corruption. For - example, this enables cow_user_page() to check that the source page is - not undergoing DMA. - - This option causes a performance degradation. Use only if you want to - debug device drivers and dma interactions. - - If unsure, say N. - -config DMA_API_DEBUG_SG - bool "Debug DMA scatter-gather usage" - default y - depends on DMA_API_DEBUG - help - Perform extra checking that callers of dma_map_sg() have respected the - appropriate segment length/boundary limits for the given device when - preparing DMA scatterlists. - - This is particularly likely to have been overlooked in cases where the - dma_map_sg() API is used for general bulk mapping of pages rather than - preparing literal scatter-gather descriptors, where there is a risk of - unexpected behaviour from DMA API implementations if the scatterlist - is technically out-of-spec. - - If unsure, say N. - menuconfig RUNTIME_TESTING_MENU bool "Runtime Testing" def_bool y -- cgit v1.3-7-g2ca7 From 347cb6af8710b72cf9685fdc09d07873cf42d51f Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 7 Jan 2019 13:36:20 -0500 Subject: dma-mapping: add a kconfig symbol for arch_setup_dma_ops availability Signed-off-by: Christoph Hellwig Acked-by: Paul Burton # MIPS Acked-by: Catalin Marinas # arm64 --- arch/arc/Kconfig | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arc/include/asm/dma-mapping.h | 13 ------------- arch/arm/Kconfig | 1 + arch/arm/include/asm/dma-mapping.h | 4 ---- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/dma-mapping.h | 4 ---- arch/mips/Kconfig | 1 + arch/mips/include/asm/dma-mapping.h | 10 ---------- arch/mips/mm/dma-noncoherent.c | 8 ++++++++ include/linux/dma-mapping.h | 12 ++++++++---- kernel/dma/Kconfig | 3 +++ 12 files changed, 24 insertions(+), 35 deletions(-) delete mode 100644 arch/arc/include/asm/dma-mapping.h (limited to 'kernel') diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 376366a7db81..2ab27d88eb1c 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -11,6 +11,7 @@ config ARC select ARC_TIMERS select ARCH_HAS_DMA_COHERENT_TO_PFN select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_SUPPORTS_ATOMIC_RMW if ARC_HAS_LLSC diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild index caa270261521..b41f8881ecc8 100644 --- a/arch/arc/include/asm/Kbuild +++ b/arch/arc/include/asm/Kbuild @@ -3,6 +3,7 @@ generic-y += bugs.h generic-y += compat.h generic-y += device.h generic-y += div64.h +generic-y += dma-mapping.h generic-y += emergency-restart.h generic-y += extable.h generic-y += ftrace.h diff --git a/arch/arc/include/asm/dma-mapping.h b/arch/arc/include/asm/dma-mapping.h deleted file mode 100644 index c946c0a83e76..000000000000 --- a/arch/arc/include/asm/dma-mapping.h +++ /dev/null @@ -1,13 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -// (C) 2018 Synopsys, Inc. (www.synopsys.com) - -#ifndef ASM_ARC_DMA_MAPPING_H -#define ASM_ARC_DMA_MAPPING_H - -#include - -void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, - const struct iommu_ops *iommu, bool coherent); -#define arch_setup_dma_ops arch_setup_dma_ops - -#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 664e918e2624..c1cf44f00870 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -12,6 +12,7 @@ config ARM select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_PTE_SPECIAL if ARM_LPAE select ARCH_HAS_PHYS_TO_DMA + select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 31d3b96f0f4b..a224b6e39e58 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -96,10 +96,6 @@ static inline unsigned long dma_max_pfn(struct device *dev) } #define dma_max_pfn(dev) dma_max_pfn(dev) -#define arch_setup_dma_ops arch_setup_dma_ops -extern void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, - const struct iommu_ops *iommu, bool coherent); - #ifdef CONFIG_MMU #define arch_teardown_dma_ops arch_teardown_dma_ops extern void arch_teardown_dma_ops(struct device *dev); diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a4168d366127..63909f318d56 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -22,6 +22,7 @@ config ARM64 select ARCH_HAS_KCOV select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index 95dbf3ef735a..de96507ee2c1 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -29,10 +29,6 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) return NULL; } -void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, - const struct iommu_ops *iommu, bool coherent); -#define arch_setup_dma_ops arch_setup_dma_ops - #ifdef CONFIG_IOMMU_DMA void arch_teardown_dma_ops(struct device *dev); #define arch_teardown_dma_ops arch_teardown_dma_ops diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 0d14f51d0002..dc5d70f674e0 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -1118,6 +1118,7 @@ config DMA_MAYBE_COHERENT config DMA_PERDEV_COHERENT bool + select ARCH_HAS_SETUP_DMA_OPS select DMA_NONCOHERENT config DMA_NONCOHERENT diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h index 20dfaad3a55d..34de7b17b41b 100644 --- a/arch/mips/include/asm/dma-mapping.h +++ b/arch/mips/include/asm/dma-mapping.h @@ -15,14 +15,4 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) #endif } -#define arch_setup_dma_ops arch_setup_dma_ops -static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, - u64 size, const struct iommu_ops *iommu, - bool coherent) -{ -#ifdef CONFIG_DMA_PERDEV_COHERENT - dev->dma_coherent = coherent; -#endif -} - #endif /* _ASM_DMA_MAPPING_H */ diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c index cb38461391cb..0606fc87b294 100644 --- a/arch/mips/mm/dma-noncoherent.c +++ b/arch/mips/mm/dma-noncoherent.c @@ -159,3 +159,11 @@ void arch_dma_cache_sync(struct device *dev, void *vaddr, size_t size, dma_sync_virt(vaddr, size, direction); } + +#ifdef CONFIG_DMA_PERDEV_COHERENT +void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, + const struct iommu_ops *iommu, bool coherent) +{ + dev->dma_coherent = coherent; +} +#endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index b904d55247ab..2b20d60e6158 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -671,11 +671,15 @@ static inline int dma_coerce_mask_and_coherent(struct device *dev, u64 mask) return dma_set_mask_and_coherent(dev, mask); } -#ifndef arch_setup_dma_ops +#ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS +void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, + const struct iommu_ops *iommu, bool coherent); +#else static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, - u64 size, const struct iommu_ops *iommu, - bool coherent) { } -#endif + u64 size, const struct iommu_ops *iommu, bool coherent) +{ +} +#endif /* CONFIG_ARCH_HAS_SETUP_DMA_OPS */ #ifndef arch_teardown_dma_ops static inline void arch_teardown_dma_ops(struct device *dev) { } diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 61cebea36d89..6014cad35e58 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -19,6 +19,9 @@ config ARCH_HAS_DMA_COHERENCE_H config HAVE_GENERIC_DMA_COHERENT bool +config ARCH_HAS_SETUP_DMA_OPS + bool + config ARCH_HAS_SYNC_DMA_FOR_DEVICE bool -- cgit v1.3-7-g2ca7 From dc2acded38957dfa6b7b7e0203b4b8cb8d818ce6 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 21 Dec 2018 22:14:44 +0100 Subject: dma-mapping: add a kconfig symbol for arch_teardown_dma_ops availability Signed-off-by: Christoph Hellwig Acked-by: Catalin Marinas # arm64 --- arch/arm/Kconfig | 1 + arch/arm/include/asm/dma-mapping.h | 5 ----- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/dma-mapping.h | 5 ----- include/linux/dma-mapping.h | 10 +++++++--- kernel/dma/Kconfig | 3 +++ 6 files changed, 12 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index c1cf44f00870..4bb36ae71b14 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -16,6 +16,7 @@ config ARM select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU + select ARCH_HAS_TEARDOWN_DMA_OPS if MMU select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAVE_CUSTOM_GPIO_H select ARCH_HAS_GCOV_PROFILE_ALL diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index a224b6e39e58..03ba90ffc0f8 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -96,11 +96,6 @@ static inline unsigned long dma_max_pfn(struct device *dev) } #define dma_max_pfn(dev) dma_max_pfn(dev) -#ifdef CONFIG_MMU -#define arch_teardown_dma_ops arch_teardown_dma_ops -extern void arch_teardown_dma_ops(struct device *dev); -#endif - /* do not use this function in a driver */ static inline bool is_device_dma_coherent(struct device *dev) { diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 63909f318d56..87ec7be25e97 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -29,6 +29,7 @@ config ARM64 select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYSCALL_WRAPPER + select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_INLINE_READ_LOCK if !PREEMPT diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index de96507ee2c1..de98191e4c7d 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -29,11 +29,6 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) return NULL; } -#ifdef CONFIG_IOMMU_DMA -void arch_teardown_dma_ops(struct device *dev); -#define arch_teardown_dma_ops arch_teardown_dma_ops -#endif - /* * Do not use this function in a driver, it is only provided for * arch/arm/mm/xen.c, which is used by arm64 as well. diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 2b20d60e6158..4210c5c1dd21 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -681,9 +681,13 @@ static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, } #endif /* CONFIG_ARCH_HAS_SETUP_DMA_OPS */ -#ifndef arch_teardown_dma_ops -static inline void arch_teardown_dma_ops(struct device *dev) { } -#endif +#ifdef CONFIG_ARCH_HAS_TEARDOWN_DMA_OPS +void arch_teardown_dma_ops(struct device *dev); +#else +static inline void arch_teardown_dma_ops(struct device *dev) +{ +} +#endif /* CONFIG_ARCH_HAS_TEARDOWN_DMA_OPS */ static inline unsigned int dma_get_max_seg_size(struct device *dev) { diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 6014cad35e58..bde9179c6ed7 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -22,6 +22,9 @@ config HAVE_GENERIC_DMA_COHERENT config ARCH_HAS_SETUP_DMA_OPS bool +config ARCH_HAS_TEARDOWN_DMA_OPS + bool + config ARCH_HAS_SYNC_DMA_FOR_DEVICE bool -- cgit v1.3-7-g2ca7 From be4311a262bcf29da60c1ef6b5a457fe5d9cccef Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 1 Feb 2019 22:25:09 +0100 Subject: dma-mapping: remove an incorrect __iommem annotation memmap return a regular void pointer, not and __iomem one. Signed-off-by: Christoph Hellwig --- kernel/dma/coherent.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 66f0fb7e9a3a..4b76aba574c2 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -43,7 +43,7 @@ static int dma_init_coherent_memory( struct dma_coherent_mem **mem) { struct dma_coherent_mem *dma_mem = NULL; - void __iomem *mem_base = NULL; + void *mem_base = NULL; int pages = size >> PAGE_SHIFT; int bitmap_size = BITS_TO_LONGS(pages) * sizeof(long); int ret; -- cgit v1.3-7-g2ca7 From a51866946c0aa0b04a554595d38b496e80c76a1b Mon Sep 17 00:00:00 2001 From: Julien Thierry Date: Wed, 13 Feb 2019 10:09:19 +0000 Subject: genirq: Fix wrong name in request_percpu_nmi() description ready_percpu_nmi() was the previous name of prepare_percpu_nmi(). Update request_percpu_nmi() comment with the correct function name. Signed-off-by: Julien Thierry Reported-by: Li Wei Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Marc Zyngier Signed-off-by: Marc Zyngier --- kernel/irq/manage.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 0a1ebc004a59..5b570050b9b6 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -2427,8 +2427,8 @@ EXPORT_SYMBOL_GPL(__request_percpu_irq); * @dev_id: A percpu cookie passed back to the handler function * * This call allocates interrupt resources for a per CPU NMI. Per CPU NMIs - * have to be setup on each CPU by calling ready_percpu_nmi() before being - * enabled on the same CPU by using enable_percpu_nmi(). + * have to be setup on each CPU by calling prepare_percpu_nmi() before + * being enabled on the same CPU by using enable_percpu_nmi(). * * Dev_id must be globally unique. It is a per-cpu variable, and * the handler gets called with the interrupted CPU's instance of -- cgit v1.3-7-g2ca7 From 2c4f1fcbef0bc324830bc2fb1a264c08ec93dec5 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Fri, 25 Jan 2019 23:10:50 +0800 Subject: kprobe: Do not use uaccess functions to access kernel memory that can fault The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address. In this case, fetch_store_strlen() would crash since it uses general usercopy function, and user access functions are no longer allowed to access kernel memory. For example, we can crash the kernel by doing something as below: $ sudo kprobe 'p:do_sys_open +0(+0(%si)):string' [ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?) [ 103.622104] general protection fault: 0000 [#1] SMP PTI [ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96 [ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014 [ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0 [ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63 ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48 [ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246 [ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000 [ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000 [ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000 [ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0 [ 103.649147] Call Trace: [ 103.649781] ? sched_clock_cpu+0xc/0xa0 [ 103.650747] ? do_sys_open+0x5/0x220 [ 103.651635] kprobe_trace_func+0x303/0x380 [ 103.652645] ? do_sys_open+0x5/0x220 [ 103.653528] kprobe_dispatcher+0x45/0x50 [ 103.654682] ? do_sys_open+0x1/0x220 [ 103.655875] kprobe_ftrace_handler+0x90/0xf0 [ 103.657282] ftrace_ops_assist_func+0x54/0xf0 [ 103.658564] ? __call_rcu+0x1dc/0x280 [ 103.659482] 0xffffffffc00000bf [ 103.660384] ? __ia32_sys_open+0x20/0x20 [ 103.661682] ? do_sys_open+0x1/0x220 [ 103.662863] do_sys_open+0x5/0x220 [ 103.663988] do_syscall_64+0x60/0x210 [ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 103.666862] RIP: 0033:0x7fc22fadccdd [ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44 [ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd [ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c [ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8 [ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 103.688056] Modules linked in: [ 103.689131] ---[ end trace 43792035c28984a1 ]--- This can be fixed by using probe_mem_read() instead, as it can handle faulting kernel memory addresses, which kprobes can legitimately do. Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com Cc: stable@vger.kernel.org Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses") Signed-off-by: Changbin Du Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..9eaf07f99212 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = { static nokprobe_inline int fetch_store_strlen(unsigned long addr) { - mm_segment_t old_fs; int ret, len = 0; u8 c; - old_fs = get_fs(); - set_fs(KERNEL_DS); - pagefault_disable(); - do { - ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1); + ret = probe_mem_read(&c, (u8 *)addr + len, 1); len++; } while (c && ret == 0 && len < MAX_STRING_SIZE); - pagefault_enable(); - set_fs(old_fs); - return (ret < 0) ? ret : len; } -- cgit v1.3-7-g2ca7 From 9e7382153f80ba45a0bbcd540fb77d4b15f6e966 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Thu, 14 Feb 2019 15:29:50 +0000 Subject: tracing: Fix number of entries in trace header The following commit 441dae8f2f29 ("tracing: Add support for display of tgid in trace output") removed the call to print_event_info() from print_func_help_header_irq() which results in the ftrace header not reporting the number of entries written in the buffer. As this wasn't the original intent of the patch, re-introduce the call to print_event_info() to restore the orginal behaviour. Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com Acked-by: Joel Fernandes Cc: stable@vger.kernel.org Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output") Signed-off-by: Quentin Perret Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c521b7347482..c4238b441624 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3384,6 +3384,8 @@ static void print_func_help_header_irq(struct trace_buffer *buf, struct seq_file const char tgid_space[] = " "; const char space[] = " "; + print_event_info(buf, m); + seq_printf(m, "# %s _-----=> irqs-off\n", tgid ? tgid_space : space); seq_printf(m, "# %s / _----=> need-resched\n", -- cgit v1.3-7-g2ca7 From f79b3f338564e7674dbe6375bcf685c2ba483efe Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Mon, 11 Feb 2019 15:00:48 -0500 Subject: ftrace: Allow enabling of filters via index of available_filter_functions Enabling of large number of functions by echoing in a large subset of the functions in available_filter_functions can take a very long time. The process requires testing all functions registered by the function tracer (which is in the 10s of thousands), and doing a kallsyms lookup to convert the ip address into a name, then comparing that name with the string passed in. When a function causes the function tracer to crash the system, a binary bisect of the available_filter_functions can be done to find the culprit. But this requires passing in half of the functions in available_filter_functions over and over again, which makes it basically a O(n^2) operation. With 40,000 functions, that ends up bing 1,600,000,000 opertions! And enabling this can take over 20 minutes. As a quick speed up, if a number is passed into one of the filter files, instead of doing a search, it just enables the function at the corresponding line of the available_filter_functions file. That is: # echo 50 > set_ftrace_filter # cat set_ftrace_filter x86_pmu_commit_txn # head -50 available_filter_functions | tail -1 x86_pmu_commit_txn This allows setting of half the available_filter_functions to take place in less than a second! # time seq 20000 > set_ftrace_filter real 0m0.042s user 0m0.005s sys 0m0.015s # wc -l set_ftrace_filter 20000 set_ftrace_filter Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/ftrace.rst | 38 ++++++++++++++++++++++++++++++++++++++ kernel/trace/ftrace.c | 30 ++++++++++++++++++++++++++++++ kernel/trace/trace.h | 1 + kernel/trace/trace_events_filter.c | 5 +++++ 4 files changed, 74 insertions(+) (limited to 'kernel') diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index 6ce2763a2a3e..7c5e6d6ab5d1 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -233,6 +233,12 @@ of ftrace. Here is a list of some of the key files: This interface also allows for commands to be used. See the "Filter commands" section for more details. + As a speed up, since processing strings can't be quite expensive + and requires a check of all functions registered to tracing, instead + an index can be written into this file. A number (starting with "1") + written will instead select the same corresponding at the line position + of the "available_filter_functions" file. + set_ftrace_notrace: This has an effect opposite to that of @@ -2835,6 +2841,38 @@ Produces:: We can see that there's no more lock or preempt tracing. +Selecting function filters via index +------------------------------------ + +Because processing of strings is expensive (the address of the function +needs to be looked up before comparing to the string being passed in), +an index can be used as well to enable functions. This is useful in the +case of setting thousands of specific functions at a time. By passing +in a list of numbers, no string processing will occur. Instead, the function +at the specific location in the internal array (which corresponds to the +functions in the "available_filter_functions" file), is selected. + +:: + + # echo 1 > set_ftrace_filter + +Will select the first function listed in "available_filter_functions" + +:: + + # head -1 available_filter_functions + trace_initcall_finish_cb + + # cat set_ftrace_filter + trace_initcall_finish_cb + + # head -50 available_filter_functions | tail -1 + x86_pmu_commit_txn + + # echo 1 50 > set_ftrace_filter + # cat set_ftrace_filter + trace_initcall_finish_cb + x86_pmu_commit_txn Dynamic ftrace with the function graph tracer --------------------------------------------- diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index aac7847c0214..fa79323331b2 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -3701,6 +3701,31 @@ enter_record(struct ftrace_hash *hash, struct dyn_ftrace *rec, int clear_filter) return ret; } +static int +add_rec_by_index(struct ftrace_hash *hash, struct ftrace_glob *func_g, + int clear_filter) +{ + long index = simple_strtoul(func_g->search, NULL, 0); + struct ftrace_page *pg; + struct dyn_ftrace *rec; + + /* The index starts at 1 */ + if (--index < 0) + return 0; + + do_for_each_ftrace_rec(pg, rec) { + if (pg->index <= index) { + index -= pg->index; + /* this is a double loop, break goes to the next page */ + break; + } + rec = &pg->records[index]; + enter_record(hash, rec, clear_filter); + return 1; + } while_for_each_ftrace_rec(); + return 0; +} + static int ftrace_match_record(struct dyn_ftrace *rec, struct ftrace_glob *func_g, struct ftrace_glob *mod_g, int exclude_mod) @@ -3769,6 +3794,11 @@ match_records(struct ftrace_hash *hash, char *func, int len, char *mod) if (unlikely(ftrace_disabled)) goto out_unlock; + if (func_g.type == MATCH_INDEX) { + found = add_rec_by_index(hash, &func_g, clear_filter); + goto out_unlock; + } + do_for_each_ftrace_rec(pg, rec) { if (rec->flags & FTRACE_FL_DISABLED) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index a34fa5e76abb..ae7df090b93e 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1459,6 +1459,7 @@ enum regex_type { MATCH_MIDDLE_ONLY, MATCH_END_ONLY, MATCH_GLOB, + MATCH_INDEX, }; struct regex { diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index f052ecb085e9..ade606c33231 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -825,6 +825,9 @@ enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not) *search = buff; + if (isdigit(buff[0])) + return MATCH_INDEX; + for (i = 0; i < len; i++) { if (buff[i] == '*') { if (!i) { @@ -862,6 +865,8 @@ static void filter_build_regex(struct filter_pred *pred) } switch (type) { + /* MATCH_INDEX should not happen, but if it does, match full */ + case MATCH_INDEX: case MATCH_FULL: r->match = regex_match_full; break; -- cgit v1.3-7-g2ca7 From ce59b8e99c2c3e5828bce0b434850c6fec9d2397 Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Wed, 16 Jan 2019 13:20:27 +0200 Subject: uprobes: convert uprobe.ref to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable uprobe.ref is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the uprobe.ref it might make a difference in following places: - put_uprobe(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Link: http://lkml.kernel.org/r/1547637627-29526-1-git-send-email-elena.reshetova@intel.com Suggested-by: Kees Cook Acked-by: Oleg Nesterov Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Srikar Dronamraju Signed-off-by: Elena Reshetova Signed-off-by: Steven Rostedt (VMware) --- kernel/events/uprobes.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8aef47ee7bfa..0a8bf7a4fc5e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -66,7 +66,7 @@ static struct percpu_rw_semaphore dup_mmap_sem; struct uprobe { struct rb_node rb_node; /* node in the rb tree */ - atomic_t ref; + refcount_t ref; struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; struct list_head pending_list; @@ -560,13 +560,13 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v static struct uprobe *get_uprobe(struct uprobe *uprobe) { - atomic_inc(&uprobe->ref); + refcount_inc(&uprobe->ref); return uprobe; } static void put_uprobe(struct uprobe *uprobe) { - if (atomic_dec_and_test(&uprobe->ref)) { + if (refcount_dec_and_test(&uprobe->ref)) { /* * If application munmap(exec_vma) before uprobe_unregister() * gets called, we don't get a chance to remove uprobe from @@ -657,7 +657,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) rb_link_node(&uprobe->rb_node, parent, p); rb_insert_color(&uprobe->rb_node, &uprobes_tree); /* get access + creation ref */ - atomic_set(&uprobe->ref, 2); + refcount_set(&uprobe->ref, 2); return u; } -- cgit v1.3-7-g2ca7 From b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 15 Feb 2019 11:01:31 -0800 Subject: cgroup, rstat: Don't flush subtree root unless necessary cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups on flush. While it was only visiting updated ones in the subtree, it was visiting @root unconditionally. We can easily check whether @root is updated or not by looking at its ->updated_next just as with the cgroups in the subtree. * Remove the unnecessary cgroup_parent() test. The system root cgroup is never updated and thus its ->updated_next is always NULL. No need to test whether cgroup_parent() exists in addition to ->updated_next. * Terminate traverse if ->updated_next is NULL. This can only happen for subtree @root and there's no reason to visit it if it's not marked updated. This reduces cpu consumption when reading a lot of rstat backed files. In a micro benchmark reading stat from ~1600 cgroups, the sys time was lowered by >40%. Signed-off-by: Tejun Heo --- kernel/cgroup/rstat.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d503d1a9007c..bb95a35e8c2d 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -87,7 +87,6 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, struct cgroup *root, int cpu) { struct cgroup_rstat_cpu *rstatc; - struct cgroup *parent; if (pos == root) return NULL; @@ -115,8 +114,8 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, * However, due to the way we traverse, @pos will be the first * child in most cases. The only exception is @root. */ - parent = cgroup_parent(pos); - if (parent && rstatc->updated_next) { + if (rstatc->updated_next) { + struct cgroup *parent = cgroup_parent(pos); struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); struct cgroup_rstat_cpu *nrstatc; struct cgroup **nextp; @@ -140,9 +139,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, * updated stat. */ smp_mb(); + + return pos; } - return pos; + /* only happens for @root */ + return NULL; } /* see cgroup_rstat_flush() */ -- cgit v1.3-7-g2ca7 From 22cb45d7692ab502dd47dc5a607b3af240ee1e37 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Thu, 14 Feb 2019 09:04:08 +0000 Subject: swiotlb: drop pointless static qualifier in swiotlb_create_debugfs() There is no need to have the 'struct dentry *d_swiotlb_usage' variable static since new value always be assigned before use it. Signed-off-by: YueHaibing Signed-off-by: Konrad Rzeszutek Wilk --- kernel/dma/swiotlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index a01b83e95a2a..2b0c8fd9658e 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -682,7 +682,7 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask) static int __init swiotlb_create_debugfs(void) { - static struct dentry *d_swiotlb_usage; + struct dentry *d_swiotlb_usage; struct dentry *ent; d_swiotlb_usage = debugfs_create_dir("swiotlb", NULL); -- cgit v1.3-7-g2ca7 From 0145c30e896d26e638d27c957d9eed72893c1c92 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 16 Feb 2019 18:13:07 +0100 Subject: genirq/affinity: Code consolidation All information and calculations in the interrupt affinity spreading code is strictly unsigned int. Though the code uses int all over the place. Convert it over to unsigned int. Signed-off-by: Thomas Gleixner Reviewed-by: Ming Lei Acked-by: Marc Zyngier Cc: Christoph Hellwig Cc: Bjorn Helgaas Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch Cc: Sumit Saxena Cc: Kashyap Desai Cc: Shivasharan Srikanteshwara Link: https://lkml.kernel.org/r/20190216172228.336424556@linutronix.de --- include/linux/interrupt.h | 20 +++++++++-------- kernel/irq/affinity.c | 56 +++++++++++++++++++++++------------------------ 2 files changed, 38 insertions(+), 38 deletions(-) (limited to 'kernel') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 4a728dba02e2..35e7389c2011 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -251,10 +251,10 @@ struct irq_affinity_notify { * @sets: Number of affinitized sets */ struct irq_affinity { - int pre_vectors; - int post_vectors; - int nr_sets; - int *sets; + unsigned int pre_vectors; + unsigned int post_vectors; + unsigned int nr_sets; + unsigned int *sets; }; /** @@ -314,9 +314,10 @@ extern int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify); struct irq_affinity_desc * -irq_create_affinity_masks(int nvec, const struct irq_affinity *affd); +irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd); -int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd); +unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, + const struct irq_affinity *affd); #else /* CONFIG_SMP */ @@ -350,13 +351,14 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) } static inline struct irq_affinity_desc * -irq_create_affinity_masks(int nvec, const struct irq_affinity *affd) +irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd) { return NULL; } -static inline int -irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd) +static inline unsigned int +irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, + const struct irq_affinity *affd) { return maxvec; } diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 118b66d64a53..82e8799374e9 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -9,7 +9,7 @@ #include static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk, - int cpus_per_vec) + unsigned int cpus_per_vec) { const struct cpumask *siblmsk; int cpu, sibl; @@ -95,15 +95,17 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask, } static int __irq_build_affinity_masks(const struct irq_affinity *affd, - int startvec, int numvecs, int firstvec, + unsigned int startvec, + unsigned int numvecs, + unsigned int firstvec, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, struct cpumask *nmsk, struct irq_affinity_desc *masks) { - int n, nodes, cpus_per_vec, extra_vecs, done = 0; - int last_affv = firstvec + numvecs; - int curvec = startvec; + unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0; + unsigned int last_affv = firstvec + numvecs; + unsigned int curvec = startvec; nodemask_t nodemsk = NODE_MASK_NONE; if (!cpumask_weight(cpu_mask)) @@ -117,18 +119,16 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, */ if (numvecs <= nodes) { for_each_node_mask(n, nodemsk) { - cpumask_or(&masks[curvec].mask, - &masks[curvec].mask, - node_to_cpumask[n]); + cpumask_or(&masks[curvec].mask, &masks[curvec].mask, + node_to_cpumask[n]); if (++curvec == last_affv) curvec = firstvec; } - done = numvecs; - goto out; + return numvecs; } for_each_node_mask(n, nodemsk) { - int ncpus, v, vecs_to_assign, vecs_per_node; + unsigned int ncpus, v, vecs_to_assign, vecs_per_node; /* Spread the vectors per node */ vecs_per_node = (numvecs - (curvec - firstvec)) / nodes; @@ -163,8 +163,6 @@ static int __irq_build_affinity_masks(const struct irq_affinity *affd, curvec = firstvec; --nodes; } - -out: return done; } @@ -174,13 +172,14 @@ out: * 2) spread other possible CPUs on these vectors */ static int irq_build_affinity_masks(const struct irq_affinity *affd, - int startvec, int numvecs, int firstvec, + unsigned int startvec, unsigned int numvecs, + unsigned int firstvec, struct irq_affinity_desc *masks) { - int curvec = startvec, nr_present, nr_others; - int ret = -ENOMEM; - cpumask_var_t nmsk, npresmsk; + unsigned int curvec = startvec, nr_present, nr_others; cpumask_var_t *node_to_cpumask; + cpumask_var_t nmsk, npresmsk; + int ret = -ENOMEM; if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) return ret; @@ -239,12 +238,10 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, * Returns the irq_affinity_desc pointer or NULL if allocation failed. */ struct irq_affinity_desc * -irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) +irq_create_affinity_masks(unsigned int nvecs, const struct irq_affinity *affd) { - int affvecs = nvecs - affd->pre_vectors - affd->post_vectors; - int curvec, usedvecs; + unsigned int affvecs, curvec, usedvecs, nr_sets, i; struct irq_affinity_desc *masks = NULL; - int i, nr_sets; /* * If there aren't any vectors left after applying the pre/post @@ -264,16 +261,17 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) * Spread on present CPUs starting from affd->pre_vectors. If we * have multiple sets, build each sets affinity mask separately. */ + affvecs = nvecs - affd->pre_vectors - affd->post_vectors; nr_sets = affd->nr_sets; if (!nr_sets) nr_sets = 1; for (i = 0, usedvecs = 0; i < nr_sets; i++) { - int this_vecs = affd->sets ? affd->sets[i] : affvecs; + unsigned int this_vecs = affd->sets ? affd->sets[i] : affvecs; int ret; ret = irq_build_affinity_masks(affd, curvec, this_vecs, - curvec, masks); + curvec, masks); if (ret) { kfree(masks); return NULL; @@ -303,17 +301,17 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) * @maxvec: The maximum number of vectors available * @affd: Description of the affinity requirements */ -int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd) +unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, + const struct irq_affinity *affd) { - int resv = affd->pre_vectors + affd->post_vectors; - int vecs = maxvec - resv; - int set_vecs; + unsigned int resv = affd->pre_vectors + affd->post_vectors; + unsigned int set_vecs; if (resv > minvec) return 0; if (affd->nr_sets) { - int i; + unsigned int i; for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) set_vecs += affd->sets[i]; @@ -323,5 +321,5 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity put_online_cpus(); } - return resv + min(set_vecs, vecs); + return resv + min(set_vecs, maxvec - resv); } -- cgit v1.3-7-g2ca7 From 9cfef55bb57e7620c63087be18a76351628f8d0f Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sat, 16 Feb 2019 18:13:08 +0100 Subject: genirq/affinity: Store interrupt sets size in struct irq_affinity The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Cc: Christoph Hellwig Cc: Bjorn Helgaas Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch Cc: Sumit Saxena Cc: Kashyap Desai Cc: Shivasharan Srikanteshwara Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de --- drivers/nvme/host/pci.c | 7 +++---- include/linux/interrupt.h | 9 ++++++--- kernel/irq/affinity.c | 16 ++++++++++++---- 3 files changed, 21 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 9bc585415d9b..21ffd671b6ed 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2081,12 +2081,11 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues) static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues) { struct pci_dev *pdev = to_pci_dev(dev->dev); - int irq_sets[2]; struct irq_affinity affd = { - .pre_vectors = 1, - .nr_sets = ARRAY_SIZE(irq_sets), - .sets = irq_sets, + .pre_vectors = 1, + .nr_sets = 2, }; + unsigned int *irq_sets = affd.set_size; int result = 0; unsigned int irq_queues, this_p_queues; diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 35e7389c2011..5afdfd5dc39b 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -241,20 +241,23 @@ struct irq_affinity_notify { void (*release)(struct kref *ref); }; +#define IRQ_AFFINITY_MAX_SETS 4 + /** * struct irq_affinity - Description for automatic irq affinity assignements * @pre_vectors: Don't apply affinity to @pre_vectors at beginning of * the MSI(-X) vector space * @post_vectors: Don't apply affinity to @post_vectors at end of * the MSI(-X) vector space - * @nr_sets: Length of passed in *sets array - * @sets: Number of affinitized sets + * @nr_sets: The number of interrupt sets for which affinity + * spreading is required + * @set_size: Array holding the size of each interrupt set */ struct irq_affinity { unsigned int pre_vectors; unsigned int post_vectors; unsigned int nr_sets; - unsigned int *sets; + unsigned int set_size[IRQ_AFFINITY_MAX_SETS]; }; /** diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 82e8799374e9..278289c091bb 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -238,9 +238,10 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, * Returns the irq_affinity_desc pointer or NULL if allocation failed. */ struct irq_affinity_desc * -irq_create_affinity_masks(unsigned int nvecs, const struct irq_affinity *affd) +irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) { unsigned int affvecs, curvec, usedvecs, nr_sets, i; + unsigned int set_size[IRQ_AFFINITY_MAX_SETS]; struct irq_affinity_desc *masks = NULL; /* @@ -250,6 +251,9 @@ irq_create_affinity_masks(unsigned int nvecs, const struct irq_affinity *affd) if (nvecs == affd->pre_vectors + affd->post_vectors) return NULL; + if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS)) + return NULL; + masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); if (!masks) return NULL; @@ -263,11 +267,15 @@ irq_create_affinity_masks(unsigned int nvecs, const struct irq_affinity *affd) */ affvecs = nvecs - affd->pre_vectors - affd->post_vectors; nr_sets = affd->nr_sets; - if (!nr_sets) + if (!nr_sets) { nr_sets = 1; + set_size[0] = affvecs; + } else { + memcpy(set_size, affd->set_size, nr_sets * sizeof(unsigned int)); + } for (i = 0, usedvecs = 0; i < nr_sets; i++) { - unsigned int this_vecs = affd->sets ? affd->sets[i] : affvecs; + unsigned int this_vecs = set_size[i]; int ret; ret = irq_build_affinity_masks(affd, curvec, this_vecs, @@ -314,7 +322,7 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, unsigned int i; for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) - set_vecs += affd->sets[i]; + set_vecs += affd->set_size[i]; } else { get_online_cpus(); set_vecs = cpumask_weight(cpu_possible_mask); -- cgit v1.3-7-g2ca7 From c66d4bd110a1f8a68c1a88bfbf866eb50c6464b7 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sat, 16 Feb 2019 18:13:09 +0100 Subject: genirq/affinity: Add new callback for (re)calculating interrupt sets The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback is required in struct irq_affinity, which can be invoked by the core code. The callback gets the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const', but for the callback to be able to modify the data in the struct it's required to remove the 'const' qualifier. Add the optional callback to struct irq_affinity, which allows drivers to recalculate the number and size of interrupt sets and remove the 'const' qualifier. For simple invocations, which do not supply a callback, a default callback is installed, which just sets nr_sets to 1 and transfers the number of spreadable vectors to the set_size array at index 0. This is for now guarded by a check for nr_sets != 0 to keep the NVME driver working until it is converted to the callback mechanism. To make sure that the driver configuration is correct under all circumstances the callback is invoked even when there are no interrupts for queues left, i.e. the pre/post requirements already exhaust the numner of available interrupts. At the PCI layer irq_create_affinity_masks() has to be invoked even for the case where the legacy interrupt is used. That ensures that the callback is invoked and the device driver can adjust to that situation. [ tglx: Fixed the simple case (no sets required). Moved the sanity check for nr_sets after the invocation of the callback so it catches broken drivers. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Cc: Christoph Hellwig Cc: Bjorn Helgaas Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch Cc: Sumit Saxena Cc: Kashyap Desai Cc: Shivasharan Srikanteshwara Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de --- drivers/pci/msi.c | 25 +++++++++++------ drivers/scsi/be2iscsi/be_main.c | 2 +- include/linux/interrupt.h | 10 +++++-- include/linux/pci.h | 4 +-- kernel/irq/affinity.c | 62 +++++++++++++++++++++++++++++------------ 5 files changed, 71 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 4c0b47867258..7149d6315726 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -532,7 +532,7 @@ error_attrs: } static struct msi_desc * -msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) +msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd) { struct irq_affinity_desc *masks = NULL; struct msi_desc *entry; @@ -597,7 +597,7 @@ static int msi_verify_entries(struct pci_dev *dev) * which could have been allocated. */ static int msi_capability_init(struct pci_dev *dev, int nvec, - const struct irq_affinity *affd) + struct irq_affinity *affd) { struct msi_desc *entry; int ret; @@ -669,7 +669,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries) static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, struct msix_entry *entries, int nvec, - const struct irq_affinity *affd) + struct irq_affinity *affd) { struct irq_affinity_desc *curmsk, *masks = NULL; struct msi_desc *entry; @@ -736,7 +736,7 @@ static void msix_program_entries(struct pci_dev *dev, * requested MSI-X entries with allocated irqs or non-zero for otherwise. **/ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries, - int nvec, const struct irq_affinity *affd) + int nvec, struct irq_affinity *affd) { int ret; u16 control; @@ -932,7 +932,7 @@ int pci_msix_vec_count(struct pci_dev *dev) EXPORT_SYMBOL(pci_msix_vec_count); static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, - int nvec, const struct irq_affinity *affd) + int nvec, struct irq_affinity *affd) { int nr_entries; int i, j; @@ -1018,7 +1018,7 @@ int pci_msi_enabled(void) EXPORT_SYMBOL(pci_msi_enabled); static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, - const struct irq_affinity *affd) + struct irq_affinity *affd) { int nvec; int rc; @@ -1086,7 +1086,7 @@ EXPORT_SYMBOL(pci_enable_msi); static int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec, - int maxvec, const struct irq_affinity *affd) + int maxvec, struct irq_affinity *affd) { int rc, nvec = maxvec; @@ -1165,9 +1165,9 @@ EXPORT_SYMBOL(pci_enable_msix_range); */ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, unsigned int max_vecs, unsigned int flags, - const struct irq_affinity *affd) + struct irq_affinity *affd) { - static const struct irq_affinity msi_default_affd; + struct irq_affinity msi_default_affd = {0}; int msix_vecs = -ENOSPC; int msi_vecs = -ENOSPC; @@ -1196,6 +1196,13 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, /* use legacy irq if allowed */ if (flags & PCI_IRQ_LEGACY) { if (min_vecs == 1 && dev->irq) { + /* + * Invoke the affinity spreading logic to ensure that + * the device driver can adjust queue configuration + * for the single interrupt case. + */ + if (affd) + irq_create_affinity_masks(1, affd); pci_intx(dev, 1); return 1; } diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c index 74e260027c7d..76e49d902609 100644 --- a/drivers/scsi/be2iscsi/be_main.c +++ b/drivers/scsi/be2iscsi/be_main.c @@ -3566,7 +3566,7 @@ static void be2iscsi_enable_msix(struct beiscsi_hba *phba) /* if eqid_count == 1 fall back to INTX */ if (enable_msix && nvec > 1) { - const struct irq_affinity desc = { .post_vectors = 1 }; + struct irq_affinity desc = { .post_vectors = 1 }; if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &desc) < 0) { diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 5afdfd5dc39b..dcdddf4fa76b 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -252,12 +252,18 @@ struct irq_affinity_notify { * @nr_sets: The number of interrupt sets for which affinity * spreading is required * @set_size: Array holding the size of each interrupt set + * @calc_sets: Callback for calculating the number and size + * of interrupt sets + * @priv: Private data for usage by @calc_sets, usually a + * pointer to driver/device specific data. */ struct irq_affinity { unsigned int pre_vectors; unsigned int post_vectors; unsigned int nr_sets; unsigned int set_size[IRQ_AFFINITY_MAX_SETS]; + void (*calc_sets)(struct irq_affinity *, unsigned int nvecs); + void *priv; }; /** @@ -317,7 +323,7 @@ extern int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify); struct irq_affinity_desc * -irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd); +irq_create_affinity_masks(unsigned int nvec, struct irq_affinity *affd); unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, const struct irq_affinity *affd); @@ -354,7 +360,7 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) } static inline struct irq_affinity_desc * -irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd) +irq_create_affinity_masks(unsigned int nvec, struct irq_affinity *affd) { return NULL; } diff --git a/include/linux/pci.h b/include/linux/pci.h index 65f1d8c2f082..e7c51b00cdfe 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1393,7 +1393,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev, } int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, unsigned int max_vecs, unsigned int flags, - const struct irq_affinity *affd); + struct irq_affinity *affd); void pci_free_irq_vectors(struct pci_dev *dev); int pci_irq_vector(struct pci_dev *dev, unsigned int nr); @@ -1419,7 +1419,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev, static inline int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, unsigned int max_vecs, unsigned int flags, - const struct irq_affinity *aff_desc) + struct irq_affinity *aff_desc) { if ((flags & PCI_IRQ_LEGACY) && min_vecs == 1 && dev->irq) return 1; diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 278289c091bb..d737dc60ab52 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -230,6 +230,12 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, return ret; } +static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs) +{ + affd->nr_sets = 1; + affd->set_size[0] = affvecs; +} + /** * irq_create_affinity_masks - Create affinity masks for multiqueue spreading * @nvecs: The total number of vectors @@ -240,20 +246,46 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, struct irq_affinity_desc * irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) { - unsigned int affvecs, curvec, usedvecs, nr_sets, i; - unsigned int set_size[IRQ_AFFINITY_MAX_SETS]; + unsigned int affvecs, curvec, usedvecs, i; struct irq_affinity_desc *masks = NULL; /* - * If there aren't any vectors left after applying the pre/post - * vectors don't bother with assigning affinity. + * Determine the number of vectors which need interrupt affinities + * assigned. If the pre/post request exhausts the available vectors + * then nothing to do here except for invoking the calc_sets() + * callback so the device driver can adjust to the situation. If there + * is only a single vector, then managing the queue is pointless as + * well. */ - if (nvecs == affd->pre_vectors + affd->post_vectors) - return NULL; + if (nvecs > 1 && nvecs > affd->pre_vectors + affd->post_vectors) + affvecs = nvecs - affd->pre_vectors - affd->post_vectors; + else + affvecs = 0; + + /* + * Simple invocations do not provide a calc_sets() callback. Install + * the generic one. The check for affd->nr_sets is a temporary + * workaround and will be removed after the NVME driver is converted + * over. + */ + if (!affd->nr_sets && !affd->calc_sets) + affd->calc_sets = default_calc_sets; + + /* + * If the device driver provided a calc_sets() callback let it + * recalculate the number of sets and their size. The check will go + * away once the NVME driver is converted over. + */ + if (affd->calc_sets) + affd->calc_sets(affd, affvecs); if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS)) return NULL; + /* Nothing to assign? */ + if (!affvecs) + return NULL; + masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); if (!masks) return NULL; @@ -261,21 +293,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); + /* * Spread on present CPUs starting from affd->pre_vectors. If we * have multiple sets, build each sets affinity mask separately. */ - affvecs = nvecs - affd->pre_vectors - affd->post_vectors; - nr_sets = affd->nr_sets; - if (!nr_sets) { - nr_sets = 1; - set_size[0] = affvecs; - } else { - memcpy(set_size, affd->set_size, nr_sets * sizeof(unsigned int)); - } - - for (i = 0, usedvecs = 0; i < nr_sets; i++) { - unsigned int this_vecs = set_size[i]; + for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) { + unsigned int this_vecs = affd->set_size[i]; int ret; ret = irq_build_affinity_masks(affd, curvec, this_vecs, @@ -318,7 +342,9 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, if (resv > minvec) return 0; - if (affd->nr_sets) { + if (affd->calc_sets) { + set_vecs = maxvec - resv; + } else if (affd->nr_sets) { unsigned int i; for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) -- cgit v1.3-7-g2ca7 From a6a309edba13866b31dc4d8aebad3864a6d56ade Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 16 Feb 2019 18:13:11 +0100 Subject: genirq/affinity: Remove the leftovers of the original set support Now that the NVME driver is converted over to the calc_set() callback, the workarounds of the original set support can be removed. Signed-off-by: Thomas Gleixner Reviewed-by: Ming Lei Acked-by: Marc Zyngier Cc: Christoph Hellwig Cc: Bjorn Helgaas Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch Cc: Sumit Saxena Cc: Kashyap Desai Cc: Shivasharan Srikanteshwara Link: https://lkml.kernel.org/r/20190216172228.689834224@linutronix.de --- kernel/irq/affinity.c | 20 ++++---------------- 1 file changed, 4 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index d737dc60ab52..f18cd5aa33e8 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -264,20 +264,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) /* * Simple invocations do not provide a calc_sets() callback. Install - * the generic one. The check for affd->nr_sets is a temporary - * workaround and will be removed after the NVME driver is converted - * over. + * the generic one. */ - if (!affd->nr_sets && !affd->calc_sets) + if (!affd->calc_sets) affd->calc_sets = default_calc_sets; - /* - * If the device driver provided a calc_sets() callback let it - * recalculate the number of sets and their size. The check will go - * away once the NVME driver is converted over. - */ - if (affd->calc_sets) - affd->calc_sets(affd, affvecs); + /* Recalculate the sets */ + affd->calc_sets(affd, affvecs); if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS)) return NULL; @@ -344,11 +337,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, if (affd->calc_sets) { set_vecs = maxvec - resv; - } else if (affd->nr_sets) { - unsigned int i; - - for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) - set_vecs += affd->set_size[i]; } else { get_online_cpus(); set_vecs = cpumask_weight(cpu_possible_mask); -- cgit v1.3-7-g2ca7 From fbce251baa6e357441961c78796e5e9fad682675 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 13 Feb 2019 08:01:03 +0100 Subject: dma-direct: we might need GFP_DMA for 32-bit dma masks If there is no ZONE_DMA32 we might need GFP_DMA to be able to allocate memory that satisfies a 32-bit DMA mask. Reported-by: Christian Zigotzky Signed-off-by: Christoph Hellwig Tested-by: Christian Zigotzky Signed-off-by: Michael Ellerman --- kernel/dma/direct.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 355d16acee6d..d5bb51cf27c6 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -132,8 +132,7 @@ again: goto again; } - if (IS_ENABLED(CONFIG_ZONE_DMA) && - phys_mask < DMA_BIT_MASK(32) && !(gfp & GFP_DMA)) { + if (IS_ENABLED(CONFIG_ZONE_DMA) && !(gfp & GFP_DMA)) { gfp = (gfp & ~GFP_DMA32) | GFP_DMA; goto again; } -- cgit v1.3-7-g2ca7 From ffe3dfd4e3598651a87651f3d59f144ee31f60fb Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 13 Feb 2019 08:01:15 +0100 Subject: powerpc/dma: stop overriding dma_get_required_mask The ppc_md and pci_controller_ops methods are unused now and can be removed. The dma_nommu implementation is generic to the generic one except for using max_pfn instead of calling into the memblock API, and all other dma_map_ops instances implement a method of their own. Signed-off-by: Christoph Hellwig Tested-by: Christian Zigotzky Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/device.h | 2 -- arch/powerpc/include/asm/dma-mapping.h | 2 -- arch/powerpc/include/asm/machdep.h | 2 -- arch/powerpc/include/asm/pci-bridge.h | 1 - arch/powerpc/kernel/dma.c | 29 ----------------------------- kernel/dma/mapping.c | 2 -- 6 files changed, 38 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h index 1aa53318b4bc..3814e1c2d4bc 100644 --- a/arch/powerpc/include/asm/device.h +++ b/arch/powerpc/include/asm/device.h @@ -59,6 +59,4 @@ struct pdev_archdata { u64 dma_mask; }; -#define ARCH_HAS_DMA_GET_REQUIRED_MASK - #endif /* _ASM_POWERPC_DEVICE_H */ diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 1d80174db8a4..dc7f7bcdf65d 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -112,7 +112,5 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 -extern u64 __dma_get_required_mask(struct device *dev); - #endif /* __KERNEL__ */ #endif /* _ASM_DMA_MAPPING_H */ diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 8311869005fa..7b70dcbce1b9 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -47,9 +47,7 @@ struct machdep_calls { #endif #endif /* CONFIG_PPC64 */ - /* Platform set_dma_mask and dma_get_required_mask overrides */ int (*dma_set_mask)(struct device *dev, u64 dma_mask); - u64 (*dma_get_required_mask)(struct device *dev); int (*probe)(void); void (*setup_arch)(void); /* Optional, may be NULL */ diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index d7492dca6599..236a7460b6ec 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -46,7 +46,6 @@ struct pci_controller_ops { #endif int (*dma_set_mask)(struct pci_dev *pdev, u64 dma_mask); - u64 (*dma_get_required_mask)(struct pci_dev *pdev); void (*shutdown)(struct pci_controller *hose); }; diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index e5db4d3f8bea..0d52107b90f0 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -318,35 +318,6 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); -u64 __dma_get_required_mask(struct device *dev) -{ - const struct dma_map_ops *dma_ops = get_dma_ops(dev); - - if (unlikely(dma_ops == NULL)) - return 0; - - if (dma_ops->get_required_mask) - return dma_ops->get_required_mask(dev); - - return DMA_BIT_MASK(8 * sizeof(dma_addr_t)); -} - -u64 dma_get_required_mask(struct device *dev) -{ - if (ppc_md.dma_get_required_mask) - return ppc_md.dma_get_required_mask(dev); - - if (dev_is_pci(dev)) { - struct pci_dev *pdev = to_pci_dev(dev); - struct pci_controller *phb = pci_bus_to_host(pdev->bus); - if (phb->controller_ops.dma_get_required_mask) - return phb->controller_ops.dma_get_required_mask(pdev); - } - - return __dma_get_required_mask(dev); -} -EXPORT_SYMBOL_GPL(dma_get_required_mask); - static int __init dma_init(void) { #ifdef CONFIG_IBMVIO diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index a11006b6d8e8..40c0af744692 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -207,7 +207,6 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, } EXPORT_SYMBOL(dma_mmap_attrs); -#ifndef ARCH_HAS_DMA_GET_REQUIRED_MASK static u64 dma_default_get_required_mask(struct device *dev) { u32 low_totalram = ((max_pfn - 1) << PAGE_SHIFT); @@ -238,7 +237,6 @@ u64 dma_get_required_mask(struct device *dev) return dma_default_get_required_mask(dev); } EXPORT_SYMBOL_GPL(dma_get_required_mask); -#endif #ifndef arch_dma_alloc_attrs #define arch_dma_alloc_attrs(dev) (true) -- cgit v1.3-7-g2ca7 From 11ddce15451eb5e3cb2c951dc5c8d86a2802017a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 13 Feb 2019 08:01:22 +0100 Subject: dma-mapping, powerpc: simplify the arch dma_set_mask override Instead of letting the architecture supply all of dma_set_mask just give it an additional hook selected by Kconfig. Signed-off-by: Christoph Hellwig Tested-by: Christian Zigotzky Signed-off-by: Michael Ellerman --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/dma-mapping.h | 2 -- arch/powerpc/include/asm/machdep.h | 2 +- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/dma-mask.c | 12 ++++++++++++ arch/powerpc/kernel/dma.c | 12 ------------ arch/powerpc/sysdev/fsl_pci.c | 8 +------- kernel/dma/Kconfig | 3 +++ kernel/dma/mapping.c | 9 +++++++-- 9 files changed, 26 insertions(+), 24 deletions(-) create mode 100644 arch/powerpc/kernel/dma-mask.c (limited to 'kernel') diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b238c63a75cc..39d07c02f7d8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -886,6 +886,7 @@ config FSL_SOC config FSL_PCI bool + select ARCH_HAS_DMA_SET_MASK select PPC_INDIRECT_PCI select PCI_QUIRKS diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index dc7f7bcdf65d..16d45518d9bb 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -110,7 +110,5 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) dev->archdata.dma_offset = off; } -#define HAVE_ARCH_DMA_SET_MASK 1 - #endif /* __KERNEL__ */ #endif /* _ASM_DMA_MAPPING_H */ diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 7b70dcbce1b9..2f0ca6560e47 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -47,7 +47,7 @@ struct machdep_calls { #endif #endif /* CONFIG_PPC64 */ - int (*dma_set_mask)(struct device *dev, u64 dma_mask); + void (*dma_set_mask)(struct device *dev, u64 dma_mask); int (*probe)(void); void (*setup_arch)(void); /* Optional, may be NULL */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index cb7f0bb9ee71..9bb12cd642ef 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_UPROBES) += uprobes.o obj-$(CONFIG_PPC_UDBG_16550) += legacy_serial.o udbg_16550.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-$(CONFIG_SWIOTLB) += dma-swiotlb.o +obj-$(CONFIG_ARCH_HAS_DMA_SET_MASK) += dma-mask.o pci64-$(CONFIG_PPC64) += pci_dn.o pci-hotplug.o isa-bridge.o obj-$(CONFIG_PCI) += pci_$(BITS).o $(pci64-y) \ diff --git a/arch/powerpc/kernel/dma-mask.c b/arch/powerpc/kernel/dma-mask.c new file mode 100644 index 000000000000..ffbbbc432612 --- /dev/null +++ b/arch/powerpc/kernel/dma-mask.c @@ -0,0 +1,12 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include + +void arch_dma_set_mask(struct device *dev, u64 dma_mask) +{ + if (ppc_md.dma_set_mask) + ppc_md.dma_set_mask(dev, dma_mask); +} +EXPORT_SYMBOL(arch_dma_set_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 1e191eb3f0ec..e422ca65d1cf 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -234,18 +234,6 @@ const struct dma_map_ops dma_nommu_ops = { }; EXPORT_SYMBOL(dma_nommu_ops); -int dma_set_mask(struct device *dev, u64 dma_mask) -{ - if (ppc_md.dma_set_mask) - return ppc_md.dma_set_mask(dev, dma_mask); - - if (!dev->dma_mask || !dma_supported(dev, dma_mask)) - return -EIO; - *dev->dma_mask = dma_mask; - return 0; -} -EXPORT_SYMBOL(dma_set_mask); - static int __init dma_init(void) { #ifdef CONFIG_IBMVIO diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c index b710cee023a2..0c6510f340cb 100644 --- a/arch/powerpc/sysdev/fsl_pci.c +++ b/arch/powerpc/sysdev/fsl_pci.c @@ -133,11 +133,8 @@ static void setup_swiotlb_ops(struct pci_controller *hose) static inline void setup_swiotlb_ops(struct pci_controller *hose) {} #endif -static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask) +static void fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask) { - if (!dev->dma_mask || !dma_supported(dev, dma_mask)) - return -EIO; - /* * Fix up PCI devices that are able to DMA to the large inbound * mapping that allows addressing any RAM address from across PCI. @@ -147,9 +144,6 @@ static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask) set_dma_ops(dev, &dma_nommu_ops); set_dma_offset(dev, pci64_dma_offset); } - - *dev->dma_mask = dma_mask; - return 0; } static int setup_one_atmu(struct ccsr_pci __iomem *pci, diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index ca88b867e7fe..0711d18645de 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -16,6 +16,9 @@ config ARCH_DMA_ADDR_T_64BIT config ARCH_HAS_DMA_COHERENCE_H bool +config ARCH_HAS_DMA_SET_MASK + bool + config HAVE_GENERIC_DMA_COHERENT bool diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 40c0af744692..ef2aba503467 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -316,18 +316,23 @@ int dma_supported(struct device *dev, u64 mask) } EXPORT_SYMBOL(dma_supported); -#ifndef HAVE_ARCH_DMA_SET_MASK +#ifdef CONFIG_ARCH_HAS_DMA_SET_MASK +void arch_dma_set_mask(struct device *dev, u64 mask); +#else +#define arch_dma_set_mask(dev, mask) do { } while (0) +#endif + int dma_set_mask(struct device *dev, u64 mask) { if (!dev->dma_mask || !dma_supported(dev, mask)) return -EIO; + arch_dma_set_mask(dev, mask); dma_check_mask(dev, mask); *dev->dma_mask = mask; return 0; } EXPORT_SYMBOL(dma_set_mask); -#endif #ifndef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK int dma_set_coherent_mask(struct device *dev, u64 mask) -- cgit v1.3-7-g2ca7 From feee96440c9c5fdf47f8c8079c104fc8082924a0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 13 Feb 2019 08:01:27 +0100 Subject: swiotlb: remove swiotlb_dma_supported The only user left is powerpc, but even there the generic dma-direct version works just as well, given that we guarantee that the swiotlb buffer must always be addressable. Signed-off-by: Christoph Hellwig Tested-by: Christian Zigotzky Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/dma-swiotlb.c | 2 +- include/linux/swiotlb.h | 3 --- kernel/dma/swiotlb.c | 12 ------------ 3 files changed, 1 insertion(+), 16 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c index d5950a0cb758..6d2677b2daa6 100644 --- a/arch/powerpc/kernel/dma-swiotlb.c +++ b/arch/powerpc/kernel/dma-swiotlb.c @@ -36,7 +36,7 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = { .free = __dma_nommu_free_coherent, .map_sg = dma_direct_map_sg, .unmap_sg = dma_direct_unmap_sg, - .dma_supported = swiotlb_dma_supported, + .dma_supported = dma_direct_supported, .map_page = dma_direct_map_page, .unmap_page = dma_direct_unmap_page, .sync_single_for_cpu = dma_direct_sync_single_for_cpu, diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 7c007ed7505f..54254388899e 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -60,9 +60,6 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev, size_t size, enum dma_data_direction dir, enum dma_sync_target target); -extern int -swiotlb_dma_supported(struct device *hwdev, u64 mask); - #ifdef CONFIG_SWIOTLB extern enum swiotlb_force swiotlb_force; extern phys_addr_t io_tlb_start, io_tlb_end; diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index d6361776dc5c..cbf3498a46f9 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -648,15 +648,3 @@ bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, return true; } - -/* - * Return whether the given device DMA address mask can be supported - * properly. For example, if your device can only drive the low 24-bits - * during bus mastering, then you would pass 0x00ffffff as the mask to - * this function. - */ -int -swiotlb_dma_supported(struct device *hwdev, u64 mask) -{ - return __phys_to_dma(hwdev, io_tlb_end - 1) <= mask; -} -- cgit v1.3-7-g2ca7 From 6a613d24effcb875271b8a1c510172e2d6eaaee8 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 18 Feb 2019 15:28:11 +0900 Subject: cpuset: remove unused task_has_mempolicy() This is a remnant of commit 5f155f27cb7f ("mm, cpuset: always use seqlock when changing task's nodemask"). Signed-off-by: Masahiro Yamada Signed-off-by: Tejun Heo --- kernel/cgroup/cpuset.c | 13 ------------- 1 file changed, 13 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 479743db6c37..72afd55f70c6 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -203,19 +203,6 @@ static inline struct cpuset *parent_cs(struct cpuset *cs) return css_cs(cs->css.parent); } -#ifdef CONFIG_NUMA -static inline bool task_has_mempolicy(struct task_struct *task) -{ - return task->mempolicy; -} -#else -static inline bool task_has_mempolicy(struct task_struct *task) -{ - return false; -} -#endif - - /* bits in struct cpuset flags field */ typedef enum { CS_ONLINE, -- cgit v1.3-7-g2ca7 From 8d91ecc84d1b177d9bcce6ce8e82c64347ceac9d Mon Sep 17 00:00:00 2001 From: Bartosz Golaszewski Date: Mon, 18 Feb 2019 17:32:03 +0100 Subject: irq/irq_sim: add irq_set_type() callback Implement the irq_set_type() callback and call irqd_set_trigger_type() internally so that users interested in the configured trigger type can later retrieve it using irqd_get_trigger_type(). We only support edge trigger types. Signed-off-by: Bartosz Golaszewski Acked-by: Marc Zyngier --- kernel/irq/irq_sim.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/irq_sim.c b/kernel/irq/irq_sim.c index 98a20e1594ce..b992f88c5613 100644 --- a/kernel/irq/irq_sim.c +++ b/kernel/irq/irq_sim.c @@ -25,10 +25,22 @@ static void irq_sim_irqunmask(struct irq_data *data) irq_ctx->enabled = true; } +static int irq_sim_set_type(struct irq_data *data, unsigned int type) +{ + /* We only support rising and falling edge trigger types. */ + if (type & ~IRQ_TYPE_EDGE_BOTH) + return -EINVAL; + + irqd_set_trigger_type(data, type); + + return 0; +} + static struct irq_chip irq_sim_irqchip = { .name = "irq_sim", .irq_mask = irq_sim_irqmask, .irq_unmask = irq_sim_irqunmask, + .irq_set_type = irq_sim_set_type, }; static void irq_sim_handle_irq(struct irq_work *work) -- cgit v1.3-7-g2ca7 From 568f196756ad9fe2b49c46bbf6a9de1b190438b4 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 28 Jan 2019 17:21:52 -0800 Subject: bpf: check that BPF programs run with preemption disabled Introduce cant_sleep() macro for annotation of functions that cannot sleep. Use it in BPF_PROG_RUN to catch execution of BPF programs in preemptable context. Suggested-by: Jann Horn Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- include/linux/filter.h | 2 +- include/linux/kernel.h | 14 ++++++++++++-- kernel/sched/core.c | 28 ++++++++++++++++++++++++++++ 3 files changed, 41 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index 95e2d7ebdf21..f32b3eca5a04 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -533,7 +533,7 @@ struct sk_filter { struct bpf_prog *prog; }; -#define BPF_PROG_RUN(filter, ctx) (*(filter)->bpf_func)(ctx, (filter)->insnsi) +#define BPF_PROG_RUN(filter, ctx) ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); }) #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 8f0e68e250a7..a8868a32098c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -245,8 +245,10 @@ extern int _cond_resched(void); #endif #ifdef CONFIG_DEBUG_ATOMIC_SLEEP - void ___might_sleep(const char *file, int line, int preempt_offset); - void __might_sleep(const char *file, int line, int preempt_offset); +extern void ___might_sleep(const char *file, int line, int preempt_offset); +extern void __might_sleep(const char *file, int line, int preempt_offset); +extern void __cant_sleep(const char *file, int line, int preempt_offset); + /** * might_sleep - annotation for functions that can sleep * @@ -259,6 +261,13 @@ extern int _cond_resched(void); */ # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) +/** + * cant_sleep - annotation for functions that cannot sleep + * + * this macro will print a stack trace if it is executed with preemption enabled + */ +# define cant_sleep() \ + do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) #else static inline void ___might_sleep(const char *file, int line, @@ -266,6 +275,7 @@ extern int _cond_resched(void); static inline void __might_sleep(const char *file, int line, int preempt_offset) { } # define might_sleep() do { might_resched(); } while (0) +# define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d8d76a65cfdd..7cbb5658be80 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6162,6 +6162,34 @@ void ___might_sleep(const char *file, int line, int preempt_offset) add_taint(TAINT_WARN, LOCKDEP_STILL_OK); } EXPORT_SYMBOL(___might_sleep); + +void __cant_sleep(const char *file, int line, int preempt_offset) +{ + static unsigned long prev_jiffy; + + if (irqs_disabled()) + return; + + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) + return; + + if (preempt_count() > preempt_offset) + return; + + if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) + return; + prev_jiffy = jiffies; + + printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line); + printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), + current->pid, current->comm); + + debug_show_held_locks(current); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); +} +EXPORT_SYMBOL_GPL(__cant_sleep); #endif #ifdef CONFIG_MAGIC_SYSRQ -- cgit v1.3-7-g2ca7 From ff4c25f26a71b79c70ea03b3935a1297439a8a85 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Feb 2019 20:12:02 +0100 Subject: dma-mapping: improve selection of dma_declare_coherent availability This API is primarily used through DT entries, but two architectures and two drivers call it directly. So instead of selecting the config symbol for random architectures pull it in implicitly for the actual users. Also rename the Kconfig option to describe the feature better. Signed-off-by: Christoph Hellwig Acked-by: Paul Burton # MIPS Acked-by: Lee Jones Reviewed-by: Greg Kroah-Hartman --- arch/arc/Kconfig | 1 - arch/arm/Kconfig | 2 +- arch/arm64/Kconfig | 1 - arch/csky/Kconfig | 1 - arch/mips/Kconfig | 1 - arch/riscv/Kconfig | 1 - arch/sh/Kconfig | 2 +- arch/unicore32/Kconfig | 1 - arch/x86/Kconfig | 1 - drivers/mfd/Kconfig | 2 ++ drivers/of/Kconfig | 3 ++- include/linux/device.h | 2 +- include/linux/dma-mapping.h | 8 ++++---- kernel/dma/Kconfig | 2 +- kernel/dma/Makefile | 2 +- 15 files changed, 13 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index ab8d6131c954..728a0f6f838c 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -31,7 +31,6 @@ config ARC select HAVE_ARCH_TRACEHOOK select HAVE_DEBUG_STACKOVERFLOW select HAVE_FUTEX_CMPXCHG if FUTEX - select HAVE_GENERIC_DMA_COHERENT select HAVE_IOREMAP_PROT select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e07e5c184d2f..33612e6da19a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -32,6 +32,7 @@ config ARM select CLONE_BACKWARDS select CPU_PM if SUSPEND || CPU_IDLE select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS + select DMA_DECLARE_COHERENT select DMA_REMAP if MMU select EDAC_SUPPORT select EDAC_ATOMIC_SCRUB @@ -74,7 +75,6 @@ config ARM select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL select HAVE_FUNCTION_TRACER if !XIP_KERNEL select HAVE_GCC_PLUGINS - select HAVE_GENERIC_DMA_COHERENT select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || CPU_V7) select HAVE_IDE if PCI || ISA || PCMCIA select HAVE_IRQ_TIME_ACCOUNTING diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fbcf521e1c9f..e86fac1e6b03 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -139,7 +139,6 @@ config ARM64 select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER select HAVE_GCC_PLUGINS - select HAVE_GENERIC_DMA_COHERENT select HAVE_HW_BREAKPOINT if PERF_EVENTS select HAVE_IRQ_TIME_ACCOUNTING select HAVE_MEMBLOCK_NODE_MAP if NUMA diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 0a9595afe9be..c009a8c63946 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -30,7 +30,6 @@ config CSKY select HAVE_ARCH_TRACEHOOK select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER - select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZO select HAVE_KERNEL_LZMA diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index dc5d70f674e0..433b9dd35824 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -56,7 +56,6 @@ config MIPS select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER - select HAVE_GENERIC_DMA_COHERENT select HAVE_IDE select HAVE_IOREMAP_PROT select HAVE_IRQ_EXIT_ON_IRQ_STACK diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index feeeaa60697c..51b9c97751bf 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -32,7 +32,6 @@ config RISCV select HAVE_MEMBLOCK_NODE_MAP select HAVE_DMA_CONTIGUOUS select HAVE_FUTEX_CMPXCHG if FUTEX - select HAVE_GENERIC_DMA_COHERENT select HAVE_PERF_EVENTS select HAVE_SYSCALL_TRACEPOINTS select IRQ_DOMAIN diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index a9c36f95744a..a3d2a24e75c7 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -7,11 +7,11 @@ config SUPERH select ARCH_NO_COHERENT_DMA_MMAP if !MMU select HAVE_PATA_PLATFORM select CLKDEV_LOOKUP + select DMA_DECLARE_COHERENT select HAVE_IDE if HAS_IOPORT_MAP select HAVE_MEMBLOCK_NODE_MAP select ARCH_DISCARD_MEMBLOCK select HAVE_OPROFILE - select HAVE_GENERIC_DMA_COHERENT select HAVE_ARCH_TRACEHOOK select HAVE_PERF_EVENTS select HAVE_DEBUG_BUGVERBOSE diff --git a/arch/unicore32/Kconfig b/arch/unicore32/Kconfig index c3a41bfe161b..6d2891d37e32 100644 --- a/arch/unicore32/Kconfig +++ b/arch/unicore32/Kconfig @@ -4,7 +4,6 @@ config UNICORE32 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO - select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_BZIP2 select GENERIC_ATOMIC64 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 26387c7bf305..0e33dede053e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -15,7 +15,6 @@ config X86_32 select CLKSRC_I8253 select CLONE_BACKWARDS select HAVE_AOUT - select HAVE_GENERIC_DMA_COHERENT select MODULES_USE_ELF_REL select OLD_SIGACTION diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig index f15f6489803d..c3ccf2c7b3ef 100644 --- a/drivers/mfd/Kconfig +++ b/drivers/mfd/Kconfig @@ -1067,6 +1067,7 @@ config MFD_SI476X_CORE config MFD_SM501 tristate "Silicon Motion SM501" depends on HAS_DMA + select DMA_DECLARE_COHERENT ---help--- This is the core driver for the Silicon Motion SM501 multimedia companion chip. This device is a multifunction device which may @@ -1675,6 +1676,7 @@ config MFD_TC6393XB select GPIOLIB select MFD_CORE select MFD_TMIO + select DMA_DECLARE_COHERENT help Support for Toshiba Mobile IO Controller TC6393XB diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig index 3607fd2810e4..37c2ccbefecd 100644 --- a/drivers/of/Kconfig +++ b/drivers/of/Kconfig @@ -43,6 +43,7 @@ config OF_FLATTREE config OF_EARLY_FLATTREE bool + select DMA_DECLARE_COHERENT if HAS_DMA select OF_FLATTREE config OF_PROMTREE @@ -83,7 +84,7 @@ config OF_MDIO config OF_RESERVED_MEM bool depends on OF_EARLY_FLATTREE - default y if HAVE_GENERIC_DMA_COHERENT || DMA_CMA + default y if DMA_DECLARE_COHERENT || DMA_CMA config OF_RESOLVE bool diff --git a/include/linux/device.h b/include/linux/device.h index be544400acdd..c52d90348cef 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1017,7 +1017,7 @@ struct device { struct list_head dma_pools; /* dma pools (if dma'ble) */ -#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT +#ifdef CONFIG_DMA_DECLARE_COHERENT struct dma_coherent_mem *dma_mem; /* internal for coherent mem override */ #endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 4210c5c1dd21..e29441b8b3b7 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -153,7 +153,7 @@ static inline int is_device_dma_capable(struct device *dev) return dev->dma_mask != NULL && *dev->dma_mask != DMA_MASK_NONE; } -#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT +#ifdef CONFIG_DMA_DECLARE_COHERENT /* * These three functions are only for dma allocator. * Don't use them in device drivers. @@ -192,7 +192,7 @@ static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma, { return 0; } -#endif /* CONFIG_HAVE_GENERIC_DMA_COHERENT */ +#endif /* CONFIG_DMA_DECLARE_COHERENT */ static inline bool dma_is_direct(const struct dma_map_ops *ops) { @@ -739,7 +739,7 @@ static inline int dma_get_cache_alignment(void) /* flags for the coherent memory api */ #define DMA_MEMORY_EXCLUSIVE 0x01 -#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT +#ifdef CONFIG_DMA_DECLARE_COHERENT int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, dma_addr_t device_addr, size_t size, int flags); void dma_release_declared_memory(struct device *dev); @@ -764,7 +764,7 @@ dma_mark_declared_memory_occupied(struct device *dev, { return ERR_PTR(-EBUSY); } -#endif /* CONFIG_HAVE_GENERIC_DMA_COHERENT */ +#endif /* CONFIG_DMA_DECLARE_COHERENT */ static inline void *dmam_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp) diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index bde9179c6ed7..24d45c78c671 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -16,7 +16,7 @@ config ARCH_DMA_ADDR_T_64BIT config ARCH_HAS_DMA_COHERENCE_H bool -config HAVE_GENERIC_DMA_COHERENT +config DMA_DECLARE_COHERENT bool config ARCH_HAS_SETUP_DMA_OPS diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index 72ff6e46aa86..d237cf3dc181 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -2,7 +2,7 @@ obj-$(CONFIG_HAS_DMA) += mapping.o direct.o dummy.o obj-$(CONFIG_DMA_CMA) += contiguous.o -obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += coherent.o +obj-$(CONFIG_DMA_DECLARE_COHERENT) += coherent.o obj-$(CONFIG_DMA_VIRT_OPS) += virt.o obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -- cgit v1.3-7-g2ca7 From ddb26d8e1e97af2385cdcabc51e73bf706a6437c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 13 Feb 2019 19:19:08 +0100 Subject: dma-mapping: move CONFIG_DMA_CMA to kernel/dma/Kconfig This is where all the related code already lives. Signed-off-by: Christoph Hellwig Reviewed-by: Greg Kroah-Hartman --- drivers/base/Kconfig | 77 ---------------------------------------------------- kernel/dma/Kconfig | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 77 insertions(+), 77 deletions(-) (limited to 'kernel') diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 3e63a900b330..059700ea3521 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -191,83 +191,6 @@ config DMA_FENCE_TRACE lockup related problems for dma-buffers shared across multiple devices. -config DMA_CMA - bool "DMA Contiguous Memory Allocator" - depends on HAVE_DMA_CONTIGUOUS && CMA - help - This enables the Contiguous Memory Allocator which allows drivers - to allocate big physically-contiguous blocks of memory for use with - hardware components that do not support I/O map nor scatter-gather. - - You can disable CMA by specifying "cma=0" on the kernel's command - line. - - For more information see . - If unsure, say "n". - -if DMA_CMA -comment "Default contiguous memory area size:" - -config CMA_SIZE_MBYTES - int "Size in Mega Bytes" - depends on !CMA_SIZE_SEL_PERCENTAGE - default 0 if X86 - default 16 - help - Defines the size (in MiB) of the default memory area for Contiguous - Memory Allocator. If the size of 0 is selected, CMA is disabled by - default, but it can be enabled by passing cma=size[MG] to the kernel. - - -config CMA_SIZE_PERCENTAGE - int "Percentage of total memory" - depends on !CMA_SIZE_SEL_MBYTES - default 0 if X86 - default 10 - help - Defines the size of the default memory area for Contiguous Memory - Allocator as a percentage of the total memory in the system. - If 0 percent is selected, CMA is disabled by default, but it can be - enabled by passing cma=size[MG] to the kernel. - -choice - prompt "Selected region size" - default CMA_SIZE_SEL_MBYTES - -config CMA_SIZE_SEL_MBYTES - bool "Use mega bytes value only" - -config CMA_SIZE_SEL_PERCENTAGE - bool "Use percentage value only" - -config CMA_SIZE_SEL_MIN - bool "Use lower value (minimum)" - -config CMA_SIZE_SEL_MAX - bool "Use higher value (maximum)" - -endchoice - -config CMA_ALIGNMENT - int "Maximum PAGE_SIZE order of alignment for contiguous buffers" - range 4 12 - default 8 - help - DMA mapping framework by default aligns all buffers to the smallest - PAGE_SIZE order which is greater than or equal to the requested buffer - size. This works well for buffers up to a few hundreds kilobytes, but - for larger buffers it just a memory waste. With this parameter you can - specify the maximum PAGE_SIZE order for contiguous buffers. Larger - buffers will be aligned only to this specified order. The order is - expressed as a power of two multiplied by the PAGE_SIZE. - - For example, if your system defaults to 4KiB pages, the order value - of 8 means that the buffers will be aligned up to 1MiB only. - - If unsure, leave the default value "8". - -endif - config GENERIC_ARCH_TOPOLOGY bool help diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 24d45c78c671..212aac7c675f 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -60,6 +60,83 @@ config DMA_DIRECT_REMAP bool select DMA_REMAP +config DMA_CMA + bool "DMA Contiguous Memory Allocator" + depends on HAVE_DMA_CONTIGUOUS && CMA + help + This enables the Contiguous Memory Allocator which allows drivers + to allocate big physically-contiguous blocks of memory for use with + hardware components that do not support I/O map nor scatter-gather. + + You can disable CMA by specifying "cma=0" on the kernel's command + line. + + For more information see . + If unsure, say "n". + +if DMA_CMA +comment "Default contiguous memory area size:" + +config CMA_SIZE_MBYTES + int "Size in Mega Bytes" + depends on !CMA_SIZE_SEL_PERCENTAGE + default 0 if X86 + default 16 + help + Defines the size (in MiB) of the default memory area for Contiguous + Memory Allocator. If the size of 0 is selected, CMA is disabled by + default, but it can be enabled by passing cma=size[MG] to the kernel. + + +config CMA_SIZE_PERCENTAGE + int "Percentage of total memory" + depends on !CMA_SIZE_SEL_MBYTES + default 0 if X86 + default 10 + help + Defines the size of the default memory area for Contiguous Memory + Allocator as a percentage of the total memory in the system. + If 0 percent is selected, CMA is disabled by default, but it can be + enabled by passing cma=size[MG] to the kernel. + +choice + prompt "Selected region size" + default CMA_SIZE_SEL_MBYTES + +config CMA_SIZE_SEL_MBYTES + bool "Use mega bytes value only" + +config CMA_SIZE_SEL_PERCENTAGE + bool "Use percentage value only" + +config CMA_SIZE_SEL_MIN + bool "Use lower value (minimum)" + +config CMA_SIZE_SEL_MAX + bool "Use higher value (maximum)" + +endchoice + +config CMA_ALIGNMENT + int "Maximum PAGE_SIZE order of alignment for contiguous buffers" + range 4 12 + default 8 + help + DMA mapping framework by default aligns all buffers to the smallest + PAGE_SIZE order which is greater than or equal to the requested buffer + size. This works well for buffers up to a few hundreds kilobytes, but + for larger buffers it just a memory waste. With this parameter you can + specify the maximum PAGE_SIZE order for contiguous buffers. Larger + buffers will be aligned only to this specified order. The order is + expressed as a power of two multiplied by the PAGE_SIZE. + + For example, if your system defaults to 4KiB pages, the order value + of 8 means that the buffers will be aligned up to 1MiB only. + + If unsure, leave the default value "8". + +endif + config DMA_API_DEBUG bool "Enable debugging of DMA-API usage" select NEED_DMA_MAP_STATE -- cgit v1.3-7-g2ca7 From 91a6fda95cb67c94b887355690d1923a7eb6f630 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 25 Dec 2018 17:27:14 +0100 Subject: dma-mapping: remove dma_mark_declared_memory_occupied This API is not used anywhere, so remove it. Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 17 ----------------- include/linux/dma-mapping.h | 9 --------- kernel/dma/coherent.c | 23 ----------------------- 3 files changed, 49 deletions(-) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 78114ee63057..b9d0cba83877 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -605,23 +605,6 @@ unconditionally having removed all the required structures. It is the driver's job to ensure that no parts of this memory region are currently in use. -:: - - void * - dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) - -This is used to occupy specific regions of the declared space -(dma_alloc_coherent() will hand out the first free region it finds). - -device_addr is the *device* address of the region requested. - -size is the size (and should be a page-sized multiple). - -The return value will be either a pointer to the processor virtual -address of the memory, or an error (via PTR_ERR()) if any part of the -region is occupied. - Part III - Debug drivers use of the DMA-API ------------------------------------------- diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index e29441b8b3b7..d29faadf6ef2 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -743,8 +743,6 @@ static inline int dma_get_cache_alignment(void) int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, dma_addr_t device_addr, size_t size, int flags); void dma_release_declared_memory(struct device *dev); -void *dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size); #else static inline int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, @@ -757,13 +755,6 @@ static inline void dma_release_declared_memory(struct device *dev) { } - -static inline void * -dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) -{ - return ERR_PTR(-EBUSY); -} #endif /* CONFIG_DMA_DECLARE_COHERENT */ static inline void *dmam_alloc_coherent(struct device *dev, size_t size, diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 4b76aba574c2..1d12a31af6d7 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -137,29 +137,6 @@ void dma_release_declared_memory(struct device *dev) } EXPORT_SYMBOL(dma_release_declared_memory); -void *dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) -{ - struct dma_coherent_mem *mem = dev->dma_mem; - unsigned long flags; - int pos, err; - - size += device_addr & ~PAGE_MASK; - - if (!mem) - return ERR_PTR(-EINVAL); - - spin_lock_irqsave(&mem->spinlock, flags); - pos = PFN_DOWN(device_addr - dma_get_device_base(dev, mem)); - err = bitmap_allocate_region(mem->bitmap, pos, get_order(size)); - spin_unlock_irqrestore(&mem->spinlock, flags); - - if (err != 0) - return ERR_PTR(err); - return mem->virt_base + (pos << PAGE_SHIFT); -} -EXPORT_SYMBOL(dma_mark_declared_memory_occupied); - static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, ssize_t size, dma_addr_t *dma_handle) { -- cgit v1.3-7-g2ca7 From 82c5de0ab8dbd6035223ad69e76bd8a88a0a9399 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 25 Dec 2018 13:29:54 +0100 Subject: dma-mapping: remove the DMA_MEMORY_EXCLUSIVE flag All users of dma_declare_coherent want their allocations to be exclusive, so default to exclusive allocations. Signed-off-by: Christoph Hellwig Reviewed-by: Greg Kroah-Hartman --- Documentation/DMA-API.txt | 9 +------- arch/arm/mach-imx/mach-imx27_visstrim_m10.c | 12 ++++------- arch/arm/mach-imx/mach-mx31moboard.c | 3 +-- arch/sh/boards/mach-ap325rxa/setup.c | 5 ++--- arch/sh/boards/mach-ecovec24/setup.c | 6 ++---- arch/sh/boards/mach-kfr2r09/setup.c | 5 ++--- arch/sh/boards/mach-migor/setup.c | 5 ++--- arch/sh/boards/mach-se/7724/setup.c | 6 ++---- arch/sh/drivers/pci/fixups-dreamcast.c | 3 +-- .../platform/soc_camera/sh_mobile_ceu_camera.c | 3 +-- drivers/usb/host/ohci-sm501.c | 3 +-- drivers/usb/host/ohci-tmio.c | 2 +- include/linux/dma-mapping.h | 7 ++---- kernel/dma/coherent.c | 25 ++++++---------------- 14 files changed, 29 insertions(+), 65 deletions(-) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index b9d0cba83877..38e561b773b4 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -566,8 +566,7 @@ boundaries when doing this. int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int - flags) + dma_addr_t device_addr, size_t size); Declare region of memory to be handed out by dma_alloc_coherent() when it's asked for coherent memory for this device. @@ -581,12 +580,6 @@ dma_addr_t in dma_alloc_coherent()). size is the size of the area (must be multiples of PAGE_SIZE). -flags can be ORed together and are: - -- DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. - Do not allow dma_alloc_coherent() to fall back to system memory when - it's out of memory in the declared region. - As a simplification for the platforms, only *one* such region of memory may be declared per device. diff --git a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c index 5169dfba9718..07d4fcfe5c2e 100644 --- a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c +++ b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c @@ -258,8 +258,7 @@ static void __init visstrim_analog_camera_init(void) return; dma_declare_coherent_memory(&pdev->dev, mx2_camera_base, - mx2_camera_base, MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + mx2_camera_base, MX2_CAMERA_BUF_SIZE); } static void __init visstrim_reserve(void) @@ -445,8 +444,7 @@ static void __init visstrim_coda_init(void) dma_declare_coherent_memory(&pdev->dev, mx2_camera_base + MX2_CAMERA_BUF_SIZE, mx2_camera_base + MX2_CAMERA_BUF_SIZE, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); } /* DMA deinterlace */ @@ -465,8 +463,7 @@ static void __init visstrim_deinterlace_init(void) dma_declare_coherent_memory(&pdev->dev, mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE, mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); } /* Emma-PrP for format conversion */ @@ -485,8 +482,7 @@ static void __init visstrim_emmaprp_init(void) */ ret = dma_declare_coherent_memory(&pdev->dev, mx2_camera_base, mx2_camera_base, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); if (ret) pr_err("Failed to declare memory for emmaprp\n"); } diff --git a/arch/arm/mach-imx/mach-mx31moboard.c b/arch/arm/mach-imx/mach-mx31moboard.c index 643a3d749703..fe50f4cf00a7 100644 --- a/arch/arm/mach-imx/mach-mx31moboard.c +++ b/arch/arm/mach-imx/mach-mx31moboard.c @@ -475,8 +475,7 @@ static int __init mx31moboard_init_cam(void) ret = dma_declare_coherent_memory(&pdev->dev, mx3_camera_base, mx3_camera_base, - MX3_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX3_CAMERA_BUF_SIZE); if (ret) goto err; diff --git a/arch/sh/boards/mach-ap325rxa/setup.c b/arch/sh/boards/mach-ap325rxa/setup.c index 8f234d0435aa..7899b4f51fdd 100644 --- a/arch/sh/boards/mach-ap325rxa/setup.c +++ b/arch/sh/boards/mach-ap325rxa/setup.c @@ -529,9 +529,8 @@ static int __init ap325rxa_devices_setup(void) device_initialize(&ap325rxa_ceu_device.dev); arch_setup_pdev_archdata(&ap325rxa_ceu_device); dma_declare_coherent_memory(&ap325rxa_ceu_device.dev, - ceu_dma_membase, ceu_dma_membase, - ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + ceu_dma_membase, ceu_dma_membase, + ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(&ap325rxa_ceu_device); diff --git a/arch/sh/boards/mach-ecovec24/setup.c b/arch/sh/boards/mach-ecovec24/setup.c index 22b4106b8084..eb66754cfb8c 100644 --- a/arch/sh/boards/mach-ecovec24/setup.c +++ b/arch/sh/boards/mach-ecovec24/setup.c @@ -1440,8 +1440,7 @@ static int __init arch_setup(void) dma_declare_coherent_memory(&ecovec_ceu_devices[0]->dev, ceu0_dma_membase, ceu0_dma_membase, ceu0_dma_membase + - CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(ecovec_ceu_devices[0]); device_initialize(&ecovec_ceu_devices[1]->dev); @@ -1449,8 +1448,7 @@ static int __init arch_setup(void) dma_declare_coherent_memory(&ecovec_ceu_devices[1]->dev, ceu1_dma_membase, ceu1_dma_membase, ceu1_dma_membase + - CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(ecovec_ceu_devices[1]); gpiod_add_lookup_table(&cn12_power_gpiod_table); diff --git a/arch/sh/boards/mach-kfr2r09/setup.c b/arch/sh/boards/mach-kfr2r09/setup.c index 203d249a0a2b..b8bf67c86eab 100644 --- a/arch/sh/boards/mach-kfr2r09/setup.c +++ b/arch/sh/boards/mach-kfr2r09/setup.c @@ -603,9 +603,8 @@ static int __init kfr2r09_devices_setup(void) device_initialize(&kfr2r09_ceu_device.dev); arch_setup_pdev_archdata(&kfr2r09_ceu_device); dma_declare_coherent_memory(&kfr2r09_ceu_device.dev, - ceu_dma_membase, ceu_dma_membase, - ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + ceu_dma_membase, ceu_dma_membase, + ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(&kfr2r09_ceu_device); diff --git a/arch/sh/boards/mach-migor/setup.c b/arch/sh/boards/mach-migor/setup.c index f4ad33c6d2aa..bcd249e6cfcc 100644 --- a/arch/sh/boards/mach-migor/setup.c +++ b/arch/sh/boards/mach-migor/setup.c @@ -603,9 +603,8 @@ static int __init migor_devices_setup(void) device_initialize(&migor_ceu_device.dev); arch_setup_pdev_archdata(&migor_ceu_device); dma_declare_coherent_memory(&migor_ceu_device.dev, - ceu_dma_membase, ceu_dma_membase, - ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + ceu_dma_membase, ceu_dma_membase, + ceu_dma_membase + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(&migor_ceu_device); diff --git a/arch/sh/boards/mach-se/7724/setup.c b/arch/sh/boards/mach-se/7724/setup.c index fdbec22ae687..13c2d3ce78f4 100644 --- a/arch/sh/boards/mach-se/7724/setup.c +++ b/arch/sh/boards/mach-se/7724/setup.c @@ -941,8 +941,7 @@ static int __init devices_setup(void) dma_declare_coherent_memory(&ms7724se_ceu_devices[0]->dev, ceu0_dma_membase, ceu0_dma_membase, ceu0_dma_membase + - CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(ms7724se_ceu_devices[0]); device_initialize(&ms7724se_ceu_devices[1]->dev); @@ -950,8 +949,7 @@ static int __init devices_setup(void) dma_declare_coherent_memory(&ms7724se_ceu_devices[1]->dev, ceu1_dma_membase, ceu1_dma_membase, ceu1_dma_membase + - CEU_BUFFER_MEMORY_SIZE - 1, - DMA_MEMORY_EXCLUSIVE); + CEU_BUFFER_MEMORY_SIZE - 1); platform_device_add(ms7724se_ceu_devices[1]); return platform_add_devices(ms7724se_devices, diff --git a/arch/sh/drivers/pci/fixups-dreamcast.c b/arch/sh/drivers/pci/fixups-dreamcast.c index dfdbd05b6eb1..7be8694c0d13 100644 --- a/arch/sh/drivers/pci/fixups-dreamcast.c +++ b/arch/sh/drivers/pci/fixups-dreamcast.c @@ -63,8 +63,7 @@ static void gapspci_fixup_resources(struct pci_dev *dev) BUG_ON(dma_declare_coherent_memory(&dev->dev, res.start, region.start, - resource_size(&res), - DMA_MEMORY_EXCLUSIVE)); + resource_size(&res))); break; default: printk("PCI: Failed resource fixup\n"); diff --git a/drivers/media/platform/soc_camera/sh_mobile_ceu_camera.c b/drivers/media/platform/soc_camera/sh_mobile_ceu_camera.c index 6803f744e307..cc357b8db1dc 100644 --- a/drivers/media/platform/soc_camera/sh_mobile_ceu_camera.c +++ b/drivers/media/platform/soc_camera/sh_mobile_ceu_camera.c @@ -1708,8 +1708,7 @@ static int sh_mobile_ceu_probe(struct platform_device *pdev) if (res) { err = dma_declare_coherent_memory(&pdev->dev, res->start, res->start, - resource_size(res), - DMA_MEMORY_EXCLUSIVE); + resource_size(res)); if (err) { dev_err(&pdev->dev, "Unable to declare CEU memory.\n"); return err; diff --git a/drivers/usb/host/ohci-sm501.c b/drivers/usb/host/ohci-sm501.c index c9233cddf9a2..c26228c25f99 100644 --- a/drivers/usb/host/ohci-sm501.c +++ b/drivers/usb/host/ohci-sm501.c @@ -126,8 +126,7 @@ static int ohci_hcd_sm501_drv_probe(struct platform_device *pdev) retval = dma_declare_coherent_memory(dev, mem->start, mem->start - mem->parent->start, - resource_size(mem), - DMA_MEMORY_EXCLUSIVE); + resource_size(mem)); if (retval) { dev_err(dev, "cannot declare coherent memory\n"); goto err1; diff --git a/drivers/usb/host/ohci-tmio.c b/drivers/usb/host/ohci-tmio.c index a631dbb369d7..f88a0370659f 100644 --- a/drivers/usb/host/ohci-tmio.c +++ b/drivers/usb/host/ohci-tmio.c @@ -225,7 +225,7 @@ static int ohci_hcd_tmio_drv_probe(struct platform_device *dev) } ret = dma_declare_coherent_memory(&dev->dev, sram->start, sram->start, - resource_size(sram), DMA_MEMORY_EXCLUSIVE); + resource_size(sram)); if (ret) goto err_dma_declare; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index d29faadf6ef2..70ad15758a70 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -736,17 +736,14 @@ static inline int dma_get_cache_alignment(void) return 1; } -/* flags for the coherent memory api */ -#define DMA_MEMORY_EXCLUSIVE 0x01 - #ifdef CONFIG_DMA_DECLARE_COHERENT int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int flags); + dma_addr_t device_addr, size_t size); void dma_release_declared_memory(struct device *dev); #else static inline int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int flags) + dma_addr_t device_addr, size_t size) { return -ENOSYS; } diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 1d12a31af6d7..29fd6590dc1e 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -14,7 +14,6 @@ struct dma_coherent_mem { dma_addr_t device_base; unsigned long pfn_base; int size; - int flags; unsigned long *bitmap; spinlock_t spinlock; bool use_dev_dma_pfn_offset; @@ -38,9 +37,9 @@ static inline dma_addr_t dma_get_device_base(struct device *dev, return mem->device_base; } -static int dma_init_coherent_memory( - phys_addr_t phys_addr, dma_addr_t device_addr, size_t size, int flags, - struct dma_coherent_mem **mem) +static int dma_init_coherent_memory(phys_addr_t phys_addr, + dma_addr_t device_addr, size_t size, + struct dma_coherent_mem **mem) { struct dma_coherent_mem *dma_mem = NULL; void *mem_base = NULL; @@ -73,7 +72,6 @@ static int dma_init_coherent_memory( dma_mem->device_base = device_addr; dma_mem->pfn_base = PFN_DOWN(phys_addr); dma_mem->size = pages; - dma_mem->flags = flags; spin_lock_init(&dma_mem->spinlock); *mem = dma_mem; @@ -110,12 +108,12 @@ static int dma_assign_coherent_memory(struct device *dev, } int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int flags) + dma_addr_t device_addr, size_t size) { struct dma_coherent_mem *mem; int ret; - ret = dma_init_coherent_memory(phys_addr, device_addr, size, flags, &mem); + ret = dma_init_coherent_memory(phys_addr, device_addr, size, &mem); if (ret) return ret; @@ -190,15 +188,7 @@ int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size, return 0; *ret = __dma_alloc_from_coherent(mem, size, dma_handle); - if (*ret) - return 1; - - /* - * In the case where the allocation can not be satisfied from the - * per-device area, try to fall back to generic memory if the - * constraints allow it. - */ - return mem->flags & DMA_MEMORY_EXCLUSIVE; + return 1; } void *dma_alloc_from_global_coherent(ssize_t size, dma_addr_t *dma_handle) @@ -327,8 +317,7 @@ static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev) if (!mem) { ret = dma_init_coherent_memory(rmem->base, rmem->base, - rmem->size, - DMA_MEMORY_EXCLUSIVE, &mem); + rmem->size, &mem); if (ret) { pr_err("Reserved memory: failed to init DMA memory pool at %pa, size %ld MiB\n", &rmem->base, (unsigned long)rmem->size / SZ_1M); -- cgit v1.3-7-g2ca7 From e7f0c424d0806b05d6f47be9f202b037eb701707 Mon Sep 17 00:00:00 2001 From: "zhangyi (F)" Date: Wed, 13 Feb 2019 20:29:06 +0800 Subject: tracing: Do not free iter->trace in fail path of tracing_open_pipe() Commit d716ff71dd12 ("tracing: Remove taking of trace_types_lock in pipe files") use the current tracer instead of the copy in tracing_open_pipe(), but it forget to remove the freeing sentence in the error path. There's an error path that can call kfree(iter->trace) after the iter->trace was assigned to tr->current_trace, which would be bad to free. Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com Cc: stable@vger.kernel.org Fixes: d716ff71dd12 ("tracing: Remove taking of trace_types_lock in pipe files") Signed-off-by: zhangyi (F) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c521b7347482..b583ff7656bb 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5624,7 +5624,6 @@ out: return ret; fail: - kfree(iter->trace); kfree(iter); __trace_array_put(tr); mutex_unlock(&trace_types_lock); -- cgit v1.3-7-g2ca7 From 7d18a10c316783357fb1b2b649cfcf97c70a7bee Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:41 -0600 Subject: tracing: Refactor hist trigger action code The hist trigger action code currently implements two essentially hard-coded pairs of 'actions' - onmax(), which tracks a variable and saves some event fields when a max is hit, and onmatch(), which is hard-coded to generate a synthetic event. These hardcoded pairs (track max/save fields and detect match/generate synthetic event) should really be decoupled into separate components that can then be arbitrarily combined. The first component of each pair (track max/detect match) is called a 'handler' in the new code, while the second component (save fields/generate synthetic event) is called an 'action' in this scheme. This change refactors the action code to reflect this split by adding two handlers, HANDLER_ONMATCH and HANDLER_ONMAX, along with two actions, ACTION_SAVE and ACTION_TRACE. The new code combines them to produce the existing ONMATCH/TRACE and ONMAX/SAVE functionality, but doesn't implement the other combinations now possible. Future patches will expand these to further useful cases, such as ONMAX/TRACE, as well as add additional handlers and actions such as ONCHANGE and SNAPSHOT. Also, add abbreviated documentation for handlers and actions to README. Link: http://lkml.kernel.org/r/98bfdd48c1b4ff29fc5766442f99f5bc3c34b76b.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 407 +++++++++++++++++++++++---------------- 1 file changed, 238 insertions(+), 169 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 449d90cfa151..dfaaad582797 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -313,9 +313,9 @@ struct hist_trigger_data { struct field_var_hist *field_var_hists[SYNTH_FIELDS_MAX]; unsigned int n_field_var_hists; - struct field_var *max_vars[SYNTH_FIELDS_MAX]; - unsigned int n_max_vars; - unsigned int n_max_var_str; + struct field_var *save_vars[SYNTH_FIELDS_MAX]; + unsigned int n_save_vars; + unsigned int n_save_var_str; }; static int synth_event_create(int argc, const char **argv); @@ -383,11 +383,25 @@ struct action_data; typedef void (*action_fn_t) (struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, void *rec, - struct ring_buffer_event *rbe, + struct ring_buffer_event *rbe, void *key, struct action_data *data, u64 *var_ref_vals); +enum handler_id { + HANDLER_ONMATCH = 1, + HANDLER_ONMAX, +}; + +enum action_id { + ACTION_SAVE = 1, + ACTION_TRACE, +}; + struct action_data { + enum handler_id handler; + enum action_id action; + char *action_name; action_fn_t fn; + unsigned int n_params; char *params[SYNTH_FIELDS_MAX]; @@ -404,13 +418,11 @@ struct action_data { unsigned int var_ref_idx; char *match_event; char *match_event_system; - char *synth_event_name; struct synth_event *synth_event; } onmatch; struct { char *var_str; - char *fn_name; unsigned int max_var_ref_idx; struct hist_field *max_var; struct hist_field *var; @@ -1078,7 +1090,7 @@ static struct synth_event *alloc_synth_event(const char *name, int n_fields, static void action_trace(struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, void *rec, - struct ring_buffer_event *rbe, + struct ring_buffer_event *rbe, void *key, struct action_data *data, u64 *var_ref_vals) { struct synth_event *event = data->onmatch.synth_event; @@ -1644,7 +1656,7 @@ find_match_var(struct hist_trigger_data *hist_data, char *var_name) for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; - if (data->fn == action_trace) { + if (data->handler == HANDLER_ONMATCH) { char *system = data->onmatch.match_event_system; char *event_name = data->onmatch.match_event; @@ -2076,7 +2088,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt) } } - n_str = hist_data->n_field_var_str + hist_data->n_max_var_str; + n_str = hist_data->n_field_var_str + hist_data->n_save_var_str; size = STR_VAR_LEN_MAX; @@ -3050,7 +3062,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, int ret; if (target_hist_data->n_field_var_hists >= SYNTH_FIELDS_MAX) { - hist_err_event("onmatch: Too many field variables defined: ", + hist_err_event("trace action: Too many field variables defined: ", subsys_name, event_name, field_name); return ERR_PTR(-EINVAL); } @@ -3058,7 +3070,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, file = event_file(tr, subsys_name, event_name); if (IS_ERR(file)) { - hist_err_event("onmatch: Event file not found: ", + hist_err_event("trace action: Event file not found: ", subsys_name, event_name, field_name); ret = PTR_ERR(file); return ERR_PTR(ret); @@ -3072,7 +3084,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, */ hist_data = find_compatible_hist(target_hist_data, file); if (!hist_data) { - hist_err_event("onmatch: Matching event histogram not found: ", + hist_err_event("trace action: Matching event histogram not found: ", subsys_name, event_name, field_name); return ERR_PTR(-EINVAL); } @@ -3134,7 +3146,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, kfree(cmd); kfree(var_hist->cmd); kfree(var_hist); - hist_err_event("onmatch: Couldn't create histogram for field: ", + hist_err_event("trace action: Couldn't create histogram for field: ", subsys_name, event_name, field_name); return ERR_PTR(ret); } @@ -3147,7 +3159,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, if (IS_ERR_OR_NULL(event_var)) { kfree(var_hist->cmd); kfree(var_hist); - hist_err_event("onmatch: Couldn't find synthetic variable: ", + hist_err_event("trace action: Couldn't find synthetic variable: ", subsys_name, event_name, field_name); return ERR_PTR(-EINVAL); } @@ -3230,8 +3242,8 @@ static void update_max_vars(struct hist_trigger_data *hist_data, struct ring_buffer_event *rbe, void *rec) { - __update_field_vars(elt, rbe, rec, hist_data->max_vars, - hist_data->n_max_vars, hist_data->n_field_var_str); + __update_field_vars(elt, rbe, rec, hist_data->save_vars, + hist_data->n_save_vars, hist_data->n_field_var_str); } static struct hist_field *create_var(struct hist_trigger_data *hist_data, @@ -3375,9 +3387,9 @@ static void onmax_print(struct seq_file *m, seq_printf(m, "\n\tmax: %10llu", tracing_map_read_var(elt, max_idx)); - for (i = 0; i < hist_data->n_max_vars; i++) { - struct hist_field *save_val = hist_data->max_vars[i]->val; - struct hist_field *save_var = hist_data->max_vars[i]->var; + for (i = 0; i < hist_data->n_save_vars; i++) { + struct hist_field *save_val = hist_data->save_vars[i]->val; + struct hist_field *save_var = hist_data->save_vars[i]->var; u64 val; save_var_idx = save_var->var.idx; @@ -3394,7 +3406,7 @@ static void onmax_print(struct seq_file *m, static void onmax_save(struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, void *rec, - struct ring_buffer_event *rbe, + struct ring_buffer_event *rbe, void *key, struct action_data *data, u64 *var_ref_vals) { unsigned int max_idx = data->onmax.max_var->var.idx; @@ -3421,7 +3433,7 @@ static void onmax_destroy(struct action_data *data) destroy_hist_field(data->onmax.var, 0); kfree(data->onmax.var_str); - kfree(data->onmax.fn_name); + kfree(data->action_name); for (i = 0; i < data->n_params; i++) kfree(data->params[i]); @@ -3429,15 +3441,16 @@ static void onmax_destroy(struct action_data *data) kfree(data); } +static int action_create(struct hist_trigger_data *hist_data, + struct action_data *data); + static int onmax_create(struct hist_trigger_data *hist_data, struct action_data *data) { + struct hist_field *var_field, *ref_field, *max_var = NULL; struct trace_event_file *file = hist_data->event_file; - struct hist_field *var_field, *ref_field, *max_var; unsigned int var_ref_idx = hist_data->n_var_refs; - struct field_var *field_var; - char *onmax_var_str, *param; - unsigned int i; + char *onmax_var_str; int ret = 0; onmax_var_str = data->onmax.var_str; @@ -3459,8 +3472,8 @@ static int onmax_create(struct hist_trigger_data *hist_data, data->onmax.var = ref_field; - data->fn = onmax_save; data->onmax.max_var_ref_idx = var_ref_idx; + max_var = create_var(hist_data, file, "max", sizeof(u64), "u64"); if (IS_ERR(max_var)) { hist_err("onmax: Couldn't create onmax variable: ", "max"); @@ -3469,27 +3482,7 @@ static int onmax_create(struct hist_trigger_data *hist_data, } data->onmax.max_var = max_var; - for (i = 0; i < data->n_params; i++) { - param = kstrdup(data->params[i], GFP_KERNEL); - if (!param) { - ret = -ENOMEM; - goto out; - } - - field_var = create_target_field_var(hist_data, NULL, NULL, param); - if (IS_ERR(field_var)) { - hist_err("onmax: Couldn't create field variable: ", param); - ret = PTR_ERR(field_var); - kfree(param); - goto out; - } - - hist_data->max_vars[hist_data->n_max_vars++] = field_var; - if (field_var->val->flags & HIST_FIELD_FL_STRING) - hist_data->n_max_var_str++; - - kfree(param); - } + ret = action_create(hist_data, data); out: return ret; } @@ -3500,11 +3493,14 @@ static int parse_action_params(char *params, struct action_data *data) int ret = 0; while (params) { - if (data->n_params >= SYNTH_FIELDS_MAX) + if (data->n_params >= SYNTH_FIELDS_MAX) { + hist_err("Too many action params", ""); goto out; + } param = strsep(¶ms, ","); if (!param) { + hist_err("No action param found", ""); ret = -EINVAL; goto out; } @@ -3528,10 +3524,71 @@ static int parse_action_params(char *params, struct action_data *data) return ret; } -static struct action_data *onmax_parse(char *str) +static int action_parse(char *str, struct action_data *data, + enum handler_id handler) +{ + char *action_name; + int ret = 0; + + strsep(&str, "."); + if (!str) { + hist_err("action parsing: No action found", ""); + ret = -EINVAL; + goto out; + } + + action_name = strsep(&str, "("); + if (!action_name || !str) { + hist_err("action parsing: No action found", ""); + ret = -EINVAL; + goto out; + } + + if (str_has_prefix(action_name, "save")) { + char *params = strsep(&str, ")"); + + if (!params) { + hist_err("action parsing: No params found for %s", "save"); + ret = -EINVAL; + goto out; + } + + ret = parse_action_params(params, data); + if (ret) + goto out; + + if (handler == HANDLER_ONMAX) + data->fn = onmax_save; + + data->action = ACTION_SAVE; + } else { + char *params = strsep(&str, ")"); + + if (params) { + ret = parse_action_params(params, data); + if (ret) + goto out; + } + + data->fn = action_trace; + data->action = ACTION_TRACE; + } + + data->action_name = kstrdup(action_name, GFP_KERNEL); + if (!data->action_name) { + ret = -ENOMEM; + goto out; + } + + data->handler = handler; + out: + return ret; +} + +static struct action_data *onmax_parse(char *str, enum handler_id handler) { - char *onmax_fn_name, *onmax_var_str; struct action_data *data; + char *onmax_var_str; int ret = -EINVAL; data = kzalloc(sizeof(*data), GFP_KERNEL); @@ -3550,33 +3607,9 @@ static struct action_data *onmax_parse(char *str) goto free; } - strsep(&str, "."); - if (!str) - goto free; - - onmax_fn_name = strsep(&str, "("); - if (!onmax_fn_name || !str) - goto free; - - if (str_has_prefix(onmax_fn_name, "save")) { - char *params = strsep(&str, ")"); - - if (!params) { - ret = -EINVAL; - goto free; - } - - ret = parse_action_params(params, data); - if (ret) - goto free; - } else - goto free; - - data->onmax.fn_name = kstrdup(onmax_fn_name, GFP_KERNEL); - if (!data->onmax.fn_name) { - ret = -ENOMEM; + ret = action_parse(str, data, handler); + if (ret) goto free; - } out: return data; free: @@ -3593,7 +3626,7 @@ static void onmatch_destroy(struct action_data *data) kfree(data->onmatch.match_event); kfree(data->onmatch.match_event_system); - kfree(data->onmatch.synth_event_name); + kfree(data->action_name); for (i = 0; i < data->n_params; i++) kfree(data->params[i]); @@ -3651,8 +3684,9 @@ static int check_synth_field(struct synth_event *event, } static struct hist_field * -onmatch_find_var(struct hist_trigger_data *hist_data, struct action_data *data, - char *system, char *event, char *var) +trace_action_find_var(struct hist_trigger_data *hist_data, + struct action_data *data, + char *system, char *event, char *var) { struct hist_field *hist_field; @@ -3660,7 +3694,7 @@ onmatch_find_var(struct hist_trigger_data *hist_data, struct action_data *data, hist_field = find_target_event_var(hist_data, system, event, var); if (!hist_field) { - if (!system) { + if (!system && data->handler == HANDLER_ONMATCH) { system = data->onmatch.match_event_system; event = data->onmatch.match_event; } @@ -3669,15 +3703,15 @@ onmatch_find_var(struct hist_trigger_data *hist_data, struct action_data *data, } if (!hist_field) - hist_err_event("onmatch: Couldn't find onmatch param: $", system, event, var); + hist_err_event("trace action: Couldn't find param: $", system, event, var); return hist_field; } static struct hist_field * -onmatch_create_field_var(struct hist_trigger_data *hist_data, - struct action_data *data, char *system, - char *event, char *var) +trace_action_create_field_var(struct hist_trigger_data *hist_data, + struct action_data *data, char *system, + char *event, char *var) { struct hist_field *hist_field = NULL; struct field_var *field_var; @@ -3700,7 +3734,7 @@ onmatch_create_field_var(struct hist_trigger_data *hist_data, * looking for fields on the onmatch(system.event.xxx) * event. */ - if (!system) { + if (!system && data->handler == HANDLER_ONMATCH) { system = data->onmatch.match_event_system; event = data->onmatch.match_event; } @@ -3724,9 +3758,8 @@ onmatch_create_field_var(struct hist_trigger_data *hist_data, goto out; } -static int onmatch_create(struct hist_trigger_data *hist_data, - struct trace_event_file *file, - struct action_data *data) +static int trace_action_create(struct hist_trigger_data *hist_data, + struct action_data *data) { char *event_name, *param, *system = NULL; struct hist_field *hist_field, *var_ref; @@ -3737,11 +3770,12 @@ static int onmatch_create(struct hist_trigger_data *hist_data, lockdep_assert_held(&event_mutex); - event = find_synth_event(data->onmatch.synth_event_name); + event = find_synth_event(data->action_name); if (!event) { - hist_err("onmatch: Couldn't find synthetic event: ", data->onmatch.synth_event_name); + hist_err("trace action: Couldn't find synthetic event: ", data->action_name); return -EINVAL; } + event->ref++; var_ref_idx = hist_data->n_var_refs; @@ -3769,13 +3803,15 @@ static int onmatch_create(struct hist_trigger_data *hist_data, } if (param[0] == '$') - hist_field = onmatch_find_var(hist_data, data, system, - event_name, param); + hist_field = trace_action_find_var(hist_data, data, + system, event_name, + param); else - hist_field = onmatch_create_field_var(hist_data, data, - system, - event_name, - param); + hist_field = trace_action_create_field_var(hist_data, + data, + system, + event_name, + param); if (!hist_field) { kfree(p); @@ -3797,7 +3833,7 @@ static int onmatch_create(struct hist_trigger_data *hist_data, continue; } - hist_err_event("onmatch: Param type doesn't match synthetic event field type: ", + hist_err_event("trace action: Param type doesn't match synthetic event field type: ", system, event_name, param); kfree(p); ret = -EINVAL; @@ -3805,12 +3841,11 @@ static int onmatch_create(struct hist_trigger_data *hist_data, } if (field_pos != event->n_fields) { - hist_err("onmatch: Param count doesn't match synthetic event field count: ", event->name); + hist_err("trace action: Param count doesn't match synthetic event field count: ", event->name); ret = -EINVAL; goto err; } - data->fn = action_trace; data->onmatch.synth_event = event; data->onmatch.var_ref_idx = var_ref_idx; out: @@ -3821,10 +3856,58 @@ static int onmatch_create(struct hist_trigger_data *hist_data, goto out; } +static int action_create(struct hist_trigger_data *hist_data, + struct action_data *data) +{ + struct field_var *field_var; + unsigned int i; + char *param; + int ret = 0; + + if (data->action == ACTION_TRACE) + return trace_action_create(hist_data, data); + + if (data->action == ACTION_SAVE) { + if (hist_data->n_save_vars) { + ret = -EEXIST; + hist_err("save action: Can't have more than one save() action per hist", ""); + goto out; + } + + for (i = 0; i < data->n_params; i++) { + param = kstrdup(data->params[i], GFP_KERNEL); + if (!param) { + ret = -ENOMEM; + goto out; + } + + field_var = create_target_field_var(hist_data, NULL, NULL, param); + if (IS_ERR(field_var)) { + hist_err("save action: Couldn't create field variable: ", param); + ret = PTR_ERR(field_var); + kfree(param); + goto out; + } + + hist_data->save_vars[hist_data->n_save_vars++] = field_var; + if (field_var->val->flags & HIST_FIELD_FL_STRING) + hist_data->n_save_var_str++; + kfree(param); + } + } + out: + return ret; +} + +static int onmatch_create(struct hist_trigger_data *hist_data, + struct action_data *data) +{ + return action_create(hist_data, data); +} + static struct action_data *onmatch_parse(struct trace_array *tr, char *str) { char *match_event, *match_event_system; - char *synth_event_name, *params; struct action_data *data; int ret = -EINVAL; @@ -3862,31 +3945,7 @@ static struct action_data *onmatch_parse(struct trace_array *tr, char *str) goto free; } - strsep(&str, "."); - if (!str) { - hist_err("onmatch: Missing . after onmatch(): ", str); - goto free; - } - - synth_event_name = strsep(&str, "("); - if (!synth_event_name || !str) { - hist_err("onmatch: Missing opening paramlist paren: ", synth_event_name); - goto free; - } - - data->onmatch.synth_event_name = kstrdup(synth_event_name, GFP_KERNEL); - if (!data->onmatch.synth_event_name) { - ret = -ENOMEM; - goto free; - } - - params = strsep(&str, ")"); - if (!params || !str || (str && strlen(str))) { - hist_err("onmatch: Missing closing paramlist paren: ", params); - goto free; - } - - ret = parse_action_params(params, data); + ret = action_parse(str, data, HANDLER_ONMATCH); if (ret) goto free; out: @@ -4326,9 +4385,9 @@ static void destroy_actions(struct hist_trigger_data *hist_data) for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; - if (data->fn == action_trace) + if (data->handler == HANDLER_ONMATCH) onmatch_destroy(data); - else if (data->fn == onmax_save) + else if (data->handler == HANDLER_ONMAX) onmax_destroy(data); else kfree(data); @@ -4355,16 +4414,14 @@ static int parse_actions(struct hist_trigger_data *hist_data) ret = PTR_ERR(data); break; } - data->fn = action_trace; } else if ((len = str_has_prefix(str, "onmax("))) { char *action_str = str + len; - data = onmax_parse(action_str); + data = onmax_parse(action_str, HANDLER_ONMAX); if (IS_ERR(data)) { ret = PTR_ERR(data); break; } - data->fn = onmax_save; } else { ret = -EINVAL; break; @@ -4376,8 +4433,7 @@ static int parse_actions(struct hist_trigger_data *hist_data) return ret; } -static int create_actions(struct hist_trigger_data *hist_data, - struct trace_event_file *file) +static int create_actions(struct hist_trigger_data *hist_data) { struct action_data *data; unsigned int i; @@ -4386,14 +4442,17 @@ static int create_actions(struct hist_trigger_data *hist_data, for (i = 0; i < hist_data->attrs->n_actions; i++) { data = hist_data->actions[i]; - if (data->fn == action_trace) { - ret = onmatch_create(hist_data, file, data); + if (data->handler == HANDLER_ONMATCH) { + ret = onmatch_create(hist_data, data); if (ret) - return ret; - } else if (data->fn == onmax_save) { + break; + } else if (data->handler == HANDLER_ONMAX) { ret = onmax_create(hist_data, data); if (ret) - return ret; + break; + } else { + ret = -EINVAL; + break; } } @@ -4409,26 +4468,42 @@ static void print_actions(struct seq_file *m, for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; - if (data->fn == onmax_save) + if (data->handler == HANDLER_ONMAX) onmax_print(m, hist_data, elt, data); } } +static void print_action_spec(struct seq_file *m, + struct hist_trigger_data *hist_data, + struct action_data *data) +{ + unsigned int i; + + if (data->action == ACTION_SAVE) { + for (i = 0; i < hist_data->n_save_vars; i++) { + seq_printf(m, "%s", hist_data->save_vars[i]->var->var.name); + if (i < hist_data->n_save_vars - 1) + seq_puts(m, ","); + } + } else if (data->action == ACTION_TRACE) { + for (i = 0; i < data->n_params; i++) { + if (i) + seq_puts(m, ","); + seq_printf(m, "%s", data->params[i]); + } + } +} + static void print_onmax_spec(struct seq_file *m, struct hist_trigger_data *hist_data, struct action_data *data) { - unsigned int i; - seq_puts(m, ":onmax("); seq_printf(m, "%s", data->onmax.var_str); - seq_printf(m, ").%s(", data->onmax.fn_name); + seq_printf(m, ").%s(", data->action_name); + + print_action_spec(m, hist_data, data); - for (i = 0; i < hist_data->n_max_vars; i++) { - seq_printf(m, "%s", hist_data->max_vars[i]->var->var.name); - if (i < hist_data->n_max_vars - 1) - seq_puts(m, ","); - } seq_puts(m, ")"); } @@ -4436,18 +4511,12 @@ static void print_onmatch_spec(struct seq_file *m, struct hist_trigger_data *hist_data, struct action_data *data) { - unsigned int i; - seq_printf(m, ":onmatch(%s.%s).", data->onmatch.match_event_system, data->onmatch.match_event); - seq_printf(m, "%s(", data->onmatch.synth_event->name); + seq_printf(m, "%s(", data->action_name); - for (i = 0; i < data->n_params; i++) { - if (i) - seq_puts(m, ","); - seq_printf(m, "%s", data->params[i]); - } + print_action_spec(m, hist_data, data); seq_puts(m, ")"); } @@ -4464,7 +4533,9 @@ static bool actions_match(struct hist_trigger_data *hist_data, struct action_data *data = hist_data->actions[i]; struct action_data *data_test = hist_data_test->actions[i]; - if (data->fn != data_test->fn) + if (data->handler != data_test->handler) + return false; + if (data->action != data_test->action) return false; if (data->n_params != data_test->n_params) @@ -4475,23 +4546,20 @@ static bool actions_match(struct hist_trigger_data *hist_data, return false; } - if (data->fn == action_trace) { - if (strcmp(data->onmatch.synth_event_name, - data_test->onmatch.synth_event_name) != 0) - return false; + if (strcmp(data->action_name, data_test->action_name) != 0) + return false; + + if (data->handler == HANDLER_ONMATCH) { if (strcmp(data->onmatch.match_event_system, data_test->onmatch.match_event_system) != 0) return false; if (strcmp(data->onmatch.match_event, data_test->onmatch.match_event) != 0) return false; - } else if (data->fn == onmax_save) { + } else if (data->handler == HANDLER_ONMAX) { if (strcmp(data->onmax.var_str, data_test->onmax.var_str) != 0) return false; - if (strcmp(data->onmax.fn_name, - data_test->onmax.fn_name) != 0) - return false; } } @@ -4507,9 +4575,9 @@ static void print_actions_spec(struct seq_file *m, for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; - if (data->fn == action_trace) + if (data->handler == HANDLER_ONMATCH) print_onmatch_spec(m, hist_data, data); - else if (data->fn == onmax_save) + else if (data->handler == HANDLER_ONMAX) print_onmax_spec(m, hist_data, data); } } @@ -4703,14 +4771,15 @@ static inline void add_to_key(char *compound_key, void *key, static void hist_trigger_actions(struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, void *rec, - struct ring_buffer_event *rbe, u64 *var_ref_vals) + struct ring_buffer_event *rbe, void *key, + u64 *var_ref_vals) { struct action_data *data; unsigned int i; for (i = 0; i < hist_data->n_actions; i++) { data = hist_data->actions[i]; - data->fn(hist_data, elt, rec, rbe, data, var_ref_vals); + data->fn(hist_data, elt, rec, rbe, key, data, var_ref_vals); } } @@ -4771,7 +4840,7 @@ static void event_hist_trigger(struct event_trigger_data *data, void *rec, hist_trigger_elt_update(hist_data, elt, rec, rbe, var_ref_vals); if (resolve_var_refs(hist_data, key, var_ref_vals, true)) - hist_trigger_actions(hist_data, elt, rec, rbe, var_ref_vals); + hist_trigger_actions(hist_data, elt, rec, rbe, key, var_ref_vals); } static void hist_trigger_stacktrace_print(struct seq_file *m, @@ -5683,7 +5752,7 @@ static int event_hist_trigger_func(struct event_command *cmd_ops, if (has_hist_vars(hist_data)) save_hist_vars(hist_data); - ret = create_actions(hist_data, file); + ret = create_actions(hist_data); if (ret) goto out_unreg; -- cgit v1.3-7-g2ca7 From c3e49506a0f426a850675e39419879214060ca8b Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:43 -0600 Subject: tracing: Split up onmatch action data Currently, the onmatch action data binds the onmatch action to data related to synthetic event generation. Since we want to allow the onmatch handler to potentially invoke a different action, and because we expect other handlers to generate synthetic events, we need to separate the data related to these two functions. Also rename the onmatch data to something more descriptive, and create and use common action data destroy function. Link: http://lkml.kernel.org/r/b9abbf9aae69fe3920cdc8ddbcaad544dd258d78.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 12 ++++- kernel/trace/trace_events_hist.c | 103 ++++++++++++++++++++------------------- 2 files changed, 63 insertions(+), 52 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b583ff7656bb..b477926ac3bc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4700,6 +4700,7 @@ static const char readme_msg[] = "\t [:size=#entries]\n" "\t [:pause][:continue][:clear]\n" "\t [:name=histname1]\n" + "\t [:.]\n" "\t [if ]\n\n" "\t When a matching event is hit, an entry is added to a hash\n" "\t table using the key(s) and value(s) named, and the value of a\n" @@ -4741,7 +4742,16 @@ static const char readme_msg[] = "\t The enable_hist and disable_hist triggers can be used to\n" "\t have one event conditionally start and stop another event's\n" "\t already-attached hist trigger. The syntax is analagous to\n" - "\t the enable_event and disable_event triggers.\n" + "\t the enable_event and disable_event triggers.\n\n" + "\t Hist trigger handlers and actions are executed whenever a\n" + "\t a histogram entry is added or updated. They take the form:\n\n" + "\t .\n\n" + "\t The available handlers are:\n\n" + "\t onmatch(matching.event) - invoke on addition or update\n" + "\t onmax(var) - invoke if var exceeds current max\n\n" + "\t The available actions are:\n\n" + "\t (param list) - generate synthetic event\n" + "\t save(field,...) - save current event fields\n" #endif ; diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index dfaaad582797..0b843ecef547 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -405,21 +405,22 @@ struct action_data { unsigned int n_params; char *params[SYNTH_FIELDS_MAX]; + /* + * When a histogram trigger is hit, the values of any + * references to variables, including variables being passed + * as parameters to synthetic events, are collected into a + * var_ref_vals array. This var_ref_idx is the index of the + * first param in the array to be passed to the synthetic + * event invocation. + */ + unsigned int var_ref_idx; + struct synth_event *synth_event; + union { struct { - /* - * When a histogram trigger is hit, the values of any - * references to variables, including variables being passed - * as parameters to synthetic events, are collected into a - * var_ref_vals array. This var_ref_idx is the index of the - * first param in the array to be passed to the synthetic - * event invocation. - */ - unsigned int var_ref_idx; - char *match_event; - char *match_event_system; - struct synth_event *synth_event; - } onmatch; + char *event; + char *event_system; + } match_data; struct { char *var_str; @@ -1093,9 +1094,9 @@ static void action_trace(struct hist_trigger_data *hist_data, struct ring_buffer_event *rbe, void *key, struct action_data *data, u64 *var_ref_vals) { - struct synth_event *event = data->onmatch.synth_event; + struct synth_event *event = data->synth_event; - trace_synth(event, var_ref_vals, data->onmatch.var_ref_idx); + trace_synth(event, var_ref_vals, data->var_ref_idx); } struct hist_var_data { @@ -1657,8 +1658,8 @@ find_match_var(struct hist_trigger_data *hist_data, char *var_name) struct action_data *data = hist_data->actions[i]; if (data->handler == HANDLER_ONMATCH) { - char *system = data->onmatch.match_event_system; - char *event_name = data->onmatch.match_event; + char *system = data->match_data.event_system; + char *event_name = data->match_data.event; file = find_var_file(tr, system, event_name, var_name); if (!file) @@ -3425,22 +3426,33 @@ static void onmax_save(struct hist_trigger_data *hist_data, update_max_vars(hist_data, elt, rbe, rec); } -static void onmax_destroy(struct action_data *data) +static void action_data_destroy(struct action_data *data) { unsigned int i; - destroy_hist_field(data->onmax.max_var, 0); - destroy_hist_field(data->onmax.var, 0); + lockdep_assert_held(&event_mutex); - kfree(data->onmax.var_str); kfree(data->action_name); for (i = 0; i < data->n_params; i++) kfree(data->params[i]); + if (data->synth_event) + data->synth_event->ref--; + kfree(data); } +static void onmax_destroy(struct action_data *data) +{ + destroy_hist_field(data->onmax.max_var, 0); + destroy_hist_field(data->onmax.var, 0); + + kfree(data->onmax.var_str); + + action_data_destroy(data); +} + static int action_create(struct hist_trigger_data *hist_data, struct action_data *data); @@ -3620,21 +3632,10 @@ static struct action_data *onmax_parse(char *str, enum handler_id handler) static void onmatch_destroy(struct action_data *data) { - unsigned int i; - - lockdep_assert_held(&event_mutex); + kfree(data->match_data.event); + kfree(data->match_data.event_system); - kfree(data->onmatch.match_event); - kfree(data->onmatch.match_event_system); - kfree(data->action_name); - - for (i = 0; i < data->n_params; i++) - kfree(data->params[i]); - - if (data->onmatch.synth_event) - data->onmatch.synth_event->ref--; - - kfree(data); + action_data_destroy(data); } static void destroy_field_var(struct field_var *field_var) @@ -3695,8 +3696,8 @@ trace_action_find_var(struct hist_trigger_data *hist_data, hist_field = find_target_event_var(hist_data, system, event, var); if (!hist_field) { if (!system && data->handler == HANDLER_ONMATCH) { - system = data->onmatch.match_event_system; - event = data->onmatch.match_event; + system = data->match_data.event_system; + event = data->match_data.event; } hist_field = find_event_var(hist_data, system, event, var); @@ -3735,8 +3736,8 @@ trace_action_create_field_var(struct hist_trigger_data *hist_data, * event. */ if (!system && data->handler == HANDLER_ONMATCH) { - system = data->onmatch.match_event_system; - event = data->onmatch.match_event; + system = data->match_data.event_system; + event = data->match_data.event; } /* @@ -3846,8 +3847,8 @@ static int trace_action_create(struct hist_trigger_data *hist_data, goto err; } - data->onmatch.synth_event = event; - data->onmatch.var_ref_idx = var_ref_idx; + data->synth_event = event; + data->var_ref_idx = var_ref_idx; out: return ret; err: @@ -3933,14 +3934,14 @@ static struct action_data *onmatch_parse(struct trace_array *tr, char *str) goto free; } - data->onmatch.match_event = kstrdup(match_event, GFP_KERNEL); - if (!data->onmatch.match_event) { + data->match_data.event = kstrdup(match_event, GFP_KERNEL); + if (!data->match_data.event) { ret = -ENOMEM; goto free; } - data->onmatch.match_event_system = kstrdup(match_event_system, GFP_KERNEL); - if (!data->onmatch.match_event_system) { + data->match_data.event_system = kstrdup(match_event_system, GFP_KERNEL); + if (!data->match_data.event_system) { ret = -ENOMEM; goto free; } @@ -4511,8 +4512,8 @@ static void print_onmatch_spec(struct seq_file *m, struct hist_trigger_data *hist_data, struct action_data *data) { - seq_printf(m, ":onmatch(%s.%s).", data->onmatch.match_event_system, - data->onmatch.match_event); + seq_printf(m, ":onmatch(%s.%s).", data->match_data.event_system, + data->match_data.event); seq_printf(m, "%s(", data->action_name); @@ -4550,11 +4551,11 @@ static bool actions_match(struct hist_trigger_data *hist_data, return false; if (data->handler == HANDLER_ONMATCH) { - if (strcmp(data->onmatch.match_event_system, - data_test->onmatch.match_event_system) != 0) + if (strcmp(data->match_data.event_system, + data_test->match_data.event_system) != 0) return false; - if (strcmp(data->onmatch.match_event, - data_test->onmatch.match_event) != 0) + if (strcmp(data->match_data.event, + data_test->match_data.event) != 0) return false; } else if (data->handler == HANDLER_ONMAX) { if (strcmp(data->onmax.var_str, -- cgit v1.3-7-g2ca7 From 466f4528fbc692ea56deca278fa6aeb79e6e8b21 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:44 -0600 Subject: tracing: Generalize hist trigger onmax and save action The action refactor code allowed actions and handlers to be separated, but the existing onmax handler and save action code is still not flexible enough to handle arbitrary coupling. This change generalizes them and in the process makes additional handlers and actions easier to implement. The onmax action can be broken up and thought of as two separate components - a variable to be tracked (the parameter given to the onmax($var_to_track) function) and an invisible variable created to save the ongoing result of doing something with that variable, such as saving the max value of that variable so far seen. Separating it out like this and renaming it appropriately allows us to use the same code for similar tracking functions such as onchange($var_to_track), which would just track the last value seen rather than the max seen so far, which is useful in some situations. Additionally, because different handlers and actions may want to save and access data differently e.g. save and retrieve tracking values as local variables vs something more global, save_val() and get_val() interface functions are introduced and max-specific implementations are used instead. The same goes for the code that checks whether a maximum has been hit - a generic check_val() interface and max-checking implementation is used instead, which allows future patches to make use of he same code using their own implemetations of similar functionality. Link: http://lkml.kernel.org/r/980ea73dd8e3f36db3d646f99652f8fed42b77d4.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 236 ++++++++++++++++++++++++++------------- 1 file changed, 160 insertions(+), 76 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 0b843ecef547..0515229e5f95 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -386,6 +386,8 @@ typedef void (*action_fn_t) (struct hist_trigger_data *hist_data, struct ring_buffer_event *rbe, void *key, struct action_data *data, u64 *var_ref_vals); +typedef bool (*check_track_val_fn_t) (u64 track_val, u64 var_val); + enum handler_id { HANDLER_ONMATCH = 1, HANDLER_ONMAX, @@ -423,15 +425,35 @@ struct action_data { } match_data; struct { + /* + * var_str contains the $-unstripped variable + * name referenced by var_ref, and used when + * printing the action. Because var_ref + * creation is deferred to create_actions(), + * we need a per-action way to save it until + * then, thus var_str. + */ char *var_str; - unsigned int max_var_ref_idx; - struct hist_field *max_var; - struct hist_field *var; - } onmax; + + /* + * var_ref refers to the variable being + * tracked e.g onmax($var). + */ + struct hist_field *var_ref; + + /* + * track_var contains the 'invisible' tracking + * variable created to keep the current + * e.g. max value. + */ + struct hist_field *track_var; + + check_track_val_fn_t check_val; + action_fn_t save_data; + } track_data; }; }; - static char last_hist_cmd[MAX_FILTER_STR_VAL]; static char hist_err_str[MAX_FILTER_STR_VAL]; @@ -3238,10 +3260,10 @@ static void update_field_vars(struct hist_trigger_data *hist_data, hist_data->n_field_vars, 0); } -static void update_max_vars(struct hist_trigger_data *hist_data, - struct tracing_map_elt *elt, - struct ring_buffer_event *rbe, - void *rec) +static void save_track_data_vars(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, void *rec, + struct ring_buffer_event *rbe, void *key, + struct action_data *data, u64 *var_ref_vals) { __update_field_vars(elt, rbe, rec, hist_data->save_vars, hist_data->n_save_vars, hist_data->n_field_var_str); @@ -3379,14 +3401,67 @@ create_target_field_var(struct hist_trigger_data *target_hist_data, return create_field_var(target_hist_data, file, var_name); } -static void onmax_print(struct seq_file *m, - struct hist_trigger_data *hist_data, - struct tracing_map_elt *elt, - struct action_data *data) +static bool check_track_val_max(u64 track_val, u64 var_val) +{ + if (var_val <= track_val) + return false; + + return true; +} + +static u64 get_track_val(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, + struct action_data *data) { - unsigned int i, save_var_idx, max_idx = data->onmax.max_var->var.idx; + unsigned int track_var_idx = data->track_data.track_var->var.idx; + u64 track_val; + + track_val = tracing_map_read_var(elt, track_var_idx); - seq_printf(m, "\n\tmax: %10llu", tracing_map_read_var(elt, max_idx)); + return track_val; +} + +static void save_track_val(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, + struct action_data *data, u64 var_val) +{ + unsigned int track_var_idx = data->track_data.track_var->var.idx; + + tracing_map_set_var(elt, track_var_idx, var_val); +} + +static void save_track_data(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, void *rec, + struct ring_buffer_event *rbe, void *key, + struct action_data *data, u64 *var_ref_vals) +{ + if (data->track_data.save_data) + data->track_data.save_data(hist_data, elt, rec, rbe, key, data, var_ref_vals); +} + +static bool check_track_val(struct tracing_map_elt *elt, + struct action_data *data, + u64 var_val) +{ + struct hist_trigger_data *hist_data; + u64 track_val; + + hist_data = data->track_data.track_var->hist_data; + track_val = get_track_val(hist_data, elt, data); + + return data->track_data.check_val(track_val, var_val); +} + +static void track_data_print(struct seq_file *m, + struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, + struct action_data *data) +{ + u64 track_val = get_track_val(hist_data, elt, data); + unsigned int i, save_var_idx; + + if (data->handler == HANDLER_ONMAX) + seq_printf(m, "\n\tmax: %10llu", track_val); for (i = 0; i < hist_data->n_save_vars; i++) { struct hist_field *save_val = hist_data->save_vars[i]->val; @@ -3405,25 +3480,17 @@ static void onmax_print(struct seq_file *m, } } -static void onmax_save(struct hist_trigger_data *hist_data, - struct tracing_map_elt *elt, void *rec, - struct ring_buffer_event *rbe, void *key, - struct action_data *data, u64 *var_ref_vals) +static void ontrack_action(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, void *rec, + struct ring_buffer_event *rbe, void *key, + struct action_data *data, u64 *var_ref_vals) { - unsigned int max_idx = data->onmax.max_var->var.idx; - unsigned int max_var_ref_idx = data->onmax.max_var_ref_idx; + u64 var_val = var_ref_vals[data->track_data.var_ref->var_ref_idx]; - u64 var_val, max_val; - - var_val = var_ref_vals[max_var_ref_idx]; - max_val = tracing_map_read_var(elt, max_idx); - - if (var_val <= max_val) - return; - - tracing_map_set_var(elt, max_idx, var_val); - - update_max_vars(hist_data, elt, rbe, rec); + if (check_track_val(elt, data, var_val)) { + save_track_val(hist_data, elt, data, var_val); + save_track_data(hist_data, elt, rec, rbe, key, data, var_ref_vals); + } } static void action_data_destroy(struct action_data *data) @@ -3443,12 +3510,13 @@ static void action_data_destroy(struct action_data *data) kfree(data); } -static void onmax_destroy(struct action_data *data) +static void track_data_destroy(struct hist_trigger_data *hist_data, + struct action_data *data) { - destroy_hist_field(data->onmax.max_var, 0); - destroy_hist_field(data->onmax.var, 0); + destroy_hist_field(data->track_data.track_var, 0); + destroy_hist_field(data->track_data.var_ref, 0); - kfree(data->onmax.var_str); + kfree(data->track_data.var_str); action_data_destroy(data); } @@ -3456,25 +3524,24 @@ static void onmax_destroy(struct action_data *data) static int action_create(struct hist_trigger_data *hist_data, struct action_data *data); -static int onmax_create(struct hist_trigger_data *hist_data, - struct action_data *data) +static int track_data_create(struct hist_trigger_data *hist_data, + struct action_data *data) { - struct hist_field *var_field, *ref_field, *max_var = NULL; + struct hist_field *var_field, *ref_field, *track_var = NULL; struct trace_event_file *file = hist_data->event_file; - unsigned int var_ref_idx = hist_data->n_var_refs; - char *onmax_var_str; + char *track_data_var_str; int ret = 0; - onmax_var_str = data->onmax.var_str; - if (onmax_var_str[0] != '$') { - hist_err("onmax: For onmax(x), x must be a variable: ", onmax_var_str); + track_data_var_str = data->track_data.var_str; + if (track_data_var_str[0] != '$') { + hist_err("For onmax(x), x must be a variable: ", track_data_var_str); return -EINVAL; } - onmax_var_str++; + track_data_var_str++; - var_field = find_target_event_var(hist_data, NULL, NULL, onmax_var_str); + var_field = find_target_event_var(hist_data, NULL, NULL, track_data_var_str); if (!var_field) { - hist_err("onmax: Couldn't find onmax variable: ", onmax_var_str); + hist_err("Couldn't find onmax variable: ", track_data_var_str); return -EINVAL; } @@ -3482,17 +3549,16 @@ static int onmax_create(struct hist_trigger_data *hist_data, if (!ref_field) return -ENOMEM; - data->onmax.var = ref_field; + data->track_data.var_ref = ref_field; - data->onmax.max_var_ref_idx = var_ref_idx; - - max_var = create_var(hist_data, file, "max", sizeof(u64), "u64"); - if (IS_ERR(max_var)) { - hist_err("onmax: Couldn't create onmax variable: ", "max"); - ret = PTR_ERR(max_var); + if (data->handler == HANDLER_ONMAX) + track_var = create_var(hist_data, file, "__max", sizeof(u64), "u64"); + if (IS_ERR(track_var)) { + hist_err("Couldn't create onmax variable: ", "__max"); + ret = PTR_ERR(track_var); goto out; } - data->onmax.max_var = max_var; + data->track_data.track_var = track_var; ret = action_create(hist_data, data); out: @@ -3570,8 +3636,15 @@ static int action_parse(char *str, struct action_data *data, goto out; if (handler == HANDLER_ONMAX) - data->fn = onmax_save; + data->track_data.check_val = check_track_val_max; + else { + hist_err("action parsing: Handler doesn't support action: ", action_name); + ret = -EINVAL; + goto out; + } + data->track_data.save_data = save_track_data_vars; + data->fn = ontrack_action; data->action = ACTION_SAVE; } else { char *params = strsep(&str, ")"); @@ -3582,7 +3655,15 @@ static int action_parse(char *str, struct action_data *data, goto out; } - data->fn = action_trace; + if (handler == HANDLER_ONMAX) + data->track_data.check_val = check_track_val_max; + + if (handler != HANDLER_ONMATCH) { + data->track_data.save_data = action_trace; + data->fn = ontrack_action; + } else + data->fn = action_trace; + data->action = ACTION_TRACE; } @@ -3597,24 +3678,25 @@ static int action_parse(char *str, struct action_data *data, return ret; } -static struct action_data *onmax_parse(char *str, enum handler_id handler) +static struct action_data *track_data_parse(struct hist_trigger_data *hist_data, + char *str, enum handler_id handler) { struct action_data *data; - char *onmax_var_str; int ret = -EINVAL; + char *var_str; data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) return ERR_PTR(-ENOMEM); - onmax_var_str = strsep(&str, ")"); - if (!onmax_var_str || !str) { + var_str = strsep(&str, ")"); + if (!var_str || !str) { ret = -EINVAL; goto free; } - data->onmax.var_str = kstrdup(onmax_var_str, GFP_KERNEL); - if (!data->onmax.var_str) { + data->track_data.var_str = kstrdup(var_str, GFP_KERNEL); + if (!data->track_data.var_str) { ret = -ENOMEM; goto free; } @@ -3625,7 +3707,7 @@ static struct action_data *onmax_parse(char *str, enum handler_id handler) out: return data; free: - onmax_destroy(data); + track_data_destroy(hist_data, data); data = ERR_PTR(ret); goto out; } @@ -4389,7 +4471,7 @@ static void destroy_actions(struct hist_trigger_data *hist_data) if (data->handler == HANDLER_ONMATCH) onmatch_destroy(data); else if (data->handler == HANDLER_ONMAX) - onmax_destroy(data); + track_data_destroy(hist_data, data); else kfree(data); } @@ -4418,7 +4500,8 @@ static int parse_actions(struct hist_trigger_data *hist_data) } else if ((len = str_has_prefix(str, "onmax("))) { char *action_str = str + len; - data = onmax_parse(action_str, HANDLER_ONMAX); + data = track_data_parse(hist_data, action_str, + HANDLER_ONMAX); if (IS_ERR(data)) { ret = PTR_ERR(data); break; @@ -4448,7 +4531,7 @@ static int create_actions(struct hist_trigger_data *hist_data) if (ret) break; } else if (data->handler == HANDLER_ONMAX) { - ret = onmax_create(hist_data, data); + ret = track_data_create(hist_data, data); if (ret) break; } else { @@ -4470,7 +4553,7 @@ static void print_actions(struct seq_file *m, struct action_data *data = hist_data->actions[i]; if (data->handler == HANDLER_ONMAX) - onmax_print(m, hist_data, elt, data); + track_data_print(m, hist_data, elt, data); } } @@ -4495,12 +4578,13 @@ static void print_action_spec(struct seq_file *m, } } -static void print_onmax_spec(struct seq_file *m, - struct hist_trigger_data *hist_data, - struct action_data *data) +static void print_track_data_spec(struct seq_file *m, + struct hist_trigger_data *hist_data, + struct action_data *data) { - seq_puts(m, ":onmax("); - seq_printf(m, "%s", data->onmax.var_str); + if (data->handler == HANDLER_ONMAX) + seq_puts(m, ":onmax("); + seq_printf(m, "%s", data->track_data.var_str); seq_printf(m, ").%s(", data->action_name); print_action_spec(m, hist_data, data); @@ -4558,8 +4642,8 @@ static bool actions_match(struct hist_trigger_data *hist_data, data_test->match_data.event) != 0) return false; } else if (data->handler == HANDLER_ONMAX) { - if (strcmp(data->onmax.var_str, - data_test->onmax.var_str) != 0) + if (strcmp(data->track_data.var_str, + data_test->track_data.var_str) != 0) return false; } } @@ -4579,7 +4663,7 @@ static void print_actions_spec(struct seq_file *m, if (data->handler == HANDLER_ONMATCH) print_onmatch_spec(m, hist_data, data); else if (data->handler == HANDLER_ONMAX) - print_onmax_spec(m, hist_data, data); + print_track_data_spec(m, hist_data, data); } } -- cgit v1.3-7-g2ca7 From a35873a0993b4d38b40871f10fa4356c088c7140 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:45 -0600 Subject: tracing: Add conditional snapshot Currently, tracing snapshots are context-free - they capture the ring buffer contents at the time the tracing_snapshot() function was invoked, and nothing else. Additionally, they're always taken unconditionally - the calling code can decide whether or not to take a snapshot, but the data used to make that decision is kept separately from the snapshot itself. This change adds the ability to associate with each trace instance some user data, along with an 'update' function that can use that data to determine whether or not to actually take a snapshot. The update function can then update that data along with any other state (as part of the data presumably), if warranted. Because snapshots are 'global' per-instance, only one user can enable and use a conditional snapshot for any given trace instance. To enable a conditional snapshot (see details in the function and data structure comments), the user calls tracing_snapshot_cond_enable(). Similarly, to disable a conditional snapshot and free it up for other users, tracing_snapshot_cond_disable() should be called. To actually initiate a conditional snapshot, tracing_snapshot_cond() should be called. tracing_snapshot_cond() will invoke the update() callback, allowing the user to decide whether or not to actually take the snapshot and update the user-defined data associated with the snapshot. If the callback returns 'true', tracing_snapshot_cond() will then actually take the snapshot and return. This scheme allows for flexibility in snapshot implementations - for example, by implementing slightly different update() callbacks, snapshots can be taken in situations where the user is only interested in taking a snapshot when a new maximum in hit versus when a value changes in any way at all. Future patches will demonstrate both cases. Link: http://lkml.kernel.org/r/1bea07828d5fd6864a585f83b1eed47ce097eb45.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 192 +++++++++++++++++++++++++++++++++++++- kernel/trace/trace.h | 56 ++++++++++- kernel/trace/trace_sched_wakeup.c | 2 +- 3 files changed, 244 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b477926ac3bc..9f4d56f74b46 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -894,7 +894,7 @@ int __trace_bputs(unsigned long ip, const char *str) EXPORT_SYMBOL_GPL(__trace_bputs); #ifdef CONFIG_TRACER_SNAPSHOT -void tracing_snapshot_instance(struct trace_array *tr) +void tracing_snapshot_instance_cond(struct trace_array *tr, void *cond_data) { struct tracer *tracer = tr->current_trace; unsigned long flags; @@ -920,10 +920,15 @@ void tracing_snapshot_instance(struct trace_array *tr) } local_irq_save(flags); - update_max_tr(tr, current, smp_processor_id()); + update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); } +void tracing_snapshot_instance(struct trace_array *tr) +{ + tracing_snapshot_instance_cond(tr, NULL); +} + /** * tracing_snapshot - take a snapshot of the current buffer. * @@ -946,6 +951,54 @@ void tracing_snapshot(void) } EXPORT_SYMBOL_GPL(tracing_snapshot); +/** + * tracing_snapshot_cond - conditionally take a snapshot of the current buffer. + * @tr: The tracing instance to snapshot + * @cond_data: The data to be tested conditionally, and possibly saved + * + * This is the same as tracing_snapshot() except that the snapshot is + * conditional - the snapshot will only happen if the + * cond_snapshot.update() implementation receiving the cond_data + * returns true, which means that the trace array's cond_snapshot + * update() operation used the cond_data to determine whether the + * snapshot should be taken, and if it was, presumably saved it along + * with the snapshot. + */ +void tracing_snapshot_cond(struct trace_array *tr, void *cond_data) +{ + tracing_snapshot_instance_cond(tr, cond_data); +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond); + +/** + * tracing_snapshot_cond_data - get the user data associated with a snapshot + * @tr: The tracing instance + * + * When the user enables a conditional snapshot using + * tracing_snapshot_cond_enable(), the user-defined cond_data is saved + * with the snapshot. This accessor is used to retrieve it. + * + * Should not be called from cond_snapshot.update(), since it takes + * the tr->max_lock lock, which the code calling + * cond_snapshot.update() has already done. + * + * Returns the cond_data associated with the trace array's snapshot. + */ +void *tracing_cond_snapshot_data(struct trace_array *tr) +{ + void *cond_data = NULL; + + arch_spin_lock(&tr->max_lock); + + if (tr->cond_snapshot) + cond_data = tr->cond_snapshot->cond_data; + + arch_spin_unlock(&tr->max_lock); + + return cond_data; +} +EXPORT_SYMBOL_GPL(tracing_cond_snapshot_data); + static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf, struct trace_buffer *size_buf, int cpu_id); static void set_buffer_entries(struct trace_buffer *buf, unsigned long val); @@ -1025,12 +1078,103 @@ void tracing_snapshot_alloc(void) tracing_snapshot(); } EXPORT_SYMBOL_GPL(tracing_snapshot_alloc); + +/** + * tracing_snapshot_cond_enable - enable conditional snapshot for an instance + * @tr: The tracing instance + * @cond_data: User data to associate with the snapshot + * @update: Implementation of the cond_snapshot update function + * + * Check whether the conditional snapshot for the given instance has + * already been enabled, or if the current tracer is already using a + * snapshot; if so, return -EBUSY, else create a cond_snapshot and + * save the cond_data and update function inside. + * + * Returns 0 if successful, error otherwise. + */ +int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, + cond_update_fn_t update) +{ + struct cond_snapshot *cond_snapshot; + int ret = 0; + + cond_snapshot = kzalloc(sizeof(*cond_snapshot), GFP_KERNEL); + if (!cond_snapshot) + return -ENOMEM; + + cond_snapshot->cond_data = cond_data; + cond_snapshot->update = update; + + mutex_lock(&trace_types_lock); + + ret = tracing_alloc_snapshot_instance(tr); + if (ret) + goto fail_unlock; + + if (tr->current_trace->use_max_tr) { + ret = -EBUSY; + goto fail_unlock; + } + + if (tr->cond_snapshot) { + ret = -EBUSY; + goto fail_unlock; + } + + arch_spin_lock(&tr->max_lock); + tr->cond_snapshot = cond_snapshot; + arch_spin_unlock(&tr->max_lock); + + mutex_unlock(&trace_types_lock); + + return ret; + + fail_unlock: + mutex_unlock(&trace_types_lock); + kfree(cond_snapshot); + return ret; +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond_enable); + +/** + * tracing_snapshot_cond_disable - disable conditional snapshot for an instance + * @tr: The tracing instance + * + * Check whether the conditional snapshot for the given instance is + * enabled; if so, free the cond_snapshot associated with it, + * otherwise return -EINVAL. + * + * Returns 0 if successful, error otherwise. + */ +int tracing_snapshot_cond_disable(struct trace_array *tr) +{ + int ret = 0; + + arch_spin_lock(&tr->max_lock); + + if (!tr->cond_snapshot) + ret = -EINVAL; + else { + kfree(tr->cond_snapshot); + tr->cond_snapshot = NULL; + } + + arch_spin_unlock(&tr->max_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); #else void tracing_snapshot(void) { WARN_ONCE(1, "Snapshot feature not enabled, but internal snapshot used"); } EXPORT_SYMBOL_GPL(tracing_snapshot); +void tracing_snapshot_cond(struct trace_array *tr, void *cond_data) +{ + WARN_ONCE(1, "Snapshot feature not enabled, but internal conditional snapshot used"); +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond); int tracing_alloc_snapshot(void) { WARN_ONCE(1, "Snapshot feature not enabled, but snapshot allocation used"); @@ -1043,6 +1187,21 @@ void tracing_snapshot_alloc(void) tracing_snapshot(); } EXPORT_SYMBOL_GPL(tracing_snapshot_alloc); +void *tracing_cond_snapshot_data(struct trace_array *tr) +{ + return NULL; +} +EXPORT_SYMBOL_GPL(tracing_cond_snapshot_data); +int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update) +{ + return -ENODEV; +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond_enable); +int tracing_snapshot_cond_disable(struct trace_array *tr) +{ + return false; +} +EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); #endif /* CONFIG_TRACER_SNAPSHOT */ void tracer_tracing_off(struct trace_array *tr) @@ -1354,12 +1513,14 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) * @tr: tracer * @tsk: the task with the latency * @cpu: The cpu that initiated the trace. + * @cond_data: User data associated with a conditional snapshot * * Flip the buffers between the @tr and the max_tr and record information * about which task was the cause of this latency. */ void -update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) +update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu, + void *cond_data) { if (tr->stop_count) return; @@ -1380,9 +1541,15 @@ update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) else ring_buffer_record_off(tr->max_buffer.buffer); +#ifdef CONFIG_TRACER_SNAPSHOT + if (tr->cond_snapshot && !tr->cond_snapshot->update(tr, cond_data)) + goto out_unlock; +#endif swap(tr->trace_buffer.buffer, tr->max_buffer.buffer); __update_max_tr(tr, tsk, cpu); + + out_unlock: arch_spin_unlock(&tr->max_lock); } @@ -5396,6 +5563,16 @@ static int tracing_set_tracer(struct trace_array *tr, const char *buf) if (t == tr->current_trace) goto out; +#ifdef CONFIG_TRACER_SNAPSHOT + if (t->use_max_tr) { + arch_spin_lock(&tr->max_lock); + if (tr->cond_snapshot) + ret = -EBUSY; + arch_spin_unlock(&tr->max_lock); + if (ret) + goto out; + } +#endif /* Some tracers won't work on kernel command line */ if (system_state < SYSTEM_RUNNING && t->noboot) { pr_warn("Tracer '%s' is not allowed on command line, ignored\n", @@ -6477,6 +6654,13 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt, goto out; } + arch_spin_lock(&tr->max_lock); + if (tr->cond_snapshot) + ret = -EBUSY; + arch_spin_unlock(&tr->max_lock); + if (ret) + goto out; + switch (val) { case 0: if (iter->cpu_file != RING_BUFFER_ALL_CPUS) { @@ -6502,7 +6686,7 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt, local_irq_disable(); /* Now, we're going to swap */ if (iter->cpu_file == RING_BUFFER_ALL_CPUS) - update_max_tr(tr, current, smp_processor_id()); + update_max_tr(tr, current, smp_processor_id(), NULL); else update_max_tr_single(tr, current, iter->cpu_file); local_irq_enable(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index ae7df090b93e..d80cee49e0eb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -194,6 +194,51 @@ struct trace_pid_list { unsigned long *pids; }; +typedef bool (*cond_update_fn_t)(struct trace_array *tr, void *cond_data); + +/** + * struct cond_snapshot - conditional snapshot data and callback + * + * The cond_snapshot structure encapsulates a callback function and + * data associated with the snapshot for a given tracing instance. + * + * When a snapshot is taken conditionally, by invoking + * tracing_snapshot_cond(tr, cond_data), the cond_data passed in is + * passed in turn to the cond_snapshot.update() function. That data + * can be compared by the update() implementation with the cond_data + * contained wihin the struct cond_snapshot instance associated with + * the trace_array. Because the tr->max_lock is held throughout the + * update() call, the update() function can directly retrieve the + * cond_snapshot and cond_data associated with the per-instance + * snapshot associated with the trace_array. + * + * The cond_snapshot.update() implementation can save data to be + * associated with the snapshot if it decides to, and returns 'true' + * in that case, or it returns 'false' if the conditional snapshot + * shouldn't be taken. + * + * The cond_snapshot instance is created and associated with the + * user-defined cond_data by tracing_cond_snapshot_enable(). + * Likewise, the cond_snapshot instance is destroyed and is no longer + * associated with the trace instance by + * tracing_cond_snapshot_disable(). + * + * The method below is required. + * + * @update: When a conditional snapshot is invoked, the update() + * callback function is invoked with the tr->max_lock held. The + * update() implementation signals whether or not to actually + * take the snapshot, by returning 'true' if so, 'false' if no + * snapshot should be taken. Because the max_lock is held for + * the duration of update(), the implementation is safe to + * directly retrieven and save any implementation data it needs + * to in association with the snapshot. + */ +struct cond_snapshot { + void *cond_data; + cond_update_fn_t update; +}; + /* * The trace array - an array of per-CPU trace arrays. This is the * highest level data structure that individual tracers deal with. @@ -277,6 +322,9 @@ struct trace_array { #endif int time_stamp_abs_ref; struct list_head hist_vars; +#ifdef CONFIG_TRACER_SNAPSHOT + struct cond_snapshot *cond_snapshot; +#endif }; enum { @@ -727,7 +775,8 @@ int trace_pid_write(struct trace_pid_list *filtered_pids, const char __user *ubuf, size_t cnt); #ifdef CONFIG_TRACER_MAX_TRACE -void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu); +void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu, + void *cond_data); void update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu); #endif /* CONFIG_TRACER_MAX_TRACE */ @@ -1810,6 +1859,11 @@ static inline bool event_command_needs_rec(struct event_command *cmd_ops) extern int trace_event_enable_disable(struct trace_event_file *file, int enable, int soft_disable); extern int tracing_alloc_snapshot(void); +extern void tracing_snapshot_cond(struct trace_array *tr, void *cond_data); +extern int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update); + +extern int tracing_snapshot_cond_disable(struct trace_array *tr); +extern void *tracing_cond_snapshot_data(struct trace_array *tr); extern const char *__start___trace_bprintk_fmt[]; extern const char *__stop___trace_bprintk_fmt[]; diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index f4fe7d1781e9..743b2b520d34 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -486,7 +486,7 @@ probe_wakeup_sched_switch(void *ignore, bool preempt, if (likely(!is_tracing_stopped())) { wakeup_trace->max_latency = delta; - update_max_tr(wakeup_trace, wakeup_task, wakeup_cpu); + update_max_tr(wakeup_trace, wakeup_task, wakeup_cpu, NULL); } out_unlock: -- cgit v1.3-7-g2ca7 From a3785b7eca8fd45c7c39f2ddfcd67368af30c1b4 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:46 -0600 Subject: tracing: Add hist trigger snapshot() action Add support for hist:handlerXXX($var).snapshot(), which will take a snapshot of the current trace buffer whenever handlerXXX is hit. As a first user, this also adds snapshot() action support for the onmax() handler i.e. hist:onmax($var).snapshot(). Also, the hist trigger key printing is moved into a separate function so the snapshot() action can print a histogram key outside the histogram display - add and use hist_trigger_print_key() for that purpose. Link: http://lkml.kernel.org/r/2f1a952c0dcd8aca8702ce81269581a692396d45.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 3 + kernel/trace/trace_events_hist.c | 266 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 259 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9f4d56f74b46..dd60c14a0fb0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4919,6 +4919,9 @@ static const char readme_msg[] = "\t The available actions are:\n\n" "\t (param list) - generate synthetic event\n" "\t save(field,...) - save current event fields\n" +#ifdef CONFIG_TRACER_SNAPSHOT + "\t snapshot() - snapshot the trace buffer\n" +#endif #endif ; diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 0515229e5f95..571937a268a3 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -396,6 +396,7 @@ enum handler_id { enum action_id { ACTION_SAVE = 1, ACTION_TRACE, + ACTION_SNAPSHOT, }; struct action_data { @@ -454,6 +455,83 @@ struct action_data { }; }; +struct track_data { + u64 track_val; + bool updated; + + unsigned int key_len; + void *key; + struct tracing_map_elt elt; + + struct action_data *action_data; + struct hist_trigger_data *hist_data; +}; + +struct hist_elt_data { + char *comm; + u64 *var_ref_vals; + char *field_var_str[SYNTH_FIELDS_MAX]; +}; + +struct snapshot_context { + struct tracing_map_elt *elt; + void *key; +}; + +static void track_data_free(struct track_data *track_data) +{ + struct hist_elt_data *elt_data; + + if (!track_data) + return; + + kfree(track_data->key); + + elt_data = track_data->elt.private_data; + if (elt_data) { + kfree(elt_data->comm); + kfree(elt_data); + } + + kfree(track_data); +} + +static struct track_data *track_data_alloc(unsigned int key_len, + struct action_data *action_data, + struct hist_trigger_data *hist_data) +{ + struct track_data *data = kzalloc(sizeof(*data), GFP_KERNEL); + struct hist_elt_data *elt_data; + + if (!data) + return ERR_PTR(-ENOMEM); + + data->key = kzalloc(key_len, GFP_KERNEL); + if (!data->key) { + track_data_free(data); + return ERR_PTR(-ENOMEM); + } + + data->key_len = key_len; + data->action_data = action_data; + data->hist_data = hist_data; + + elt_data = kzalloc(sizeof(*elt_data), GFP_KERNEL); + if (!elt_data) { + track_data_free(data); + return ERR_PTR(-ENOMEM); + } + data->elt.private_data = elt_data; + + elt_data->comm = kzalloc(TASK_COMM_LEN, GFP_KERNEL); + if (!elt_data->comm) { + track_data_free(data); + return ERR_PTR(-ENOMEM); + } + + return data; +} + static char last_hist_cmd[MAX_FILTER_STR_VAL]; static char hist_err_str[MAX_FILTER_STR_VAL]; @@ -1726,12 +1804,6 @@ static struct hist_field *find_event_var(struct hist_trigger_data *hist_data, return hist_field; } -struct hist_elt_data { - char *comm; - u64 *var_ref_vals; - char *field_var_str[SYNTH_FIELDS_MAX]; -}; - static u64 hist_field_var_ref(struct hist_field *hist_field, struct tracing_map_elt *elt, struct ring_buffer_event *rbe, @@ -3452,6 +3524,112 @@ static bool check_track_val(struct tracing_map_elt *elt, return data->track_data.check_val(track_val, var_val); } +#ifdef CONFIG_TRACER_SNAPSHOT +static bool cond_snapshot_update(struct trace_array *tr, void *cond_data) +{ + /* called with tr->max_lock held */ + struct track_data *track_data = tr->cond_snapshot->cond_data; + struct hist_elt_data *elt_data, *track_elt_data; + struct snapshot_context *context = cond_data; + u64 track_val; + + if (!track_data) + return false; + + track_val = get_track_val(track_data->hist_data, context->elt, + track_data->action_data); + + track_data->track_val = track_val; + memcpy(track_data->key, context->key, track_data->key_len); + + elt_data = context->elt->private_data; + track_elt_data = track_data->elt.private_data; + if (elt_data->comm) + memcpy(track_elt_data->comm, elt_data->comm, TASK_COMM_LEN); + + track_data->updated = true; + + return true; +} + +static void save_track_data_snapshot(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, void *rec, + struct ring_buffer_event *rbe, void *key, + struct action_data *data, + u64 *var_ref_vals) +{ + struct trace_event_file *file = hist_data->event_file; + struct snapshot_context context; + + context.elt = elt; + context.key = key; + + tracing_snapshot_cond(file->tr, &context); +} + +static void hist_trigger_print_key(struct seq_file *m, + struct hist_trigger_data *hist_data, + void *key, + struct tracing_map_elt *elt); + +static struct action_data *snapshot_action(struct hist_trigger_data *hist_data) +{ + unsigned int i; + + if (!hist_data->n_actions) + return NULL; + + for (i = 0; i < hist_data->n_actions; i++) { + struct action_data *data = hist_data->actions[i]; + + if (data->action == ACTION_SNAPSHOT) + return data; + } + + return NULL; +} + +static void track_data_snapshot_print(struct seq_file *m, + struct hist_trigger_data *hist_data) +{ + struct trace_event_file *file = hist_data->event_file; + struct track_data *track_data; + struct action_data *action; + + track_data = tracing_cond_snapshot_data(file->tr); + if (!track_data) + return; + + if (!track_data->updated) + return; + + action = snapshot_action(hist_data); + if (!action) + return; + + seq_puts(m, "\nSnapshot taken (see tracing/snapshot). Details:\n"); + seq_printf(m, "\ttriggering value { %s(%s) }: %10llu", + action->handler == HANDLER_ONMAX ? "onmax" : "onchange", + action->track_data.var_str, track_data->track_val); + + seq_puts(m, "\ttriggered by event with key: "); + hist_trigger_print_key(m, hist_data, track_data->key, &track_data->elt); + seq_putc(m, '\n'); +} +#else +static bool cond_snapshot_update(struct trace_array *tr, void *cond_data) +{ + return false; +} +static void save_track_data_snapshot(struct hist_trigger_data *hist_data, + struct tracing_map_elt *elt, void *rec, + struct ring_buffer_event *rbe, void *key, + struct action_data *data, + u64 *var_ref_vals) {} +static void track_data_snapshot_print(struct seq_file *m, + struct hist_trigger_data *hist_data) {} +#endif /* CONFIG_TRACER_SNAPSHOT */ + static void track_data_print(struct seq_file *m, struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, @@ -3463,6 +3641,9 @@ static void track_data_print(struct seq_file *m, if (data->handler == HANDLER_ONMAX) seq_printf(m, "\n\tmax: %10llu", track_val); + if (data->action == ACTION_SNAPSHOT) + return; + for (i = 0; i < hist_data->n_save_vars; i++) { struct hist_field *save_val = hist_data->save_vars[i]->val; struct hist_field *save_var = hist_data->save_vars[i]->var; @@ -3513,9 +3694,21 @@ static void action_data_destroy(struct action_data *data) static void track_data_destroy(struct hist_trigger_data *hist_data, struct action_data *data) { + struct trace_event_file *file = hist_data->event_file; + destroy_hist_field(data->track_data.track_var, 0); destroy_hist_field(data->track_data.var_ref, 0); + if (data->action == ACTION_SNAPSHOT) { + struct track_data *track_data; + + track_data = tracing_cond_snapshot_data(file->tr); + if (track_data && track_data->hist_data == hist_data) { + tracing_snapshot_cond_disable(file->tr); + track_data_free(track_data); + } + } + kfree(data->track_data.var_str); action_data_destroy(data); @@ -3646,6 +3839,26 @@ static int action_parse(char *str, struct action_data *data, data->track_data.save_data = save_track_data_vars; data->fn = ontrack_action; data->action = ACTION_SAVE; + } else if (str_has_prefix(action_name, "snapshot")) { + char *params = strsep(&str, ")"); + + if (!str) { + hist_err("action parsing: No closing paren found: %s", params); + ret = -EINVAL; + goto out; + } + + if (handler == HANDLER_ONMAX) + data->track_data.check_val = check_track_val_max; + else { + hist_err("action parsing: Handler doesn't support action: ", action_name); + ret = -EINVAL; + goto out; + } + + data->track_data.save_data = save_track_data_snapshot; + data->fn = ontrack_action; + data->action = ACTION_SNAPSHOT; } else { char *params = strsep(&str, ")"); @@ -3942,6 +4155,8 @@ static int trace_action_create(struct hist_trigger_data *hist_data, static int action_create(struct hist_trigger_data *hist_data, struct action_data *data) { + struct trace_event_file *file = hist_data->event_file; + struct track_data *track_data; struct field_var *field_var; unsigned int i; char *param; @@ -3950,6 +4165,21 @@ static int action_create(struct hist_trigger_data *hist_data, if (data->action == ACTION_TRACE) return trace_action_create(hist_data, data); + if (data->action == ACTION_SNAPSHOT) { + track_data = track_data_alloc(hist_data->key_size, data, hist_data); + if (IS_ERR(track_data)) { + ret = PTR_ERR(track_data); + goto out; + } + + ret = tracing_snapshot_cond_enable(file->tr, track_data, + cond_snapshot_update); + if (ret) + track_data_free(track_data); + + goto out; + } + if (data->action == ACTION_SAVE) { if (hist_data->n_save_vars) { ret = -EEXIST; @@ -4552,6 +4782,9 @@ static void print_actions(struct seq_file *m, for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; + if (data->action == ACTION_SNAPSHOT) + continue; + if (data->handler == HANDLER_ONMAX) track_data_print(m, hist_data, elt, data); } @@ -4946,10 +5179,10 @@ static void hist_trigger_stacktrace_print(struct seq_file *m, } } -static void -hist_trigger_entry_print(struct seq_file *m, - struct hist_trigger_data *hist_data, void *key, - struct tracing_map_elt *elt) +static void hist_trigger_print_key(struct seq_file *m, + struct hist_trigger_data *hist_data, + void *key, + struct tracing_map_elt *elt) { struct hist_field *key_field; char str[KSYM_SYMBOL_LEN]; @@ -5025,6 +5258,17 @@ hist_trigger_entry_print(struct seq_file *m, seq_puts(m, " "); seq_puts(m, "}"); +} + +static void hist_trigger_entry_print(struct seq_file *m, + struct hist_trigger_data *hist_data, + void *key, + struct tracing_map_elt *elt) +{ + const char *field_name; + unsigned int i; + + hist_trigger_print_key(m, hist_data, key, elt); seq_printf(m, " hitcount: %10llu", tracing_map_read_sum(elt, HITCOUNT_IDX)); @@ -5091,6 +5335,8 @@ static void hist_trigger_show(struct seq_file *m, if (n_entries < 0) n_entries = 0; + track_data_snapshot_print(m, hist_data); + seq_printf(m, "\nTotals:\n Hits: %llu\n Entries: %u\n Dropped: %llu\n", (u64)atomic64_read(&hist_data->map->hits), n_entries, (u64)atomic64_read(&hist_data->map->drops)); -- cgit v1.3-7-g2ca7 From dff81f559285b5c6df287eb231a9d6b02057ef8b Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:48 -0600 Subject: tracing: Add hist trigger onchange() handler Add support for a hist:onchange($var) handler, similar to the onmax() handler but triggering whenever there's any change in $var, not just a max. Link: http://lkml.kernel.org/r/dfbc7e4ada242603e9ec3f049b5ad076a07dfd03.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 3 ++- kernel/trace/trace_events_hist.c | 58 ++++++++++++++++++++++++++++++++++------ 2 files changed, 52 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index dd60c14a0fb0..be6779f963c6 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4915,7 +4915,8 @@ static const char readme_msg[] = "\t .\n\n" "\t The available handlers are:\n\n" "\t onmatch(matching.event) - invoke on addition or update\n" - "\t onmax(var) - invoke if var exceeds current max\n\n" + "\t onmax(var) - invoke if var exceeds current max\n" + "\t onchange(var) - invoke action if var changes\n\n" "\t The available actions are:\n\n" "\t (param list) - generate synthetic event\n" "\t save(field,...) - save current event fields\n" diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 571937a268a3..2f3323ca9d24 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -391,6 +391,7 @@ typedef bool (*check_track_val_fn_t) (u64 track_val, u64 var_val); enum handler_id { HANDLER_ONMATCH = 1, HANDLER_ONMAX, + HANDLER_ONCHANGE, }; enum action_id { @@ -1989,7 +1990,8 @@ static int parse_action(char *str, struct hist_trigger_attrs *attrs) return ret; if ((str_has_prefix(str, "onmatch(")) || - (str_has_prefix(str, "onmax("))) { + (str_has_prefix(str, "onmax(")) || + (str_has_prefix(str, "onchange("))) { attrs->action_str[attrs->n_actions] = kstrdup(str, GFP_KERNEL); if (!attrs->action_str[attrs->n_actions]) { ret = -ENOMEM; @@ -3481,6 +3483,14 @@ static bool check_track_val_max(u64 track_val, u64 var_val) return true; } +static bool check_track_val_changed(u64 track_val, u64 var_val) +{ + if (var_val == track_val) + return false; + + return true; +} + static u64 get_track_val(struct hist_trigger_data *hist_data, struct tracing_map_elt *elt, struct action_data *data) @@ -3640,6 +3650,8 @@ static void track_data_print(struct seq_file *m, if (data->handler == HANDLER_ONMAX) seq_printf(m, "\n\tmax: %10llu", track_val); + else if (data->handler == HANDLER_ONCHANGE) + seq_printf(m, "\n\tchanged: %10llu", track_val); if (data->action == ACTION_SNAPSHOT) return; @@ -3727,14 +3739,14 @@ static int track_data_create(struct hist_trigger_data *hist_data, track_data_var_str = data->track_data.var_str; if (track_data_var_str[0] != '$') { - hist_err("For onmax(x), x must be a variable: ", track_data_var_str); + hist_err("For onmax(x) or onchange(x), x must be a variable: ", track_data_var_str); return -EINVAL; } track_data_var_str++; var_field = find_target_event_var(hist_data, NULL, NULL, track_data_var_str); if (!var_field) { - hist_err("Couldn't find onmax variable: ", track_data_var_str); + hist_err("Couldn't find onmax or onchange variable: ", track_data_var_str); return -EINVAL; } @@ -3751,6 +3763,14 @@ static int track_data_create(struct hist_trigger_data *hist_data, ret = PTR_ERR(track_var); goto out; } + + if (data->handler == HANDLER_ONCHANGE) + track_var = create_var(hist_data, file, "__change", sizeof(u64), "u64"); + if (IS_ERR(track_var)) { + hist_err("Couldn't create onchange variable: ", "__change"); + ret = PTR_ERR(track_var); + goto out; + } data->track_data.track_var = track_var; ret = action_create(hist_data, data); @@ -3830,6 +3850,8 @@ static int action_parse(char *str, struct action_data *data, if (handler == HANDLER_ONMAX) data->track_data.check_val = check_track_val_max; + else if (handler == HANDLER_ONCHANGE) + data->track_data.check_val = check_track_val_changed; else { hist_err("action parsing: Handler doesn't support action: ", action_name); ret = -EINVAL; @@ -3850,6 +3872,8 @@ static int action_parse(char *str, struct action_data *data, if (handler == HANDLER_ONMAX) data->track_data.check_val = check_track_val_max; + else if (handler == HANDLER_ONCHANGE) + data->track_data.check_val = check_track_val_changed; else { hist_err("action parsing: Handler doesn't support action: ", action_name); ret = -EINVAL; @@ -3870,6 +3894,8 @@ static int action_parse(char *str, struct action_data *data, if (handler == HANDLER_ONMAX) data->track_data.check_val = check_track_val_max; + else if (handler == HANDLER_ONCHANGE) + data->track_data.check_val = check_track_val_changed; if (handler != HANDLER_ONMATCH) { data->track_data.save_data = action_trace; @@ -4700,7 +4726,8 @@ static void destroy_actions(struct hist_trigger_data *hist_data) if (data->handler == HANDLER_ONMATCH) onmatch_destroy(data); - else if (data->handler == HANDLER_ONMAX) + else if (data->handler == HANDLER_ONMAX || + data->handler == HANDLER_ONCHANGE) track_data_destroy(hist_data, data); else kfree(data); @@ -4736,6 +4763,15 @@ static int parse_actions(struct hist_trigger_data *hist_data) ret = PTR_ERR(data); break; } + } else if ((len = str_has_prefix(str, "onchange("))) { + char *action_str = str + len; + + data = track_data_parse(hist_data, action_str, + HANDLER_ONCHANGE); + if (IS_ERR(data)) { + ret = PTR_ERR(data); + break; + } } else { ret = -EINVAL; break; @@ -4760,7 +4796,8 @@ static int create_actions(struct hist_trigger_data *hist_data) ret = onmatch_create(hist_data, data); if (ret) break; - } else if (data->handler == HANDLER_ONMAX) { + } else if (data->handler == HANDLER_ONMAX || + data->handler == HANDLER_ONCHANGE) { ret = track_data_create(hist_data, data); if (ret) break; @@ -4785,7 +4822,8 @@ static void print_actions(struct seq_file *m, if (data->action == ACTION_SNAPSHOT) continue; - if (data->handler == HANDLER_ONMAX) + if (data->handler == HANDLER_ONMAX || + data->handler == HANDLER_ONCHANGE) track_data_print(m, hist_data, elt, data); } } @@ -4817,6 +4855,8 @@ static void print_track_data_spec(struct seq_file *m, { if (data->handler == HANDLER_ONMAX) seq_puts(m, ":onmax("); + else if (data->handler == HANDLER_ONCHANGE) + seq_puts(m, ":onchange("); seq_printf(m, "%s", data->track_data.var_str); seq_printf(m, ").%s(", data->action_name); @@ -4874,7 +4914,8 @@ static bool actions_match(struct hist_trigger_data *hist_data, if (strcmp(data->match_data.event, data_test->match_data.event) != 0) return false; - } else if (data->handler == HANDLER_ONMAX) { + } else if (data->handler == HANDLER_ONMAX || + data->handler == HANDLER_ONCHANGE) { if (strcmp(data->track_data.var_str, data_test->track_data.var_str) != 0) return false; @@ -4895,7 +4936,8 @@ static void print_actions_spec(struct seq_file *m, if (data->handler == HANDLER_ONMATCH) print_onmatch_spec(m, hist_data, data); - else if (data->handler == HANDLER_ONMAX) + else if (data->handler == HANDLER_ONMAX || + data->handler == HANDLER_ONCHANGE) print_track_data_spec(m, hist_data, data); } } -- cgit v1.3-7-g2ca7 From e91eefd731d933194940805bb1f75a4972dc607c Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 13 Feb 2019 17:42:50 -0600 Subject: tracing: Add alternative synthetic event trace action syntax Add a 'trace(synthetic_event_name, params)' alternative to synthetic_event_name(params). Currently, the syntax used for generating synthetic events is to invoke synthetic_event_name(params) i.e. use the synthetic event name as a function call. Users requested a new form that more explicitly shows that the synthetic event is in effect being traced. In this version, a new 'trace()' keyword is used, and the synthetic event name is passed in as the first argument. In addition, for the sake of consistency with other actions, change the documention to emphasize the trace() form over the function-call form, which remains documented as equivalent. Link: http://lkml.kernel.org/r/d082773e50232a001480cf837679a1e01c1a2eb7.1550100284.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/histogram.rst | 54 +++++++++++++++++++++++++++------------ kernel/trace/trace.c | 2 +- kernel/trace/trace_events_hist.c | 42 +++++++++++++++++++++++++++--- 3 files changed, 76 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst index 79476c906b1a..0ea59d45aef1 100644 --- a/Documentation/trace/histogram.rst +++ b/Documentation/trace/histogram.rst @@ -1873,31 +1873,45 @@ The available handlers are: The available actions are: - - (param list) - generate synthetic event + - trace(,param list) - generate synthetic event - save(field,...) - save current event fields - snapshot() - snapshot the trace buffer The following commonly-used handler.action pairs are available: - - onmatch(matching.event).(param list) + - onmatch(matching.event).trace(,param list) - The 'onmatch(matching.event).(params)' hist - trigger action is invoked whenever an event matches and the - histogram entry would be added or updated. It causes the named - synthetic event to be generated with the values given in the + The 'onmatch(matching.event).trace(,param + list)' hist trigger action is invoked whenever an event matches + and the histogram entry would be added or updated. It causes the + named synthetic event to be generated with the values given in the 'param list'. The result is the generation of a synthetic event that consists of the values contained in those variables at the - time the invoking event was hit. - - The 'param list' consists of one or more parameters which may be - either variables or fields defined on either the 'matching.event' - or the target event. The variables or fields specified in the - param list may be either fully-qualified or unqualified. If a - variable is specified as unqualified, it must be unique between - the two events. A field name used as a param can be unqualified - if it refers to the target event, but must be fully qualified if - it refers to the matching event. A fully-qualified name is of the - form 'system.event_name.$var_name' or 'system.event_name.field'. + time the invoking event was hit. For example, if the synthetic + event name is 'wakeup_latency', a wakeup_latency event is + generated using onmatch(event).trace(wakeup_latency,arg1,arg2). + + There is also an equivalent alternative form available for + generating synthetic events. In this form, the synthetic event + name is used as if it were a function name. For example, using + the 'wakeup_latency' synthetic event name again, the + wakeup_latency event would be generated by invoking it as if it + were a function call, with the event field values passed in as + arguments: onmatch(event).wakeup_latency(arg1,arg2). The syntax + for this form is: + + onmatch(matching.event).(param list) + + In either case, the 'param list' consists of one or more + parameters which may be either variables or fields defined on + either the 'matching.event' or the target event. The variables or + fields specified in the param list may be either fully-qualified + or unqualified. If a variable is specified as unqualified, it + must be unique between the two events. A field name used as a + param can be unqualified if it refers to the target event, but + must be fully qualified if it refers to the matching event. A + fully-qualified name is of the form 'system.event_name.$var_name' + or 'system.event_name.field'. The 'matching.event' specification is simply the fully qualified event name of the event that matches the target event for the @@ -1928,6 +1942,12 @@ The following commonly-used handler.action pairs are available: wakeup_new_test($testpid) if comm=="cyclictest"' >> \ /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger + Or, equivalently, using the 'trace' keyword syntax: + + # echo 'hist:keys=$testpid:testpid=pid:onmatch(sched.sched_wakeup_new).\ + trace(wakeup_new_test,$testpid) if comm=="cyclictest"' >> \ + /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger + Creating and displaying a histogram based on those events is now just a matter of using the fields and new synthetic event in the tracing/events/synthetic directory, as usual:: diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index be6779f963c6..0460cc0f28fd 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4918,7 +4918,7 @@ static const char readme_msg[] = "\t onmax(var) - invoke if var exceeds current max\n" "\t onchange(var) - invoke action if var changes\n\n" "\t The available actions are:\n\n" - "\t (param list) - generate synthetic event\n" + "\t trace(,param list) - generate synthetic event\n" "\t save(field,...) - save current event fields\n" #ifdef CONFIG_TRACER_SNAPSHOT "\t snapshot() - snapshot the trace buffer\n" diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 2f3323ca9d24..66386ba1425f 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -419,6 +419,8 @@ struct action_data { */ unsigned int var_ref_idx; struct synth_event *synth_event; + bool use_trace_keyword; + char *synth_event_name; union { struct { @@ -3700,6 +3702,8 @@ static void action_data_destroy(struct action_data *data) if (data->synth_event) data->synth_event->ref--; + kfree(data->synth_event_name); + kfree(data); } @@ -3781,6 +3785,7 @@ static int track_data_create(struct hist_trigger_data *hist_data, static int parse_action_params(char *params, struct action_data *data) { char *param, *saved_param; + bool first_param = true; int ret = 0; while (params) { @@ -3809,6 +3814,13 @@ static int parse_action_params(char *params, struct action_data *data) goto out; } + if (first_param && data->use_trace_keyword) { + data->synth_event_name = saved_param; + first_param = false; + continue; + } + first_param = false; + data->params[data->n_params++] = saved_param; } out: @@ -3886,6 +3898,9 @@ static int action_parse(char *str, struct action_data *data, } else { char *params = strsep(&str, ")"); + if (str_has_prefix(action_name, "trace")) + data->use_trace_keyword = true; + if (params) { ret = parse_action_params(params, data); if (ret) @@ -4088,13 +4103,19 @@ static int trace_action_create(struct hist_trigger_data *hist_data, unsigned int i, var_ref_idx; unsigned int field_pos = 0; struct synth_event *event; + char *synth_event_name; int ret = 0; lockdep_assert_held(&event_mutex); - event = find_synth_event(data->action_name); + if (data->use_trace_keyword) + synth_event_name = data->synth_event_name; + else + synth_event_name = data->action_name; + + event = find_synth_event(synth_event_name); if (!event) { - hist_err("trace action: Couldn't find synthetic event: ", data->action_name); + hist_err("trace action: Couldn't find synthetic event: ", synth_event_name); return -EINVAL; } @@ -4841,8 +4862,10 @@ static void print_action_spec(struct seq_file *m, seq_puts(m, ","); } } else if (data->action == ACTION_TRACE) { + if (data->use_trace_keyword) + seq_printf(m, "%s", data->synth_event_name); for (i = 0; i < data->n_params; i++) { - if (i) + if (i || data->use_trace_keyword) seq_puts(m, ","); seq_printf(m, "%s", data->params[i]); } @@ -4890,6 +4913,7 @@ static bool actions_match(struct hist_trigger_data *hist_data, for (i = 0; i < hist_data->n_actions; i++) { struct action_data *data = hist_data->actions[i]; struct action_data *data_test = hist_data_test->actions[i]; + char *action_name, *action_name_test; if (data->handler != data_test->handler) return false; @@ -4904,7 +4928,17 @@ static bool actions_match(struct hist_trigger_data *hist_data, return false; } - if (strcmp(data->action_name, data_test->action_name) != 0) + if (data->use_trace_keyword) + action_name = data->synth_event_name; + else + action_name = data->action_name; + + if (data_test->use_trace_keyword) + action_name_test = data_test->synth_event_name; + else + action_name_test = data_test->action_name; + + if (strcmp(action_name, action_name_test) != 0) return false; if (data->handler == HANDLER_ONMATCH) { -- cgit v1.3-7-g2ca7 From 1c347a94ca79ef89156c7ad5d3a44bb2320a7047 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 14 Feb 2019 18:45:21 -0500 Subject: tracing: Comment why cond_snapshot is checked outside of max_lock protection Before setting tr->cond_snapshot, it must be NULL before it can be updated. It can go to NULL when a trace event hist trigger is created or removed, and can only be modified under the max_lock spin lock. But because it can only be set to something other than NULL under both the max_lock spin lock as well as the trace_types_lock, we can perform the check if it is not NULL only under the trace_types_lock and fail out without having to grab the max_lock spin lock. This is very subtle, and deserves a comment. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 0460cc0f28fd..2cf3c747a357 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1116,6 +1116,14 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, goto fail_unlock; } + /* + * The cond_snapshot can only change to NULL without the + * trace_types_lock. We don't care if we race with it going + * to NULL, but we want to make sure that it's not set to + * something other than NULL when we get here, which we can + * do safely with only holding the trace_types_lock and not + * having to take the max_lock. + */ if (tr->cond_snapshot) { ret = -EBUSY; goto fail_unlock; -- cgit v1.3-7-g2ca7 From 9e5a36a3371f48fef0ebea6826d1d66f6201c522 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Sun, 17 Feb 2019 22:32:22 +0000 Subject: tracing: Fix spelling mistake: "analagous" -> "analogous" There is a spelling mistake in the mini-howto help text. Fix it. Link: http://lkml.kernel.org/r/20190217223222.16479-1-colin.king@canonical.com Signed-off-by: Colin Ian King Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2cf3c747a357..3835f7ed3293 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4916,7 +4916,7 @@ static const char readme_msg[] = "\t unchanged.\n\n" "\t The enable_hist and disable_hist triggers can be used to\n" "\t have one event conditionally start and stop another event's\n" - "\t already-attached hist trigger. The syntax is analagous to\n" + "\t already-attached hist trigger. The syntax is analogous to\n" "\t the enable_event and disable_event triggers.\n\n" "\t Hist trigger handlers and actions are executed whenever a\n" "\t a histogram entry is added or updated. They take the form:\n\n" -- cgit v1.3-7-g2ca7 From cbae05d32ff68233f00cbad9fda0382cc982d821 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Sat, 16 Feb 2019 19:59:33 +0900 Subject: printk: Pass caller information to log_store(). When thread1 called printk() which did not end with '\n', and then thread2 called printk() which ends with '\n' before thread1 calls pr_cont(), the partial content saved into "struct cont" is flushed by thread2 despite the partial content was generated by thread1. This leads to confusing output as if the partial content was generated by thread2. Fix this problem by passing correct caller information to log_store(). Before: [ T8533] abcdefghijklm [ T8533] ABCDEFGHIJKLMNOPQRSTUVWXYZ [ T8532] nopqrstuvwxyz [ T8532] abcdefghijklmnopqrstuvwxyz [ T8533] abcdefghijklm [ T8533] ABCDEFGHIJKLMNOPQRSTUVWXYZ [ T8532] nopqrstuvwxyz After: [ T8507] abcdefghijklm [ T8508] ABCDEFGHIJKLMNOPQRSTUVWXYZ [ T8507] nopqrstuvwxyz [ T8507] abcdefghijklmnopqrstuvwxyz [ T8507] abcdefghijklm [ T8508] ABCDEFGHIJKLMNOPQRSTUVWXYZ [ T8507] nopqrstuvwxyz Link: http://lkml.kernel.org/r/1550314773-8607-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp To: Dmitry Vyukov To: Sergey Senozhatsky To: Steven Rostedt Cc: linux-kernel@vger.kernel.org Signed-off-by: Tetsuo Handa Reviewed-by: Sergey Senozhatsky [pmladek: broke 80-column rule where it made more harm than good] Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index e3d977bbc7ab..b4d26388bc62 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -585,7 +585,7 @@ static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len, } /* insert record into the buffer, discard old ones, update heads */ -static int log_store(int facility, int level, +static int log_store(u32 caller_id, int facility, int level, enum log_flags flags, u64 ts_nsec, const char *dict, u16 dict_len, const char *text, u16 text_len) @@ -634,10 +634,7 @@ static int log_store(int facility, int level, else msg->ts_nsec = local_clock(); #ifdef CONFIG_PRINTK_CALLER - if (in_task()) - msg->caller_id = task_pid_nr(current); - else - msg->caller_id = 0x80000000 + raw_smp_processor_id(); + msg->caller_id = caller_id; #endif memset(log_dict(msg) + dict_len, 0, pad_len); msg->len = size; @@ -1800,6 +1797,12 @@ static inline void printk_delay(void) } } +static inline u32 printk_caller_id(void) +{ + return in_task() ? task_pid_nr(current) : + 0x80000000 + raw_smp_processor_id(); +} + /* * Continuation lines are buffered, and not committed to the record buffer * until the line is complete, or a race forces it. The line fragments @@ -1809,7 +1812,7 @@ static inline void printk_delay(void) static struct cont { char buf[LOG_LINE_MAX]; size_t len; /* length == 0 means unused buffer */ - struct task_struct *owner; /* task of first print*/ + u32 caller_id; /* printk_caller_id() of first print */ u64 ts_nsec; /* time of first print */ u8 level; /* log level of first message */ u8 facility; /* log facility of first message */ @@ -1821,12 +1824,13 @@ static void cont_flush(void) if (cont.len == 0) return; - log_store(cont.facility, cont.level, cont.flags, cont.ts_nsec, - NULL, 0, cont.buf, cont.len); + log_store(cont.caller_id, cont.facility, cont.level, cont.flags, + cont.ts_nsec, NULL, 0, cont.buf, cont.len); cont.len = 0; } -static bool cont_add(int facility, int level, enum log_flags flags, const char *text, size_t len) +static bool cont_add(u32 caller_id, int facility, int level, + enum log_flags flags, const char *text, size_t len) { /* If the line gets too long, split it up in separate records. */ if (cont.len + len > sizeof(cont.buf)) { @@ -1837,7 +1841,7 @@ static bool cont_add(int facility, int level, enum log_flags flags, const char * if (!cont.len) { cont.facility = facility; cont.level = level; - cont.owner = current; + cont.caller_id = caller_id; cont.ts_nsec = local_clock(); cont.flags = flags; } @@ -1857,13 +1861,15 @@ static bool cont_add(int facility, int level, enum log_flags flags, const char * static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) { + const u32 caller_id = printk_caller_id(); + /* * If an earlier line was buffered, and we're a continuation - * write from the same process, try to add it to the buffer. + * write from the same context, try to add it to the buffer. */ if (cont.len) { - if (cont.owner == current && (lflags & LOG_CONT)) { - if (cont_add(facility, level, lflags, text, text_len)) + if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { + if (cont_add(caller_id, facility, level, lflags, text, text_len)) return text_len; } /* Otherwise, make sure it's flushed */ @@ -1876,12 +1882,13 @@ static size_t log_output(int facility, int level, enum log_flags lflags, const c /* If it doesn't end in a newline, try to buffer the current line */ if (!(lflags & LOG_NEWLINE)) { - if (cont_add(facility, level, lflags, text, text_len)) + if (cont_add(caller_id, facility, level, lflags, text, text_len)) return text_len; } /* Store it in the record log */ - return log_store(facility, level, lflags, 0, dict, dictlen, text, text_len); + return log_store(caller_id, facility, level, lflags, 0, + dict, dictlen, text, text_len); } /* Must be called under logbuf_lock. */ -- cgit v1.3-7-g2ca7 From 9f199dd34ce06f603df365ab18bd84eefc5f7c2b Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 20 Feb 2019 08:59:23 +0000 Subject: irqdomain: Allow the default irq domain to be retrieved The default irq domain allows legacy code to create irqdomain mappings without having to track the domain it is allocating from. Setting the default domain is a one shot, fire and forget operation, and no effort was made to be able to retrieve this information at a later point in time. Newer irqdomain APIs (the hierarchical stuff) relies on both the irqchip code to track the irqdomain it is allocating from, as well as some form of firmware abstraction to easily identify which piece of HW maps to which irq domain (DT, ACPI). For systems without such firmware (or legacy platform that are getting dragged into the 21st century), things are a bit harder. For these cases (and these cases only!), let's provide a way to retrieve the default domain, allowing the use of the v2 API without having to resort to platform-specific hacks. Signed-off-by: Marc Zyngier --- include/linux/irqdomain.h | 1 + kernel/irq/irqdomain.c | 14 ++++++++++++++ 2 files changed, 15 insertions(+) (limited to 'kernel') diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 35965f41d7be..d2130dc7c0e6 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -265,6 +265,7 @@ extern struct irq_domain *irq_find_matching_fwspec(struct irq_fwspec *fwspec, enum irq_domain_bus_token bus_token); extern bool irq_domain_check_msi_remap(void); extern void irq_set_default_host(struct irq_domain *host); +extern struct irq_domain *irq_get_default_host(void); extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs, irq_hw_number_t hwirq, int node, const struct irq_affinity_desc *affinity); diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 8b0be4bd6565..80818764643d 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -458,6 +458,20 @@ void irq_set_default_host(struct irq_domain *domain) } EXPORT_SYMBOL_GPL(irq_set_default_host); +/** + * irq_get_default_host() - Retrieve the "default" irq domain + * + * Returns: the default domain, if any. + * + * Modern code should never use this. This should only be used on + * systems that cannot implement a firmware->fwnode mapping (which + * both DT and ACPI provide). + */ +struct irq_domain *irq_get_default_host(void) +{ + return irq_default_domain; +} + static void irq_domain_clear_mapping(struct irq_domain *domain, irq_hw_number_t hwirq) { -- cgit v1.3-7-g2ca7 From 83540fbc8812a580b6ad8f93f4c29e62e417687e Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Wed, 20 Feb 2019 17:54:43 +0100 Subject: tracing/perf: Use strndup_user() instead of buggy open-coded version The first version of this method was missing the check for `ret == PATH_MAX`; then such a check was added, but it didn't call kfree() on error, so there was still a small memory leak in the error case. Fix it by using strndup_user() instead of open-coding it. Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: 0eadcc7a7bc0 ("perf/core: Fix perf_uprobe_init()") Reviewed-by: Masami Hiramatsu Acked-by: Song Liu Signed-off-by: Jann Horn Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_event_perf.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 76217bbef815..4629a6104474 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -299,15 +299,13 @@ int perf_uprobe_init(struct perf_event *p_event, if (!p_event->attr.uprobe_path) return -EINVAL; - path = kzalloc(PATH_MAX, GFP_KERNEL); - if (!path) - return -ENOMEM; - ret = strncpy_from_user( - path, u64_to_user_ptr(p_event->attr.uprobe_path), PATH_MAX); - if (ret == PATH_MAX) - return -E2BIG; - if (ret < 0) - goto out; + + path = strndup_user(u64_to_user_ptr(p_event->attr.uprobe_path), + PATH_MAX); + if (IS_ERR(path)) { + ret = PTR_ERR(path); + return (ret == -EINVAL) ? -E2BIG : ret; + } if (path[0] == '\0') { ret = -EINVAL; goto out; -- cgit v1.3-7-g2ca7 From 8bdc6201785d0b5231bfd77bf2b62726395092d7 Mon Sep 17 00:00:00 2001 From: Liu Song Date: Tue, 19 Feb 2019 23:53:27 +0800 Subject: workqueue: fix typo in comment qeueue/queue Signed-off-by: Liu Song Signed-off-by: Tejun Heo --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index a503ad9d0aec..29a4de4025be 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -646,7 +646,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, * The following mb guarantees that previous clear of a PENDING bit * will not be reordered with any speculative LOADS or STORES from * work->current_func, which is executed afterwards. This possible - * reordering can lead to a missed execution on attempt to qeueue + * reordering can lead to a missed execution on attempt to queue * the same @work. E.g. consider this case: * * CPU#0 CPU#1 -- cgit v1.3-7-g2ca7 From 4e37504d1c49eec6434d0cc97278d2b51c9e8763 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 20 Feb 2019 22:19:59 -0800 Subject: psi: avoid divide-by-zero crash inside virtual machines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We've been seeing hard-to-trigger psi crashes when running inside VM instances: divide error: 0000 [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events psi_clock RIP: 0010:psi_update_stats+0x270/0x490 RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: psi_clock+0x12/0x50 process_one_work+0x1e0/0x390 worker_thread+0x2b/0x3c0 ? rescuer_thread+0x330/0x330 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48 The Code-line points to `period` being 0 inside update_stats(), and we divide by that when calculating that period's pressure percentage. The elapsed period should never be 0. The reason this can happen is due to an off-by-one in the idle time / missing period calculation combined with a coarse sched_clock() in the virtual machine. The target time for aggregation is advanced into the future on a fixed grid to prevent clock drift. So when an aggregation runs after some idle period, we can not just set it to "now + psi_period", but have to calculate the downtime and advance the target time relative to itself. However, if the aggregator was disabled exactly one psi_period (ns), we drop one idle period in the calculation due to a > when we should do >=. In that case, next_update will be advanced from 'now - psi_period' to 'now' when it should be moved to 'now + psi_period'. The run finishes with last_update == next_update == sched_clock(). With hardware clocks, this exact nanosecond match isn't likely in the first place; but if it does happen, the clock will still have moved on and the period non-zero by the time the worker runs. A pointlessly short period, but besides the extra work, no harm no foul. However, a slow sched_clock() like we have on VMs might not have advanced either by the time the worker runs again. And when we calculate the elapsed period, the result, our pressure divisor, will be 0. Ouch. Fix this by correctly handling the situation when the elapsed time between aggregation runs is precisely two periods, and advance the expiration timestamp correctly to period into the future. Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Łukasz Siudut Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index c3484785b179..0e97ca9306ef 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group) expires = group->next_update; if (now < expires) goto out; - if (now - expires > psi_period) + if (now - expires >= psi_period) missed_periods = div_u64(now - expires, psi_period); /* -- cgit v1.3-7-g2ca7 From e80d02dd763093f70c3000ef34253a6d426becf6 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 21 Feb 2019 10:40:14 -0800 Subject: seccomp, bpf: disable preemption before calling into bpf prog All BPF programs must be called with preemption disabled. Fixes: 568f196756ad ("bpf: check that BPF programs run with preemption disabled") Reported-by: syzbot+8bf19ee2aa580de7a2a7@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/seccomp.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index e815781ed751..a43c601ac252 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -267,6 +267,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ + preempt_disable(); for (; f; f = f->prev) { u32 cur_ret = BPF_PROG_RUN(f->prog, sd); @@ -275,6 +276,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, *match = f; } } + preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ -- cgit v1.3-7-g2ca7 From 7c0cdf0b3940f63d9777c3fcf250a2f83859ca54 Mon Sep 17 00:00:00 2001 From: Alban Crequy Date: Fri, 22 Feb 2019 14:19:08 +0100 Subject: bpf, lpm: fix lookup bug in map_delete_elem trie_delete_elem() was deleting an entry even though it was not matching if the prefixlen was correct. This patch adds a check on matchlen. Reproducer: $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1 $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm key: 10 00 00 00 aa bb cc dd value: 01 Found 1 element $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff $ echo $? 0 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm Found 0 elements A similar reproducer is added in the selftests. Without the patch: $ sudo ./tools/testing/selftests/bpf/test_lpm_map test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed. Aborted With the patch: test_lpm_map runs without errors. Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE") Cc: Craig Gallek Signed-off-by: Alban Crequy Acked-by: Craig Gallek Signed-off-by: Daniel Borkmann --- kernel/bpf/lpm_trie.c | 1 + tools/testing/selftests/bpf/test_lpm_map.c | 10 ++++++++++ 2 files changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index abf1002080df..93a5cbbde421 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -471,6 +471,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key) } if (!node || node->prefixlen != key->prefixlen || + node->prefixlen != matchlen || (node->flags & LPM_TREE_NODE_FLAG_IM)) { ret = -ENOENT; goto out; diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c index 147e34cfceb7..02d7c871862a 100644 --- a/tools/testing/selftests/bpf/test_lpm_map.c +++ b/tools/testing/selftests/bpf/test_lpm_map.c @@ -474,6 +474,16 @@ static void test_lpm_delete(void) assert(bpf_map_lookup_elem(map_fd, key, &value) == -1 && errno == ENOENT); + key->prefixlen = 30; // unused prefix so far + inet_pton(AF_INET, "192.255.0.0", key->data); + assert(bpf_map_delete_elem(map_fd, key) == -1 && + errno == ENOENT); + + key->prefixlen = 16; // same prefix as the root node + inet_pton(AF_INET, "192.255.0.0", key->data); + assert(bpf_map_delete_elem(map_fd, key) == -1 && + errno == ENOENT); + /* assert initial lookup */ key->prefixlen = 32; inet_pton(AF_INET, "192.168.0.1", key->data); -- cgit v1.3-7-g2ca7 From 18736eef12137c59f60cc9f56dc5bea05c92e0eb Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 15 Feb 2019 13:56:54 +0200 Subject: perf: Copy parent's address filter offsets on clone When a child event is allocated in the inherit_event() path, the VMA based filter offsets are not copied from the parent, even though the address space mapping of the new task remains the same, which leads to no trace for the new task until exec. Reported-by: Mansour Alharthi Signed-off-by: Alexander Shishkin Tested-by: Mathieu Poirier Acked-by: Peter Zijlstra Cc: Jiri Olsa Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20190215115655.63469-2-alexander.shishkin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 5aeb4c74fb99..2d89efc0a3e0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1255,6 +1255,7 @@ static void put_ctx(struct perf_event_context *ctx) * perf_event_context::lock * perf_event::mmap_mutex * mmap_sem + * perf_addr_filters_head::lock * * cpu_hotplug_lock * pmus_lock @@ -10312,6 +10313,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, goto err_per_task; } + /* + * Clone the parent's vma offsets: they are valid until exec() + * even if the mm is not shared with the parent. + */ + if (event->parent) { + struct perf_addr_filters_head *ifh = perf_event_addr_filters(event); + + raw_spin_lock_irq(&ifh->lock); + memcpy(event->addr_filters_offs, + event->parent->addr_filters_offs, + pmu->nr_addr_filters * sizeof(unsigned long)); + raw_spin_unlock_irq(&ifh->lock); + } + /* force hw sync on the address filters */ event->addr_filters_gen = 1; } -- cgit v1.3-7-g2ca7 From c60f83b813e5b25ccd5de7e8c8925c31b3aebcc1 Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 15 Feb 2019 13:56:55 +0200 Subject: perf, pt, coresight: Fix address filters for vmas with non-zero offset Currently, the address range calculation for file-based filters works as long as the vma that maps the matching part of the object file starts from offset zero into the file (vm_pgoff==0). Otherwise, the resulting filter range would be off by vm_pgoff pages. Another related problem is that in case of a partially matching vma, that is, a vma that matches part of a filter region, the filter range size wouldn't be adjusted. Fix the arithmetics around address filter range calculations, taking into account vma offset, so that the entire calculation is done before the filter configuration is passed to the PMU drivers instead of having those drivers do the final bit of arithmetics. Based on the patch by Adrian Hunter . Reported-by: Adrian Hunter Signed-off-by: Alexander Shishkin Tested-by: Mathieu Poirier Acked-by: Peter Zijlstra Cc: Jiri Olsa Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/x86/events/intel/pt.c | 9 +-- drivers/hwtracing/coresight/coresight-etm-perf.c | 7 +- include/linux/perf_event.h | 7 +- kernel/events/core.c | 81 ++++++++++++++---------- 4 files changed, 62 insertions(+), 42 deletions(-) (limited to 'kernel') diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c index c0e86ff21f81..fb3a2f13fc70 100644 --- a/arch/x86/events/intel/pt.c +++ b/arch/x86/events/intel/pt.c @@ -1223,7 +1223,8 @@ static int pt_event_addr_filters_validate(struct list_head *filters) static void pt_event_addr_filters_sync(struct perf_event *event) { struct perf_addr_filters_head *head = perf_event_addr_filters(event); - unsigned long msr_a, msr_b, *offs = event->addr_filters_offs; + unsigned long msr_a, msr_b; + struct perf_addr_filter_range *fr = event->addr_filter_ranges; struct pt_filters *filters = event->hw.addr_filters; struct perf_addr_filter *filter; int range = 0; @@ -1232,12 +1233,12 @@ static void pt_event_addr_filters_sync(struct perf_event *event) return; list_for_each_entry(filter, &head->list, entry) { - if (filter->path.dentry && !offs[range]) { + if (filter->path.dentry && !fr[range].start) { msr_a = msr_b = 0; } else { /* apply the offset */ - msr_a = filter->offset + offs[range]; - msr_b = filter->size + msr_a - 1; + msr_a = fr[range].start; + msr_b = msr_a + fr[range].size - 1; } filters->filter[range].msr_a = msr_a; diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c index 8c88bf0a1e5f..4d5a2b9f9d6a 100644 --- a/drivers/hwtracing/coresight/coresight-etm-perf.c +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c @@ -433,15 +433,16 @@ static int etm_addr_filters_validate(struct list_head *filters) static void etm_addr_filters_sync(struct perf_event *event) { struct perf_addr_filters_head *head = perf_event_addr_filters(event); - unsigned long start, stop, *offs = event->addr_filters_offs; + unsigned long start, stop; + struct perf_addr_filter_range *fr = event->addr_filter_ranges; struct etm_filters *filters = event->hw.addr_filters; struct etm_filter *etm_filter; struct perf_addr_filter *filter; int i = 0; list_for_each_entry(filter, &head->list, entry) { - start = filter->offset + offs[i]; - stop = start + filter->size; + start = fr[i].start; + stop = start + fr[i].size; etm_filter = &filters->etm_filter[i]; switch (filter->action) { diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index d9c3610e0e25..6ebc72f65017 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -490,6 +490,11 @@ struct perf_addr_filters_head { unsigned int nr_file_filters; }; +struct perf_addr_filter_range { + unsigned long start; + unsigned long size; +}; + /** * enum perf_event_state - the states of an event: */ @@ -666,7 +671,7 @@ struct perf_event { /* address range filters */ struct perf_addr_filters_head addr_filters; /* vma address array for file-based filders */ - unsigned long *addr_filters_offs; + struct perf_addr_filter_range *addr_filter_ranges; unsigned long addr_filters_gen; void (*destroy)(struct perf_event *); diff --git a/kernel/events/core.c b/kernel/events/core.c index 2d89efc0a3e0..16609f6737da 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2799,7 +2799,7 @@ static int perf_event_stop(struct perf_event *event, int restart) * * (p1) when userspace mappings change as a result of (1) or (2) or (3) below, * we update the addresses of corresponding vmas in - * event::addr_filters_offs array and bump the event::addr_filters_gen; + * event::addr_filter_ranges array and bump the event::addr_filters_gen; * (p2) when an event is scheduled in (pmu::add), it calls * perf_event_addr_filters_sync() which calls pmu::addr_filters_sync() * if the generation has changed since the previous call. @@ -4446,7 +4446,7 @@ static void _free_event(struct perf_event *event) perf_event_free_bpf_prog(event); perf_addr_filters_splice(event, NULL); - kfree(event->addr_filters_offs); + kfree(event->addr_filter_ranges); if (event->destroy) event->destroy(event); @@ -6687,7 +6687,8 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data) raw_spin_lock_irqsave(&ifh->lock, flags); list_for_each_entry(filter, &ifh->list, entry) { if (filter->path.dentry) { - event->addr_filters_offs[count] = 0; + event->addr_filter_ranges[count].start = 0; + event->addr_filter_ranges[count].size = 0; restart++; } @@ -7367,28 +7368,47 @@ static bool perf_addr_filter_match(struct perf_addr_filter *filter, return true; } +static bool perf_addr_filter_vma_adjust(struct perf_addr_filter *filter, + struct vm_area_struct *vma, + struct perf_addr_filter_range *fr) +{ + unsigned long vma_size = vma->vm_end - vma->vm_start; + unsigned long off = vma->vm_pgoff << PAGE_SHIFT; + struct file *file = vma->vm_file; + + if (!perf_addr_filter_match(filter, file, off, vma_size)) + return false; + + if (filter->offset < off) { + fr->start = vma->vm_start; + fr->size = min(vma_size, filter->size - (off - filter->offset)); + } else { + fr->start = vma->vm_start + filter->offset - off; + fr->size = min(vma->vm_end - fr->start, filter->size); + } + + return true; +} + static void __perf_addr_filters_adjust(struct perf_event *event, void *data) { struct perf_addr_filters_head *ifh = perf_event_addr_filters(event); struct vm_area_struct *vma = data; - unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags; - struct file *file = vma->vm_file; struct perf_addr_filter *filter; unsigned int restart = 0, count = 0; + unsigned long flags; if (!has_addr_filter(event)) return; - if (!file) + if (!vma->vm_file) return; raw_spin_lock_irqsave(&ifh->lock, flags); list_for_each_entry(filter, &ifh->list, entry) { - if (perf_addr_filter_match(filter, file, off, - vma->vm_end - vma->vm_start)) { - event->addr_filters_offs[count] = vma->vm_start; + if (perf_addr_filter_vma_adjust(filter, vma, + &event->addr_filter_ranges[count])) restart++; - } count++; } @@ -8978,26 +8998,19 @@ static void perf_addr_filters_splice(struct perf_event *event, * @filter; if so, adjust filter's address range. * Called with mm::mmap_sem down for reading. */ -static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter, - struct mm_struct *mm) +static void perf_addr_filter_apply(struct perf_addr_filter *filter, + struct mm_struct *mm, + struct perf_addr_filter_range *fr) { struct vm_area_struct *vma; for (vma = mm->mmap; vma; vma = vma->vm_next) { - struct file *file = vma->vm_file; - unsigned long off = vma->vm_pgoff << PAGE_SHIFT; - unsigned long vma_size = vma->vm_end - vma->vm_start; - - if (!file) + if (!vma->vm_file) continue; - if (!perf_addr_filter_match(filter, file, off, vma_size)) - continue; - - return vma->vm_start; + if (perf_addr_filter_vma_adjust(filter, vma, fr)) + return; } - - return 0; } /* @@ -9031,15 +9044,15 @@ static void perf_event_addr_filters_apply(struct perf_event *event) raw_spin_lock_irqsave(&ifh->lock, flags); list_for_each_entry(filter, &ifh->list, entry) { - event->addr_filters_offs[count] = 0; + event->addr_filter_ranges[count].start = 0; + event->addr_filter_ranges[count].size = 0; /* * Adjust base offset if the filter is associated to a binary * that needs to be mapped: */ if (filter->path.dentry) - event->addr_filters_offs[count] = - perf_addr_filter_apply(filter, mm); + perf_addr_filter_apply(filter, mm, &event->addr_filter_ranges[count]); count++; } @@ -10305,10 +10318,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, goto err_pmu; if (has_addr_filter(event)) { - event->addr_filters_offs = kcalloc(pmu->nr_addr_filters, - sizeof(unsigned long), - GFP_KERNEL); - if (!event->addr_filters_offs) { + event->addr_filter_ranges = kcalloc(pmu->nr_addr_filters, + sizeof(struct perf_addr_filter_range), + GFP_KERNEL); + if (!event->addr_filter_ranges) { err = -ENOMEM; goto err_per_task; } @@ -10321,9 +10334,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, struct perf_addr_filters_head *ifh = perf_event_addr_filters(event); raw_spin_lock_irq(&ifh->lock); - memcpy(event->addr_filters_offs, - event->parent->addr_filters_offs, - pmu->nr_addr_filters * sizeof(unsigned long)); + memcpy(event->addr_filter_ranges, + event->parent->addr_filter_ranges, + pmu->nr_addr_filters * sizeof(struct perf_addr_filter_range)); raw_spin_unlock_irq(&ifh->lock); } @@ -10345,7 +10358,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, return event; err_addr_filters: - kfree(event->addr_filters_offs); + kfree(event->addr_filter_ranges); err_per_task: exclusive_event_destroy(event); -- cgit v1.3-7-g2ca7 From 781e62823cb81b972dc8652c1827205cda2ac9ac Mon Sep 17 00:00:00 2001 From: Peng Sun Date: Tue, 26 Feb 2019 22:15:37 +0800 Subject: bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id() In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero() to increase the refcount, both map->refcnt and map->usercnt. Then, if bpf_map_new_fd() fails, should handle map->usercnt too. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Signed-off-by: Peng Sun Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann --- kernel/bpf/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8577bb7f8be6..f98328abe8fa 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1986,7 +1986,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr) fd = bpf_map_new_fd(map, f_flags); if (fd < 0) - bpf_map_put(map); + bpf_map_put_with_uref(map); return fd; } -- cgit v1.3-7-g2ca7 From b303c6df80c9f8f13785aa83a0471fca7e38b24d Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 21 Feb 2019 13:13:38 +0900 Subject: kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig Since -Wmaybe-uninitialized was introduced by GCC 4.7, we have patched various false positives: - commit e74fc973b6e5 ("Turn off -Wmaybe-uninitialized when building with -Os") turned off this option for -Os. - commit 815eb71e7149 ("Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES") turned off this option for CONFIG_PROFILE_ALL_BRANCHES - commit a76bcf557ef4 ("Kbuild: enable -Wmaybe-uninitialized warning for "make W=1"") turned off this option for GCC < 4.9 Arnd provided more explanation in https://lkml.org/lkml/2017/3/14/903 I think this looks better by shifting the logic from Makefile to Kconfig. Link: https://github.com/ClangBuiltLinux/linux/issues/350 Signed-off-by: Masahiro Yamada Reviewed-by: Nathan Chancellor Tested-by: Nick Desaulniers --- Makefile | 10 +++------- init/Kconfig | 17 +++++++++++++++++ kernel/trace/Kconfig | 1 + 3 files changed, 21 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/Makefile b/Makefile index 7d9032bf47c5..9f52fda4ad0e 100644 --- a/Makefile +++ b/Makefile @@ -660,17 +660,13 @@ KBUILD_CFLAGS += $(call cc-disable-warning, int-in-bool-context) ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += $(call cc-option,-Oz,-Os) -KBUILD_CFLAGS += $(call cc-disable-warning,maybe-uninitialized,) -else -ifdef CONFIG_PROFILE_ALL_BRANCHES -KBUILD_CFLAGS += -O2 $(call cc-disable-warning,maybe-uninitialized,) else KBUILD_CFLAGS += -O2 endif -endif -KBUILD_CFLAGS += $(call cc-ifversion, -lt, 0409, \ - $(call cc-disable-warning,maybe-uninitialized,)) +ifdef CONFIG_CC_DISABLE_WARN_MAYBE_UNINITIALIZED +KBUILD_CFLAGS += -Wno-maybe-uninitialized +endif # Tell gcc to never replace conditional load with a non-conditional one KBUILD_CFLAGS += $(call cc-option,--param=allow-store-data-races=0) diff --git a/init/Kconfig b/init/Kconfig index 513fa544a134..ce43083b681d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -26,6 +26,22 @@ config CLANG_VERSION config CC_HAS_ASM_GOTO def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC)) +config CC_HAS_WARN_MAYBE_UNINITIALIZED + def_bool $(cc-option,-Wmaybe-uninitialized) + help + GCC >= 4.7 supports this option. + +config CC_DISABLE_WARN_MAYBE_UNINITIALIZED + bool + depends on CC_HAS_WARN_MAYBE_UNINITIALIZED + default CC_IS_GCC && GCC_VERSION < 40900 # unreliable for GCC < 4.9 + help + GCC's -Wmaybe-uninitialized is not reliable by definition. + Lots of false positive warnings are produced in some cases. + + If this option is enabled, -Wno-maybe-uninitialzed is passed + to the compiler to suppress maybe-uninitialized warnings. + config CONSTRUCTORS bool depends on !UML @@ -1102,6 +1118,7 @@ config CC_OPTIMIZE_FOR_PERFORMANCE config CC_OPTIMIZE_FOR_SIZE bool "Optimize for size" + imply CC_DISABLE_WARN_MAYBE_UNINITIALIZED # avoid false positives help Enabling this option will pass "-Os" instead of "-O2" to your compiler resulting in a smaller kernel. diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index fa8b1fe824f3..8bd1d6d001d7 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -370,6 +370,7 @@ config PROFILE_ANNOTATED_BRANCHES config PROFILE_ALL_BRANCHES bool "Profile all if conditionals" if !FORTIFY_SOURCE select TRACE_BRANCH_PROFILING + imply CC_DISABLE_WARN_MAYBE_UNINITIALIZED # avoid false positives help This tracer profiles all branch conditions. Every if () taken in the kernel is recorded whether it hit or miss. -- cgit v1.3-7-g2ca7 From 492ecee892c2a4ba6a14903d5d586ff750b7e805 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Mon, 25 Feb 2019 14:28:39 -0800 Subject: bpf: enable program stats JITed BPF programs are indistinguishable from kernel functions, but unlike kernel code BPF code can be changed often. Typical approach of "perf record" + "perf report" profiling and tuning of kernel code works just as well for BPF programs, but kernel code doesn't need to be monitored whereas BPF programs do. Users load and run large amount of BPF programs. These BPF stats allow tools monitor the usage of BPF on the server. The monitoring tools will turn sysctl kernel.bpf_stats_enabled on and off for few seconds to sample average cost of the programs. Aggregated data over hours and days will provide an insight into cost of BPF and alarms can trigger in case given program suddenly gets more expensive. The cost of two sched_clock() per program invocation adds ~20 nsec. Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down from ~10 nsec to ~30 nsec. static_key minimizes the cost of the stats collection. There is no measurable difference before/after this patch with kernel.bpf_stats_enabled=0 Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- include/linux/bpf.h | 9 +++++++++ include/linux/filter.h | 20 +++++++++++++++++++- kernel/bpf/core.c | 31 +++++++++++++++++++++++++++++-- kernel/bpf/syscall.c | 34 ++++++++++++++++++++++++++++++++-- kernel/bpf/verifier.c | 7 ++++++- kernel/sysctl.c | 34 ++++++++++++++++++++++++++++++++++ 6 files changed, 129 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index de18227b3d95..a2132e09dc1c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -16,6 +16,7 @@ #include #include #include +#include struct bpf_verifier_env; struct perf_event; @@ -340,6 +341,12 @@ enum bpf_cgroup_storage_type { #define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX +struct bpf_prog_stats { + u64 cnt; + u64 nsecs; + struct u64_stats_sync syncp; +}; + struct bpf_prog_aux { atomic_t refcnt; u32 used_map_cnt; @@ -389,6 +396,7 @@ struct bpf_prog_aux { * main prog always has linfo_idx == 0 */ u32 linfo_idx; + struct bpf_prog_stats __percpu *stats; union { struct work_struct work; struct rcu_head rcu; @@ -559,6 +567,7 @@ void bpf_map_area_free(void *base); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); extern int sysctl_unprivileged_bpf_disabled; +extern int sysctl_bpf_stats_enabled; int bpf_map_new_fd(struct bpf_map *map, int flags); int bpf_prog_new_fd(struct bpf_prog *prog); diff --git a/include/linux/filter.h b/include/linux/filter.h index f32b3eca5a04..7e5e3db11106 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -533,7 +533,24 @@ struct sk_filter { struct bpf_prog *prog; }; -#define BPF_PROG_RUN(filter, ctx) ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); }) +DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); + +#define BPF_PROG_RUN(prog, ctx) ({ \ + u32 ret; \ + cant_sleep(); \ + if (static_branch_unlikely(&bpf_stats_enabled_key)) { \ + struct bpf_prog_stats *stats; \ + u64 start = sched_clock(); \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + stats = this_cpu_ptr(prog->aux->stats); \ + u64_stats_update_begin(&stats->syncp); \ + stats->cnt++; \ + stats->nsecs += sched_clock() - start; \ + u64_stats_update_end(&stats->syncp); \ + } else { \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + } \ + ret; }) #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN @@ -764,6 +781,7 @@ void bpf_prog_free_jited_linfo(struct bpf_prog *prog); void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog); struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags); +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags); void __bpf_prog_free(struct bpf_prog *fp); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index ef88b167959d..1c14c347f3cf 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -78,7 +78,7 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns return NULL; } -struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog_aux *aux; @@ -104,6 +104,26 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) return fp; } + +struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; + struct bpf_prog *prog; + + prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); + if (!prog) + return NULL; + + prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); + if (!prog->aux->stats) { + kfree(prog->aux); + vfree(prog); + return NULL; + } + + u64_stats_init(&prog->aux->stats->syncp); + return prog; +} EXPORT_SYMBOL_GPL(bpf_prog_alloc); int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog) @@ -231,7 +251,10 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, void __bpf_prog_free(struct bpf_prog *fp) { - kfree(fp->aux); + if (fp->aux) { + free_percpu(fp->aux->stats); + kfree(fp->aux); + } vfree(fp); } @@ -2069,6 +2092,10 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to, return -EFAULT; } +DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key); +EXPORT_SYMBOL(bpf_stats_enabled_key); +int sysctl_bpf_stats_enabled __read_mostly; + /* All definitions of tracepoints related to BPF. */ #define CREATE_TRACE_POINTS #include diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ec7c552af76b..31cf66fc3f5c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1283,24 +1283,54 @@ static int bpf_prog_release(struct inode *inode, struct file *filp) return 0; } +static void bpf_prog_get_stats(const struct bpf_prog *prog, + struct bpf_prog_stats *stats) +{ + u64 nsecs = 0, cnt = 0; + int cpu; + + for_each_possible_cpu(cpu) { + const struct bpf_prog_stats *st; + unsigned int start; + u64 tnsecs, tcnt; + + st = per_cpu_ptr(prog->aux->stats, cpu); + do { + start = u64_stats_fetch_begin_irq(&st->syncp); + tnsecs = st->nsecs; + tcnt = st->cnt; + } while (u64_stats_fetch_retry_irq(&st->syncp, start)); + nsecs += tnsecs; + cnt += tcnt; + } + stats->nsecs = nsecs; + stats->cnt = cnt; +} + #ifdef CONFIG_PROC_FS static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp) { const struct bpf_prog *prog = filp->private_data; char prog_tag[sizeof(prog->tag) * 2 + 1] = { }; + struct bpf_prog_stats stats; + bpf_prog_get_stats(prog, &stats); bin2hex(prog_tag, prog->tag, sizeof(prog->tag)); seq_printf(m, "prog_type:\t%u\n" "prog_jited:\t%u\n" "prog_tag:\t%s\n" "memlock:\t%llu\n" - "prog_id:\t%u\n", + "prog_id:\t%u\n" + "run_time_ns:\t%llu\n" + "run_cnt:\t%llu\n", prog->type, prog->jited, prog_tag, prog->pages * 1ULL << PAGE_SHIFT, - prog->aux->id); + prog->aux->id, + stats.nsecs, + stats.cnt); } #endif diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1b9496c41383..0e4edd7e3c5f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7320,7 +7320,12 @@ static int jit_subprogs(struct bpf_verifier_env *env) subprog_end = env->subprog_info[i + 1].start; len = subprog_end - subprog_start; - func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER); + /* BPF_PROG_RUN doesn't call subprogs directly, + * hence main prog stats include the runtime of subprogs. + * subprogs don't have IDs and not reachable via prog_get_next_id + * func[i]->aux->stats will never be accessed and stays NULL + */ + func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER); if (!func[i]) goto out_free; memcpy(func[i]->insnsi, &prog->insnsi[subprog_start], diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..86e0771352f2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -224,6 +224,9 @@ static int proc_dostring_coredump(struct ctl_table *table, int write, #endif static int proc_dopipe_max_size(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); #ifdef CONFIG_MAGIC_SYSRQ /* Note: sysrq code uses its own private copy */ @@ -1230,6 +1233,15 @@ static struct ctl_table kern_table[] = { .extra2 = &one, }, #endif + { + .procname = "bpf_stats_enabled", + .data = &sysctl_bpf_stats_enabled, + .maxlen = sizeof(sysctl_bpf_stats_enabled), + .mode = 0644, + .proc_handler = proc_dointvec_minmax_bpf_stats, + .extra1 = &zero, + .extra2 = &one, + }, #if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU) { .procname = "panic_on_rcu_stall", @@ -3260,6 +3272,28 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write, #endif /* CONFIG_PROC_SYSCTL */ +static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret, bpf_stats = *(int *)table->data; + struct ctl_table tmp = *table; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + tmp.data = &bpf_stats; + ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); + if (write && !ret) { + *(int *)table->data = bpf_stats; + if (bpf_stats) + static_branch_enable(&bpf_stats_enabled_key); + else + static_branch_disable(&bpf_stats_enabled_key); + } + return ret; +} + /* * No sense putting this after each symbol definition, twice, * exception granted :-) -- cgit v1.3-7-g2ca7 From 5f8f8b93aeb8371c54af08bece2bd04bc2d48707 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Mon, 25 Feb 2019 14:28:40 -0800 Subject: bpf: expose program stats via bpf_prog_info Return bpf program run_time_ns and run_cnt via bpf_prog_info Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 5 +++++ 2 files changed, 7 insertions(+) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index bcdd2474eee7..2e308e90ffea 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2813,6 +2813,8 @@ struct bpf_prog_info { __u32 jited_line_info_rec_size; __u32 nr_prog_tags; __aligned_u64 prog_tags; + __u64 run_time_ns; + __u64 run_cnt; } __attribute__((aligned(8))); struct bpf_map_info { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 31cf66fc3f5c..174581dfe225 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2152,6 +2152,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, struct bpf_prog_info __user *uinfo = u64_to_user_ptr(attr->info.info); struct bpf_prog_info info = {}; u32 info_len = attr->info.info_len; + struct bpf_prog_stats stats; char __user *uinsns; u32 ulen; int err; @@ -2191,6 +2192,10 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, if (err) return err; + bpf_prog_get_stats(prog, &stats); + info.run_time_ns = stats.nsecs; + info.run_cnt = stats.cnt; + if (!capable(CAP_SYS_ADMIN)) { info.jited_prog_len = 0; info.xlated_prog_len = 0; -- cgit v1.3-7-g2ca7 From a115d0ed7201a5904c084ae6f07913fe2b9396a6 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Wed, 27 Feb 2019 13:22:56 -0800 Subject: bpf: set inner_map_meta->spin_lock_off correctly Commit d83525ca62cf ("bpf: introduce bpf_spin_lock") introduced bpf_spin_lock and the field spin_lock_off in kernel internal structure bpf_map has the following meaning: >=0 valid offset, <0 error For every map created, the kernel will ensure spin_lock_off has correct value. Currently, bpf_map->spin_lock_off is not copied from the inner map to the map_in_map inner_map_meta during a map_in_map type map creation, so inner_map_meta->spin_lock_off = 0. This will give verifier wrong information that inner_map has bpf_spin_lock and the bpf_spin_lock is defined at offset 0. An access to offset 0 of a value pointer will trigger the following error: bpf_spin_lock cannot be accessed directly by load/store This patch fixed the issue by copy inner map's spin_lock_off value to inner_map_meta->spin_lock_off. Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song Acked-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov --- kernel/bpf/map_in_map.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 583346a0ab29..3dff41403583 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -58,6 +58,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) inner_map_meta->value_size = inner_map->value_size; inner_map_meta->map_flags = inner_map->map_flags; inner_map_meta->max_entries = inner_map->max_entries; + inner_map_meta->spin_lock_off = inner_map->spin_lock_off; /* Misc members not needed in bpf_map_meta_equal() check. */ inner_map_meta->ops = inner_map->ops; -- cgit v1.3-7-g2ca7 From 733000c7ffd9d9c8c4fdfd82f0d41956c8cf0537 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Sun, 24 Feb 2019 20:14:13 -0500 Subject: locking/qspinlock: Remove unnecessary BUG_ON() call With the > 4 nesting levels case handled by the commit: d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels") the BUG_ON() call in encode_tail() will never actually be triggered. Remove it. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Acked-by: Will Deacon Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 21ee51b47961..5e9247dc2515 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -124,9 +124,6 @@ static inline __pure u32 encode_tail(int cpu, int idx) { u32 tail; -#ifdef CONFIG_DEBUG_SPINLOCK - BUG_ON(idx > 3); -#endif tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET; tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */ -- cgit v1.3-7-g2ca7 From 09d75ecb122d8b600d76e3b8d53a10ffbe3bcec2 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:36 -0800 Subject: locking/lockdep: Fix two 32-bit compiler warnings Use %zu to format size_t instead of %lu to avoid that the compiler complains about a mismatch between format specifier and argument on 32-bit systems. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-2-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7f7db23fc002..5c5283bf499c 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4266,7 +4266,7 @@ void __init lockdep_init(void) printk("... MAX_LOCKDEP_CHAINS: %lu\n", MAX_LOCKDEP_CHAINS); printk("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE); - printk(" memory used by lock dependency info: %lu kB\n", + printk(" memory used by lock dependency info: %zu kB\n", (sizeof(struct lock_class) * MAX_LOCKDEP_KEYS + sizeof(struct list_head) * CLASSHASH_SIZE + sizeof(struct lock_list) * MAX_LOCKDEP_ENTRIES + @@ -4278,7 +4278,7 @@ void __init lockdep_init(void) ) / 1024 ); - printk(" per task-struct memory footprint: %lu bytes\n", + printk(" per task-struct memory footprint: %zu bytes\n", sizeof(struct held_lock) * MAX_LOCK_DEPTH); } -- cgit v1.3-7-g2ca7 From 7ff8517e1034f26dde03d6df4026f085480408f0 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:37 -0800 Subject: locking/lockdep: Fix reported required memory size (1/2) Change the sizeof(array element time) * (array size) expressions into sizeof(array). This fixes the size computations of the classhash_table[] and chainhash_table[] arrays. The reason is that commit: a63f38cc4ccf ("locking/lockdep: Convert hash tables to hlists") changed the type of the elements of that array from 'struct list_head' into 'struct hlist_head'. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-3-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 5c5283bf499c..57a523f0273c 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4267,19 +4267,19 @@ void __init lockdep_init(void) printk("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE); printk(" memory used by lock dependency info: %zu kB\n", - (sizeof(struct lock_class) * MAX_LOCKDEP_KEYS + - sizeof(struct list_head) * CLASSHASH_SIZE + - sizeof(struct lock_list) * MAX_LOCKDEP_ENTRIES + - sizeof(struct lock_chain) * MAX_LOCKDEP_CHAINS + - sizeof(struct list_head) * CHAINHASH_SIZE + (sizeof(lock_classes) + + sizeof(classhash_table) + + sizeof(list_entries) + + sizeof(lock_chains) + + sizeof(chainhash_table) #ifdef CONFIG_PROVE_LOCKING - + sizeof(struct circular_queue) + + sizeof(lock_cq) #endif ) / 1024 ); printk(" per task-struct memory footprint: %zu bytes\n", - sizeof(struct held_lock) * MAX_LOCK_DEPTH); + sizeof(((struct task_struct *)NULL)->held_locks)); } static void -- cgit v1.3-7-g2ca7 From 15ea86b58c71d05e0921bebcf707aa30e43e9e25 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:38 -0800 Subject: locking/lockdep: Fix reported required memory size (2/2) Lock chains are only tracked with CONFIG_PROVE_LOCKING=y. Do not report the memory required for the lock chain array if CONFIG_PROVE_LOCKING=n. See also commit: ca58abcb4a6d ("lockdep: sanitise CONFIG_PROVE_LOCKING") Include the size of the chain_hlocks[] array. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-4-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 57a523f0273c..ec6f6aff4d8d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4270,10 +4270,11 @@ void __init lockdep_init(void) (sizeof(lock_classes) + sizeof(classhash_table) + sizeof(list_entries) + - sizeof(lock_chains) + sizeof(chainhash_table) #ifdef CONFIG_PROVE_LOCKING + sizeof(lock_cq) + + sizeof(lock_chains) + + sizeof(chain_hlocks) #endif ) / 1024 ); -- cgit v1.3-7-g2ca7 From 523b113bace5e64e860d8c61d7aa25057d274753 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:39 -0800 Subject: locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache Make sure that add_chain_cache() returns 0 and does not modify the chain hash if nr_chain_hlocks == MAX_LOCKDEP_CHAIN_HLOCKS before this function is called. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-5-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index ec6f6aff4d8d..21d84510e28f 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2195,16 +2195,8 @@ static inline int add_chain_cache(struct task_struct *curr, chain_hlocks[chain->base + j] = lock_id; } chain_hlocks[chain->base + j] = class - lock_classes; - } - - if (nr_chain_hlocks < MAX_LOCKDEP_CHAIN_HLOCKS) nr_chain_hlocks += chain->depth; - -#ifdef CONFIG_DEBUG_LOCKDEP - /* - * Important for check_no_collision(). - */ - if (unlikely(nr_chain_hlocks > MAX_LOCKDEP_CHAIN_HLOCKS)) { + } else { if (!debug_locks_off_graph_unlock()) return 0; @@ -2212,7 +2204,6 @@ static inline int add_chain_cache(struct task_struct *curr, dump_stack(); return 0; } -#endif hlist_add_head_rcu(&chain->entry, hash_head); debug_atomic_inc(chain_lookup_misses); -- cgit v1.3-7-g2ca7 From 86cffb80a525f7b8f969c8c79669d383e02f17d1 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:41 -0800 Subject: locking/lockdep: Make zap_class() remove all matching lock order entries Make sure that all lock order entries that refer to a class are removed from the list_entries[] array when a kernel module is unloaded. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-7-bvanassche@acm.org Signed-off-by: Ingo Molnar --- include/linux/lockdep.h | 1 + kernel/locking/lockdep.c | 19 +++++++++++++------ 2 files changed, 14 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 0c38bade84b7..b5e6bfe0ae4a 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -178,6 +178,7 @@ static inline void lockdep_copy_map(struct lockdep_map *to, struct lock_list { struct list_head entry; struct lock_class *class; + struct lock_class *links_to; struct stack_trace trace; int distance; diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 21d84510e28f..28fbeb2a10cc 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -859,7 +859,8 @@ static struct lock_list *alloc_list_entry(void) /* * Add a new dependency to the head of the list: */ -static int add_lock_to_list(struct lock_class *this, struct list_head *head, +static int add_lock_to_list(struct lock_class *this, + struct lock_class *links_to, struct list_head *head, unsigned long ip, int distance, struct stack_trace *trace) { @@ -873,6 +874,7 @@ static int add_lock_to_list(struct lock_class *this, struct list_head *head, return 0; entry->class = this; + entry->links_to = links_to; entry->distance = distance; entry->trace = *trace; /* @@ -1907,14 +1909,14 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev, * Ok, all validations passed, add the new lock * to the previous lock's dependency list: */ - ret = add_lock_to_list(hlock_class(next), + ret = add_lock_to_list(hlock_class(next), hlock_class(prev), &hlock_class(prev)->locks_after, next->acquire_ip, distance, trace); if (!ret) return 0; - ret = add_lock_to_list(hlock_class(prev), + ret = add_lock_to_list(hlock_class(prev), hlock_class(next), &hlock_class(next)->locks_before, next->acquire_ip, distance, trace); if (!ret) @@ -4107,15 +4109,20 @@ void lockdep_reset(void) */ static void zap_class(struct lock_class *class) { + struct lock_list *entry; int i; /* * Remove all dependencies this lock is * involved in: */ - for (i = 0; i < nr_list_entries; i++) { - if (list_entries[i].class == class) - list_del_rcu(&list_entries[i].entry); + for (i = 0, entry = list_entries; i < nr_list_entries; i++, entry++) { + if (entry->class != class && entry->links_to != class) + continue; + list_del_rcu(&entry->entry); + /* Clear .class and .links_to to avoid double removal. */ + WRITE_ONCE(entry->class, NULL); + WRITE_ONCE(entry->links_to, NULL); } /* * Unhash the class and remove it from the all_lock_classes list: -- cgit v1.3-7-g2ca7 From feb0a3865ed2f7d66a1f2686f7ad784422c249ad Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:42 -0800 Subject: locking/lockdep: Initialize the locks_before and locks_after lists earlier This patch does not change any functionality. A later patch will reuse lock classes that have been freed. In combination with that patch this patch wil have the effect of initializing lock class order lists once instead of every time a lock class structure is reinitialized. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-8-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 28fbeb2a10cc..d1a6daf1f51f 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -735,6 +735,25 @@ static bool assign_lock_key(struct lockdep_map *lock) return true; } +/* + * Initialize the lock_classes[] array elements. + */ +static void init_data_structures_once(void) +{ + static bool initialization_happened; + int i; + + if (likely(initialization_happened)) + return; + + initialization_happened = true; + + for (i = 0; i < ARRAY_SIZE(lock_classes); i++) { + INIT_LIST_HEAD(&lock_classes[i].locks_after); + INIT_LIST_HEAD(&lock_classes[i].locks_before); + } +} + /* * Register a lock's class in the hash-table, if the class is not present * yet. Otherwise we look it up. We cache the result in the lock object @@ -775,6 +794,8 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) goto out_unlock_set; } + init_data_structures_once(); + /* * Allocate a new key from the static array, and add it to * the hash: @@ -793,8 +814,8 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) class->key = key; class->name = lock->name; class->subclass = subclass; - INIT_LIST_HEAD(&class->locks_before); - INIT_LIST_HEAD(&class->locks_after); + WARN_ON_ONCE(!list_empty(&class->locks_before)); + WARN_ON_ONCE(!list_empty(&class->locks_after)); class->name_version = count_matching_names(class); /* * We use RCU's safe list-add method to make @@ -4155,6 +4176,8 @@ void lockdep_free_key_range(void *start, unsigned long size) int i; int locked; + init_data_structures_once(); + raw_local_irq_save(flags); locked = graph_lock(); @@ -4218,6 +4241,8 @@ void lockdep_reset_lock(struct lockdep_map *lock) unsigned long flags; int j, locked; + init_data_structures_once(); + raw_local_irq_save(flags); locked = graph_lock(); -- cgit v1.3-7-g2ca7 From 956f3563a8387beb7758f2e8ee483639ef91afc6 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:43 -0800 Subject: locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock() This patch does not change the behavior of these functions but makes the patch that frees unused lock classes easier to read. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-9-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 72 ++++++++++++++++++++++++------------------------ 1 file changed, 36 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index d1a6daf1f51f..2d4c21a02546 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4160,6 +4160,24 @@ static inline int within(const void *addr, void *start, unsigned long size) return addr >= start && addr < start + size; } +static void __lockdep_free_key_range(void *start, unsigned long size) +{ + struct lock_class *class; + struct hlist_head *head; + int i; + + /* Unhash all classes that were created by a module. */ + for (i = 0; i < CLASSHASH_SIZE; i++) { + head = classhash_table + i; + hlist_for_each_entry_rcu(class, head, hash_entry) { + if (!within(class->key, start, size) && + !within(class->name, start, size)) + continue; + zap_class(class); + } + } +} + /* * Used in module.c to remove lock classes from memory that is going to be * freed; and possibly re-used by other modules. @@ -4170,30 +4188,14 @@ static inline int within(const void *addr, void *start, unsigned long size) */ void lockdep_free_key_range(void *start, unsigned long size) { - struct lock_class *class; - struct hlist_head *head; unsigned long flags; - int i; int locked; init_data_structures_once(); raw_local_irq_save(flags); locked = graph_lock(); - - /* - * Unhash all classes that were created by this module: - */ - for (i = 0; i < CLASSHASH_SIZE; i++) { - head = classhash_table + i; - hlist_for_each_entry_rcu(class, head, hash_entry) { - if (within(class->key, start, size)) - zap_class(class); - else if (within(class->name, start, size)) - zap_class(class); - } - } - + __lockdep_free_key_range(start, size); if (locked) graph_unlock(); raw_local_irq_restore(flags); @@ -4235,16 +4237,11 @@ static bool lock_class_cache_is_registered(struct lockdep_map *lock) return false; } -void lockdep_reset_lock(struct lockdep_map *lock) +/* The caller must hold the graph lock. Does not sleep. */ +static void __lockdep_reset_lock(struct lockdep_map *lock) { struct lock_class *class; - unsigned long flags; - int j, locked; - - init_data_structures_once(); - - raw_local_irq_save(flags); - locked = graph_lock(); + int j; /* * Remove all classes this lock might have: @@ -4261,19 +4258,22 @@ void lockdep_reset_lock(struct lockdep_map *lock) * Debug check: in the end all mapped classes should * be gone. */ - if (unlikely(lock_class_cache_is_registered(lock))) { - if (debug_locks_off_graph_unlock()) { - /* - * We all just reset everything, how did it match? - */ - WARN_ON(1); - } - goto out_restore; - } + if (WARN_ON_ONCE(lock_class_cache_is_registered(lock))) + debug_locks_off(); +} + +void lockdep_reset_lock(struct lockdep_map *lock) +{ + unsigned long flags; + int locked; + + init_data_structures_once(); + + raw_local_irq_save(flags); + locked = graph_lock(); + __lockdep_reset_lock(lock); if (locked) graph_unlock(); - -out_restore: raw_local_irq_restore(flags); } -- cgit v1.3-7-g2ca7 From cdc84d794947b5431c0a6916c303aee7114819d2 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:44 -0800 Subject: locking/lockdep: Make it easy to detect whether or not inside a selftest The patch that frees unused lock classes will modify the behavior of lockdep_free_key_range() and lockdep_reset_lock() depending on whether or not these functions are called from the context of the lockdep selftests. Hence make it easy to detect whether or not lockdep code is called from the context of a lockdep selftest. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-10-bvanassche@acm.org Signed-off-by: Ingo Molnar --- include/linux/lockdep.h | 5 +++++ kernel/locking/lockdep.c | 6 ++++++ lib/locking-selftest.c | 2 ++ 3 files changed, 13 insertions(+) (limited to 'kernel') diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index b5e6bfe0ae4a..66eee1ba0f2a 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -265,6 +265,7 @@ extern void lockdep_reset(void); extern void lockdep_reset_lock(struct lockdep_map *lock); extern void lockdep_free_key_range(void *start, unsigned long size); extern asmlinkage void lockdep_sys_exit(void); +extern void lockdep_set_selftest_task(struct task_struct *task); extern void lockdep_off(void); extern void lockdep_on(void); @@ -395,6 +396,10 @@ static inline void lockdep_on(void) { } +static inline void lockdep_set_selftest_task(struct task_struct *task) +{ +} + # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) # define lock_downgrade(l, i) do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 2d4c21a02546..34cd87c65f5d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -81,6 +81,7 @@ module_param(lock_stat, int, 0644); * code to recurse back into the lockdep code... */ static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; +static struct task_struct *lockdep_selftest_task_struct; static int graph_lock(void) { @@ -331,6 +332,11 @@ void lockdep_on(void) } EXPORT_SYMBOL(lockdep_on); +void lockdep_set_selftest_task(struct task_struct *task) +{ + lockdep_selftest_task_struct = task; +} + /* * Debugging switches: */ diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index 1e1bbf171eca..a1705545e6ac 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -1989,6 +1989,7 @@ void locking_selftest(void) init_shared_classes(); debug_locks_silent = !debug_locks_verbose; + lockdep_set_selftest_task(current); DO_TESTCASE_6R("A-A deadlock", AA); DO_TESTCASE_6R("A-B-B-A deadlock", ABBA); @@ -2097,5 +2098,6 @@ void locking_selftest(void) printk("---------------------------------\n"); debug_locks = 1; } + lockdep_set_selftest_task(NULL); debug_locks_silent = 0; } -- cgit v1.3-7-g2ca7 From 29fc33fb7283970701355dc89badba4ed21c7092 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:45 -0800 Subject: locking/lockdep: Update two outdated comments synchronize_sched() has been removed recently. Update the comments that refer to synchronize_sched(). Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Fixes: 51959d85f32d ("lockdep: Replace synchronize_sched() with synchronize_rcu()") # v5.0-rc1 Link: https://lkml.kernel.org/r/20190214230058.196511-11-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 34cd87c65f5d..c7ca3a4def7e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4188,9 +4188,9 @@ static void __lockdep_free_key_range(void *start, unsigned long size) * Used in module.c to remove lock classes from memory that is going to be * freed; and possibly re-used by other modules. * - * We will have had one sync_sched() before getting here, so we're guaranteed - * nobody will look up these exact classes -- they're properly dead but still - * allocated. + * We will have had one synchronize_rcu() before getting here, so we're + * guaranteed nobody will look up these exact classes -- they're properly dead + * but still allocated. */ void lockdep_free_key_range(void *start, unsigned long size) { @@ -4209,8 +4209,6 @@ void lockdep_free_key_range(void *start, unsigned long size) /* * Wait for any possible iterators from look_up_lock_class() to pass * before continuing to free the memory they refer to. - * - * sync_sched() is sufficient because the read-side is IRQ disable. */ synchronize_rcu(); -- cgit v1.3-7-g2ca7 From a0b0fd53e1e67639b303b15939b9c653dbe7a8c4 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:46 -0800 Subject: locking/lockdep: Free lock classes that are no longer in use Instead of leaving lock classes that are no longer in use in the lock_classes array, reuse entries from that array that are no longer in use. Maintain a linked list of free lock classes with list head 'free_lock_class'. Only add freed lock classes to the free_lock_classes list after a grace period to avoid that a lock_classes[] element would be reused while an RCU reader is accessing it. Since the lockdep selftests run in a context where sleeping is not allowed and since the selftests require that lock resetting/zapping works with debug_locks off, make the behavior of lockdep_free_key_range() and lockdep_reset_lock() depend on whether or not these are called from the context of the lockdep selftests. Thanks to Peter for having shown how to modify get_pending_free() such that that function does not have to sleep. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-12-bvanassche@acm.org Signed-off-by: Ingo Molnar --- include/linux/lockdep.h | 9 +- kernel/locking/lockdep.c | 396 +++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 354 insertions(+), 51 deletions(-) (limited to 'kernel') diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 66eee1ba0f2a..619ec3f26cdc 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -63,7 +63,8 @@ extern struct lock_class_key __lockdep_no_validate__; #define LOCKSTAT_POINTS 4 /* - * The lock-class itself: + * The lock-class itself. The order of the structure members matters. + * reinit_class() zeroes the key member and all subsequent members. */ struct lock_class { /* @@ -72,7 +73,9 @@ struct lock_class { struct hlist_node hash_entry; /* - * global list of all lock-classes: + * Entry in all_lock_classes when in use. Entry in free_lock_classes + * when not in use. Instances that are being freed are on one of the + * zapped_classes lists. */ struct list_head lock_entry; @@ -104,7 +107,7 @@ struct lock_class { unsigned long contention_point[LOCKSTAT_POINTS]; unsigned long contending_point[LOCKSTAT_POINTS]; #endif -}; +} __no_randomize_layout; #ifdef CONFIG_LOCK_STAT struct lock_time { diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index c7ca3a4def7e..8ecf355dd163 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -135,8 +136,8 @@ static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES]; /* * All data structures here are protected by the global debug_lock. * - * Mutex key structs only get allocated, once during bootup, and never - * get freed - this significantly simplifies the debugging code. + * nr_lock_classes is the number of elements of lock_classes[] that is + * in use. */ unsigned long nr_lock_classes; #ifndef CONFIG_DEBUG_LOCKDEP @@ -278,11 +279,39 @@ static inline void lock_release_holdtime(struct held_lock *hlock) #endif /* - * We keep a global list of all lock classes. The list only grows, - * never shrinks. The list is only accessed with the lockdep - * spinlock lock held. + * We keep a global list of all lock classes. The list is only accessed with + * the lockdep spinlock lock held. free_lock_classes is a list with free + * elements. These elements are linked together by the lock_entry member in + * struct lock_class. */ LIST_HEAD(all_lock_classes); +static LIST_HEAD(free_lock_classes); + +/** + * struct pending_free - information about data structures about to be freed + * @zapped: Head of a list with struct lock_class elements. + */ +struct pending_free { + struct list_head zapped; +}; + +/** + * struct delayed_free - data structures used for delayed freeing + * + * A data structure for delayed freeing of data structures that may be + * accessed by RCU readers at the time these were freed. + * + * @rcu_head: Used to schedule an RCU callback for freeing data structures. + * @index: Index of @pf to which freed data structures are added. + * @scheduled: Whether or not an RCU callback has been scheduled. + * @pf: Array with information about data structures about to be freed. + */ +static struct delayed_free { + struct rcu_head rcu_head; + int index; + int scheduled; + struct pending_free pf[2]; +} delayed_free; /* * The lockdep classes are in a hash-table as well, for fast lookup: @@ -742,7 +771,8 @@ static bool assign_lock_key(struct lockdep_map *lock) } /* - * Initialize the lock_classes[] array elements. + * Initialize the lock_classes[] array elements, the free_lock_classes list + * and also the delayed_free structure. */ static void init_data_structures_once(void) { @@ -754,7 +784,12 @@ static void init_data_structures_once(void) initialization_happened = true; + init_rcu_head(&delayed_free.rcu_head); + INIT_LIST_HEAD(&delayed_free.pf[0].zapped); + INIT_LIST_HEAD(&delayed_free.pf[1].zapped); + for (i = 0; i < ARRAY_SIZE(lock_classes); i++) { + list_add_tail(&lock_classes[i].lock_entry, &free_lock_classes); INIT_LIST_HEAD(&lock_classes[i].locks_after); INIT_LIST_HEAD(&lock_classes[i].locks_before); } @@ -802,11 +837,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) init_data_structures_once(); - /* - * Allocate a new key from the static array, and add it to - * the hash: - */ - if (nr_lock_classes >= MAX_LOCKDEP_KEYS) { + /* Allocate a new lock class and add it to the hash. */ + class = list_first_entry_or_null(&free_lock_classes, typeof(*class), + lock_entry); + if (!class) { if (!debug_locks_off_graph_unlock()) { return NULL; } @@ -815,7 +849,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) dump_stack(); return NULL; } - class = lock_classes + nr_lock_classes++; + nr_lock_classes++; debug_atomic_inc(nr_unused_locks); class->key = key; class->name = lock->name; @@ -829,9 +863,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) */ hlist_add_head_rcu(&class->hash_entry, hash_head); /* - * Add it to the global list of classes: + * Remove the class from the free list and add it to the global list + * of classes. */ - list_add_tail(&class->lock_entry, &all_lock_classes); + list_move_tail(&class->lock_entry, &all_lock_classes); if (verbose(class)) { graph_unlock(); @@ -1860,6 +1895,24 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev, struct lock_list this; int ret; + if (!hlock_class(prev)->key || !hlock_class(next)->key) { + /* + * The warning statements below may trigger a use-after-free + * of the class name. It is better to trigger a use-after free + * and to have the class name most of the time instead of not + * having the class name available. + */ + WARN_ONCE(!debug_locks_silent && !hlock_class(prev)->key, + "Detected use-after-free of lock class %px/%s\n", + hlock_class(prev), + hlock_class(prev)->name); + WARN_ONCE(!debug_locks_silent && !hlock_class(next)->key, + "Detected use-after-free of lock class %px/%s\n", + hlock_class(next), + hlock_class(next)->name); + return 2; + } + /* * Prove that the new -> dependency would not * create a circular dependency in the graph. (We do this by @@ -2242,19 +2295,16 @@ static inline int add_chain_cache(struct task_struct *curr, } /* - * Look up a dependency chain. + * Look up a dependency chain. Must be called with either the graph lock or + * the RCU read lock held. */ static inline struct lock_chain *lookup_chain_cache(u64 chain_key) { struct hlist_head *hash_head = chainhashentry(chain_key); struct lock_chain *chain; - /* - * We can walk it lock-free, because entries only get added - * to the hash: - */ hlist_for_each_entry_rcu(chain, hash_head, entry) { - if (chain->chain_key == chain_key) { + if (READ_ONCE(chain->chain_key) == chain_key) { debug_atomic_inc(chain_lookup_hits); return chain; } @@ -3337,6 +3387,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, if (nest_lock && !__lock_is_held(nest_lock, -1)) return print_lock_nested_lock_not_held(curr, hlock, ip); + if (!debug_locks_silent) { + WARN_ON_ONCE(depth && !hlock_class(hlock - 1)->key); + WARN_ON_ONCE(!hlock_class(hlock)->key); + } + if (!validate_chain(curr, lock, hlock, chain_head, chain_key)) return 0; @@ -4131,14 +4186,92 @@ void lockdep_reset(void) raw_local_irq_restore(flags); } +/* Remove a class from a lock chain. Must be called with the graph lock held. */ +static void remove_class_from_lock_chain(struct lock_chain *chain, + struct lock_class *class) +{ +#ifdef CONFIG_PROVE_LOCKING + struct lock_chain *new_chain; + u64 chain_key; + int i; + + for (i = chain->base; i < chain->base + chain->depth; i++) { + if (chain_hlocks[i] != class - lock_classes) + continue; + /* The code below leaks one chain_hlock[] entry. */ + if (--chain->depth > 0) + memmove(&chain_hlocks[i], &chain_hlocks[i + 1], + (chain->base + chain->depth - i) * + sizeof(chain_hlocks[0])); + /* + * Each lock class occurs at most once in a lock chain so once + * we found a match we can break out of this loop. + */ + goto recalc; + } + /* Since the chain has not been modified, return. */ + return; + +recalc: + chain_key = 0; + for (i = chain->base; i < chain->base + chain->depth; i++) + chain_key = iterate_chain_key(chain_key, chain_hlocks[i] + 1); + if (chain->depth && chain->chain_key == chain_key) + return; + /* Overwrite the chain key for concurrent RCU readers. */ + WRITE_ONCE(chain->chain_key, chain_key); + /* + * Note: calling hlist_del_rcu() from inside a + * hlist_for_each_entry_rcu() loop is safe. + */ + hlist_del_rcu(&chain->entry); + if (chain->depth == 0) + return; + /* + * If the modified lock chain matches an existing lock chain, drop + * the modified lock chain. + */ + if (lookup_chain_cache(chain_key)) + return; + if (WARN_ON_ONCE(nr_lock_chains >= MAX_LOCKDEP_CHAINS)) { + debug_locks_off(); + return; + } + /* + * Leak *chain because it is not safe to reinsert it before an RCU + * grace period has expired. + */ + new_chain = lock_chains + nr_lock_chains++; + *new_chain = *chain; + hlist_add_head_rcu(&new_chain->entry, chainhashentry(chain_key)); +#endif +} + +/* Must be called with the graph lock held. */ +static void remove_class_from_lock_chains(struct lock_class *class) +{ + struct lock_chain *chain; + struct hlist_head *head; + int i; + + for (i = 0; i < ARRAY_SIZE(chainhash_table); i++) { + head = chainhash_table + i; + hlist_for_each_entry_rcu(chain, head, entry) { + remove_class_from_lock_chain(chain, class); + } + } +} + /* * Remove all references to a lock class. The caller must hold the graph lock. */ -static void zap_class(struct lock_class *class) +static void zap_class(struct pending_free *pf, struct lock_class *class) { struct lock_list *entry; int i; + WARN_ON_ONCE(!class->key); + /* * Remove all dependencies this lock is * involved in: @@ -4151,14 +4284,33 @@ static void zap_class(struct lock_class *class) WRITE_ONCE(entry->class, NULL); WRITE_ONCE(entry->links_to, NULL); } - /* - * Unhash the class and remove it from the all_lock_classes list: - */ - hlist_del_rcu(&class->hash_entry); - list_del(&class->lock_entry); + if (list_empty(&class->locks_after) && + list_empty(&class->locks_before)) { + list_move_tail(&class->lock_entry, &pf->zapped); + hlist_del_rcu(&class->hash_entry); + WRITE_ONCE(class->key, NULL); + WRITE_ONCE(class->name, NULL); + nr_lock_classes--; + } else { + WARN_ONCE(true, "%s() failed for class %s\n", __func__, + class->name); + } - RCU_INIT_POINTER(class->key, NULL); - RCU_INIT_POINTER(class->name, NULL); + remove_class_from_lock_chains(class); +} + +static void reinit_class(struct lock_class *class) +{ + void *const p = class; + const unsigned int offset = offsetof(struct lock_class, key); + + WARN_ON_ONCE(!class->lock_entry.next); + WARN_ON_ONCE(!list_empty(&class->locks_after)); + WARN_ON_ONCE(!list_empty(&class->locks_before)); + memset(p + offset, 0, sizeof(*class) - offset); + WARN_ON_ONCE(!class->lock_entry.next); + WARN_ON_ONCE(!list_empty(&class->locks_after)); + WARN_ON_ONCE(!list_empty(&class->locks_before)); } static inline int within(const void *addr, void *start, unsigned long size) @@ -4166,7 +4318,87 @@ static inline int within(const void *addr, void *start, unsigned long size) return addr >= start && addr < start + size; } -static void __lockdep_free_key_range(void *start, unsigned long size) +static bool inside_selftest(void) +{ + return current == lockdep_selftest_task_struct; +} + +/* The caller must hold the graph lock. */ +static struct pending_free *get_pending_free(void) +{ + return delayed_free.pf + delayed_free.index; +} + +static void free_zapped_rcu(struct rcu_head *cb); + +/* + * Schedule an RCU callback if no RCU callback is pending. Must be called with + * the graph lock held. + */ +static void call_rcu_zapped(struct pending_free *pf) +{ + WARN_ON_ONCE(inside_selftest()); + + if (list_empty(&pf->zapped)) + return; + + if (delayed_free.scheduled) + return; + + delayed_free.scheduled = true; + + WARN_ON_ONCE(delayed_free.pf + delayed_free.index != pf); + delayed_free.index ^= 1; + + call_rcu(&delayed_free.rcu_head, free_zapped_rcu); +} + +/* The caller must hold the graph lock. May be called from RCU context. */ +static void __free_zapped_classes(struct pending_free *pf) +{ + struct lock_class *class; + + list_for_each_entry(class, &pf->zapped, lock_entry) + reinit_class(class); + + list_splice_init(&pf->zapped, &free_lock_classes); +} + +static void free_zapped_rcu(struct rcu_head *ch) +{ + struct pending_free *pf; + unsigned long flags; + + if (WARN_ON_ONCE(ch != &delayed_free.rcu_head)) + return; + + raw_local_irq_save(flags); + if (!graph_lock()) + goto out_irq; + + /* closed head */ + pf = delayed_free.pf + (delayed_free.index ^ 1); + __free_zapped_classes(pf); + delayed_free.scheduled = false; + + /* + * If there's anything on the open list, close and start a new callback. + */ + call_rcu_zapped(delayed_free.pf + delayed_free.index); + + graph_unlock(); +out_irq: + raw_local_irq_restore(flags); +} + +/* + * Remove all lock classes from the class hash table and from the + * all_lock_classes list whose key or name is in the address range [start, + * start + size). Move these lock classes to the zapped_classes list. Must + * be called with the graph lock held. + */ +static void __lockdep_free_key_range(struct pending_free *pf, void *start, + unsigned long size) { struct lock_class *class; struct hlist_head *head; @@ -4179,7 +4411,7 @@ static void __lockdep_free_key_range(void *start, unsigned long size) if (!within(class->key, start, size) && !within(class->name, start, size)) continue; - zap_class(class); + zap_class(pf, class); } } } @@ -4192,8 +4424,9 @@ static void __lockdep_free_key_range(void *start, unsigned long size) * guaranteed nobody will look up these exact classes -- they're properly dead * but still allocated. */ -void lockdep_free_key_range(void *start, unsigned long size) +static void lockdep_free_key_range_reg(void *start, unsigned long size) { + struct pending_free *pf; unsigned long flags; int locked; @@ -4201,9 +4434,15 @@ void lockdep_free_key_range(void *start, unsigned long size) raw_local_irq_save(flags); locked = graph_lock(); - __lockdep_free_key_range(start, size); - if (locked) - graph_unlock(); + if (!locked) + goto out_irq; + + pf = get_pending_free(); + __lockdep_free_key_range(pf, start, size); + call_rcu_zapped(pf); + + graph_unlock(); +out_irq: raw_local_irq_restore(flags); /* @@ -4211,12 +4450,35 @@ void lockdep_free_key_range(void *start, unsigned long size) * before continuing to free the memory they refer to. */ synchronize_rcu(); +} - /* - * XXX at this point we could return the resources to the pool; - * instead we leak them. We would need to change to bitmap allocators - * instead of the linear allocators we have now. - */ +/* + * Free all lockdep keys in the range [start, start+size). Does not sleep. + * Ignores debug_locks. Must only be used by the lockdep selftests. + */ +static void lockdep_free_key_range_imm(void *start, unsigned long size) +{ + struct pending_free *pf = delayed_free.pf; + unsigned long flags; + + init_data_structures_once(); + + raw_local_irq_save(flags); + arch_spin_lock(&lockdep_lock); + __lockdep_free_key_range(pf, start, size); + __free_zapped_classes(pf); + arch_spin_unlock(&lockdep_lock); + raw_local_irq_restore(flags); +} + +void lockdep_free_key_range(void *start, unsigned long size) +{ + init_data_structures_once(); + + if (inside_selftest()) + lockdep_free_key_range_imm(start, size); + else + lockdep_free_key_range_reg(start, size); } /* @@ -4242,7 +4504,8 @@ static bool lock_class_cache_is_registered(struct lockdep_map *lock) } /* The caller must hold the graph lock. Does not sleep. */ -static void __lockdep_reset_lock(struct lockdep_map *lock) +static void __lockdep_reset_lock(struct pending_free *pf, + struct lockdep_map *lock) { struct lock_class *class; int j; @@ -4256,7 +4519,7 @@ static void __lockdep_reset_lock(struct lockdep_map *lock) */ class = look_up_lock_class(lock, j); if (class) - zap_class(class); + zap_class(pf, class); } /* * Debug check: in the end all mapped classes should @@ -4266,21 +4529,57 @@ static void __lockdep_reset_lock(struct lockdep_map *lock) debug_locks_off(); } -void lockdep_reset_lock(struct lockdep_map *lock) +/* + * Remove all information lockdep has about a lock if debug_locks == 1. Free + * released data structures from RCU context. + */ +static void lockdep_reset_lock_reg(struct lockdep_map *lock) { + struct pending_free *pf; unsigned long flags; int locked; - init_data_structures_once(); - raw_local_irq_save(flags); locked = graph_lock(); - __lockdep_reset_lock(lock); - if (locked) - graph_unlock(); + if (!locked) + goto out_irq; + + pf = get_pending_free(); + __lockdep_reset_lock(pf, lock); + call_rcu_zapped(pf); + + graph_unlock(); +out_irq: + raw_local_irq_restore(flags); +} + +/* + * Reset a lock. Does not sleep. Ignores debug_locks. Must only be used by the + * lockdep selftests. + */ +static void lockdep_reset_lock_imm(struct lockdep_map *lock) +{ + struct pending_free *pf = delayed_free.pf; + unsigned long flags; + + raw_local_irq_save(flags); + arch_spin_lock(&lockdep_lock); + __lockdep_reset_lock(pf, lock); + __free_zapped_classes(pf); + arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); } +void lockdep_reset_lock(struct lockdep_map *lock) +{ + init_data_structures_once(); + + if (inside_selftest()) + lockdep_reset_lock_imm(lock); + else + lockdep_reset_lock_reg(lock); +} + void __init lockdep_init(void) { printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n"); @@ -4297,7 +4596,8 @@ void __init lockdep_init(void) (sizeof(lock_classes) + sizeof(classhash_table) + sizeof(list_entries) + - sizeof(chainhash_table) + sizeof(chainhash_table) + + sizeof(delayed_free) #ifdef CONFIG_PROVE_LOCKING + sizeof(lock_cq) + sizeof(lock_chains) -- cgit v1.3-7-g2ca7 From ace35a7ac493d4284a57ad807579011bebba891c Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:47 -0800 Subject: locking/lockdep: Reuse list entries that are no longer in use Instead of abandoning elements of list_entries[] that are no longer in use, make alloc_list_entry() reuse array elements that have been freed. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-13-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 8ecf355dd163..2c6d0b67e7b6 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -132,6 +133,7 @@ static inline int debug_locks_off_graph_unlock(void) unsigned long nr_list_entries; static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES]; +static DECLARE_BITMAP(list_entries_in_use, MAX_LOCKDEP_ENTRIES); /* * All data structures here are protected by the global debug_lock. @@ -907,7 +909,10 @@ out_set_class_cache: */ static struct lock_list *alloc_list_entry(void) { - if (nr_list_entries >= MAX_LOCKDEP_ENTRIES) { + int idx = find_first_zero_bit(list_entries_in_use, + ARRAY_SIZE(list_entries)); + + if (idx >= ARRAY_SIZE(list_entries)) { if (!debug_locks_off_graph_unlock()) return NULL; @@ -915,7 +920,9 @@ static struct lock_list *alloc_list_entry(void) dump_stack(); return NULL; } - return list_entries + nr_list_entries++; + nr_list_entries++; + __set_bit(idx, list_entries_in_use); + return list_entries + idx; } /* @@ -1019,7 +1026,7 @@ static inline void mark_lock_accessed(struct lock_list *lock, unsigned long nr; nr = lock - list_entries; - WARN_ON(nr >= nr_list_entries); /* Out-of-bounds, input fail */ + WARN_ON(nr >= ARRAY_SIZE(list_entries)); /* Out-of-bounds, input fail */ lock->parent = parent; lock->class->dep_gen_id = lockdep_dependency_gen_id; } @@ -1029,7 +1036,7 @@ static inline unsigned long lock_accessed(struct lock_list *lock) unsigned long nr; nr = lock - list_entries; - WARN_ON(nr >= nr_list_entries); /* Out-of-bounds, input fail */ + WARN_ON(nr >= ARRAY_SIZE(list_entries)); /* Out-of-bounds, input fail */ return lock->class->dep_gen_id == lockdep_dependency_gen_id; } @@ -4276,13 +4283,13 @@ static void zap_class(struct pending_free *pf, struct lock_class *class) * Remove all dependencies this lock is * involved in: */ - for (i = 0, entry = list_entries; i < nr_list_entries; i++, entry++) { + for_each_set_bit(i, list_entries_in_use, ARRAY_SIZE(list_entries)) { + entry = list_entries + i; if (entry->class != class && entry->links_to != class) continue; + __clear_bit(i, list_entries_in_use); + nr_list_entries--; list_del_rcu(&entry->entry); - /* Clear .class and .links_to to avoid double removal. */ - WRITE_ONCE(entry->class, NULL); - WRITE_ONCE(entry->links_to, NULL); } if (list_empty(&class->locks_after) && list_empty(&class->locks_before)) { @@ -4596,6 +4603,7 @@ void __init lockdep_init(void) (sizeof(lock_classes) + sizeof(classhash_table) + sizeof(list_entries) + + sizeof(list_entries_in_use) + sizeof(chainhash_table) + sizeof(delayed_free) #ifdef CONFIG_PROVE_LOCKING -- cgit v1.3-7-g2ca7 From 2212684adff79e2704a2792ff46682afb9246fc8 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:48 -0800 Subject: locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count() This patch does not change any functionality but makes the next patch in this series easier to read. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-14-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 16 +++++++++++++++- kernel/locking/lockdep_internals.h | 3 ++- kernel/locking/lockdep_proc.c | 12 ++++++------ 3 files changed, 23 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 2c6d0b67e7b6..753a9b758266 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2096,7 +2096,7 @@ out_bug: return 0; } -unsigned long nr_lock_chains; +static unsigned long nr_lock_chains; struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; int nr_chain_hlocks; static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; @@ -2230,6 +2230,20 @@ static int check_no_collision(struct task_struct *curr, return 1; } +/* + * Given an index that is >= -1, return the index of the next lock chain. + * Return -2 if there is no next lock chain. + */ +long lockdep_next_lockchain(long i) +{ + return i + 1 < nr_lock_chains ? i + 1 : -2; +} + +unsigned long lock_chain_count(void) +{ + return nr_lock_chains; +} + /* * Adds a dependency chain into chain hashtable. And must be called with * graph_lock held. diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index 2ebb9d0ea91c..d4c197425f68 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -100,7 +100,8 @@ struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i); extern unsigned long nr_lock_classes; extern unsigned long nr_list_entries; -extern unsigned long nr_lock_chains; +long lockdep_next_lockchain(long i); +unsigned long lock_chain_count(void); extern int nr_chain_hlocks; extern unsigned long nr_stack_trace_entries; diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 3d31f9b0059e..9c49ec645d8b 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -104,18 +104,18 @@ static const struct seq_operations lockdep_ops = { #ifdef CONFIG_PROVE_LOCKING static void *lc_start(struct seq_file *m, loff_t *pos) { + if (*pos < 0) + return NULL; + if (*pos == 0) return SEQ_START_TOKEN; - if (*pos - 1 < nr_lock_chains) - return lock_chains + (*pos - 1); - - return NULL; + return lock_chains + (*pos - 1); } static void *lc_next(struct seq_file *m, void *v, loff_t *pos) { - (*pos)++; + *pos = lockdep_next_lockchain(*pos - 1) + 1; return lc_start(m, pos); } @@ -268,7 +268,7 @@ static int lockdep_stats_show(struct seq_file *m, void *v) #ifdef CONFIG_PROVE_LOCKING seq_printf(m, " dependency chains: %11lu [max: %lu]\n", - nr_lock_chains, MAX_LOCKDEP_CHAINS); + lock_chain_count(), MAX_LOCKDEP_CHAINS); seq_printf(m, " dependency chain hlocks: %11d [max: %lu]\n", nr_chain_hlocks, MAX_LOCKDEP_CHAIN_HLOCKS); #endif -- cgit v1.3-7-g2ca7 From 527af3ea273b2cf0c017a2c90090b3c94af8aba4 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:49 -0800 Subject: locking/lockdep: Fix a comment in add_chain_cache() Reflect that add_chain_cache() is always called with the graph lock held. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-15-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 753a9b758266..ec0cb794f70d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2266,7 +2266,7 @@ static inline int add_chain_cache(struct task_struct *curr, */ /* - * We might need to take the graph lock, ensure we've got IRQs + * The caller must hold the graph lock, ensure we've got IRQs * disabled to make this an IRQ-safe lock.. for recursion reasons * lockdep won't complain about its own locking errors. */ -- cgit v1.3-7-g2ca7 From de4643a77356a77bce73f64275b125b4b71a69cf Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:50 -0800 Subject: locking/lockdep: Reuse lock chains that have been freed A previous patch introduced a lock chain leak. Fix that leak by reusing lock chains that have been freed. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-16-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 57 +++++++++++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index ec0cb794f70d..0bb204464afe 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -292,9 +292,12 @@ static LIST_HEAD(free_lock_classes); /** * struct pending_free - information about data structures about to be freed * @zapped: Head of a list with struct lock_class elements. + * @lock_chains_being_freed: Bitmap that indicates which lock_chains[] elements + * are about to be freed. */ struct pending_free { struct list_head zapped; + DECLARE_BITMAP(lock_chains_being_freed, MAX_LOCKDEP_CHAINS); }; /** @@ -2096,8 +2099,8 @@ out_bug: return 0; } -static unsigned long nr_lock_chains; struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; +static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS); int nr_chain_hlocks; static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; @@ -2236,12 +2239,25 @@ static int check_no_collision(struct task_struct *curr, */ long lockdep_next_lockchain(long i) { - return i + 1 < nr_lock_chains ? i + 1 : -2; + i = find_next_bit(lock_chains_in_use, ARRAY_SIZE(lock_chains), i + 1); + return i < ARRAY_SIZE(lock_chains) ? i : -2; } unsigned long lock_chain_count(void) { - return nr_lock_chains; + return bitmap_weight(lock_chains_in_use, ARRAY_SIZE(lock_chains)); +} + +/* Must be called with the graph lock held. */ +static struct lock_chain *alloc_lock_chain(void) +{ + int idx = find_first_zero_bit(lock_chains_in_use, + ARRAY_SIZE(lock_chains)); + + if (unlikely(idx >= ARRAY_SIZE(lock_chains))) + return NULL; + __set_bit(idx, lock_chains_in_use); + return lock_chains + idx; } /* @@ -2260,11 +2276,6 @@ static inline int add_chain_cache(struct task_struct *curr, struct lock_chain *chain; int i, j; - /* - * Allocate a new chain entry from the static array, and add - * it to the hash: - */ - /* * The caller must hold the graph lock, ensure we've got IRQs * disabled to make this an IRQ-safe lock.. for recursion reasons @@ -2273,7 +2284,8 @@ static inline int add_chain_cache(struct task_struct *curr, if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) return 0; - if (unlikely(nr_lock_chains >= MAX_LOCKDEP_CHAINS)) { + chain = alloc_lock_chain(); + if (!chain) { if (!debug_locks_off_graph_unlock()) return 0; @@ -2281,7 +2293,6 @@ static inline int add_chain_cache(struct task_struct *curr, dump_stack(); return 0; } - chain = lock_chains + nr_lock_chains++; chain->chain_key = chain_key; chain->irq_context = hlock->irq_context; i = get_first_held_lock(curr, hlock); @@ -4208,7 +4219,8 @@ void lockdep_reset(void) } /* Remove a class from a lock chain. Must be called with the graph lock held. */ -static void remove_class_from_lock_chain(struct lock_chain *chain, +static void remove_class_from_lock_chain(struct pending_free *pf, + struct lock_chain *chain, struct lock_class *class) { #ifdef CONFIG_PROVE_LOCKING @@ -4246,6 +4258,7 @@ recalc: * hlist_for_each_entry_rcu() loop is safe. */ hlist_del_rcu(&chain->entry); + __set_bit(chain - lock_chains, pf->lock_chains_being_freed); if (chain->depth == 0) return; /* @@ -4254,22 +4267,19 @@ recalc: */ if (lookup_chain_cache(chain_key)) return; - if (WARN_ON_ONCE(nr_lock_chains >= MAX_LOCKDEP_CHAINS)) { + new_chain = alloc_lock_chain(); + if (WARN_ON_ONCE(!new_chain)) { debug_locks_off(); return; } - /* - * Leak *chain because it is not safe to reinsert it before an RCU - * grace period has expired. - */ - new_chain = lock_chains + nr_lock_chains++; *new_chain = *chain; hlist_add_head_rcu(&new_chain->entry, chainhashentry(chain_key)); #endif } /* Must be called with the graph lock held. */ -static void remove_class_from_lock_chains(struct lock_class *class) +static void remove_class_from_lock_chains(struct pending_free *pf, + struct lock_class *class) { struct lock_chain *chain; struct hlist_head *head; @@ -4278,7 +4288,7 @@ static void remove_class_from_lock_chains(struct lock_class *class) for (i = 0; i < ARRAY_SIZE(chainhash_table); i++) { head = chainhash_table + i; hlist_for_each_entry_rcu(chain, head, entry) { - remove_class_from_lock_chain(chain, class); + remove_class_from_lock_chain(pf, chain, class); } } } @@ -4317,7 +4327,7 @@ static void zap_class(struct pending_free *pf, struct lock_class *class) class->name); } - remove_class_from_lock_chains(class); + remove_class_from_lock_chains(pf, class); } static void reinit_class(struct lock_class *class) @@ -4383,6 +4393,12 @@ static void __free_zapped_classes(struct pending_free *pf) reinit_class(class); list_splice_init(&pf->zapped, &free_lock_classes); + +#ifdef CONFIG_PROVE_LOCKING + bitmap_andnot(lock_chains_in_use, lock_chains_in_use, + pf->lock_chains_being_freed, ARRAY_SIZE(lock_chains)); + bitmap_clear(pf->lock_chains_being_freed, 0, ARRAY_SIZE(lock_chains)); +#endif } static void free_zapped_rcu(struct rcu_head *ch) @@ -4623,6 +4639,7 @@ void __init lockdep_init(void) #ifdef CONFIG_PROVE_LOCKING + sizeof(lock_cq) + sizeof(lock_chains) + + sizeof(lock_chains_in_use) + sizeof(chain_hlocks) #endif ) / 1024 -- cgit v1.3-7-g2ca7 From b526b2e39a53b312f5a6867ce57824247aa0ce8b Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:51 -0800 Subject: locking/lockdep: Check data structure consistency Debugging lockdep data structure inconsistencies is challenging. Add code that verifies data structure consistency at runtime. That code is disabled by default because it is very CPU intensive. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-17-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 0bb204464afe..630be9ac6253 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -74,6 +74,8 @@ module_param(lock_stat, int, 0644); #define lock_stat 0 #endif +static bool check_data_structure_consistency; + /* * lockdep_lock: protects the lockdep graph, the hashes and the * class/list/hash allocators. @@ -775,6 +777,168 @@ static bool assign_lock_key(struct lockdep_map *lock) return true; } +/* Check whether element @e occurs in list @h */ +static bool in_list(struct list_head *e, struct list_head *h) +{ + struct list_head *f; + + list_for_each(f, h) { + if (e == f) + return true; + } + + return false; +} + +/* + * Check whether entry @e occurs in any of the locks_after or locks_before + * lists. + */ +static bool in_any_class_list(struct list_head *e) +{ + struct lock_class *class; + int i; + + for (i = 0; i < ARRAY_SIZE(lock_classes); i++) { + class = &lock_classes[i]; + if (in_list(e, &class->locks_after) || + in_list(e, &class->locks_before)) + return true; + } + return false; +} + +static bool class_lock_list_valid(struct lock_class *c, struct list_head *h) +{ + struct lock_list *e; + + list_for_each_entry(e, h, entry) { + if (e->links_to != c) { + printk(KERN_INFO "class %s: mismatch for lock entry %ld; class %s <> %s", + c->name ? : "(?)", + (unsigned long)(e - list_entries), + e->links_to && e->links_to->name ? + e->links_to->name : "(?)", + e->class && e->class->name ? e->class->name : + "(?)"); + return false; + } + } + return true; +} + +static u16 chain_hlocks[]; + +static bool check_lock_chain_key(struct lock_chain *chain) +{ +#ifdef CONFIG_PROVE_LOCKING + u64 chain_key = 0; + int i; + + for (i = chain->base; i < chain->base + chain->depth; i++) + chain_key = iterate_chain_key(chain_key, chain_hlocks[i] + 1); + /* + * The 'unsigned long long' casts avoid that a compiler warning + * is reported when building tools/lib/lockdep. + */ + if (chain->chain_key != chain_key) + printk(KERN_INFO "chain %lld: key %#llx <> %#llx\n", + (unsigned long long)(chain - lock_chains), + (unsigned long long)chain->chain_key, + (unsigned long long)chain_key); + return chain->chain_key == chain_key; +#else + return true; +#endif +} + +static bool in_any_zapped_class_list(struct lock_class *class) +{ + struct pending_free *pf; + int i; + + for (i = 0, pf = delayed_free.pf; i < ARRAY_SIZE(delayed_free.pf); + i++, pf++) + if (in_list(&class->lock_entry, &pf->zapped)) + return true; + + return false; +} + +static bool check_data_structures(void) +{ + struct lock_class *class; + struct lock_chain *chain; + struct hlist_head *head; + struct lock_list *e; + int i; + + /* Check whether all classes occur in a lock list. */ + for (i = 0; i < ARRAY_SIZE(lock_classes); i++) { + class = &lock_classes[i]; + if (!in_list(&class->lock_entry, &all_lock_classes) && + !in_list(&class->lock_entry, &free_lock_classes) && + !in_any_zapped_class_list(class)) { + printk(KERN_INFO "class %px/%s is not in any class list\n", + class, class->name ? : "(?)"); + return false; + return false; + } + } + + /* Check whether all classes have valid lock lists. */ + for (i = 0; i < ARRAY_SIZE(lock_classes); i++) { + class = &lock_classes[i]; + if (!class_lock_list_valid(class, &class->locks_before)) + return false; + if (!class_lock_list_valid(class, &class->locks_after)) + return false; + } + + /* Check the chain_key of all lock chains. */ + for (i = 0; i < ARRAY_SIZE(chainhash_table); i++) { + head = chainhash_table + i; + hlist_for_each_entry_rcu(chain, head, entry) { + if (!check_lock_chain_key(chain)) + return false; + } + } + + /* + * Check whether all list entries that are in use occur in a class + * lock list. + */ + for_each_set_bit(i, list_entries_in_use, ARRAY_SIZE(list_entries)) { + e = list_entries + i; + if (!in_any_class_list(&e->entry)) { + printk(KERN_INFO "list entry %d is not in any class list; class %s <> %s\n", + (unsigned int)(e - list_entries), + e->class->name ? : "(?)", + e->links_to->name ? : "(?)"); + return false; + } + } + + /* + * Check whether all list entries that are not in use do not occur in + * a class lock list. + */ + for_each_clear_bit(i, list_entries_in_use, ARRAY_SIZE(list_entries)) { + e = list_entries + i; + if (in_any_class_list(&e->entry)) { + printk(KERN_INFO "list entry %d occurs in a class list; class %s <> %s\n", + (unsigned int)(e - list_entries), + e->class && e->class->name ? e->class->name : + "(?)", + e->links_to && e->links_to->name ? + e->links_to->name : "(?)"); + return false; + } + } + + return true; +} + /* * Initialize the lock_classes[] array elements, the free_lock_classes list * and also the delayed_free structure. @@ -4389,6 +4553,9 @@ static void __free_zapped_classes(struct pending_free *pf) { struct lock_class *class; + if (check_data_structure_consistency) + WARN_ON_ONCE(!check_data_structures()); + list_for_each_entry(class, &pf->zapped, lock_entry) reinit_class(class); -- cgit v1.3-7-g2ca7 From 4bf508621855613ca2ac782f70c3171e0e8bb011 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:52 -0800 Subject: locking/lockdep: Verify whether lock objects are small enough to be used as class keys Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-18-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 630be9ac6253..84427441824e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -758,6 +758,17 @@ static bool assign_lock_key(struct lockdep_map *lock) { unsigned long can_addr, addr = (unsigned long)lock; +#ifdef __KERNEL__ + /* + * lockdep_free_key_range() assumes that struct lock_class_key + * objects do not overlap. Since we use the address of lock + * objects as class key for static objects, check whether the + * size of lock_class_key objects does not exceed the size of + * the smallest lock object. + */ + BUILD_BUG_ON(sizeof(struct lock_class_key) > sizeof(raw_spinlock_t)); +#endif + if (__is_kernel_percpu_address(addr, &can_addr)) lock->key = (void *)can_addr; else if (__is_module_percpu_address(addr, &can_addr)) -- cgit v1.3-7-g2ca7 From 108c14858b9ea224686e476c8f5ec345a0df9e27 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:53 -0800 Subject: locking/lockdep: Add support for dynamic keys A shortcoming of the current lockdep implementation is that it requires lock keys to be allocated statically. That forces all instances of lock objects that occur in a given data structure to share a lock key. Since lock dependency analysis groups lock objects per key sharing lock keys can cause false positive lockdep reports. Make it possible to avoid such false positive reports by allowing lock keys to be allocated dynamically. Require that dynamically allocated lock keys are registered before use by calling lockdep_register_key(). Complain about attempts to register the same lock key pointer twice without calling lockdep_unregister_key() between successive registration calls. The purpose of the new lock_keys_hash[] data structure that keeps track of all dynamic keys is twofold: - Verify whether the lockdep_register_key() and lockdep_unregister_key() functions are used correctly. - Avoid that lockdep_init_map() complains when encountering a dynamically allocated key. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: johannes.berg@intel.com Cc: tj@kernel.org Link: https://lkml.kernel.org/r/20190214230058.196511-19-bvanassche@acm.org Signed-off-by: Ingo Molnar --- include/linux/lockdep.h | 21 ++++++-- kernel/locking/lockdep.c | 121 +++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 131 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 619ec3f26cdc..43fb35bd7baf 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -46,15 +46,19 @@ extern int lock_stat; #define NR_LOCKDEP_CACHING_CLASSES 2 /* - * Lock-classes are keyed via unique addresses, by embedding the - * lockclass-key into the kernel (or module) .data section. (For - * static locks we use the lock address itself as the key.) + * A lockdep key is associated with each lock object. For static locks we use + * the lock address itself as the key. Dynamically allocated lock objects can + * have a statically or dynamically allocated key. Dynamically allocated lock + * keys must be registered before being used and must be unregistered before + * the key memory is freed. */ struct lockdep_subclass_key { char __one_byte; } __attribute__ ((__packed__)); +/* hash_entry is used to keep track of dynamically allocated keys. */ struct lock_class_key { + struct hlist_node hash_entry; struct lockdep_subclass_key subkeys[MAX_LOCKDEP_SUBCLASSES]; }; @@ -273,6 +277,9 @@ extern void lockdep_set_selftest_task(struct task_struct *task); extern void lockdep_off(void); extern void lockdep_on(void); +extern void lockdep_register_key(struct lock_class_key *key); +extern void lockdep_unregister_key(struct lock_class_key *key); + /* * These methods are used by specific locking variants (spinlocks, * rwlocks, mutexes and rwsems) to pass init/acquire/release events @@ -434,6 +441,14 @@ static inline void lockdep_set_selftest_task(struct task_struct *task) */ struct lock_class_key { }; +static inline void lockdep_register_key(struct lock_class_key *key) +{ +} + +static inline void lockdep_unregister_key(struct lock_class_key *key) +{ +} + /* * The lockdep_map takes no space if lockdep is disabled: */ diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 84427441824e..c73bc4334bee 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -143,6 +143,9 @@ static DECLARE_BITMAP(list_entries_in_use, MAX_LOCKDEP_ENTRIES); * nr_lock_classes is the number of elements of lock_classes[] that is * in use. */ +#define KEYHASH_BITS (MAX_LOCKDEP_KEYS_BITS - 1) +#define KEYHASH_SIZE (1UL << KEYHASH_BITS) +static struct hlist_head lock_keys_hash[KEYHASH_SIZE]; unsigned long nr_lock_classes; #ifndef CONFIG_DEBUG_LOCKDEP static @@ -641,7 +644,7 @@ static int very_verbose(struct lock_class *class) * Is this the address of a static object: */ #ifdef __KERNEL__ -static int static_obj(void *obj) +static int static_obj(const void *obj) { unsigned long start = (unsigned long) &_stext, end = (unsigned long) &_end, @@ -975,6 +978,71 @@ static void init_data_structures_once(void) } } +static inline struct hlist_head *keyhashentry(const struct lock_class_key *key) +{ + unsigned long hash = hash_long((uintptr_t)key, KEYHASH_BITS); + + return lock_keys_hash + hash; +} + +/* Register a dynamically allocated key. */ +void lockdep_register_key(struct lock_class_key *key) +{ + struct hlist_head *hash_head; + struct lock_class_key *k; + unsigned long flags; + + if (WARN_ON_ONCE(static_obj(key))) + return; + hash_head = keyhashentry(key); + + raw_local_irq_save(flags); + if (!graph_lock()) + goto restore_irqs; + hlist_for_each_entry_rcu(k, hash_head, hash_entry) { + if (WARN_ON_ONCE(k == key)) + goto out_unlock; + } + hlist_add_head_rcu(&key->hash_entry, hash_head); +out_unlock: + graph_unlock(); +restore_irqs: + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lockdep_register_key); + +/* Check whether a key has been registered as a dynamic key. */ +static bool is_dynamic_key(const struct lock_class_key *key) +{ + struct hlist_head *hash_head; + struct lock_class_key *k; + bool found = false; + + if (WARN_ON_ONCE(static_obj(key))) + return false; + + /* + * If lock debugging is disabled lock_keys_hash[] may contain + * pointers to memory that has already been freed. Avoid triggering + * a use-after-free in that case by returning early. + */ + if (!debug_locks) + return true; + + hash_head = keyhashentry(key); + + rcu_read_lock(); + hlist_for_each_entry_rcu(k, hash_head, hash_entry) { + if (k == key) { + found = true; + break; + } + } + rcu_read_unlock(); + + return found; +} + /* * Register a lock's class in the hash-table, if the class is not present * yet. Otherwise we look it up. We cache the result in the lock object @@ -996,7 +1064,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) if (!lock->key) { if (!assign_lock_key(lock)) return NULL; - } else if (!static_obj(lock->key)) { + } else if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { return NULL; } @@ -3378,13 +3446,12 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name, if (DEBUG_LOCKS_WARN_ON(!key)) return; /* - * Sanity check, the lock-class key must be persistent: + * Sanity check, the lock-class key must either have been allocated + * statically or must have been registered as a dynamic key. */ - if (!static_obj(key)) { - printk("BUG: key %px not in .data!\n", key); - /* - * What it says above ^^^^^, I suggest you read it. - */ + if (!static_obj(key) && !is_dynamic_key(key)) { + if (debug_locks) + printk(KERN_ERR "BUG: key %px has not been registered!\n", key); DEBUG_LOCKS_WARN_ON(1); return; } @@ -4795,6 +4862,44 @@ void lockdep_reset_lock(struct lockdep_map *lock) lockdep_reset_lock_reg(lock); } +/* Unregister a dynamically allocated key. */ +void lockdep_unregister_key(struct lock_class_key *key) +{ + struct hlist_head *hash_head = keyhashentry(key); + struct lock_class_key *k; + struct pending_free *pf; + unsigned long flags; + bool found = false; + + might_sleep(); + + if (WARN_ON_ONCE(static_obj(key))) + return; + + raw_local_irq_save(flags); + if (!graph_lock()) + goto out_irq; + + pf = get_pending_free(); + hlist_for_each_entry_rcu(k, hash_head, hash_entry) { + if (k == key) { + hlist_del_rcu(&k->hash_entry); + found = true; + break; + } + } + WARN_ON_ONCE(!found); + __lockdep_free_key_range(pf, key, 1); + call_rcu_zapped(pf); + graph_unlock(); +out_irq: + raw_local_irq_restore(flags); + + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(lockdep_unregister_key); + void __init lockdep_init(void) { printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n"); -- cgit v1.3-7-g2ca7 From 669de8bda87b92ab9a2fc663b3f5743c2ad1ae9f Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Thu, 14 Feb 2019 15:00:54 -0800 Subject: kernel/workqueue: Use dynamic lockdep keys for workqueues The following commit: 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") improved deadlock checking in the workqueue implementation. Unfortunately that patch also introduced a few false positive lockdep complaints. This patch suppresses these false positives by allocating the workqueue mutex lockdep key dynamically. An example of a false positive lockdep complaint suppressed by this patch can be found below. The root cause of the lockdep complaint shown below is that the direct I/O code can call alloc_workqueue() from inside a work item created by another alloc_workqueue() call and that both workqueues share the same lockdep key. This patch avoids that that lockdep complaint is triggered by allocating the work queue lockdep keys dynamically. In other words, this patch guarantees that a unique lockdep key is associated with each work queue mutex. ====================================================== WARNING: possible circular locking dependency detected 4.19.0-dbg+ #1 Not tainted fio/4129 is trying to acquire lock: 00000000a01cfe1a ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970 but task is already holding lock: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sb->s_type->i_mutex_key#14){+.+.}: down_write+0x3d/0x80 __generic_file_fsync+0x77/0xf0 ext4_sync_file+0x3c9/0x780 vfs_fsync_range+0x66/0x100 dio_complete+0x2f5/0x360 dio_aio_complete_work+0x1c/0x20 process_one_work+0x481/0x9f0 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #1 ((work_completion)(&dio->complete_work)){+.+.}: process_one_work+0x447/0x9f0 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}: lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x1f33/0x4be0 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbb0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: (wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#14 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#14); lock((work_completion)(&dio->complete_work)); lock(&sb->s_type->i_mutex_key#14); lock((wq_completion)"dio/%s"sb->s_id); *** DEADLOCK *** 1 lock held by fio/4129: #0: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710 stack backtrace: CPU: 3 PID: 4129 Comm: fio Not tainted 4.19.0-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xc5 print_circular_bug.isra.32+0x20a/0x218 __lock_acquire+0x1c68/0x1cf0 lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x1f33/0x4be0 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbb0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Johannes Berg Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Link: https://lkml.kernel.org/r/20190214230058.196511-20-bvanassche@acm.org [ Reworked the changelog a bit. ] Signed-off-by: Ingo Molnar --- include/linux/workqueue.h | 28 ++++------------------ kernel/workqueue.c | 59 +++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 54 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 60d673e15632..d9a1a480e920 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -390,43 +390,23 @@ extern struct workqueue_struct *system_freezable_wq; extern struct workqueue_struct *system_power_efficient_wq; extern struct workqueue_struct *system_freezable_power_efficient_wq; -extern struct workqueue_struct * -__alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, - struct lock_class_key *key, const char *lock_name, ...) __printf(1, 6); - /** * alloc_workqueue - allocate a workqueue * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags * @max_active: max in-flight work items, 0 for default - * @args...: args for @fmt + * remaining args: args for @fmt * * Allocate a workqueue with the specified parameters. For detailed * information on WQ_* flags, please refer to * Documentation/core-api/workqueue.rst. * - * The __lock_name macro dance is to guarantee that single lock_class_key - * doesn't end up with different namesm, which isn't allowed by lockdep. - * * RETURNS: * Pointer to the allocated workqueue on success, %NULL on failure. */ -#ifdef CONFIG_LOCKDEP -#define alloc_workqueue(fmt, flags, max_active, args...) \ -({ \ - static struct lock_class_key __key; \ - const char *__lock_name; \ - \ - __lock_name = "(wq_completion)"#fmt#args; \ - \ - __alloc_workqueue_key((fmt), (flags), (max_active), \ - &__key, __lock_name, ##args); \ -}) -#else -#define alloc_workqueue(fmt, flags, max_active, args...) \ - __alloc_workqueue_key((fmt), (flags), (max_active), \ - NULL, NULL, ##args) -#endif +struct workqueue_struct *alloc_workqueue(const char *fmt, + unsigned int flags, + int max_active, ...); /** * alloc_ordered_workqueue - allocate an ordered workqueue diff --git a/kernel/workqueue.c b/kernel/workqueue.c index fc5d23d752a5..e163e7a7f5e5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -259,6 +259,8 @@ struct workqueue_struct { struct wq_device *wq_dev; /* I: for sysfs interface */ #endif #ifdef CONFIG_LOCKDEP + char *lock_name; + struct lock_class_key key; struct lockdep_map lockdep_map; #endif char name[WQ_NAME_LEN]; /* I: workqueue name */ @@ -3337,11 +3339,49 @@ static int init_worker_pool(struct worker_pool *pool) return 0; } +#ifdef CONFIG_LOCKDEP +static void wq_init_lockdep(struct workqueue_struct *wq) +{ + char *lock_name; + + lockdep_register_key(&wq->key); + lock_name = kasprintf(GFP_KERNEL, "%s%s", "(wq_completion)", wq->name); + if (!lock_name) + lock_name = wq->name; + lockdep_init_map(&wq->lockdep_map, lock_name, &wq->key, 0); +} + +static void wq_unregister_lockdep(struct workqueue_struct *wq) +{ + lockdep_unregister_key(&wq->key); +} + +static void wq_free_lockdep(struct workqueue_struct *wq) +{ + if (wq->lock_name != wq->name) + kfree(wq->lock_name); +} +#else +static void wq_init_lockdep(struct workqueue_struct *wq) +{ +} + +static void wq_unregister_lockdep(struct workqueue_struct *wq) +{ +} + +static void wq_free_lockdep(struct workqueue_struct *wq) +{ +} +#endif + static void rcu_free_wq(struct rcu_head *rcu) { struct workqueue_struct *wq = container_of(rcu, struct workqueue_struct, rcu); + wq_free_lockdep(wq); + if (!(wq->flags & WQ_UNBOUND)) free_percpu(wq->cpu_pwqs); else @@ -3532,8 +3572,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work) * If we're the last pwq going away, @wq is already dead and no one * is gonna access it anymore. Schedule RCU free. */ - if (is_last) + if (is_last) { + wq_unregister_lockdep(wq); call_rcu(&wq->rcu, rcu_free_wq); + } } /** @@ -4067,11 +4109,9 @@ static int init_rescuer(struct workqueue_struct *wq) return 0; } -struct workqueue_struct *__alloc_workqueue_key(const char *fmt, - unsigned int flags, - int max_active, - struct lock_class_key *key, - const char *lock_name, ...) +struct workqueue_struct *alloc_workqueue(const char *fmt, + unsigned int flags, + int max_active, ...) { size_t tbl_size = 0; va_list args; @@ -4106,7 +4146,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, goto err_free_wq; } - va_start(args, lock_name); + va_start(args, max_active); vsnprintf(wq->name, sizeof(wq->name), fmt, args); va_end(args); @@ -4123,7 +4163,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, INIT_LIST_HEAD(&wq->flusher_overflow); INIT_LIST_HEAD(&wq->maydays); - lockdep_init_map(&wq->lockdep_map, lock_name, key, 0); + wq_init_lockdep(wq); INIT_LIST_HEAD(&wq->list); if (alloc_and_link_pwqs(wq) < 0) @@ -4161,7 +4201,7 @@ err_destroy: destroy_workqueue(wq); return NULL; } -EXPORT_SYMBOL_GPL(__alloc_workqueue_key); +EXPORT_SYMBOL_GPL(alloc_workqueue); /** * destroy_workqueue - safely terminate a workqueue @@ -4214,6 +4254,7 @@ void destroy_workqueue(struct workqueue_struct *wq) kthread_stop(wq->rescuer->task); if (!(wq->flags & WQ_UNBOUND)) { + wq_unregister_lockdep(wq); /* * The base ref is never dropped on per-cpu pwqs. Directly * schedule RCU free. -- cgit v1.3-7-g2ca7 From 72dcd505e8585857397207f28782c8ba55e553b5 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 25 Feb 2019 15:56:32 +0100 Subject: locking/lockdep: Add module_param to enable consistency checks And move the whole lot under CONFIG_DEBUG_LOCKDEP. Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 46 +++++++++++++++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index c73bc4334bee..fda370ca529c 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -74,8 +74,6 @@ module_param(lock_stat, int, 0644); #define lock_stat 0 #endif -static bool check_data_structure_consistency; - /* * lockdep_lock: protects the lockdep graph, the hashes and the * class/list/hash allocators. @@ -791,6 +789,8 @@ static bool assign_lock_key(struct lockdep_map *lock) return true; } +#ifdef CONFIG_DEBUG_LOCKDEP + /* Check whether element @e occurs in list @h */ static bool in_list(struct list_head *e, struct list_head *h) { @@ -855,15 +855,15 @@ static bool check_lock_chain_key(struct lock_chain *chain) * The 'unsigned long long' casts avoid that a compiler warning * is reported when building tools/lib/lockdep. */ - if (chain->chain_key != chain_key) + if (chain->chain_key != chain_key) { printk(KERN_INFO "chain %lld: key %#llx <> %#llx\n", (unsigned long long)(chain - lock_chains), (unsigned long long)chain->chain_key, (unsigned long long)chain_key); - return chain->chain_key == chain_key; -#else - return true; + return false; + } #endif + return true; } static bool in_any_zapped_class_list(struct lock_class *class) @@ -871,15 +871,15 @@ static bool in_any_zapped_class_list(struct lock_class *class) struct pending_free *pf; int i; - for (i = 0, pf = delayed_free.pf; i < ARRAY_SIZE(delayed_free.pf); - i++, pf++) + for (i = 0, pf = delayed_free.pf; i < ARRAY_SIZE(delayed_free.pf); i++, pf++) { if (in_list(&class->lock_entry, &pf->zapped)) return true; + } return false; } -static bool check_data_structures(void) +static bool __check_data_structures(void) { struct lock_class *class; struct lock_chain *chain; @@ -896,7 +896,6 @@ static bool check_data_structures(void) printk(KERN_INFO "class %px/%s is not in any class list\n", class, class->name ? : "(?)"); return false; - return false; } } @@ -953,6 +952,27 @@ static bool check_data_structures(void) return true; } +int check_consistency = 0; +module_param(check_consistency, int, 0644); + +static void check_data_structures(void) +{ + static bool once = false; + + if (check_consistency && !once) { + if (!__check_data_structures()) { + once = true; + WARN_ON(once); + } + } +} + +#else /* CONFIG_DEBUG_LOCKDEP */ + +static inline void check_data_structures(void) { } + +#endif /* CONFIG_DEBUG_LOCKDEP */ + /* * Initialize the lock_classes[] array elements, the free_lock_classes list * and also the delayed_free structure. @@ -4474,10 +4494,11 @@ static void remove_class_from_lock_chain(struct pending_free *pf, if (chain_hlocks[i] != class - lock_classes) continue; /* The code below leaks one chain_hlock[] entry. */ - if (--chain->depth > 0) + if (--chain->depth > 0) { memmove(&chain_hlocks[i], &chain_hlocks[i + 1], (chain->base + chain->depth - i) * sizeof(chain_hlocks[0])); + } /* * Each lock class occurs at most once in a lock chain so once * we found a match we can break out of this loop. @@ -4631,8 +4652,7 @@ static void __free_zapped_classes(struct pending_free *pf) { struct lock_class *class; - if (check_data_structure_consistency) - WARN_ON_ONCE(!check_data_structures()); + check_data_structures(); list_for_each_entry(class, &pf->zapped, lock_entry) reinit_class(class); -- cgit v1.3-7-g2ca7 From 90129625d9203a917fc1d3e4768976ba90d71b44 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 5 Jan 2019 00:38:03 -0500 Subject: cgroup: start switching to fs_context Unfortunately, cgroup is tangled into kernfs infrastructure. To avoid converting all kernfs-based filesystems at once, we need to untangle the remount part of things, instead of having it go through kernfs_sop_remount_fs(). Fortunately, it's not hard to do. This commit just gets cgroup/cgroup1 to use fs_context to deliver options on mount and remount paths. Parsing those is going to be done in the next commits; for now we do pretty much what legacy case does. Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 14 +++++ kernel/cgroup/cgroup-v1.c | 9 +-- kernel/cgroup/cgroup.c | 134 +++++++++++++++++++++++++++++----------- 3 files changed, 116 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index c9a35f09e4b9..a89cb0ba7a68 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -7,6 +7,7 @@ #include #include #include +#include #define TRACE_CGROUP_PATH_LEN 1024 extern spinlock_t trace_cgroup_path_lock; @@ -36,6 +37,18 @@ extern void __init enable_debug_cgroup(void); } \ } while (0) +/* + * The cgroup filesystem superblock creation/mount context. + */ +struct cgroup_fs_context { + char *data; +}; + +static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc) +{ + return fc->fs_private; +} + /* * A cgroup can be associated with multiple css_sets as different tasks may * belong to different cgroups on different hierarchies. In the other @@ -255,5 +268,6 @@ void cgroup1_check_for_release(struct cgroup *cgrp); struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, void *data, unsigned long magic, struct cgroup_namespace *ns); +int cgroup1_reconfigure(struct fs_context *ctx); #endif /* __CGROUP_INTERNAL_H */ diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index f94a7229974e..e377e19dd3e6 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1046,17 +1046,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) return 0; } -static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data) +int cgroup1_reconfigure(struct fs_context *fc) { - int ret = 0; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + struct kernfs_root *kf_root = kernfs_root_from_sb(fc->root->d_sb); struct cgroup_root *root = cgroup_root_from_kf(kf_root); + int ret = 0; struct cgroup_sb_opts opts; u16 added_mask, removed_mask; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* See what subsystems are wanted */ - ret = parse_cgroupfs_options(data, &opts); + ret = parse_cgroupfs_options(ctx->data, &opts); if (ret) goto out_unlock; @@ -1106,7 +1108,6 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data) struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { .rename = cgroup1_rename, .show_options = cgroup1_show_options, - .remount_fs = cgroup1_remount, .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .show_path = cgroup_show_path, diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 7fd9f22e406d..7f7db5f967e3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1811,12 +1811,13 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root return 0; } -static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data) +static int cgroup_reconfigure(struct fs_context *fc) { + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); unsigned int root_flags; int ret; - ret = parse_cgroup_root_flags(data, &root_flags); + ret = parse_cgroup_root_flags(ctx->data, &root_flags); if (ret) return ret; @@ -2067,21 +2068,98 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, return dentry; } -static struct dentry *cgroup_mount(struct file_system_type *fs_type, - int flags, const char *unused_dev_name, - void *data) +/* + * Destroy a cgroup filesystem context. + */ +static void cgroup_fs_context_free(struct fs_context *fc) +{ + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + + kfree(ctx); +} + +static int cgroup_parse_monolithic(struct fs_context *fc, void *data) +{ + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + + ctx->data = data; + if (ctx->data) + security_sb_eat_lsm_opts(ctx->data, &fc->security); + return 0; +} + +static int cgroup_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; - struct dentry *dentry; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + unsigned int root_flags; + struct dentry *root; int ret; - get_cgroup_ns(ns); + /* Check if the caller has permission to mount. */ + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + ret = parse_cgroup_root_flags(ctx->data, &root_flags); + if (ret) + return ret; + + cgrp_dfl_visible = true; + cgroup_get_live(&cgrp_dfl_root.cgrp); + + root = cgroup_do_mount(&cgroup2_fs_type, fc->sb_flags, &cgrp_dfl_root, + CGROUP2_SUPER_MAGIC, ns); + if (IS_ERR(root)) + return PTR_ERR(root); + + apply_cgroup_root_flags(root_flags); + fc->root = root; + return 0; +} + +static int cgroup1_get_tree(struct fs_context *fc) +{ + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + struct dentry *root; /* Check if the caller has permission to mount. */ - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) { - put_cgroup_ns(ns); - return ERR_PTR(-EPERM); - } + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + root = cgroup1_mount(&cgroup_fs_type, fc->sb_flags, ctx->data, + CGROUP_SUPER_MAGIC, ns); + if (IS_ERR(root)) + return PTR_ERR(root); + + fc->root = root; + return 0; +} + +static const struct fs_context_operations cgroup_fs_context_ops = { + .free = cgroup_fs_context_free, + .parse_monolithic = cgroup_parse_monolithic, + .get_tree = cgroup_get_tree, + .reconfigure = cgroup_reconfigure, +}; + +static const struct fs_context_operations cgroup1_fs_context_ops = { + .free = cgroup_fs_context_free, + .parse_monolithic = cgroup_parse_monolithic, + .get_tree = cgroup1_get_tree, + .reconfigure = cgroup1_reconfigure, +}; + +/* + * Initialise the cgroup filesystem creation/reconfiguration context. + */ +static int cgroup_init_fs_context(struct fs_context *fc) +{ + struct cgroup_fs_context *ctx; + + ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; /* * The first time anyone tries to mount a cgroup, enable the list @@ -2090,29 +2168,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, if (!use_task_css_set_links) cgroup_enable_task_cg_lists(); - if (fs_type == &cgroup2_fs_type) { - unsigned int root_flags; - - ret = parse_cgroup_root_flags(data, &root_flags); - if (ret) { - put_cgroup_ns(ns); - return ERR_PTR(ret); - } - - cgrp_dfl_visible = true; - cgroup_get_live(&cgrp_dfl_root.cgrp); - - dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root, - CGROUP2_SUPER_MAGIC, ns); - if (!IS_ERR(dentry)) - apply_cgroup_root_flags(root_flags); - } else { - dentry = cgroup1_mount(&cgroup_fs_type, flags, data, - CGROUP_SUPER_MAGIC, ns); - } - - put_cgroup_ns(ns); - return dentry; + fc->fs_private = ctx; + if (fc->fs_type == &cgroup2_fs_type) + fc->ops = &cgroup_fs_context_ops; + else + fc->ops = &cgroup1_fs_context_ops; + return 0; } static void cgroup_kill_sb(struct super_block *sb) @@ -2136,14 +2197,14 @@ static void cgroup_kill_sb(struct super_block *sb) struct file_system_type cgroup_fs_type = { .name = "cgroup", - .mount = cgroup_mount, + .init_fs_context = cgroup_init_fs_context, .kill_sb = cgroup_kill_sb, .fs_flags = FS_USERNS_MOUNT, }; static struct file_system_type cgroup2_fs_type = { .name = "cgroup2", - .mount = cgroup_mount, + .init_fs_context = cgroup_init_fs_context, .kill_sb = cgroup_kill_sb, .fs_flags = FS_USERNS_MOUNT, }; @@ -5268,7 +5329,6 @@ int cgroup_rmdir(struct kernfs_node *kn) static struct kernfs_syscall_ops cgroup_kf_syscall_ops = { .show_options = cgroup_show_options, - .remount_fs = cgroup_remount, .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .show_path = cgroup_show_path, -- cgit v1.3-7-g2ca7 From 7feeef58690a5ea8c5033d43e696ef41b28d82eb Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 16 Jan 2019 21:23:02 -0500 Subject: cgroup: fold cgroup1_mount() into cgroup1_get_tree() Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 4 +--- kernel/cgroup/cgroup-v1.c | 26 +++++++++++++++++--------- kernel/cgroup/cgroup.c | 19 ------------------- 3 files changed, 18 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index a89cb0ba7a68..37836d598ff8 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -265,9 +265,7 @@ bool cgroup1_ssid_disabled(int ssid); void cgroup1_pidlist_destroy_all(struct cgroup *cgrp); void cgroup1_release_agent(struct work_struct *work); void cgroup1_check_for_release(struct cgroup *cgrp); -struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, - void *data, unsigned long magic, - struct cgroup_namespace *ns); +int cgroup1_get_tree(struct fs_context *fc); int cgroup1_reconfigure(struct fs_context *ctx); #endif /* __CGROUP_INTERNAL_H */ diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index e377e19dd3e6..7ae3810dcbdf 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1113,20 +1113,24 @@ struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { .show_path = cgroup_show_path, }; -struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags, - void *data, unsigned long magic, - struct cgroup_namespace *ns) +int cgroup1_get_tree(struct fs_context *fc) { + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_sb_opts opts; struct cgroup_root *root; struct cgroup_subsys *ss; struct dentry *dentry; int i, ret; + /* Check if the caller has permission to mount. */ + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* First find the desired set of subsystems */ - ret = parse_cgroupfs_options(data, &opts); + ret = parse_cgroupfs_options(ctx->data, &opts); if (ret) goto out_unlock; @@ -1228,19 +1232,23 @@ out_free: kfree(opts.name); if (ret) - return ERR_PTR(ret); + return ret; - dentry = cgroup_do_mount(&cgroup_fs_type, flags, root, + dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, CGROUP_SUPER_MAGIC, ns); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); - if (!IS_ERR(dentry) && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { struct super_block *sb = dentry->d_sb; dput(dentry); deactivate_locked_super(sb); msleep(10); - dentry = ERR_PTR(restart_syscall()); + return restart_syscall(); } - return dentry; + + fc->root = dentry; + return 0; } static int __init cgroup1_wq_init(void) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 7f7db5f967e3..0652f74064a2 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2117,25 +2117,6 @@ static int cgroup_get_tree(struct fs_context *fc) return 0; } -static int cgroup1_get_tree(struct fs_context *fc) -{ - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; - struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - struct dentry *root; - - /* Check if the caller has permission to mount. */ - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) - return -EPERM; - - root = cgroup1_mount(&cgroup_fs_type, fc->sb_flags, ctx->data, - CGROUP_SUPER_MAGIC, ns); - if (IS_ERR(root)) - return PTR_ERR(root); - - fc->root = root; - return 0; -} - static const struct fs_context_operations cgroup_fs_context_ops = { .free = cgroup_fs_context_free, .parse_monolithic = cgroup_parse_monolithic, -- cgit v1.3-7-g2ca7 From f5dfb5315d340abd71bec523be9b114d5ac410de Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 16 Jan 2019 23:42:38 -0500 Subject: cgroup: take options parsing into ->parse_monolithic() Store the results in cgroup_fs_context. There's a nasty twist caused by the enabling/disabling subsystems - we can't do the checks sensitive to that until cgroup_mutex gets grabbed. Frankly, these checks are complete bullshit (e.g. all,none combination is accepted if all subsystems are disabled; so's cpusets,none and all,cpusets when cpusets is disabled, etc.), but touching that would be a userland-visible behaviour change ;-/ So we do parsing in ->parse_monolithic() and have the consistency checks done in check_cgroupfs_options(), with the latter called (on already parsed options) from cgroup1_get_tree() and cgroup1_reconfigure(). Freeing the strdup'ed strings is done from fs_context destructor, which somewhat simplifies the life for cgroup1_{get_tree,reconfigure}(). Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 23 ++++--- kernel/cgroup/cgroup-v1.c | 143 +++++++++++++++++++--------------------- kernel/cgroup/cgroup.c | 54 ++++++++------- 3 files changed, 104 insertions(+), 116 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 37836d598ff8..e627ff193dba 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -41,7 +41,15 @@ extern void __init enable_debug_cgroup(void); * The cgroup filesystem superblock creation/mount context. */ struct cgroup_fs_context { - char *data; + unsigned int flags; /* CGRP_ROOT_* flags */ + + /* cgroup1 bits */ + bool cpuset_clone_children; + bool none; /* User explicitly requested empty subsystem */ + bool all_ss; /* Seen 'all' option */ + u16 subsys_mask; /* Selected subsystems */ + char *name; /* Hierarchy name */ + char *release_agent; /* Path for release notifications */ }; static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc) @@ -130,16 +138,6 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) -struct cgroup_sb_opts { - u16 subsys_mask; - unsigned int flags; - char *release_agent; - bool cpuset_clone_children; - char *name; - /* User explicitly requested empty subsystem */ - bool none; -}; - extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; @@ -210,7 +208,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen, struct cgroup_namespace *ns); void cgroup_free_root(struct cgroup_root *root); -void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts); +void init_cgroup_root(struct cgroup_root *root, struct cgroup_fs_context *ctx); int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, @@ -266,6 +264,7 @@ void cgroup1_pidlist_destroy_all(struct cgroup *cgrp); void cgroup1_release_agent(struct work_struct *work); void cgroup1_check_for_release(struct cgroup *cgrp); int cgroup1_get_tree(struct fs_context *fc); +int parse_cgroup1_options(char *data, struct cgroup_fs_context *ctx); int cgroup1_reconfigure(struct fs_context *ctx); #endif /* __CGROUP_INTERNAL_H */ diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 7ae3810dcbdf..5c93643090e9 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -906,61 +906,47 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo return 0; } -static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) +int parse_cgroup1_options(char *data, struct cgroup_fs_context *ctx) { char *token, *o = data; - bool all_ss = false, one_ss = false; - u16 mask = U16_MAX; struct cgroup_subsys *ss; - int nr_opts = 0; int i; -#ifdef CONFIG_CPUSETS - mask = ~((u16)1 << cpuset_cgrp_id); -#endif - - memset(opts, 0, sizeof(*opts)); - while ((token = strsep(&o, ",")) != NULL) { - nr_opts++; - if (!*token) return -EINVAL; if (!strcmp(token, "none")) { /* Explicitly have no subsystems */ - opts->none = true; + ctx->none = true; continue; } if (!strcmp(token, "all")) { - /* Mutually exclusive option 'all' + subsystem name */ - if (one_ss) - return -EINVAL; - all_ss = true; + ctx->all_ss = true; continue; } if (!strcmp(token, "noprefix")) { - opts->flags |= CGRP_ROOT_NOPREFIX; + ctx->flags |= CGRP_ROOT_NOPREFIX; continue; } if (!strcmp(token, "clone_children")) { - opts->cpuset_clone_children = true; + ctx->cpuset_clone_children = true; continue; } if (!strcmp(token, "cpuset_v2_mode")) { - opts->flags |= CGRP_ROOT_CPUSET_V2_MODE; + ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE; continue; } if (!strcmp(token, "xattr")) { - opts->flags |= CGRP_ROOT_XATTR; + ctx->flags |= CGRP_ROOT_XATTR; continue; } if (!strncmp(token, "release_agent=", 14)) { /* Specifying two release agents is forbidden */ - if (opts->release_agent) + if (ctx->release_agent) return -EINVAL; - opts->release_agent = + ctx->release_agent = kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL); - if (!opts->release_agent) + if (!ctx->release_agent) return -ENOMEM; continue; } @@ -983,12 +969,12 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) return -EINVAL; } /* Specifying two names is forbidden */ - if (opts->name) + if (ctx->name) return -EINVAL; - opts->name = kstrndup(name, + ctx->name = kstrndup(name, MAX_CGROUP_ROOT_NAMELEN - 1, GFP_KERNEL); - if (!opts->name) + if (!ctx->name) return -ENOMEM; continue; @@ -997,38 +983,51 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) for_each_subsys(ss, i) { if (strcmp(token, ss->legacy_name)) continue; - if (!cgroup_ssid_enabled(i)) - continue; - if (cgroup1_ssid_disabled(i)) - continue; - - /* Mutually exclusive option 'all' + subsystem name */ - if (all_ss) - return -EINVAL; - opts->subsys_mask |= (1 << i); - one_ss = true; - + ctx->subsys_mask |= (1 << i); break; } if (i == CGROUP_SUBSYS_COUNT) return -ENOENT; } + return 0; +} + +static int check_cgroupfs_options(struct cgroup_fs_context *ctx) +{ + u16 mask = U16_MAX; + u16 enabled = 0; + struct cgroup_subsys *ss; + int i; + +#ifdef CONFIG_CPUSETS + mask = ~((u16)1 << cpuset_cgrp_id); +#endif + for_each_subsys(ss, i) + if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i)) + enabled |= 1 << i; + + ctx->subsys_mask &= enabled; /* - * If the 'all' option was specified select all the subsystems, - * otherwise if 'none', 'name=' and a subsystem name options were - * not specified, let's default to 'all' + * In absense of 'none', 'name=' or subsystem name options, + * let's default to 'all'. */ - if (all_ss || (!one_ss && !opts->none && !opts->name)) - for_each_subsys(ss, i) - if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i)) - opts->subsys_mask |= (1 << i); + if (!ctx->subsys_mask && !ctx->none && !ctx->name) + ctx->all_ss = true; + + if (ctx->all_ss) { + /* Mutually exclusive option 'all' + subsystem name */ + if (ctx->subsys_mask) + return -EINVAL; + /* 'all' => select all the subsystems */ + ctx->subsys_mask = enabled; + } /* * We either have to specify by name or by subsystems. (So all * empty hierarchies must have a name). */ - if (!opts->subsys_mask && !opts->name) + if (!ctx->subsys_mask && !ctx->name) return -EINVAL; /* @@ -1036,11 +1035,11 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) * with the old cpuset, so we allow noprefix only if mounting just * the cpuset subsystem. */ - if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask)) + if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask)) return -EINVAL; /* Can't specify "none" and some subsystems */ - if (opts->subsys_mask && opts->none) + if (ctx->subsys_mask && ctx->none) return -EINVAL; return 0; @@ -1052,28 +1051,27 @@ int cgroup1_reconfigure(struct fs_context *fc) struct kernfs_root *kf_root = kernfs_root_from_sb(fc->root->d_sb); struct cgroup_root *root = cgroup_root_from_kf(kf_root); int ret = 0; - struct cgroup_sb_opts opts; u16 added_mask, removed_mask; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* See what subsystems are wanted */ - ret = parse_cgroupfs_options(ctx->data, &opts); + ret = check_cgroupfs_options(ctx); if (ret) goto out_unlock; - if (opts.subsys_mask != root->subsys_mask || opts.release_agent) + if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent) pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n", task_tgid_nr(current), current->comm); - added_mask = opts.subsys_mask & ~root->subsys_mask; - removed_mask = root->subsys_mask & ~opts.subsys_mask; + added_mask = ctx->subsys_mask & ~root->subsys_mask; + removed_mask = root->subsys_mask & ~ctx->subsys_mask; /* Don't allow flags or name to change at remount */ - if ((opts.flags ^ root->flags) || - (opts.name && strcmp(opts.name, root->name))) { + if ((ctx->flags ^ root->flags) || + (ctx->name && strcmp(ctx->name, root->name))) { pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n", - opts.flags, opts.name ?: "", root->flags, root->name); + ctx->flags, ctx->name ?: "", root->flags, root->name); ret = -EINVAL; goto out_unlock; } @@ -1090,17 +1088,15 @@ int cgroup1_reconfigure(struct fs_context *fc) WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask)); - if (opts.release_agent) { + if (ctx->release_agent) { spin_lock(&release_agent_path_lock); - strcpy(root->release_agent_path, opts.release_agent); + strcpy(root->release_agent_path, ctx->release_agent); spin_unlock(&release_agent_path_lock); } trace_cgroup_remount(root); out_unlock: - kfree(opts.release_agent); - kfree(opts.name); mutex_unlock(&cgroup_mutex); return ret; } @@ -1117,7 +1113,6 @@ int cgroup1_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - struct cgroup_sb_opts opts; struct cgroup_root *root; struct cgroup_subsys *ss; struct dentry *dentry; @@ -1130,7 +1125,7 @@ int cgroup1_get_tree(struct fs_context *fc) cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* First find the desired set of subsystems */ - ret = parse_cgroupfs_options(ctx->data, &opts); + ret = check_cgroupfs_options(ctx); if (ret) goto out_unlock; @@ -1142,7 +1137,7 @@ int cgroup1_get_tree(struct fs_context *fc) * starting. Testing ref liveliness is good enough. */ for_each_subsys(ss, i) { - if (!(opts.subsys_mask & (1 << i)) || + if (!(ctx->subsys_mask & (1 << i)) || ss->root == &cgrp_dfl_root) continue; @@ -1166,8 +1161,8 @@ int cgroup1_get_tree(struct fs_context *fc) * name matches but sybsys_mask doesn't, we should fail. * Remember whether name matched. */ - if (opts.name) { - if (strcmp(opts.name, root->name)) + if (ctx->name) { + if (strcmp(ctx->name, root->name)) continue; name_match = true; } @@ -1176,15 +1171,15 @@ int cgroup1_get_tree(struct fs_context *fc) * If we asked for subsystems (or explicitly for no * subsystems) then they must match. */ - if ((opts.subsys_mask || opts.none) && - (opts.subsys_mask != root->subsys_mask)) { + if ((ctx->subsys_mask || ctx->none) && + (ctx->subsys_mask != root->subsys_mask)) { if (!name_match) continue; ret = -EBUSY; goto out_unlock; } - if (root->flags ^ opts.flags) + if (root->flags ^ ctx->flags) pr_warn("new mount options do not match the existing superblock, will be ignored\n"); ret = 0; @@ -1196,7 +1191,7 @@ int cgroup1_get_tree(struct fs_context *fc) * specification is allowed for already existing hierarchies but we * can't create new one without subsys specification. */ - if (!opts.subsys_mask && !opts.none) { + if (!ctx->subsys_mask && !ctx->none) { ret = -EINVAL; goto out_unlock; } @@ -1213,9 +1208,9 @@ int cgroup1_get_tree(struct fs_context *fc) goto out_unlock; } - init_cgroup_root(root, &opts); + init_cgroup_root(root, ctx); - ret = cgroup_setup_root(root, opts.subsys_mask); + ret = cgroup_setup_root(root, ctx->subsys_mask); if (ret) cgroup_free_root(root); @@ -1223,14 +1218,10 @@ out_unlock: if (!ret && !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { mutex_unlock(&cgroup_mutex); msleep(10); - ret = restart_syscall(); - goto out_free; + return restart_syscall(); } mutex_unlock(&cgroup_mutex); out_free: - kfree(opts.release_agent); - kfree(opts.name); - if (ret) return ret; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 0652f74064a2..33da9eef3ef4 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1814,14 +1814,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root static int cgroup_reconfigure(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - unsigned int root_flags; - int ret; - - ret = parse_cgroup_root_flags(ctx->data, &root_flags); - if (ret) - return ret; - apply_cgroup_root_flags(root_flags); + apply_cgroup_root_flags(ctx->flags); return 0; } @@ -1909,7 +1903,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent); } -void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts) +void init_cgroup_root(struct cgroup_root *root, struct cgroup_fs_context *ctx) { struct cgroup *cgrp = &root->cgrp; @@ -1919,12 +1913,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts) init_cgroup_housekeeping(cgrp); idr_init(&root->cgroup_idr); - root->flags = opts->flags; - if (opts->release_agent) - strscpy(root->release_agent_path, opts->release_agent, PATH_MAX); - if (opts->name) - strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN); - if (opts->cpuset_clone_children) + root->flags = ctx->flags; + if (ctx->release_agent) + strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX); + if (ctx->name) + strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN); + if (ctx->cpuset_clone_children) set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags); } @@ -2075,6 +2069,8 @@ static void cgroup_fs_context_free(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + kfree(ctx->name); + kfree(ctx->release_agent); kfree(ctx); } @@ -2082,28 +2078,30 @@ static int cgroup_parse_monolithic(struct fs_context *fc, void *data) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - ctx->data = data; - if (ctx->data) - security_sb_eat_lsm_opts(ctx->data, &fc->security); - return 0; + if (data) + security_sb_eat_lsm_opts(data, &fc->security); + return parse_cgroup_root_flags(data, &ctx->flags); +} + +static int cgroup1_parse_monolithic(struct fs_context *fc, void *data) +{ + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + + if (data) + security_sb_eat_lsm_opts(data, &fc->security); + return parse_cgroup1_options(data, ctx); } static int cgroup_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - unsigned int root_flags; struct dentry *root; - int ret; /* Check if the caller has permission to mount. */ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) return -EPERM; - ret = parse_cgroup_root_flags(ctx->data, &root_flags); - if (ret) - return ret; - cgrp_dfl_visible = true; cgroup_get_live(&cgrp_dfl_root.cgrp); @@ -2112,7 +2110,7 @@ static int cgroup_get_tree(struct fs_context *fc) if (IS_ERR(root)) return PTR_ERR(root); - apply_cgroup_root_flags(root_flags); + apply_cgroup_root_flags(ctx->flags); fc->root = root; return 0; } @@ -2126,7 +2124,7 @@ static const struct fs_context_operations cgroup_fs_context_ops = { static const struct fs_context_operations cgroup1_fs_context_ops = { .free = cgroup_fs_context_free, - .parse_monolithic = cgroup_parse_monolithic, + .parse_monolithic = cgroup1_parse_monolithic, .get_tree = cgroup1_get_tree, .reconfigure = cgroup1_reconfigure, }; @@ -5376,11 +5374,11 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) */ int __init cgroup_init_early(void) { - static struct cgroup_sb_opts __initdata opts; + static struct cgroup_fs_context __initdata ctx; struct cgroup_subsys *ss; int i; - init_cgroup_root(&cgrp_dfl_root, &opts); + init_cgroup_root(&cgrp_dfl_root, &ctx); cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF; RCU_INIT_POINTER(init_task.cgroups, &init_css_set); -- cgit v1.3-7-g2ca7 From 8d2451f4994fa60a57617282bab91b98266a00b1 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 00:15:11 -0500 Subject: cgroup1: switch to option-by-option parsing [dhowells should be the author - it's carved out of his patch] Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 3 +- kernel/cgroup/cgroup-v1.c | 192 +++++++++++++++++++++++----------------- kernel/cgroup/cgroup.c | 20 ++--- 3 files changed, 117 insertions(+), 98 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index e627ff193dba..a7b5a41f170c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -257,14 +257,15 @@ extern const struct proc_ns_operations cgroupns_operations; */ extern struct cftype cgroup1_base_files[]; extern struct kernfs_syscall_ops cgroup1_kf_syscall_ops; +extern const struct fs_parameter_description cgroup1_fs_parameters; int proc_cgroupstats_show(struct seq_file *m, void *v); bool cgroup1_ssid_disabled(int ssid); void cgroup1_pidlist_destroy_all(struct cgroup *cgrp); void cgroup1_release_agent(struct work_struct *work); void cgroup1_check_for_release(struct cgroup *cgrp); +int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param); int cgroup1_get_tree(struct fs_context *fc); -int parse_cgroup1_options(char *data, struct cgroup_fs_context *ctx); int cgroup1_reconfigure(struct fs_context *ctx); #endif /* __CGROUP_INTERNAL_H */ diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 5c93643090e9..725e9f6fe80d 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -13,9 +13,12 @@ #include #include #include +#include #include +#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; }) + /* * pidlists linger the following amount before being destroyed. The goal * is avoiding frequent destruction in the middle of consecutive read calls @@ -906,94 +909,117 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo return 0; } -int parse_cgroup1_options(char *data, struct cgroup_fs_context *ctx) -{ - char *token, *o = data; - struct cgroup_subsys *ss; - int i; +enum cgroup1_param { + Opt_all, + Opt_clone_children, + Opt_cpuset_v2_mode, + Opt_name, + Opt_none, + Opt_noprefix, + Opt_release_agent, + Opt_xattr, +}; - while ((token = strsep(&o, ",")) != NULL) { - if (!*token) - return -EINVAL; - if (!strcmp(token, "none")) { - /* Explicitly have no subsystems */ - ctx->none = true; - continue; - } - if (!strcmp(token, "all")) { - ctx->all_ss = true; - continue; - } - if (!strcmp(token, "noprefix")) { - ctx->flags |= CGRP_ROOT_NOPREFIX; - continue; - } - if (!strcmp(token, "clone_children")) { - ctx->cpuset_clone_children = true; - continue; - } - if (!strcmp(token, "cpuset_v2_mode")) { - ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE; - continue; - } - if (!strcmp(token, "xattr")) { - ctx->flags |= CGRP_ROOT_XATTR; - continue; - } - if (!strncmp(token, "release_agent=", 14)) { - /* Specifying two release agents is forbidden */ - if (ctx->release_agent) - return -EINVAL; - ctx->release_agent = - kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL); - if (!ctx->release_agent) - return -ENOMEM; - continue; - } - if (!strncmp(token, "name=", 5)) { - const char *name = token + 5; - - /* blocked by boot param? */ - if (cgroup_no_v1_named) - return -ENOENT; - /* Can't specify an empty name */ - if (!strlen(name)) - return -EINVAL; - /* Must match [\w.-]+ */ - for (i = 0; i < strlen(name); i++) { - char c = name[i]; - if (isalnum(c)) - continue; - if ((c == '.') || (c == '-') || (c == '_')) - continue; - return -EINVAL; - } - /* Specifying two names is forbidden */ - if (ctx->name) - return -EINVAL; - ctx->name = kstrndup(name, - MAX_CGROUP_ROOT_NAMELEN - 1, - GFP_KERNEL); - if (!ctx->name) - return -ENOMEM; +static const struct fs_parameter_spec cgroup1_param_specs[] = { + fsparam_flag ("all", Opt_all), + fsparam_flag ("clone_children", Opt_clone_children), + fsparam_flag ("cpuset_v2_mode", Opt_cpuset_v2_mode), + fsparam_string("name", Opt_name), + fsparam_flag ("none", Opt_none), + fsparam_flag ("noprefix", Opt_noprefix), + fsparam_string("release_agent", Opt_release_agent), + fsparam_flag ("xattr", Opt_xattr), + {} +}; - continue; - } +const struct fs_parameter_description cgroup1_fs_parameters = { + .name = "cgroup1", + .specs = cgroup1_param_specs, +}; +int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + struct cgroup_subsys *ss; + struct fs_parse_result result; + int opt, i; + + opt = fs_parse(fc, &cgroup1_fs_parameters, param, &result); + if (opt == -ENOPARAM) { + if (strcmp(param->key, "source") == 0) { + fc->source = param->string; + param->string = NULL; + return 0; + } for_each_subsys(ss, i) { - if (strcmp(token, ss->legacy_name)) + if (strcmp(param->key, ss->legacy_name)) continue; ctx->subsys_mask |= (1 << i); - break; + return 0; } - if (i == CGROUP_SUBSYS_COUNT) + return cg_invalf(fc, "cgroup1: Unknown subsys name '%s'", param->key); + } + if (opt < 0) + return opt; + + switch (opt) { + case Opt_none: + /* Explicitly have no subsystems */ + ctx->none = true; + break; + case Opt_all: + ctx->all_ss = true; + break; + case Opt_noprefix: + ctx->flags |= CGRP_ROOT_NOPREFIX; + break; + case Opt_clone_children: + ctx->cpuset_clone_children = true; + break; + case Opt_cpuset_v2_mode: + ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE; + break; + case Opt_xattr: + ctx->flags |= CGRP_ROOT_XATTR; + break; + case Opt_release_agent: + /* Specifying two release agents is forbidden */ + if (ctx->release_agent) + return cg_invalf(fc, "cgroup1: release_agent respecified"); + ctx->release_agent = param->string; + param->string = NULL; + break; + case Opt_name: + /* blocked by boot param? */ + if (cgroup_no_v1_named) return -ENOENT; + /* Can't specify an empty name */ + if (!param->size) + return cg_invalf(fc, "cgroup1: Empty name"); + if (param->size > MAX_CGROUP_ROOT_NAMELEN - 1) + return cg_invalf(fc, "cgroup1: Name too long"); + /* Must match [\w.-]+ */ + for (i = 0; i < param->size; i++) { + char c = param->string[i]; + if (isalnum(c)) + continue; + if ((c == '.') || (c == '-') || (c == '_')) + continue; + return cg_invalf(fc, "cgroup1: Invalid name"); + } + /* Specifying two names is forbidden */ + if (ctx->name) + return cg_invalf(fc, "cgroup1: name respecified"); + ctx->name = param->string; + param->string = NULL; + break; } return 0; } -static int check_cgroupfs_options(struct cgroup_fs_context *ctx) +static int check_cgroupfs_options(struct fs_context *fc) { + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); u16 mask = U16_MAX; u16 enabled = 0; struct cgroup_subsys *ss; @@ -1018,7 +1044,7 @@ static int check_cgroupfs_options(struct cgroup_fs_context *ctx) if (ctx->all_ss) { /* Mutually exclusive option 'all' + subsystem name */ if (ctx->subsys_mask) - return -EINVAL; + return cg_invalf(fc, "cgroup1: subsys name conflicts with all"); /* 'all' => select all the subsystems */ ctx->subsys_mask = enabled; } @@ -1028,7 +1054,7 @@ static int check_cgroupfs_options(struct cgroup_fs_context *ctx) * empty hierarchies must have a name). */ if (!ctx->subsys_mask && !ctx->name) - return -EINVAL; + return cg_invalf(fc, "cgroup1: Need name or subsystem set"); /* * Option noprefix was introduced just for backward compatibility @@ -1036,11 +1062,11 @@ static int check_cgroupfs_options(struct cgroup_fs_context *ctx) * the cpuset subsystem. */ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask)) - return -EINVAL; + return cg_invalf(fc, "cgroup1: noprefix used incorrectly"); /* Can't specify "none" and some subsystems */ if (ctx->subsys_mask && ctx->none) - return -EINVAL; + return cg_invalf(fc, "cgroup1: none used incorrectly"); return 0; } @@ -1056,7 +1082,7 @@ int cgroup1_reconfigure(struct fs_context *fc) cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* See what subsystems are wanted */ - ret = check_cgroupfs_options(ctx); + ret = check_cgroupfs_options(fc); if (ret) goto out_unlock; @@ -1070,7 +1096,7 @@ int cgroup1_reconfigure(struct fs_context *fc) /* Don't allow flags or name to change at remount */ if ((ctx->flags ^ root->flags) || (ctx->name && strcmp(ctx->name, root->name))) { - pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n", + cg_invalf(fc, "option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"", ctx->flags, ctx->name ?: "", root->flags, root->name); ret = -EINVAL; goto out_unlock; @@ -1125,7 +1151,7 @@ int cgroup1_get_tree(struct fs_context *fc) cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* First find the desired set of subsystems */ - ret = check_cgroupfs_options(ctx); + ret = check_cgroupfs_options(fc); if (ret) goto out_unlock; @@ -1192,7 +1218,7 @@ int cgroup1_get_tree(struct fs_context *fc) * can't create new one without subsys specification. */ if (!ctx->subsys_mask && !ctx->none) { - ret = -EINVAL; + ret = cg_invalf(fc, "cgroup1: No subsys list or none specified"); goto out_unlock; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 33da9eef3ef4..faba00caa197 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2083,15 +2083,6 @@ static int cgroup_parse_monolithic(struct fs_context *fc, void *data) return parse_cgroup_root_flags(data, &ctx->flags); } -static int cgroup1_parse_monolithic(struct fs_context *fc, void *data) -{ - struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - - if (data) - security_sb_eat_lsm_opts(data, &fc->security); - return parse_cgroup1_options(data, ctx); -} - static int cgroup_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; @@ -2124,7 +2115,7 @@ static const struct fs_context_operations cgroup_fs_context_ops = { static const struct fs_context_operations cgroup1_fs_context_ops = { .free = cgroup_fs_context_free, - .parse_monolithic = cgroup1_parse_monolithic, + .parse_param = cgroup1_parse_param, .get_tree = cgroup1_get_tree, .reconfigure = cgroup1_reconfigure, }; @@ -2175,10 +2166,11 @@ static void cgroup_kill_sb(struct super_block *sb) } struct file_system_type cgroup_fs_type = { - .name = "cgroup", - .init_fs_context = cgroup_init_fs_context, - .kill_sb = cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .name = "cgroup", + .init_fs_context = cgroup_init_fs_context, + .parameters = &cgroup1_fs_parameters, + .kill_sb = cgroup_kill_sb, + .fs_flags = FS_USERNS_MOUNT, }; static struct file_system_type cgroup2_fs_type = { -- cgit v1.3-7-g2ca7 From e34a98d5b226b84a3ed6da93e7a92e65cc1c81ba Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 00:22:58 -0500 Subject: cgroup2: switch to option-by-option parsing [again, carved out of patch by dhowells] [NB: we probably want to handle "source" in parse_param here] Signed-off-by: Al Viro --- kernel/cgroup/cgroup.c | 62 +++++++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 29 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index faba00caa197..d0cddfbdf5cf 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include #include @@ -1772,26 +1773,37 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, return len; } -static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) -{ - char *token; +enum cgroup2_param { + Opt_nsdelegate, + nr__cgroup2_params +}; - *root_flags = 0; +static const struct fs_parameter_spec cgroup2_param_specs[] = { + fsparam_flag ("nsdelegate", Opt_nsdelegate), + {} +}; - if (!data || *data == '\0') - return 0; +static const struct fs_parameter_description cgroup2_fs_parameters = { + .name = "cgroup2", + .specs = cgroup2_param_specs, +}; - while ((token = strsep(&data, ",")) != NULL) { - if (!strcmp(token, "nsdelegate")) { - *root_flags |= CGRP_ROOT_NS_DELEGATE; - continue; - } +static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + struct fs_parse_result result; + int opt; - pr_err("cgroup2: unknown option \"%s\"\n", token); - return -EINVAL; - } + opt = fs_parse(fc, &cgroup2_fs_parameters, param, &result); + if (opt < 0) + return opt; - return 0; + switch (opt) { + case Opt_nsdelegate: + ctx->flags |= CGRP_ROOT_NS_DELEGATE; + return 0; + } + return -EINVAL; } static void apply_cgroup_root_flags(unsigned int root_flags) @@ -2074,15 +2086,6 @@ static void cgroup_fs_context_free(struct fs_context *fc) kfree(ctx); } -static int cgroup_parse_monolithic(struct fs_context *fc, void *data) -{ - struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - - if (data) - security_sb_eat_lsm_opts(data, &fc->security); - return parse_cgroup_root_flags(data, &ctx->flags); -} - static int cgroup_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; @@ -2108,7 +2111,7 @@ static int cgroup_get_tree(struct fs_context *fc) static const struct fs_context_operations cgroup_fs_context_ops = { .free = cgroup_fs_context_free, - .parse_monolithic = cgroup_parse_monolithic, + .parse_param = cgroup2_parse_param, .get_tree = cgroup_get_tree, .reconfigure = cgroup_reconfigure, }; @@ -2174,10 +2177,11 @@ struct file_system_type cgroup_fs_type = { }; static struct file_system_type cgroup2_fs_type = { - .name = "cgroup2", - .init_fs_context = cgroup_init_fs_context, - .kill_sb = cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .name = "cgroup2", + .init_fs_context = cgroup_init_fs_context, + .parameters = &cgroup2_fs_parameters, + .kill_sb = cgroup_kill_sb, + .fs_flags = FS_USERNS_MOUNT, }; int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen, -- cgit v1.3-7-g2ca7 From cf6299b1d00555cd10dc30d95b300d7084128a2c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 02:25:51 -0500 Subject: cgroup: stash cgroup_root reference into cgroup_fs_context Note that this reference is *NOT* contributing to refcount of cgroup_root in question and is valid only until cgroup_do_mount() returns. Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 3 ++- kernel/cgroup/cgroup-v1.c | 4 +++- kernel/cgroup/cgroup.c | 7 +++++-- 3 files changed, 10 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index a7b5a41f170c..3c1613a7648c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -41,6 +41,7 @@ extern void __init enable_debug_cgroup(void); * The cgroup filesystem superblock creation/mount context. */ struct cgroup_fs_context { + struct cgroup_root *root; unsigned int flags; /* CGRP_ROOT_* flags */ /* cgroup1 bits */ @@ -208,7 +209,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen, struct cgroup_namespace *ns); void cgroup_free_root(struct cgroup_root *root); -void init_cgroup_root(struct cgroup_root *root, struct cgroup_fs_context *ctx); +void init_cgroup_root(struct cgroup_fs_context *ctx); int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 725e9f6fe80d..45a198c63d6e 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1208,6 +1208,7 @@ int cgroup1_get_tree(struct fs_context *fc) if (root->flags ^ ctx->flags) pr_warn("new mount options do not match the existing superblock, will be ignored\n"); + ctx->root = root; ret = 0; goto out_unlock; } @@ -1234,7 +1235,8 @@ int cgroup1_get_tree(struct fs_context *fc) goto out_unlock; } - init_cgroup_root(root, ctx); + ctx->root = root; + init_cgroup_root(ctx); ret = cgroup_setup_root(root, ctx->subsys_mask); if (ret) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index d0cddfbdf5cf..57f43f63363a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1915,8 +1915,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent); } -void init_cgroup_root(struct cgroup_root *root, struct cgroup_fs_context *ctx) +void init_cgroup_root(struct cgroup_fs_context *ctx) { + struct cgroup_root *root = ctx->root; struct cgroup *cgrp = &root->cgrp; INIT_LIST_HEAD(&root->root_list); @@ -2098,6 +2099,7 @@ static int cgroup_get_tree(struct fs_context *fc) cgrp_dfl_visible = true; cgroup_get_live(&cgrp_dfl_root.cgrp); + ctx->root = &cgrp_dfl_root; root = cgroup_do_mount(&cgroup2_fs_type, fc->sb_flags, &cgrp_dfl_root, CGROUP2_SUPER_MAGIC, ns); @@ -5374,7 +5376,8 @@ int __init cgroup_init_early(void) struct cgroup_subsys *ss; int i; - init_cgroup_root(&cgrp_dfl_root, &ctx); + ctx.root = &cgrp_dfl_root; + init_cgroup_root(&ctx); cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF; RCU_INIT_POINTER(init_task.cgroups, &init_css_set); -- cgit v1.3-7-g2ca7 From 71d883c37e8d4484207708af56685abb39703b04 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 02:44:07 -0500 Subject: cgroup_do_mount(): massage calling conventions pass it fs_context instead of fs_type/flags/root triple, have it return int instead of dentry and make it deal with setting fc->root. Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 3 +-- kernel/cgroup/cgroup-v1.c | 17 +++++----------- kernel/cgroup/cgroup.c | 45 +++++++++++++++++++++-------------------- 3 files changed, 29 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 3c1613a7648c..f7fd54f2973f 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -212,8 +212,7 @@ void cgroup_free_root(struct cgroup_root *root); void init_cgroup_root(struct cgroup_fs_context *ctx); int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, - struct cgroup_root *root, unsigned long magic, +int cgroup_do_mount(struct fs_context *fc, unsigned long magic, struct cgroup_namespace *ns); int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 45a198c63d6e..05f05d773adf 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1141,7 +1141,6 @@ int cgroup1_get_tree(struct fs_context *fc) struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_root *root; struct cgroup_subsys *ss; - struct dentry *dentry; int i, ret; /* Check if the caller has permission to mount. */ @@ -1253,21 +1252,15 @@ out_free: if (ret) return ret; - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, - CGROUP_SUPER_MAGIC, ns); - if (IS_ERR(dentry)) - return PTR_ERR(dentry); - - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { - struct super_block *sb = dentry->d_sb; - dput(dentry); + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block *sb = fc->root->d_sb; + dput(fc->root); deactivate_locked_super(sb); msleep(10); return restart_syscall(); } - - fc->root = dentry; - return 0; + return ret; } static int __init cgroup1_wq_init(void) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 57f43f63363a..64360a46d4df 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2036,43 +2036,48 @@ out: return ret; } -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags, - struct cgroup_root *root, unsigned long magic, - struct cgroup_namespace *ns) +int cgroup_do_mount(struct fs_context *fc, unsigned long magic, + struct cgroup_namespace *ns) { - struct dentry *dentry; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); bool new_sb = false; + int ret = 0; - dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb); + fc->root = kernfs_mount(fc->fs_type, fc->sb_flags, ctx->root->kf_root, + magic, &new_sb); + if (IS_ERR(fc->root)) + ret = PTR_ERR(fc->root); /* * In non-init cgroup namespace, instead of root cgroup's dentry, * we return the dentry corresponding to the cgroupns->root_cgrp. */ - if (!IS_ERR(dentry) && ns != &init_cgroup_ns) { + if (!ret && ns != &init_cgroup_ns) { struct dentry *nsdentry; - struct super_block *sb = dentry->d_sb; + struct super_block *sb = fc->root->d_sb; struct cgroup *cgrp; mutex_lock(&cgroup_mutex); spin_lock_irq(&css_set_lock); - cgrp = cset_cgroup_from_root(ns->root_cset, root); + cgrp = cset_cgroup_from_root(ns->root_cset, ctx->root); spin_unlock_irq(&css_set_lock); mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, sb); - dput(dentry); - if (IS_ERR(nsdentry)) + dput(fc->root); + fc->root = nsdentry; + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); deactivate_locked_super(sb); - dentry = nsdentry; + } } if (!new_sb) - cgroup_put(&root->cgrp); + cgroup_put(&ctx->root->cgrp); - return dentry; + return ret; } /* @@ -2091,7 +2096,7 @@ static int cgroup_get_tree(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - struct dentry *root; + int ret; /* Check if the caller has permission to mount. */ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) @@ -2101,14 +2106,10 @@ static int cgroup_get_tree(struct fs_context *fc) cgroup_get_live(&cgrp_dfl_root.cgrp); ctx->root = &cgrp_dfl_root; - root = cgroup_do_mount(&cgroup2_fs_type, fc->sb_flags, &cgrp_dfl_root, - CGROUP2_SUPER_MAGIC, ns); - if (IS_ERR(root)) - return PTR_ERR(root); - - apply_cgroup_root_flags(ctx->flags); - fc->root = root; - return 0; + ret = cgroup_do_mount(fc, CGROUP2_SUPER_MAGIC, ns); + if (!ret) + apply_cgroup_root_flags(ctx->flags); + return ret; } static const struct fs_context_operations cgroup_fs_context_ops = { -- cgit v1.3-7-g2ca7 From 6678889f0726910fc884c54f951d8c5646a04819 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 09:42:30 -0500 Subject: cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper Signed-off-by: Al Viro --- kernel/cgroup/cgroup-v1.c | 87 +++++++++++++++++++++++++---------------------- 1 file changed, 46 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 05f05d773adf..0d71fc98e73d 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1135,7 +1135,15 @@ struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { .show_path = cgroup_show_path, }; -int cgroup1_get_tree(struct fs_context *fc) +/* + * The guts of cgroup1 mount - find or create cgroup_root to use. + * Called with cgroup_mutex held; returns 0 on success, -E... on + * error and positive - in case when the candidate is busy dying. + * On success it stashes a reference to cgroup_root into given + * cgroup_fs_context; that reference is *NOT* counting towards the + * cgroup_root refcount. + */ +static int cgroup1_root_to_use(struct fs_context *fc) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); @@ -1143,16 +1151,10 @@ int cgroup1_get_tree(struct fs_context *fc) struct cgroup_subsys *ss; int i, ret; - /* Check if the caller has permission to mount. */ - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) - return -EPERM; - - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); - /* First find the desired set of subsystems */ ret = check_cgroupfs_options(fc); if (ret) - goto out_unlock; + return ret; /* * Destruction of cgroup root is asynchronous, so subsystems may @@ -1166,12 +1168,8 @@ int cgroup1_get_tree(struct fs_context *fc) ss->root == &cgrp_dfl_root) continue; - if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { - mutex_unlock(&cgroup_mutex); - msleep(10); - ret = restart_syscall(); - goto out_free; - } + if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) + return 1; /* restart */ cgroup_put(&ss->root->cgrp); } @@ -1200,16 +1198,14 @@ int cgroup1_get_tree(struct fs_context *fc) (ctx->subsys_mask != root->subsys_mask)) { if (!name_match) continue; - ret = -EBUSY; - goto out_unlock; + return -EBUSY; } if (root->flags ^ ctx->flags) pr_warn("new mount options do not match the existing superblock, will be ignored\n"); ctx->root = root; - ret = 0; - goto out_unlock; + return 0; } /* @@ -1217,22 +1213,16 @@ int cgroup1_get_tree(struct fs_context *fc) * specification is allowed for already existing hierarchies but we * can't create new one without subsys specification. */ - if (!ctx->subsys_mask && !ctx->none) { - ret = cg_invalf(fc, "cgroup1: No subsys list or none specified"); - goto out_unlock; - } + if (!ctx->subsys_mask && !ctx->none) + return cg_invalf(fc, "cgroup1: No subsys list or none specified"); /* Hierarchies may only be created in the initial cgroup namespace. */ - if (ns != &init_cgroup_ns) { - ret = -EPERM; - goto out_unlock; - } + if (ns != &init_cgroup_ns) + return -EPERM; root = kzalloc(sizeof(*root), GFP_KERNEL); - if (!root) { - ret = -ENOMEM; - goto out_unlock; - } + if (!root) + return -ENOMEM; ctx->root = root; init_cgroup_root(ctx); @@ -1240,23 +1230,38 @@ int cgroup1_get_tree(struct fs_context *fc) ret = cgroup_setup_root(root, ctx->subsys_mask); if (ret) cgroup_free_root(root); + return ret; +} + +int cgroup1_get_tree(struct fs_context *fc) +{ + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; + struct cgroup_fs_context *ctx = cgroup_fc2context(fc); + int ret; + + /* Check if the caller has permission to mount. */ + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); + + ret = cgroup1_root_to_use(fc); + if (!ret && !percpu_ref_tryget_live(&ctx->root->cgrp.self.refcnt)) + ret = 1; /* restart */ -out_unlock: - if (!ret && !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { - mutex_unlock(&cgroup_mutex); - msleep(10); - return restart_syscall(); - } mutex_unlock(&cgroup_mutex); -out_free: - if (ret) - return ret; - ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); - if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + if (!ret) + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + + if (!ret && percpu_ref_is_dying(&ctx->root->cgrp.self.refcnt)) { struct super_block *sb = fc->root->d_sb; dput(fc->root); deactivate_locked_super(sb); + ret = 1; + } + + if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } -- cgit v1.3-7-g2ca7 From cca8f32714d3a8bb4d109c9d7d790fd705b734e5 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 17 Jan 2019 10:14:26 -0500 Subject: cgroup: store a reference to cgroup_ns into cgroup_fs_context ... and trim cgroup_do_mount() arguments (renaming it to cgroup_do_get_tree()) Signed-off-by: Al Viro --- kernel/cgroup/cgroup-internal.h | 4 ++-- kernel/cgroup/cgroup-v1.c | 8 +++----- kernel/cgroup/cgroup.c | 17 ++++++++++++----- 3 files changed, 17 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index f7fd54f2973f..37cf709b7a0e 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -42,6 +42,7 @@ extern void __init enable_debug_cgroup(void); */ struct cgroup_fs_context { struct cgroup_root *root; + struct cgroup_namespace *ns; unsigned int flags; /* CGRP_ROOT_* flags */ /* cgroup1 bits */ @@ -212,8 +213,7 @@ void cgroup_free_root(struct cgroup_root *root); void init_cgroup_root(struct cgroup_fs_context *ctx); int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); -int cgroup_do_mount(struct fs_context *fc, unsigned long magic, - struct cgroup_namespace *ns); +int cgroup_do_get_tree(struct fs_context *fc); int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp); void cgroup_migrate_finish(struct cgroup_mgctx *mgctx); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 0d71fc98e73d..571ef3447426 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1145,7 +1145,6 @@ struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { */ static int cgroup1_root_to_use(struct fs_context *fc) { - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_root *root; struct cgroup_subsys *ss; @@ -1217,7 +1216,7 @@ static int cgroup1_root_to_use(struct fs_context *fc) return cg_invalf(fc, "cgroup1: No subsys list or none specified"); /* Hierarchies may only be created in the initial cgroup namespace. */ - if (ns != &init_cgroup_ns) + if (ctx->ns != &init_cgroup_ns) return -EPERM; root = kzalloc(sizeof(*root), GFP_KERNEL); @@ -1235,12 +1234,11 @@ static int cgroup1_root_to_use(struct fs_context *fc) int cgroup1_get_tree(struct fs_context *fc) { - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); int ret; /* Check if the caller has permission to mount. */ - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + if (!ns_capable(ctx->ns->user_ns, CAP_SYS_ADMIN)) return -EPERM; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); @@ -1252,7 +1250,7 @@ int cgroup1_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); if (!ret) - ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + ret = cgroup_do_get_tree(fc); if (!ret && percpu_ref_is_dying(&ctx->root->cgrp.self.refcnt)) { struct super_block *sb = fc->root->d_sb; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 64360a46d4df..0c6bef234a7c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2036,13 +2036,17 @@ out: return ret; } -int cgroup_do_mount(struct fs_context *fc, unsigned long magic, - struct cgroup_namespace *ns) +int cgroup_do_get_tree(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); bool new_sb = false; + unsigned long magic; int ret = 0; + if (fc->fs_type == &cgroup2_fs_type) + magic = CGROUP2_SUPER_MAGIC; + else + magic = CGROUP_SUPER_MAGIC; fc->root = kernfs_mount(fc->fs_type, fc->sb_flags, ctx->root->kf_root, magic, &new_sb); if (IS_ERR(fc->root)) @@ -2052,7 +2056,7 @@ int cgroup_do_mount(struct fs_context *fc, unsigned long magic, * In non-init cgroup namespace, instead of root cgroup's dentry, * we return the dentry corresponding to the cgroupns->root_cgrp. */ - if (!ret && ns != &init_cgroup_ns) { + if (!ret && ctx->ns != &init_cgroup_ns) { struct dentry *nsdentry; struct super_block *sb = fc->root->d_sb; struct cgroup *cgrp; @@ -2060,7 +2064,7 @@ int cgroup_do_mount(struct fs_context *fc, unsigned long magic, mutex_lock(&cgroup_mutex); spin_lock_irq(&css_set_lock); - cgrp = cset_cgroup_from_root(ns->root_cset, ctx->root); + cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root); spin_unlock_irq(&css_set_lock); mutex_unlock(&cgroup_mutex); @@ -2089,6 +2093,7 @@ static void cgroup_fs_context_free(struct fs_context *fc) kfree(ctx->name); kfree(ctx->release_agent); + put_cgroup_ns(ctx->ns); kfree(ctx); } @@ -2106,7 +2111,7 @@ static int cgroup_get_tree(struct fs_context *fc) cgroup_get_live(&cgrp_dfl_root.cgrp); ctx->root = &cgrp_dfl_root; - ret = cgroup_do_mount(fc, CGROUP2_SUPER_MAGIC, ns); + ret = cgroup_do_get_tree(fc); if (!ret) apply_cgroup_root_flags(ctx->flags); return ret; @@ -2144,6 +2149,8 @@ static int cgroup_init_fs_context(struct fs_context *fc) if (!use_task_css_set_links) cgroup_enable_task_cg_lists(); + ctx->ns = current->nsproxy->cgroup_ns; + get_cgroup_ns(ctx->ns); fc->fs_private = ctx; if (fc->fs_type == &cgroup2_fs_type) fc->ops = &cgroup_fs_context_ops; -- cgit v1.3-7-g2ca7 From 23bf1b6be9c291a7130118dcc7384f72ac04d813 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 1 Nov 2018 23:07:26 +0000 Subject: kernfs, sysfs, cgroup, intel_rdt: Support fs_context Make kernfs support superblock creation/mount/remount with fs_context. This requires that sysfs, cgroup and intel_rdt, which are built on kernfs, be made to support fs_context also. Notes: (1) A kernfs_fs_context struct is created to wrap fs_context and the kernfs mount parameters are moved in here (or are in fs_context). (2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra namespace tag parameter is passed in the context if desired (3) kernfs_free_fs_context() is provided as a destructor for the kernfs_fs_context struct, but for the moment it does nothing except get called in the right places. (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to pass, but possibly this should be done anyway in case someone wants to add a parameter in future. (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and the cgroup v1 and v2 mount parameters are all moved there. (6) cgroup1 parameter parsing error messages are now handled by invalf(), which allows userspace to collect them directly. (7) cgroup1 parameter cleanup is now done in the context destructor rather than in the mount/get_tree and remount functions. Weirdies: (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held, but then uses the resulting pointer after dropping the locks. I'm told this is okay and needs commenting. (*) The cgroup refcount web. This really needs documenting. (*) cgroup2 only has one root? Add a suggestion from Thomas Gleixner in which the RDT enablement code is placed into its own function. [folded a leak fix from Andrey Vagin] Signed-off-by: David Howells cc: Greg Kroah-Hartman cc: Tejun Heo cc: Li Zefan cc: Johannes Weiner cc: cgroups@vger.kernel.org cc: fenghua.yu@intel.com Signed-off-by: Al Viro --- arch/x86/kernel/cpu/resctrl/internal.h | 16 +++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 185 +++++++++++++++++++++------------ fs/kernfs/kernfs-internal.h | 1 + fs/kernfs/mount.c | 89 +++++++--------- fs/sysfs/mount.c | 73 +++++++++---- include/linux/kernfs.h | 38 +++---- kernel/cgroup/cgroup-internal.h | 5 +- kernel/cgroup/cgroup.c | 31 +++--- 8 files changed, 262 insertions(+), 176 deletions(-) (limited to 'kernel') diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 822b7db634ee..e49b77283924 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -4,6 +4,7 @@ #include #include +#include #include #define MSR_IA32_L3_QOS_CFG 0xc81 @@ -40,6 +41,21 @@ #define RMID_VAL_ERROR BIT_ULL(63) #define RMID_VAL_UNAVAIL BIT_ULL(62) + +struct rdt_fs_context { + struct kernfs_fs_context kfc; + bool enable_cdpl2; + bool enable_cdpl3; + bool enable_mba_mbps; +}; + +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct rdt_fs_context, kfc); +} + DECLARE_STATIC_KEY_FALSE(rdt_enable_key); /** diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 8388adf241b2..399601eda8e4 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,7 @@ #include #include #include +#include #include @@ -1858,46 +1860,6 @@ static void cdp_disable_all(void) cdpl2_disable(); } -static int parse_rdtgroupfs_options(char *data) -{ - char *token, *o = data; - int ret = 0; - - while ((token = strsep(&o, ",")) != NULL) { - if (!*token) { - ret = -EINVAL; - goto out; - } - - if (!strcmp(token, "cdp")) { - ret = cdpl3_enable(); - if (ret) - goto out; - } else if (!strcmp(token, "cdpl2")) { - ret = cdpl2_enable(); - if (ret) - goto out; - } else if (!strcmp(token, "mba_MBps")) { - if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) - ret = set_mba_sc(true); - else - ret = -EINVAL; - if (ret) - goto out; - } else { - ret = -EINVAL; - goto out; - } - } - - return 0; - -out: - pr_err("Invalid mount option \"%s\"\n", token); - - return ret; -} - /* * We don't allow rdtgroup directories to be created anywhere * except the root directory. Thus when looking for the rdtgroup @@ -1969,13 +1931,27 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn); -static struct dentry *rdt_mount(struct file_system_type *fs_type, - int flags, const char *unused_dev_name, - void *data) +static int rdt_enable_ctx(struct rdt_fs_context *ctx) +{ + int ret = 0; + + if (ctx->enable_cdpl2) + ret = cdpl2_enable(); + + if (!ret && ctx->enable_cdpl3) + ret = cdpl3_enable(); + + if (!ret && ctx->enable_mba_mbps) + ret = set_mba_sc(true); + + return ret; +} + +static int rdt_get_tree(struct fs_context *fc) { + struct rdt_fs_context *ctx = rdt_fc2context(fc); struct rdt_domain *dom; struct rdt_resource *r; - struct dentry *dentry; int ret; cpus_read_lock(); @@ -1984,53 +1960,42 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type, * resctrl file system can only be mounted once. */ if (static_branch_unlikely(&rdt_enable_key)) { - dentry = ERR_PTR(-EBUSY); + ret = -EBUSY; goto out; } - ret = parse_rdtgroupfs_options(data); - if (ret) { - dentry = ERR_PTR(ret); + ret = rdt_enable_ctx(ctx); + if (ret < 0) goto out_cdp; - } closid_init(); ret = rdtgroup_create_info_dir(rdtgroup_default.kn); - if (ret) { - dentry = ERR_PTR(ret); - goto out_cdp; - } + if (ret < 0) + goto out_mba; if (rdt_mon_capable) { ret = mongroup_create_dir(rdtgroup_default.kn, NULL, "mon_groups", &kn_mongrp); - if (ret) { - dentry = ERR_PTR(ret); + if (ret < 0) goto out_info; - } kernfs_get(kn_mongrp); ret = mkdir_mondata_all(rdtgroup_default.kn, &rdtgroup_default, &kn_mondata); - if (ret) { - dentry = ERR_PTR(ret); + if (ret < 0) goto out_mongrp; - } kernfs_get(kn_mondata); rdtgroup_default.mon.mon_data_kn = kn_mondata; } ret = rdt_pseudo_lock_init(); - if (ret) { - dentry = ERR_PTR(ret); + if (ret) goto out_mondata; - } - dentry = kernfs_mount(fs_type, flags, rdt_root, - RDTGROUP_SUPER_MAGIC, NULL); - if (IS_ERR(dentry)) + ret = kernfs_get_tree(fc); + if (ret < 0) goto out_psl; if (rdt_alloc_capable) @@ -2059,14 +2024,95 @@ out_mongrp: kernfs_remove(kn_mongrp); out_info: kernfs_remove(kn_info); +out_mba: + if (ctx->enable_mba_mbps) + set_mba_sc(false); out_cdp: cdp_disable_all(); out: rdt_last_cmd_clear(); mutex_unlock(&rdtgroup_mutex); cpus_read_unlock(); + return ret; +} + +enum rdt_param { + Opt_cdp, + Opt_cdpl2, + Opt_mba_mpbs, + nr__rdt_params +}; + +static const struct fs_parameter_spec rdt_param_specs[] = { + fsparam_flag("cdp", Opt_cdp), + fsparam_flag("cdpl2", Opt_cdpl2), + fsparam_flag("mba_mpbs", Opt_mba_mpbs), + {} +}; + +static const struct fs_parameter_description rdt_fs_parameters = { + .name = "rdt", + .specs = rdt_param_specs, +}; + +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + struct fs_parse_result result; + int opt; + + opt = fs_parse(fc, &rdt_fs_parameters, param, &result); + if (opt < 0) + return opt; - return dentry; + switch (opt) { + case Opt_cdp: + ctx->enable_cdpl3 = true; + return 0; + case Opt_cdpl2: + ctx->enable_cdpl2 = true; + return 0; + case Opt_mba_mpbs: + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) + return -EINVAL; + ctx->enable_mba_mbps = true; + return 0; + } + + return -EINVAL; +} + +static void rdt_fs_context_free(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + + kernfs_free_fs_context(fc); + kfree(ctx); +} + +static const struct fs_context_operations rdt_fs_context_ops = { + .free = rdt_fs_context_free, + .parse_param = rdt_parse_param, + .get_tree = rdt_get_tree, +}; + +static int rdt_init_fs_context(struct fs_context *fc) +{ + struct rdt_fs_context *ctx; + + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->kfc.root = rdt_root; + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; + fc->fs_private = &ctx->kfc; + fc->ops = &rdt_fs_context_ops; + if (fc->user_ns) + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(&init_user_ns); + fc->global = true; + return 0; } static int reset_all_ctrls(struct rdt_resource *r) @@ -2239,9 +2285,10 @@ static void rdt_kill_sb(struct super_block *sb) } static struct file_system_type rdt_fs_type = { - .name = "resctrl", - .mount = rdt_mount, - .kill_sb = rdt_kill_sb, + .name = "resctrl", + .init_fs_context = rdt_init_fs_context, + .parameters = &rdt_fs_parameters, + .kill_sb = rdt_kill_sb, }; static int mon_addfile(struct kernfs_node *parent_kn, const char *name, diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 3d83b114bb08..379e3a9eb1ec 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -17,6 +17,7 @@ #include #include +#include struct kernfs_iattrs { struct iattr ia_iattr; diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 4d303047a4f8..36376cc5c9c2 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -22,16 +22,6 @@ struct kmem_cache *kernfs_node_cache; -static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data) -{ - struct kernfs_root *root = kernfs_info(sb)->root; - struct kernfs_syscall_ops *scops = root->syscall_ops; - - if (scops && scops->remount_fs) - return scops->remount_fs(root, flags, data); - return 0; -} - static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry) { struct kernfs_root *root = kernfs_root(kernfs_dentry_node(dentry)); @@ -60,7 +50,6 @@ const struct super_operations kernfs_sops = { .drop_inode = generic_delete_inode, .evict_inode = kernfs_evict_inode, - .remount_fs = kernfs_sop_remount_fs, .show_options = kernfs_sop_show_options, .show_path = kernfs_sop_show_path, }; @@ -222,7 +211,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn, } while (true); } -static int kernfs_fill_super(struct super_block *sb, unsigned long magic) +static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc) { struct kernfs_super_info *info = kernfs_info(sb); struct inode *inode; @@ -233,7 +222,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic) sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; - sb->s_magic = magic; + sb->s_magic = kfc->magic; sb->s_op = &kernfs_sops; sb->s_xattr = kernfs_xattr_handlers; if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP) @@ -263,21 +252,20 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic) return 0; } -static int kernfs_test_super(struct super_block *sb, void *data) +static int kernfs_test_super(struct super_block *sb, struct fs_context *fc) { struct kernfs_super_info *sb_info = kernfs_info(sb); - struct kernfs_super_info *info = data; + struct kernfs_super_info *info = fc->s_fs_info; return sb_info->root == info->root && sb_info->ns == info->ns; } -static int kernfs_set_super(struct super_block *sb, void *data) +static int kernfs_set_super(struct super_block *sb, struct fs_context *fc) { - int error; - error = set_anon_super(sb, data); - if (!error) - sb->s_fs_info = data; - return error; + struct kernfs_fs_context *kfc = fc->fs_private; + + kfc->ns_tag = NULL; + return set_anon_super_fc(sb, fc); } /** @@ -294,63 +282,60 @@ const void *kernfs_super_ns(struct super_block *sb) } /** - * kernfs_mount_ns - kernfs mount helper - * @fs_type: file_system_type of the fs being mounted - * @flags: mount flags specified for the mount - * @root: kernfs_root of the hierarchy being mounted - * @magic: file system specific magic number - * @new_sb_created: tell the caller if we allocated a new superblock - * @ns: optional namespace tag of the mount + * kernfs_get_tree - kernfs filesystem access/retrieval helper + * @fc: The filesystem context. * - * This is to be called from each kernfs user's file_system_type->mount() - * implementation, which should pass through the specified @fs_type and - * @flags, and specify the hierarchy and namespace tag to mount via @root - * and @ns, respectively. - * - * The return value can be passed to the vfs layer verbatim. + * This is to be called from each kernfs user's fs_context->ops->get_tree() + * implementation, which should set the specified ->@fs_type and ->@flags, and + * specify the hierarchy and namespace tag to mount via ->@root and ->@ns, + * respectively. */ -struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags, - struct kernfs_root *root, unsigned long magic, - bool *new_sb_created, const void *ns) +int kernfs_get_tree(struct fs_context *fc) { + struct kernfs_fs_context *kfc = fc->fs_private; struct super_block *sb; struct kernfs_super_info *info; int error; info = kzalloc(sizeof(*info), GFP_KERNEL); if (!info) - return ERR_PTR(-ENOMEM); + return -ENOMEM; - info->root = root; - info->ns = ns; + info->root = kfc->root; + info->ns = kfc->ns_tag; INIT_LIST_HEAD(&info->node); - sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags, - &init_user_ns, info); - if (IS_ERR(sb) || sb->s_fs_info != info) - kfree(info); + fc->s_fs_info = info; + sb = sget_fc(fc, kernfs_test_super, kernfs_set_super); if (IS_ERR(sb)) - return ERR_CAST(sb); - - if (new_sb_created) - *new_sb_created = !sb->s_root; + return PTR_ERR(sb); if (!sb->s_root) { struct kernfs_super_info *info = kernfs_info(sb); - error = kernfs_fill_super(sb, magic); + kfc->new_sb_created = true; + + error = kernfs_fill_super(sb, kfc); if (error) { deactivate_locked_super(sb); - return ERR_PTR(error); + return error; } sb->s_flags |= SB_ACTIVE; mutex_lock(&kernfs_mutex); - list_add(&info->node, &root->supers); + list_add(&info->node, &info->root->supers); mutex_unlock(&kernfs_mutex); } - return dget(sb->s_root); + fc->root = dget(sb->s_root); + return 0; +} + +void kernfs_free_fs_context(struct fs_context *fc) +{ + /* Note that we don't deal with kfc->ns_tag here. */ + kfree(fc->s_fs_info); + fc->s_fs_info = NULL; } /** diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 92682fcc41f6..4cb21b558a85 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -13,34 +13,69 @@ #include #include #include +#include #include +#include +#include #include "sysfs.h" static struct kernfs_root *sysfs_root; struct kernfs_node *sysfs_root_kn; -static struct dentry *sysfs_mount(struct file_system_type *fs_type, - int flags, const char *dev_name, void *data) +static int sysfs_get_tree(struct fs_context *fc) { - struct dentry *root; - void *ns; - bool new_sb = false; + struct kernfs_fs_context *kfc = fc->fs_private; + int ret; - if (!(flags & SB_KERNMOUNT)) { + ret = kernfs_get_tree(fc); + if (ret) + return ret; + + if (kfc->new_sb_created) + fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE; + return 0; +} + +static void sysfs_fs_context_free(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + if (kfc->ns_tag) + kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag); + kernfs_free_fs_context(fc); + kfree(kfc); +} + +static const struct fs_context_operations sysfs_fs_context_ops = { + .free = sysfs_fs_context_free, + .get_tree = sysfs_get_tree, +}; + +static int sysfs_init_fs_context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc; + struct net *netns; + + if (!(fc->sb_flags & SB_KERNMOUNT)) { if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) - return ERR_PTR(-EPERM); + return -EPERM; } - ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET); - root = kernfs_mount_ns(fs_type, flags, sysfs_root, - SYSFS_MAGIC, &new_sb, ns); - if (!new_sb) - kobj_ns_drop(KOBJ_NS_TYPE_NET, ns); - else if (!IS_ERR(root)) - root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE; + kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL); + if (!kfc) + return -ENOMEM; - return root; + kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET); + kfc->root = sysfs_root; + kfc->magic = SYSFS_MAGIC; + fc->fs_private = kfc; + fc->ops = &sysfs_fs_context_ops; + if (fc->user_ns) + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(netns->user_ns); + fc->global = true; + return 0; } static void sysfs_kill_sb(struct super_block *sb) @@ -52,10 +87,10 @@ static void sysfs_kill_sb(struct super_block *sb) } static struct file_system_type sysfs_fs_type = { - .name = "sysfs", - .mount = sysfs_mount, - .kill_sb = sysfs_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .name = "sysfs", + .init_fs_context = sysfs_init_fs_context, + .kill_sb = sysfs_kill_sb, + .fs_flags = FS_USERNS_MOUNT, }; int __init sysfs_init(void) diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 44acb4c3659c..822a64e65b41 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -25,7 +25,9 @@ struct seq_file; struct vm_area_struct; struct super_block; struct file_system_type; +struct fs_context; +struct kernfs_fs_context; struct kernfs_open_node; struct kernfs_iattrs; @@ -167,7 +169,6 @@ struct kernfs_node { * kernfs_node parameter. */ struct kernfs_syscall_ops { - int (*remount_fs)(struct kernfs_root *root, int *flags, char *data); int (*show_options)(struct seq_file *sf, struct kernfs_root *root); int (*mkdir)(struct kernfs_node *parent, const char *name, @@ -268,6 +269,18 @@ struct kernfs_ops { #endif }; +/* + * The kernfs superblock creation/mount parameter context. + */ +struct kernfs_fs_context { + struct kernfs_root *root; /* Root of the hierarchy being mounted */ + void *ns_tag; /* Namespace tag of the mount (or NULL) */ + unsigned long magic; /* File system specific magic number */ + + /* The following are set/used by kernfs_mount() */ + bool new_sb_created; /* Set to T if we allocated a new sb */ +}; + #ifdef CONFIG_KERNFS static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn) @@ -353,9 +366,8 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr); void kernfs_notify(struct kernfs_node *kn); const void *kernfs_super_ns(struct super_block *sb); -struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags, - struct kernfs_root *root, unsigned long magic, - bool *new_sb_created, const void *ns); +int kernfs_get_tree(struct fs_context *fc); +void kernfs_free_fs_context(struct fs_context *fc); void kernfs_kill_sb(struct super_block *sb); void kernfs_init(void); @@ -458,11 +470,10 @@ static inline void kernfs_notify(struct kernfs_node *kn) { } static inline const void *kernfs_super_ns(struct super_block *sb) { return NULL; } -static inline struct dentry * -kernfs_mount_ns(struct file_system_type *fs_type, int flags, - struct kernfs_root *root, unsigned long magic, - bool *new_sb_created, const void *ns) -{ return ERR_PTR(-ENOSYS); } +static inline int kernfs_get_tree(struct fs_context *fc) +{ return -ENOSYS; } + +static inline void kernfs_free_fs_context(struct fs_context *fc) { } static inline void kernfs_kill_sb(struct super_block *sb) { } @@ -545,13 +556,4 @@ static inline int kernfs_rename(struct kernfs_node *kn, return kernfs_rename_ns(kn, new_parent, new_name, NULL); } -static inline struct dentry * -kernfs_mount(struct file_system_type *fs_type, int flags, - struct kernfs_root *root, unsigned long magic, - bool *new_sb_created) -{ - return kernfs_mount_ns(fs_type, flags, root, - magic, new_sb_created, NULL); -} - #endif /* __LINUX_KERNFS_H */ diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 37cf709b7a0e..30e39f3932ad 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -41,6 +41,7 @@ extern void __init enable_debug_cgroup(void); * The cgroup filesystem superblock creation/mount context. */ struct cgroup_fs_context { + struct kernfs_fs_context kfc; struct cgroup_root *root; struct cgroup_namespace *ns; unsigned int flags; /* CGRP_ROOT_* flags */ @@ -56,7 +57,9 @@ struct cgroup_fs_context { static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc) { - return fc->fs_private; + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct cgroup_fs_context, kfc); } /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 0c6bef234a7c..747e5b17f9da 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2039,18 +2039,14 @@ out: int cgroup_do_get_tree(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); - bool new_sb = false; - unsigned long magic; - int ret = 0; + int ret; + ctx->kfc.root = ctx->root->kf_root; if (fc->fs_type == &cgroup2_fs_type) - magic = CGROUP2_SUPER_MAGIC; + ctx->kfc.magic = CGROUP2_SUPER_MAGIC; else - magic = CGROUP_SUPER_MAGIC; - fc->root = kernfs_mount(fc->fs_type, fc->sb_flags, ctx->root->kf_root, - magic, &new_sb); - if (IS_ERR(fc->root)) - ret = PTR_ERR(fc->root); + ctx->kfc.magic = CGROUP_SUPER_MAGIC; + ret = kernfs_get_tree(fc); /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2078,7 +2074,7 @@ int cgroup_do_get_tree(struct fs_context *fc) } } - if (!new_sb) + if (!ctx->kfc.new_sb_created) cgroup_put(&ctx->root->cgrp); return ret; @@ -2094,19 +2090,15 @@ static void cgroup_fs_context_free(struct fs_context *fc) kfree(ctx->name); kfree(ctx->release_agent); put_cgroup_ns(ctx->ns); + kernfs_free_fs_context(fc); kfree(ctx); } static int cgroup_get_tree(struct fs_context *fc) { - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup_fs_context *ctx = cgroup_fc2context(fc); int ret; - /* Check if the caller has permission to mount. */ - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) - return -EPERM; - cgrp_dfl_visible = true; cgroup_get_live(&cgrp_dfl_root.cgrp); ctx->root = &cgrp_dfl_root; @@ -2132,7 +2124,8 @@ static const struct fs_context_operations cgroup1_fs_context_ops = { }; /* - * Initialise the cgroup filesystem creation/reconfiguration context. + * Initialise the cgroup filesystem creation/reconfiguration context. Notably, + * we select the namespace we're going to use. */ static int cgroup_init_fs_context(struct fs_context *fc) { @@ -2151,11 +2144,15 @@ static int cgroup_init_fs_context(struct fs_context *fc) ctx->ns = current->nsproxy->cgroup_ns; get_cgroup_ns(ctx->ns); - fc->fs_private = ctx; + fc->fs_private = &ctx->kfc; if (fc->fs_type == &cgroup2_fs_type) fc->ops = &cgroup_fs_context_ops; else fc->ops = &cgroup1_fs_context_ops; + if (fc->user_ns) + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(ctx->ns->user_ns); + fc->global = true; return 0; } -- cgit v1.3-7-g2ca7 From a18753747385b8b98577a18adc8ec99fda679044 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 1 Nov 2018 23:07:25 +0000 Subject: cpuset: Use fs_context Make the cpuset filesystem use the filesystem context. This is potentially tricky as the cpuset fs is almost an alias for the cgroup filesystem, but with some special parameters. This can, however, be handled by setting up an appropriate cgroup filesystem and returning the root directory of that as the root dir of this one. Signed-off-by: David Howells cc: Tejun Heo Signed-off-by: Al Viro --- kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 479743db6c37..9758b03834ac 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -372,25 +373,52 @@ static inline bool is_in_v2_mode(void) * users. If someone tries to mount the "cpuset" filesystem, we * silently switch it to mount "cgroup" instead */ -static struct dentry *cpuset_mount(struct file_system_type *fs_type, - int flags, const char *unused_dev_name, void *data) -{ - struct file_system_type *cgroup_fs = get_fs_type("cgroup"); - struct dentry *ret = ERR_PTR(-ENODEV); - if (cgroup_fs) { - char mountopts[] = - "cpuset,noprefix," - "release_agent=/sbin/cpuset_release_agent"; - ret = cgroup_fs->mount(cgroup_fs, flags, - unused_dev_name, mountopts); - put_filesystem(cgroup_fs); +static int cpuset_get_tree(struct fs_context *fc) +{ + struct file_system_type *cgroup_fs; + struct fs_context *new_fc; + int ret; + + cgroup_fs = get_fs_type("cgroup"); + if (!cgroup_fs) + return -ENODEV; + + new_fc = fs_context_for_mount(cgroup_fs, fc->sb_flags); + if (IS_ERR(new_fc)) { + ret = PTR_ERR(new_fc); + } else { + static const char agent_path[] = "/sbin/cpuset_release_agent"; + ret = vfs_parse_fs_string(new_fc, "cpuset", NULL, 0); + if (!ret) + ret = vfs_parse_fs_string(new_fc, "noprefix", NULL, 0); + if (!ret) + ret = vfs_parse_fs_string(new_fc, "release_agent", + agent_path, sizeof(agent_path) - 1); + if (!ret) + ret = vfs_get_tree(new_fc); + if (!ret) { /* steal the result */ + fc->root = new_fc->root; + new_fc->root = NULL; + } + put_fs_context(new_fc); } + put_filesystem(cgroup_fs); return ret; } +static const struct fs_context_operations cpuset_fs_context_ops = { + .get_tree = cpuset_get_tree, +}; + +static int cpuset_init_fs_context(struct fs_context *fc) +{ + fc->ops = &cpuset_fs_context_ops; + return 0; +} + static struct file_system_type cpuset_fs_type = { - .name = "cpuset", - .mount = cpuset_mount, + .name = "cpuset", + .init_fs_context = cpuset_init_fs_context, }; /* -- cgit v1.3-7-g2ca7 From 06a2ae56b5b88fa57cd56e0b99bd874135efdf58 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 1 Nov 2018 23:07:26 +0000 Subject: vfs: Add some logging to the core users of the fs_context log Add some logging to the core users of the fs_context log so that information can be extracted from them as to the reason for failure. Signed-off-by: David Howells Signed-off-by: Al Viro --- fs/super.c | 4 +++- kernel/cgroup/cgroup-v1.c | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/fs/super.c b/fs/super.c index 0ebb5c11fa56..583a0124bc39 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1467,8 +1467,10 @@ int vfs_get_tree(struct fs_context *fc) struct super_block *sb; int error; - if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source) + if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source) { + errorf(fc, "Filesystem requires source device"); return -ENOENT; + } if (fc->root) return -EBUSY; diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 571ef3447426..c126b34fd4ff 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -17,7 +17,7 @@ #include -#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; }) +#define cg_invalf(fc, fmt, ...) invalf(fc, fmt, ## __VA_ARGS__) /* * pidlists linger the following amount before being destroyed. The goal -- cgit v1.3-7-g2ca7 From fe99a4f4d6022ec92f9b52a5528cb9b77513e7d1 Mon Sep 17 00:00:00 2001 From: Julia Cartwright Date: Tue, 12 Feb 2019 17:25:53 +0100 Subject: kthread: Convert worker lock to raw spinlock In order to enable the queuing of kthread work items from hardirq context even when PREEMPT_RT_FULL is enabled, convert the worker spin_lock to a raw_spin_lock. This is only acceptable to do because the work performed under the lock is well-bounded and minimal. Reported-by: Steffen Trumtrar Reported-by: Tim Sander Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Tested-by: Steffen Trumtrar Reviewed-by: Petr Mladek Cc: Guenter Roeck Link: https://lkml.kernel.org/r/20190212162554.19779-1-bigeasy@linutronix.de --- include/linux/kthread.h | 4 ++-- kernel/kthread.c | 42 +++++++++++++++++++++--------------------- 2 files changed, 23 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/include/linux/kthread.h b/include/linux/kthread.h index c1961761311d..6b8c064f0cbc 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -85,7 +85,7 @@ enum { struct kthread_worker { unsigned int flags; - spinlock_t lock; + raw_spinlock_t lock; struct list_head work_list; struct list_head delayed_work_list; struct task_struct *task; @@ -106,7 +106,7 @@ struct kthread_delayed_work { }; #define KTHREAD_WORKER_INIT(worker) { \ - .lock = __SPIN_LOCK_UNLOCKED((worker).lock), \ + .lock = __RAW_SPIN_LOCK_UNLOCKED((worker).lock), \ .work_list = LIST_HEAD_INIT((worker).work_list), \ .delayed_work_list = LIST_HEAD_INIT((worker).delayed_work_list),\ } diff --git a/kernel/kthread.c b/kernel/kthread.c index 087d18d771b5..5641b55783a6 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -599,7 +599,7 @@ void __kthread_init_worker(struct kthread_worker *worker, struct lock_class_key *key) { memset(worker, 0, sizeof(struct kthread_worker)); - spin_lock_init(&worker->lock); + raw_spin_lock_init(&worker->lock); lockdep_set_class_and_name(&worker->lock, key, name); INIT_LIST_HEAD(&worker->work_list); INIT_LIST_HEAD(&worker->delayed_work_list); @@ -641,21 +641,21 @@ repeat: if (kthread_should_stop()) { __set_current_state(TASK_RUNNING); - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); worker->task = NULL; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); return 0; } work = NULL; - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); if (!list_empty(&worker->work_list)) { work = list_first_entry(&worker->work_list, struct kthread_work, node); list_del_init(&work->node); } worker->current_work = work; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); if (work) { __set_current_state(TASK_RUNNING); @@ -812,12 +812,12 @@ bool kthread_queue_work(struct kthread_worker *worker, bool ret = false; unsigned long flags; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); if (!queuing_blocked(worker, work)) { kthread_insert_work(worker, work, &worker->work_list); ret = true; } - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_queue_work); @@ -843,7 +843,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) if (WARN_ON_ONCE(!worker)) return; - spin_lock(&worker->lock); + raw_spin_lock(&worker->lock); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -852,7 +852,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) list_del_init(&work->node); kthread_insert_work(worker, work, &worker->work_list); - spin_unlock(&worker->lock); + raw_spin_unlock(&worker->lock); } EXPORT_SYMBOL(kthread_delayed_work_timer_fn); @@ -908,14 +908,14 @@ bool kthread_queue_delayed_work(struct kthread_worker *worker, unsigned long flags; bool ret = false; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); if (!queuing_blocked(worker, work)) { __kthread_queue_delayed_work(worker, dwork, delay); ret = true; } - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_queue_delayed_work); @@ -951,7 +951,7 @@ void kthread_flush_work(struct kthread_work *work) if (!worker) return; - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -963,7 +963,7 @@ void kthread_flush_work(struct kthread_work *work) else noop = true; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); if (!noop) wait_for_completion(&fwork.done); @@ -996,9 +996,9 @@ static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork, * any queuing is blocked by setting the canceling counter. */ work->canceling++; - spin_unlock_irqrestore(&worker->lock, *flags); + raw_spin_unlock_irqrestore(&worker->lock, *flags); del_timer_sync(&dwork->timer); - spin_lock_irqsave(&worker->lock, *flags); + raw_spin_lock_irqsave(&worker->lock, *flags); work->canceling--; } @@ -1045,7 +1045,7 @@ bool kthread_mod_delayed_work(struct kthread_worker *worker, unsigned long flags; int ret = false; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); /* Do not bother with canceling when never queued. */ if (!work->worker) @@ -1062,7 +1062,7 @@ bool kthread_mod_delayed_work(struct kthread_worker *worker, fast_queue: __kthread_queue_delayed_work(worker, dwork, delay); out: - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_mod_delayed_work); @@ -1076,7 +1076,7 @@ static bool __kthread_cancel_work_sync(struct kthread_work *work, bool is_dwork) if (!worker) goto out; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -1090,13 +1090,13 @@ static bool __kthread_cancel_work_sync(struct kthread_work *work, bool is_dwork) * In the meantime, block any queuing by setting the canceling counter. */ work->canceling++; - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); kthread_flush_work(work); - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); work->canceling--; out_fast: - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); out: return ret; } -- cgit v1.3-7-g2ca7 From ad01423aedaa7c6dd62d560b73a3cb39e6da3901 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 12 Feb 2019 17:25:54 +0100 Subject: kthread: Do not use TIMER_IRQSAFE The TIMER_IRQSAFE usage was introduced in commit 22597dc3d97b1 ("kthread: initial support for delayed kthread work") which modelled the delayed kthread code after workqueue's code. The workqueue code requires the flag TIMER_IRQSAFE for synchronisation purpose. This is not true for kthread's delay timer since all operations occur under a lock. Remove TIMER_IRQSAFE from the timer initialisation and use timer_setup() for initialisation purpose which is the official function. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Reviewed-by: Petr Mladek Link: https://lkml.kernel.org/r/20190212162554.19779-2-bigeasy@linutronix.de --- include/linux/kthread.h | 5 ++--- kernel/kthread.c | 5 +++-- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 6b8c064f0cbc..3d9d834c66a2 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -164,9 +164,8 @@ extern void __kthread_init_worker(struct kthread_worker *worker, #define kthread_init_delayed_work(dwork, fn) \ do { \ kthread_init_work(&(dwork)->work, (fn)); \ - __init_timer(&(dwork)->timer, \ - kthread_delayed_work_timer_fn, \ - TIMER_IRQSAFE); \ + timer_setup(&(dwork)->timer, \ + kthread_delayed_work_timer_fn, 0); \ } while (0) int kthread_worker_fn(void *worker_ptr); diff --git a/kernel/kthread.c b/kernel/kthread.c index 5641b55783a6..537335541267 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -835,6 +835,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) struct kthread_delayed_work *dwork = from_timer(dwork, t, timer); struct kthread_work *work = &dwork->work; struct kthread_worker *worker = work->worker; + unsigned long flags; /* * This might happen when a pending work is reinitialized. @@ -843,7 +844,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) if (WARN_ON_ONCE(!worker)) return; - raw_spin_lock(&worker->lock); + raw_spin_lock_irqsave(&worker->lock, flags); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -852,7 +853,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) list_del_init(&work->node); kthread_insert_work(worker, work, &worker->work_list); - raw_spin_unlock(&worker->lock); + raw_spin_unlock_irqrestore(&worker->lock, flags); } EXPORT_SYMBOL(kthread_delayed_work_timer_fn); -- cgit v1.3-7-g2ca7 From 2b188cc1bb857a9d4701ae59aa7768b5124e262e Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Mon, 7 Jan 2019 10:46:33 -0700 Subject: Add io_uring IO interface The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 1255 ++++++++++++++++++++++++++++++++ include/linux/fs.h | 9 + include/linux/sched/user.h | 2 +- include/linux/syscalls.h | 6 + include/uapi/asm-generic/unistd.h | 6 +- include/uapi/linux/io_uring.h | 95 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + net/unix/garbage.c | 3 + 12 files changed, 1390 insertions(+), 2 deletions(-) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h (limited to 'kernel') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..481c126259e9 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,5 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup +426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..6a32a430c8e0 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +425 common io_uring_setup __x64_sys_io_uring_setup +426 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..f68052290426 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,1255 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * A note on the read/write ordering memory barriers that are matched between + * the application and kernel side. When the application reads the CQ ring + * tail, it must use an appropriate smp_rmb() to order with the smp_wmb() + * the kernel uses after writing the tail. Failure to do so could cause a + * delay in when the application notices that completion events available. + * This isn't a fatal condition. Likewise, the application must use an + * appropriate smp_wmb() both before writing the SQ tail, and after writing + * the SQ tail. The first one orders the sqe writes with the tail write, and + * the latter is paired with the smp_rmb() the kernel will issue before + * reading the SQ tail on submission. + * + * Also see the examples in the liburing library: + * + * git://git.kernel.dk/liburing + * + * io_uring also uses READ/WRITE_ONCE() for _any_ store or load that happens + * from data shared between the kernel and application. This is done both + * for ordering purposes, but also to ensure that once a value is loaded from + * data that the application could potentially modify, it remains stable. + * + * Copyright (C) 2018-2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "internal.h" + +#define IORING_MAX_ENTRIES 4096 + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct { + struct percpu_ref refs; + } ____cacheline_aligned_in_smp; + + struct { + unsigned int flags; + bool compat; + bool account_mem; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned cached_sq_head; + unsigned sq_entries; + unsigned sq_mask; + struct io_uring_sqe *sq_sqes; + } ____cacheline_aligned_in_smp; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + + struct { + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cached_cq_tail; + unsigned cq_entries; + unsigned cq_mask; + struct wait_queue_head cq_wait; + struct fasync_struct *cq_fasync; + } ____cacheline_aligned_in_smp; + + struct user_struct *user; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; + +#if defined(CONFIG_UNIX) + struct socket *ring_sock; +#endif +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned short index; + bool has_user; +}; + +struct io_kiocb { + struct kiocb rw; + + struct sqe_submit submit; + + struct io_ring_ctx *ctx; + struct list_head list; + unsigned int flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 user_data; + + struct work_struct work; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_uring_fops; + +struct sock *io_uring_get_socket(struct file *file) +{ +#if defined(CONFIG_UNIX) + if (file->f_op == &io_uring_fops) { + struct io_ring_ctx *ctx = file->private_data; + + return ctx->ring_sock->sk; + } +#endif + return NULL; +} +EXPORT_SYMBOL(io_uring_get_socket); + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + init_waitqueue_head(&ctx->cq_wait); + init_completion(&ctx->ctx_done); + mutex_init(&ctx->uring_lock); + init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->completion_lock); + return ctx; +} + +static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + if (ctx->cached_cq_tail != READ_ONCE(ring->r.tail)) { + /* order cqe stores with ring update */ + smp_store_release(&ring->r.tail, ctx->cached_cq_tail); + + /* + * Write sider barrier of tail update, app has read side. See + * comment at the top of this file. + */ + smp_wmb(); + + if (wq_has_sleeper(&ctx->cq_wait)) { + wake_up_interruptible(&ctx->cq_wait); + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); + } + } +} + +static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + tail = ctx->cached_cq_tail; + /* See comment at the top of the file */ + smp_rmb(); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + ctx->cached_cq_tail++; + return &ring->cqes[tail & ctx->cq_mask]; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqring(ctx); + if (cqe) { + WRITE_ONCE(cqe->user_data, ki_user_data); + WRITE_ONCE(cqe->res, res); + WRITE_ONCE(cqe->flags, ev_flags); + } else { + unsigned overflow = READ_ONCE(ctx->cq_ring->overflow); + + WRITE_ONCE(ctx->cq_ring->overflow, overflow + 1); + } +} + +static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); + io_commit_cqring(ctx); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, __GFP_NOWARN); + if (req) { + req->ctx = ctx; + req->flags = 0; + return req; + } + + io_ring_drop_ctx_refs(ctx, 1); + return NULL; +} + +static void io_free_req(struct io_kiocb *req) +{ + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_complete_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_add_event(req->ctx, req->user_data, res, 0); + io_free_req(req); +} + +/* + * If we tracked the file through the SCM inflight mechanism, we could support + * any file. For now, just ensure that anything potentially problematic is done + * inline. + */ +static bool io_file_supports_async(struct file *file) +{ + umode_t mode = file_inode(file)->i_mode; + + if (S_ISBLK(mode) || S_ISCHR(mode)) + return true; + if (S_ISREG(mode) && file->f_op != &io_uring_fops) + return true; + + return false; +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + unsigned ioprio; + int fd, ret; + + /* For -EAGAIN retry, everything is already prepped */ + if (kiocb->ki_filp) + return 0; + + fd = READ_ONCE(sqe->fd); + kiocb->ki_filp = fget(fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) + force_nonblock = false; + kiocb->ki_pos = READ_ONCE(sqe->off); + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + + ioprio = READ_ONCE(sqe->ioprio); + if (ioprio) { + ret = ioprio_check_cap(ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * We can't just restart the syscall, since previously + * submitted sqes may already be in progress. Just fail this + * IO with EINTR. + */ + ret = -EINTR; + /* fall through */ + default: + kiocb->ki_complete(kiocb, ret, 0); + } +} + +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct sqe_submit *s, struct iovec **iovec, + struct iov_iter *iter) +{ + const struct io_uring_sqe *sqe = s->sqe; + void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); + size_t sqe_len = READ_ONCE(sqe->len); + + if (!s->has_user) + return -EFAULT; + +#ifdef CONFIG_COMPAT + if (ctx->compat) + return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV, + iovec, iter); +#endif + + return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); +} + +static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, s->sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + /* Hold on to the file for -EAGAIN */ + if (unlikely(ret && ret != -EAGAIN)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, s->sqe, force_nonblock); + if (ret) + return ret; + /* Hold on to the file for -EAGAIN */ + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + return -EAGAIN; + + ret = -EBADF; + file = kiocb->ki_filp; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +/* + * IORING_OP_NOP just posts a completion event, nothing else. + */ +static int io_nop(struct io_kiocb *req, u64 user_data) +{ + struct io_ring_ctx *ctx = req->ctx; + long err = 0; + + /* + * Twilight zone - it's possible that someone issued an opcode that + * has a file attached, then got -EAGAIN on submission, and changed + * the sqe before we retried it from async context. Avoid dropping + * a file reference for this malicious case, and flag the error. + */ + if (req->rw.ki_filp) { + err = -EBADF; + fput(req->rw.ki_filp); + } + io_cqring_add_event(ctx, user_data, err, 0); + io_free_req(req); + return 0; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct sqe_submit *s, bool force_nonblock) +{ + ssize_t ret; + int opcode; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->user_data = READ_ONCE(s->sqe->user_data); + + opcode = READ_ONCE(s->sqe->opcode); + switch (opcode) { + case IORING_OP_NOP: + ret = io_nop(req, req->user_data); + break; + case IORING_OP_READV: + ret = io_read(req, s, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, s, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct sqe_submit *s = &req->submit; + const struct io_uring_sqe *sqe = s->sqe; + struct io_ring_ctx *ctx = req->ctx; + mm_segment_t old_fs = get_fs(); + int ret; + + /* Ensure we clear previously set forced non-block flag */ + req->flags &= ~REQ_F_FORCE_NONBLOCK; + req->rw.ki_flags &= ~IOCB_NOWAIT; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + s->has_user = true; + + ret = __io_submit_sqe(ctx, req, s, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_cqring_add_event(ctx, sqe->user_data, ret, 0); + io_free_req(req); + } + + /* async context always use a copy of the sqe */ + kfree(sqe); +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(s->sqe->flags)) + return -EINVAL; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + req->rw.ki_filp = NULL; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + struct io_uring_sqe *sqe_copy; + + sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); + if (sqe_copy) { + memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy)); + s->sqe = sqe_copy; + + memcpy(&req->submit, s, sizeof(*s)); + INIT_WORK(&req->work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work); + ret = 0; + } + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_commit_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + if (ctx->cached_sq_head != READ_ONCE(ring->r.head)) { + /* + * Ensure any loads from the SQEs are done at this point, + * since once we write the new head, the application could + * write new data to them. + */ + smp_store_release(&ring->r.head, ctx->cached_sq_head); + + /* + * write side barrier of head update, app has read side. See + * comment at the top of this file + */ + smp_wmb(); + } +} + +/* + * Undo last io_get_sqring() + */ +static void io_drop_sqring(struct io_ring_ctx *ctx) +{ + ctx->cached_sq_head--; +} + +/* + * Fetch an sqe, if one is available. Note that s->sqe will point to memory + * that is mapped by userspace. This means that care needs to be taken to + * ensure that reads are stable, as we cannot rely on userspace always + * being a good citizen. If members of the sqe are validated and then later + * used, it's important that those reads are done through READ_ONCE() to + * prevent a re-load down the line. + */ +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + /* + * The cached sq head (or cq tail) serves two purposes: + * + * 1) allows us to batch the cost of updating the user visible + * head updates. + * 2) allows the kernel side to track the head on its own, even + * though the application is the one updating it. + */ + head = ctx->cached_sq_head; + /* See comment at the top of this file */ + smp_rmb(); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = READ_ONCE(ring->array[head & ctx->sq_mask]); + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + ctx->cached_sq_head++; + return true; + } + + /* drop invalid entries */ + ctx->cached_sq_head++; + ring->dropped++; + /* See comment at the top of this file */ + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_get_sqring(ctx, &s)) + break; + + s.has_user = true; + ret = io_submit_sqe(ctx, &s); + if (ret) { + io_drop_sqring(ctx); + break; + } + + submit++; + } + io_commit_sqring(ctx); + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +static unsigned io_cqring_events(struct io_cq_ring *ring) +{ + return READ_ONCE(ring->r.tail) - READ_ONCE(ring->r.head); +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, + const sigset_t __user *sig, size_t sigsz) +{ + struct io_cq_ring *ring = ctx->cq_ring; + sigset_t ksigmask, sigsaved; + DEFINE_WAIT(wait); + int ret; + + /* See comment at the top of this file */ + smp_rmb(); + if (io_cqring_events(ring) >= min_events) + return 0; + + if (sig) { + ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz); + if (ret) + return ret; + } + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + /* See comment at the top of this file */ + smp_rmb(); + if (io_cqring_events(ring) >= min_events) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + + if (sig) + restore_user_sigmask(sig, &sigsaved); + + return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + mmgrab(current->mm); + ctx->sqo_mm = current->mm; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + mmdrop(ctx->sqo_mm); + ctx->sqo_mm = NULL; + return ret; +} + +static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages) +{ + atomic_long_sub(nr_pages, &user->locked_vm); +} + +static int io_account_mem(struct user_struct *user, unsigned long nr_pages) +{ + unsigned long page_limit, cur_pages, new_pages; + + /* Don't allow more pages than we can safely lock */ + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + do { + cur_pages = atomic_long_read(&user->locked_vm); + new_pages = cur_pages + nr_pages; + if (new_pages > page_limit) + return -ENOMEM; + } while (atomic_long_cmpxchg(&user->locked_vm, cur_pages, + new_pages) != cur_pages); + + return 0; +} + +static void io_mem_free(void *ptr) +{ + struct page *page = virt_to_head_page(ptr); + + if (put_page_testzero(page)) + free_compound_page(page); +} + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t bytes; + + bytes = struct_size(sq_ring, array, sq_entries); + bytes += array_size(sizeof(struct io_uring_sqe), sq_entries); + bytes += struct_size(cq_ring, cqes, cq_entries); + + return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) + destroy_workqueue(ctx->sqo_wq); + if (ctx->sqo_mm) + mmdrop(ctx->sqo_mm); +#if defined(CONFIG_UNIX) + if (ctx->ring_sock) + sock_release(ctx->ring_sock); +#endif + + io_mem_free(ctx->sq_ring); + io_mem_free(ctx->sq_sqes); + io_mem_free(ctx->cq_ring); + + percpu_ref_exit(&ctx->refs); + if (ctx->account_mem) + io_unaccount_mem(ctx->user, + ring_pages(ctx->sq_entries, ctx->cq_entries)); + free_uid(ctx->user); + kfree(ctx); +} + +static __poll_t io_uring_poll(struct file *file, poll_table *wait) +{ + struct io_ring_ctx *ctx = file->private_data; + __poll_t mask = 0; + + poll_wait(file, &ctx->cq_wait, wait); + /* See comment at the top of this file */ + smp_rmb(); + if (READ_ONCE(ctx->sq_ring->r.tail) + 1 != ctx->cached_sq_head) + mask |= EPOLLOUT | EPOLLWRNORM; + if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail) + mask |= EPOLLIN | EPOLLRDNORM; + + return mask; +} + +static int io_uring_fasync(int fd, struct file *file, int on) +{ + struct io_ring_ctx *ctx = file->private_data; + + return fasync_helper(fd, file, on, &ctx->cq_fasync); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + mutex_lock(&ctx->uring_lock); + percpu_ref_kill(&ctx->refs); + mutex_unlock(&ctx->uring_lock); + + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_uring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags, const sigset_t __user *, sig, + size_t, sigsz) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + int submitted = 0; + struct fd f; + + if (flags & ~IORING_ENTER_GETEVENTS) + return -EINVAL; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = 0; + if (to_submit) { + to_submit = min(to_submit, ctx->sq_entries); + + mutex_lock(&ctx->uring_lock); + submitted = io_ring_submit(ctx, to_submit); + mutex_unlock(&ctx->uring_lock); + + if (submitted < 0) + goto out_ctx; + } + if (flags & IORING_ENTER_GETEVENTS) { + min_complete = min(min_complete, ctx->cq_entries); + + /* + * The application could have included the 'to_submit' count + * in how many events it wanted to wait for. If we failed to + * submit the desired count, we may need to adjust the number + * of events to poll/wait for. + */ + if (submitted < to_submit) + min_complete = min_t(unsigned, submitted, min_complete); + + ret = io_cqring_wait(ctx, min_complete, sig, sigsz); + } + +out_ctx: + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return submitted ? submitted : ret; +} + +static const struct file_operations io_uring_fops = { + .release = io_uring_release, + .mmap = io_uring_mmap, + .poll = io_uring_poll, + .fasync = io_uring_fasync, +}; + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + return -EOVERFLOW; + + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) { + io_mem_free(ctx->sq_ring); + return -ENOMEM; + } + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) { + io_mem_free(ctx->sq_ring); + io_mem_free(ctx->sq_sqes); + return -ENOMEM; + } + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +} + +/* + * Allocate an anonymous fd, this is what constitutes the application + * visible backing of an io_uring instance. The application mmaps this + * fd to gain access to the SQ/CQ ring details. If UNIX sockets are enabled, + * we have to tie this fd to a socket for file garbage collection purposes. + */ +static int io_uring_get_fd(struct io_ring_ctx *ctx) +{ + struct file *file; + int ret; + +#if defined(CONFIG_UNIX) + ret = sock_create_kern(&init_net, PF_UNIX, SOCK_RAW, IPPROTO_IP, + &ctx->ring_sock); + if (ret) + return ret; +#endif + + ret = get_unused_fd_flags(O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (IS_ERR(file)) { + put_unused_fd(ret); + ret = PTR_ERR(file); + goto err; + } + +#if defined(CONFIG_UNIX) + ctx->ring_sock->file = file; +#endif + fd_install(ret, file); + return ret; +err: +#if defined(CONFIG_UNIX) + sock_release(ctx->ring_sock); + ctx->ring_sock = NULL; +#endif + return ret; +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct user_struct *user = NULL; + struct io_ring_ctx *ctx; + bool account_mem; + int ret; + + if (!entries || entries > IORING_MAX_ENTRIES) + return -EINVAL; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + user = get_uid(current_user()); + account_mem = !capable(CAP_IPC_LOCK); + + if (account_mem) { + ret = io_account_mem(user, + ring_pages(p->sq_entries, p->cq_entries)); + if (ret) { + free_uid(user); + return ret; + } + } + + ctx = io_ring_ctx_alloc(p); + if (!ctx) { + if (account_mem) + io_unaccount_mem(user, ring_pages(p->sq_entries, + p->cq_entries)); + free_uid(user); + return -ENOMEM; + } + ctx->compat = in_compat_syscall(); + ctx->account_mem = account_mem; + ctx->user = user; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = io_uring_get_fd(ctx); + if (ret < 0) + goto err; + + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +static long io_uring_setup(u32 entries, struct io_uring_params __user *params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params); +} + +static int __init io_uring_init(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_init); diff --git a/include/linux/fs.h b/include/linux/fs.h index dedcc2e9265c..61aa210f0c2b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3517,4 +3517,13 @@ extern void inode_nohighmem(struct inode *inode); extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice); +#if defined(CONFIG_IO_URING) +extern struct sock *io_uring_get_socket(struct file *file); +#else +static inline struct sock *io_uring_get_socket(struct file *file) +{ + return NULL; +} +#endif + #endif /* _LINUX_FS_H */ diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 39ad98c09c58..c7b5f86b91a1 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,7 +40,7 @@ struct user_struct { kuid_t uid; #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ - defined(CONFIG_NET) + defined(CONFIG_NET) || defined(CONFIG_IO_URING) atomic_long_t locked_vm; #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..3072dbaa7869 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags, + const sigset_t __user *sig, size_t sigsz); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d90127298f12..87871e7b7ea7 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) __SYSCALL(__NR_rseq, sys_rseq) #define __NR_kexec_file_load 294 __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) +#define __NR_io_uring_setup 425 +__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) +#define __NR_io_uring_enter 426 +__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) #undef __NR_syscalls -#define __NR_syscalls 295 +#define __NR_syscalls 427 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..ac692823d6f4 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,95 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + __u64 addr; /* pointer to buffer or iovecs */ + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 user_data; /* data to be passed back at completion time */ + __u64 __pad2[3]; +}; + +#define IORING_OP_NOP 0 +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv1; + __u64 resv2; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u64 resv[2]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1U << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u32 resv[7]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index c9386a365eea..53b54214a36e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1414,6 +1414,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and complete IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ diff --git a/net/unix/garbage.c b/net/unix/garbage.c index c36757e72844..f81854d74c7d 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -108,6 +108,9 @@ struct sock *unix_get_socket(struct file *filp) /* PF_UNIX ? */ if (s && sock->ops && sock->ops->family == PF_UNIX) u_sock = s; + } else { + /* Could be an io_uring instance */ + u_sock = io_uring_get_socket(filp); } return u_sock; } -- cgit v1.3-7-g2ca7 From edafccee56ff31678a091ddb7219aba9b28bc3cb Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 9 Jan 2019 09:16:05 -0700 Subject: io_uring: add support for pre-mapped user IO buffers If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/io_uring.c | 374 +++++++++++++++++++++++++++++++-- include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/io_uring.h | 13 +- kernel/sys_ni.c | 1 + 7 files changed, 381 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 481c126259e9..2eefd2a7c1ce 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -400,3 +400,4 @@ 386 i386 rseq sys_rseq __ia32_sys_rseq 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter +427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 6a32a430c8e0..65c026185e61 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -345,6 +345,7 @@ 334 common rseq __x64_sys_rseq 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter +427 common io_uring_register __x64_sys_io_uring_register # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/io_uring.c b/fs/io_uring.c index 31f43ed894ba..c0c0f68568b5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -52,6 +53,8 @@ #include #include #include +#include +#include #include @@ -81,6 +84,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -113,6 +123,10 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp; + /* if used, fixed mapped user buffers */ + unsigned nr_user_bufs; + struct io_mapped_ubuf *user_bufs; + struct user_struct *user; struct completion ctx_done; @@ -732,6 +746,46 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } } +static int io_import_fixed(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + size_t len = READ_ONCE(sqe->len); + struct io_mapped_ubuf *imu; + unsigned index, buf_index; + size_t offset; + u64 buf_addr; + + /* attempt to use fixed buffers without having provided iovecs */ + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + + buf_index = READ_ONCE(sqe->buf_index); + if (unlikely(buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + + index = array_index_nospec(buf_index, ctx->nr_user_bufs); + imu = &ctx->user_bufs[index]; + buf_addr = READ_ONCE(sqe->addr); + + /* overflow */ + if (buf_addr + len < buf_addr) + return -EFAULT; + /* not inside the mapped region */ + if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = buf_addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct sqe_submit *s, struct iovec **iovec, struct iov_iter *iter) @@ -739,6 +793,23 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct io_uring_sqe *sqe = s->sqe; void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); size_t sqe_len = READ_ONCE(sqe->len); + u8 opcode; + + /* + * We're reading ->opcode for the second time, but the first read + * doesn't care whether it's _FIXED or not, so it doesn't matter + * whether ->opcode changes concurrently. The first read does care + * about whether it is a READ or a WRITE, so we don't trust this read + * for that purpose and instead let the caller pass in the read/write + * flag. + */ + opcode = READ_ONCE(sqe->opcode); + if (opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED) { + ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); + *iovec = NULL; + return ret; + } if (!s->has_user) return -EFAULT; @@ -886,7 +957,7 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (unlikely(sqe->addr || sqe->ioprio)) + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL; fd = READ_ONCE(sqe->fd); @@ -945,9 +1016,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, req->user_data); break; case IORING_OP_READV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; ret = io_read(req, s, force_nonblock, state); break; case IORING_OP_WRITEV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; + ret = io_write(req, s, force_nonblock, state); + break; + case IORING_OP_READ_FIXED: + ret = io_read(req, s, force_nonblock, state); + break; + case IORING_OP_WRITE_FIXED: ret = io_write(req, s, force_nonblock, state); break; case IORING_OP_FSYNC: @@ -976,28 +1057,46 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; } +static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) +{ + u8 opcode = READ_ONCE(sqe->opcode); + + return !(opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED); +} + static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct sqe_submit *s = &req->submit; const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; + bool needs_user; int ret; /* Ensure we clear previously set forced non-block flag */ req->flags &= ~REQ_F_FORCE_NONBLOCK; req->rw.ki_flags &= ~IOCB_NOWAIT; - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; - } - - use_mm(ctx->sqo_mm); - set_fs(USER_DS); - s->has_user = true; s->needs_lock = true; + s->has_user = false; + + /* + * If we're doing IO to fixed buffers, we don't need to get/set + * user context + */ + needs_user = io_sqe_needs_user(s->sqe); + if (needs_user) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); + s->has_user = true; + } do { ret = __io_submit_sqe(ctx, req, s, false, NULL); @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) cond_resched(); } while (1); - set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + if (needs_user) { + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); + } err: if (ret) { io_cqring_add_event(ctx, sqe->user_data, ret, 0); @@ -1317,6 +1418,198 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; } +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return -ENXIO; + + for (i = 0; i < ctx->nr_user_bufs; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + if (ctx->account_mem) + io_unaccount_mem(ctx->user, imu->nr_bvecs); + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; + ctx->nr_user_bufs = 0; + return 0; +} + +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, + void __user *arg, unsigned index) +{ + struct iovec __user *src; + +#ifdef CONFIG_COMPAT + if (ctx->compat) { + struct compat_iovec __user *ciovs; + struct compat_iovec ciov; + + ciovs = (struct compat_iovec __user *) arg; + if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) + return -EFAULT; + + dst->iov_base = (void __user *) (unsigned long) ciov.iov_base; + dst->iov_len = ciov.iov_len; + return 0; + } +#endif + src = (struct iovec __user *) arg; + if (copy_from_user(dst, &src[index], sizeof(*dst))) + return -EFAULT; + return 0; +} + +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + struct vm_area_struct **vmas = NULL; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + if (ctx->user_bufs) + return -EBUSY; + if (!nr_args || nr_args > UIO_MAXIOV) + return -EINVAL; + + ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + for (i = 0; i < nr_args; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = io_copy_iov(ctx, &iov, arg, i); + if (ret) + break; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base || !iov.iov_len) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_1G) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + if (ctx->account_mem) { + ret = io_account_mem(ctx->user, nr_pages); + if (ret) + goto err; + } + + ret = 0; + if (!pages || nr_pages > got_pages) { + kfree(vmas); + kfree(pages); + pages = kmalloc_array(nr_pages, sizeof(struct page *), + GFP_KERNEL); + vmas = kmalloc_array(nr_pages, + sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!pages || !vmas) { + ret = -ENOMEM; + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + got_pages = nr_pages; + } + + imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + GFP_KERNEL); + ret = -ENOMEM; + if (!imu->bvec) { + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + ret = 0; + down_read(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE, + pages, vmas); + if (pret == nr_pages) { + /* don't support file backed memory */ + for (j = 0; j < nr_pages; j++) { + struct vm_area_struct *vma = vmas[j]; + + if (vma->vm_file && + !is_file_hugepages(vma->vm_file)) { + ret = -EOPNOTSUPP; + break; + } + } + } else { + ret = pret < 0 ? pret : -EFAULT; + } + up_read(¤t->mm->mmap_sem); + if (ret) { + /* + * if we did partial map, or found file backed vmas, + * release any pages we did get + */ + if (pret > 0) { + for (j = 0; j < pret; j++) + put_page(pages[j]); + } + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + + ctx->nr_user_bufs++; + } + kfree(pages); + kfree(vmas); + return 0; +err: + kfree(pages); + kfree(vmas); + io_sqe_buffer_unregister(ctx); + return ret; +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { if (ctx->sqo_wq) @@ -1325,6 +1618,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) mmdrop(ctx->sqo_mm); io_iopoll_reap_events(ctx); + io_sqe_buffer_unregister(ctx); #if defined(CONFIG_UNIX) if (ctx->ring_sock) @@ -1689,6 +1983,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return io_uring_setup(entries, params); } +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, + void __user *arg, unsigned nr_args) +{ + int ret; + + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + + switch (opcode) { + case IORING_REGISTER_BUFFERS: + ret = io_sqe_buffer_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_BUFFERS: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_buffer_unregister(ctx); + break; + default: + ret = -EINVAL; + break; + } + + /* bring the ctx back to life */ + reinit_completion(&ctx->ctx_done); + percpu_ref_reinit(&ctx->refs); + return ret; +} + +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, + void __user *, arg, unsigned int, nr_args) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ctx = f.file->private_data; + + mutex_lock(&ctx->uring_lock); + ret = __io_uring_register(ctx, opcode, arg, nr_args); + mutex_unlock(&ctx->uring_lock); +out_fput: + fdput(f); + return ret; +} + static int __init io_uring_init(void) { req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3072dbaa7869..3681c05ac538 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries, asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags, const sigset_t __user *sig, size_t sigsz); +asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op, + void __user *arg, unsigned int nr_args); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 87871e7b7ea7..d346229a1eb0 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) #define __NR_io_uring_enter 426 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) +#define __NR_io_uring_register 427 +__SYSCALL(__NR_io_uring_register, sys_io_uring_register) #undef __NR_syscalls -#define __NR_syscalls 427 +#define __NR_syscalls 428 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5c457ea396e6..cf28f7a11f12 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,7 +27,10 @@ struct io_uring_sqe { __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ - __u64 __pad2[3]; + union { + __u16 buf_index; /* index into fixed buffers, if used */ + __u64 __pad2[3]; + }; }; /* @@ -39,6 +42,8 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 +#define IORING_OP_READ_FIXED 4 +#define IORING_OP_WRITE_FIXED 5 /* * sqe->fsync_flags @@ -103,4 +108,10 @@ struct io_uring_params { struct io_cqring_offsets cq_off; }; +/* + * io_uring_register(2) opcodes and arguments + */ +#define IORING_REGISTER_BUFFERS 0 +#define IORING_UNREGISTER_BUFFERS 1 + #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ee5e523564bb..1bb6604dc19f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL(io_uring_setup); COND_SYSCALL(io_uring_enter); +COND_SYSCALL(io_uring_register); /* fs/xattr.c */ -- cgit v1.3-7-g2ca7 From 21038f2baa05a0550f56f010f609a5c871b6a274 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Mon, 25 Feb 2019 16:20:05 -0800 Subject: perf, bpf: Consider events with attr.bpf_event as side-band events Events with attr.bpf_event set should be considered as side-band events, as they carry information about BPF programs. Signed-off-by: Song Liu Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Jiri Olsa Cc: Namhyung Kim Cc: Peter Zijlstra Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT") Link: http://lkml.kernel.org/r/20190226002019.3748539-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 5f59d848171e..dd9698ad3d66 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4238,7 +4238,8 @@ static bool is_sb_event(struct perf_event *event) if (attr->mmap || attr->mmap_data || attr->mmap2 || attr->comm || attr->comm_exec || attr->task || attr->ksymbol || - attr->context_switch) + attr->context_switch || + attr->bpf_event) return true; return false; } -- cgit v1.3-7-g2ca7 From 5cd401ace914dc68556c6d2fcae0c349444d5f86 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Mon, 25 Feb 2019 10:57:30 -0800 Subject: mm/resource: Return real error codes from walk failures walk_system_ram_range() can return an error code either becuase *it* failed, or because the 'func' that it calls returned an error. The memory hotplug does the following: ret = walk_system_ram_range(..., func); if (ret) return ret; and 'ret' makes it out to userspace, eventually. The problem s, walk_system_ram_range() failues that result from *it* failing (as opposed to 'func') return -1. That leads to a very odd -EPERM (-1) return code out to userspace. Make walk_system_ram_range() return -EINVAL for internal failures to keep userspace less confused. This return code is compatible with all the callers that I audited. Signed-off-by: Dave Hansen Reviewed-by: Bjorn Helgaas Acked-by: Michael Ellerman (powerpc) Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Borislav Petkov Cc: Yaowei Bai Cc: Takashi Iwai Cc: Jerome Glisse Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: Keith Busch Signed-off-by: Dan Williams --- kernel/resource.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index 915c02e8e5dd..ca7ed5158cff 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end, int (*func)(struct resource *, void *)) { struct resource res; - int ret = -1; + int ret = -EINVAL; while (start < end && !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) { @@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, unsigned long flags; struct resource res; unsigned long pfn, end_pfn; - int ret = -1; + int ret = -EINVAL; start = (u64) start_pfn << PAGE_SHIFT; end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1; -- cgit v1.3-7-g2ca7 From b926b7f3baecb2a855db629e6822e1a85212e91c Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Mon, 25 Feb 2019 10:57:33 -0800 Subject: mm/resource: Move HMM pr_debug() deeper into resource code HMM consumes physical address space for its own use, even though nothing is mapped or accessible there. It uses a special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY) to uniquely identify these areas. When HMM consumes address space, it makes a best guess about what to consume. However, it is possible that a future memory or device hotplug can collide with the reserved area. In the case of these conflicts, there is an error message in register_memory_resource(). Later patches in this series move register_memory_resource() from using request_resource_conflict() to __request_region(). Unfortunately, __request_region() does not return the conflict like the previous function did, which makes it impossible to check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting resource. Instead of warning in register_memory_resource(), move the check into the core resource code itself (__request_region()) where the conflicting resource _is_ available. This has the added bonus of producing a warning in case of HMM conflicts with devices *or* RAM address space, as opposed to the RAM- only warnings that were there previously. Signed-off-by: Dave Hansen Reviewed-by: Jerome Glisse Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Keith Busch Signed-off-by: Dan Williams --- kernel/resource.c | 9 +++++++++ mm/memory_hotplug.c | 5 ----- 2 files changed, 9 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index ca7ed5158cff..35fe105d581e 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1132,6 +1132,15 @@ struct resource * __request_region(struct resource *parent, conflict = __request_resource(parent, res); if (!conflict) break; + /* + * mm/hmm.c reserves physical addresses which then + * become unavailable to other users. Conflicts are + * not expected. Warn to aid debugging if encountered. + */ + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) { + pr_warn("Unaddressable device %s %pR conflicts with %pR", + conflict->name, conflict, res); + } if (conflict != parent) { if (!(conflict->flags & IORESOURCE_BUSY)) { parent = conflict; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b9a667d36c55..e198974c968d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -110,11 +110,6 @@ static struct resource *register_memory_resource(u64 start, u64 size) res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; conflict = request_resource_conflict(&iomem_resource, res); if (conflict) { - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) { - pr_debug("Device unaddressable memory block " - "memory hotplug at %#010llx !\n", - (unsigned long long)start); - } pr_debug("System RAM resource %pR cannot be added\n", res); kfree(res); return ERR_PTR(-EEXIST); -- cgit v1.3-7-g2ca7 From 2b539aefe9e48e3908cff02699aa63a8b9bd268e Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Mon, 25 Feb 2019 10:57:38 -0800 Subject: mm/resource: Let walk_system_ram_range() search child resources In the process of onlining memory, we use walk_system_ram_range() to find the actual RAM areas inside of the area being onlined. However, it currently only finds memory resources which are "top-level" iomem_resources. Children are not currently searched which causes it to skip System RAM in areas like this (in the format of /proc/iomem): a0000000-bfffffff : Persistent Memory (legacy) a0000000-afffffff : System RAM Changing the true->false here allows children to be searched as well. We need this because we add a new "System RAM" resource underneath the "persistent memory" resource when we use persistent memory in a volatile mode. Signed-off-by: Dave Hansen Cc: Keith Busch Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Borislav Petkov Cc: Bjorn Helgaas Cc: Yaowei Bai Cc: Takashi Iwai Cc: Jerome Glisse Signed-off-by: Dan Williams --- kernel/resource.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index 35fe105d581e..e7f9d2a5db25 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -454,6 +454,9 @@ int walk_mem_res(u64 start, u64 end, void *arg, * This function calls the @func callback against all memory ranges of type * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY. * It is to be used only for System RAM. + * + * This will find System RAM ranges that are children of top-level resources + * in addition to top-level System RAM resources. */ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, void *arg, int (*func)(unsigned long, unsigned long, void *)) @@ -469,7 +472,7 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; while (start < end && !find_next_iomem_res(start, end, flags, IORES_DESC_NONE, - true, &res)) { + false, &res)) { pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT; end_pfn = (res.end + 1) >> PAGE_SHIFT; if (end_pfn > pfn) -- cgit v1.3-7-g2ca7 From 3fcc5530bcb2a879e32bd940e6eafee328ac3647 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Wed, 27 Feb 2019 18:30:44 -0800 Subject: bpf: fix build without bpf_syscall wrap bpf_stats_enabled sysctl with #ifdef Reported-by: Stephen Rothwell Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- kernel/sysctl.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 86e0771352f2..7578e21a711b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -224,9 +224,11 @@ static int proc_dostring_coredump(struct ctl_table *table, int write, #endif static int proc_dopipe_max_size(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_BPF_SYSCALL static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#endif #ifdef CONFIG_MAGIC_SYSRQ /* Note: sysrq code uses its own private copy */ @@ -1232,7 +1234,6 @@ static struct ctl_table kern_table[] = { .extra1 = &one, .extra2 = &one, }, -#endif { .procname = "bpf_stats_enabled", .data = &sysctl_bpf_stats_enabled, @@ -1242,6 +1243,7 @@ static struct ctl_table kern_table[] = { .extra1 = &zero, .extra2 = &one, }, +#endif #if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU) { .procname = "panic_on_rcu_stall", @@ -3272,6 +3274,7 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write, #endif /* CONFIG_PROC_SYSCTL */ +#ifdef CONFIG_BPF_SYSCALL static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) @@ -3293,7 +3296,7 @@ static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, } return ret; } - +#endif /* * No sense putting this after each symbol definition, twice, * exception granted :-) -- cgit v1.3-7-g2ca7 From 10c3405f060397e565e4f75c403859f9a074bfa5 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 12 Feb 2019 14:54:30 -0600 Subject: perf: Mark expected switch fall-through MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: kernel/events/core.c: In function ‘perf_event_parse_addr_filter’: kernel/events/core.c:9154:11: warning: this statement may fall through [-Wimplicit-fallthrough=] kernel = 1; ~~~~~~~^~~ kernel/events/core.c:9156:3: note: here case IF_SRC_FILEADDR: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva Cc: Alexander Shishkin Cc: Gustavo A. R. Silva Cc: Jiri Olsa Cc: Kees Kook Cc: Namhyung Kim Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20190212205430.GA8446@embeddedor Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index dd9698ad3d66..6fb27b564730 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9175,6 +9175,7 @@ perf_event_parse_addr_filter(struct perf_event *event, char *fstr, case IF_SRC_KERNELADDR: case IF_SRC_KERNEL: kernel = 1; + /* fall through */ case IF_SRC_FILEADDR: case IF_SRC_FILE: -- cgit v1.3-7-g2ca7 From 352d20d611414715353ee65fc206ee57ab1a6984 Mon Sep 17 00:00:00 2001 From: Peng Sun Date: Wed, 27 Feb 2019 22:36:25 +0800 Subject: bpf: drop refcount if bpf_map_new_fd() fails in map_create() In bpf/syscall.c, map_create() first set map->usercnt to 1, a file descriptor is supposed to return to userspace. When bpf_map_new_fd() fails, drop the refcount. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Signed-off-by: Peng Sun Acked-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- kernel/bpf/syscall.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f98328abe8fa..84470d1480aa 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -559,12 +559,12 @@ static int map_create(union bpf_attr *attr) err = bpf_map_new_fd(map, f_flags); if (err < 0) { /* failed to allocate fd. - * bpf_map_put() is needed because the above + * bpf_map_put_with_uref() is needed because the above * bpf_map_alloc_id() has published the map * to the userspace and the userspace may * have refcnt-ed it through BPF_MAP_GET_FD_BY_ID. */ - bpf_map_put(map); + bpf_map_put_with_uref(map); return err; } -- cgit v1.3-7-g2ca7 From 6a072128d262d2b98d31626906a96700d1fc11eb Mon Sep 17 00:00:00 2001 From: Pavel Tikhomirov Date: Thu, 23 Aug 2018 13:25:34 +0300 Subject: tracing: Fix event filters and triggers to handle negative numbers Then tracing syscall exit event it is extremely useful to filter exit codes equal to some negative value, to react only to required errors. But negative numbers does not work: [root@snorch sys_exit_read]# echo "ret == -1" > filter bash: echo: write error: Invalid argument [root@snorch sys_exit_read]# cat filter ret == -1 ^ parse_error: Invalid value (did you forget quotes)? Similar thing happens when setting triggers. These is a regression in v4.17 introduced by the commit mentioned below, testing without these commit shows no problem with negative numbers. Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com Cc: stable@vger.kernel.org Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and faster") Signed-off-by: Pavel Tikhomirov Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_filter.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 27821480105e..217ef481fbbb 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1301,7 +1301,7 @@ static int parse_pred(const char *str, void *data, /* go past the last quote */ i++; - } else if (isdigit(str[i])) { + } else if (isdigit(str[i]) || str[i] == '-') { /* Make sure the field is not a string */ if (is_string_field(field)) { @@ -1314,6 +1314,9 @@ static int parse_pred(const char *str, void *data, goto err_free; } + if (str[i] == '-') + i++; + /* We allow 0xDEADBEEF */ while (isalnum(str[i])) i++; -- cgit v1.3-7-g2ca7 From 49ef5f45701c3dedc6aa6efee5d03756fea0f99d Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Fri, 22 Feb 2019 01:16:43 +0900 Subject: tracing/kprobes: Use probe_kernel_read instead of probe_mem_read Use probe_kernel_read() instead of probe_mem_read() because probe_mem_read() is a kind of wrapper for switching memory read function between uprobes and kprobes. Link: http://lkml.kernel.org/r/20190222011643.3e19ade84a3db3e83518648f@kernel.org Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 9eaf07f99212..99592c27465e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -865,7 +865,7 @@ fetch_store_strlen(unsigned long addr) u8 c; do { - ret = probe_mem_read(&c, (u8 *)addr + len, 1); + ret = probe_kernel_read(&c, (u8 *)addr + len, 1); len++; } while (c && ret == 0 && len < MAX_STRING_SIZE); -- cgit v1.3-7-g2ca7 From 4b9113045b1745ec8512d6743680809edca6a74e Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 1 Mar 2019 14:33:11 -0800 Subject: bpf: fix u64_stats_init() usage in bpf_prog_alloc() We need to iterate through all possible cpus. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Eric Dumazet Reported-by: Guenter Roeck Tested-by: Guenter Roeck Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- kernel/bpf/core.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 1c14c347f3cf..3f08c257858e 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -109,6 +109,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog *prog; + int cpu; prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); if (!prog) @@ -121,7 +122,12 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) return NULL; } - u64_stats_init(&prog->aux->stats->syncp); + for_each_possible_cpu(cpu) { + struct bpf_prog_stats *pstats; + + pstats = per_cpu_ptr(prog->aux->stats, cpu); + u64_stats_init(&pstats->syncp); + } return prog; } EXPORT_SYMBOL_GPL(bpf_prog_alloc); -- cgit v1.3-7-g2ca7 From 3612af783cf52c74a031a2f11b82247b2599d3cd Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 1 Mar 2019 22:05:29 +0100 Subject: bpf: fix sanitation rewrite in case of non-pointers Marek reported that he saw an issue with the below snippet in that timing measurements where off when loaded as unpriv while results were reasonable when loaded as privileged: [...] uint64_t a = bpf_ktime_get_ns(); uint64_t b = bpf_ktime_get_ns(); uint64_t delta = b - a; if ((int64_t)delta > 0) { [...] Turns out there is a bug where a corner case is missing in the fix d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths"), namely fixup_bpf_calls() only checks whether aux has a non-zero alu_state, but it also needs to test for the case of BPF_ALU_NON_POINTER since in both occasions we need to skip the masking rewrite (as there is nothing to mask). Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths") Reported-by: Marek Majkowski Reported-by: Arthur Fabre Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/ Acked-by: Song Liu Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8f295b790297..5fcce2f4209d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6920,7 +6920,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) u32 off_reg; aux = &env->insn_aux_data[i + delta]; - if (!aux->alu_state) + if (!aux->alu_state || + aux->alu_state == BPF_ALU_NON_POINTER) continue; isneg = aux->alu_state & BPF_ALU_NEG_VALUE; -- cgit v1.3-7-g2ca7 From e36202a844d4eff2ab07bcef998d7b4beda9761f Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 22 Feb 2019 18:59:40 +0900 Subject: printk: Remove no longer used LOG_PREFIX. When commit 5becfb1df5ac8e49 ("kmsg: merge continuation records while printing") introduced LOG_PREFIX, we used KERN_DEFAULT etc. as a flag for setting LOG_PREFIX in order to tell whether to call cont_add() (i.e. whether to append the message to "struct cont"). But since commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for printing continuation lines") inverted the behavior (i.e. don't append the message to "struct cont" unless KERN_CONT is specified) and commit 5aa068ea4082b39e ("printk: remove games with previous record flags") removed the last LOG_PREFIX check, setting LOG_PREFIX via KERN_DEFAULT etc. is no longer meaningful. Therefore, we can remove LOG_PREFIX and make KERN_DEFAULT empty string. Link: http://lkml.kernel.org/r/1550829580-9189-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp To: Steven Rostedt To: Linus Torvalds Cc: linux-kernel@vger.kernel.org Cc: Tetsuo Handa Signed-off-by: Tetsuo Handa Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- include/linux/kern_levels.h | 2 +- include/linux/printk.h | 1 - kernel/printk/printk.c | 6 +----- 3 files changed, 2 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/kern_levels.h b/include/linux/kern_levels.h index d237fe854ad9..bf2389c26ae3 100644 --- a/include/linux/kern_levels.h +++ b/include/linux/kern_levels.h @@ -14,7 +14,7 @@ #define KERN_INFO KERN_SOH "6" /* informational */ #define KERN_DEBUG KERN_SOH "7" /* debug-level messages */ -#define KERN_DEFAULT KERN_SOH "d" /* the default kernel loglevel */ +#define KERN_DEFAULT "" /* the default kernel loglevel */ /* * Annotation for a "continued" line of log printout (only done after a diff --git a/include/linux/printk.h b/include/linux/printk.h index 55aa96975fa2..97aa12c928d4 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -18,7 +18,6 @@ static inline int printk_get_level(const char *buffer) if (buffer[0] == KERN_SOH_ASCII && buffer[1]) { switch (buffer[1]) { case '0' ... '7': - case 'd': /* KERN_DEFAULT */ case 'c': /* KERN_CONT */ return buffer[1]; } diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index b4d26388bc62..9b6783c158f9 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -345,7 +345,6 @@ static int console_msg_format = MSG_FORMAT_DEFAULT; enum log_flags { LOG_NEWLINE = 2, /* text ended with a newline */ - LOG_PREFIX = 4, /* text started with a prefix */ LOG_CONT = 8, /* text is a fragment of a continuation line */ }; @@ -1922,9 +1921,6 @@ int vprintk_store(int facility, int level, case '0' ... '7': if (level == LOGLEVEL_DEFAULT) level = kern_level - '0'; - /* fallthrough */ - case 'd': /* KERN_DEFAULT */ - lflags |= LOG_PREFIX; break; case 'c': /* KERN_CONT */ lflags |= LOG_CONT; @@ -1939,7 +1935,7 @@ int vprintk_store(int facility, int level, level = default_message_loglevel; if (dict) - lflags |= LOG_PREFIX|LOG_NEWLINE; + lflags |= LOG_NEWLINE; return log_output(facility, level, lflags, dict, dictlen, text, text_len); -- cgit v1.3-7-g2ca7 From ed581aaf99be10883c8364df16bd80a7b8f72efc Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Mon, 4 Feb 2019 15:07:23 -0600 Subject: tracing: Use str_has_prefix() in synth_event_create() Since we now have a str_has_prefix() that returns the length, we can use that instead of explicitly calculating it. Link: http://lkml.kernel.org/r/03418373fd1e80030e7394b8e3e081c5de28a710.1549309756.git.tom.zanussi@linux.intel.com Cc: Joe Perches Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 66386ba1425f..5b03b9a869bb 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -1316,8 +1316,8 @@ static int synth_event_create(int argc, const char **argv) /* This interface accepts group name prefix */ if (strchr(name, '/')) { - len = sizeof(SYNTH_SYSTEM "/") - 1; - if (strncmp(name, SYNTH_SYSTEM "/", len)) + len = str_has_prefix(name, SYNTH_SYSTEM "/"); + if (len == 0) return -EINVAL; name += len; } -- cgit v1.3-7-g2ca7 From 9f0bbf3115ca9f91f43b7c74e9ac7d79f47fc6c2 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Mon, 4 Feb 2019 15:07:24 -0600 Subject: tracing: Use strncpy instead of memcpy for string keys in hist triggers Because there may be random garbage beyond a string's null terminator, it's not correct to copy the the complete character array for use as a hist trigger key. This results in multiple histogram entries for the 'same' string key. So, in the case of a string key, use strncpy instead of memcpy to avoid copying in the extra bytes. Before, using the gdbus entries in the following hist trigger as an example: # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist ... { comm: ImgDecoder #4 } hitcount: 203 { comm: gmain } hitcount: 213 { comm: gmain } hitcount: 216 { comm: StreamTrans #73 } hitcount: 221 { comm: mozStorage #3 } hitcount: 230 { comm: gdbus } hitcount: 233 { comm: StyleThread#5 } hitcount: 253 { comm: gdbus } hitcount: 256 { comm: gdbus } hitcount: 260 { comm: StyleThread#4 } hitcount: 271 ... # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l 51 After: # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l 1 Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com Cc: Namhyung Kim Cc: stable@vger.kernel.org Fixes: 79e577cbce4c4 ("tracing: Support string type key properly") Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 5b03b9a869bb..c7774fa119a7 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -5157,9 +5157,10 @@ static inline void add_to_key(char *compound_key, void *key, /* ensure NULL-termination */ if (size > key_field->size - 1) size = key_field->size - 1; - } - memcpy(compound_key + key_field->offset, key, size); + strncpy(compound_key + key_field->offset, (char *)key, size); + } else + memcpy(compound_key + key_field->offset, key, size); } static void -- cgit v1.3-7-g2ca7 From bf393fd4a3c888e6d407968f461900481bd0c041 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Fri, 1 Mar 2019 13:57:25 -0800 Subject: workqueue: Fix spelling in source code comments Change "execuing" into "executing" and "guarnateed" into "guaranteed". Cc: Lai Jiangshan Signed-off-by: Bart Van Assche Signed-off-by: Tejun Heo --- kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 29a4de4025be..1ef68ac89162 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1321,7 +1321,7 @@ static bool is_chained_work(struct workqueue_struct *wq) worker = current_wq_worker(); /* - * Return %true iff I'm a worker execuing a work item on @wq. If + * Return %true iff I'm a worker executing a work item on @wq. If * I'm @worker, it's safe to dereference it without locking. */ return worker && worker->current_pwq->wq == wq; @@ -1619,7 +1619,7 @@ static void rcu_work_rcufn(struct rcu_head *rcu) * * Return: %false if @rwork was already pending, %true otherwise. Note * that a full RCU grace period is guaranteed only after a %true return. - * While @rwork is guarnateed to be executed after a %false return, the + * While @rwork is guaranteed to be executed after a %false return, the * execution may happen before a full RCU grace period has passed. */ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork) -- cgit v1.3-7-g2ca7 From 3eb39f47934f9d5a3027fe00d906a45fe3a15fad Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 19 Nov 2018 00:51:56 +0100 Subject: signal: add pidfd_send_signal() syscall The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/ as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(, ). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/ open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" Cc: Jann Horn Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Oleg Nesterov Cc: Al Viro Cc: Florian Weimer Signed-off-by: Christian Brauner Reviewed-by: Tycho Andersen Reviewed-by: Kees Cook Reviewed-by: David Howells Acked-by: Arnd Bergmann Acked-by: Thomas Gleixner Acked-by: Serge Hallyn Acked-by: Aleksa Sarai --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/proc/base.c | 9 +++ include/linux/proc_fs.h | 6 ++ include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- kernel/signal.c | 133 +++++++++++++++++++++++++++++++-- kernel/sys_ni.c | 1 + 8 files changed, 151 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..234d91df8ca6 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,4 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +424 i386 pidfd_send_signal sys_pidfd_send_signal __ia32_sys_pidfd_send_signal diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..58f4b3ad4fe0 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,7 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +424 common pidfd_send_signal __x64_sys_pidfd_send_signal # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/proc/base.c b/fs/proc/base.c index 633a63462573..b6627c471078 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3046,6 +3046,15 @@ static const struct file_operations proc_tgid_base_operations = { .llseek = generic_file_llseek, }; +struct pid *tgid_pidfd_to_pid(const struct file *file) +{ + if (!d_is_dir(file->f_path.dentry) || + (file->f_op != &proc_tgid_base_operations)) + return ERR_PTR(-EBADF); + + return proc_pid(file_inode(file)); +} + static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) { return proc_pident_lookup(dir, dentry, diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index d0e1f1522a78..52a283ba0465 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo int (*show)(struct seq_file *, void *), proc_write_t write, void *data); +extern struct pid *tgid_pidfd_to_pid(const struct file *file); #else /* CONFIG_PROC_FS */ @@ -114,6 +115,11 @@ static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p #define proc_create_net(name, mode, parent, state_size, ops) ({NULL;}) #define proc_create_net_single(name, mode, parent, show, data) ({NULL;}) +static inline struct pid *tgid_pidfd_to_pid(const struct file *file) +{ + return ERR_PTR(-EBADF); +} + #endif /* CONFIG_PROC_FS */ struct net; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..5eb2e351675e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -926,6 +926,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, int flags, uint32_t sig); +asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, + siginfo_t __user *info, + unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d90127298f12..c861e7d1053b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,9 +740,11 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) __SYSCALL(__NR_rseq, sys_rseq) #define __NR_kexec_file_load 294 __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) +#define __NR_pidfd_send_signal 424 +__SYSCALL(__NR_pidfd_send_signal, sys_pidfd_send_signal) #undef __NR_syscalls -#define __NR_syscalls 295 +#define __NR_syscalls 425 /* * 32 bit systems traditionally used different diff --git a/kernel/signal.c b/kernel/signal.c index e1d7ad8e6ab1..268bed80244f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -19,7 +19,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -3429,6 +3431,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, #endif #endif +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info) +{ + clear_siginfo(info); + info->si_signo = sig; + info->si_errno = 0; + info->si_code = SI_USER; + info->si_pid = task_tgid_vnr(current); + info->si_uid = from_kuid_munged(current_user_ns(), current_uid()); +} + /** * sys_kill - send a signal to a process * @pid: the PID of the process @@ -3438,16 +3450,125 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig) { struct kernel_siginfo info; - clear_siginfo(&info); - info.si_signo = sig; - info.si_errno = 0; - info.si_code = SI_USER; - info.si_pid = task_tgid_vnr(current); - info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); + prepare_kill_siginfo(sig, &info); return kill_something_info(sig, &info, pid); } +#ifdef CONFIG_PROC_FS +/* + * Verify that the signaler and signalee either are in the same pid namespace + * or that the signaler's pid namespace is an ancestor of the signalee's pid + * namespace. + */ +static bool access_pidfd_pidns(struct pid *pid) +{ + struct pid_namespace *active = task_active_pid_ns(current); + struct pid_namespace *p = ns_of_pid(pid); + + for (;;) { + if (!p) + return false; + if (p == active) + break; + p = p->parent; + } + + return true; +} + +static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) +{ +#ifdef CONFIG_COMPAT + /* + * Avoid hooking up compat syscalls and instead handle necessary + * conversions here. Note, this is a stop-gap measure and should not be + * considered a generic solution. + */ + if (in_compat_syscall()) + return copy_siginfo_from_user32( + kinfo, (struct compat_siginfo __user *)info); +#endif + return copy_siginfo_from_user(kinfo, info); +} + +/** + * sys_pidfd_send_signal - send a signal to a process through a task file + * descriptor + * @pidfd: the file descriptor of the process + * @sig: signal to be sent + * @info: the signal info + * @flags: future flags to be passed + * + * The syscall currently only signals via PIDTYPE_PID which covers + * kill(, . It does not signal threads or process + * groups. + * In order to extend the syscall to threads and process groups the @flags + * argument should be used. In essence, the @flags argument will determine + * what is signaled and not the file descriptor itself. Put in other words, + * grouping is a property of the flags argument not a property of the file + * descriptor. + * + * Return: 0 on success, negative errno on failure + */ +SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, + siginfo_t __user *, info, unsigned int, flags) +{ + int ret; + struct fd f; + struct pid *pid; + kernel_siginfo_t kinfo; + + /* Enforce flags be set to 0 until we add an extension. */ + if (flags) + return -EINVAL; + + f = fdget_raw(pidfd); + if (!f.file) + return -EBADF; + + /* Is this a pidfd? */ + pid = tgid_pidfd_to_pid(f.file); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto err; + } + + ret = -EINVAL; + if (!access_pidfd_pidns(pid)) + goto err; + + if (info) { + ret = copy_siginfo_from_user_any(&kinfo, info); + if (unlikely(ret)) + goto err; + + ret = -EINVAL; + if (unlikely(sig != kinfo.si_signo)) + goto err; + + if ((task_pid(current) != pid) && + (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) { + /* Only allow sending arbitrary signals to yourself. */ + ret = -EPERM; + if (kinfo.si_code != SI_USER) + goto err; + + /* Turn this into a regular kill signal. */ + prepare_kill_siginfo(sig, &kinfo); + } + } else { + prepare_kill_siginfo(sig, &kinfo); + } + + ret = kill_pid_info(sig, &kinfo, pid); + +err: + fdput(f); + return ret; +} +#endif /* CONFIG_PROC_FS */ + static int do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info) { diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..f905f4f9f677 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -163,6 +163,7 @@ COND_SYSCALL(syslog); /* kernel/sched/core.c */ /* kernel/signal.c */ +COND_SYSCALL(pidfd_send_signal); /* kernel/sys.c */ COND_SYSCALL(setregid); -- cgit v1.3-7-g2ca7 From 27242c62b141240079d1ac8d35adcdc70cae8895 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 5 Mar 2019 10:11:59 -0600 Subject: tracing: Use strncpy instead of memcpy when copying comm for hist triggers Because there may be random garbage beyond a string's null terminator, code that might use the entire comm array e.g. histogram keys, can give unexpected results if that garbage is copied in too, so avoid that possibility by using strncpy instead of memcpy. Link: http://lkml.kernel.org/r/1eb9f096a8086c3c82c7fc087c900005143cec54.1551802084.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index c7774fa119a7..ca46339f3009 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -2141,7 +2141,7 @@ static inline void save_comm(char *comm, struct task_struct *task) return; } - memcpy(comm, task->comm, TASK_COMM_LEN); + strncpy(comm, task->comm, TASK_COMM_LEN); } static void hist_elt_data_free(struct hist_elt_data *elt_data) @@ -3557,7 +3557,7 @@ static bool cond_snapshot_update(struct trace_array *tr, void *cond_data) elt_data = context->elt->private_data; track_elt_data = track_data->elt.private_data; if (elt_data->comm) - memcpy(track_elt_data->comm, elt_data->comm, TASK_COMM_LEN); + strncpy(track_elt_data->comm, elt_data->comm, TASK_COMM_LEN); track_data->updated = true; -- cgit v1.3-7-g2ca7 From 85f726a35e504418607b77c5e7da165dc1ea63ce Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Tue, 5 Mar 2019 10:12:00 -0600 Subject: tracing: Use strncpy instead of memcpy when copying comm in trace.c Because there may be random garbage beyond a string's null terminator, code that might use the entire comm array e.g. histogram keys, can give unexpected results if that garbage is copied in too, so avoid that possibility by using strncpy instead of memcpy. Link: http://lkml.kernel.org/r/1d6ebac26570c2a29ce9fb575379f17ef5c8b81b.1551802084.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi Suggested-by: Steven Rostedt (VMware) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 3835f7ed3293..e9cc47e59d25 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1497,7 +1497,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) max_data->critical_start = data->critical_start; max_data->critical_end = data->critical_end; - memcpy(max_data->comm, tsk->comm, TASK_COMM_LEN); + strncpy(max_data->comm, tsk->comm, TASK_COMM_LEN); max_data->pid = tsk->pid; /* * If tsk == current, then use current_uid(), as that does not use @@ -1923,7 +1923,7 @@ static inline char *get_saved_cmdlines(int idx) static inline void set_cmdline(int idx, const char *cmdline) { - memcpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN); + strncpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN); } static int allocate_cmdlines_buffer(unsigned int val, -- cgit v1.3-7-g2ca7 From e04b742f74c236202b7a505c2688068969d00e65 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 5 Mar 2019 15:42:27 -0800 Subject: kexec: export PG_offline to VMCOREINFO Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actuall backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of other balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. We now have PG_offline which can be (and already is by virtio-balloon) used for marking pages as logically offline. Follow up patches will make use of this flag also in other balloon implementations. Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile can directly skip pages that are logically offline and the content therefore stale. Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. Link: http://lkml.kernel.org/r/20181119101616.8901-4-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Michael S. Tsirkin Acked-by: Dave Young Cc: "Kirill A. Shutemov" Cc: Baoquan He Cc: Omar Sandoval Cc: Arnd Bergmann Cc: Matthew Wilcox Cc: Michal Hocko Cc: Lianbo Jiang Cc: Borislav Petkov Cc: Kazuhito Hagio Cc: Alexander Duyck Cc: Alexey Dobriyan Cc: Boris Ostrovsky Cc: Christian Hansen Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Haiyang Zhang Cc: Jonathan Corbet Cc: Juergen Gross Cc: Julien Freche Cc: Kairui Song Cc: Konstantin Khlebnikov Cc: "K. Y. Srinivasan" Cc: Len Brown Cc: Michal Hocko Cc: Mike Rapoport Cc: Miles Chen Cc: Nadav Amit Cc: Naoya Horiguchi Cc: Pankaj gupta Cc: Pavel Machek Cc: Pavel Tatashin Cc: Rafael J. Wysocki Cc: "Rafael J. Wysocki" Cc: Stefano Stabellini Cc: Stephen Hemminger Cc: Stephen Rothwell Cc: Vitaly Kuznetsov Cc: Vlastimil Babka Cc: Xavier Deguillard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/crash_core.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 933cb3e45b98..093c9f917ed0 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); #endif arch_crash_save_vmcoreinfo(); -- cgit v1.3-7-g2ca7 From 5b56db37218e6503906c6057c177a84f0a0ba551 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 5 Mar 2019 15:42:45 -0800 Subject: PM/Hibernate: use pfn_to_online_page() Let's use pfn_to_online_page() instead of pfn_to_page() when checking for saveable pages to not save/restore offline memory sections. Link: http://lkml.kernel.org/r/20181119101616.8901-8-david@redhat.com Signed-off-by: David Hildenbrand Suggested-by: Michal Hocko Acked-by: Michal Hocko Acked-by: Pavel Machek Acked-by: Rafael J. Wysocki Cc: Len Brown Cc: Matthew Wilcox Cc: "Michael S. Tsirkin" Cc: Alexander Duyck Cc: Alexey Dobriyan Cc: Arnd Bergmann Cc: Baoquan He Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Christian Hansen Cc: Dave Young Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Haiyang Zhang Cc: Jonathan Corbet Cc: Juergen Gross Cc: Julien Freche Cc: Kairui Song Cc: Kazuhito Hagio Cc: "Kirill A. Shutemov" Cc: Konstantin Khlebnikov Cc: "K. Y. Srinivasan" Cc: Lianbo Jiang Cc: Mike Rapoport Cc: Miles Chen Cc: Nadav Amit Cc: Naoya Horiguchi Cc: Omar Sandoval Cc: Pankaj gupta Cc: Pavel Tatashin Cc: "Rafael J. Wysocki" Cc: Stefano Stabellini Cc: Stephen Hemminger Cc: Stephen Rothwell Cc: Vitaly Kuznetsov Cc: Vlastimil Babka Cc: Xavier Deguillard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/snapshot.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 640b2034edd6..87e6dd57819f 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(!PageHighMem(page)); @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(PageHighMem(page)); -- cgit v1.3-7-g2ca7 From abd02ac616e32d818a0478e68924beac8ba5e5d8 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 5 Mar 2019 15:42:50 -0800 Subject: PM/Hibernate: exclude all PageOffline() pages The content of pages that are marked PG_offline is not of interest (e.g. inflated by a balloon driver), let's skip these pages. In saveable_highmem_page(), move the PageReserved() check to a new check along with the PageOffline() check to separate it from the swsusp checks. [david@redhat.com: v2] Link: http://lkml.kernel.org/r/20181122100627.5189-9-david@redhat.com Link: http://lkml.kernel.org/r/20181119101616.8901-9-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Pavel Machek Acked-by: Rafael J. Wysocki Cc: Pavel Machek Cc: Len Brown Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Cc: Alexander Duyck Cc: Alexey Dobriyan Cc: Arnd Bergmann Cc: Baoquan He Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Christian Hansen Cc: Dave Young Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Haiyang Zhang Cc: Jonathan Corbet Cc: Juergen Gross Cc: Julien Freche Cc: Kairui Song Cc: Kazuhito Hagio Cc: "Kirill A. Shutemov" Cc: Konstantin Khlebnikov Cc: "K. Y. Srinivasan" Cc: Lianbo Jiang Cc: Michal Hocko Cc: Mike Rapoport Cc: Miles Chen Cc: Nadav Amit Cc: Naoya Horiguchi Cc: Omar Sandoval Cc: Pankaj gupta Cc: Pavel Tatashin Cc: "Rafael J. Wysocki" Cc: Stefano Stabellini Cc: Stephen Hemminger Cc: Stephen Rothwell Cc: Vitaly Kuznetsov Cc: Vlastimil Babka Cc: Xavier Deguillard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/snapshot.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 87e6dd57819f..4802b039b89f 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1221,8 +1221,10 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) BUG_ON(!PageHighMem(page)); - if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || - PageReserved(page)) + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) + return NULL; + + if (PageReserved(page) || PageOffline(page)) return NULL; if (page_is_guard(page)) @@ -1286,6 +1288,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) return NULL; + if (PageOffline(page)) + return NULL; + if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn))) return NULL; -- cgit v1.3-7-g2ca7 From 98fa15f34cb379864757670b8e8743b21456a20e Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Tue, 5 Mar 2019 15:42:58 -0800 Subject: mm: replace all open encodings for NUMA_NO_NODE Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Reviewed-by: David Hildenbrand Acked-by: Jeff Kirsher [ixgbe] Acked-by: Jens Axboe [mtip32xx] Acked-by: Vinod Koul [dmaengine.c] Acked-by: Michael Ellerman [powerpc] Acked-by: Doug Ledford [drivers/infiniband] Cc: Joseph Qi Cc: Hans Verkuil Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/topology.h | 3 ++- arch/ia64/kernel/numa.c | 2 +- arch/ia64/mm/discontig.c | 6 +++--- arch/powerpc/include/asm/pci-bridge.h | 3 ++- arch/powerpc/kernel/paca.c | 3 ++- arch/powerpc/kernel/pci-common.c | 3 ++- arch/powerpc/mm/numa.c | 14 +++++++------- arch/powerpc/platforms/powernv/memtrace.c | 5 +++-- arch/sparc/kernel/pci_fire.c | 3 ++- arch/sparc/kernel/pci_schizo.c | 3 ++- arch/sparc/kernel/psycho_common.c | 3 ++- arch/sparc/kernel/sbus.c | 3 ++- arch/sparc/mm/init_64.c | 6 +++--- arch/x86/include/asm/pci.h | 3 ++- arch/x86/kernel/apic/x2apic_uv_x.c | 7 ++++--- arch/x86/kernel/smpboot.c | 3 ++- drivers/block/mtip32xx/mtip32xx.c | 5 +++-- drivers/dma/dmaengine.c | 4 +++- drivers/infiniband/hw/hfi1/affinity.c | 3 ++- drivers/infiniband/hw/hfi1/init.c | 3 ++- drivers/iommu/dmar.c | 5 +++-- drivers/iommu/intel-iommu.c | 3 ++- drivers/misc/sgi-xp/xpc_uv.c | 3 ++- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 +++-- include/linux/device.h | 2 +- init/init_task.c | 3 ++- kernel/kthread.c | 3 ++- kernel/sched/fair.c | 15 ++++++++------- lib/cpumask.c | 3 ++- mm/huge_memory.c | 13 +++++++------ mm/hugetlb.c | 3 ++- mm/ksm.c | 2 +- mm/memory.c | 7 ++++--- mm/memory_hotplug.c | 12 ++++++------ mm/mempolicy.c | 2 +- mm/page_alloc.c | 4 ++-- mm/page_ext.c | 2 +- net/core/pktgen.c | 3 ++- net/qrtr/qrtr.c | 3 ++- 39 files changed, 104 insertions(+), 74 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/include/asm/topology.h b/arch/alpha/include/asm/topology.h index e6e13a85796a..5a77a40567fa 100644 --- a/arch/alpha/include/asm/topology.h +++ b/arch/alpha/include/asm/topology.h @@ -4,6 +4,7 @@ #include #include +#include #include #ifdef CONFIG_NUMA @@ -29,7 +30,7 @@ static const struct cpumask *cpumask_of_node(int node) { int cpu; - if (node == -1) + if (node == NUMA_NO_NODE) return cpu_all_mask; cpumask_clear(&node_to_cpumask_map[node]); diff --git a/arch/ia64/kernel/numa.c b/arch/ia64/kernel/numa.c index 92c376279c6d..1315da6c7aeb 100644 --- a/arch/ia64/kernel/numa.c +++ b/arch/ia64/kernel/numa.c @@ -74,7 +74,7 @@ void __init build_cpu_to_node_map(void) cpumask_clear(&node_to_cpu_mask[node]); for_each_possible_early_cpu(cpu) { - node = -1; + node = NUMA_NO_NODE; for (i = 0; i < NR_CPUS; ++i) if (cpu_physical_id(cpu) == node_cpuid[i].phys_id) { node = node_cpuid[i].nid; diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 8a965784340c..f9c36750c6a4 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -227,7 +227,7 @@ void __init setup_per_cpu_areas(void) * CPUs are put into groups according to node. Walk cpu_map * and create new groups at node boundaries. */ - prev_node = -1; + prev_node = NUMA_NO_NODE; ai->nr_groups = 0; for (unit = 0; unit < nr_units; unit++) { cpu = cpu_map[unit]; @@ -435,7 +435,7 @@ static void __init *memory_less_node_alloc(int nid, unsigned long pernodesize) { void *ptr = NULL; u8 best = 0xff; - int bestnode = -1, node, anynode = 0; + int bestnode = NUMA_NO_NODE, node, anynode = 0; for_each_online_node(node) { if (node_isset(node, memory_less_mask)) @@ -447,7 +447,7 @@ static void __init *memory_less_node_alloc(int nid, unsigned long pernodesize) anynode = node; } - if (bestnode == -1) + if (bestnode == NUMA_NO_NODE) bestnode = anynode; ptr = memblock_alloc_try_nid(pernodesize, PERCPU_PAGE_SIZE, diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index aee4fcc24990..77fc21278fa2 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -10,6 +10,7 @@ #include #include #include +#include struct device_node; @@ -265,7 +266,7 @@ extern int pcibios_map_io_space(struct pci_bus *bus); #ifdef CONFIG_NUMA #define PHB_SET_NODE(PHB, NODE) ((PHB)->node = (NODE)) #else -#define PHB_SET_NODE(PHB, NODE) ((PHB)->node = -1) +#define PHB_SET_NODE(PHB, NODE) ((PHB)->node = NUMA_NO_NODE) #endif #endif /* CONFIG_PPC64 */ diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 913bfca09c4f..b8480127793d 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -36,7 +37,7 @@ static void *__init alloc_paca_data(unsigned long size, unsigned long align, * which will put its paca in the right place. */ if (cpu == boot_cpuid) { - nid = -1; + nid = NUMA_NO_NODE; memblock_set_bottom_up(true); } else { nid = early_cpu_to_node(cpu); diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 88e4f69a09e5..4538e8ddde80 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -132,7 +133,7 @@ struct pci_controller *pcibios_alloc_controller(struct device_node *dev) int nid = of_node_to_nid(dev); if (nid < 0 || !node_online(nid)) - nid = -1; + nid = NUMA_NO_NODE; PHB_SET_NODE(phb, nid); } diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 87f0dd004295..270cefb75cca 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -215,7 +215,7 @@ static void initialize_distance_lookup_table(int nid, */ static int associativity_to_nid(const __be32 *associativity) { - int nid = -1; + int nid = NUMA_NO_NODE; if (min_common_depth == -1) goto out; @@ -225,7 +225,7 @@ static int associativity_to_nid(const __be32 *associativity) /* POWER4 LPAR uses 0xffff as invalid node */ if (nid == 0xffff || nid >= MAX_NUMNODES) - nid = -1; + nid = NUMA_NO_NODE; if (nid > 0 && of_read_number(associativity, 1) >= distance_ref_points_depth) { @@ -244,7 +244,7 @@ out: */ static int of_node_to_nid_single(struct device_node *device) { - int nid = -1; + int nid = NUMA_NO_NODE; const __be32 *tmp; tmp = of_get_associativity(device); @@ -256,7 +256,7 @@ static int of_node_to_nid_single(struct device_node *device) /* Walk the device tree upwards, looking for an associativity id */ int of_node_to_nid(struct device_node *device) { - int nid = -1; + int nid = NUMA_NO_NODE; of_node_get(device); while (device) { @@ -454,7 +454,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb) */ static int numa_setup_cpu(unsigned long lcpu) { - int nid = -1; + int nid = NUMA_NO_NODE; struct device_node *cpu; /* @@ -930,7 +930,7 @@ static int hot_add_drconf_scn_to_nid(unsigned long scn_addr) { struct drmem_lmb *lmb; unsigned long lmb_size; - int nid = -1; + int nid = NUMA_NO_NODE; lmb_size = drmem_lmb_size(); @@ -960,7 +960,7 @@ static int hot_add_drconf_scn_to_nid(unsigned long scn_addr) static int hot_add_node_scn_to_nid(unsigned long scn_addr) { struct device_node *memory; - int nid = -1; + int nid = NUMA_NO_NODE; for_each_node_by_type(memory, "memory") { unsigned long start, size; diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index 84d038ed3882..248a38ad25c7 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -223,7 +224,7 @@ static int memtrace_online(void) ent = &memtrace_array[i]; /* We have onlined this chunk previously */ - if (ent->nid == -1) + if (ent->nid == NUMA_NO_NODE) continue; /* Remove from io mappings */ @@ -257,7 +258,7 @@ static int memtrace_online(void) */ debugfs_remove_recursive(ent->dir); pr_info("Added trace memory back to node %d\n", ent->nid); - ent->size = ent->start = ent->nid = -1; + ent->size = ent->start = ent->nid = NUMA_NO_NODE; } if (ret) return ret; diff --git a/arch/sparc/kernel/pci_fire.c b/arch/sparc/kernel/pci_fire.c index be71ae086622..0ca08d455e80 100644 --- a/arch/sparc/kernel/pci_fire.c +++ b/arch/sparc/kernel/pci_fire.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -416,7 +417,7 @@ static int pci_fire_pbm_init(struct pci_pbm_info *pbm, struct device_node *dp = op->dev.of_node; int err; - pbm->numa_node = -1; + pbm->numa_node = NUMA_NO_NODE; pbm->pci_ops = &sun4u_pci_ops; pbm->config_space_reg_bits = 12; diff --git a/arch/sparc/kernel/pci_schizo.c b/arch/sparc/kernel/pci_schizo.c index 934b97c72f7c..421aba00e6b0 100644 --- a/arch/sparc/kernel/pci_schizo.c +++ b/arch/sparc/kernel/pci_schizo.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -1347,7 +1348,7 @@ static int schizo_pbm_init(struct pci_pbm_info *pbm, pbm->next = pci_pbm_root; pci_pbm_root = pbm; - pbm->numa_node = -1; + pbm->numa_node = NUMA_NO_NODE; pbm->pci_ops = &sun4u_pci_ops; pbm->config_space_reg_bits = 8; diff --git a/arch/sparc/kernel/psycho_common.c b/arch/sparc/kernel/psycho_common.c index 81aa91e5c0e6..e90bcb6bad7f 100644 --- a/arch/sparc/kernel/psycho_common.c +++ b/arch/sparc/kernel/psycho_common.c @@ -5,6 +5,7 @@ */ #include #include +#include #include @@ -454,7 +455,7 @@ void psycho_pbm_init_common(struct pci_pbm_info *pbm, struct platform_device *op struct device_node *dp = op->dev.of_node; pbm->name = dp->full_name; - pbm->numa_node = -1; + pbm->numa_node = NUMA_NO_NODE; pbm->chip_type = chip_type; pbm->chip_version = of_getintprop_default(dp, "version#", 0); pbm->chip_revision = of_getintprop_default(dp, "module-revision#", 0); diff --git a/arch/sparc/kernel/sbus.c b/arch/sparc/kernel/sbus.c index 41c5deb581b8..32141e1006c4 100644 --- a/arch/sparc/kernel/sbus.c +++ b/arch/sparc/kernel/sbus.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -561,7 +562,7 @@ static void __init sbus_iommu_init(struct platform_device *op) op->dev.archdata.iommu = iommu; op->dev.archdata.stc = strbuf; - op->dev.archdata.numa_node = -1; + op->dev.archdata.numa_node = NUMA_NO_NODE; reg_base = regs + SYSIO_IOMMUREG_BASE; iommu->iommu_control = reg_base + IOMMU_CONTROL; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index b4221d3727d0..9e6bd868ba6f 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -976,13 +976,13 @@ static u64 __init memblock_nid_range_sun4u(u64 start, u64 end, int *nid) { int prev_nid, new_nid; - prev_nid = -1; + prev_nid = NUMA_NO_NODE; for ( ; start < end; start += PAGE_SIZE) { for (new_nid = 0; new_nid < num_node_masks; new_nid++) { struct node_mem_mask *p = &node_masks[new_nid]; if ((start & p->mask) == p->match) { - if (prev_nid == -1) + if (prev_nid == NUMA_NO_NODE) prev_nid = new_nid; break; } @@ -1208,7 +1208,7 @@ int of_node_to_nid(struct device_node *dp) md = mdesc_grab(); count = 0; - nid = -1; + nid = NUMA_NO_NODE; mdesc_for_each_node_by_name(md, grp, "group") { if (!scan_arcs_for_cfg_handle(md, grp, cfg_handle)) { nid = count; diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h index 662963681ea6..e662f987dfa2 100644 --- a/arch/x86/include/asm/pci.h +++ b/arch/x86/include/asm/pci.h @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -141,7 +142,7 @@ cpumask_of_pcibus(const struct pci_bus *bus) int node; node = __pcibus_to_node(bus); - return (node == -1) ? cpu_online_mask : + return (node == NUMA_NO_NODE) ? cpu_online_mask : cpumask_of_node(node); } #endif diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c index a555da094157..1e225528f0d7 100644 --- a/arch/x86/kernel/apic/x2apic_uv_x.c +++ b/arch/x86/kernel/apic/x2apic_uv_x.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1390,7 +1391,7 @@ static void __init build_socket_tables(void) } /* Set socket -> node values: */ - lnid = -1; + lnid = NUMA_NO_NODE; for_each_present_cpu(cpu) { int nid = cpu_to_node(cpu); int apicid, sockid; @@ -1521,7 +1522,7 @@ static void __init uv_system_init_hub(void) new_hub->pnode = 0xffff; new_hub->numa_blade_id = uv_node_to_blade_id(nodeid); - new_hub->memory_nid = -1; + new_hub->memory_nid = NUMA_NO_NODE; new_hub->nr_possible_cpus = 0; new_hub->nr_online_cpus = 0; } @@ -1538,7 +1539,7 @@ static void __init uv_system_init_hub(void) uv_cpu_info_per(cpu)->p_uv_hub_info = uv_hub_info_list(nodeid); uv_cpu_info_per(cpu)->blade_cpu_id = uv_cpu_hub_info(cpu)->nr_possible_cpus++; - if (uv_cpu_hub_info(cpu)->memory_nid == -1) + if (uv_cpu_hub_info(cpu)->memory_nid == NUMA_NO_NODE) uv_cpu_hub_info(cpu)->memory_nid = cpu_to_node(cpu); /* Init memoryless node: */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ccd1f2a8e557..c91ff9f9fe8a 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include @@ -841,7 +842,7 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip) /* reduce the number of lines printed when booting a large cpu count system */ static void announce_cpu(int cpu, int apicid) { - static int current_node = -1; + static int current_node = NUMA_NO_NODE; int node = early_cpu_to_node(cpu); static int width, node_width; diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 88e8440e75c3..2f3ee4d6af82 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -40,6 +40,7 @@ #include #include #include +#include #include "mtip32xx.h" #define HW_CMD_SLOT_SZ (MTIP_MAX_COMMAND_SLOTS * 32) @@ -4018,9 +4019,9 @@ static int get_least_used_cpu_on_node(int node) /* Helper for selecting a node in round robin mode */ static inline int mtip_get_next_rr_node(void) { - static int next_node = -1; + static int next_node = NUMA_NO_NODE; - if (next_node == -1) { + if (next_node == NUMA_NO_NODE) { next_node = first_online_node; return next_node; } diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index f1a441ab395d..3a11b1092e80 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -63,6 +63,7 @@ #include #include #include +#include static DEFINE_MUTEX(dma_list_mutex); static DEFINE_IDA(dma_ida); @@ -386,7 +387,8 @@ EXPORT_SYMBOL(dma_issue_pending_all); static bool dma_chan_is_local(struct dma_chan *chan, int cpu) { int node = dev_to_node(chan->device->dev); - return node == -1 || cpumask_test_cpu(cpu, cpumask_of_node(node)); + return node == NUMA_NO_NODE || + cpumask_test_cpu(cpu, cpumask_of_node(node)); } /** diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c index 2baf38cc1e23..4fe662c3bbc1 100644 --- a/drivers/infiniband/hw/hfi1/affinity.c +++ b/drivers/infiniband/hw/hfi1/affinity.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "hfi.h" #include "affinity.h" @@ -777,7 +778,7 @@ void hfi1_dev_affinity_clean_up(struct hfi1_devdata *dd) _dev_comp_vect_cpu_mask_clean_up(dd, entry); unlock: mutex_unlock(&node_affinity.lock); - dd->node = -1; + dd->node = NUMA_NO_NODE; } /* diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c index 7835eb52e7c5..441b06e2a154 100644 --- a/drivers/infiniband/hw/hfi1/init.c +++ b/drivers/infiniband/hw/hfi1/init.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include "hfi.h" @@ -1303,7 +1304,7 @@ static struct hfi1_devdata *hfi1_alloc_devdata(struct pci_dev *pdev, dd->unit = ret; list_add(&dd->list, &hfi1_dev_list); } - dd->node = -1; + dd->node = NUMA_NO_NODE; spin_unlock_irqrestore(&hfi1_devs_lock, flags); idr_preload_end(); diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c index 58dc70bffd5b..9c49300e9fb7 100644 --- a/drivers/iommu/dmar.c +++ b/drivers/iommu/dmar.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -477,7 +478,7 @@ static int dmar_parse_one_rhsa(struct acpi_dmar_header *header, void *arg) int node = acpi_map_pxm_to_node(rhsa->proximity_domain); if (!node_online(node)) - node = -1; + node = NUMA_NO_NODE; drhd->iommu->node = node; return 0; } @@ -1062,7 +1063,7 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd) iommu->msagaw = msagaw; iommu->segment = drhd->segment; - iommu->node = -1; + iommu->node = NUMA_NO_NODE; ver = readl(iommu->reg + DMAR_VER_REG); pr_info("%s: reg_base_addr %llx ver %d:%d cap %llx ecap %llx\n", diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 78188bf7e90d..39a33dec4d0b 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include #include @@ -1716,7 +1717,7 @@ static struct dmar_domain *alloc_domain(int flags) return NULL; memset(domain, 0, sizeof(*domain)); - domain->nid = -1; + domain->nid = NUMA_NO_NODE; domain->flags = flags; domain->has_iotlb_device = false; INIT_LIST_HEAD(&domain->devices); diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c index 0441abe87880..9e443df44b3b 100644 --- a/drivers/misc/sgi-xp/xpc_uv.c +++ b/drivers/misc/sgi-xp/xpc_uv.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #if defined CONFIG_X86_64 #include @@ -61,7 +62,7 @@ static struct xpc_heartbeat_uv *xpc_heartbeat_uv; XPC_NOTIFY_MSG_SIZE_UV) #define XPC_NOTIFY_IRQ_NAME "xpc_notify" -static int xpc_mq_node = -1; +static int xpc_mq_node = NUMA_NO_NODE; static struct xpc_gru_mq_uv *xpc_activate_mq_uv; static struct xpc_gru_mq_uv *xpc_notify_mq_uv; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index a4e7584a50cb..e100054a3765 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -6418,7 +6419,7 @@ int ixgbe_setup_tx_resources(struct ixgbe_ring *tx_ring) { struct device *dev = tx_ring->dev; int orig_node = dev_to_node(dev); - int ring_node = -1; + int ring_node = NUMA_NO_NODE; int size; size = sizeof(struct ixgbe_tx_buffer) * tx_ring->count; @@ -6512,7 +6513,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter, { struct device *dev = rx_ring->dev; int orig_node = dev_to_node(dev); - int ring_node = -1; + int ring_node = NUMA_NO_NODE; int size; size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count; diff --git a/include/linux/device.h b/include/linux/device.h index 6cb4640b6160..4d2f13e8c540 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1095,7 +1095,7 @@ static inline void set_dev_node(struct device *dev, int node) #else static inline int dev_to_node(struct device *dev) { - return -1; + return NUMA_NO_NODE; } static inline void set_dev_node(struct device *dev, int node) { diff --git a/init/init_task.c b/init/init_task.c index 5aebe3be4d7c..26131e73aa6d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -154,7 +155,7 @@ struct task_struct init_task .vtime.state = VTIME_SYS, #endif #ifdef CONFIG_NUMA_BALANCING - .numa_preferred_nid = -1, + .numa_preferred_nid = NUMA_NO_NODE, .numa_group = NULL, .numa_faults = NULL, #endif diff --git a/kernel/kthread.c b/kernel/kthread.c index 087d18d771b5..ebebbcf3c5de 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -20,6 +20,7 @@ #include #include #include +#include #include static DEFINE_SPINLOCK(kthread_create_lock); @@ -675,7 +676,7 @@ __kthread_create_worker(int cpu, unsigned int flags, { struct kthread_worker *worker; struct task_struct *task; - int node = -1; + int node = NUMA_NO_NODE; worker = kzalloc(sizeof(*worker), GFP_KERNEL); if (!worker) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 310d0637fe4b..0e6a0ef129c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1160,7 +1160,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) /* New address space, reset the preferred nid */ if (!(clone_flags & CLONE_VM)) { - p->numa_preferred_nid = -1; + p->numa_preferred_nid = NUMA_NO_NODE; return; } @@ -1180,13 +1180,13 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) static void account_numa_enqueue(struct rq *rq, struct task_struct *p) { - rq->nr_numa_running += (p->numa_preferred_nid != -1); + rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE); rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p)); } static void account_numa_dequeue(struct rq *rq, struct task_struct *p) { - rq->nr_numa_running -= (p->numa_preferred_nid != -1); + rq->nr_numa_running -= (p->numa_preferred_nid != NUMA_NO_NODE); rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p)); } @@ -1400,7 +1400,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, * two full passes of the "multi-stage node selection" test that is * executed below. */ - if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) && + if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) return true; @@ -1848,7 +1848,7 @@ static void numa_migrate_preferred(struct task_struct *p) unsigned long interval = HZ; /* This task has no NUMA fault statistics yet */ - if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults)) + if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults)) return; /* Periodically retry migrating the task to the preferred node */ @@ -2095,7 +2095,7 @@ static int preferred_group_nid(struct task_struct *p, int nid) static void task_numa_placement(struct task_struct *p) { - int seq, nid, max_nid = -1; + int seq, nid, max_nid = NUMA_NO_NODE; unsigned long max_faults = 0; unsigned long fault_types[2] = { 0, 0 }; unsigned long total_faults; @@ -2638,7 +2638,8 @@ static void update_scan_period(struct task_struct *p, int new_cpu) * the preferred node. */ if (dst_nid == p->numa_preferred_nid || - (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid)) + (p->numa_preferred_nid != NUMA_NO_NODE && + src_nid != p->numa_preferred_nid)) return; } diff --git a/lib/cpumask.c b/lib/cpumask.c index 8d666ab84b5c..087a3e9a0202 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -5,6 +5,7 @@ #include #include #include +#include /** * cpumask_next - get the next cpu in a cpumask @@ -206,7 +207,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node) /* Wrap: we always want a cpu. */ i %= num_online_cpus(); - if (node == -1) { + if (node == NUMA_NO_NODE) { for_each_cpu(cpu, cpu_online_mask) if (i-- == 0) return cpu; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index faf357eaf0ce..d066f7ca1ee8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -1475,7 +1476,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) struct anon_vma *anon_vma = NULL; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - int page_nid = -1, this_nid = numa_node_id(); + int page_nid = NUMA_NO_NODE, this_nid = numa_node_id(); int target_nid, last_cpupid = -1; bool page_locked; bool migrated = false; @@ -1520,7 +1521,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) */ page_locked = trylock_page(page); target_nid = mpol_misplaced(page, vma, haddr); - if (target_nid == -1) { + if (target_nid == NUMA_NO_NODE) { /* If the page was locked, there are no parallel migrations */ if (page_locked) goto clear_pmdnuma; @@ -1528,7 +1529,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) /* Migration could have started since the pmd_trans_migrating check */ if (!page_locked) { - page_nid = -1; + page_nid = NUMA_NO_NODE; if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); @@ -1549,14 +1550,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (unlikely(!pmd_same(pmd, *vmf->pmd))) { unlock_page(page); put_page(page); - page_nid = -1; + page_nid = NUMA_NO_NODE; goto out_unlock; } /* Bail if we fail to protect against THP splits for any reason */ if (unlikely(!anon_vma)) { put_page(page); - page_nid = -1; + page_nid = NUMA_NO_NODE; goto clear_pmdnuma; } @@ -1618,7 +1619,7 @@ out: if (anon_vma) page_unlock_anon_vma_read(anon_vma); - if (page_nid != -1) + if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8dfdffc34a99..3c504fa6b460 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -887,7 +888,7 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, struct zonelist *zonelist; struct zone *zone; struct zoneref *z; - int node = -1; + int node = NUMA_NO_NODE; zonelist = node_zonelist(nid, gfp_mask); diff --git a/mm/ksm.c b/mm/ksm.c index 6c48ad13b4c9..fd2db6a74d3c 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -598,7 +598,7 @@ static struct stable_node *alloc_stable_node_chain(struct stable_node *dup, chain->chain_prune_time = jiffies; chain->rmap_hlist_len = STABLE_NODE_CHAIN; #if defined (CONFIG_DEBUG_VM) && defined(CONFIG_NUMA) - chain->nid = -1; /* debug */ + chain->nid = NUMA_NO_NODE; /* debug */ #endif ksm_stable_node_chains++; diff --git a/mm/memory.c b/mm/memory.c index e11ca9dd823f..eb40f32295d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include @@ -3586,7 +3587,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page = NULL; - int page_nid = -1; + int page_nid = NUMA_NO_NODE; int last_cpupid; int target_nid; bool migrated = false; @@ -3653,7 +3654,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, &flags); pte_unmap_unlock(vmf->pte, vmf->ptl); - if (target_nid == -1) { + if (target_nid == NUMA_NO_NODE) { put_page(page); goto out; } @@ -3667,7 +3668,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_MIGRATE_FAIL; out: - if (page_nid != -1) + if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); return 0; } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 4f07c8ddfdd7..b3d3c64d15df 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -702,9 +702,9 @@ static void node_states_check_changes_online(unsigned long nr_pages, { int nid = zone_to_nid(zone); - arg->status_change_nid = -1; - arg->status_change_nid_normal = -1; - arg->status_change_nid_high = -1; + arg->status_change_nid = NUMA_NO_NODE; + arg->status_change_nid_normal = NUMA_NO_NODE; + arg->status_change_nid_high = NUMA_NO_NODE; if (!node_state(nid, N_MEMORY)) arg->status_change_nid = nid; @@ -1509,9 +1509,9 @@ static void node_states_check_changes_offline(unsigned long nr_pages, unsigned long present_pages = 0; enum zone_type zt; - arg->status_change_nid = -1; - arg->status_change_nid_normal = -1; - arg->status_change_nid_high = -1; + arg->status_change_nid = NUMA_NO_NODE; + arg->status_change_nid_normal = NUMA_NO_NODE; + arg->status_change_nid_high = NUMA_NO_NODE; /* * Check whether node_states[N_NORMAL_MEMORY] will be changed. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ee2bce59d2bf..76e7e4bc3335 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2304,7 +2304,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long unsigned long pgoff; int thiscpu = raw_smp_processor_id(); int thisnid = cpu_to_node(thiscpu); - int polnid = -1; + int polnid = NUMA_NO_NODE; int ret = -1; pol = get_vma_policy(vma, addr); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5361bd078493..1f9f1409df9b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6016,7 +6016,7 @@ int __meminit __early_pfn_to_nid(unsigned long pfn, return state->last_nid; nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn); - if (nid != -1) { + if (nid != NUMA_NO_NODE) { state->last_start = start_pfn; state->last_end = end_pfn; state->last_nid = nid; @@ -6771,7 +6771,7 @@ unsigned long __init node_map_pfn_alignment(void) { unsigned long accl_mask = 0, last_end = 0; unsigned long start, end, mask; - int last_nid = -1; + int last_nid = NUMA_NO_NODE; int i, nid; for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) { diff --git a/mm/page_ext.c b/mm/page_ext.c index 8c78b8d45117..762d5b7eb523 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -300,7 +300,7 @@ static int __meminit online_page_ext(unsigned long start_pfn, start = SECTION_ALIGN_DOWN(start_pfn); end = SECTION_ALIGN_UP(start_pfn + nr_pages); - if (nid == -1) { + if (nid == NUMA_NO_NODE) { /* * In this case, "nid" already exists and contains valid memory. * "start_pfn" passed to us is a pfn which is an arg for diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 6ac919847ce6..f3f5a78cd062 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -158,6 +158,7 @@ #include #include #include +#include #include #include #include @@ -3625,7 +3626,7 @@ static int pktgen_add_device(struct pktgen_thread *t, const char *ifname) pkt_dev->svlan_cfi = 0; pkt_dev->svlan_id = 0xffff; pkt_dev->burst = 1; - pkt_dev->node = -1; + pkt_dev->node = NUMA_NO_NODE; err = pktgen_setup_dev(t->net, pkt_dev, ifname); if (err) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 86e1e37eb4e8..b37e6e0a1026 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -15,6 +15,7 @@ #include #include #include /* For TIOCINQ/OUTQ */ +#include #include @@ -101,7 +102,7 @@ static inline struct qrtr_sock *qrtr_sk(struct sock *sk) return container_of(sk, struct qrtr_sock, sk); } -static unsigned int qrtr_local_nid = -1; +static unsigned int qrtr_local_nid = NUMA_NO_NODE; /* for node ids */ static RADIX_TREE(qrtr_nodes, GFP_KERNEL); -- cgit v1.3-7-g2ca7 From 6b7e5cad651a2b1031a4c69a98f87e3532dd4cef Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Tue, 5 Mar 2019 15:43:41 -0800 Subject: mm: remove sysctl_extfrag_handler() sysctl_extfrag_handler() neglects to propagate the return value from proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need to exist, so just use proc_dointvec_minmax() directly. Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org Signed-off-by: Matthew Wilcox Reported-by: Aditya Pakki Acked-by: Mel Gorman Acked-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 2 -- kernel/sysctl.c | 2 +- mm/compaction.c | 8 -------- 3 files changed, 1 insertion(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 68250a57aace..70d0256edd31 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -88,8 +88,6 @@ extern int sysctl_compact_memory; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; -extern int sysctl_extfrag_handler(struct ctl_table *table, int write, - void __user *buffer, size_t *length, loff_t *ppos); extern int sysctl_compact_unevictable_allowed; extern int fragmentation_index(struct zone *zone, unsigned int order); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7578e21a711b..9c78e06f7ba4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1460,7 +1460,7 @@ static struct ctl_table vm_table[] = { .data = &sysctl_extfrag_threshold, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = sysctl_extfrag_handler, + .proc_handler = proc_dointvec_minmax, .extra1 = &min_extfrag_threshold, .extra2 = &max_extfrag_threshold, }, diff --git a/mm/compaction.c b/mm/compaction.c index ef29490b0f46..c15b4bbc9e9e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1876,14 +1876,6 @@ int sysctl_compaction_handler(struct ctl_table *table, int write, return 0; } -int sysctl_extfrag_handler(struct ctl_table *table, int write, - void __user *buffer, size_t *length, loff_t *ppos) -{ - proc_dointvec_minmax(table, write, buffer, length, ppos); - - return 0; -} - #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) static ssize_t sysfs_compact_node(struct device *dev, struct device_attribute *attr, -- cgit v1.3-7-g2ca7 From 5e1f0f098b4649fad53011246bcaeff011ffdf5d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 5 Mar 2019 15:45:41 -0800 Subject: mm, compaction: capture a page under direct compaction Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Dan Carpenter Cc: David Rientjes Cc: YueHaibing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 3 +- include/linux/sched.h | 4 +++ kernel/sched/core.c | 3 ++ mm/compaction.c | 31 +++++++++++++++----- mm/internal.h | 9 ++++++ mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++++++++++++++--- 6 files changed, 111 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 70d0256edd31..c960923d9ec2 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -93,7 +93,8 @@ extern int sysctl_compact_unevictable_allowed; extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, - const struct alloc_context *ac, enum compact_priority prio); + const struct alloc_context *ac, enum compact_priority prio, + struct page **page); extern void reset_isolation_suitable(pg_data_t *pgdat); extern enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int classzone_idx); diff --git a/include/linux/sched.h b/include/linux/sched.h index f9b43c989577..ebfb34fb9b30 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -47,6 +47,7 @@ struct pid_namespace; struct pipe_inode_info; struct rcu_node; struct reclaim_state; +struct capture_control; struct robust_list_head; struct sched_attr; struct sched_param; @@ -958,6 +959,9 @@ struct task_struct { struct io_context *io_context; +#ifdef CONFIG_COMPACTION + struct capture_control *capture_control; +#endif /* Ptrace state: */ unsigned long ptrace_message; kernel_siginfo_t *last_siginfo; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7cbb5658be80..916e956e92be 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2190,6 +2190,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) INIT_HLIST_HEAD(&p->preempt_notifiers); #endif +#ifdef CONFIG_COMPACTION + p->capture_control = NULL; +#endif init_numa_balancing(clone_flags, p); } diff --git a/mm/compaction.c b/mm/compaction.c index 3084cee77fda..1cc871da3fda 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2056,7 +2056,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, return false; } -static enum compact_result compact_zone(struct compact_control *cc) +static enum compact_result +compact_zone(struct compact_control *cc, struct capture_control *capc) { enum compact_result ret; unsigned long start_pfn = cc->zone->zone_start_pfn; @@ -2225,6 +2226,11 @@ check_drain: } } + /* Stop if a page has been captured */ + if (capc && capc->page) { + ret = COMPACT_SUCCESS; + break; + } } out: @@ -2258,7 +2264,8 @@ out: static enum compact_result compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, enum compact_priority prio, - unsigned int alloc_flags, int classzone_idx) + unsigned int alloc_flags, int classzone_idx, + struct page **capture) { enum compact_result ret; struct compact_control cc = { @@ -2279,14 +2286,24 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY), .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY) }; + struct capture_control capc = { + .cc = &cc, + .page = NULL, + }; + + if (capture) + current->capture_control = &capc; INIT_LIST_HEAD(&cc.freepages); INIT_LIST_HEAD(&cc.migratepages); - ret = compact_zone(&cc); + ret = compact_zone(&cc, &capc); VM_BUG_ON(!list_empty(&cc.freepages)); VM_BUG_ON(!list_empty(&cc.migratepages)); + *capture = capc.page; + current->capture_control = NULL; + return ret; } @@ -2304,7 +2321,7 @@ int sysctl_extfrag_threshold = 500; */ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, - enum compact_priority prio) + enum compact_priority prio, struct page **capture) { int may_perform_io = gfp_mask & __GFP_IO; struct zoneref *z; @@ -2332,7 +2349,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, } status = compact_zone_order(zone, order, gfp_mask, prio, - alloc_flags, ac_classzone_idx(ac)); + alloc_flags, ac_classzone_idx(ac), capture); rc = max(status, rc); /* The allocation should succeed, stop compacting */ @@ -2400,7 +2417,7 @@ static void compact_node(int nid) INIT_LIST_HEAD(&cc.freepages); INIT_LIST_HEAD(&cc.migratepages); - compact_zone(&cc); + compact_zone(&cc, NULL); VM_BUG_ON(!list_empty(&cc.freepages)); VM_BUG_ON(!list_empty(&cc.migratepages)); @@ -2535,7 +2552,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) if (kthread_should_stop()) return; - status = compact_zone(&cc); + status = compact_zone(&cc, NULL); if (status == COMPACT_SUCCESS) { compaction_defer_reset(zone, cc.order, false); diff --git a/mm/internal.h b/mm/internal.h index 31bb0be6fd52..9eeaf2b95166 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -209,6 +209,15 @@ struct compact_control { bool rescan; /* Rescanning the same pageblock */ }; +/* + * Used in direct compaction when a page should be taken from the freelists + * immediately when one is created during the free path. + */ +struct capture_control { + struct compact_control *cc; + struct page *page; +}; + unsigned long isolate_freepages_range(struct compact_control *cc, unsigned long start_pfn, unsigned long end_pfn); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e132b9e7a93..09bf2c5f8b4b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -789,6 +789,57 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, return 0; } +#ifdef CONFIG_COMPACTION +static inline struct capture_control *task_capc(struct zone *zone) +{ + struct capture_control *capc = current->capture_control; + + return capc && + !(current->flags & PF_KTHREAD) && + !capc->page && + capc->cc->zone == zone && + capc->cc->direct_compaction ? capc : NULL; +} + +static inline bool +compaction_capture(struct capture_control *capc, struct page *page, + int order, int migratetype) +{ + if (!capc || order != capc->cc->order) + return false; + + /* Do not accidentally pollute CMA or isolated regions*/ + if (is_migrate_cma(migratetype) || + is_migrate_isolate(migratetype)) + return false; + + /* + * Do not let lower order allocations polluate a movable pageblock. + * This might let an unmovable request use a reclaimable pageblock + * and vice-versa but no more than normal fallback logic which can + * have trouble finding a high-order free page. + */ + if (order < pageblock_order && migratetype == MIGRATE_MOVABLE) + return false; + + capc->page = page; + return true; +} + +#else +static inline struct capture_control *task_capc(struct zone *zone) +{ + return NULL; +} + +static inline bool +compaction_capture(struct capture_control *capc, struct page *page, + int order, int migratetype) +{ + return false; +} +#endif /* CONFIG_COMPACTION */ + /* * Freeing function for a buddy system allocator. * @@ -822,6 +873,7 @@ static inline void __free_one_page(struct page *page, unsigned long uninitialized_var(buddy_pfn); struct page *buddy; unsigned int max_order; + struct capture_control *capc = task_capc(zone); max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); @@ -837,6 +889,11 @@ static inline void __free_one_page(struct page *page, continue_merging: while (order < max_order - 1) { + if (compaction_capture(capc, page, order, migratetype)) { + __mod_zone_freepage_state(zone, -(1 << order), + migratetype); + return; + } buddy_pfn = __find_buddy_pfn(pfn, order); buddy = page + (buddy_pfn - pfn); @@ -3710,7 +3767,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio, enum compact_result *compact_result) { - struct page *page; + struct page *page = NULL; unsigned long pflags; unsigned int noreclaim_flag; @@ -3721,13 +3778,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, noreclaim_flag = memalloc_noreclaim_save(); *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, - prio); + prio, &page); memalloc_noreclaim_restore(noreclaim_flag); psi_memstall_leave(&pflags); - if (*compact_result <= COMPACT_INACTIVE) + if (*compact_result <= COMPACT_INACTIVE) { + WARN_ON_ONCE(page); return NULL; + } /* * At least in one zone compaction wasn't deferred or skipped, so let's @@ -3735,7 +3794,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, */ count_vm_event(COMPACTSTALL); - page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); + /* Prep a captured page if available */ + if (page) + prep_new_page(page, order, gfp_mask, alloc_flags); + + /* Try get a page from the freelist if available */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); if (page) { struct zone *zone = page_zone(page); -- cgit v1.3-7-g2ca7 From dc50537bdd1a0804fa2cbc990565ee9a944e66fa Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 5 Mar 2019 15:45:48 -0800 Subject: kernel: cgroup: add poll file operation Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner Signed-off-by: Suren Baghdasaryan Cc: Dennis Zhou Cc: Ingo Molnar Cc: Jens Axboe Cc: Li Zefan Cc: Peter Zijlstra Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/cgroup-defs.h | 4 ++++ kernel/cgroup/cgroup.c | 12 ++++++++++++ 2 files changed, 16 insertions(+) (limited to 'kernel') diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 8fcbae1b8db0..aad3babef007 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -32,6 +32,7 @@ struct kernfs_node; struct kernfs_ops; struct kernfs_open_file; struct seq_file; +struct poll_table_struct; #define MAX_CGROUP_TYPE_NAMELEN 32 #define MAX_CGROUP_ROOT_NAMELEN 64 @@ -574,6 +575,9 @@ struct cftype { ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); + __poll_t (*poll)(struct kernfs_open_file *of, + struct poll_table_struct *pt); + #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lock_class_key lockdep_key; #endif diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index cef98502b124..17828333f7c3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3534,6 +3534,16 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, return ret ?: nbytes; } +static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt) +{ + struct cftype *cft = of->kn->priv; + + if (cft->poll) + return cft->poll(of, pt); + + return kernfs_generic_poll(of, pt); +} + static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos) { return seq_cft(seq)->seq_start(seq, ppos); @@ -3572,6 +3582,7 @@ static struct kernfs_ops cgroup_kf_single_ops = { .open = cgroup_file_open, .release = cgroup_file_release, .write = cgroup_file_write, + .poll = cgroup_file_poll, .seq_show = cgroup_seqfile_show, }; @@ -3580,6 +3591,7 @@ static struct kernfs_ops cgroup_kf_ops = { .open = cgroup_file_open, .release = cgroup_file_release, .write = cgroup_file_write, + .poll = cgroup_file_poll, .seq_start = cgroup_seqfile_start, .seq_next = cgroup_seqfile_next, .seq_stop = cgroup_seqfile_stop, -- cgit v1.3-7-g2ca7 From 7e89a37c477c4caacde7b511c64720e20104945f Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 28 Feb 2019 15:22:53 +0100 Subject: ipc: Fix building compat mode without sysvipc As John Stultz noticed, my y2038 syscall series caused a link failure when CONFIG_SYSVIPC is disabled but CONFIG_COMPAT is enabled: arch/arm64/kernel/sys32.o:(.rodata+0x960): undefined reference to `__arm64_compat_sys_old_semctl' arch/arm64/kernel/sys32.o:(.rodata+0x980): undefined reference to `__arm64_compat_sys_old_msgctl' arch/arm64/kernel/sys32.o:(.rodata+0x9a0): undefined reference to `__arm64_compat_sys_old_shmctl' Add the missing entries in kernel/sys_ni.c for the new system calls. Cc: Laura Abbott Cc: John Stultz Cc: Thomas Gleixner Signed-off-by: Arnd Bergmann --- kernel/sys_ni.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 85e5ccec0955..62a6c8707799 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -202,6 +202,7 @@ COND_SYSCALL(msgget); COND_SYSCALL(old_msgctl); COND_SYSCALL(msgctl); COND_SYSCALL_COMPAT(msgctl); +COND_SYSCALL_COMPAT(old_msgctl); COND_SYSCALL(msgrcv); COND_SYSCALL_COMPAT(msgrcv); COND_SYSCALL(msgsnd); @@ -212,6 +213,7 @@ COND_SYSCALL(semget); COND_SYSCALL(old_semctl); COND_SYSCALL(semctl); COND_SYSCALL_COMPAT(semctl); +COND_SYSCALL_COMPAT(old_semctl); COND_SYSCALL(semtimedop); COND_SYSCALL(semtimedop_time32); COND_SYSCALL(semop); @@ -221,6 +223,7 @@ COND_SYSCALL(shmget); COND_SYSCALL(old_shmctl); COND_SYSCALL(shmctl); COND_SYSCALL_COMPAT(shmctl); +COND_SYSCALL_COMPAT(old_shmctl); COND_SYSCALL(shmat); COND_SYSCALL_COMPAT(shmat); COND_SYSCALL(shmdt); -- cgit v1.3-7-g2ca7 From abe420bfae528c92bd8cc5ecb62dc95672b1fd6f Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Thu, 7 Feb 2019 12:59:13 +0100 Subject: swiotlb: Introduce swiotlb_max_mapping_size() The function returns the maximum size that can be remapped by the SWIOTLB implementation. This function will be later exposed to users through the DMA-API. Cc: stable@vger.kernel.org Reviewed-by: Konrad Rzeszutek Wilk Reviewed-by: Christoph Hellwig Signed-off-by: Joerg Roedel Signed-off-by: Michael S. Tsirkin --- include/linux/swiotlb.h | 5 +++++ kernel/dma/swiotlb.c | 5 +++++ 2 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 7c007ed7505f..d3980aeed4a0 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -76,6 +76,7 @@ bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs); void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); +size_t swiotlb_max_mapping_size(struct device *dev); #else #define swiotlb_force SWIOTLB_NO_FORCE static inline bool is_swiotlb_buffer(phys_addr_t paddr) @@ -95,6 +96,10 @@ static inline unsigned int swiotlb_max_segment(void) { return 0; } +static inline size_t swiotlb_max_mapping_size(struct device *dev) +{ + return SIZE_MAX; +} #endif /* CONFIG_SWIOTLB */ extern void swiotlb_print_info(void); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 1fb6fd68b9c7..9cb21259cb0b 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -662,3 +662,8 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask) { return __phys_to_dma(hwdev, io_tlb_end - 1) <= mask; } + +size_t swiotlb_max_mapping_size(struct device *dev) +{ + return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE; +} -- cgit v1.3-7-g2ca7 From 492366f7b4237257ef50ca9c431a6a0d50225aca Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Thu, 7 Feb 2019 12:59:14 +0100 Subject: swiotlb: Add is_swiotlb_active() function This function will be used from dma_direct code to determine the maximum segment size of a dma mapping. Cc: stable@vger.kernel.org Reviewed-by: Konrad Rzeszutek Wilk Reviewed-by: Christoph Hellwig Signed-off-by: Joerg Roedel Signed-off-by: Michael S. Tsirkin --- include/linux/swiotlb.h | 6 ++++++ kernel/dma/swiotlb.c | 9 +++++++++ 2 files changed, 15 insertions(+) (limited to 'kernel') diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index d3980aeed4a0..29bc3a203283 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -77,6 +77,7 @@ bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); +bool is_swiotlb_active(void); #else #define swiotlb_force SWIOTLB_NO_FORCE static inline bool is_swiotlb_buffer(phys_addr_t paddr) @@ -100,6 +101,11 @@ static inline size_t swiotlb_max_mapping_size(struct device *dev) { return SIZE_MAX; } + +static inline bool is_swiotlb_active(void) +{ + return false; +} #endif /* CONFIG_SWIOTLB */ extern void swiotlb_print_info(void); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 9cb21259cb0b..c873f9cc2146 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -667,3 +667,12 @@ size_t swiotlb_max_mapping_size(struct device *dev) { return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE; } + +bool is_swiotlb_active(void) +{ + /* + * When SWIOTLB is initialized, even if io_tlb_start points to physical + * address zero, io_tlb_end surely doesn't. + */ + return io_tlb_end != 0; +} -- cgit v1.3-7-g2ca7 From 133d624b1cee16906134e92d5befb843b58bcf31 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Thu, 7 Feb 2019 12:59:15 +0100 Subject: dma: Introduce dma_max_mapping_size() The function returns the maximum size that can be mapped using DMA-API functions. The patch also adds the implementation for direct DMA and a new dma_map_ops pointer so that other implementations can expose their limit. Cc: stable@vger.kernel.org Reviewed-by: Konrad Rzeszutek Wilk Reviewed-by: Christoph Hellwig Signed-off-by: Joerg Roedel Signed-off-by: Michael S. Tsirkin --- Documentation/DMA-API.txt | 8 ++++++++ include/linux/dma-mapping.h | 8 ++++++++ kernel/dma/direct.c | 11 +++++++++++ kernel/dma/mapping.c | 14 ++++++++++++++ 4 files changed, 41 insertions(+) (limited to 'kernel') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index e133ccd60228..acfe3d0f78d1 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -195,6 +195,14 @@ Requesting the required mask does not alter the current mask. If you wish to take advantage of it, you should issue a dma_set_mask() call to set the mask to the value returned. +:: + + size_t + dma_direct_max_mapping_size(struct device *dev); + +Returns the maximum size of a mapping for the device. The size parameter +of the mapping functions like dma_map_single(), dma_map_page() and +others should not be larger than the returned value. Part Id - Streaming DMA mappings -------------------------------- diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f6ded992c183..5b21f14802e1 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -130,6 +130,7 @@ struct dma_map_ops { enum dma_data_direction direction); int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); + size_t (*max_mapping_size)(struct device *dev); }; #define DMA_MAPPING_ERROR (~(dma_addr_t)0) @@ -257,6 +258,8 @@ static inline void dma_direct_sync_sg_for_cpu(struct device *dev, } #endif +size_t dma_direct_max_mapping_size(struct device *dev); + #ifdef CONFIG_HAS_DMA #include @@ -460,6 +463,7 @@ int dma_supported(struct device *dev, u64 mask); int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); +size_t dma_max_mapping_size(struct device *dev); #else /* CONFIG_HAS_DMA */ static inline dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, @@ -561,6 +565,10 @@ static inline u64 dma_get_required_mask(struct device *dev) { return 0; } +static inline size_t dma_max_mapping_size(struct device *dev) +{ + return 0; +} #endif /* CONFIG_HAS_DMA */ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 355d16acee6d..6310ad01f915 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -380,3 +380,14 @@ int dma_direct_supported(struct device *dev, u64 mask) */ return mask >= __phys_to_dma(dev, min_mask); } + +size_t dma_direct_max_mapping_size(struct device *dev) +{ + size_t size = SIZE_MAX; + + /* If SWIOTLB is active, use its maximum mapping size */ + if (is_swiotlb_active()) + size = swiotlb_max_mapping_size(dev); + + return size; +} diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index a11006b6d8e8..5753008ab286 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -357,3 +357,17 @@ void dma_cache_sync(struct device *dev, void *vaddr, size_t size, ops->cache_sync(dev, vaddr, size, dir); } EXPORT_SYMBOL(dma_cache_sync); + +size_t dma_max_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (dma_is_direct(ops)) + size = dma_direct_max_mapping_size(dev); + else if (ops && ops->max_mapping_size) + size = ops->max_mapping_size(dev); + + return size; +} +EXPORT_SYMBOL_GPL(dma_max_mapping_size); -- cgit v1.3-7-g2ca7 From 78c3aff834f7939d1a2570e366e33f2a880d4d9e Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 4 Mar 2019 21:34:12 +0100 Subject: bpf: fix sysctl.c warning When CONFIG_BPF_SYSCALL or CONFIG_SYSCTL is disabled, we get a warning about an unused function: kernel/sysctl.c:3331:12: error: 'proc_dointvec_minmax_bpf_stats' defined but not used [-Werror=unused-function] static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, The CONFIG_BPF_SYSCALL check was already handled, but the SYSCTL check is needed on top. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Arnd Bergmann Reviewed-by: Kees Cook Reviewed-by: Christian Brauner Acked-by: Song Liu Signed-off-by: Daniel Borkmann --- kernel/sysctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7578e21a711b..2322b12cb4cf 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -3274,7 +3274,7 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write, #endif /* CONFIG_PROC_SYSCTL */ -#ifdef CONFIG_BPF_SYSCALL +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL) static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) -- cgit v1.3-7-g2ca7 From 20182390c4134478d795a096ddb8dddcc648e28a Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Mon, 4 Mar 2019 21:08:53 +0100 Subject: bpf: fix replace_map_fd_with_map_ptr's ldimm64 second imm field Non-zero imm value in the second part of the ldimm64 instruction for BPF_PSEUDO_MAP_FD is invalid, and thus must be rejected. The map fd only ever sits in the first instructions' imm field. None of the BPF loaders known to us are using it, so risk of regression is minimal. For clarity and consistency, the few insn->{src_reg,imm} occurrences are rewritten into insn[0].{src_reg,imm}. Add a test case to the BPF selftest suite as well. Fixes: 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn") Signed-off-by: Daniel Borkmann Acked-by: Song Liu Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 10 +++++----- tools/testing/selftests/bpf/verifier/ld_imm64.c | 15 ++++++++++++++- 2 files changed, 19 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a7b96bf0e654..ce166a002d16 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6678,17 +6678,17 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) /* valid generic load 64-bit imm */ goto next_insn; - if (insn->src_reg != BPF_PSEUDO_MAP_FD) { - verbose(env, - "unrecognized bpf_ld_imm64 insn\n"); + if (insn[0].src_reg != BPF_PSEUDO_MAP_FD || + insn[1].imm != 0) { + verbose(env, "unrecognized bpf_ld_imm64 insn\n"); return -EINVAL; } - f = fdget(insn->imm); + f = fdget(insn[0].imm); map = __bpf_map_get(f); if (IS_ERR(map)) { verbose(env, "fd %d is not pointing to valid bpf_map\n", - insn->imm); + insn[0].imm); return PTR_ERR(map); } diff --git a/tools/testing/selftests/bpf/verifier/ld_imm64.c b/tools/testing/selftests/bpf/verifier/ld_imm64.c index 28b8c805a293..3856dba733e9 100644 --- a/tools/testing/selftests/bpf/verifier/ld_imm64.c +++ b/tools/testing/selftests/bpf/verifier/ld_imm64.c @@ -122,7 +122,7 @@ .insns = { BPF_MOV64_IMM(BPF_REG_1, 0), BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, 0, BPF_REG_1, 0, 1), - BPF_RAW_INSN(0, 0, 0, 0, 1), + BPF_RAW_INSN(0, 0, 0, 0, 0), BPF_EXIT_INSN(), }, .errstr = "not pointing to valid bpf_map", @@ -139,3 +139,16 @@ .errstr = "invalid bpf_ld_imm64 insn", .result = REJECT, }, +{ + "test14 ld_imm64: reject 2nd imm != 0", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_1, + BPF_PSEUDO_MAP_FD, 0, 0), + BPF_RAW_INSN(0, 0, 0, 0, 0xfefefe), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_48b = { 1 }, + .errstr = "unrecognized bpf_ld_imm64 insn", + .result = REJECT, +}, -- cgit v1.3-7-g2ca7 From 4169680e9f7cdbf893f8885611b3235aeda94224 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Thu, 7 Mar 2019 16:26:36 -0800 Subject: kernel/panic.c: taint: fix debugfs_simple_attr.cocci warnings Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE for debugfs files. Semantic patch information: Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() imposes some significant overhead as compared to DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci The _unsafe() part suggests that some of them "safeness responsibilities" are now panic.c responsibilities. The patch is OK since panic's clear_warn_once_fops struct file_operations is safe against removal, so we don't have to use otherwise necessary debugfs_file_get()/debugfs_file_put(). [sergey.senozhatsky.work@gmail.com: changelog addition] Link: http://lkml.kernel.org/r/1545990861-158097-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing Reviewed-by: Sergey Senozhatsky Cc: Kees Cook Cc: Borislav Petkov Cc: Steven Rostedt (VMware) Cc: Petr Mladek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/panic.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/panic.c b/kernel/panic.c index f121e6ba7e11..0ae0d7332f12 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -642,16 +642,14 @@ static int clear_warn_once_set(void *data, u64 val) return 0; } -DEFINE_SIMPLE_ATTRIBUTE(clear_warn_once_fops, - NULL, - clear_warn_once_set, - "%lld\n"); +DEFINE_DEBUGFS_ATTRIBUTE(clear_warn_once_fops, NULL, clear_warn_once_set, + "%lld\n"); static __init int register_warn_debugfs(void) { /* Don't care about failure */ - debugfs_create_file("clear_warn_once", 0200, NULL, - NULL, &clear_warn_once_fops); + debugfs_create_file_unsafe("clear_warn_once", 0200, NULL, NULL, + &clear_warn_once_fops); return 0; } -- cgit v1.3-7-g2ca7 From a98eb6f19952f18a7e5ac55d6bd7bbbb2bdc8b88 Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Thu, 7 Mar 2019 16:26:46 -0800 Subject: kernel/hung_task.c - fix sparse warnings sparse complains: CHECK kernel/hung_task.c kernel/hung_task.c:28:19: warning: symbol 'sysctl_hung_task_check_count' was not declared. Should it be static? kernel/hung_task.c:42:29: warning: symbol 'sysctl_hung_task_timeout_secs' was not declared. Should it be static? kernel/hung_task.c:47:29: warning: symbol 'sysctl_hung_task_check_interval_secs' was not declared. Should it be static? kernel/hung_task.c:49:19: warning: symbol 'sysctl_hung_task_warnings' was not declared. Should it be static? kernel/hung_task.c:61:28: warning: symbol 'sysctl_hung_task_panic' was not declared. Should it be static? kernel/hung_task.c:219:5: warning: symbol 'proc_dohung_task_timeout_secs' was not declared. Should it be static? Add the appropriate header file to provide declarations. Link: http://lkml.kernel.org/r/467.1548649525@turing-police.cc.vt.edu Signed-off-by: Valdis Kletnieks Cc: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hung_task.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 4a9191617076..0c11216171c9 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -19,6 +19,7 @@ #include #include #include +#include #include -- cgit v1.3-7-g2ca7 From b014bebab047e9fdf2df45f6504ccbeaca446321 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 7 Mar 2019 16:26:50 -0800 Subject: kernel/hung_task.c: Use continuously blocked time when reporting. Since commit a2e514453861 ("kernel/hung_task.c: allow to set checking interval separately from timeout") added hung_task_check_interval_secs, setting a value different from hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo 120 > /proc/sys/kernel/hung_task_timeout_secs echo 5 > /proc/sys/kernel/hung_task_check_interval_secs causes confusing output as if the task was blocked for hung_task_timeout_secs seconds from the previous report. [ 399.395930] INFO: task kswapd0:75 blocked for more than 120 seconds. [ 405.027637] INFO: task kswapd0:75 blocked for more than 120 seconds. [ 410.659725] INFO: task kswapd0:75 blocked for more than 120 seconds. [ 416.292860] INFO: task kswapd0:75 blocked for more than 120 seconds. [ 421.932305] INFO: task kswapd0:75 blocked for more than 120 seconds. Although we could update t->last_switch_time after sched_show_task(t) if we want to report only every 120 seconds, reporting every 5 seconds might not be very bad for monitoring after a problematic situation has started. Thus, let's use continuously blocked time instead of updating previously reported time. [ 677.985011] INFO: task kswapd0:80 blocked for more than 122 seconds. [ 693.856126] INFO: task kswapd0:80 blocked for more than 138 seconds. [ 709.728075] INFO: task kswapd0:80 blocked for more than 154 seconds. [ 725.600018] INFO: task kswapd0:80 blocked for more than 170 seconds. [ 741.473133] INFO: task kswapd0:80 blocked for more than 186 seconds. Link: http://lkml.kernel.org/r/1551175083-10669-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Acked-by: Dmitry Vyukov Cc: "Paul E. McKenney" Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hung_task.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 0c11216171c9..f108a95882c6 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -127,7 +127,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) if (sysctl_hung_task_warnings > 0) sysctl_hung_task_warnings--; pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n", - t->comm, t->pid, timeout); + t->comm, t->pid, (jiffies - t->last_switch_time) / HZ); pr_err(" %s %s %.*s\n", print_tainted(), init_utsname()->release, (int)strcspn(init_utsname()->version, " "), -- cgit v1.3-7-g2ca7 From 21f63a5da2499a1286d36986d5e02db96c350d8d Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Thu, 7 Mar 2019 16:26:53 -0800 Subject: kernel/sys: annotate implicit fall through There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: kernel/sys.c:1748:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Link: http://lkml.kernel.org/r/20190114203347.17530-1-malat@debian.org Signed-off-by: Mathieu Malaterre Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index f7eb62eceb24..dc5d9e636d48 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1747,6 +1747,7 @@ void getrusage(struct task_struct *p, int who, struct rusage *r) if (who == RUSAGE_CHILDREN) break; + /* fall through */ case RUSAGE_SELF: thread_group_cputime_adjusted(p, &tgutime, &tgstime); -- cgit v1.3-7-g2ca7 From 513770f54edba8b19c2175a151e02f1dfc911d87 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 7 Mar 2019 16:27:48 -0800 Subject: dynamic_debug: move pr_err from module.c to ddebug_add_module This serves two purposes: First, we get a diagnostic if (though extremely unlikely), any of the calls of ddebug_add_module for built-in code fails, effectively disabling dynamic_debug. Second, I want to make struct _ddebug opaque, and avoid accessing any of its members outside dynamic_debug.[ch]. Link: http://lkml.kernel.org/r/20190212214150.4807-9-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes Acked-by: Jason Baron Cc: David Sterba Cc: Greg Kroah-Hartman Cc: Ingo Molnar Cc: Petr Mladek Cc: "Rafael J . Wysocki" Cc: Steven Rostedt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 4 +--- lib/dynamic_debug.c | 4 +++- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 2ad1b5239910..7b1d437c1ea6 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2720,9 +2720,7 @@ static void dynamic_debug_setup(struct module *mod, struct _ddebug *debug, unsig if (!debug) return; #ifdef CONFIG_DYNAMIC_DEBUG - if (ddebug_add_module(debug, num, mod->name)) - pr_err("dynamic debug error adding module: %s\n", - debug->modname); + ddebug_add_module(debug, num, mod->name); #endif } diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index 7b76f43edaef..7bdf98c37e91 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -849,8 +849,10 @@ int ddebug_add_module(struct _ddebug *tab, unsigned int n, struct ddebug_table *dt; dt = kzalloc(sizeof(*dt), GFP_KERNEL); - if (dt == NULL) + if (dt == NULL) { + pr_err("error adding module: %s\n", name); return -ENOMEM; + } /* * For built-in modules, name lives in .rodata and is * immortal. For loaded modules, name points at the name[] -- cgit v1.3-7-g2ca7 From a4507fedcd2580d510d8d91ac6b99537f869f62a Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 7 Mar 2019 16:27:52 -0800 Subject: dynamic_debug: add static inline stub for ddebug_add_module For symmetry with ddebug_remove_module, and to avoid a bit of ifdeffery in module.c, move the declaration of ddebug_add_module inside #if defined(CONFIG_DYNAMIC_DEBUG) and add a corresponding no-op stub in the #else branch. Link: http://lkml.kernel.org/r/20190212214150.4807-10-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes Acked-by: Jason Baron Cc: David Sterba Cc: Greg Kroah-Hartman Cc: Ingo Molnar Cc: Petr Mladek Cc: "Rafael J . Wysocki" Cc: Steven Rostedt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/dynamic_debug.h | 10 ++++++++-- kernel/module.c | 2 -- 2 files changed, 8 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index b17725400f75..3f8977cfa479 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -47,10 +47,10 @@ struct _ddebug { } __attribute__((aligned(8))); -int ddebug_add_module(struct _ddebug *tab, unsigned int n, - const char *modname); #if defined(CONFIG_DYNAMIC_DEBUG) +int ddebug_add_module(struct _ddebug *tab, unsigned int n, + const char *modname); extern int ddebug_remove_module(const char *mod_name); extern __printf(2, 3) void __dynamic_pr_debug(struct _ddebug *descriptor, const char *fmt, ...); @@ -152,6 +152,12 @@ do { \ #include #include +static inline int ddebug_add_module(struct _ddebug *tab, unsigned int n, + const char *modname) +{ + return 0; +} + static inline int ddebug_remove_module(const char *mod) { return 0; diff --git a/kernel/module.c b/kernel/module.c index 7b1d437c1ea6..0b9aa8ab89f0 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2719,9 +2719,7 @@ static void dynamic_debug_setup(struct module *mod, struct _ddebug *debug, unsig { if (!debug) return; -#ifdef CONFIG_DYNAMIC_DEBUG ddebug_add_module(debug, num, mod->name); -#endif } static void dynamic_debug_remove(struct module *mod, struct _ddebug *debug) -- cgit v1.3-7-g2ca7 From 4b0470027528ba98f9617f4ceba328de71d2fe49 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 7 Mar 2019 16:29:30 -0800 Subject: kernel: workqueue: clarify wq_worker_last_func() caller requirements This function can only be called safely from very specific scheduler contexts. Document those. Link: http://lkml.kernel.org/r/20190206150528.31198-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Suggested-by: Andrew Morton Acked-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/workqueue.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 56814902bc56..d51c37dd9422 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -920,6 +920,16 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) * CONTEXT: * spin_lock_irq(rq->lock) * + * This function is called during schedule() when a kworker is going + * to sleep. It's used by psi to identify aggregation workers during + * dequeuing, to allow periodic aggregation to shut-off when that + * worker is the last task in the system or cgroup to go to sleep. + * + * As this function doesn't involve any workqueue-related locking, it + * only returns stable values when called from inside the scheduler's + * queuing and dequeuing paths, when @task, which must be a kworker, + * is guaranteed to not be processing any works. + * * Return: * The last work function %current executed as a worker, NULL if it * hasn't executed any work yet. -- cgit v1.3-7-g2ca7 From 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 7 Mar 2019 16:29:40 -0800 Subject: sysctl: handle overflow in proc_get_long proc_get_long() is a funny function. It uses simple_strtoul() and for a good reason. proc_get_long() wants to always succeed the parse and return the maybe incorrect value and the trailing characters to check against a pre-defined list of acceptable trailing values. However, simple_strtoul() explicitly ignores overflows which can cause funny things like the following to happen: echo 18446744073709551616 > /proc/sys/fs/file-max cat /proc/sys/fs/file-max 0 (Which will cause your system to silently die behind your back.) On the other hand kstrtoul() does do overflow detection but does not return the trailing characters, and also fails the parse when anything other than '\n' is a trailing character whereas proc_get_long() wants to be more lenient. Now, before adding another kstrtoul() function let's simply add a static parse strtoul_lenient() which: - fails on overflow with -ERANGE - returns the trailing characters to the caller The reason why we should fail on ERANGE is that we already do a partial fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we already reject values such as 184467440737095516160 (21 chars) but accept values such as 18446744073709551616 (20 chars) but both are overflows. So we should just always reject 64bit overflows and not special-case this based on the number of chars. Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io Signed-off-by: Christian Brauner Acked-by: Kees Cook Cc: "Eric W. Biederman" Cc: Luis Chamberlain Cc: Joe Lawrence Cc: Waiman Long Cc: Dominik Brodowski Cc: Al Viro Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 14f30b4a1b64..1877ebe85c95 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -67,6 +67,8 @@ #include #include +#include "../lib/kstrtox.h" + #include #include @@ -2117,6 +2119,41 @@ static void proc_skip_char(char **buf, size_t *size, const char v) } } +/** + * strtoul_lenient - parse an ASCII formatted integer from a buffer and only + * fail on overflow + * + * @cp: kernel buffer containing the string to parse + * @endp: pointer to store the trailing characters + * @base: the base to use + * @res: where the parsed integer will be stored + * + * In case of success 0 is returned and @res will contain the parsed integer, + * @endp will hold any trailing characters. + * This function will fail the parse on overflow. If there wasn't an overflow + * the function will defer the decision what characters count as invalid to the + * caller. + */ +static int strtoul_lenient(const char *cp, char **endp, unsigned int base, + unsigned long *res) +{ + unsigned long long result; + unsigned int rv; + + cp = _parse_integer_fixup_radix(cp, &base); + rv = _parse_integer(cp, base, &result); + if ((rv & KSTRTOX_OVERFLOW) || (result != (unsigned long)result)) + return -ERANGE; + + cp += rv; + + if (endp) + *endp = (char *)cp; + + *res = (unsigned long)result; + return 0; +} + #define TMPBUFLEN 22 /** * proc_get_long - reads an ASCII formatted integer from a user buffer @@ -2160,7 +2197,8 @@ static int proc_get_long(char **buf, size_t *size, if (!isdigit(*p)) return -EINVAL; - *val = simple_strtoul(p, &p, 0); + if (strtoul_lenient(p, &p, 0, val)) + return -EINVAL; len = p - tmp; -- cgit v1.3-7-g2ca7 From 32a5ad9c22852e6bd9e74bdec5934ef9d1480bc5 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 7 Mar 2019 16:29:43 -0800 Subject: sysctl: handle overflow for file-max Currently, when writing echo 18446744073709551616 > /proc/sys/fs/file-max /proc/sys/fs/file-max will overflow and be set to 0. That quickly crashes the system. This commit sets the max and min value for file-max. The max value is set to long int. Any higher value cannot currently be used as the percpu counters are long ints and not unsigned integers. Note that the file-max value is ultimately parsed via __do_proc_doulongvec_minmax(). This function does not report error when min or max are exceeded. Which means if a value largen that long int is written userspace will not receive an error instead the old value will be kept. There is an argument to be made that this should be changed and __do_proc_doulongvec_minmax() should return an error when a dedicated min or max value are exceeded. However this has the potential to break userspace so let's defer this to an RFC patch. Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io Signed-off-by: Christian Brauner Acked-by: Kees Cook Cc: Alexey Dobriyan Cc: Al Viro Cc: Dominik Brodowski Cc: "Eric W. Biederman" Cc: Joe Lawrence Cc: Luis Chamberlain Cc: Waiman Long [christian@brauner.io: v4] Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 1877ebe85c95..3fb1405f3f8c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -129,6 +129,7 @@ static int __maybe_unused one = 1; static int __maybe_unused two = 2; static int __maybe_unused four = 4; static unsigned long one_ul = 1; +static unsigned long long_max = LONG_MAX; static int one_hundred = 100; static int one_thousand = 1000; #ifdef CONFIG_PRINTK @@ -1749,6 +1750,8 @@ static struct ctl_table fs_table[] = { .maxlen = sizeof(files_stat.max_files), .mode = 0644, .proc_handler = proc_doulongvec_minmax, + .extra1 = &zero, + .extra2 = &long_max, }, { .procname = "nr_open", -- cgit v1.3-7-g2ca7 From 9abdb50cda0ffe33bbb2e40cbad97b32fb7ff892 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 7 Mar 2019 16:29:47 -0800 Subject: kernel/gcov/gcc_3_4.c: use struct_size() in kzalloc() One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190109172445.GA15908@embeddedor Signed-off-by: Gustavo A. R. Silva Reviewed-by: Peter Oberparleiter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/gcov/gcc_3_4.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/gcov/gcc_3_4.c b/kernel/gcov/gcc_3_4.c index 1e32e66c9563..2dddecbdbe6e 100644 --- a/kernel/gcov/gcc_3_4.c +++ b/kernel/gcov/gcc_3_4.c @@ -245,8 +245,7 @@ struct gcov_info *gcov_info_dup(struct gcov_info *info) /* Duplicate gcov_info. */ active = num_counter_active(info); - dup = kzalloc(sizeof(struct gcov_info) + - sizeof(struct gcov_ctr_info) * active, GFP_KERNEL); + dup = kzalloc(struct_size(dup, counts, active), GFP_KERNEL); if (!dup) return NULL; dup->version = info->version; @@ -364,8 +363,7 @@ struct gcov_iterator *gcov_iter_new(struct gcov_info *info) { struct gcov_iterator *iter; - iter = kzalloc(sizeof(struct gcov_iterator) + - num_counter_active(info) * sizeof(struct type_info), + iter = kzalloc(struct_size(iter, type_info, num_counter_active(info)), GFP_KERNEL); if (iter) iter->info = info; -- cgit v1.3-7-g2ca7 From 13610aa908dcfce77135bb799c0a10d0172da6ba Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 7 Mar 2019 16:29:53 -0800 Subject: kernel/configs: use .incbin directive to embed config_data.gz This slightly optimizes the kernel/configs.c build. bin2c is not very efficient because it converts a data file into a huge array to embed it into a *.c file. Instead, we can use the .incbin directive. Also, this simplifies the code; Makefile is cleaner, and the way to get the offset/size of the config_data.gz is more straightforward. I used the "asm" statement in *.c instead of splitting it into *.S because MODULE_* tags are not supported in *.S files. I also cleaned up kernel/.gitignore; "config_data.gz" is unneeded because the top-level .gitignore takes care of the "*.gz" pattern. [yamada.masahiro@socionext.com: v2] Link: http://lkml.kernel.org/r/1550108893-21226-1-git-send-email-yamada.masahiro@socionext.com Link: http://lkml.kernel.org/r/1549941160-8084-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada Cc: Randy Dunlap Cc: Arnd Bergmann Cc: Alexander Popov Cc: Kees Cook Cc: Jonathan Corbet Cc: Thomas Gleixner Cc: Dan Williams Cc: Mathieu Desnoyers Cc: Richard Guy Briggs Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/dontdiff | 1 - kernel/.gitignore | 2 -- kernel/Makefile | 11 +---------- kernel/configs.c | 42 ++++++++++++++++++++---------------------- 4 files changed, 21 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 2228fcc8e29f..ef25a066d952 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -106,7 +106,6 @@ compile.h* conf config config-* -config_data.h* config.mak config.mak.autogen conmakehash diff --git a/kernel/.gitignore b/kernel/.gitignore index b3097bde4e9c..6e699100872f 100644 --- a/kernel/.gitignore +++ b/kernel/.gitignore @@ -1,7 +1,5 @@ # # Generated files # -config_data.h -config_data.gz timeconst.h hz.bc diff --git a/kernel/Makefile b/kernel/Makefile index 6aa7543bcdb2..6c57e78817da 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -116,17 +116,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o KASAN_SANITIZE_stackleak.o := n KCOV_INSTRUMENT_stackleak.o := n -$(obj)/configs.o: $(obj)/config_data.h +$(obj)/configs.o: $(obj)/config_data.gz targets += config_data.gz $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE $(call if_changed,gzip) - -filechk_ikconfiggz = \ - echo "static const char kernel_config_data[] __used = MAGIC_START"; \ - cat $< | scripts/bin2c; \ - echo "MAGIC_END;" - -targets += config_data.h -$(obj)/config_data.h: $(obj)/config_data.gz FORCE - $(call filechk,ikconfiggz) diff --git a/kernel/configs.c b/kernel/configs.c index 2df132b20217..b062425ccf8d 100644 --- a/kernel/configs.c +++ b/kernel/configs.c @@ -30,37 +30,35 @@ #include #include -/**************************************************/ -/* the actual current config file */ - /* - * Define kernel_config_data and kernel_config_data_size, which contains the - * wrapped and compressed configuration file. The file is first compressed - * with gzip and then bounded by two eight byte magic numbers to allow - * extraction from a binary kernel image: - * - * IKCFG_ST - * - * IKCFG_ED + * "IKCFG_ST" and "IKCFG_ED" are used to extract the config data from + * a binary kernel image or a module. See scripts/extract-ikconfig. */ -#define MAGIC_START "IKCFG_ST" -#define MAGIC_END "IKCFG_ED" -#include "config_data.h" - - -#define MAGIC_SIZE (sizeof(MAGIC_START) - 1) -#define kernel_config_data_size \ - (sizeof(kernel_config_data) - 1 - MAGIC_SIZE * 2) +asm ( +" .pushsection .rodata, \"a\" \n" +" .ascii \"IKCFG_ST\" \n" +" .global kernel_config_data \n" +"kernel_config_data: \n" +" .incbin \"kernel/config_data.gz\" \n" +" .global kernel_config_data_end \n" +"kernel_config_data_end: \n" +" .ascii \"IKCFG_ED\" \n" +" .popsection \n" +); #ifdef CONFIG_IKCONFIG_PROC +extern char kernel_config_data; +extern char kernel_config_data_end; + static ssize_t ikconfig_read_current(struct file *file, char __user *buf, size_t len, loff_t * offset) { return simple_read_from_buffer(buf, len, offset, - kernel_config_data + MAGIC_SIZE, - kernel_config_data_size); + &kernel_config_data, + &kernel_config_data_end - + &kernel_config_data); } static const struct file_operations ikconfig_file_ops = { @@ -79,7 +77,7 @@ static int __init ikconfig_init(void) if (!entry) return -ENOMEM; - proc_set_size(entry, kernel_config_data_size); + proc_set_size(entry, &kernel_config_data_end - &kernel_config_data); return 0; } -- cgit v1.3-7-g2ca7 From ec9672d57670d495404f36ab8b651bfefc0ea10b Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Thu, 7 Mar 2019 16:29:56 -0800 Subject: kcov: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-46-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman Cc: Andrey Ryabinin Cc: Mark Rutland Cc: Arnd Bergmann Cc: "Steven Rostedt (VMware)" Cc: Dmitry Vyukov Cc: Anders Roxell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index c2277dbdbfb1..5b0bb281c1a0 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -444,10 +444,8 @@ static int __init kcov_init(void) * there is no need to protect it against removal races. The * use of debugfs_create_file_unsafe() is actually safe here. */ - if (!debugfs_create_file_unsafe("kcov", 0600, NULL, NULL, &kcov_fops)) { - pr_err("failed to create kcov in debugfs\n"); - return -ENOMEM; - } + debugfs_create_file_unsafe("kcov", 0600, NULL, NULL, &kcov_fops); + return 0; } -- cgit v1.3-7-g2ca7 From 39e07cb60860e3162fc377380b8a60409315681e Mon Sep 17 00:00:00 2001 From: Elena Reshetova Date: Thu, 7 Mar 2019 16:30:00 -0800 Subject: kcov: convert kcov.refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable kcov.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the kcov.refcount it might make a difference in following places: - kcov_put(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Link: http://lkml.kernel.org/r/1547634429-772-1-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova Suggested-by: Kees Cook Reviewed-by: David Windsor Reviewed-by: Hans Liljestrand Reviewed-by: Dmitry Vyukov Reviewed-by: Andrea Parri Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index 5b0bb281c1a0..2ee38727844a 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -20,6 +20,7 @@ #include #include #include +#include #include /* Number of 64-bit words written per one comparison: */ @@ -44,7 +45,7 @@ struct kcov { * - opened file descriptor * - task with enabled coverage (we can't unwire it from another task) */ - atomic_t refcount; + refcount_t refcount; /* The lock protects mode, size, area and t. */ spinlock_t lock; enum kcov_mode mode; @@ -228,12 +229,12 @@ EXPORT_SYMBOL(__sanitizer_cov_trace_switch); static void kcov_get(struct kcov *kcov) { - atomic_inc(&kcov->refcount); + refcount_inc(&kcov->refcount); } static void kcov_put(struct kcov *kcov) { - if (atomic_dec_and_test(&kcov->refcount)) { + if (refcount_dec_and_test(&kcov->refcount)) { vfree(kcov->area); kfree(kcov); } @@ -312,7 +313,7 @@ static int kcov_open(struct inode *inode, struct file *filep) if (!kcov) return -ENOMEM; kcov->mode = KCOV_MODE_DISABLED; - atomic_set(&kcov->refcount, 1); + refcount_set(&kcov->refcount, 1); spin_lock_init(&kcov->lock); filep->private_data = kcov; return nonseekable_open(inode, filep); -- cgit v1.3-7-g2ca7 From fd2081ffce4e8aa3b2085be3bc584523ddeedf02 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Thu, 7 Mar 2019 16:31:31 -0800 Subject: kernel/fork.c: remove duplicated include Remove duplicated include. Link: http://lkml.kernel.org/r/20181209062952.17736-1-yuehaibing@huawei.com Signed-off-by: YueHaibing Reviewed-by: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 77059b211608..874e48c410f8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -77,7 +77,6 @@ #include #include #include -#include #include #include #include -- cgit v1.3-7-g2ca7 From 5768402fd9c6e872252b5268ad85e3fbae4fe26b Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 15 Feb 2019 13:47:27 +0200 Subject: perf/ring_buffer: Use high order allocations for AUX buffers optimistically Currently, the AUX buffer allocator will use high-order allocations for PMUs that don't support hardware scatter-gather chaining to ensure large contiguous blocks of pages, and always use an array of single pages otherwise. There is, however, a tangible performance benefit in using larger chunks of contiguous memory even in the latter case, that comes from not having to fetch the next page's address at every page boundary. In particular, a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime penalty with a single multi-page output region in snapshot mode (no PMI) than with multiple single-page output regions, from ~6% down to ~4%. For the snapshot mode it does make a difference as it is intended to run over long periods of time. For this reason, change the allocation policy to always optimistically start with the highest possible order when allocating pages for the AUX buffer, desceding until the allocation succeeds or order zero allocation fails. Signed-off-by: Alexander Shishkin Signed-off-by: Peter Zijlstra (Intel) Cc: Andy Lutomirski Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Link: https://lkml.kernel.org/r/20190215114727.62648-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 678ccec60d8f..a4047321d7d8 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -598,29 +598,27 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, { bool overwrite = !(flags & RING_BUFFER_WRITABLE); int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu); - int ret = -ENOMEM, max_order = 0; + int ret = -ENOMEM, max_order; if (!has_aux(event)) return -EOPNOTSUPP; - if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) { - /* - * We need to start with the max_order that fits in nr_pages, - * not the other way around, hence ilog2() and not get_order. - */ - max_order = ilog2(nr_pages); + /* + * We need to start with the max_order that fits in nr_pages, + * not the other way around, hence ilog2() and not get_order. + */ + max_order = ilog2(nr_pages); - /* - * PMU requests more than one contiguous chunks of memory - * for SW double buffering - */ - if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) && - !overwrite) { - if (!max_order) - return -EINVAL; + /* + * PMU requests more than one contiguous chunks of memory + * for SW double buffering + */ + if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) && + !overwrite) { + if (!max_order) + return -EINVAL; - max_order--; - } + max_order--; } rb->aux_pages = kcalloc_node(nr_pages, sizeof(void *), GFP_KERNEL, -- cgit v1.3-7-g2ca7 From 43aa378b41700650e4ddbd068650f9fe4ab496df Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 12 Feb 2019 14:54:30 -0600 Subject: perf/core: Mark expected switch fall-through MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: kernel/events/core.c: In function ‘perf_event_parse_addr_filter’: kernel/events/core.c:9154:11: warning: this statement may fall through [-Wimplicit-fallthrough=] kernel = 1; ~~~~~~~^~~ kernel/events/core.c:9156:3: note: here case IF_SRC_FILEADDR: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Andy Lutomirski Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kees Cook Cc: Linus Torvalds Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Rik van Riel Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Link: https://lkml.kernel.org/r/20190212205430.GA8446@embeddedor Signed-off-by: Ingo Molnar --- kernel/events/core.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 5f59d848171e..68ff130b99e7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9174,6 +9174,7 @@ perf_event_parse_addr_filter(struct perf_event *event, char *fstr, case IF_SRC_KERNELADDR: case IF_SRC_KERNEL: kernel = 1; + /* fall through */ case IF_SRC_FILEADDR: case IF_SRC_FILE: -- cgit v1.3-7-g2ca7 From 3fe7522fb766f6ee76bf7bc2837f1e3cc52c4e27 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 7 Mar 2019 08:52:12 +0100 Subject: locking/lockdep: Avoid a Clang warning Clang warns about a tentative array definition without a length: kernel/locking/lockdep.c:845:12: error: tentative array definition assumed to have one element [-Werror] There is no real reason to do this here, so just set the same length as in the real definition later in the same file. It has to be hidden in an #ifdef or annotated __maybe_unused though, to avoid the unused-variable warning if CONFIG_PROVE_LOCKING is disabled. Signed-off-by: Arnd Bergmann Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Andrew Morton Cc: Andy Lutomirski Cc: Arnaldo Carvalho de Melo Cc: Bart Van Assche Cc: Borislav Petkov Cc: Dave Hansen Cc: Frederic Weisbecker Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Joel Fernandes (Google) Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Rik van Riel Cc: Stephane Eranian Cc: Steven Rostedt (VMware) Cc: Tetsuo Handa Cc: Thomas Gleixner Cc: Vince Weaver Cc: Waiman Long Cc: Will Deacon Link: https://lkml.kernel.org/r/20190307075222.3424524-1-arnd@arndb.de Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 21cb81fe6359..35a144dfddf5 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -842,7 +842,9 @@ static bool class_lock_list_valid(struct lock_class *c, struct list_head *h) return true; } -static u16 chain_hlocks[]; +#ifdef CONFIG_PROVE_LOCKING +static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; +#endif static bool check_lock_chain_key(struct lock_chain *chain) { -- cgit v1.3-7-g2ca7 From 0126574fca2ce0f0d5beb9dade6efb823ff7407b Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Sun, 3 Mar 2019 10:19:01 -0800 Subject: locking/lockdep: Only call init_rcu_head() after RCU has been initialized init_data_structures_once() is called for the first time before RCU has been initialized. Make sure that init_rcu_head() is called before the RCU head is used and after RCU has been initialized. Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Will Deacon Cc: longman@redhat.com Link: https://lkml.kernel.org/r/c20aa0f0-42ab-a884-d931-7d4ec2bf0cdc@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 35a144dfddf5..34cdcbedda49 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -982,15 +982,22 @@ static inline void check_data_structures(void) { } */ static void init_data_structures_once(void) { - static bool initialization_happened; + static bool ds_initialized, rcu_head_initialized; int i; - if (likely(initialization_happened)) + if (likely(rcu_head_initialized)) return; - initialization_happened = true; + if (system_state >= SYSTEM_SCHEDULING) { + init_rcu_head(&delayed_free.rcu_head); + rcu_head_initialized = true; + } + + if (ds_initialized) + return; + + ds_initialized = true; - init_rcu_head(&delayed_free.rcu_head); INIT_LIST_HEAD(&delayed_free.pf[0].zapped); INIT_LIST_HEAD(&delayed_free.pf[1].zapped); -- cgit v1.3-7-g2ca7 From 009bb421b6ceb7916ce627023d0eb7ced04c8910 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Sun, 3 Mar 2019 14:00:46 -0800 Subject: workqueue, lockdep: Fix an alloc_workqueue() error path This patch fixes a use-after-free and a memory leak in an alloc_workqueue() error path. Repoted by syzkaller and KASAN: BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:197 [inline] BUG: KASAN: use-after-free in lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023 Read of size 8 at addr ffff888090fc2698 by task syz-executor134/7858 CPU: 1 PID: 7858 Comm: syz-executor134 Not tainted 5.0.0-rc8-next-20190301 #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 __read_once_size include/linux/compiler.h:197 [inline] lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023 wq_init_lockdep kernel/workqueue.c:3444 [inline] alloc_workqueue+0x427/0xe70 kernel/workqueue.c:4263 ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732 misc_open+0x398/0x4c0 drivers/char/misc.c:141 chrdev_open+0x247/0x6b0 fs/char_dev.c:417 do_dentry_open+0x488/0x1160 fs/open.c:771 vfs_open+0xa0/0xd0 fs/open.c:880 do_last fs/namei.c:3416 [inline] path_openat+0x10e9/0x46e0 fs/namei.c:3533 do_filp_open+0x1a1/0x280 fs/namei.c:3563 do_sys_open+0x3fe/0x5d0 fs/open.c:1063 __do_sys_openat fs/open.c:1090 [inline] __se_sys_openat fs/open.c:1084 [inline] __x64_sys_openat+0x9d/0x100 fs/open.c:1084 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Allocated by task 7789: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_kmalloc mm/kasan/common.c:497 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511 __do_kmalloc mm/slab.c:3726 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3735 kmalloc include/linux/slab.h:553 [inline] kzalloc include/linux/slab.h:743 [inline] alloc_workqueue+0x13c/0xe70 kernel/workqueue.c:4236 ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732 misc_open+0x398/0x4c0 drivers/char/misc.c:141 chrdev_open+0x247/0x6b0 fs/char_dev.c:417 do_dentry_open+0x488/0x1160 fs/open.c:771 vfs_open+0xa0/0xd0 fs/open.c:880 do_last fs/namei.c:3416 [inline] path_openat+0x10e9/0x46e0 fs/namei.c:3533 do_filp_open+0x1a1/0x280 fs/namei.c:3563 do_sys_open+0x3fe/0x5d0 fs/open.c:1063 __do_sys_openat fs/open.c:1090 [inline] __se_sys_openat fs/open.c:1084 [inline] __x64_sys_openat+0x9d/0x100 fs/open.c:1084 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 7789: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459 kasan_slab_free+0xe/0x10 mm/kasan/common.c:467 __cache_free mm/slab.c:3498 [inline] kfree+0xcf/0x230 mm/slab.c:3821 alloc_workqueue+0xc3e/0xe70 kernel/workqueue.c:4295 ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732 misc_open+0x398/0x4c0 drivers/char/misc.c:141 chrdev_open+0x247/0x6b0 fs/char_dev.c:417 do_dentry_open+0x488/0x1160 fs/open.c:771 vfs_open+0xa0/0xd0 fs/open.c:880 do_last fs/namei.c:3416 [inline] path_openat+0x10e9/0x46e0 fs/namei.c:3533 do_filp_open+0x1a1/0x280 fs/namei.c:3563 do_sys_open+0x3fe/0x5d0 fs/open.c:1063 __do_sys_openat fs/open.c:1090 [inline] __se_sys_openat fs/open.c:1084 [inline] __x64_sys_openat+0x9d/0x100 fs/open.c:1084 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888090fc2580 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 280 bytes inside of 512-byte region [ffff888090fc2580, ffff888090fc2780) Reported-by: syzbot+17335689e239ce135d8b@syzkaller.appspotmail.com Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Will Deacon Fixes: 669de8bda87b ("kernel/workqueue: Use dynamic lockdep keys for workqueues") Link: https://lkml.kernel.org/r/20190303220046.29448-1-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/workqueue.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index e163e7a7f5e5..ecdd5e6f2166 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4194,6 +4194,8 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, return wq; err_free_wq: + wq_unregister_lockdep(wq); + wq_free_lockdep(wq); free_workqueue_attrs(wq->unbound_attrs); kfree(wq); return NULL; -- cgit v1.3-7-g2ca7 From 69a106c00e8554a7e6b3f4bb2967332670f89337 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Wed, 6 Mar 2019 19:27:31 -0500 Subject: workqueue, lockdep: Fix a memory leak in wq->lock_name The following commit: 669de8bda87b ("kernel/workqueue: Use dynamic lockdep keys for workqueues") introduced a memory leak as wq_free_lockdep() calls kfree(wq->lock_name), but wq_init_lockdep() does not point wq->lock_name to the newly allocated slab object. This can be reproduced by running LTP fallocate04 followed by oom01 tests: unreferenced object 0xc0000005876384d8 (size 64): comm "fallocate04", pid 26972, jiffies 4297139141 (age 40370.480s) hex dump (first 32 bytes): 28 77 71 5f 63 6f 6d 70 6c 65 74 69 6f 6e 29 65 (wq_completion)e 78 74 34 2d 72 73 76 2d 63 6f 6e 76 65 72 73 69 xt4-rsv-conversi backtrace: [<00000000cb452883>] kvasprintf+0x6c/0xe0 [<000000004654ddac>] kasprintf+0x34/0x60 [<000000001c68f311>] alloc_workqueue+0x1f8/0x6ac [<0000000003c2ad83>] ext4_fill_super+0x23d4/0x3c80 [ext4] [<0000000006610538>] mount_bdev+0x25c/0x290 [<00000000bcf955ec>] ext4_mount+0x28/0x50 [ext4] [<0000000016e08fd3>] legacy_get_tree+0x4c/0xb0 [<0000000042b6a5fc>] vfs_get_tree+0x6c/0x190 [<00000000268ab022>] do_mount+0xb9c/0x1100 [<00000000698e6898>] ksys_mount+0x158/0x180 [<0000000064e391fd>] sys_mount+0x20/0x30 [<00000000ba378f12>] system_call+0x5c/0x70 Signed-off-by: Qian Cai Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Bart Van Assche Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Will Deacon Cc: catalin.marinas@arm.com Cc: jiangshanlai@gmail.com Cc: tj@kernel.org Fixes: 669de8bda87b ("kernel/workqueue: Use dynamic lockdep keys for workqueues") Link: https://lkml.kernel.org/r/20190307002731.47371-1-cai@lca.pw Signed-off-by: Ingo Molnar --- kernel/workqueue.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ecdd5e6f2166..e14da4f599ab 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3348,6 +3348,8 @@ static void wq_init_lockdep(struct workqueue_struct *wq) lock_name = kasprintf(GFP_KERNEL, "%s%s", "(wq_completion)", wq->name); if (!lock_name) lock_name = wq->name; + + wq->lock_name = lock_name; lockdep_init_map(&wq->lockdep_map, lock_name, &wq->key, 0); } -- cgit v1.3-7-g2ca7 From d9c1bb2f6a2157b38e8eb63af437cb22701d31ee Mon Sep 17 00:00:00 2001 From: Stephane Eranian Date: Thu, 7 Mar 2019 10:52:33 -0800 Subject: perf/core: Restore mmap record type correctly On mmap(), perf_events generates a RECORD_MMAP record and then checks which events are interested in this record. There are currently 2 versions of mmap records: RECORD_MMAP and RECORD_MMAP2. MMAP2 is larger. The event configuration controls which version the user level tool accepts. If the event->attr.mmap2=1 field then MMAP2 record is returned. The perf_event_mmap_output() takes care of this. It checks attr->mmap2 and corrects the record fields before putting it in the sampling buffer of the event. At the end the function restores the modified MMAP record fields. The problem is that the function restores the size but not the type. Thus, if a subsequent event only accepts MMAP type, then it would instead receive an MMAP2 record with a size of MMAP record. This patch fixes the problem by restoring the record type on exit. Signed-off-by: Stephane Eranian Acked-by: Peter Zijlstra (Intel) Cc: Andi Kleen Cc: Jiri Olsa Cc: Kan Liang Fixes: 13d7a2410fa6 ("perf: Add attr->mmap2 attribute to an event") Link: http://lkml.kernel.org/r/20190307185233.225521-1-eranian@google.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 6fb27b564730..514b8e014a2d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7189,6 +7189,7 @@ static void perf_event_mmap_output(struct perf_event *event, struct perf_output_handle handle; struct perf_sample_data sample; int size = mmap_event->event_id.header.size; + u32 type = mmap_event->event_id.header.type; int ret; if (!perf_event_mmap_match(event, data)) @@ -7232,6 +7233,7 @@ static void perf_event_mmap_output(struct perf_event *event, perf_output_end(&handle); out: mmap_event->event_id.header.size = size; + mmap_event->event_id.header.type = type; } static void perf_event_mmap_event(struct perf_mmap_event *mmap_event) -- cgit v1.3-7-g2ca7 From 0841625201b649c0b1bf0f0e25cf4401e68fb8fd Mon Sep 17 00:00:00 2001 From: Valdis Klētnieks Date: Tue, 12 Mar 2019 04:52:58 -0400 Subject: tracing/probes: Make reserved_field_names static sparse complains: CHECK kernel/trace/trace_probe.c kernel/trace/trace_probe.c:16:12: warning: symbol 'reserved_field_names' was not declared. Should it be static? Yes, it should be static. Link: http://lkml.kernel.org/r/2478.1552380778@turing-police Signed-off-by: Valdis Kletnieks Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_probe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 89da34b326e3..cfcf77e6fb19 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -13,7 +13,7 @@ #include "trace_probe.h" -const char *reserved_field_names[] = { +static const char *reserved_field_names[] = { "common_type", "common_flags", "common_preempt_count", -- cgit v1.3-7-g2ca7 From cede666e2eb28dc8a680d1622fded533769f07a4 Mon Sep 17 00:00:00 2001 From: Valdis Klētnieks Date: Tue, 12 Mar 2019 04:58:32 -0400 Subject: trace/probes: Remove kernel doc style from non kernel doc comment CC kernel/trace/trace_kprobe.o kernel/trace/trace_kprobe.c:41: warning: cannot understand function prototype: 'struct trace_kprobe ' The real problem is that a comment looked like kerneldoc when it shouldn't be... Link: http://lkml.kernel.org/r/2812.1552381112@turing-police Signed-off-by: Valdis Kletnieks Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..ceafa0a2b1d1 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -35,7 +35,7 @@ static struct dyn_event_operations trace_kprobe_ops = { .match = trace_kprobe_match, }; -/** +/* * Kprobe event core functions */ struct trace_kprobe { -- cgit v1.3-7-g2ca7 From 8cf7630b29701d364f8df4a50e4f1f5e752b2778 Mon Sep 17 00:00:00 2001 From: Zev Weiss Date: Mon, 11 Mar 2019 23:28:02 -0700 Subject: kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv This bug has apparently existed since the introduction of this function in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git, "[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of neighbour sysctls."). As a minimal fix we can simply duplicate the corresponding check in do_proc_dointvec_conv(). Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net Signed-off-by: Zev Weiss Cc: Brendan Higgins Cc: Iurii Zaikin Cc: Kees Cook Cc: Luis Chamberlain Cc: [2.6.2+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7bb3988425ee..0854197e0e67 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2645,7 +2645,16 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp, { struct do_proc_dointvec_minmax_conv_param *param = data; if (write) { - int val = *negp ? -*lvalp : *lvalp; + int val; + if (*negp) { + if (*lvalp > (unsigned long) INT_MAX + 1) + return -EINVAL; + val = -*lvalp; + } else { + if (*lvalp > (unsigned long) INT_MAX) + return -EINVAL; + val = *lvalp; + } if ((param->min && *param->min > val) || (param->max && *param->max < val)) return -EINVAL; -- cgit v1.3-7-g2ca7 From 2bc4fc60fb3e1a4cae429b889720755ebe9085bb Mon Sep 17 00:00:00 2001 From: Zev Weiss Date: Mon, 11 Mar 2019 23:28:06 -0700 Subject: kernel/sysctl.c: define minmax conv functions in terms of non-minmax versions do_proc_do[u]intvec_minmax_conv() had included open-coded versions of do_proc_do[u]intvec_conv(); the duplication led to buggy inconsistencies (missing range checks). To reduce the likelihood of such problems in the future, we can instead refactor both to be defined in terms of their non-bounded counterparts (plus the added check). Link: http://lkml.kernel.org/r/20190207165138.5oud57vq4ozwb4kh@hatter.bewilderbeest.net Signed-off-by: Zev Weiss Cc: Brendan Higgins Cc: Iurii Zaikin Cc: Kees Cook Cc: Luis Chamberlain Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 59 +++++++++++++++++++++++++-------------------------------- 1 file changed, 26 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0854197e0e67..e5da394d1ca3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2643,32 +2643,25 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp, int *valp, int write, void *data) { + int tmp, ret; struct do_proc_dointvec_minmax_conv_param *param = data; + /* + * If writing, first do so via a temporary local int so we can + * bounds-check it before touching *valp. + */ + int *ip = write ? &tmp : valp; + + ret = do_proc_dointvec_conv(negp, lvalp, ip, write, data); + if (ret) + return ret; + if (write) { - int val; - if (*negp) { - if (*lvalp > (unsigned long) INT_MAX + 1) - return -EINVAL; - val = -*lvalp; - } else { - if (*lvalp > (unsigned long) INT_MAX) - return -EINVAL; - val = *lvalp; - } - if ((param->min && *param->min > val) || - (param->max && *param->max < val)) + if ((param->min && *param->min > tmp) || + (param->max && *param->max < tmp)) return -EINVAL; - *valp = val; - } else { - int val = *valp; - if (val < 0) { - *negp = true; - *lvalp = -(unsigned long)val; - } else { - *negp = false; - *lvalp = (unsigned long)val; - } + *valp = tmp; } + return 0; } @@ -2717,22 +2710,22 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp, unsigned int *valp, int write, void *data) { + int ret; + unsigned int tmp; struct do_proc_douintvec_minmax_conv_param *param = data; + /* write via temporary local uint for bounds-checking */ + unsigned int *up = write ? &tmp : valp; - if (write) { - unsigned int val = *lvalp; + ret = do_proc_douintvec_conv(lvalp, up, write, data); + if (ret) + return ret; - if (*lvalp > UINT_MAX) - return -EINVAL; - - if ((param->min && *param->min > val) || - (param->max && *param->max < val)) + if (write) { + if ((param->min && *param->min > tmp) || + (param->max && *param->max < tmp)) return -ERANGE; - *valp = val; - } else { - unsigned int val = *valp; - *lvalp = (unsigned long) val; + *valp = tmp; } return 0; -- cgit v1.3-7-g2ca7 From a0bf842e89a3842162aa8514b9bf4611c86fee10 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 11 Mar 2019 23:30:26 -0700 Subject: swiotlb: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-19-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport Cc: Catalin Marinas Cc: Christophe Leroy Cc: Christoph Hellwig Cc: "David S. Miller" Cc: Dennis Zhou Cc: Geert Uytterhoeven Cc: Greentime Hu Cc: Greg Kroah-Hartman Cc: Guan Xuetao Cc: Guo Ren Cc: Guo Ren [c-sky] Cc: Heiko Carstens Cc: Juergen Gross [Xen] Cc: Mark Salter Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Paul Burton Cc: Petr Mladek Cc: Richard Weinberger Cc: Rich Felker Cc: Rob Herring Cc: Rob Herring Cc: Russell King Cc: Stafford Horne Cc: Tony Luck Cc: Vineet Gupta Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/dma/swiotlb.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 9f5851064aea..dd6a8e2d53a7 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -199,6 +199,7 @@ void __init swiotlb_update_mem_attributes(void) int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) { unsigned long i, bytes; + size_t alloc_size; bytes = nslabs << IO_TLB_SHIFT; @@ -211,12 +212,18 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE * between io_tlb_start and io_tlb_end. */ - io_tlb_list = memblock_alloc( - PAGE_ALIGN(io_tlb_nslabs * sizeof(int)), - PAGE_SIZE); - io_tlb_orig_addr = memblock_alloc( - PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)), - PAGE_SIZE); + alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(int)); + io_tlb_list = memblock_alloc(alloc_size, PAGE_SIZE); + if (!io_tlb_list) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, alloc_size, PAGE_SIZE); + + alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)); + io_tlb_orig_addr = memblock_alloc(alloc_size, PAGE_SIZE); + if (!io_tlb_orig_addr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, alloc_size, PAGE_SIZE); + for (i = 0; i < io_tlb_nslabs; i++) { io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE); io_tlb_orig_addr[i] = INVALID_PHYS_ADDR; -- cgit v1.3-7-g2ca7 From 8a7f97b902f4fb0d94b355b6b3f1fbd7154cafb9 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 11 Mar 2019 23:30:31 -0700 Subject: treewide: add checks for the return value of memblock_alloc*() Add check for the return value of memblock_alloc*() functions and call panic() in case of error. The panic message repeats the one used by panicing memblock allocators with adjustment of parameters to include only relevant ones. The replacement was mostly automated with semantic patches like the one below with manual massaging of format strings. @@ expression ptr, size, align; @@ ptr = memblock_alloc(size, align); + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type] Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc] Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails] Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx [akpm@linux-foundation.org: fix xtensa printk warning] Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport Signed-off-by: Anders Roxell Reviewed-by: Guo Ren [c-sky] Acked-by: Paul Burton [MIPS] Acked-by: Heiko Carstens [s390] Reviewed-by: Juergen Gross [Xen] Reviewed-by: Geert Uytterhoeven [m68k] Acked-by: Max Filippov [xtensa] Cc: Catalin Marinas Cc: Christophe Leroy Cc: Christoph Hellwig Cc: "David S. Miller" Cc: Dennis Zhou Cc: Greentime Hu Cc: Greg Kroah-Hartman Cc: Guan Xuetao Cc: Guo Ren Cc: Mark Salter Cc: Matt Turner Cc: Michael Ellerman Cc: Michal Simek Cc: Petr Mladek Cc: Richard Weinberger Cc: Rich Felker Cc: Rob Herring Cc: Rob Herring Cc: Russell King Cc: Stafford Horne Cc: Tony Luck Cc: Vineet Gupta Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/kernel/core_cia.c | 3 +++ arch/alpha/kernel/core_marvel.c | 6 ++++++ arch/alpha/kernel/pci-noop.c | 13 +++++++++++-- arch/alpha/kernel/pci.c | 11 ++++++++++- arch/alpha/kernel/pci_iommu.c | 12 ++++++++++++ arch/arc/mm/highmem.c | 4 ++++ arch/arm/kernel/setup.c | 6 ++++++ arch/arm/mm/mmu.c | 14 +++++++++++++- arch/arm64/kernel/setup.c | 8 +++++--- arch/arm64/mm/kasan_init.c | 10 ++++++++++ arch/c6x/mm/dma-coherent.c | 4 ++++ arch/c6x/mm/init.c | 3 +++ arch/csky/mm/highmem.c | 5 +++++ arch/h8300/mm/init.c | 3 +++ arch/m68k/atari/stram.c | 4 ++++ arch/m68k/mm/init.c | 3 +++ arch/m68k/mm/mcfmmu.c | 6 ++++++ arch/m68k/mm/motorola.c | 9 +++++++++ arch/m68k/mm/sun3mmu.c | 6 ++++++ arch/m68k/sun3/sun3dvma.c | 3 +++ arch/microblaze/mm/init.c | 8 ++++++-- arch/mips/cavium-octeon/dma-octeon.c | 3 +++ arch/mips/kernel/setup.c | 3 +++ arch/mips/kernel/traps.c | 3 +++ arch/mips/mm/init.c | 5 +++++ arch/nds32/mm/init.c | 12 ++++++++++++ arch/openrisc/mm/ioremap.c | 8 ++++++-- arch/powerpc/kernel/dt_cpu_ftrs.c | 5 +++++ arch/powerpc/kernel/pci_32.c | 3 +++ arch/powerpc/kernel/setup-common.c | 3 +++ arch/powerpc/kernel/setup_64.c | 4 ++++ arch/powerpc/lib/alloc.c | 3 +++ arch/powerpc/mm/hash_utils_64.c | 3 +++ arch/powerpc/mm/mmu_context_nohash.c | 9 +++++++++ arch/powerpc/mm/pgtable-book3e.c | 12 ++++++++++-- arch/powerpc/mm/pgtable-book3s64.c | 3 +++ arch/powerpc/mm/pgtable-radix.c | 9 ++++++++- arch/powerpc/mm/ppc_mmu_32.c | 3 +++ arch/powerpc/platforms/pasemi/iommu.c | 3 +++ arch/powerpc/platforms/powermac/nvram.c | 3 +++ arch/powerpc/platforms/powernv/opal.c | 3 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++++++++ arch/powerpc/platforms/ps3/setup.c | 3 +++ arch/powerpc/sysdev/msi_bitmap.c | 3 +++ arch/s390/kernel/setup.c | 13 +++++++++++++ arch/s390/kernel/smp.c | 5 ++++- arch/s390/kernel/topology.c | 6 ++++++ arch/s390/numa/mode_emu.c | 3 +++ arch/s390/numa/numa.c | 6 +++++- arch/sh/mm/init.c | 6 ++++++ arch/sh/mm/numa.c | 4 ++++ arch/um/drivers/net_kern.c | 3 +++ arch/um/drivers/vector_kern.c | 3 +++ arch/um/kernel/initrd.c | 2 ++ arch/um/kernel/mem.c | 16 ++++++++++++++++ arch/unicore32/kernel/setup.c | 4 ++++ arch/unicore32/mm/mmu.c | 15 +++++++++++++-- arch/x86/kernel/acpi/boot.c | 3 +++ arch/x86/kernel/apic/io_apic.c | 5 +++++ arch/x86/kernel/e820.c | 3 +++ arch/x86/platform/olpc/olpc_dt.c | 3 +++ arch/x86/xen/p2m.c | 11 +++++++++-- arch/xtensa/mm/kasan_init.c | 4 ++++ arch/xtensa/mm/mmu.c | 3 +++ drivers/clk/ti/clk.c | 3 +++ drivers/macintosh/smu.c | 3 +++ drivers/of/fdt.c | 8 +++++++- drivers/of/unittest.c | 8 +++++++- drivers/xen/swiotlb-xen.c | 7 +++++-- kernel/dma/swiotlb.c | 4 ++-- kernel/power/snapshot.c | 3 +++ lib/cpumask.c | 3 +++ mm/kasan/init.c | 10 ++++++++-- mm/sparse.c | 21 +++++++++++++++++---- 74 files changed, 411 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/kernel/core_cia.c b/arch/alpha/kernel/core_cia.c index 466cd44d8b36..f489170201c3 100644 --- a/arch/alpha/kernel/core_cia.c +++ b/arch/alpha/kernel/core_cia.c @@ -332,6 +332,9 @@ cia_prepare_tbia_workaround(int window) /* Use minimal 1K map. */ ppte = memblock_alloc(CIA_BROKEN_TBIA_SIZE, 32768); + if (!ppte) + panic("%s: Failed to allocate %u bytes align=0x%x\n", + __func__, CIA_BROKEN_TBIA_SIZE, 32768); pte = (virt_to_phys(ppte) >> (PAGE_SHIFT - 1)) | 1; for (i = 0; i < CIA_BROKEN_TBIA_SIZE / sizeof(unsigned long); ++i) diff --git a/arch/alpha/kernel/core_marvel.c b/arch/alpha/kernel/core_marvel.c index c1d0c18c71ca..1db9d0eb2922 100644 --- a/arch/alpha/kernel/core_marvel.c +++ b/arch/alpha/kernel/core_marvel.c @@ -83,6 +83,9 @@ mk_resource_name(int pe, int port, char *str) sprintf(tmp, "PCI %s PE %d PORT %d", str, pe, port); name = memblock_alloc(strlen(tmp) + 1, SMP_CACHE_BYTES); + if (!name) + panic("%s: Failed to allocate %zu bytes\n", __func__, + strlen(tmp) + 1); strcpy(name, tmp); return name; @@ -118,6 +121,9 @@ alloc_io7(unsigned int pe) } io7 = memblock_alloc(sizeof(*io7), SMP_CACHE_BYTES); + if (!io7) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*io7)); io7->pe = pe; raw_spin_lock_init(&io7->irq_lock); diff --git a/arch/alpha/kernel/pci-noop.c b/arch/alpha/kernel/pci-noop.c index 091cff3c68fd..ae82061edae9 100644 --- a/arch/alpha/kernel/pci-noop.c +++ b/arch/alpha/kernel/pci-noop.c @@ -34,6 +34,9 @@ alloc_pci_controller(void) struct pci_controller *hose; hose = memblock_alloc(sizeof(*hose), SMP_CACHE_BYTES); + if (!hose) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*hose)); *hose_tail = hose; hose_tail = &hose->next; @@ -44,7 +47,13 @@ alloc_pci_controller(void) struct resource * __init alloc_resource(void) { - return memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); + void *ptr = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); + + if (!ptr) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct resource)); + + return ptr; } SYSCALL_DEFINE3(pciconfig_iobase, long, which, unsigned long, bus, @@ -54,7 +63,7 @@ SYSCALL_DEFINE3(pciconfig_iobase, long, which, unsigned long, bus, /* from hose or from bus.devfn */ if (which & IOBASE_FROM_HOSE) { - for (hose = hose_head; hose; hose = hose->next) + for (hose = hose_head; hose; hose = hose->next) if (hose->index == bus) break; if (!hose) diff --git a/arch/alpha/kernel/pci.c b/arch/alpha/kernel/pci.c index 97098127df83..64fbfb0763b2 100644 --- a/arch/alpha/kernel/pci.c +++ b/arch/alpha/kernel/pci.c @@ -393,6 +393,9 @@ alloc_pci_controller(void) struct pci_controller *hose; hose = memblock_alloc(sizeof(*hose), SMP_CACHE_BYTES); + if (!hose) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*hose)); *hose_tail = hose; hose_tail = &hose->next; @@ -403,7 +406,13 @@ alloc_pci_controller(void) struct resource * __init alloc_resource(void) { - return memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); + void *ptr = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); + + if (!ptr) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct resource)); + + return ptr; } diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c index e4cf77b07742..3034d6d936d2 100644 --- a/arch/alpha/kernel/pci_iommu.c +++ b/arch/alpha/kernel/pci_iommu.c @@ -80,6 +80,9 @@ iommu_arena_new_node(int nid, struct pci_controller *hose, dma_addr_t base, " falling back to system-wide allocation\n", __func__, nid); arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES); + if (!arena) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*arena)); } arena->ptes = memblock_alloc_node(sizeof(*arena), align, nid); @@ -88,12 +91,21 @@ iommu_arena_new_node(int nid, struct pci_controller *hose, dma_addr_t base, " falling back to system-wide allocation\n", __func__, nid); arena->ptes = memblock_alloc(mem_size, align); + if (!arena->ptes) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, mem_size, align); } #else /* CONFIG_DISCONTIGMEM */ arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES); + if (!arena) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*arena)); arena->ptes = memblock_alloc(mem_size, align); + if (!arena->ptes) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, mem_size, align); #endif /* CONFIG_DISCONTIGMEM */ diff --git a/arch/arc/mm/highmem.c b/arch/arc/mm/highmem.c index 48e700151810..11f57e2ced8a 100644 --- a/arch/arc/mm/highmem.c +++ b/arch/arc/mm/highmem.c @@ -124,6 +124,10 @@ static noinline pte_t * __init alloc_kmap_pgtable(unsigned long kvaddr) pmd_k = pmd_offset(pud_k, kvaddr); pte_k = (pte_t *)memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!pte_k) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + pmd_populate_kernel(&init_mm, pmd_k, pte_k); return pte_k; } diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 375b13f7e780..5d78b6ac0429 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -867,6 +867,9 @@ static void __init request_standard_resources(const struct machine_desc *mdesc) boot_alias_start = phys_to_idmap(start); if (arm_has_idmap_alias() && boot_alias_start != IDMAP_INVALID_ADDR) { res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes\n", + __func__, sizeof(*res)); res->name = "System RAM (boot alias)"; res->start = boot_alias_start; res->end = phys_to_idmap(end); @@ -875,6 +878,9 @@ static void __init request_standard_resources(const struct machine_desc *mdesc) } res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*res)); res->name = "System RAM"; res->start = start; res->end = end; diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 57de0dde3ae0..f3ce34113f89 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -721,7 +721,13 @@ EXPORT_SYMBOL(phys_mem_access_prot); static void __init *early_alloc(unsigned long sz) { - return memblock_alloc(sz, sz); + void *ptr = memblock_alloc(sz, sz); + + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, sz, sz); + + return ptr; } static void *__init late_alloc(unsigned long sz) @@ -994,6 +1000,9 @@ void __init iotable_init(struct map_desc *io_desc, int nr) return; svm = memblock_alloc(sizeof(*svm) * nr, __alignof__(*svm)); + if (!svm) + panic("%s: Failed to allocate %zu bytes align=0x%zx\n", + __func__, sizeof(*svm) * nr, __alignof__(*svm)); for (md = io_desc; nr; md++, nr--) { create_mapping(md); @@ -1016,6 +1025,9 @@ void __init vm_reserve_area_early(unsigned long addr, unsigned long size, struct static_vm *svm; svm = memblock_alloc(sizeof(*svm), __alignof__(*svm)); + if (!svm) + panic("%s: Failed to allocate %zu bytes align=0x%zx\n", + __func__, sizeof(*svm), __alignof__(*svm)); vm = &svm->vm; vm->addr = (void *)addr; diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 834b321a88f8..f8482fe5a190 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -208,6 +208,7 @@ static void __init request_standard_resources(void) struct memblock_region *region; struct resource *res; unsigned long i = 0; + size_t res_size; kernel_code.start = __pa_symbol(_text); kernel_code.end = __pa_symbol(__init_begin - 1); @@ -215,9 +216,10 @@ static void __init request_standard_resources(void) kernel_data.end = __pa_symbol(_end - 1); num_standard_resources = memblock.memory.cnt; - standard_resources = memblock_alloc_low(num_standard_resources * - sizeof(*standard_resources), - SMP_CACHE_BYTES); + res_size = num_standard_resources * sizeof(*standard_resources); + standard_resources = memblock_alloc_low(res_size, SMP_CACHE_BYTES); + if (!standard_resources) + panic("%s: Failed to allocate %zu bytes\n", __func__, res_size); for_each_memblock(memory, region) { res = &standard_resources[i++]; diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index f37a86d2a69d..296de39ddee5 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -40,6 +40,11 @@ static phys_addr_t __init kasan_alloc_zeroed_page(int node) void *p = memblock_alloc_try_nid(PAGE_SIZE, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_KASAN, node); + if (!p) + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%llx\n", + __func__, PAGE_SIZE, PAGE_SIZE, node, + __pa(MAX_DMA_ADDRESS)); + return __pa(p); } @@ -48,6 +53,11 @@ static phys_addr_t __init kasan_alloc_raw_page(int node) void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_KASAN, node); + if (!p) + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%llx\n", + __func__, PAGE_SIZE, PAGE_SIZE, node, + __pa(MAX_DMA_ADDRESS)); + return __pa(p); } diff --git a/arch/c6x/mm/dma-coherent.c b/arch/c6x/mm/dma-coherent.c index 0be289839ce0..0d3701bc88f6 100644 --- a/arch/c6x/mm/dma-coherent.c +++ b/arch/c6x/mm/dma-coherent.c @@ -138,6 +138,10 @@ void __init coherent_mem_init(phys_addr_t start, u32 size) dma_bitmap = memblock_alloc(BITS_TO_LONGS(dma_pages) * sizeof(long), sizeof(long)); + if (!dma_bitmap) + panic("%s: Failed to allocate %zu bytes align=0x%zx\n", + __func__, BITS_TO_LONGS(dma_pages) * sizeof(long), + sizeof(long)); } static void c6x_dma_sync(struct device *dev, phys_addr_t paddr, size_t size, diff --git a/arch/c6x/mm/init.c b/arch/c6x/mm/init.c index e83c04654238..fe582c3a1794 100644 --- a/arch/c6x/mm/init.c +++ b/arch/c6x/mm/init.c @@ -40,6 +40,9 @@ void __init paging_init(void) empty_zero_page = (unsigned long) memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); /* * Set up user data space diff --git a/arch/csky/mm/highmem.c b/arch/csky/mm/highmem.c index 53b1bfa4c462..3317b774f6dc 100644 --- a/arch/csky/mm/highmem.c +++ b/arch/csky/mm/highmem.c @@ -141,6 +141,11 @@ static void __init fixrange_init(unsigned long start, unsigned long end, for (; (k < PTRS_PER_PMD) && (vaddr != end); pmd++, k++) { if (pmd_none(*pmd)) { pte = (pte_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, + PAGE_SIZE); + set_pmd(pmd, __pmd(__pa(pte))); BUG_ON(pte != pte_offset_kernel(pmd, 0)); } diff --git a/arch/h8300/mm/init.c b/arch/h8300/mm/init.c index a1578904ad4e..0f04a5e9aa4f 100644 --- a/arch/h8300/mm/init.c +++ b/arch/h8300/mm/init.c @@ -68,6 +68,9 @@ void __init paging_init(void) * to a couple of allocated pages. */ empty_zero_page = (unsigned long)memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); /* * Set up SFC/DFC registers (user data space). diff --git a/arch/m68k/atari/stram.c b/arch/m68k/atari/stram.c index 6ffc204eb07d..6152f9f631d2 100644 --- a/arch/m68k/atari/stram.c +++ b/arch/m68k/atari/stram.c @@ -97,6 +97,10 @@ void __init atari_stram_reserve_pages(void *start_mem) pr_debug("atari_stram pool: kernel in ST-RAM, using alloc_bootmem!\n"); stram_pool.start = (resource_size_t)memblock_alloc_low(pool_size, PAGE_SIZE); + if (!stram_pool.start) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, pool_size, PAGE_SIZE); + stram_pool.end = stram_pool.start + pool_size - 1; request_resource(&iomem_resource, &stram_pool); stram_virt_offset = 0; diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c index 933c33e76a48..8868a4c9adae 100644 --- a/arch/m68k/mm/init.c +++ b/arch/m68k/mm/init.c @@ -94,6 +94,9 @@ void __init paging_init(void) high_memory = (void *) end_mem; empty_zero_page = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); /* * Set up SFC/DFC registers (user data space). diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c index 492f953db31b..6cb1e41d58d0 100644 --- a/arch/m68k/mm/mcfmmu.c +++ b/arch/m68k/mm/mcfmmu.c @@ -44,6 +44,9 @@ void __init paging_init(void) int i; empty_zero_page = (void *) memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); pg_dir = swapper_pg_dir; memset(swapper_pg_dir, 0, sizeof(swapper_pg_dir)); @@ -51,6 +54,9 @@ void __init paging_init(void) size = num_pages * sizeof(pte_t); size = (size + PAGE_SIZE) & ~(PAGE_SIZE-1); next_pgtable = (unsigned long) memblock_alloc(size, PAGE_SIZE); + if (!next_pgtable) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, size, PAGE_SIZE); bootmem_end = (next_pgtable + size + PAGE_SIZE) & PAGE_MASK; pg_dir += PAGE_OFFSET >> PGDIR_SHIFT; diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c index 3f3d0bf36091..356601bf96d9 100644 --- a/arch/m68k/mm/motorola.c +++ b/arch/m68k/mm/motorola.c @@ -55,6 +55,9 @@ static pte_t * __init kernel_page_table(void) pte_t *ptablep; ptablep = (pte_t *)memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!ptablep) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); clear_page(ptablep); __flush_page_to_ram(ptablep); @@ -96,6 +99,9 @@ static pmd_t * __init kernel_ptr_table(void) if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) { last_pgtable = (pmd_t *)memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!last_pgtable) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); clear_page(last_pgtable); __flush_page_to_ram(last_pgtable); @@ -278,6 +284,9 @@ void __init paging_init(void) * to a couple of allocated pages */ empty_zero_page = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); /* * Set up SFC/DFC registers diff --git a/arch/m68k/mm/sun3mmu.c b/arch/m68k/mm/sun3mmu.c index f736db48a2e1..eca1c46bb90a 100644 --- a/arch/m68k/mm/sun3mmu.c +++ b/arch/m68k/mm/sun3mmu.c @@ -46,6 +46,9 @@ void __init paging_init(void) unsigned long size; empty_zero_page = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); address = PAGE_OFFSET; pg_dir = swapper_pg_dir; @@ -56,6 +59,9 @@ void __init paging_init(void) size = (size + PAGE_SIZE) & ~(PAGE_SIZE-1); next_pgtable = (unsigned long)memblock_alloc(size, PAGE_SIZE); + if (!next_pgtable) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, size, PAGE_SIZE); bootmem_end = (next_pgtable + size + PAGE_SIZE) & PAGE_MASK; /* Map whole memory from PAGE_OFFSET (0x0E000000) */ diff --git a/arch/m68k/sun3/sun3dvma.c b/arch/m68k/sun3/sun3dvma.c index 4d64711d3d47..399f3d06125f 100644 --- a/arch/m68k/sun3/sun3dvma.c +++ b/arch/m68k/sun3/sun3dvma.c @@ -269,6 +269,9 @@ void __init dvma_init(void) iommu_use = memblock_alloc(IOMMU_TOTAL_ENTRIES * sizeof(unsigned long), SMP_CACHE_BYTES); + if (!iommu_use) + panic("%s: Failed to allocate %zu bytes\n", __func__, + IOMMU_TOTAL_ENTRIES * sizeof(unsigned long)); dvma_unmap_iommu(DVMA_START, DVMA_SIZE); diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c index bd1cd4bff449..7e97d44f6538 100644 --- a/arch/microblaze/mm/init.c +++ b/arch/microblaze/mm/init.c @@ -374,10 +374,14 @@ void * __ref zalloc_maybe_bootmem(size_t size, gfp_t mask) { void *p; - if (mem_init_done) + if (mem_init_done) { p = kzalloc(size, mask); - else + } else { p = memblock_alloc(size, SMP_CACHE_BYTES); + if (!p) + panic("%s: Failed to allocate %zu bytes\n", + __func__, size); + } return p; } diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index e8eb60ed99f2..11d5a4e90736 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -245,6 +245,9 @@ void __init plat_swiotlb_setup(void) swiotlbsize = swiotlb_nslabs << IO_TLB_SHIFT; octeon_swiotlb = memblock_alloc_low(swiotlbsize, PAGE_SIZE); + if (!octeon_swiotlb) + panic("%s: Failed to allocate %zu bytes align=%lx\n", + __func__, swiotlbsize, PAGE_SIZE); if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM) panic("Cannot allocate SWIOTLB buffer"); diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c index 5151532ad959..8d1dc6c71173 100644 --- a/arch/mips/kernel/setup.c +++ b/arch/mips/kernel/setup.c @@ -919,6 +919,9 @@ static void __init resource_init(void) end = HIGHMEM_START - 1; res = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct resource)); res->start = start; res->end = end; diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c index fc511ecefec6..98ca55d62201 100644 --- a/arch/mips/kernel/traps.c +++ b/arch/mips/kernel/traps.c @@ -2294,6 +2294,9 @@ void __init trap_init(void) ebase = (unsigned long) memblock_alloc(size, 1 << fls(size)); + if (!ebase) + panic("%s: Failed to allocate %lu bytes align=0x%x\n", + __func__, size, 1 << fls(size)); /* * Try to ensure ebase resides in KSeg0 if possible. diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c index c3b45e248806..bbb196ad5f26 100644 --- a/arch/mips/mm/init.c +++ b/arch/mips/mm/init.c @@ -252,6 +252,11 @@ void __init fixrange_init(unsigned long start, unsigned long end, if (pmd_none(*pmd)) { pte = (pte_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, + PAGE_SIZE); + set_pmd(pmd, __pmd((unsigned long)pte)); BUG_ON(pte != pte_offset_kernel(pmd, 0)); } diff --git a/arch/nds32/mm/init.c b/arch/nds32/mm/init.c index d1e521cce317..1d03633f89a9 100644 --- a/arch/nds32/mm/init.c +++ b/arch/nds32/mm/init.c @@ -79,6 +79,9 @@ static void __init map_ram(void) /* Alloc one page for holding PTE's... */ pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); set_pmd(pme, __pmd(__pa(pte) + _PAGE_KERNEL_TABLE)); /* Fill the newly allocated page with PTE'S */ @@ -111,6 +114,9 @@ static void __init fixedrange_init(void) pud = pud_offset(pgd, vaddr); pmd = pmd_offset(pud, vaddr); fixmap_pmd_p = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!fixmap_pmd_p) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); set_pmd(pmd, __pmd(__pa(fixmap_pmd_p) + _PAGE_KERNEL_TABLE)); #ifdef CONFIG_HIGHMEM @@ -123,6 +129,9 @@ static void __init fixedrange_init(void) pud = pud_offset(pgd, vaddr); pmd = pmd_offset(pud, vaddr); pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); set_pmd(pmd, __pmd(__pa(pte) + _PAGE_KERNEL_TABLE)); pkmap_page_table = pte; #endif /* CONFIG_HIGHMEM */ @@ -148,6 +157,9 @@ void __init paging_init(void) /* allocate space for empty_zero_page */ zero_page = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); zone_sizes_init(); empty_zero_page = virt_to_page(zero_page); diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c index 051bcb4fefd3..a8509950dbbc 100644 --- a/arch/openrisc/mm/ioremap.c +++ b/arch/openrisc/mm/ioremap.c @@ -122,10 +122,14 @@ pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm) { pte_t *pte; - if (likely(mem_init_done)) + if (likely(mem_init_done)) { pte = (pte_t *)get_zeroed_page(GFP_KERNEL); - else + } else { pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + } return pte; } diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index 28c076c771de..c66fd3ce6478 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -1005,6 +1005,11 @@ static int __init dt_cpu_ftrs_scan_callback(unsigned long node, const char of_scan_flat_dt_subnodes(node, count_cpufeatures_subnodes, &nr_dt_cpu_features); dt_cpu_features = memblock_alloc(sizeof(struct dt_cpu_feature) * nr_dt_cpu_features, PAGE_SIZE); + if (!dt_cpu_features) + panic("%s: Failed to allocate %zu bytes align=0x%lx\n", + __func__, + sizeof(struct dt_cpu_feature) * nr_dt_cpu_features, + PAGE_SIZE); cpufeatures_setup_start(isa); diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c index d3f04f2d8249..0417fda13636 100644 --- a/arch/powerpc/kernel/pci_32.c +++ b/arch/powerpc/kernel/pci_32.c @@ -205,6 +205,9 @@ pci_create_OF_bus_map(void) of_prop = memblock_alloc(sizeof(struct property) + 256, SMP_CACHE_BYTES); + if (!of_prop) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct property) + 256); dn = of_find_node_by_path("/"); if (dn) { memset(of_prop, -1, sizeof(struct property) + 256); diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index f17868e19e2c..2e5dfb6e0823 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -461,6 +461,9 @@ void __init smp_setup_cpu_maps(void) cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), __alignof__(u32)); + if (!cpu_to_phys_id) + panic("%s: Failed to allocate %zu bytes align=0x%zx\n", + __func__, nr_cpu_ids * sizeof(u32), __alignof__(u32)); for_each_node_by_type(dn, "cpu") { const __be32 *intserv; diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index ff0aac42bb33..ba404dd9ce1d 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -905,6 +905,10 @@ static void __ref init_fallback_flush(void) l1d_flush_fallback_area = memblock_alloc_try_nid(l1d_size * 2, l1d_size, MEMBLOCK_LOW_LIMIT, limit, NUMA_NO_NODE); + if (!l1d_flush_fallback_area) + panic("%s: Failed to allocate %llu bytes align=0x%llx max_addr=%pa\n", + __func__, l1d_size * 2, l1d_size, &limit); + for_each_possible_cpu(cpu) { struct paca_struct *paca = paca_ptrs[cpu]; diff --git a/arch/powerpc/lib/alloc.c b/arch/powerpc/lib/alloc.c index dedf88a76f58..ce180870bd52 100644 --- a/arch/powerpc/lib/alloc.c +++ b/arch/powerpc/lib/alloc.c @@ -15,6 +15,9 @@ void * __ref zalloc_maybe_bootmem(size_t size, gfp_t mask) p = kzalloc(size, mask); else { p = memblock_alloc(size, SMP_CACHE_BYTES); + if (!p) + panic("%s: Failed to allocate %zu bytes\n", __func__, + size); } return p; } diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 880a366c229c..0a4f939a8161 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -915,6 +915,9 @@ static void __init htab_initialize(void) linear_map_hash_slots = memblock_alloc_try_nid( linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT, ppc64_rma_size, NUMA_NO_NODE); + if (!linear_map_hash_slots) + panic("%s: Failed to allocate %lu bytes max_addr=%pa\n", + __func__, linear_map_hash_count, &ppc64_rma_size); } #endif /* CONFIG_DEBUG_PAGEALLOC */ diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c index 22d71a58167f..1945c5f19f5e 100644 --- a/arch/powerpc/mm/mmu_context_nohash.c +++ b/arch/powerpc/mm/mmu_context_nohash.c @@ -461,10 +461,19 @@ void __init mmu_context_init(void) * Allocate the maps used by context management */ context_map = memblock_alloc(CTX_MAP_SIZE, SMP_CACHE_BYTES); + if (!context_map) + panic("%s: Failed to allocate %zu bytes\n", __func__, + CTX_MAP_SIZE); context_mm = memblock_alloc(sizeof(void *) * (LAST_CONTEXT + 1), SMP_CACHE_BYTES); + if (!context_mm) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(void *) * (LAST_CONTEXT + 1)); #ifdef CONFIG_SMP stale_map[boot_cpuid] = memblock_alloc(CTX_MAP_SIZE, SMP_CACHE_BYTES); + if (!stale_map[boot_cpuid]) + panic("%s: Failed to allocate %zu bytes\n", __func__, + CTX_MAP_SIZE); cpuhp_setup_state_nocalls(CPUHP_POWERPC_MMU_CTX_PREPARE, "powerpc/mmu/ctx:prepare", diff --git a/arch/powerpc/mm/pgtable-book3e.c b/arch/powerpc/mm/pgtable-book3e.c index 53cbc7dc2df2..1032ef7aaf62 100644 --- a/arch/powerpc/mm/pgtable-book3e.c +++ b/arch/powerpc/mm/pgtable-book3e.c @@ -57,8 +57,16 @@ void vmemmap_remove_mapping(unsigned long start, static __ref void *early_alloc_pgtable(unsigned long size) { - return memblock_alloc_try_nid(size, size, MEMBLOCK_LOW_LIMIT, - __pa(MAX_DMA_ADDRESS), NUMA_NO_NODE); + void *ptr; + + ptr = memblock_alloc_try_nid(size, size, MEMBLOCK_LOW_LIMIT, + __pa(MAX_DMA_ADDRESS), NUMA_NO_NODE); + + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx max_addr=%lx\n", + __func__, size, size, __pa(MAX_DMA_ADDRESS)); + + return ptr; } /* diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c index 92a3e4c39540..a4341aba0af4 100644 --- a/arch/powerpc/mm/pgtable-book3s64.c +++ b/arch/powerpc/mm/pgtable-book3s64.c @@ -197,6 +197,9 @@ void __init mmu_partition_table_init(void) BUILD_BUG_ON_MSG((PATB_SIZE_SHIFT > 36), "Partition table size too large."); /* Initialize the Partition Table with no entries */ partition_tb = memblock_alloc(patb_size, patb_size); + if (!partition_tb) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, patb_size, patb_size); /* * update partition table control register, diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c index e377684ac6ad..154472a28c77 100644 --- a/arch/powerpc/mm/pgtable-radix.c +++ b/arch/powerpc/mm/pgtable-radix.c @@ -53,13 +53,20 @@ static __ref void *early_alloc_pgtable(unsigned long size, int nid, { phys_addr_t min_addr = MEMBLOCK_LOW_LIMIT; phys_addr_t max_addr = MEMBLOCK_ALLOC_ANYWHERE; + void *ptr; if (region_start) min_addr = region_start; if (region_end) max_addr = region_end; - return memblock_alloc_try_nid(size, size, min_addr, max_addr, nid); + ptr = memblock_alloc_try_nid(size, size, min_addr, max_addr, nid); + + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa max_addr=%pa\n", + __func__, size, size, nid, &min_addr, &max_addr); + + return ptr; } static int early_map_kernel_page(unsigned long ea, unsigned long pa, diff --git a/arch/powerpc/mm/ppc_mmu_32.c b/arch/powerpc/mm/ppc_mmu_32.c index 6c8a60b1e31d..f29d2f118b44 100644 --- a/arch/powerpc/mm/ppc_mmu_32.c +++ b/arch/powerpc/mm/ppc_mmu_32.c @@ -340,6 +340,9 @@ void __init MMU_init_hw(void) */ if ( ppc_md.progress ) ppc_md.progress("hash:find piece", 0x322); Hash = memblock_alloc(Hash_size, Hash_size); + if (!Hash) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, Hash_size, Hash_size); _SDR1 = __pa(Hash) | SDR1_LOW_BITS; Hash_end = (struct hash_pte *) ((unsigned long)Hash + Hash_size); diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c index 86368e238f6e..044c6089462c 100644 --- a/arch/powerpc/platforms/pasemi/iommu.c +++ b/arch/powerpc/platforms/pasemi/iommu.c @@ -211,6 +211,9 @@ static int __init iob_init(struct device_node *dn) iob_l2_base = memblock_alloc_try_nid_raw(1UL << 21, 1UL << 21, MEMBLOCK_LOW_LIMIT, 0x80000000, NUMA_NO_NODE); + if (!iob_l2_base) + panic("%s: Failed to allocate %lu bytes align=0x%lx max_addr=%x\n", + __func__, 1UL << 21, 1UL << 21, 0x80000000); pr_info("IOBMAP L2 allocated at: %p\n", iob_l2_base); diff --git a/arch/powerpc/platforms/powermac/nvram.c b/arch/powerpc/platforms/powermac/nvram.c index 9360cdc408c1..86989c5779c2 100644 --- a/arch/powerpc/platforms/powermac/nvram.c +++ b/arch/powerpc/platforms/powermac/nvram.c @@ -519,6 +519,9 @@ static int __init core99_nvram_setup(struct device_node *dp, unsigned long addr) return -EINVAL; } nvram_image = memblock_alloc(NVRAM_SIZE, SMP_CACHE_BYTES); + if (!nvram_image) + panic("%s: Failed to allocate %u bytes\n", __func__, + NVRAM_SIZE); nvram_data = ioremap(addr, NVRAM_SIZE*2); nvram_naddrs = 1; /* Make sure we get the correct case */ diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 727a7de08635..2b0eca104f86 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -171,6 +171,9 @@ int __init early_init_dt_scan_recoverable_ranges(unsigned long node, * Allocate a buffer to hold the MC recoverable ranges. */ mc_recoverable_range = memblock_alloc(size, __alignof__(u64)); + if (!mc_recoverable_range) + panic("%s: Failed to allocate %u bytes align=0x%lx\n", + __func__, size, __alignof__(u64)); for (i = 0; i < mc_recoverable_range_len; i++) { mc_recoverable_range[i].start_addr = diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fa6af52b5219..3ead4c237ed0 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3657,6 +3657,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, pr_debug(" PHB-ID : 0x%016llx\n", phb_id); phb = memblock_alloc(sizeof(*phb), SMP_CACHE_BYTES); + if (!phb) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*phb)); /* Allocate PCI controller */ phb->hose = hose = pcibios_alloc_controller(np); @@ -3703,6 +3706,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, phb->diag_data_size = PNV_PCI_DIAG_BUF_SIZE; phb->diag_data = memblock_alloc(phb->diag_data_size, SMP_CACHE_BYTES); + if (!phb->diag_data) + panic("%s: Failed to allocate %u bytes\n", __func__, + phb->diag_data_size); /* Parse 32-bit and IO ranges (if any) */ pci_process_bridge_OF_ranges(hose, np, !hose->global_number); @@ -3762,6 +3768,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, pemap_off = size; size += phb->ioda.total_pe_num * sizeof(struct pnv_ioda_pe); aux = memblock_alloc(size, SMP_CACHE_BYTES); + if (!aux) + panic("%s: Failed to allocate %lu bytes\n", __func__, size); phb->ioda.pe_alloc = aux; phb->ioda.m64_segmap = aux + m64map_off; phb->ioda.m32_segmap = aux + m32map_off; diff --git a/arch/powerpc/platforms/ps3/setup.c b/arch/powerpc/platforms/ps3/setup.c index 658bfab3350b..4ce5458eb0f8 100644 --- a/arch/powerpc/platforms/ps3/setup.c +++ b/arch/powerpc/platforms/ps3/setup.c @@ -127,6 +127,9 @@ static void __init prealloc(struct ps3_prealloc *p) return; p->address = memblock_alloc(p->size, p->align); + if (!p->address) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, p->size, p->align); printk(KERN_INFO "%s: %lu bytes at %p\n", p->name, p->size, p->address); diff --git a/arch/powerpc/sysdev/msi_bitmap.c b/arch/powerpc/sysdev/msi_bitmap.c index d45450f6666a..51a679a1c403 100644 --- a/arch/powerpc/sysdev/msi_bitmap.c +++ b/arch/powerpc/sysdev/msi_bitmap.c @@ -129,6 +129,9 @@ int __ref msi_bitmap_alloc(struct msi_bitmap *bmp, unsigned int irq_count, bmp->bitmap = kzalloc(size, GFP_KERNEL); else { bmp->bitmap = memblock_alloc(size, SMP_CACHE_BYTES); + if (!bmp->bitmap) + panic("%s: Failed to allocate %u bytes\n", __func__, + size); /* the bitmap won't be freed from memblock allocator */ kmemleak_not_leak(bmp->bitmap); } diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index d7920f3e76c6..2c642af526ce 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -378,6 +378,10 @@ static void __init setup_lowcore_dat_off(void) */ BUILD_BUG_ON(sizeof(struct lowcore) != LC_PAGES * PAGE_SIZE); lc = memblock_alloc_low(sizeof(*lc), sizeof(*lc)); + if (!lc) + panic("%s: Failed to allocate %zu bytes align=%zx\n", + __func__, sizeof(*lc), sizeof(*lc)); + lc->restart_psw.mask = PSW_KERNEL_BITS; lc->restart_psw.addr = (unsigned long) restart_int_handler; lc->external_new_psw.mask = PSW_KERNEL_BITS | PSW_MASK_MCHECK; @@ -419,6 +423,9 @@ static void __init setup_lowcore_dat_off(void) * all CPUs in cast *one* of them does a PSW restart. */ restart_stack = memblock_alloc(THREAD_SIZE, THREAD_SIZE); + if (!restart_stack) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, THREAD_SIZE, THREAD_SIZE); restart_stack += STACK_INIT_OFFSET; /* @@ -495,6 +502,9 @@ static void __init setup_resources(void) for_each_memblock(memory, reg) { res = memblock_alloc(sizeof(*res), 8); + if (!res) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*res), 8); res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM; res->name = "System RAM"; @@ -509,6 +519,9 @@ static void __init setup_resources(void) continue; if (std_res->end > res->end) { sub_res = memblock_alloc(sizeof(*sub_res), 8); + if (!sub_res) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*sub_res), 8); *sub_res = *std_res; sub_res->end = res->end; std_res->start = res->end + 1; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 5e3cccc408b8..3fe1c77c361b 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -658,7 +658,7 @@ void __init smp_save_dump_cpus(void) /* Allocate a page as dumping area for the store status sigps */ page = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0, 1UL << 31); if (!page) - panic("ERROR: Failed to allocate %x bytes below %lx\n", + panic("ERROR: Failed to allocate %lx bytes below %lx\n", PAGE_SIZE, 1UL << 31); /* Set multi-threading state to the previous system. */ @@ -770,6 +770,9 @@ void __init smp_detect_cpus(void) /* Get CPU information */ info = memblock_alloc(sizeof(*info), 8); + if (!info) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*info), 8); smp_get_core_info(info, 1); /* Find boot CPU type */ if (sclp.has_core_type) { diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c index 8992b04c0ade..8964a3f60aad 100644 --- a/arch/s390/kernel/topology.c +++ b/arch/s390/kernel/topology.c @@ -520,6 +520,9 @@ static void __init alloc_masks(struct sysinfo_15_1_x *info, nr_masks = max(nr_masks, 1); for (i = 0; i < nr_masks; i++) { mask->next = memblock_alloc(sizeof(*mask->next), 8); + if (!mask->next) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*mask->next), 8); mask = mask->next; } } @@ -538,6 +541,9 @@ void __init topology_init_early(void) if (!MACHINE_HAS_TOPOLOGY) goto out; tl_info = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!tl_info) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); info = tl_info; store_topology(info); pr_info("The CPU configuration topology of the machine is: %d %d %d %d %d %d / %d\n", diff --git a/arch/s390/numa/mode_emu.c b/arch/s390/numa/mode_emu.c index bfba273c32c0..71a12a4f4906 100644 --- a/arch/s390/numa/mode_emu.c +++ b/arch/s390/numa/mode_emu.c @@ -313,6 +313,9 @@ static void __ref create_core_to_node_map(void) int i; emu_cores = memblock_alloc(sizeof(*emu_cores), 8); + if (!emu_cores) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*emu_cores), 8); for (i = 0; i < ARRAY_SIZE(emu_cores->to_node_id); i++) emu_cores->to_node_id[i] = NODE_ID_FREE; } diff --git a/arch/s390/numa/numa.c b/arch/s390/numa/numa.c index 2d1271e2a70d..8eb9e9743f5d 100644 --- a/arch/s390/numa/numa.c +++ b/arch/s390/numa/numa.c @@ -92,8 +92,12 @@ static void __init numa_setup_memory(void) } while (cur_base < end_of_dram); /* Allocate and fill out node_data */ - for (nid = 0; nid < MAX_NUMNODES; nid++) + for (nid = 0; nid < MAX_NUMNODES; nid++) { NODE_DATA(nid) = memblock_alloc(sizeof(pg_data_t), 8); + if (!NODE_DATA(nid)) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(pg_data_t), 8); + } for_each_online_node(nid) { unsigned long start_pfn, end_pfn; diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index a0fa4de03dd5..fceefd92016f 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -128,6 +128,9 @@ static pmd_t * __init one_md_table_init(pud_t *pud) pmd_t *pmd; pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!pmd) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); pud_populate(&init_mm, pud, pmd); BUG_ON(pmd != pmd_offset(pud, 0)); } @@ -141,6 +144,9 @@ static pte_t * __init one_page_table_init(pmd_t *pmd) pte_t *pte; pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); pmd_populate_kernel(&init_mm, pmd, pte); BUG_ON(pte != pte_offset_kernel(pmd, 0)); } diff --git a/arch/sh/mm/numa.c b/arch/sh/mm/numa.c index c4bde6148810..f7e4439deb17 100644 --- a/arch/sh/mm/numa.c +++ b/arch/sh/mm/numa.c @@ -43,6 +43,10 @@ void __init setup_bootmem_node(int nid, unsigned long start, unsigned long end) /* Node-local pgdat */ NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data), SMP_CACHE_BYTES, nid); + if (!NODE_DATA(nid)) + panic("%s: Failed to allocate %zu bytes align=0x%x nid=%d\n", + __func__, sizeof(struct pglist_data), SMP_CACHE_BYTES, + nid); NODE_DATA(nid)->node_start_pfn = start_pfn; NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn; diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index d80cfb1d9430..6e5be5fb4143 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -649,6 +649,9 @@ static int __init eth_setup(char *str) } new = memblock_alloc(sizeof(*new), SMP_CACHE_BYTES); + if (!new) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*new)); INIT_LIST_HEAD(&new->list); new->index = n; diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c index 046fa9ea0ccc..596e7056f376 100644 --- a/arch/um/drivers/vector_kern.c +++ b/arch/um/drivers/vector_kern.c @@ -1576,6 +1576,9 @@ static int __init vector_setup(char *str) return 1; } new = memblock_alloc(sizeof(*new), SMP_CACHE_BYTES); + if (!new) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*new)); INIT_LIST_HEAD(&new->list); new->unit = n; new->arguments = str; diff --git a/arch/um/kernel/initrd.c b/arch/um/kernel/initrd.c index ce169ea87e61..1dcd310cb34d 100644 --- a/arch/um/kernel/initrd.c +++ b/arch/um/kernel/initrd.c @@ -37,6 +37,8 @@ int __init read_initrd(void) } area = memblock_alloc(size, SMP_CACHE_BYTES); + if (!area) + panic("%s: Failed to allocate %llu bytes\n", __func__, size); if (load_initrd(initrd, area, size) == -1) return 0; diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 799b571a8f88..99aa11bf53d1 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -66,6 +66,10 @@ static void __init one_page_table_init(pmd_t *pmd) if (pmd_none(*pmd)) { pte_t *pte = (pte_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + set_pmd(pmd, __pmd(_KERNPG_TABLE + (unsigned long) __pa(pte))); if (pte != pte_offset_kernel(pmd, 0)) @@ -77,6 +81,10 @@ static void __init one_md_table_init(pud_t *pud) { #ifdef CONFIG_3_LEVEL_PGTABLES pmd_t *pmd_table = (pmd_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!pmd_table) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + set_pud(pud, __pud(_KERNPG_TABLE + (unsigned long) __pa(pmd_table))); if (pmd_table != pmd_offset(pud, 0)) BUG(); @@ -126,6 +134,10 @@ static void __init fixaddr_user_init( void) fixrange_init( FIXADDR_USER_START, FIXADDR_USER_END, swapper_pg_dir); v = (unsigned long) memblock_alloc_low(size, PAGE_SIZE); + if (!v) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, size, PAGE_SIZE); + memcpy((void *) v , (void *) FIXADDR_USER_START, size); p = __pa(v); for ( ; size > 0; size -= PAGE_SIZE, vaddr += PAGE_SIZE, @@ -146,6 +158,10 @@ void __init paging_init(void) empty_zero_page = (unsigned long *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE); + if (!empty_zero_page) + panic("%s: Failed to allocate %lu bytes align=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + for (i = 0; i < ARRAY_SIZE(zones_size); i++) zones_size[i] = 0; diff --git a/arch/unicore32/kernel/setup.c b/arch/unicore32/kernel/setup.c index 4b0cb68c355a..d3239cf2e837 100644 --- a/arch/unicore32/kernel/setup.c +++ b/arch/unicore32/kernel/setup.c @@ -207,6 +207,10 @@ request_standard_resources(struct meminfo *mi) continue; res = memblock_alloc_low(sizeof(*res), SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes align=%x\n", + __func__, sizeof(*res), SMP_CACHE_BYTES); + res->name = "System RAM"; res->start = mi->bank[i].start; res->end = mi->bank[i].start + mi->bank[i].size - 1; diff --git a/arch/unicore32/mm/mmu.c b/arch/unicore32/mm/mmu.c index a40219291965..aa2060beb408 100644 --- a/arch/unicore32/mm/mmu.c +++ b/arch/unicore32/mm/mmu.c @@ -145,8 +145,13 @@ static pte_t * __init early_pte_alloc(pmd_t *pmd, unsigned long addr, unsigned long prot) { if (pmd_none(*pmd)) { - pte_t *pte = memblock_alloc(PTRS_PER_PTE * sizeof(pte_t), - PTRS_PER_PTE * sizeof(pte_t)); + size_t size = PTRS_PER_PTE * sizeof(pte_t); + pte_t *pte = memblock_alloc(size, size); + + if (!pte) + panic("%s: Failed to allocate %zu bytes align=%zx\n", + __func__, size, size); + __pmd_populate(pmd, __pa(pte) | prot); } BUG_ON(pmd_bad(*pmd)); @@ -349,6 +354,9 @@ static void __init devicemaps_init(void) * Allocate the vector page early. */ vectors = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!vectors) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); for (addr = VMALLOC_END; addr; addr += PGDIR_SIZE) pmd_clear(pmd_off_k(addr)); @@ -426,6 +434,9 @@ void __init paging_init(void) /* allocate the zero page. */ zero_page = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!zero_page) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); bootmem_init(); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 2624de16cd7a..8dcbf6890714 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -935,6 +935,9 @@ static int __init acpi_parse_hpet(struct acpi_table_header *table) #define HPET_RESOURCE_NAME_SIZE 9 hpet_res = memblock_alloc(sizeof(*hpet_res) + HPET_RESOURCE_NAME_SIZE, SMP_CACHE_BYTES); + if (!hpet_res) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*hpet_res) + HPET_RESOURCE_NAME_SIZE); hpet_res->name = (void *)&hpet_res[1]; hpet_res->flags = IORESOURCE_MEM; diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 264e3221d923..53aa234a6803 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2581,6 +2581,8 @@ static struct resource * __init ioapic_setup_resources(void) n *= nr_ioapics; mem = memblock_alloc(n, SMP_CACHE_BYTES); + if (!mem) + panic("%s: Failed to allocate %lu bytes\n", __func__, n); res = (void *)mem; mem += sizeof(struct resource) * nr_ioapics; @@ -2625,6 +2627,9 @@ fake_ioapic_page: #endif ioapic_phys = (unsigned long)memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!ioapic_phys) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); ioapic_phys = __pa(ioapic_phys); } set_fixmap_nocache(idx, ioapic_phys); diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 5203ee4e6435..6831c8437951 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1092,6 +1092,9 @@ void __init e820__reserve_resources(void) res = memblock_alloc(sizeof(*res) * e820_table->nr_entries, SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*res) * e820_table->nr_entries); e820_res = res; for (i = 0; i < e820_table->nr_entries; i++) { diff --git a/arch/x86/platform/olpc/olpc_dt.c b/arch/x86/platform/olpc/olpc_dt.c index b4ab779f1d47..ac9e7bf49b66 100644 --- a/arch/x86/platform/olpc/olpc_dt.c +++ b/arch/x86/platform/olpc/olpc_dt.c @@ -141,6 +141,9 @@ void * __init prom_early_alloc(unsigned long size) * wasted bootmem) and hand off chunks of it to callers. */ res = memblock_alloc(chunk_size, SMP_CACHE_BYTES); + if (!res) + panic("%s: Failed to allocate %zu bytes\n", __func__, + chunk_size); BUG_ON(!res); prom_early_allocated += chunk_size; memset(res, 0, chunk_size); diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index 055e37e43541..95ce9b5be411 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -181,8 +181,15 @@ static void p2m_init_identity(unsigned long *p2m, unsigned long pfn) static void * __ref alloc_p2m_page(void) { - if (unlikely(!slab_is_available())) - return memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (unlikely(!slab_is_available())) { + void *ptr = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE); + + return ptr; + } return (void *)__get_free_page(GFP_KERNEL); } diff --git a/arch/xtensa/mm/kasan_init.c b/arch/xtensa/mm/kasan_init.c index 4852848a0c28..af7152560bc3 100644 --- a/arch/xtensa/mm/kasan_init.c +++ b/arch/xtensa/mm/kasan_init.c @@ -45,6 +45,10 @@ static void __init populate(void *start, void *end) pmd_t *pmd = pmd_offset(pgd, vaddr); pte_t *pte = memblock_alloc(n_pages * sizeof(pte_t), PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, n_pages * sizeof(pte_t), PAGE_SIZE); + pr_debug("%s: %p - %p\n", __func__, start, end); for (i = j = 0; i < n_pmds; ++i) { diff --git a/arch/xtensa/mm/mmu.c b/arch/xtensa/mm/mmu.c index a4dcfd39bc5c..2fb7d1172228 100644 --- a/arch/xtensa/mm/mmu.c +++ b/arch/xtensa/mm/mmu.c @@ -32,6 +32,9 @@ static void * __init init_pmd(unsigned long vaddr, unsigned long n_pages) __func__, vaddr, n_pages); pte = memblock_alloc_low(n_pages * sizeof(pte_t), PAGE_SIZE); + if (!pte) + panic("%s: Failed to allocate %zu bytes align=%lx\n", + __func__, n_pages * sizeof(pte_t), PAGE_SIZE); for (i = 0; i < n_pages; ++i) pte_clear(NULL, 0, pte + i); diff --git a/drivers/clk/ti/clk.c b/drivers/clk/ti/clk.c index d0cd58534781..5d7fb2eecce4 100644 --- a/drivers/clk/ti/clk.c +++ b/drivers/clk/ti/clk.c @@ -351,6 +351,9 @@ void __init omap2_clk_legacy_provider_init(int index, void __iomem *mem) struct clk_iomap *io; io = memblock_alloc(sizeof(*io), SMP_CACHE_BYTES); + if (!io) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(*io)); io->mem = mem; diff --git a/drivers/macintosh/smu.c b/drivers/macintosh/smu.c index 42cf68d15da3..6a844125cf2d 100644 --- a/drivers/macintosh/smu.c +++ b/drivers/macintosh/smu.c @@ -493,6 +493,9 @@ int __init smu_init (void) } smu = memblock_alloc(sizeof(struct smu_device), SMP_CACHE_BYTES); + if (!smu) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct smu_device)); spin_lock_init(&smu->lock); INIT_LIST_HEAD(&smu->cmd_list); diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 9cc1461aac7d..4734223ab702 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -1181,7 +1181,13 @@ int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, static void * __init early_init_dt_alloc_memory_arch(u64 size, u64 align) { - return memblock_alloc(size, align); + void *ptr = memblock_alloc(size, align); + + if (!ptr) + panic("%s: Failed to allocate %llu bytes align=0x%llx\n", + __func__, size, align); + + return ptr; } bool __init early_init_dt_verify(void *params) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 66037511f2d7..cccde756b510 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -2241,7 +2241,13 @@ static struct device_node *overlay_base_root; static void * __init dt_alloc_memory(u64 size, u64 align) { - return memblock_alloc(size, align); + void *ptr = memblock_alloc(size, align); + + if (!ptr) + panic("%s: Failed to allocate %llu bytes align=0x%llx\n", + __func__, size, align); + + return ptr; } /* diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index bb7888429be6..877baf2a94f4 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -214,10 +214,13 @@ retry: /* * Get IO TLB memory from any location. */ - if (early) + if (early) { xen_io_tlb_start = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE); - else { + if (!xen_io_tlb_start) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, PAGE_ALIGN(bytes), PAGE_SIZE); + } else { #define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT)) #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT) while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) { diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index dd6a8e2d53a7..56ac77a80b1f 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -215,13 +215,13 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(int)); io_tlb_list = memblock_alloc(alloc_size, PAGE_SIZE); if (!io_tlb_list) - panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + panic("%s: Failed to allocate %zu bytes align=0x%lx\n", __func__, alloc_size, PAGE_SIZE); alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)); io_tlb_orig_addr = memblock_alloc(alloc_size, PAGE_SIZE); if (!io_tlb_orig_addr) - panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + panic("%s: Failed to allocate %zu bytes align=0x%lx\n", __func__, alloc_size, PAGE_SIZE); for (i = 0; i < io_tlb_nslabs; i++) { diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 4802b039b89f..f08a1e4ee1d4 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -965,6 +965,9 @@ void __init __register_nosave_region(unsigned long start_pfn, /* This allocation cannot fail */ region = memblock_alloc(sizeof(struct nosave_region), SMP_CACHE_BYTES); + if (!region) + panic("%s: Failed to allocate %zu bytes\n", __func__, + sizeof(struct nosave_region)); } region->start_pfn = start_pfn; region->end_pfn = end_pfn; diff --git a/lib/cpumask.c b/lib/cpumask.c index 087a3e9a0202..0cb672eb107c 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -165,6 +165,9 @@ EXPORT_SYMBOL(zalloc_cpumask_var); void __init alloc_bootmem_cpumask_var(cpumask_var_t *mask) { *mask = memblock_alloc(cpumask_size(), SMP_CACHE_BYTES); + if (!*mask) + panic("%s: Failed to allocate %u bytes\n", __func__, + cpumask_size()); } /** diff --git a/mm/kasan/init.c b/mm/kasan/init.c index fcaa1ca03175..ce45c491ebcd 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -83,8 +83,14 @@ static inline bool kasan_early_shadow_page_entry(pte_t pte) static __init void *early_alloc(size_t size, int node) { - return memblock_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS), - MEMBLOCK_ALLOC_ACCESSIBLE, node); + void *ptr = memblock_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS), + MEMBLOCK_ALLOC_ACCESSIBLE, node); + + if (!ptr) + panic("%s: Failed to allocate %zu bytes align=%zx nid=%d from=%llx\n", + __func__, size, size, node, (u64)__pa(MAX_DMA_ADDRESS)); + + return ptr; } static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr, diff --git a/mm/sparse.c b/mm/sparse.c index 77a0554fa5bd..7397fb4e78b4 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -65,11 +65,15 @@ static noinline struct mem_section __ref *sparse_index_alloc(int nid) unsigned long array_size = SECTIONS_PER_ROOT * sizeof(struct mem_section); - if (slab_is_available()) + if (slab_is_available()) { section = kzalloc_node(array_size, GFP_KERNEL, nid); - else + } else { section = memblock_alloc_node(array_size, SMP_CACHE_BYTES, nid); + if (!section) + panic("%s: Failed to allocate %lu bytes nid=%d\n", + __func__, array_size, nid); + } return section; } @@ -218,6 +222,9 @@ void __init memory_present(int nid, unsigned long start, unsigned long end) size = sizeof(struct mem_section*) * NR_SECTION_ROOTS; align = 1 << (INTERNODE_CACHE_SHIFT); mem_section = memblock_alloc(size, align); + if (!mem_section) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", + __func__, size, align); } #endif @@ -404,13 +411,18 @@ struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid, { unsigned long size = section_map_size(); struct page *map = sparse_buffer_alloc(size); + phys_addr_t addr = __pa(MAX_DMA_ADDRESS); if (map) return map; map = memblock_alloc_try_nid(size, - PAGE_SIZE, __pa(MAX_DMA_ADDRESS), + PAGE_SIZE, addr, MEMBLOCK_ALLOC_ACCESSIBLE, nid); + if (!map) + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n", + __func__, size, PAGE_SIZE, nid, &addr); + return map; } #endif /* !CONFIG_SPARSEMEM_VMEMMAP */ @@ -420,10 +432,11 @@ static void *sparsemap_buf_end __meminitdata; static void __init sparse_buffer_init(unsigned long size, int nid) { + phys_addr_t addr = __pa(MAX_DMA_ADDRESS); WARN_ON(sparsemap_buf); /* forgot to call sparse_buffer_fini()? */ sparsemap_buf = memblock_alloc_try_nid_raw(size, PAGE_SIZE, - __pa(MAX_DMA_ADDRESS), + addr, MEMBLOCK_ALLOC_ACCESSIBLE, nid); sparsemap_buf_end = sparsemap_buf + size; } -- cgit v1.3-7-g2ca7 From 26fb3dae0a1ec78bdde4b5b72e0e709503e8c596 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 11 Mar 2019 23:30:42 -0700 Subject: memblock: drop memblock_alloc_*_nopanic() variants As all the memblock allocation functions return NULL in case of error rather than panic(), the duplicates with _nopanic suffix can be removed. Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport Acked-by: Greg Kroah-Hartman Reviewed-by: Petr Mladek [printk] Cc: Catalin Marinas Cc: Christophe Leroy Cc: Christoph Hellwig Cc: "David S. Miller" Cc: Dennis Zhou Cc: Geert Uytterhoeven Cc: Greentime Hu Cc: Guan Xuetao Cc: Guo Ren Cc: Guo Ren [c-sky] Cc: Heiko Carstens Cc: Juergen Gross [Xen] Cc: Mark Salter Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Paul Burton Cc: Richard Weinberger Cc: Rich Felker Cc: Rob Herring Cc: Rob Herring Cc: Russell King Cc: Stafford Horne Cc: Tony Luck Cc: Vineet Gupta Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arc/kernel/unwind.c | 3 +-- arch/sh/mm/init.c | 2 +- arch/x86/kernel/setup_percpu.c | 10 +++++----- arch/x86/mm/kasan_init_64.c | 14 ++++++++------ drivers/firmware/memmap.c | 2 +- drivers/usb/early/xhci-dbc.c | 2 +- include/linux/memblock.h | 35 ----------------------------------- kernel/dma/swiotlb.c | 2 +- kernel/printk/printk.c | 9 +-------- mm/memblock.c | 35 ----------------------------------- mm/page_alloc.c | 10 +++++----- mm/page_ext.c | 2 +- mm/percpu.c | 11 ++++------- mm/sparse.c | 6 ++---- 14 files changed, 31 insertions(+), 112 deletions(-) (limited to 'kernel') diff --git a/arch/arc/kernel/unwind.c b/arch/arc/kernel/unwind.c index d34f69eb1a95..271e9fafa479 100644 --- a/arch/arc/kernel/unwind.c +++ b/arch/arc/kernel/unwind.c @@ -181,8 +181,7 @@ static void init_unwind_hdr(struct unwind_table *table, */ static void *__init unw_hdr_alloc_early(unsigned long sz) { - return memblock_alloc_from_nopanic(sz, sizeof(unsigned int), - MAX_DMA_ADDRESS); + return memblock_alloc_from(sz, sizeof(unsigned int), MAX_DMA_ADDRESS); } static void *unw_hdr_alloc(unsigned long sz) diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index fceefd92016f..70621324db41 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -202,7 +202,7 @@ void __init allocate_pgdat(unsigned int nid) get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #ifdef CONFIG_NEED_MULTIPLE_NODES - NODE_DATA(nid) = memblock_alloc_try_nid_nopanic( + NODE_DATA(nid) = memblock_alloc_try_nid( sizeof(struct pglist_data), SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT, MEMBLOCK_ALLOC_ACCESSIBLE, nid); diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c index 13af08827eef..4bf46575568a 100644 --- a/arch/x86/kernel/setup_percpu.c +++ b/arch/x86/kernel/setup_percpu.c @@ -106,22 +106,22 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size, void *ptr; if (!node_online(node) || !NODE_DATA(node)) { - ptr = memblock_alloc_from_nopanic(size, align, goal); + ptr = memblock_alloc_from(size, align, goal); pr_info("cpu %d has no node %d or node-local memory\n", cpu, node); pr_debug("per cpu data for cpu%d %lu bytes at %016lx\n", cpu, size, __pa(ptr)); } else { - ptr = memblock_alloc_try_nid_nopanic(size, align, goal, - MEMBLOCK_ALLOC_ACCESSIBLE, - node); + ptr = memblock_alloc_try_nid(size, align, goal, + MEMBLOCK_ALLOC_ACCESSIBLE, + node); pr_debug("per cpu data for cpu%d %lu bytes on node%d at %016lx\n", cpu, size, node, __pa(ptr)); } return ptr; #else - return memblock_alloc_from_nopanic(size, align, goal); + return memblock_alloc_from(size, align, goal); #endif } diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 462fde83b515..8dc0fc0b1382 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -24,14 +24,16 @@ extern struct range pfn_mapped[E820_MAX_ENTRIES]; static p4d_t tmp_p4d_table[MAX_PTRS_PER_P4D] __initdata __aligned(PAGE_SIZE); -static __init void *early_alloc(size_t size, int nid, bool panic) +static __init void *early_alloc(size_t size, int nid, bool should_panic) { - if (panic) - return memblock_alloc_try_nid(size, size, - __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, nid); - else - return memblock_alloc_try_nid_nopanic(size, size, + void *ptr = memblock_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, nid); + + if (!ptr && should_panic) + panic("%pS: Failed to allocate page, nid=%d from=%lx\n", + (void *)_RET_IP_, nid, __pa(MAX_DMA_ADDRESS)); + + return ptr; } static void __init kasan_populate_pmd(pmd_t *pmd, unsigned long addr, diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c index ec4fd253a4e9..d168c87c7d30 100644 --- a/drivers/firmware/memmap.c +++ b/drivers/firmware/memmap.c @@ -333,7 +333,7 @@ int __init firmware_map_add_early(u64 start, u64 end, const char *type) { struct firmware_map_entry *entry; - entry = memblock_alloc_nopanic(sizeof(struct firmware_map_entry), + entry = memblock_alloc(sizeof(struct firmware_map_entry), SMP_CACHE_BYTES); if (WARN_ON(!entry)) return -ENOMEM; diff --git a/drivers/usb/early/xhci-dbc.c b/drivers/usb/early/xhci-dbc.c index d2652dccc699..c9cfb100ecdc 100644 --- a/drivers/usb/early/xhci-dbc.c +++ b/drivers/usb/early/xhci-dbc.c @@ -94,7 +94,7 @@ static void * __init xdbc_get_page(dma_addr_t *dma_addr) { void *virt; - virt = memblock_alloc_nopanic(PAGE_SIZE, PAGE_SIZE); + virt = memblock_alloc(PAGE_SIZE, PAGE_SIZE); if (!virt) return NULL; diff --git a/include/linux/memblock.h b/include/linux/memblock.h index c077227e6d53..db69ad97aa2e 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -335,9 +335,6 @@ static inline phys_addr_t memblock_phys_alloc(phys_addr_t size, void *memblock_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid); -void *memblock_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align, - phys_addr_t min_addr, phys_addr_t max_addr, - int nid); void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid); @@ -364,36 +361,12 @@ static inline void * __init memblock_alloc_from(phys_addr_t size, MEMBLOCK_ALLOC_ACCESSIBLE, NUMA_NO_NODE); } -static inline void * __init memblock_alloc_nopanic(phys_addr_t size, - phys_addr_t align) -{ - return memblock_alloc_try_nid_nopanic(size, align, MEMBLOCK_LOW_LIMIT, - MEMBLOCK_ALLOC_ACCESSIBLE, - NUMA_NO_NODE); -} - static inline void * __init memblock_alloc_low(phys_addr_t size, phys_addr_t align) { return memblock_alloc_try_nid(size, align, MEMBLOCK_LOW_LIMIT, ARCH_LOW_ADDRESS_LIMIT, NUMA_NO_NODE); } -static inline void * __init memblock_alloc_low_nopanic(phys_addr_t size, - phys_addr_t align) -{ - return memblock_alloc_try_nid_nopanic(size, align, MEMBLOCK_LOW_LIMIT, - ARCH_LOW_ADDRESS_LIMIT, - NUMA_NO_NODE); -} - -static inline void * __init memblock_alloc_from_nopanic(phys_addr_t size, - phys_addr_t align, - phys_addr_t min_addr) -{ - return memblock_alloc_try_nid_nopanic(size, align, min_addr, - MEMBLOCK_ALLOC_ACCESSIBLE, - NUMA_NO_NODE); -} static inline void * __init memblock_alloc_node(phys_addr_t size, phys_addr_t align, int nid) @@ -402,14 +375,6 @@ static inline void * __init memblock_alloc_node(phys_addr_t size, MEMBLOCK_ALLOC_ACCESSIBLE, nid); } -static inline void * __init memblock_alloc_node_nopanic(phys_addr_t size, - int nid) -{ - return memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, - MEMBLOCK_LOW_LIMIT, - MEMBLOCK_ALLOC_ACCESSIBLE, nid); -} - static inline void __init memblock_free_early(phys_addr_t base, phys_addr_t size) { diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 56ac77a80b1f..53012db1e53c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -256,7 +256,7 @@ swiotlb_init(int verbose) bytes = io_tlb_nslabs << IO_TLB_SHIFT; /* Get IO TLB memory from the low pages */ - vstart = memblock_alloc_low_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE); + vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE); if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose)) return; diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 8eee85bb2687..6b7654b8001f 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1143,14 +1143,7 @@ void __init setup_log_buf(int early) if (!new_log_buf_len) return; - if (early) { - new_log_buf = - memblock_alloc(new_log_buf_len, LOG_ALIGN); - } else { - new_log_buf = memblock_alloc_nopanic(new_log_buf_len, - LOG_ALIGN); - } - + new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN); if (unlikely(!new_log_buf)) { pr_err("log_buf_len: %lu bytes not available\n", new_log_buf_len); diff --git a/mm/memblock.c b/mm/memblock.c index a838c50ca9a8..0ab30d0185bc 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1433,41 +1433,6 @@ void * __init memblock_alloc_try_nid_raw( return ptr; } -/** - * memblock_alloc_try_nid_nopanic - allocate boot memory block - * @size: size of memory block to be allocated in bytes - * @align: alignment of the region and block's size - * @min_addr: the lower bound of the memory region from where the allocation - * is preferred (phys address) - * @max_addr: the upper bound of the memory region from where the allocation - * is preferred (phys address), or %MEMBLOCK_ALLOC_ACCESSIBLE to - * allocate only from memory limited by memblock.current_limit value - * @nid: nid of the free area to find, %NUMA_NO_NODE for any node - * - * Public function, provides additional debug information (including caller - * info), if enabled. This function zeroes the allocated memory. - * - * Return: - * Virtual address of allocated memory block on success, NULL on failure. - */ -void * __init memblock_alloc_try_nid_nopanic( - phys_addr_t size, phys_addr_t align, - phys_addr_t min_addr, phys_addr_t max_addr, - int nid) -{ - void *ptr; - - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pF\n", - __func__, (u64)size, (u64)align, nid, &min_addr, - &max_addr, (void *)_RET_IP_); - - ptr = memblock_alloc_internal(size, align, - min_addr, max_addr, nid); - if (ptr) - memset(ptr, 0, size); - return ptr; -} - /** * memblock_alloc_try_nid - allocate boot memory block * @size: size of memory block to be allocated in bytes diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3eb01dedfb50..03fcf73d47da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6445,8 +6445,8 @@ static void __ref setup_usemap(struct pglist_data *pgdat, zone->pageblock_flags = NULL; if (usemapsize) { zone->pageblock_flags = - memblock_alloc_node_nopanic(usemapsize, - pgdat->node_id); + memblock_alloc_node(usemapsize, SMP_CACHE_BYTES, + pgdat->node_id); if (!zone->pageblock_flags) panic("Failed to allocate %ld bytes for zone %s pageblock flags on node %d\n", usemapsize, zone->name, pgdat->node_id); @@ -6679,7 +6679,8 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat) end = pgdat_end_pfn(pgdat); end = ALIGN(end, MAX_ORDER_NR_PAGES); size = (end - start) * sizeof(struct page); - map = memblock_alloc_node_nopanic(size, pgdat->node_id); + map = memblock_alloc_node(size, SMP_CACHE_BYTES, + pgdat->node_id); if (!map) panic("Failed to allocate %ld bytes for node %d memory map\n", size, pgdat->node_id); @@ -7959,8 +7960,7 @@ void *__init alloc_large_system_hash(const char *tablename, size = bucketsize << log2qty; if (flags & HASH_EARLY) { if (flags & HASH_ZERO) - table = memblock_alloc_nopanic(size, - SMP_CACHE_BYTES); + table = memblock_alloc(size, SMP_CACHE_BYTES); else table = memblock_alloc_raw(size, SMP_CACHE_BYTES); diff --git a/mm/page_ext.c b/mm/page_ext.c index ab4244920e0f..d8f1aca4ad43 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -161,7 +161,7 @@ static int __init alloc_node_page_ext(int nid) table_size = get_entry_size() * nr_pages; - base = memblock_alloc_try_nid_nopanic( + base = memblock_alloc_try_nid( table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, nid); if (!base) diff --git a/mm/percpu.c b/mm/percpu.c index 3f9fb3086a9b..2e6fc8d552c9 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1905,7 +1905,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, __alignof__(ai->groups[0].cpu_map[0])); ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); - ptr = memblock_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE); + ptr = memblock_alloc(PFN_ALIGN(ai_size), PAGE_SIZE); if (!ptr) return NULL; ai = ptr; @@ -2496,7 +2496,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, size_sum = ai->static_size + ai->reserved_size + ai->dyn_size; areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *)); - areas = memblock_alloc_nopanic(areas_size, SMP_CACHE_BYTES); + areas = memblock_alloc(areas_size, SMP_CACHE_BYTES); if (!areas) { rc = -ENOMEM; goto out_free; @@ -2729,8 +2729,7 @@ EXPORT_SYMBOL(__per_cpu_offset); static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size, size_t align) { - return memblock_alloc_from_nopanic( - size, align, __pa(MAX_DMA_ADDRESS)); + return memblock_alloc_from(size, align, __pa(MAX_DMA_ADDRESS)); } static void __init pcpu_dfl_fc_free(void *ptr, size_t size) @@ -2778,9 +2777,7 @@ void __init setup_per_cpu_areas(void) void *fc; ai = pcpu_alloc_alloc_info(1, 1); - fc = memblock_alloc_from_nopanic(unit_size, - PAGE_SIZE, - __pa(MAX_DMA_ADDRESS)); + fc = memblock_alloc_from(unit_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); if (!ai || !fc) panic("Failed to allocate memory for percpu areas."); /* kmemleak tracks the percpu allocations separately */ diff --git a/mm/sparse.c b/mm/sparse.c index 7397fb4e78b4..69904aa6165b 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -330,9 +330,7 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat, limit = goal + (1UL << PA_SECTION_SHIFT); nid = early_pfn_to_nid(goal >> PAGE_SHIFT); again: - p = memblock_alloc_try_nid_nopanic(size, - SMP_CACHE_BYTES, goal, limit, - nid); + p = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid); if (!p && limit) { limit = 0; goto again; @@ -386,7 +384,7 @@ static unsigned long * __init sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat, unsigned long size) { - return memblock_alloc_node_nopanic(size, pgdat->node_id); + return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id); } static void __init check_usemap_section_nr(int nid, unsigned long *usemap) -- cgit v1.3-7-g2ca7 From 31b265b3baaf55f209229888b7ffea523ddab366 Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Fri, 8 Mar 2019 11:32:04 -0800 Subject: tracing: kdb: Fix ftdump to not sleep As reported back in 2016-11 [1], the "ftdump" kdb command triggers a BUG for "sleeping function called from invalid context". kdb's "ftdump" command wants to call ring_buffer_read_prepare() in atomic context. A very simple solution for this is to add allocation flags to ring_buffer_read_prepare() so kdb can call it without triggering the allocation error. This patch does that. Note that in the original email thread about this, it was suggested that perhaps the solution for kdb was to either preallocate the buffer ahead of time or create our own iterator. I'm hoping that this alternative of adding allocation flags to ring_buffer_read_prepare() can be considered since it means I don't need to duplicate more of the core trace code into "trace_kdb.c" (for either creating my own iterator or re-preparing a ring allocator whose memory was already allocated). NOTE: another option for kdb is to actually figure out how to make it reuse the existing ftrace_dump() function and totally eliminate the duplication. This sounds very appealing and actually works (the "sr z" command can be seen to properly dump the ftrace buffer). The downside here is that ftrace_dump() fully consumes the trace buffer. Unless that is changed I'd rather not use it because it means "ftdump | grep xyz" won't be very useful to search the ftrace buffer since it will throw away the whole trace on the first grep. A future patch to dump only the last few lines of the buffer will also be hard to implement. [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org Reported-by: Brian Norris Signed-off-by: Douglas Anderson Signed-off-by: Steven Rostedt (VMware) --- include/linux/ring_buffer.h | 2 +- kernel/trace/ring_buffer.c | 5 +++-- kernel/trace/trace.c | 6 ++++-- kernel/trace/trace_kdb.c | 6 ++++-- 4 files changed, 12 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index f1429675f252..1a40277b512c 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -128,7 +128,7 @@ ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts, unsigned long *lost_events); struct ring_buffer_iter * -ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu); +ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu, gfp_t flags); void ring_buffer_read_prepare_sync(void); void ring_buffer_read_start(struct ring_buffer_iter *iter); void ring_buffer_read_finish(struct ring_buffer_iter *iter); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 9a91479bbbfe..41b6f96e5366 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4191,6 +4191,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_consume); * ring_buffer_read_prepare - Prepare for a non consuming read of the buffer * @buffer: The ring buffer to read from * @cpu: The cpu buffer to iterate over + * @flags: gfp flags to use for memory allocation * * This performs the initial preparations necessary to iterate * through the buffer. Memory is allocated, buffer recording @@ -4208,7 +4209,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_consume); * This overall must be paired with ring_buffer_read_finish. */ struct ring_buffer_iter * -ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu) +ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu, gfp_t flags) { struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_iter *iter; @@ -4216,7 +4217,7 @@ ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu) if (!cpumask_test_cpu(cpu, buffer->cpumask)) return NULL; - iter = kmalloc(sizeof(*iter), GFP_KERNEL); + iter = kmalloc(sizeof(*iter), flags); if (!iter) return NULL; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index e9cc47e59d25..ccd759eaad79 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4077,7 +4077,8 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) if (iter->cpu_file == RING_BUFFER_ALL_CPUS) { for_each_tracing_cpu(cpu) { iter->buffer_iter[cpu] = - ring_buffer_read_prepare(iter->trace_buffer->buffer, cpu); + ring_buffer_read_prepare(iter->trace_buffer->buffer, + cpu, GFP_KERNEL); } ring_buffer_read_prepare_sync(); for_each_tracing_cpu(cpu) { @@ -4087,7 +4088,8 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) } else { cpu = iter->cpu_file; iter->buffer_iter[cpu] = - ring_buffer_read_prepare(iter->trace_buffer->buffer, cpu); + ring_buffer_read_prepare(iter->trace_buffer->buffer, + cpu, GFP_KERNEL); ring_buffer_read_prepare_sync(); ring_buffer_read_start(iter->buffer_iter[cpu]); tracing_iter_reset(iter, cpu); diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c index d953c163a079..810d78a8d14c 100644 --- a/kernel/trace/trace_kdb.c +++ b/kernel/trace/trace_kdb.c @@ -51,14 +51,16 @@ static void ftrace_dump_buf(int skip_lines, long cpu_file) if (cpu_file == RING_BUFFER_ALL_CPUS) { for_each_tracing_cpu(cpu) { iter.buffer_iter[cpu] = - ring_buffer_read_prepare(iter.trace_buffer->buffer, cpu); + ring_buffer_read_prepare(iter.trace_buffer->buffer, + cpu, GFP_ATOMIC); ring_buffer_read_start(iter.buffer_iter[cpu]); tracing_iter_reset(&iter, cpu); } } else { iter.cpu_file = cpu_file; iter.buffer_iter[cpu_file] = - ring_buffer_read_prepare(iter.trace_buffer->buffer, cpu_file); + ring_buffer_read_prepare(iter.trace_buffer->buffer, + cpu_file, GFP_ATOMIC); ring_buffer_read_start(iter.buffer_iter[cpu_file]); tracing_iter_reset(&iter, cpu_file); } -- cgit v1.3-7-g2ca7 From 1b986589680a2a5b6fc1ac196ea69925a93d9dd9 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Tue, 12 Mar 2019 10:23:02 -0700 Subject: bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk) can still be accessed after bpf_sk_release(sk). Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue. This patch addresses them together. A simple reproducer looks like this: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(sk); snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */ The problem is the verifier did not scrub the register's states of the tcp_sock ptr (tp) after bpf_sk_release(sk). [ Note that when calling bpf_tcp_sock(sk), the sk is not always refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works fine for this case. ] Currently, the verifier does not track if a helper's return ptr (in REG_0) is "carry"-ing one of its argument's refcount status. To carry this info, the reg1->id needs to be stored in reg0. One approach was tried, like "reg0->id = reg1->id", when calling "bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id" for the same reg. However, overlapping the NULL marking and ref tracking purpose in one "id" does not work well: ref_sk = bpf_sk_lookup_tcp(); fullsock = bpf_sk_fullsock(ref_sk); tp = bpf_tcp_sock(ref_sk); if (!fullsock) { bpf_sk_release(ref_sk); return 0; } /* fullsock_reg->id is marked for NOT-NULL. * Same for tp_reg->id because they have the same id. */ /* oops. verifier did not complain about the missing !tp check */ snd_cwnd = tp->snd_cwnd; Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state". With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can scrub all reg states which has a ref_obj_id match. It is done with the changes in release_reg_references() in this patch. While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and bpf_sk_fullsock() to avoid these helpers from returning another ptr. It will make bpf_sk_release(tp) possible: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(tp); A separate helper "bpf_get_listener_sock()" will be added in a later patch to do sk_to_full_sk(). Misc change notes: - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET is removed from bpf.h since no helper is using it. - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted() because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock() take ARG_PTR_TO_SOCK_COMMON. - check_refcount_ok() ensures is_acquire_function() cannot take arg_type_may_be_refcounted() as its argument. - The check_func_arg() can only allow one refcount-ed arg. It is guaranteed by check_refcount_ok() which ensures at most one arg can be refcounted. Hence, it is a verifier internal error if >1 refcount arg found in check_func_arg(). - In release_reference(), release_reference_state() is called first to ensure a match on "reg->ref_obj_id" can be found before scrubbing the reg states with release_reg_references(). - reg_is_refcounted() is no longer needed. 1. In mark_ptr_or_null_regs(), its usage is replaced by "ref_obj_id && ref_obj_id == id" because, when is_null == true, release_reference_state() should only be called on the ref_obj_id obtained by a acquire helper (i.e. is_acquire_function() == true). Otherwise, the following would happen: sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ fullsock = bpf_sk_fullsock(sk); if (!fullsock) { /* * release_reference_state(fullsock_reg->ref_obj_id) * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id. * * Hence, the following bpf_sk_release(sk) will fail * because the ref state has already been released in the * earlier release_reference_state(fullsock_reg->ref_obj_id). */ bpf_sk_release(sk); } 2. In release_reg_references(), the current reg_is_refcounted() call is unnecessary because the id check is enough. - The type_is_refcounted() and type_is_refcounted_or_null() are no longer needed also because reg_is_refcounted() is removed. Fixes: 655a51e536c0 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock") Reported-by: Lorenz Bauer Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 1 - include/linux/bpf_verifier.h | 40 +++++++++++++ kernel/bpf/verifier.c | 131 ++++++++++++++++++++++++------------------- net/core/filter.c | 6 +- 4 files changed, 115 insertions(+), 63 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a2132e09dc1c..f02367faa58d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -193,7 +193,6 @@ enum bpf_arg_type { ARG_PTR_TO_CTX, /* pointer to context */ ARG_ANYTHING, /* any (initialized) argument is ok */ - ARG_PTR_TO_SOCKET, /* pointer to bpf_sock */ ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */ }; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 69f7a3449eda..7d8228d1c898 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -66,6 +66,46 @@ struct bpf_reg_state { * same reference to the socket, to determine proper reference freeing. */ u32 id; + /* PTR_TO_SOCKET and PTR_TO_TCP_SOCK could be a ptr returned + * from a pointer-cast helper, bpf_sk_fullsock() and + * bpf_tcp_sock(). + * + * Consider the following where "sk" is a reference counted + * pointer returned from "sk = bpf_sk_lookup_tcp();": + * + * 1: sk = bpf_sk_lookup_tcp(); + * 2: if (!sk) { return 0; } + * 3: fullsock = bpf_sk_fullsock(sk); + * 4: if (!fullsock) { bpf_sk_release(sk); return 0; } + * 5: tp = bpf_tcp_sock(fullsock); + * 6: if (!tp) { bpf_sk_release(sk); return 0; } + * 7: bpf_sk_release(sk); + * 8: snd_cwnd = tp->snd_cwnd; // verifier will complain + * + * After bpf_sk_release(sk) at line 7, both "fullsock" ptr and + * "tp" ptr should be invalidated also. In order to do that, + * the reg holding "fullsock" and "sk" need to remember + * the original refcounted ptr id (i.e. sk_reg->id) in ref_obj_id + * such that the verifier can reset all regs which have + * ref_obj_id matching the sk_reg->id. + * + * sk_reg->ref_obj_id is set to sk_reg->id at line 1. + * sk_reg->id will stay as NULL-marking purpose only. + * After NULL-marking is done, sk_reg->id can be reset to 0. + * + * After "fullsock = bpf_sk_fullsock(sk);" at line 3, + * fullsock_reg->ref_obj_id is set to sk_reg->ref_obj_id. + * + * After "tp = bpf_tcp_sock(fullsock);" at line 5, + * tp_reg->ref_obj_id is set to fullsock_reg->ref_obj_id + * which is the same as sk_reg->ref_obj_id. + * + * From the verifier perspective, if sk, fullsock and tp + * are not NULL, they are the same ptr with different + * reg->type. In particular, bpf_sk_release(tp) is also + * allowed and has the same effect as bpf_sk_release(sk). + */ + u32 ref_obj_id; /* For scalar types (SCALAR_VALUE), this represents our knowledge of * the actual value. * For pointer types, this represents the variable part of the offset diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ce166a002d16..86f9cd5d1c4e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -212,7 +212,7 @@ struct bpf_call_arg_meta { int access_size; s64 msize_smax_value; u64 msize_umax_value; - int ptr_id; + int ref_obj_id; int func_id; }; @@ -346,35 +346,15 @@ static bool reg_type_may_be_null(enum bpf_reg_type type) type == PTR_TO_TCP_SOCK_OR_NULL; } -static bool type_is_refcounted(enum bpf_reg_type type) -{ - return type == PTR_TO_SOCKET; -} - -static bool type_is_refcounted_or_null(enum bpf_reg_type type) -{ - return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL; -} - -static bool reg_is_refcounted(const struct bpf_reg_state *reg) -{ - return type_is_refcounted(reg->type); -} - static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) { return reg->type == PTR_TO_MAP_VALUE && map_value_has_spin_lock(reg->map_ptr); } -static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg) +static bool arg_type_may_be_refcounted(enum bpf_arg_type type) { - return type_is_refcounted_or_null(reg->type); -} - -static bool arg_type_is_refcounted(enum bpf_arg_type type) -{ - return type == ARG_PTR_TO_SOCKET; + return type == ARG_PTR_TO_SOCK_COMMON; } /* Determine whether the function releases some resources allocated by another @@ -392,6 +372,12 @@ static bool is_acquire_function(enum bpf_func_id func_id) func_id == BPF_FUNC_sk_lookup_udp; } +static bool is_ptr_cast_function(enum bpf_func_id func_id) +{ + return func_id == BPF_FUNC_tcp_sock || + func_id == BPF_FUNC_sk_fullsock; +} + /* string representation of 'enum bpf_reg_type' */ static const char * const reg_type_str[] = { [NOT_INIT] = "?", @@ -465,7 +451,8 @@ static void print_verifier_state(struct bpf_verifier_env *env, if (t == PTR_TO_STACK) verbose(env, ",call_%d", func(env, reg)->callsite); } else { - verbose(env, "(id=%d", reg->id); + verbose(env, "(id=%d ref_obj_id=%d", reg->id, + reg->ref_obj_id); if (t != SCALAR_VALUE) verbose(env, ",off=%d", reg->off); if (type_is_pkt_pointer(t)) @@ -2414,16 +2401,15 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ if (!type_is_sk_pointer(type)) goto err_type; - } else if (arg_type == ARG_PTR_TO_SOCKET) { - expected_type = PTR_TO_SOCKET; - if (type != expected_type) - goto err_type; - if (meta->ptr_id || !reg->id) { - verbose(env, "verifier internal error: mismatched references meta=%d, reg=%d\n", - meta->ptr_id, reg->id); - return -EFAULT; + if (reg->ref_obj_id) { + if (meta->ref_obj_id) { + verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", + regno, reg->ref_obj_id, + meta->ref_obj_id); + return -EFAULT; + } + meta->ref_obj_id = reg->ref_obj_id; } - meta->ptr_id = reg->id; } else if (arg_type == ARG_PTR_TO_SPIN_LOCK) { if (meta->func_id == BPF_FUNC_spin_lock) { if (process_spin_lock(env, regno, true)) @@ -2740,32 +2726,38 @@ static bool check_arg_pair_ok(const struct bpf_func_proto *fn) return true; } -static bool check_refcount_ok(const struct bpf_func_proto *fn) +static bool check_refcount_ok(const struct bpf_func_proto *fn, int func_id) { int count = 0; - if (arg_type_is_refcounted(fn->arg1_type)) + if (arg_type_may_be_refcounted(fn->arg1_type)) count++; - if (arg_type_is_refcounted(fn->arg2_type)) + if (arg_type_may_be_refcounted(fn->arg2_type)) count++; - if (arg_type_is_refcounted(fn->arg3_type)) + if (arg_type_may_be_refcounted(fn->arg3_type)) count++; - if (arg_type_is_refcounted(fn->arg4_type)) + if (arg_type_may_be_refcounted(fn->arg4_type)) count++; - if (arg_type_is_refcounted(fn->arg5_type)) + if (arg_type_may_be_refcounted(fn->arg5_type)) count++; + /* A reference acquiring function cannot acquire + * another refcounted ptr. + */ + if (is_acquire_function(func_id) && count) + return false; + /* We only support one arg being unreferenced at the moment, * which is sufficient for the helper functions we have right now. */ return count <= 1; } -static int check_func_proto(const struct bpf_func_proto *fn) +static int check_func_proto(const struct bpf_func_proto *fn, int func_id) { return check_raw_mode_ok(fn) && check_arg_pair_ok(fn) && - check_refcount_ok(fn) ? 0 : -EINVAL; + check_refcount_ok(fn, func_id) ? 0 : -EINVAL; } /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] @@ -2799,19 +2791,20 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env) } static void release_reg_references(struct bpf_verifier_env *env, - struct bpf_func_state *state, int id) + struct bpf_func_state *state, + int ref_obj_id) { struct bpf_reg_state *regs = state->regs, *reg; int i; for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].id == id) + if (regs[i].ref_obj_id == ref_obj_id) mark_reg_unknown(env, regs, i); bpf_for_each_spilled_reg(i, state, reg) { if (!reg) continue; - if (reg_is_refcounted(reg) && reg->id == id) + if (reg->ref_obj_id == ref_obj_id) __mark_reg_unknown(reg); } } @@ -2820,15 +2813,20 @@ static void release_reg_references(struct bpf_verifier_env *env, * resources. Identify all copies of the same pointer and clear the reference. */ static int release_reference(struct bpf_verifier_env *env, - struct bpf_call_arg_meta *meta) + int ref_obj_id) { struct bpf_verifier_state *vstate = env->cur_state; + int err; int i; + err = release_reference_state(cur_func(env), ref_obj_id); + if (err) + return err; + for (i = 0; i <= vstate->curframe; i++) - release_reg_references(env, vstate->frame[i], meta->ptr_id); + release_reg_references(env, vstate->frame[i], ref_obj_id); - return release_reference_state(cur_func(env), meta->ptr_id); + return 0; } static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, @@ -3047,7 +3045,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn memset(&meta, 0, sizeof(meta)); meta.pkt_access = fn->pkt_access; - err = check_func_proto(fn); + err = check_func_proto(fn, func_id); if (err) { verbose(env, "kernel subsystem misconfigured func %s#%d\n", func_id_name(func_id), func_id); @@ -3093,7 +3091,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return err; } } else if (is_release_function(func_id)) { - err = release_reference(env, &meta); + err = release_reference(env, meta.ref_obj_id); if (err) { verbose(env, "func %s#%d reference has not been acquired before\n", func_id_name(func_id), func_id); @@ -3154,8 +3152,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn if (id < 0) return id; - /* For release_reference() */ + /* For mark_ptr_or_null_reg() */ regs[BPF_REG_0].id = id; + /* For release_reference() */ + regs[BPF_REG_0].ref_obj_id = id; } else { /* For mark_ptr_or_null_reg() */ regs[BPF_REG_0].id = ++env->id_gen; @@ -3170,6 +3170,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return -EINVAL; } + if (is_ptr_cast_function(func_id)) + /* For release_reference() */ + regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id; + do_refine_retval_range(regs, fn->ret_type, func_id, &meta); err = check_map_func_compatibility(env, meta.map_ptr, func_id); @@ -4665,11 +4669,19 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) { reg->type = PTR_TO_TCP_SOCK; } - if (is_null || !(reg_is_refcounted(reg) || - reg_may_point_to_spin_lock(reg))) { - /* We don't need id from this point onwards anymore, - * thus we should better reset it, so that state - * pruning has chances to take effect. + if (is_null) { + /* We don't need id and ref_obj_id from this point + * onwards anymore, thus we should better reset it, + * so that state pruning has chances to take effect. + */ + reg->id = 0; + reg->ref_obj_id = 0; + } else if (!reg_may_point_to_spin_lock(reg)) { + /* For not-NULL ptr, reg->ref_obj_id will be reset + * in release_reg_references(). + * + * reg->id is still used by spin_lock ptr. Other + * than spin_lock ptr type, reg->id can be reset. */ reg->id = 0; } @@ -4684,11 +4696,16 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, { struct bpf_func_state *state = vstate->frame[vstate->curframe]; struct bpf_reg_state *reg, *regs = state->regs; + u32 ref_obj_id = regs[regno].ref_obj_id; u32 id = regs[regno].id; int i, j; - if (reg_is_refcounted_or_null(®s[regno]) && is_null) - release_reference_state(state, id); + if (ref_obj_id && ref_obj_id == id && is_null) + /* regs[regno] is in the " == NULL" branch. + * No one could have freed the reference state before + * doing the NULL check. + */ + WARN_ON_ONCE(release_reference_state(state, id)); for (i = 0; i < MAX_BPF_REG; i++) mark_ptr_or_null_reg(state, ®s[i], id, is_null); diff --git a/net/core/filter.c b/net/core/filter.c index f274620945ff..36b6afacf83c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1796,8 +1796,6 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = { BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk) { - sk = sk_to_full_sk(sk); - return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL; } @@ -5266,7 +5264,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = { .func = bpf_sk_release, .gpl_only = false, .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_SOCKET, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, }; BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx, @@ -5407,8 +5405,6 @@ u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, BPF_CALL_1(bpf_tcp_sock, struct sock *, sk) { - sk = sk_to_full_sk(sk); - if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) return (unsigned long)sk; -- cgit v1.3-7-g2ca7 From f6d85f04e29859dd3ea65395c05925da352dae89 Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Mon, 14 Jan 2019 21:31:13 +0100 Subject: blkcg: annotate implicit fall through There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: kernel/trace/blktrace.c:725:9: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre Signed-off-by: Jens Axboe --- kernel/trace/blktrace.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index fac0ddf8a8e2..e1c6d79fb4cc 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -723,6 +723,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) #endif case BLKTRACESTART: start = 1; + /* fall through */ case BLKTRACESTOP: ret = __blk_trace_startstop(q, start); break; -- cgit v1.3-7-g2ca7 From 287c038c0b994dae7569d96eca154f6a7ff6b4a9 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 14 Mar 2019 13:30:09 +0900 Subject: tracing/probe: Check maxactive error cases Check maxactive on kprobe error case, because maxactive is only for kretprobe, not for kprobe. Also, maxactive should not be 0, it should be at least 1. Link: http://lkml.kernel.org/r/155253780952.14922.15784129810238750331.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index ceafa0a2b1d1..a14837545295 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -624,7 +624,11 @@ static int trace_kprobe_create(int argc, const char *argv[]) if (event) event++; - if (is_return && isdigit(argv[0][1])) { + if (isdigit(argv[0][1])) { + if (!is_return) { + pr_info("Maxactive is not for kprobe"); + return -EINVAL; + } if (event) len = event - &argv[0][1] - 1; else @@ -634,8 +638,8 @@ static int trace_kprobe_create(int argc, const char *argv[]) memcpy(buf, &argv[0][1], len); buf[len] = '\0'; ret = kstrtouint(buf, 0, &maxactive); - if (ret) { - pr_info("Failed to parse maxactive.\n"); + if (ret || !maxactive) { + pr_info("Invalid maxactive number\n"); return ret; } /* kretprobes instances are iterated over via a list. The -- cgit v1.3-7-g2ca7 From dec65d79fd269d05427c8167090bfc9c3d0b56c4 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 14 Mar 2019 13:30:20 +0900 Subject: tracing/probe: Check event name length correctly Ensure given name of event is not too long when parsing it, and fix to update event name offset correctly when the group name is given. For example, this makes probe event to check the "p:foo/" error case correctly. Link: http://lkml.kernel.org/r/155253782046.14922.14724124823730168629.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_probe.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index cfcf77e6fb19..0dc13bff2847 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -159,6 +159,7 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, char *buf) { const char *slash, *event = *pevent; + int len; slash = strchr(event, '/'); if (slash) { @@ -173,10 +174,15 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, strlcpy(buf, event, slash - event + 1); *pgroup = buf; *pevent = slash + 1; + event = *pevent; } - if (strlen(event) == 0) { + len = strlen(event); + if (len == 0) { pr_info("Event name is not specified\n"); return -EINVAL; + } else if (len > MAX_EVENT_NAME_LEN) { + pr_info("Event name is too long\n"); + return -E2BIG; } return 0; } -- cgit v1.3-7-g2ca7 From b4443c17a3c9d652dc5d7679ddca867ee3cdaa9c Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 14 Mar 2019 13:30:30 +0900 Subject: tracing/probe: Check the size of argument name and body Check the size of argument name and expression is not 0 and smaller than maximum length. Link: http://lkml.kernel.org/r/155253783029.14922.12650939303827581096.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_probe.c | 2 ++ kernel/trace/trace_probe.h | 1 + 2 files changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 0dc13bff2847..bf9d3d7e0c9d 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -554,6 +554,8 @@ int traceprobe_parse_probe_arg(struct trace_probe *tp, int i, char *arg, body = strchr(arg, '='); if (body) { + if (body - arg > MAX_ARG_NAME_LEN || body == arg) + return -EINVAL; parg->name = kmemdup_nul(arg, body - arg, GFP_KERNEL); body++; } else { diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 8a63f8bc01bc..2177c206de15 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -32,6 +32,7 @@ #define MAX_TRACE_ARGS 128 #define MAX_ARGSTR_LEN 63 #define MAX_ARRAY_LEN 64 +#define MAX_ARG_NAME_LEN 32 #define MAX_STRING_SIZE PATH_MAX /* Reserved field names */ -- cgit v1.3-7-g2ca7 From 5b7a96220900e3c3f6fb53908eb4602cda959376 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 14 Mar 2019 13:30:40 +0900 Subject: tracing/probe: Check event/group naming rule at parsing Check event and group naming rule at parsing it instead of allocating probes. Link: http://lkml.kernel.org/r/155253784064.14922.2336893061156236237.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 7 +------ kernel/trace/trace_probe.c | 8 ++++++++ kernel/trace/trace_uprobe.c | 5 +---- 3 files changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index a14837545295..5265117ad7e8 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -221,7 +221,7 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, tk->rp.maxactive = maxactive; - if (!event || !is_good_name(event)) { + if (!event || !group) { ret = -EINVAL; goto error; } @@ -231,11 +231,6 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, if (!tk->tp.call.name) goto error; - if (!group || !is_good_name(group)) { - ret = -EINVAL; - goto error; - } - tk->tp.class.system = kstrdup(group, GFP_KERNEL); if (!tk->tp.class.system) goto error; diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index bf9d3d7e0c9d..8f8411e7835f 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -172,6 +172,10 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, return -E2BIG; } strlcpy(buf, event, slash - event + 1); + if (!is_good_name(buf)) { + pr_info("Group name must follow the same rules as C identifiers\n"); + return -EINVAL; + } *pgroup = buf; *pevent = slash + 1; event = *pevent; @@ -184,6 +188,10 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, pr_info("Event name is too long\n"); return -E2BIG; } + if (!is_good_name(event)) { + pr_info("Event name must follow the same rules as C identifiers\n"); + return -EINVAL; + } return 0; } diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index e335576b9411..52f033489377 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -266,10 +266,7 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs, bool is_ret) { struct trace_uprobe *tu; - if (!event || !is_good_name(event)) - return ERR_PTR(-EINVAL); - - if (!group || !is_good_name(group)) + if (!event || !group) return ERR_PTR(-EINVAL); tu = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL); -- cgit v1.3-7-g2ca7 From a039480e9e93896cadc5a91468964febb3c5d488 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 14 Mar 2019 13:30:50 +0900 Subject: tracing/probe: Verify alloc_trace_*probe() result Since alloc_trace_*probe() returns -EINVAL only if !event && !group, it should not happen in trace_*probe_create(). If we catch that case there is a bug. So use WARN_ON_ONCE() instead of pr_info(). Link: http://lkml.kernel.org/r/155253785078.14922.16902223633734601469.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_uprobe.c | 3 ++- 2 files changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 5265117ad7e8..63aa0ab56051 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -693,9 +693,9 @@ static int trace_kprobe_create(int argc, const char *argv[]) tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive, argc, is_return); if (IS_ERR(tk)) { - pr_info("Failed to allocate trace_probe.(%d)\n", - (int)PTR_ERR(tk)); ret = PTR_ERR(tk); + /* This must return -ENOMEM otherwise there is a bug */ + WARN_ON_ONCE(ret != -ENOMEM); goto out; } diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 52f033489377..b54137ec7810 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -514,8 +514,9 @@ static int trace_uprobe_create(int argc, const char **argv) tu = alloc_trace_uprobe(group, event, argc, is_return); if (IS_ERR(tu)) { - pr_info("Failed to allocate trace_uprobe.(%d)\n", (int)PTR_ERR(tu)); ret = PTR_ERR(tu); + /* This must return -ENOMEM otherwise there is a bug */ + WARN_ON_ONCE(ret != -ENOMEM); goto fail_address_parse; } tu->offset = offset; -- cgit v1.3-7-g2ca7 From f01a7dbe98ae4265023fa5d3af0f076f0b18a647 Mon Sep 17 00:00:00 2001 From: Martynas Pumputis Date: Mon, 18 Mar 2019 16:10:26 +0100 Subject: bpf: Try harder when allocating memory for large maps It has been observed that sometimes a higher order memory allocation for BPF maps fails when there is no obvious memory pressure in a system. E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) could not be created due to vmalloc unable to allocate 75497472B, when the system's memory consumption (in MB) was the following: Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 Later analysis [1] by Michal Hocko showed that the vmalloc was not trying to reclaim memory from the page cache and was failing prematurely due to __GFP_NORETRY. Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic") and [1], we can replace __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer and will try harder to fulfil allocation requests. Unfortunately, replacing the body of the BPF map memory allocation function with the kvmalloc_node helper function is not an option at this point in time, given 1) kmalloc is non-optional for higher order allocations, and 2) passing __GFP_RETRY_MAYFAIL to the kmalloc would stress the slab allocator too much for large requests. The change has been tested with the workloads mentioned above and by observing oom_kill value from /proc/vmstat. [1]: https://lore.kernel.org/bpf/20190310071318.GW5232@dhcp22.suse.cz/ Signed-off-by: Martynas Pumputis Acked-by: Yonghong Song Cc: Michal Hocko Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20190318153940.GL8924@dhcp22.suse.cz/ --- kernel/bpf/syscall.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 62f6bced3a3c..afca36f53c49 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -136,21 +136,29 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) void *bpf_map_area_alloc(size_t size, int numa_node) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't - * trigger under memory pressure as we really just want to - * fail instead. + /* We really just want to fail instead of triggering OOM killer + * under memory pressure, therefore we set __GFP_NORETRY to kmalloc, + * which is used for lower order allocation requests. + * + * It has been observed that higher order allocation requests done by + * vmalloc with __GFP_NORETRY being set might fail due to not trying + * to reclaim memory from the page cache, thus we set + * __GFP_RETRY_MAYFAIL to avoid such situations. */ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; + + const gfp_t flags = __GFP_NOWARN | __GFP_ZERO; void *area; if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc_node(size, GFP_USER | flags, numa_node); + area = kmalloc_node(size, GFP_USER | __GFP_NORETRY | flags, + numa_node); if (area != NULL) return area; } - return __vmalloc_node_flags_caller(size, numa_node, GFP_KERNEL | flags, - __builtin_return_address(0)); + return __vmalloc_node_flags_caller(size, numa_node, + GFP_KERNEL | __GFP_RETRY_MAYFAIL | + flags, __builtin_return_address(0)); } void bpf_map_area_free(void *area) -- cgit v1.3-7-g2ca7 From a23314e9d88d89d49e69db08f60b7caa470f04e1 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 5 Mar 2019 09:32:02 +0100 Subject: sched/cpufreq: Fix 32-bit math overflow Vincent Wang reported that get_next_freq() has a mult overflow bug on 32-bit platforms in the IOWAIT boost case, since in that case {util,max} are in freq units instead of capacity units. Solve this by moving the IOWAIT boost to capacity units. And since this means @max is constant; simplify the code. Reported-by: Vincent Wang Tested-by: Vincent Wang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Rafael J. Wysocki Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Chunyan Zhang Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Quentin Perret Cc: Rafael J. Wysocki Cc: Rik van Riel Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190305083202.GU32494@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/cpufreq_schedutil.c | 59 +++++++++++++++++----------------------- 1 file changed, 25 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 033ec7c45f13..1ccf77f6d346 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -48,10 +48,10 @@ struct sugov_cpu { bool iowait_boost_pending; unsigned int iowait_boost; - unsigned int iowait_boost_max; u64 last_update; unsigned long bw_dl; + unsigned long min; unsigned long max; /* The field below is for single-CPU policies only: */ @@ -303,8 +303,7 @@ static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time, if (delta_ns <= TICK_NSEC) return false; - sg_cpu->iowait_boost = set_iowait_boost - ? sg_cpu->sg_policy->policy->min : 0; + sg_cpu->iowait_boost = set_iowait_boost ? sg_cpu->min : 0; sg_cpu->iowait_boost_pending = set_iowait_boost; return true; @@ -344,14 +343,13 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, /* Double the boost at each request */ if (sg_cpu->iowait_boost) { - sg_cpu->iowait_boost <<= 1; - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max) - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max; + sg_cpu->iowait_boost = + min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE); return; } /* First wakeup after IO: start with minimum boost */ - sg_cpu->iowait_boost = sg_cpu->sg_policy->policy->min; + sg_cpu->iowait_boost = sg_cpu->min; } /** @@ -373,47 +371,38 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, * This mechanism is designed to boost high frequently IO waiting tasks, while * being more conservative on tasks which does sporadic IO operations. */ -static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, - unsigned long *util, unsigned long *max) +static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, + unsigned long util, unsigned long max) { - unsigned int boost_util, boost_max; + unsigned long boost; /* No boost currently required */ if (!sg_cpu->iowait_boost) - return; + return util; /* Reset boost if the CPU appears to have been idle enough */ if (sugov_iowait_reset(sg_cpu, time, false)) - return; + return util; - /* - * An IO waiting task has just woken up: - * allow to further double the boost value - */ - if (sg_cpu->iowait_boost_pending) { - sg_cpu->iowait_boost_pending = false; - } else { + if (!sg_cpu->iowait_boost_pending) { /* - * Otherwise: reduce the boost value and disable it when we - * reach the minimum. + * No boost pending; reduce the boost value. */ sg_cpu->iowait_boost >>= 1; - if (sg_cpu->iowait_boost < sg_cpu->sg_policy->policy->min) { + if (sg_cpu->iowait_boost < sg_cpu->min) { sg_cpu->iowait_boost = 0; - return; + return util; } } + sg_cpu->iowait_boost_pending = false; + /* - * Apply the current boost value: a CPU is boosted only if its current - * utilization is smaller then the current IO boost level. + * @util is already in capacity scale; convert iowait_boost + * into the same scale so we can compare. */ - boost_util = sg_cpu->iowait_boost; - boost_max = sg_cpu->iowait_boost_max; - if (*util * boost_max < *max * boost_util) { - *util = boost_util; - *max = boost_max; - } + boost = (sg_cpu->iowait_boost * max) >> SCHED_CAPACITY_SHIFT; + return max(boost, util); } #ifdef CONFIG_NO_HZ_COMMON @@ -460,7 +449,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, util = sugov_get_util(sg_cpu); max = sg_cpu->max; - sugov_iowait_apply(sg_cpu, time, &util, &max); + util = sugov_iowait_apply(sg_cpu, time, util, max); next_f = get_next_freq(sg_policy, util, max); /* * Do not reduce the frequency if the CPU has not been idle @@ -500,7 +489,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) j_util = sugov_get_util(j_sg_cpu); j_max = j_sg_cpu->max; - sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max); + j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max); if (j_util * max > j_max * util) { util = j_util; @@ -837,7 +826,9 @@ static int sugov_start(struct cpufreq_policy *policy) memset(sg_cpu, 0, sizeof(*sg_cpu)); sg_cpu->cpu = cpu; sg_cpu->sg_policy = sg_policy; - sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq; + sg_cpu->min = + (SCHED_CAPACITY_SCALE * policy->cpuinfo.min_freq) / + policy->cpuinfo.max_freq; } for_each_cpu(cpu, policy->cpus) { -- cgit v1.3-7-g2ca7 From 4c47acd824aaaa8fc6dc519fb4e08d1522105b7a Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 6 Mar 2019 20:11:42 +0300 Subject: sched/core: Fix buffer overflow in cgroup2 property cpu.max Add limit into sscanf format string for on-stack buffer. Signed-off-by: Konstantin Khlebnikov Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Johannes Weiner Cc: Li Zefan Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Fixes: 0d5936344f30 ("sched: Implement interface for cgroup unified hierarchy") Link: https://lkml.kernel.org/r/155189230232.2620.13120481613524200065.stgit@buzz Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6b2c055564b5..b7a4afdc33cb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6943,7 +6943,7 @@ static int __maybe_unused cpu_period_quota_parse(char *buf, { char tok[21]; /* U64_MAX */ - if (!sscanf(buf, "%s %llu", tok, periodp)) + if (sscanf(buf, "%20s %llu", tok, periodp) < 1) return -EINVAL; *periodp *= NSEC_PER_USEC; -- cgit v1.3-7-g2ca7 From e25a7a944f1936b5134b7ee06bc432fc701e4aa3 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Mon, 11 Feb 2019 17:59:44 +0000 Subject: sched/fair: Comment some nohz_balancer_kick() kick conditions We now have a comment explaining the first sched_domain based NOHZ kick, so might as well comment them all. While at it, unwrap a line that fits under 80 characters. Co-authored-by: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8213ff6e365d..e6f7d39d4d45 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9612,8 +9612,12 @@ static void nohz_balancer_kick(struct rq *rq) sd = rcu_dereference(rq->sd); if (sd) { - if ((rq->cfs.h_nr_running >= 1) && - check_cpu_capacity(rq, sd)) { + /* + * If there's a CFS task and the current CPU has reduced + * capacity; kick the ILB to see if there's a better CPU to run + * on. + */ + if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) { flags = NOHZ_KICK_MASK; goto unlock; } @@ -9621,6 +9625,11 @@ static void nohz_balancer_kick(struct rq *rq) sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); if (sd) { + /* + * When ASYM_PACKING; see if there's a more preferred CPU + * currently idle; in which case, kick the ILB to move tasks + * around. + */ for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { if (sched_asym_prefer(i, cpu)) { flags = NOHZ_KICK_MASK; -- cgit v1.3-7-g2ca7 From a0fe2cf086aef213d1b4bca1b1291a3dee8357c9 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Mon, 11 Feb 2019 17:59:45 +0000 Subject: sched/fair: Tune down misfit NOHZ kicks In this commit: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") we set rq->misfit_task_load whenever the current running task has a utilization greater than 80% of rq->cpu_capacity. A non-zero value in this field enables misfit load balancing. However, if the task being looked at is already running on a CPU of highest capacity, there's nothing more we can do for it. We can currently spot this in update_sd_pick_busiest(), which prevents us from selecting a sched_group of group_type == group_misfit_task as the busiest group, but we don't do any of that in nohz_balancer_kick(). This means that we could repeatedly kick NOHZ CPUs when there's no improvements in terms of load balance to be done. Introduce a check_misfit_status() helper that returns true iff there is a CPU in the system that could give more CPU capacity to a rq's misfit task - IOW, there exists a CPU of higher capacity_orig or the rq's CPU is severely pressured by rt/IRQ. Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e6f7d39d4d45..f0d2f8a352bf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8058,6 +8058,18 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd) (rq->cpu_capacity_orig * 100)); } +/* + * Check whether a rq has a misfit task and if it looks like we can actually + * help that task: we can migrate the task to a CPU of higher capacity, or + * the task's current CPU is heavily pressured. + */ +static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) +{ + return rq->misfit_task_load && + (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || + check_cpu_capacity(rq, sd)); +} + /* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to ->cpus_allowed constraints. @@ -9585,7 +9597,7 @@ static void nohz_balancer_kick(struct rq *rq) if (time_before(now, nohz.next_balance)) goto out; - if (rq->nr_running >= 2 || rq->misfit_task_load) { + if (rq->nr_running >= 2) { flags = NOHZ_KICK_MASK; goto out; } @@ -9623,6 +9635,18 @@ static void nohz_balancer_kick(struct rq *rq) } } + sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)); + if (sd) { + /* + * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU + * to run the misfit task on. + */ + if (check_misfit_status(rq, sd)) { + flags = NOHZ_KICK_MASK; + goto unlock; + } + } + sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); if (sd) { /* -- cgit v1.3-7-g2ca7 From b9a7b8831600afc51c9ba52c05f12db2266f01c7 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Mon, 11 Feb 2019 17:59:46 +0000 Subject: sched/fair: Skip LLC NOHZ logic for asymmetric systems The LLC NOHZ condition will become true as soon as >=2 CPUs in a single LLC domain are busy. On big.LITTLE systems, this translates to two or more CPUs of a "cluster" (big or LITTLE) being busy. Issuing a NOHZ kick in these conditions isn't desired for asymmetric systems, as if the busy CPUs can provide enough compute capacity to the running tasks, then we can leave the NOHZ CPUs in peace. Skip the LLC NOHZ condition for asymmetric systems, and rely on nr_running & capacity checks to trigger NOHZ kicks when the system actually needs them. Suggested-by: Morten Rasmussen Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f0d2f8a352bf..51003e1c794d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9603,24 +9603,6 @@ static void nohz_balancer_kick(struct rq *rq) } rcu_read_lock(); - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - if (sds) { - /* - * If there is an imbalance between LLC domains (IOW we could - * increase the overall cache use), we need some less-loaded LLC - * domain to pull some load. Likewise, we may need to spread - * load within the current LLC domain (e.g. packed SMT cores but - * other CPUs are idle). We can't really know from here how busy - * the others are - so just get a nohz balance going if it looks - * like this LLC domain has tasks we could move. - */ - nr_busy = atomic_read(&sds->nr_busy_cpus); - if (nr_busy > 1) { - flags = NOHZ_KICK_MASK; - goto unlock; - } - - } sd = rcu_dereference(rq->sd); if (sd) { @@ -9635,6 +9617,21 @@ static void nohz_balancer_kick(struct rq *rq) } } + sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); + if (sd) { + /* + * When ASYM_PACKING; see if there's a more preferred CPU + * currently idle; in which case, kick the ILB to move tasks + * around. + */ + for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { + if (sched_asym_prefer(i, cpu)) { + flags = NOHZ_KICK_MASK; + goto unlock; + } + } + } + sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)); if (sd) { /* @@ -9645,20 +9642,32 @@ static void nohz_balancer_kick(struct rq *rq) flags = NOHZ_KICK_MASK; goto unlock; } + + /* + * For asymmetric systems, we do not want to nicely balance + * cache use, instead we want to embrace asymmetry and only + * ensure tasks have enough CPU capacity. + * + * Skip the LLC logic because it's not relevant in that case. + */ + goto unlock; } - sd = rcu_dereference(per_cpu(sd_asym_packing, cpu)); - if (sd) { + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) { /* - * When ASYM_PACKING; see if there's a more preferred CPU - * currently idle; in which case, kick the ILB to move tasks - * around. + * If there is an imbalance between LLC domains (IOW we could + * increase the overall cache use), we need some less-loaded LLC + * domain to pull some load. Likewise, we may need to spread + * load within the current LLC domain (e.g. packed SMT cores but + * other CPUs are idle). We can't really know from here how busy + * the others are - so just get a nohz balance going if it looks + * like this LLC domain has tasks we could move. */ - for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { - if (sched_asym_prefer(i, cpu)) { - flags = NOHZ_KICK_MASK; - goto unlock; - } + nr_busy = atomic_read(&sds->nr_busy_cpus); + if (nr_busy > 1) { + flags = NOHZ_KICK_MASK; + goto unlock; } } unlock: -- cgit v1.3-7-g2ca7 From cba368c1f01c27ed62fca7a853531845d263bb01 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Mon, 18 Mar 2019 10:37:13 -0700 Subject: bpf: Only print ref_obj_id for refcounted reg Naresh reported that test_align fails because of the mismatch at the verbose printout of the register states. The reason is due to the newly added ref_obj_id. ref_obj_id is only useful for refcounted reg. Thus, this patch fixes it by only printing ref_obj_id for refcounted reg. While at it, it also uses comma instead of space to separate between "id" and "ref_obj_id". Fixes: 1b986589680a ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release") Reported-by: Naresh Kamboju Signed-off-by: Martin KaFai Lau Acked-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 86f9cd5d1c4e..5aa810882583 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -352,6 +352,14 @@ static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) map_value_has_spin_lock(reg->map_ptr); } +static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type) +{ + return type == PTR_TO_SOCKET || + type == PTR_TO_SOCKET_OR_NULL || + type == PTR_TO_TCP_SOCK || + type == PTR_TO_TCP_SOCK_OR_NULL; +} + static bool arg_type_may_be_refcounted(enum bpf_arg_type type) { return type == ARG_PTR_TO_SOCK_COMMON; @@ -451,8 +459,9 @@ static void print_verifier_state(struct bpf_verifier_env *env, if (t == PTR_TO_STACK) verbose(env, ",call_%d", func(env, reg)->callsite); } else { - verbose(env, "(id=%d ref_obj_id=%d", reg->id, - reg->ref_obj_id); + verbose(env, "(id=%d", reg->id); + if (reg_type_may_be_refcounted_or_null(t)) + verbose(env, ",ref_obj_id=%d", reg->ref_obj_id); if (t != SCALAR_VALUE) verbose(env, ",off=%d", reg->off); if (type_is_pkt_pointer(t)) -- cgit v1.3-7-g2ca7 From 82efcab3b9f3ef59e9713237c6e3c05c3a95c1ae Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Mon, 11 Mar 2019 16:02:55 -0700 Subject: workqueue: Only unregister a registered lockdep key The recent change to prevent use after free and a memory leak introduced an unconditional call to wq_unregister_lockdep() in the error handling path. If the lockdep key had not been registered yet, then the lockdep core emits a warning. Only call wq_unregister_lockdep() if wq_register_lockdep() has been called first. Fixes: 009bb421b6ce ("workqueue, lockdep: Fix an alloc_workqueue() error path") Reported-by: syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com Signed-off-by: Bart Van Assche Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Tejun Heo Cc: Qian Cai Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org --- kernel/workqueue.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 4026d1871407..ddee541ea97a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4266,7 +4266,7 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, INIT_LIST_HEAD(&wq->list); if (alloc_and_link_pwqs(wq) < 0) - goto err_free_wq; + goto err_unreg_lockdep; if (wq_online && init_rescuer(wq) < 0) goto err_destroy; @@ -4292,9 +4292,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, return wq; -err_free_wq: +err_unreg_lockdep: wq_unregister_lockdep(wq); wq_free_lockdep(wq); +err_free_wq: free_workqueue_attrs(wq->unbound_attrs); kfree(wq); return NULL; -- cgit v1.3-7-g2ca7 From 0803278b0b4d8eeb2b461fb698785df65a725d9e Mon Sep 17 00:00:00 2001 From: Xu Yu Date: Thu, 21 Mar 2019 18:00:35 +0800 Subject: bpf: do not restore dst_reg when cur_state is freed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug. Call trace: dump_stack+0xbf/0x12e print_address_description+0x6a/0x280 kasan_report+0x237/0x360 sanitize_ptr_alu+0x85a/0x8d0 adjust_ptr_min_max_vals+0x8f2/0x1ca0 adjust_reg_min_max_vals+0x8ed/0x22e0 do_check+0x1ca6/0x5d00 bpf_check+0x9ca/0x2570 bpf_prog_load+0xc91/0x1030 __se_sys_bpf+0x61e/0x1f00 do_syscall_64+0xc8/0x550 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fault injection trace:  kfree+0xea/0x290  free_func_state+0x4a/0x60  free_verifier_state+0x61/0xe0  push_stack+0x216/0x2f0 <- inject failslab  sanitize_ptr_alu+0x2b1/0x8d0  adjust_ptr_min_max_vals+0x8f2/0x1ca0  adjust_reg_min_max_vals+0x8ed/0x22e0  do_check+0x1ca6/0x5d00  bpf_check+0x9ca/0x2570  bpf_prog_load+0xc91/0x1030  __se_sys_bpf+0x61e/0x1f00  do_syscall_64+0xc8/0x550  entry_SYSCALL_64_after_hwframe+0x49/0xbe When kzalloc() fails in push_stack(), free_verifier_state() will free current verifier state. As push_stack() returns, dst_reg was restored if ptr_is_dst_reg is false. However, as member of the cur_state, dst_reg is also freed, and error occurs when dereferencing dst_reg. Simply fix it by testing ret of push_stack() before restoring dst_reg. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Xu Yu Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5aa810882583..f19d5e04c69d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3381,7 +3381,7 @@ do_sim: *dst_reg = *ptr_reg; } ret = push_stack(env, env->insn_idx + 1, env->insn_idx, true); - if (!ptr_is_dst_reg) + if (!ptr_is_dst_reg && ret) *dst_reg = tmp; return !ret ? -EFAULT : 0; } -- cgit v1.3-7-g2ca7 From 83d163124cf1104cca5b668d5fe6325715a60855 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 21 Mar 2019 14:34:36 -0700 Subject: bpf: verifier: propagate liveness on all frames Commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences") connected up parentage chains of all frames of the stack. It didn't, however, ensure propagate_liveness() propagates all liveness information along those chains. This means pruning happening in the callee may generate explored states with incomplete liveness for the chains in lower frames of the stack. The included selftest is similar to the prior one from commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences"), where callee would prune regardless of the difference in r8 state. Now we also initialize r9 to 0 or 1 based on a result from get_random(). r9 is never read so the walk with r9 = 0 gets pruned (correctly) after the walk with r9 = 1 completes. The selftest is so arranged that the pruning will happen in the callee. Since callee does not propagate read marks of r8, the explored state at the pruning point prior to the callee will now ignore r8. Propagate liveness on all frames of the stack when pruning. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Signed-off-by: Jakub Kicinski Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 20 +++++++++++--------- tools/testing/selftests/bpf/verifier/calls.c | 25 +++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f19d5e04c69d..fd502c1f71eb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6078,15 +6078,17 @@ static int propagate_liveness(struct bpf_verifier_env *env, } /* Propagate read liveness of registers... */ BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG); - /* We don't need to worry about FP liveness because it's read-only */ - for (i = 0; i < BPF_REG_FP; i++) { - if (vparent->frame[vparent->curframe]->regs[i].live & REG_LIVE_READ) - continue; - if (vstate->frame[vstate->curframe]->regs[i].live & REG_LIVE_READ) { - err = mark_reg_read(env, &vstate->frame[vstate->curframe]->regs[i], - &vparent->frame[vstate->curframe]->regs[i]); - if (err) - return err; + for (frame = 0; frame <= vstate->curframe; frame++) { + /* We don't need to worry about FP liveness, it's read-only */ + for (i = frame < vstate->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++) { + if (vparent->frame[frame]->regs[i].live & REG_LIVE_READ) + continue; + if (vstate->frame[frame]->regs[i].live & REG_LIVE_READ) { + err = mark_reg_read(env, &vstate->frame[frame]->regs[i], + &vparent->frame[frame]->regs[i]); + if (err) + return err; + } } } diff --git a/tools/testing/selftests/bpf/verifier/calls.c b/tools/testing/selftests/bpf/verifier/calls.c index 4004891afa9c..f2ccae39ee66 100644 --- a/tools/testing/selftests/bpf/verifier/calls.c +++ b/tools/testing/selftests/bpf/verifier/calls.c @@ -1940,3 +1940,28 @@ .errstr = "!read_ok", .result = REJECT, }, +{ + "calls: cross frame pruning - liveness propagation", + .insns = { + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32), + BPF_MOV64_IMM(BPF_REG_8, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_MOV64_IMM(BPF_REG_8, 1), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32), + BPF_MOV64_IMM(BPF_REG_9, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_MOV64_IMM(BPF_REG_9, 1), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 4), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_8, 1, 1), + BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_2, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 0), + BPF_EXIT_INSN(), + }, + .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, + .errstr_unpriv = "function calls to other bpf functions are allowed for root only", + .errstr = "!read_ok", + .result = REJECT, +}, -- cgit v1.3-7-g2ca7 From 5a07168d8d89b00fe1760120714378175b3ef992 Mon Sep 17 00:00:00 2001 From: Chen Jie Date: Fri, 15 Mar 2019 03:44:38 +0000 Subject: futex: Ensure that futex address is aligned in handle_futex_death() The futex code requires that the user space addresses of futexes are 32bit aligned. sys_futex() checks this in futex_get_keys() but the robust list code has no alignment check in place. As a consequence the kernel crashes on architectures with strict alignment requirements in handle_futex_death() when trying to cmpxchg() on an unaligned futex address which was retrieved from the robust list. [ tglx: Rewrote changelog, proper sizeof() based alignement check and add comment ] Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") Signed-off-by: Chen Jie Signed-off-by: Thomas Gleixner Cc: Cc: Cc: Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com --- kernel/futex.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index c3b73b0311bc..9e40cf7be606 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3436,6 +3436,10 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int p { u32 uval, uninitialized_var(nval), mval; + /* Futex address must be 32bit aligned */ + if ((((unsigned long)uaddr) % sizeof(*uaddr)) != 0) + return -1; + retry: if (get_user(uval, uaddr)) return -1; -- cgit v1.3-7-g2ca7 From bb2e320565f997273fe04035bb6c17f643da6f8a Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Tue, 12 Mar 2019 04:17:56 -0400 Subject: genirq/devres: Remove excess parameter from kernel doc Building with 'make W=1' complains: CC kernel/irq/devres.o kernel/irq/devres.c:104: warning: Excess function parameter 'thread_fn' description in 'devm_request_any_context_irq' Remove it. Signed-off-by: Valdis Kletnieks Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/31207.1552378676@turing-police --- kernel/irq/devres.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/devres.c b/kernel/irq/devres.c index 5d5378ea0afe..f808c6a97dcc 100644 --- a/kernel/irq/devres.c +++ b/kernel/irq/devres.c @@ -84,8 +84,6 @@ EXPORT_SYMBOL(devm_request_threaded_irq); * @dev: device to request interrupt for * @irq: Interrupt line to allocate * @handler: Function to be called when the IRQ occurs - * @thread_fn: function to be called in a threaded interrupt context. NULL - * for devices which handle everything in @handler * @irqflags: Interrupt type flags * @devname: An ascii name for the claiming device, dev_name(dev) if NULL * @dev_id: A cookie passed back to the handler function -- cgit v1.3-7-g2ca7 From e8750053d64a3317cbc15f8341f0f11ca751bfeb Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Tue, 12 Mar 2019 04:38:35 -0400 Subject: time/jiffies: Make refined_jiffies static sparse complains: CHECK kernel/time/jiffies.c kernel/time/jiffies.c:92:20: warning: symbol 'refined_jiffies' was not declared. Should it be static? Its only used in file scope. Make it static. Signed-off-by: Valdis Kletnieks Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/32342.1552379915@turing-police --- kernel/time/jiffies.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index dc1b6f1929f9..ac9c03dd6c7d 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -89,7 +89,7 @@ struct clocksource * __init __weak clocksource_default_clock(void) return &clocksource_jiffies; } -struct clocksource refined_jiffies; +static struct clocksource refined_jiffies; int register_refined_jiffies(long cycles_per_second) { -- cgit v1.3-7-g2ca7 From 48084abf212052ca1d39fae064c581b1ce5b1fdf Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Tue, 12 Mar 2019 05:33:48 -0400 Subject: watchdog/core: Make variables static sparse complains: CHECK kernel/watchdog.c kernel/watchdog.c:45:19: warning: symbol 'nmi_watchdog_available' was not declared. Should it be static? kernel/watchdog.c:47:16: warning: symbol 'watchdog_allowed_mask' was not declared. Should it be static? They're not referenced by name from anyplace else, make them static. Signed-off-by: Valdis Kletnieks Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/7855.1552383228@turing-police --- kernel/watchdog.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 8fbfda94a67b..403c9bd90413 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -42,9 +42,9 @@ int __read_mostly watchdog_user_enabled = 1; int __read_mostly nmi_watchdog_user_enabled = NMI_WATCHDOG_DEFAULT; int __read_mostly soft_watchdog_user_enabled = 1; int __read_mostly watchdog_thresh = 10; -int __read_mostly nmi_watchdog_available; +static int __read_mostly nmi_watchdog_available; -struct cpumask watchdog_allowed_mask __read_mostly; +static struct cpumask watchdog_allowed_mask __read_mostly; struct cpumask watchdog_cpumask __read_mostly; unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask); -- cgit v1.3-7-g2ca7 From 93417a3fda2060f2a34e3341904024c5b6980d1f Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 28 Feb 2019 15:37:14 -0600 Subject: genirq: Mark expected switch case fall-through MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. With -Wimplicit-fallthrough added to CFLAGS: kernel/irq/manage.c: In function ‘irq_do_set_affinity’: kernel/irq/manage.c:198:3: warning: this statement may fall through [-Wimplicit-fallthrough=] cpumask_copy(desc->irq_common_data.affinity, mask); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/irq/manage.c:199:2: note: here case IRQ_SET_MASK_OK_NOCOPY: ^~~~ Annotate it. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Thomas Gleixner Cc: Kees Cook Link: https://lkml.kernel.org/r/20190228213714.GA9246@embeddedor --- kernel/irq/manage.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 9ec34a2a6638..1401afa0d58a 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -196,6 +196,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, case IRQ_SET_MASK_OK: case IRQ_SET_MASK_OK_DONE: cpumask_copy(desc->irq_common_data.affinity, mask); + /* fall through */ case IRQ_SET_MASK_OK_NOCOPY: irq_validate_effective_affinity(data); irq_set_thread_affinity(desc); -- cgit v1.3-7-g2ca7 From 1da6c4d9140cb7c13e87667dc4e1488d6c8fc10f Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Mon, 25 Mar 2019 15:54:43 +0100 Subject: bpf: fix use after free in bpf_evict_inode syzkaller was able to generate the following UAF in bpf: BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline] BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423 CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+ #110 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 lookup_last fs/namei.c:2269 [inline] path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318 filename_lookup+0x26a/0x520 fs/namei.c:2348 user_path_at_empty+0x40/0x50 fs/namei.c:2608 user_path include/linux/namei.h:62 [inline] do_mount+0x180/0x1ff0 fs/namespace.c:2980 ksys_mount+0x12d/0x140 fs/namespace.c:3258 __do_sys_mount fs/namespace.c:3272 [inline] __se_sys_mount fs/namespace.c:3269 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457569 Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569 RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000 R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4 R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff Allocated by task 9424: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 __do_kmalloc mm/slab.c:3722 [inline] __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737 kstrdup+0x39/0x70 mm/util.c:49 bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356 vfs_symlink+0x37a/0x5d0 fs/namei.c:4127 do_symlinkat+0x242/0x2d0 fs/namei.c:4154 __do_sys_symlink fs/namei.c:4173 [inline] __se_sys_symlink fs/namei.c:4171 [inline] __x64_sys_symlink+0x59/0x80 fs/namei.c:4171 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9425: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xcf/0x230 mm/slab.c:3817 bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565 evict+0x4b9/0x980 fs/inode.c:558 iput_final fs/inode.c:1550 [inline] iput+0x674/0xa90 fs/inode.c:1576 do_unlinkat+0x733/0xa30 fs/namei.c:4069 __do_sys_unlink fs/namei.c:4110 [inline] __se_sys_unlink fs/namei.c:4108 [inline] __x64_sys_unlink+0x42/0x50 fs/namei.c:4108 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this scenario path lookup under RCU is racing with the final unlink in case of symlinks. As Linus puts it in his analysis: [...] We actually RCU-delay the inode freeing itself, but when we do the final iput(), the "evict()" function is called synchronously. Now, the simple fix would seem to just RCU-delay the kfree() of the symlink data in bpf_evict_inode(). Maybe that's the right thing to do. [...] Al suggested to piggy-back on the ->destroy_inode() callback in order to implement RCU deferral there which can then kfree() the inode->i_link eventually right before putting inode back into inode cache. By reusing free_inode_nonrcu() from there we can avoid the need for our own inode cache and just reuse generic one as we currently do. And in-fact on top of all this we should just get rid of the bpf_evict_inode() entirely. This means truncate_inode_pages_final() and clear_inode() will then simply be called by the fs core via evict(). Dropping the reference should really only be done when inode is unhashed and nothing reachable anymore, so it's better also moved into the final ->destroy_inode() callback. Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime") Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com Suggested-by: Al Viro Analyzed-by: Linus Torvalds Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Acked-by: Linus Torvalds Acked-by: Al Viro Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/ --- kernel/bpf/inode.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2ada5e21dfa6..4a8f390a2b82 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -554,19 +554,6 @@ struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type typ } EXPORT_SYMBOL(bpf_prog_get_type_path); -static void bpf_evict_inode(struct inode *inode) -{ - enum bpf_type type; - - truncate_inode_pages_final(&inode->i_data); - clear_inode(inode); - - if (S_ISLNK(inode->i_mode)) - kfree(inode->i_link); - if (!bpf_inode_type(inode, &type)) - bpf_any_put(inode->i_private, type); -} - /* * Display the mount options in /proc/mounts. */ @@ -579,11 +566,28 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) return 0; } +static void bpf_destroy_inode_deferred(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + enum bpf_type type; + + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); + if (!bpf_inode_type(inode, &type)) + bpf_any_put(inode->i_private, type); + free_inode_nonrcu(inode); +} + +static void bpf_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); +} + static const struct super_operations bpf_super_ops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, .show_options = bpf_show_options, - .evict_inode = bpf_evict_inode, + .destroy_inode = bpf_destroy_inode, }; enum { -- cgit v1.3-7-g2ca7 From ff9d31d0d46672e201fc9ff59c42f1eef5f00c77 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Wed, 20 Mar 2019 12:53:33 -0500 Subject: tracing: Remove unnecessary var_ref destroy in track_data_destroy() Commit 656fe2ba85e8 (tracing: Use hist trigger's var_ref array to destroy var_refs) centralized the destruction of all the var_refs in one place so that other code didn't have to do it. The track_data_destroy() added later ignored that and also destroyed the track_data var_ref, causing a double-free error flagged by KASAN. ================================================================== BUG: KASAN: use-after-free in destroy_hist_field+0x30/0x70 Read of size 8 at addr ffff888086df2210 by task bash/1694 CPU: 6 PID: 1694 Comm: bash Not tainted 5.1.0-rc1-test+ #15 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x71/0xa0 ? destroy_hist_field+0x30/0x70 print_address_description.cold.3+0x9/0x1fb ? destroy_hist_field+0x30/0x70 ? destroy_hist_field+0x30/0x70 kasan_report.cold.4+0x1a/0x33 ? __kasan_slab_free+0x100/0x150 ? destroy_hist_field+0x30/0x70 destroy_hist_field+0x30/0x70 track_data_destroy+0x55/0xe0 destroy_hist_data+0x1f0/0x350 hist_unreg_all+0x203/0x220 event_trigger_open+0xbb/0x130 do_dentry_open+0x296/0x700 ? stacktrace_count_trigger+0x30/0x30 ? generic_permission+0x56/0x200 ? __x64_sys_fchdir+0xd0/0xd0 ? inode_permission+0x55/0x200 ? security_inode_permission+0x18/0x60 path_openat+0x633/0x22b0 ? path_lookupat.isra.50+0x420/0x420 ? __kasan_kmalloc.constprop.12+0xc1/0xd0 ? kmem_cache_alloc+0xe5/0x260 ? getname_flags+0x6c/0x2a0 ? do_sys_open+0x149/0x2b0 ? do_syscall_64+0x73/0x1b0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ? _raw_write_lock_bh+0xe0/0xe0 ? __kernel_text_address+0xe/0x30 ? unwind_get_return_address+0x2f/0x50 ? __list_add_valid+0x2d/0x70 ? deactivate_slab.isra.62+0x1f4/0x5a0 ? getname_flags+0x6c/0x2a0 ? set_track+0x76/0x120 do_filp_open+0x11a/0x1a0 ? may_open_dev+0x50/0x50 ? _raw_spin_lock+0x7a/0xd0 ? _raw_write_lock_bh+0xe0/0xe0 ? __alloc_fd+0x10f/0x200 do_sys_open+0x1db/0x2b0 ? filp_open+0x50/0x50 do_syscall_64+0x73/0x1b0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fa7b24a4ca2 Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 85 7a 0d 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007fffbafb3af0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 000055d3648ade30 RCX: 00007fa7b24a4ca2 RDX: 0000000000000241 RSI: 000055d364a55240 RDI: 00000000ffffff9c RBP: 00007fffbafb3bf0 R08: 0000000000000020 R09: 0000000000000002 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000003 R14: 0000000000000001 R15: 000055d364a55240 ================================================================== So remove the track_data_destroy() destroy_hist_field() call for that var_ref. Link: http://lkml.kernel.org/r/1deffec420f6a16d11dd8647318d34a66d1989a9.camel@linux.intel.com Fixes: 466f4528fbc69 ("tracing: Generalize hist trigger onmax and save action") Reported-by: Steven Rostedt (VMware) Signed-off-by: Tom Zanussi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index ca46339f3009..795aa2038377 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -3713,7 +3713,6 @@ static void track_data_destroy(struct hist_trigger_data *hist_data, struct trace_event_file *file = hist_data->event_file; destroy_hist_field(data->track_data.track_var, 0); - destroy_hist_field(data->track_data.var_ref, 0); if (data->action == ACTION_SNAPSHOT) { struct track_data *track_data; -- cgit v1.3-7-g2ca7 From 3dee10da2e9ff220e054a8f158cc296c797fbe81 Mon Sep 17 00:00:00 2001 From: Frank Rowand Date: Thu, 21 Mar 2019 23:58:20 -0700 Subject: tracing: initialize variable in create_dyn_event() Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized in this function [-Wuninitialized]. Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com Cc: Masami Hiramatsu Cc: Ingo Molnar Cc: Tom Zanussi Cc: Ravi Bangoria Cc: stable@vger.kernel.org Fixes: 5448d44c3855 ("tracing: Add unified dynamic event framework") Signed-off-by: Frank Rowand Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_dynevent.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c index dd1f43588d70..fa100ed3b4de 100644 --- a/kernel/trace/trace_dynevent.c +++ b/kernel/trace/trace_dynevent.c @@ -74,7 +74,7 @@ int dyn_event_release(int argc, char **argv, struct dyn_event_operations *type) static int create_dyn_event(int argc, char **argv) { struct dyn_event_operations *ops; - int ret; + int ret = -ENODEV; if (argv[0][0] == '-' || argv[0][0] == '!') return dyn_event_release(argc, argv, NULL); -- cgit v1.3-7-g2ca7 From 9efb85c5cfac7e1f0caae4471446d936ff2163fe Mon Sep 17 00:00:00 2001 From: Hariprasad Kelam Date: Sun, 24 Mar 2019 00:05:23 +0530 Subject: ftrace: Fix warning using plain integer as NULL & spelling corrections Changed 0 --> NULL to avoid sparse warning Corrected spelling mistakes reported by checkpatch.pl Sparse warning below: sudo make C=2 CF=-D__CHECK_ENDIAN__ M=kernel/trace CHECK kernel/trace/ftrace.c kernel/trace/ftrace.c:3007:24: warning: Using plain integer as NULL pointer kernel/trace/ftrace.c:4758:37: warning: Using plain integer as NULL pointer Link: http://lkml.kernel.org/r/20190323183523.GA2244@hari-Inspiron-1545 Signed-off-by: Hariprasad Kelam Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index fa79323331b2..26c8ca9bd06b 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1992,7 +1992,7 @@ static void print_bug_type(void) * modifying the code. @failed should be one of either: * EFAULT - if the problem happens on reading the @ip address * EINVAL - if what is read at @ip is not what was expected - * EPERM - if the problem happens on writting to the @ip address + * EPERM - if the problem happens on writing to the @ip address */ void ftrace_bug(int failed, struct dyn_ftrace *rec) { @@ -2391,7 +2391,7 @@ __ftrace_replace_code(struct dyn_ftrace *rec, int enable) return ftrace_modify_call(rec, ftrace_old_addr, ftrace_addr); } - return -1; /* unknow ftrace bug */ + return -1; /* unknown ftrace bug */ } void __weak ftrace_replace_code(int mod_flags) @@ -3004,7 +3004,7 @@ ftrace_allocate_pages(unsigned long num_to_init) int cnt; if (!num_to_init) - return 0; + return NULL; start_pg = pg = kzalloc(sizeof(*pg), GFP_KERNEL); if (!pg) @@ -4755,7 +4755,7 @@ static int ftrace_set_addr(struct ftrace_ops *ops, unsigned long ip, int remove, int reset, int enable) { - return ftrace_set_hash(ops, 0, 0, ip, remove, reset, enable); + return ftrace_set_hash(ops, NULL, 0, ip, remove, reset, enable); } /** @@ -5463,7 +5463,7 @@ void ftrace_create_filter_files(struct ftrace_ops *ops, /* * The name "destroy_filter_files" is really a misnomer. Although - * in the future, it may actualy delete the files, but this is + * in the future, it may actually delete the files, but this is * really intended to make sure the ops passed in are disabled * and that when this function returns, the caller is free to * free the ops. @@ -5786,7 +5786,7 @@ void ftrace_module_enable(struct module *mod) /* * If the tracing is enabled, go ahead and enable the record. * - * The reason not to enable the record immediatelly is the + * The reason not to enable the record immediately is the * inherent check of ftrace_make_nop/ftrace_make_call for * correct previous instructions. Making first the NOP * conversion puts the module to the correct state, thus -- cgit v1.3-7-g2ca7 From 927cb78177ae3773d0d27404566a93cb8e88890c Mon Sep 17 00:00:00 2001 From: Paul Chaignon Date: Wed, 20 Mar 2019 13:58:27 +0100 Subject: bpf: remove incorrect 'verifier bug' warning The BPF verifier checks the maximum number of call stack frames twice, first in the main CFG traversal (do_check) and then in a subsequent traversal (check_max_stack_depth). If the second check fails, it logs a 'verifier bug' warning and errors out, as the number of call stack frames should have been verified already. However, the second check may fail without indicating a verifier bug: if the excessive function calls reside in dead code, the main CFG traversal may not visit them; the subsequent traversal visits all instructions, including dead code. This case raises the question of how invalid dead code should be treated. This patch implements the conservative option and rejects such code. Signed-off-by: Paul Chaignon Tested-by: Xiao Han Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index fd502c1f71eb..6c5a41f7f338 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1897,8 +1897,9 @@ continue_func: } frame++; if (frame >= MAX_CALL_FRAMES) { - WARN_ONCE(1, "verifier bug. Call stack is too deep\n"); - return -EFAULT; + verbose(env, "the call stack of %d frames is too deep !\n", + frame); + return -E2BIG; } goto process_func; } -- cgit v1.3-7-g2ca7 From 7dd47617114921fdd8c095509e5e7b4373cc44a1 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 26 Mar 2019 22:51:02 +0100 Subject: watchdog: Respect watchdog cpumask on CPU hotplug The rework of the watchdog core to use cpu_stop_work broke the watchdog cpumask on CPU hotplug. The watchdog_enable/disable() functions are now called unconditionally from the hotplug callback, i.e. even on CPUs which are not in the watchdog cpumask. As a consequence the watchdog can become unstoppable. Only invoke them when the plugged CPU is in the watchdog cpumask. Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Reported-by: Maxime Coquelin Signed-off-by: Thomas Gleixner Tested-by: Maxime Coquelin Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Don Zickus Cc: Ricardo Neri Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de --- kernel/watchdog.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 403c9bd90413..6a5787233113 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -554,13 +554,15 @@ static void softlockup_start_all(void) int lockup_detector_online_cpu(unsigned int cpu) { - watchdog_enable(cpu); + if (cpumask_test_cpu(cpu, &watchdog_allowed_mask)) + watchdog_enable(cpu); return 0; } int lockup_detector_offline_cpu(unsigned int cpu) { - watchdog_disable(cpu); + if (cpumask_test_cpu(cpu, &watchdog_allowed_mask)) + watchdog_disable(cpu); return 0; } -- cgit v1.3-7-g2ca7 From 206b92353c839c0b27a0b9bec24195f93fd6cf7a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 26 Mar 2019 17:36:05 +0100 Subject: cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n Tianyu reported a crash in a CPU hotplug teardown callback when booting a kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot parameter. It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken forever in case that a bringup callback fails. Unfortunately this issue was not recognized when the CPU hotplug code was reworked, so the shortcoming just stayed in place. When a bringup callback fails, the CPU hotplug code rolls back the operation and takes the CPU offline. The 'nosmt' command line argument uses a bringup failure to abort the bringup of SMT sibling CPUs. This partial bringup is required due to the MCE misdesign on Intel CPUs. With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level teardown of a CPU including the synchronizations in various facilities like RCU, NOHZ and others. As a consequence the teardown callbacks which must be executed on the outgoing CPU within stop machine with interrupts disabled are executed on the control CPU in interrupt enabled and preemptible context causing the kernel to crash and burn. The pre state machine code has a different failure mode which is more subtle and resulting in a less obvious use after free crash because the control side frees resources which are still in use by the undead CPU. But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed. The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it: alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial). Crashing the kernel in such a situation is not an acceptable state either. Implement a minimal rollback variant by limiting the teardown to the point where all regular teardown callbacks have been invoked and leave the CPU in the 'dead' idle state. This has the following consequences: - the CPU is brought down to the point where the stop_machine takedown would happen. - the CPU stays there forever and is idle - The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state. - Interrupts are not forced away from the CPU - All facilities which only look at online mask would still see it, but that is the case during normal hotplug/unplug operations as well. It's just a (way) longer time frame. This will expose issues, which haven't been exposed before or only seldom, because now the normally transient state of being non active but online is a permanent state. In testing this exposed already an issue vs. work queues where the vmstat code schedules work on the almost dead CPU which ends up in an unbound workqueue and triggers 'preemtible context' warnings. This is not a problem of this change, it merily exposes an already existing issue. Still this is better than crashing fully without a chance to debug it. This is mainly thought as workaround for those architectures which do not support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP. Fixes: 2e1a3483ce74 ("cpu/hotplug: Split out the state walk into functions") Reported-by: Tianyu Lan Signed-off-by: Thomas Gleixner Tested-by: Tianyu Lan Acked-by: Greg Kroah-Hartman Cc: Konrad Wilk Cc: Josh Poimboeuf Cc: Mukesh Ojha Cc: Peter Zijlstra Cc: Jiri Kosina Cc: Rik van Riel Cc: Andy Lutomirski Cc: Micheal Kelley Cc: "K. Y. Srinivasan" Cc: Linus Torvalds Cc: Borislav Petkov Cc: K. Y. Srinivasan Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de --- kernel/cpu.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 025f419d16f6..6754f3ecfd94 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -564,6 +564,20 @@ static void undo_cpu_up(unsigned int cpu, struct cpuhp_cpu_state *st) cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL); } +static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st) +{ + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) + return true; + /* + * When CPU hotplug is disabled, then taking the CPU down is not + * possible because takedown_cpu() and the architecture and + * subsystem specific mechanisms are not available. So the CPU + * which would be completely unplugged again needs to stay around + * in the current state. + */ + return st->state <= CPUHP_BRINGUP_CPU; +} + static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, enum cpuhp_state target) { @@ -574,8 +588,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, st->state++; ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL); if (ret) { - st->target = prev_state; - undo_cpu_up(cpu, st); + if (can_rollback_cpu(st)) { + st->target = prev_state; + undo_cpu_up(cpu, st); + } break; } } -- cgit v1.3-7-g2ca7 From fcfc2aa0185f4a731d05a21e9f359968fdfd02e7 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Thu, 28 Mar 2019 20:44:13 -0700 Subject: ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK There are a few system calls (pselect, ppoll, etc) which replace a task sigmask while they are running in a kernel-space When a task calls one of these syscalls, the kernel saves a current sigmask in task->saved_sigmask and sets a syscall sigmask. On syscall-exit-stop, ptrace traps a task before restoring the saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by saved_sigmask, when the task returns to user-space. This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag. Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com Fixes: 29000caecbe8 ("ptrace: add ability to get/set signal-blocked mask") Signed-off-by: Andrei Vagin Acked-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched/signal.h | 18 ++++++++++++++++++ kernel/ptrace.c | 15 +++++++++++++-- 2 files changed, 31 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index ae5655197698..e412c092c1e8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -418,10 +418,20 @@ static inline void set_restore_sigmask(void) set_thread_flag(TIF_RESTORE_SIGMASK); WARN_ON(!test_thread_flag(TIF_SIGPENDING)); } + +static inline void clear_tsk_restore_sigmask(struct task_struct *tsk) +{ + clear_tsk_thread_flag(tsk, TIF_RESTORE_SIGMASK); +} + static inline void clear_restore_sigmask(void) { clear_thread_flag(TIF_RESTORE_SIGMASK); } +static inline bool test_tsk_restore_sigmask(struct task_struct *tsk) +{ + return test_tsk_thread_flag(tsk, TIF_RESTORE_SIGMASK); +} static inline bool test_restore_sigmask(void) { return test_thread_flag(TIF_RESTORE_SIGMASK); @@ -439,6 +449,10 @@ static inline void set_restore_sigmask(void) current->restore_sigmask = true; WARN_ON(!test_thread_flag(TIF_SIGPENDING)); } +static inline void clear_tsk_restore_sigmask(struct task_struct *tsk) +{ + tsk->restore_sigmask = false; +} static inline void clear_restore_sigmask(void) { current->restore_sigmask = false; @@ -447,6 +461,10 @@ static inline bool test_restore_sigmask(void) { return current->restore_sigmask; } +static inline bool test_tsk_restore_sigmask(struct task_struct *tsk) +{ + return tsk->restore_sigmask; +} static inline bool test_and_clear_restore_sigmask(void) { if (!current->restore_sigmask) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 771e93f9c43f..6f357f4fc859 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -29,6 +29,7 @@ #include #include #include +#include /* * Access another process' address space via ptrace. @@ -924,18 +925,26 @@ int ptrace_request(struct task_struct *child, long request, ret = ptrace_setsiginfo(child, &siginfo); break; - case PTRACE_GETSIGMASK: + case PTRACE_GETSIGMASK: { + sigset_t *mask; + if (addr != sizeof(sigset_t)) { ret = -EINVAL; break; } - if (copy_to_user(datavp, &child->blocked, sizeof(sigset_t))) + if (test_tsk_restore_sigmask(child)) + mask = &child->saved_sigmask; + else + mask = &child->blocked; + + if (copy_to_user(datavp, mask, sizeof(sigset_t))) ret = -EFAULT; else ret = 0; break; + } case PTRACE_SETSIGMASK: { sigset_t new_set; @@ -961,6 +970,8 @@ int ptrace_request(struct task_struct *child, long request, child->blocked = new_set; spin_unlock_irq(&child->sighand->siglock); + clear_tsk_restore_sigmask(child); + ret = 0; break; } -- cgit v1.3-7-g2ca7 From 676e4a6fe703f2dae699ee9d56f14516f9ada4ea Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Fri, 29 Mar 2019 10:18:00 +0100 Subject: xdp: fix cpumap redirect SKB creation bug We want to avoid leaking pointer info from xdp_frame (that is placed in top of frame) like commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse"), and followup commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom") that reserve this headroom. These changes also affected how cpumap constructed SKBs, as xdpf->headroom size changed, the skb data starting point were in-effect shifted with 32 bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size calculation also included xdpf->headroom which were reduced by same amount. A bug was introduced in commit 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't"), where the xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This round-up to find the frame_size is in principle still correct as it does not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e), but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes limit. This cause skb_shared_info to spill into next frame. It is a little hard to trigger, as the SKB need to use above 15 skb_shinfo->frags[] as far as I calculate. This does happen in practise for TCP streams when skb_try_coalesce() kicks in. KASAN can be used to detect these wrong memory accesses, I've seen: BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760 BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250 Driver veth also construct a SKB from xdp_frame in this way, but is not affected, as it doesn't reserve/deduct the room (used by xdp_frame) from the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(), and allows SKB to use this area. The fix in this patch is to do like veth and instead allow SKB to (re)use the area occupied by xdp_frame, by clearing via xdp_scrub_frame(). (This does kill the idea of the SKB being able to access (mem) info from this area, but I guess it was a bad idea anyhow, and it was already killed by the veth changes.) Fixes: 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't") Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- kernel/bpf/cpumap.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 8974b3755670..3c18260403dd 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -162,10 +162,14 @@ static void cpu_map_kthread_stop(struct work_struct *work) static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) { + unsigned int hard_start_headroom; unsigned int frame_size; void *pkt_data_start; struct sk_buff *skb; + /* Part of headroom was reserved to xdpf */ + hard_start_headroom = sizeof(struct xdp_frame) + xdpf->headroom; + /* build_skb need to place skb_shared_info after SKB end, and * also want to know the memory "truesize". Thus, need to * know the memory frame size backing xdp_buff. @@ -183,15 +187,15 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * is not at a fixed memory location, with mixed length * packets, which is bad for cache-line hotness. */ - frame_size = SKB_DATA_ALIGN(xdpf->len + xdpf->headroom) + + frame_size = SKB_DATA_ALIGN(xdpf->len + hard_start_headroom) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - pkt_data_start = xdpf->data - xdpf->headroom; + pkt_data_start = xdpf->data - hard_start_headroom; skb = build_skb(pkt_data_start, frame_size); if (!skb) return NULL; - skb_reserve(skb, xdpf->headroom); + skb_reserve(skb, hard_start_headroom); __skb_put(skb, xdpf->len); if (xdpf->metasize) skb_metadata_set(skb, xdpf->metasize); @@ -205,6 +209,9 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * - RX ring dev queue index (skb_record_rx_queue) */ + /* Allow SKB to reuse area used by xdp_frame */ + xdp_scrub_frame(xdpf); + return skb; } -- cgit v1.3-7-g2ca7 From 556a888a14afe27164191955618990fb3ccc9aad Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Sat, 30 Mar 2019 03:12:32 +0100 Subject: signal: don't silently convert SI_USER signals to non-current pidfd The current sys_pidfd_send_signal() silently turns signals with explicit SI_USER context that are sent to non-current tasks into signals with kernel-generated siginfo. This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case. If a user actually wants to send a signal with kernel-provided siginfo, they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing this case is unnecessary. Instead of silently replacing the siginfo, just bail out with an error; this is consistent with other interfaces and avoids special-casing behavior based on security checks. Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall") Signed-off-by: Jann Horn Signed-off-by: Christian Brauner --- kernel/signal.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index b7953934aa99..f98448cf2def 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3605,16 +3605,11 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, if (unlikely(sig != kinfo.si_signo)) goto err; + /* Only allow sending arbitrary signals to yourself. */ + ret = -EPERM; if ((task_pid(current) != pid) && - (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) { - /* Only allow sending arbitrary signals to yourself. */ - ret = -EPERM; - if (kinfo.si_code != SI_USER) - goto err; - - /* Turn this into a regular kill signal. */ - prepare_kill_siginfo(sig, &kinfo); - } + (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) + goto err; } else { prepare_kill_siginfo(sig, &kinfo); } -- cgit v1.3-7-g2ca7 From 0e9f02450da07fc7b1346c8c32c771555173e397 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 19 Mar 2019 12:36:10 +0000 Subject: sched/fair: Do not re-read ->h_load_next during hierarchical load calculation A NULL pointer dereference bug was reported on a distribution kernel but the same issue should be present on mainline kernel. It occured on s390 but should not be arch-specific. A partial oops looks like: Unable to handle kernel pointer dereference in virtual kernel address space ... Call Trace: ... try_to_wake_up+0xfc/0x450 vhost_poll_wakeup+0x3a/0x50 [vhost] __wake_up_common+0xbc/0x178 __wake_up_common_lock+0x9e/0x160 __wake_up_sync_key+0x4e/0x60 sock_def_readable+0x5e/0x98 The bug hits any time between 1 hour to 3 days. The dereference occurs in update_cfs_rq_h_load when accumulating h_load. The problem is that cfq_rq->h_load_next is not protected by any locking and can be updated by parallel calls to task_h_load. Depending on the compiler, code may be generated that re-reads cfq_rq->h_load_next after the check for NULL and then oops when reading se->avg.load_avg. The dissassembly showed that it was possible to reread h_load_next after the check for NULL. While this does not appear to be an issue for later compilers, it's still an accident if the correct code is generated. Full locking in this path would have high overhead so this patch uses READ_ONCE to read h_load_next only once and check for NULL before dereferencing. It was confirmed that there were no further oops after 10 days of testing. As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any potential problems with store tearing. Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()") Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fdab7eb6f351..40bd1e27b1b7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7784,10 +7784,10 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) if (cfs_rq->last_h_load_update == now) return; - cfs_rq->h_load_next = NULL; + WRITE_ONCE(cfs_rq->h_load_next, NULL); for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - cfs_rq->h_load_next = se; + WRITE_ONCE(cfs_rq->h_load_next, se); if (cfs_rq->last_h_load_update == now) break; } @@ -7797,7 +7797,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) cfs_rq->last_h_load_update = now; } - while ((se = cfs_rq->h_load_next) != NULL) { + while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) { load = cfs_rq->h_load; load = div64_ul(load * se->avg.load_avg, cfs_rq_load_avg(cfs_rq) + 1); -- cgit v1.3-7-g2ca7 From d08e411397cb6fcb3d3fb075c27a41975c99e88f Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 7 Nov 2016 16:26:36 -0500 Subject: tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments() The only users that calls syscall_get_arguments() with a variable and not a hard coded '6' is ftrace_syscall_enter(). syscall_get_arguments() can be optimized by removing a variable input, and always grabbing 6 arguments regardless of what the system call actually uses. Change ftrace_syscall_enter() to pass the 6 args into a local stack array and copy the necessary arguments into the trace event as needed. This is needed to remove two parameters from syscall_get_arguments(). Link: http://lkml.kernel.org/r/20161107213233.627583542@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_syscalls.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index f93a56d2db27..e9f5bbbad6d9 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -314,6 +314,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) struct ring_buffer_event *event; struct ring_buffer *buffer; unsigned long irq_flags; + unsigned long args[6]; int pc; int syscall_nr; int size; @@ -347,7 +348,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) entry = ring_buffer_event_data(event); entry->nr = syscall_nr; - syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args); + syscall_get_arguments(current, regs, 0, 6, args); + memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); event_trigger_unlock_commit(trace_file, buffer, event, entry, irq_flags, pc); @@ -583,6 +585,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) struct syscall_metadata *sys_data; struct syscall_trace_enter *rec; struct hlist_head *head; + unsigned long args[6]; bool valid_prog_array; int syscall_nr; int rctx; @@ -613,8 +616,8 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) return; rec->nr = syscall_nr; - syscall_get_arguments(current, regs, 0, sys_data->nb_args, - (unsigned long *)&rec->args); + syscall_get_arguments(current, regs, 0, 6, args); + memcpy(&rec->args, args, sizeof(unsigned long) * sys_data->nb_args); if ((valid_prog_array && !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) || -- cgit v1.3-7-g2ca7 From e8458e7afa855317b14915d7b86ab3caceea7eb6 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Thu, 4 Apr 2019 15:45:12 +0800 Subject: genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n When CONFIG_SPARSE_IRQ is disable, the request_mutex in struct irq_desc is not initialized which causes malfunction. Fixes: 9114014cf4e6 ("genirq: Add mutex to irq desc to serialize request/free_irq()") Signed-off-by: Kefeng Wang Signed-off-by: Thomas Gleixner Reviewed-by: Mukesh Ojha Cc: Marc Zyngier Cc: Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190404074512.145533-1-wangkefeng.wang@huawei.com --- kernel/irq/irqdesc.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 13539e12cd80..9f8a709337cf 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -558,6 +558,7 @@ int __init early_irq_init(void) alloc_masks(&desc[i], node); raw_spin_lock_init(&desc[i].lock); lockdep_set_class(&desc[i].lock, &irq_desc_lock_class); + mutex_init(&desc[i].request_mutex); desc_set_defaults(i, &desc[i], node, NULL, NULL); } return arch_early_irq_init(); -- cgit v1.3-7-g2ca7 From b35f549df1d7520d37ba1e6d4a8d4df6bd52d136 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 7 Nov 2016 16:26:37 -0500 Subject: syscalls: Remove start and number from syscall_get_arguments() args At Linux Plumbers, Andy Lutomirski approached me and pointed out that the function call syscall_get_arguments() implemented in x86 was horribly written and not optimized for the standard case of passing in 0 and 6 for the starting index and the number of system calls to get. When looking at all the users of this function, I discovered that all instances pass in only 0 and 6 for these arguments. Instead of having this function handle different cases that are never used, simply rewrite it to return the first 6 arguments of a system call. This should help out the performance of tracing system calls by ptrace, ftrace and perf. Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org Cc: Oleg Nesterov Cc: Kees Cook Cc: Andy Lutomirski Cc: Dominik Brodowski Cc: Dave Martin Cc: "Dmitry V. Levin" Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: uclinux-h8-devel@lists.sourceforge.jp Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: nios2-dev@lists.rocketboards.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Acked-by: Paul Burton # MIPS parts Acked-by: Max Filippov # For xtensa changes Acked-by: Will Deacon # For the arm64 bits Reviewed-by: Thomas Gleixner # for x86 Reviewed-by: Dmitry V. Levin Reported-by: Andy Lutomirski Signed-off-by: Steven Rostedt (VMware) --- arch/arc/include/asm/syscall.h | 7 ++-- arch/arm/include/asm/syscall.h | 23 ++--------- arch/arm64/include/asm/syscall.h | 22 ++--------- arch/c6x/include/asm/syscall.h | 41 ++++---------------- arch/csky/include/asm/syscall.h | 14 ++----- arch/h8300/include/asm/syscall.h | 34 ++++------------ arch/hexagon/include/asm/syscall.h | 4 +- arch/ia64/include/asm/syscall.h | 5 +-- arch/microblaze/include/asm/syscall.h | 4 +- arch/mips/include/asm/syscall.h | 3 +- arch/mips/kernel/ptrace.c | 2 +- arch/nds32/include/asm/syscall.h | 33 +++------------- arch/nios2/include/asm/syscall.h | 42 ++++---------------- arch/openrisc/include/asm/syscall.h | 6 +-- arch/parisc/include/asm/syscall.h | 30 ++++---------- arch/powerpc/include/asm/syscall.h | 8 ++-- arch/riscv/include/asm/syscall.h | 13 ++----- arch/s390/include/asm/syscall.h | 17 +++----- arch/sh/include/asm/syscall_32.h | 26 +++---------- arch/sh/include/asm/syscall_64.h | 4 +- arch/sparc/include/asm/syscall.h | 4 +- arch/um/include/asm/syscall-generic.h | 39 +++---------------- arch/x86/include/asm/syscall.h | 73 ++++++++--------------------------- arch/xtensa/include/asm/syscall.h | 16 ++------ include/asm-generic/syscall.h | 11 ++---- include/trace/events/syscalls.h | 2 +- kernel/seccomp.c | 2 +- kernel/trace/trace_syscalls.c | 4 +- lib/syscall.c | 2 +- 29 files changed, 113 insertions(+), 378 deletions(-) (limited to 'kernel') diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index 29de09804306..c7a4201ed62b 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -55,12 +55,11 @@ syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, */ static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) + unsigned long *args) { unsigned long *inside_ptregs = &(regs->r0); - inside_ptregs -= i; - - BUG_ON((i + n) > 6); + unsigned int n = 6; + unsigned int i = 0; while (n--) { args[i++] = (*inside_ptregs); diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index 06dea6bce293..db969a2972ae 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -55,29 +55,12 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - if (n == 0) - return; - - if (i + n > SYSCALL_MAX_ARGS) { - unsigned long *args_bad = args + SYSCALL_MAX_ARGS - i; - unsigned int n_bad = n + i - SYSCALL_MAX_ARGS; - pr_warn("%s called with max args %d, handling only %d\n", - __func__, i + n, SYSCALL_MAX_ARGS); - memset(args_bad, 0, n_bad * sizeof(args[0])); - n = SYSCALL_MAX_ARGS - i; - } - - if (i == 0) { - args[0] = regs->ARM_ORIG_r0; - args++; - i++; - n--; - } + args[0] = regs->ARM_ORIG_r0; + args++; - memcpy(args, ®s->ARM_r0 + i, n * sizeof(args[0])); + memcpy(args, ®s->ARM_r0 + 1, 5 * sizeof(args[0])); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index ad8be16a39c9..55b2dab21023 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -65,28 +65,12 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - if (n == 0) - return; - - if (i + n > SYSCALL_MAX_ARGS) { - unsigned long *args_bad = args + SYSCALL_MAX_ARGS - i; - unsigned int n_bad = n + i - SYSCALL_MAX_ARGS; - pr_warning("%s called with max args %d, handling only %d\n", - __func__, i + n, SYSCALL_MAX_ARGS); - memset(args_bad, 0, n_bad * sizeof(args[0])); - } - - if (i == 0) { - args[0] = regs->orig_x0; - args++; - i++; - n--; - } + args[0] = regs->orig_x0; + args++; - memcpy(args, ®s->regs[i], n * sizeof(args[0])); + memcpy(args, ®s->regs[1], 5 * sizeof(args[0])); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h index ae2be315ee9c..06db3251926b 100644 --- a/arch/c6x/include/asm/syscall.h +++ b/arch/c6x/include/asm/syscall.h @@ -46,40 +46,15 @@ static inline void syscall_set_return_value(struct task_struct *task, } static inline void syscall_get_arguments(struct task_struct *task, - struct pt_regs *regs, unsigned int i, - unsigned int n, unsigned long *args) + struct pt_regs *regs, + unsigned long *args) { - switch (i) { - case 0: - if (!n--) - break; - *args++ = regs->a4; - case 1: - if (!n--) - break; - *args++ = regs->b4; - case 2: - if (!n--) - break; - *args++ = regs->a6; - case 3: - if (!n--) - break; - *args++ = regs->b6; - case 4: - if (!n--) - break; - *args++ = regs->a8; - case 5: - if (!n--) - break; - *args++ = regs->b8; - case 6: - if (!n--) - break; - default: - BUG(); - } + *args++ = regs->a4; + *args++ = regs->b4; + *args++ = regs->a6; + *args++ = regs->b6; + *args++ = regs->a8; + *args = regs->b8; } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h index 9a9cd81e66c1..c691fe92edc6 100644 --- a/arch/csky/include/asm/syscall.h +++ b/arch/csky/include/asm/syscall.h @@ -43,17 +43,11 @@ syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) + unsigned long *args) { - BUG_ON(i + n > 6); - if (i == 0) { - args[0] = regs->orig_a0; - args++; - n--; - } else { - i--; - } - memcpy(args, ®s->a1 + i, n * sizeof(args[0])); + args[0] = regs->orig_a0; + args++; + memcpy(args, ®s->a1, 5 * sizeof(args[0])); } static inline void diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h index 924990401237..ddd483c6ca95 100644 --- a/arch/h8300/include/asm/syscall.h +++ b/arch/h8300/include/asm/syscall.h @@ -17,34 +17,14 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs) static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) + unsigned long *args) { - BUG_ON(i + n > 6); - - while (n > 0) { - switch (i) { - case 0: - *args++ = regs->er1; - break; - case 1: - *args++ = regs->er2; - break; - case 2: - *args++ = regs->er3; - break; - case 3: - *args++ = regs->er4; - break; - case 4: - *args++ = regs->er5; - break; - case 5: - *args++ = regs->er6; - break; - } - i++; - n--; - } + *args++ = regs->er1; + *args++ = regs->er2; + *args++ = regs->er3; + *args++ = regs->er4; + *args++ = regs->er5; + *args = regs->er6; } diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index 4af9c7b6f13a..ae3a1e24fabd 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -37,10 +37,8 @@ static inline long syscall_get_nr(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - BUG_ON(i + n > 6); - memcpy(args, &(®s->r00)[i], n * sizeof(args[0])); + memcpy(args, &(®s->r00)[0], 6 * sizeof(args[0])); } #endif diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h index 1d0b875fec44..8204c1ff70ce 100644 --- a/arch/ia64/include/asm/syscall.h +++ b/arch/ia64/include/asm/syscall.h @@ -63,12 +63,9 @@ extern void ia64_syscall_get_set_arguments(struct task_struct *task, unsigned long *args, int rw); static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - BUG_ON(i + n > 6); - - ia64_syscall_get_set_arguments(task, regs, i, n, args, 0); + ia64_syscall_get_set_arguments(task, regs, 0, 6, args, 0); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h index 220decd605a4..4b23e44e5c3c 100644 --- a/arch/microblaze/include/asm/syscall.h +++ b/arch/microblaze/include/asm/syscall.h @@ -82,9 +82,11 @@ static inline void microblaze_set_syscall_arg(struct pt_regs *regs, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { + unsigned int i = 0; + unsigned int n = 6; + while (n--) *args++ = microblaze_get_syscall_arg(regs, i++); } diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h index 6cf8ffb5367e..a2b4748655df 100644 --- a/arch/mips/include/asm/syscall.h +++ b/arch/mips/include/asm/syscall.h @@ -116,9 +116,10 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { + unsigned int i = 0; + unsigned int n = 6; int ret; /* O32 ABI syscall() */ diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c index 0057c910bc2f..3a62f80958e1 100644 --- a/arch/mips/kernel/ptrace.c +++ b/arch/mips/kernel/ptrace.c @@ -1419,7 +1419,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall) sd.nr = syscall; sd.arch = syscall_get_arch(); - syscall_get_arguments(current, regs, 0, 6, args); + syscall_get_arguments(current, regs, args); for (i = 0; i < 6; i++) sd.args[i] = args[i]; sd.instruction_pointer = KSTK_EIP(current); diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h index f7e5e86765fe..89a6ec8731d8 100644 --- a/arch/nds32/include/asm/syscall.h +++ b/arch/nds32/include/asm/syscall.h @@ -108,42 +108,21 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, * syscall_get_arguments - extract system call parameter values * @task: task of interest, must be blocked * @regs: task_pt_regs() of @task - * @i: argument index [0,5] - * @n: number of arguments; n+i must be [1,6]. * @args: array filled with argument values * - * Fetches @n arguments to the system call starting with the @i'th argument - * (from 0 through 5). Argument @i is stored in @args[0], and so on. - * An arch inline version is probably optimal when @i and @n are constants. + * Fetches 6 arguments to the system call (from 0 through 5). The first + * argument is stored in @args[0], and so on. * * It's only valid to call this when @task is stopped for tracing on * entry to a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT. - * It's invalid to call this with @i + @n > 6; we only support system calls - * taking up to 6 arguments. */ #define SYSCALL_MAX_ARGS 6 void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) + unsigned long *args) { - if (n == 0) - return; - if (i + n > SYSCALL_MAX_ARGS) { - unsigned long *args_bad = args + SYSCALL_MAX_ARGS - i; - unsigned int n_bad = n + i - SYSCALL_MAX_ARGS; - pr_warning("%s called with max args %d, handling only %d\n", - __func__, i + n, SYSCALL_MAX_ARGS); - memset(args_bad, 0, n_bad * sizeof(args[0])); - memset(args_bad, 0, n_bad * sizeof(args[0])); - } - - if (i == 0) { - args[0] = regs->orig_r0; - args++; - i++; - n--; - } - - memcpy(args, ®s->uregs[0] + i, n * sizeof(args[0])); + args[0] = regs->orig_r0; + args++; + memcpy(args, ®s->uregs[0] + 1, 5 * sizeof(args[0])); } /** diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h index 9de220854c4a..792bd449d839 100644 --- a/arch/nios2/include/asm/syscall.h +++ b/arch/nios2/include/asm/syscall.h @@ -58,42 +58,14 @@ static inline void syscall_set_return_value(struct task_struct *task, } static inline void syscall_get_arguments(struct task_struct *task, - struct pt_regs *regs, unsigned int i, unsigned int n, - unsigned long *args) + struct pt_regs *regs, unsigned long *args) { - BUG_ON(i + n > 6); - - switch (i) { - case 0: - if (!n--) - break; - *args++ = regs->r4; - case 1: - if (!n--) - break; - *args++ = regs->r5; - case 2: - if (!n--) - break; - *args++ = regs->r6; - case 3: - if (!n--) - break; - *args++ = regs->r7; - case 4: - if (!n--) - break; - *args++ = regs->r8; - case 5: - if (!n--) - break; - *args++ = regs->r9; - case 6: - if (!n--) - break; - default: - BUG(); - } + *args++ = regs->r4; + *args++ = regs->r5; + *args++ = regs->r6; + *args++ = regs->r7; + *args++ = regs->r8; + *args = regs->r9; } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h index 2db9f1cf0694..72607860cd55 100644 --- a/arch/openrisc/include/asm/syscall.h +++ b/arch/openrisc/include/asm/syscall.h @@ -56,11 +56,9 @@ syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) + unsigned long *args) { - BUG_ON(i + n > 6); - - memcpy(args, ®s->gpr[3 + i], n * sizeof(args[0])); + memcpy(args, ®s->gpr[3], 6 * sizeof(args[0])); } static inline void diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h index 8bff1a58c97f..62a6d477fae0 100644 --- a/arch/parisc/include/asm/syscall.h +++ b/arch/parisc/include/asm/syscall.h @@ -18,29 +18,15 @@ static inline long syscall_get_nr(struct task_struct *tsk, } static inline void syscall_get_arguments(struct task_struct *tsk, - struct pt_regs *regs, unsigned int i, - unsigned int n, unsigned long *args) + struct pt_regs *regs, + unsigned long *args) { - BUG_ON(i); - - switch (n) { - case 6: - args[5] = regs->gr[21]; - case 5: - args[4] = regs->gr[22]; - case 4: - args[3] = regs->gr[23]; - case 3: - args[2] = regs->gr[24]; - case 2: - args[1] = regs->gr[25]; - case 1: - args[0] = regs->gr[26]; - case 0: - break; - default: - BUG(); - } + args[5] = regs->gr[21]; + args[4] = regs->gr[22]; + args[3] = regs->gr[23]; + args[2] = regs->gr[24]; + args[1] = regs->gr[25]; + args[0] = regs->gr[26]; } static inline long syscall_get_return_value(struct task_struct *task, diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index 1a0e7a8b1c81..5c9b9dc82b7e 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -65,22 +65,20 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { unsigned long val, mask = -1UL; - - BUG_ON(i + n > 6); + unsigned int n = 6; #ifdef CONFIG_COMPAT if (test_tsk_thread_flag(task, TIF_32BIT)) mask = 0xffffffff; #endif while (n--) { - if (n == 0 && i == 0) + if (n == 0) val = regs->orig_gpr3; else - val = regs->gpr[3 + i + n]; + val = regs->gpr[3 + n]; args[n] = val & mask; } diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h index 6ea9e1804233..6adca1804be1 100644 --- a/arch/riscv/include/asm/syscall.h +++ b/arch/riscv/include/asm/syscall.h @@ -72,18 +72,11 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - BUG_ON(i + n > 6); - if (i == 0) { - args[0] = regs->orig_a0; - args++; - n--; - } else { - i--; - } - memcpy(args, ®s->a1 + i, n * sizeof(args[0])); + args[0] = regs->orig_a0; + args++; + memcpy(args, ®s->a1, 5 * sizeof(args[0])); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h index 96f9a9151fde..ee0b1f6aa36d 100644 --- a/arch/s390/include/asm/syscall.h +++ b/arch/s390/include/asm/syscall.h @@ -56,27 +56,20 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { unsigned long mask = -1UL; + unsigned int n = 6; - /* - * No arguments for this syscall, there's nothing to do. - */ - if (!n) - return; - - BUG_ON(i + n > 6); #ifdef CONFIG_COMPAT if (test_tsk_thread_flag(task, TIF_31BIT)) mask = 0xffffffff; #endif while (n-- > 0) - if (i + n > 0) - args[n] = regs->gprs[2 + i + n] & mask; - if (i == 0) - args[0] = regs->orig_gpr2 & mask; + if (n > 0) + args[n] = regs->gprs[2 + n] & mask; + + args[0] = regs->orig_gpr2 & mask; } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h index 6e118799831c..2bf1199a0595 100644 --- a/arch/sh/include/asm/syscall_32.h +++ b/arch/sh/include/asm/syscall_32.h @@ -48,30 +48,16 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - /* - * Do this simply for now. If we need to start supporting - * fetching arguments from arbitrary indices, this will need some - * extra logic. Presently there are no in-tree users that depend - * on this behaviour. - */ - BUG_ON(i); /* Argument pattern is: R4, R5, R6, R7, R0, R1 */ - switch (n) { - case 6: args[5] = regs->regs[1]; - case 5: args[4] = regs->regs[0]; - case 4: args[3] = regs->regs[7]; - case 3: args[2] = regs->regs[6]; - case 2: args[1] = regs->regs[5]; - case 1: args[0] = regs->regs[4]; - case 0: - break; - default: - BUG(); - } + args[5] = regs->regs[1]; + args[4] = regs->regs[0]; + args[3] = regs->regs[7]; + args[2] = regs->regs[6]; + args[1] = regs->regs[5]; + args[0] = regs->regs[4]; } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/sh/include/asm/syscall_64.h b/arch/sh/include/asm/syscall_64.h index 43882580c7f9..4e8f6460c703 100644 --- a/arch/sh/include/asm/syscall_64.h +++ b/arch/sh/include/asm/syscall_64.h @@ -47,11 +47,9 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - BUG_ON(i + n > 6); - memcpy(args, ®s->regs[2 + i], n * sizeof(args[0])); + memcpy(args, ®s->regs[2], 6 * sizeof(args[0])); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h index 053989e3f6a6..872dfee852d6 100644 --- a/arch/sparc/include/asm/syscall.h +++ b/arch/sparc/include/asm/syscall.h @@ -96,11 +96,11 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { int zero_extend = 0; unsigned int j; + unsigned int n = 6; #ifdef CONFIG_SPARC64 if (test_tsk_thread_flag(task, TIF_32BIT)) @@ -108,7 +108,7 @@ static inline void syscall_get_arguments(struct task_struct *task, #endif for (j = 0; j < n; j++) { - unsigned long val = regs->u_regs[UREG_I0 + i + j]; + unsigned long val = regs->u_regs[UREG_I0 + j]; if (zero_extend) args[j] = (u32) val; diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h index 9fb9cf8cd39a..25d00acd1322 100644 --- a/arch/um/include/asm/syscall-generic.h +++ b/arch/um/include/asm/syscall-generic.h @@ -53,43 +53,16 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { const struct uml_pt_regs *r = ®s->regs; - switch (i) { - case 0: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG1(r); - case 1: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG2(r); - case 2: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG3(r); - case 3: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG4(r); - case 4: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG5(r); - case 5: - if (!n--) - break; - *args++ = UPT_SYSCALL_ARG6(r); - case 6: - if (!n--) - break; - default: - BUG(); - break; - } + *args++ = UPT_SYSCALL_ARG1(r); + *args++ = UPT_SYSCALL_ARG2(r); + *args++ = UPT_SYSCALL_ARG3(r); + *args++ = UPT_SYSCALL_ARG4(r); + *args++ = UPT_SYSCALL_ARG5(r); + *args = UPT_SYSCALL_ARG6(r); } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index d653139857af..8dbb5c379450 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -91,11 +91,9 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { - BUG_ON(i + n > 6); - memcpy(args, ®s->bx + i, n * sizeof(args[0])); + memcpy(args, ®s->bx, 6 * sizeof(args[0])); } static inline void syscall_set_arguments(struct task_struct *task, @@ -116,63 +114,26 @@ static inline int syscall_get_arch(void) static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { # ifdef CONFIG_IA32_EMULATION - if (task->thread_info.status & TS_COMPAT) - switch (i) { - case 0: - if (!n--) break; - *args++ = regs->bx; - case 1: - if (!n--) break; - *args++ = regs->cx; - case 2: - if (!n--) break; - *args++ = regs->dx; - case 3: - if (!n--) break; - *args++ = regs->si; - case 4: - if (!n--) break; - *args++ = regs->di; - case 5: - if (!n--) break; - *args++ = regs->bp; - case 6: - if (!n--) break; - default: - BUG(); - break; - } - else + if (task->thread_info.status & TS_COMPAT) { + *args++ = regs->bx; + *args++ = regs->cx; + *args++ = regs->dx; + *args++ = regs->si; + *args++ = regs->di; + *args = regs->bp; + } else # endif - switch (i) { - case 0: - if (!n--) break; - *args++ = regs->di; - case 1: - if (!n--) break; - *args++ = regs->si; - case 2: - if (!n--) break; - *args++ = regs->dx; - case 3: - if (!n--) break; - *args++ = regs->r10; - case 4: - if (!n--) break; - *args++ = regs->r8; - case 5: - if (!n--) break; - *args++ = regs->r9; - case 6: - if (!n--) break; - default: - BUG(); - break; - } + { + *args++ = regs->di; + *args++ = regs->si; + *args++ = regs->dx; + *args++ = regs->r10; + *args++ = regs->r8; + *args = regs->r9; + } } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h index a168bf81c7f4..1504ce9a233a 100644 --- a/arch/xtensa/include/asm/syscall.h +++ b/arch/xtensa/include/asm/syscall.h @@ -59,23 +59,13 @@ static inline void syscall_set_return_value(struct task_struct *task, static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args) { static const unsigned int reg[] = XTENSA_SYSCALL_ARGUMENT_REGS; - unsigned int j; - - if (n == 0) - return; - - WARN_ON_ONCE(i + n > SYSCALL_MAX_ARGS); + unsigned int i; - for (j = 0; j < n; ++j) { - if (i + j < SYSCALL_MAX_ARGS) - args[j] = regs->areg[reg[i + j]]; - else - args[j] = 0; - } + for (i = 0; i < 6; ++i) + args[i] = regs->areg[reg[i]]; } static inline void syscall_set_arguments(struct task_struct *task, diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h index 0c938a4354f6..269e9412ef42 100644 --- a/include/asm-generic/syscall.h +++ b/include/asm-generic/syscall.h @@ -105,21 +105,16 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, * syscall_get_arguments - extract system call parameter values * @task: task of interest, must be blocked * @regs: task_pt_regs() of @task - * @i: argument index [0,5] - * @n: number of arguments; n+i must be [1,6]. * @args: array filled with argument values * - * Fetches @n arguments to the system call starting with the @i'th argument - * (from 0 through 5). Argument @i is stored in @args[0], and so on. - * An arch inline version is probably optimal when @i and @n are constants. + * Fetches 6 arguments to the system call. First argument is stored in +* @args[0], and so on. * * It's only valid to call this when @task is stopped for tracing on * entry to a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT. - * It's invalid to call this with @i + @n > 6; we only support system calls - * taking up to 6 arguments. */ void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, - unsigned int i, unsigned int n, unsigned long *args); + unsigned long *args); /** * syscall_set_arguments - change system call parameter value diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h index 44a3259ed4a5..b6e0cbc2c71f 100644 --- a/include/trace/events/syscalls.h +++ b/include/trace/events/syscalls.h @@ -28,7 +28,7 @@ TRACE_EVENT_FN(sys_enter, TP_fast_assign( __entry->id = id; - syscall_get_arguments(current, regs, 0, 6, __entry->args); + syscall_get_arguments(current, regs, __entry->args); ), TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 54a0347ca812..df27e499956a 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -149,7 +149,7 @@ static void populate_seccomp_data(struct seccomp_data *sd) sd->nr = syscall_get_nr(task, regs); sd->arch = syscall_get_arch(); - syscall_get_arguments(task, regs, 0, 6, args); + syscall_get_arguments(task, regs, args); sd->args[0] = args[0]; sd->args[1] = args[1]; sd->args[2] = args[2]; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index e9f5bbbad6d9..fa8fbff736d6 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -348,7 +348,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) entry = ring_buffer_event_data(event); entry->nr = syscall_nr; - syscall_get_arguments(current, regs, 0, 6, args); + syscall_get_arguments(current, regs, args); memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args); event_trigger_unlock_commit(trace_file, buffer, event, entry, @@ -616,7 +616,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) return; rec->nr = syscall_nr; - syscall_get_arguments(current, regs, 0, 6, args); + syscall_get_arguments(current, regs, args); memcpy(&rec->args, args, sizeof(unsigned long) * sys_data->nb_args); if ((valid_prog_array && diff --git a/lib/syscall.c b/lib/syscall.c index e8467e17b9a2..fb328e7ccb08 100644 --- a/lib/syscall.c +++ b/lib/syscall.c @@ -27,7 +27,7 @@ static int collect_syscall(struct task_struct *target, struct syscall_info *info info->data.nr = syscall_get_nr(target, regs); if (info->data.nr != -1L) - syscall_get_arguments(target, regs, 0, 6, + syscall_get_arguments(target, regs, (unsigned long *)&info->data.args[0]); put_task_stack(target); -- cgit v1.3-7-g2ca7 From 325aa19598e410672175ed50982f902d4e3f31c5 Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Mon, 25 Mar 2019 11:10:26 -0700 Subject: genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() If a child irqchip calls irq_chip_set_wake_parent() but its parent irqchip has the IRQCHIP_SKIP_SET_WAKE flag set an error is returned. This is inconsistent behaviour vs. set_irq_wake_real() which returns 0 when the irqchip has the IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to walk the chain of parents and set irq wake on any chips that don't have the flag set either. If the intent is to call the .irq_set_wake() callback of the parent irqchip, then we expect irqchip implementations to omit the IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function that calls irq_chip_set_wake_parent(). The problem has been observed on a Qualcomm sdm845 device where set wake fails on any GPIO interrupts after applying work in progress wakeup irq patches to the GPIO driver. The chain of chips looks like this: QCOM GPIO -> QCOM PDC (SKIP) -> ARM GIC (SKIP) The GPIO controllers parent is the QCOM PDC irqchip which in turn has ARM GIC as parent. The QCOM PDC irqchip has the IRQCHIP_SKIP_SET_WAKE flag set, and so does the grandparent ARM GIC. The GPIO driver doesn't know if the parent needs to set wake or not, so it unconditionally calls irq_chip_set_wake_parent() causing this function to return a failure because the parent irqchip (PDC) doesn't have the .irq_set_wake() callback set. Returning 0 instead makes everything work and irqs from the GPIO controller can be configured for wakeup. Make it consistent by returning 0 (success) from irq_chip_set_wake_parent() when a parent chip has IRQCHIP_SKIP_SET_WAKE set. [ tglx: Massaged changelog ] Fixes: 08b55e2a9208e ("genirq: Add irqchip_set_wake_parent") Signed-off-by: Stephen Boyd Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Cc: linux-arm-kernel@lists.infradead.org Cc: linux-gpio@vger.kernel.org Cc: Lina Iyer Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190325181026.247796-1-swboyd@chromium.org --- kernel/irq/chip.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 3faef4a77f71..51128bea3846 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1449,6 +1449,10 @@ int irq_chip_set_vcpu_affinity_parent(struct irq_data *data, void *vcpu_info) int irq_chip_set_wake_parent(struct irq_data *data, unsigned int on) { data = data->parent_data; + + if (data->chip->flags & IRQCHIP_SKIP_SET_WAKE) + return 0; + if (data->chip->irq_set_wake) return data->chip->irq_set_wake(data, on); -- cgit v1.3-7-g2ca7 From 9002b21465fa4d829edfc94a5a441005cffaa972 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 5 Apr 2019 18:39:38 -0700 Subject: kernel/sysctl.c: fix out-of-bounds access when setting file-max Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up min/max values for the file-max sysctl parameter via the .extra1 and .extra2 fields in the corresponding struct ctl_table entry. Unfortunately, the minimum value points at the global 'zero' variable, which is an int. This results in a KASAN splat when accessed as a long by proc_doulongvec_minmax on 64-bit architectures: | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0 | Read of size 8 at addr ffff2000133d1c20 by task systemd/1 | | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x228 | show_stack+0x14/0x20 | dump_stack+0xe8/0x124 | print_address_description+0x60/0x258 | kasan_report+0x140/0x1a0 | __asan_report_load8_noabort+0x18/0x20 | __do_proc_doulongvec_minmax+0x5d8/0x6a0 | proc_doulongvec_minmax+0x4c/0x78 | proc_sys_call_handler.isra.19+0x144/0x1d8 | proc_sys_write+0x34/0x58 | __vfs_write+0x54/0xe8 | vfs_write+0x124/0x3c0 | ksys_write+0xbc/0x168 | __arm64_sys_write+0x68/0x98 | el0_svc_common+0x100/0x258 | el0_svc_handler+0x48/0xc0 | el0_svc+0x8/0xc | | The buggy address belongs to the variable: | zero+0x0/0x40 | | Memory state around the buggy address: | ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa | ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00 | ^ | ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00 | ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fix the splat by introducing a unsigned long 'zero_ul' and using that instead. Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max") Signed-off-by: Will Deacon Acked-by: Christian Brauner Cc: Kees Cook Cc: Alexey Dobriyan Cc: Matteo Croce Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e5da394d1ca3..c9ec050bcf46 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -128,6 +128,7 @@ static int zero; static int __maybe_unused one = 1; static int __maybe_unused two = 2; static int __maybe_unused four = 4; +static unsigned long zero_ul; static unsigned long one_ul = 1; static unsigned long long_max = LONG_MAX; static int one_hundred = 100; @@ -1750,7 +1751,7 @@ static struct ctl_table fs_table[] = { .maxlen = sizeof(files_stat.max_files), .mode = 0644, .proc_handler = proc_doulongvec_minmax, - .extra1 = &zero, + .extra1 = &zero_ul, .extra2 = &long_max, }, { -- cgit v1.3-7-g2ca7 From 90c1cba2b3b3851c151229f61801919b2904d437 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Wed, 3 Apr 2019 16:35:52 -0700 Subject: locking/lockdep: Zap lock classes even with lock debugging disabled The following commit: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") changed the behavior of lockdep_free_key_range() from unconditionally zapping lock classes into only zapping lock classes if debug_lock == true. Not zapping lock classes if debug_lock == false leaves dangling pointers in several lockdep datastructures, e.g. lock_class::name in the all_lock_classes list. The shell command "cat /proc/lockdep" causes the kernel to iterate the all_lock_classes list. Hence the "unable to handle kernel paging request" cash that Shenghui encountered by running cat /proc/lockdep. Since the new behavior can cause cat /proc/lockdep to crash, restore the pre-v5.1 behavior. This patch avoids that cat /proc/lockdep triggers the following crash with debug_lock == false: BUG: unable to handle kernel paging request at fffffbfff40ca448 RIP: 0010:__asan_load1+0x28/0x50 Call Trace: string+0xac/0x180 vsnprintf+0x23e/0x820 seq_vprintf+0x82/0xc0 seq_printf+0x92/0xb0 print_name+0x34/0xb0 l_show+0x184/0x200 seq_read+0x59e/0x6c0 proc_reg_read+0x11f/0x170 __vfs_read+0x4d/0x90 vfs_read+0xc5/0x1f0 ksys_read+0xab/0x130 __x64_sys_read+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: shenghui Signed-off-by: Bart Van Assche Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") # v5.1-rc1. Link: https://lkml.kernel.org/r/20190403233552.124673-1-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 34cdcbedda49..e16766ff184b 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4689,8 +4689,8 @@ static void free_zapped_rcu(struct rcu_head *ch) return; raw_local_irq_save(flags); - if (!graph_lock()) - goto out_irq; + arch_spin_lock(&lockdep_lock); + current->lockdep_recursion = 1; /* closed head */ pf = delayed_free.pf + (delayed_free.index ^ 1); @@ -4702,8 +4702,8 @@ static void free_zapped_rcu(struct rcu_head *ch) */ call_rcu_zapped(delayed_free.pf + delayed_free.index); - graph_unlock(); -out_irq: + current->lockdep_recursion = 0; + arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); } @@ -4744,21 +4744,17 @@ static void lockdep_free_key_range_reg(void *start, unsigned long size) { struct pending_free *pf; unsigned long flags; - int locked; init_data_structures_once(); raw_local_irq_save(flags); - locked = graph_lock(); - if (!locked) - goto out_irq; - + arch_spin_lock(&lockdep_lock); + current->lockdep_recursion = 1; pf = get_pending_free(); __lockdep_free_key_range(pf, start, size); call_rcu_zapped(pf); - - graph_unlock(); -out_irq: + current->lockdep_recursion = 0; + arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); /* @@ -4911,9 +4907,8 @@ void lockdep_unregister_key(struct lock_class_key *key) return; raw_local_irq_save(flags); - if (!graph_lock()) - goto out_irq; - + arch_spin_lock(&lockdep_lock); + current->lockdep_recursion = 1; pf = get_pending_free(); hlist_for_each_entry_rcu(k, hash_head, hash_entry) { if (k == key) { @@ -4925,8 +4920,8 @@ void lockdep_unregister_key(struct lock_class_key *key) WARN_ON_ONCE(!found); __lockdep_free_key_range(pf, key, 1); call_rcu_zapped(pf); - graph_unlock(); -out_irq: + current->lockdep_recursion = 0; + arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ -- cgit v1.3-7-g2ca7 From 07d7e12091f4ab869cc6a4bb276399057e73b0b3 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Sun, 7 Apr 2019 21:15:42 -0700 Subject: alarmtimer: Return correct remaining time To calculate a remaining time, it's required to subtract the current time from the expiration time. In alarm_timer_remaining() the arguments of ktime_sub are swapped. Fixes: d653d8457c76 ("alarmtimer: Implement remaining callback") Signed-off-by: Andrei Vagin Signed-off-by: Thomas Gleixner Reviewed-by: Mukesh Ojha Cc: Stephen Boyd Cc: John Stultz Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.com --- kernel/time/alarmtimer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index 2c97e8c2d29f..0519a8805aab 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -594,7 +594,7 @@ static ktime_t alarm_timer_remaining(struct k_itimer *timr, ktime_t now) { struct alarm *alarm = &timr->it.alarm.alarmtimer; - return ktime_sub(now, alarm->node.expires); + return ktime_sub(alarm->node.expires, now); } /** -- cgit v1.3-7-g2ca7 From 8c5165430c0194df92369162d1c7f53f8672baa5 Mon Sep 17 00:00:00 2001 From: Scott Wood Date: Wed, 10 Apr 2019 16:59:25 -0500 Subject: dma-debug: only skip one stackframe entry With skip set to 1, I get a traceback like this: [ 106.867637] DMA-API: Mapped at: [ 106.870784] afu_dma_map_region+0x2cd/0x4f0 [dfl_afu] [ 106.875839] afu_ioctl+0x258/0x380 [dfl_afu] [ 106.880108] do_vfs_ioctl+0xa9/0x720 [ 106.883688] ksys_ioctl+0x60/0x90 [ 106.887007] __x64_sys_ioctl+0x16/0x20 With the previous value of 2, afu_dma_map_region was being omitted. I suspect that the code paths have simply changed since the value of 2 was chosen a decade ago, but it's also possible that it varies based on which mapping function was used, compiler inlining choices, etc. In any case, it's best to err on the side of skipping less. Signed-off-by: Scott Wood Signed-off-by: Christoph Hellwig --- kernel/dma/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 45d51e8e26f6..a218e43cc382 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -706,7 +706,7 @@ static struct dma_debug_entry *dma_entry_alloc(void) #ifdef CONFIG_STACKTRACE entry->stacktrace.max_entries = DMA_DEBUG_STACKTRACE_ENTRIES; entry->stacktrace.entries = entry->st_entries; - entry->stacktrace.skip = 2; + entry->stacktrace.skip = 1; save_stack_trace(&entry->stacktrace); #endif -- cgit v1.3-7-g2ca7 From 1d54ad944074010609562da5c89e4f5df2f4e5db Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 4 Apr 2019 15:03:00 +0200 Subject: perf/core: Fix perf_event_disable_inatomic() race Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local() on his s390. The problem boils down to: CPU-A CPU-B perf_event_overflow() perf_event_disable_inatomic() @pending_disable = 1 irq_work_queue(); sched-out event_sched_out() @pending_disable = 0 sched-in perf_event_overflow() perf_event_disable_inatomic() @pending_disable = 1; irq_work_queue(); // FAILS irq_work_run() perf_pending_event() if (@pending_disable) perf_event_disable_local(); // WHOOPS The problem exists in generic, but s390 is particularly sensitive because it doesn't implement arch_irq_work_raise(), nor does it call irq_work_run() from it's PMU interrupt handler (nor would that be sufficient in this case, because s390 also generates perf_event_overflow() from pmu::stop). Add to that the fact that s390 is a virtual architecture and (virtual) CPU-A can stall long enough for the above race to happen, even if it would self-IPI. Adding a irq_work_sync() to event_sched_in() would work for all hardare PMUs that properly use irq_work_run() but fails for software PMUs. Instead encode the CPU number in @pending_disable, such that we can tell which CPU requested the disable. This then allows us to detect the above scenario and even redirect the IPI to make up for the failed queue. Reported-by: Thomas-Mich Richter Tested-by: Thomas Richter Signed-off-by: Peter Zijlstra (Intel) Acked-by: Mark Rutland Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Heiko Carstens Cc: Hendrik Brueckner Cc: Jiri Olsa Cc: Kees Cook Cc: Linus Torvalds Cc: Martin Schwidefsky Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/events/core.c | 52 +++++++++++++++++++++++++++++++++++++-------- kernel/events/ring_buffer.c | 4 ++-- 2 files changed, 45 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 72d06e302e99..534e01e7bc36 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event, event->pmu->del(event, 0); event->oncpu = -1; - if (event->pending_disable) { - event->pending_disable = 0; + if (READ_ONCE(event->pending_disable) >= 0) { + WRITE_ONCE(event->pending_disable, -1); state = PERF_EVENT_STATE_OFF; } perf_event_set_state(event, state); @@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable); void perf_event_disable_inatomic(struct perf_event *event) { - event->pending_disable = 1; + WRITE_ONCE(event->pending_disable, smp_processor_id()); + /* can fail, see perf_pending_event_disable() */ irq_work_queue(&event->pending); } @@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event *event) } } +static void perf_pending_event_disable(struct perf_event *event) +{ + int cpu = READ_ONCE(event->pending_disable); + + if (cpu < 0) + return; + + if (cpu == smp_processor_id()) { + WRITE_ONCE(event->pending_disable, -1); + perf_event_disable_local(event); + return; + } + + /* + * CPU-A CPU-B + * + * perf_event_disable_inatomic() + * @pending_disable = CPU-A; + * irq_work_queue(); + * + * sched-out + * @pending_disable = -1; + * + * sched-in + * perf_event_disable_inatomic() + * @pending_disable = CPU-B; + * irq_work_queue(); // FAILS + * + * irq_work_run() + * perf_pending_event() + * + * But the event runs on CPU-B and wants disabling there. + */ + irq_work_queue_on(&event->pending, cpu); +} + static void perf_pending_event(struct irq_work *entry) { - struct perf_event *event = container_of(entry, - struct perf_event, pending); + struct perf_event *event = container_of(entry, struct perf_event, pending); int rctx; rctx = perf_swevent_get_recursion_context(); @@ -5822,10 +5858,7 @@ static void perf_pending_event(struct irq_work *entry) * and we won't recurse 'further'. */ - if (event->pending_disable) { - event->pending_disable = 0; - perf_event_disable_local(event); - } + perf_pending_event_disable(event); if (event->pending_wakeup) { event->pending_wakeup = 0; @@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, init_waitqueue_head(&event->waitq); + event->pending_disable = -1; init_irq_work(&event->pending, perf_pending_event); mutex_init(&event->mmap_mutex); diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index a4047321d7d8..2545ac08cc77 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -392,7 +392,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle, * store that will be enabled on successful return */ if (!handle->size) { /* A, matches D */ - event->pending_disable = 1; + event->pending_disable = smp_processor_id(); perf_output_wakeup(handle); local_set(&rb->aux_nest, 0); goto err_put; @@ -480,7 +480,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size) if (wakeup) { if (handle->aux_flags & PERF_AUX_FLAG_TRUNCATED) - handle->event->pending_disable = 1; + handle->event->pending_disable = smp_processor_id(); perf_output_wakeup(handle); } -- cgit v1.3-7-g2ca7 From 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 5 Apr 2019 14:02:10 -0700 Subject: fs: prevent page refcount overflow in pipe_buf_get Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn Signed-off-by: Matthew Wilcox Cc: stable@kernel.org Signed-off-by: Linus Torvalds --- fs/fuse/dev.c | 12 ++++++------ fs/pipe.c | 4 ++-- fs/splice.c | 12 ++++++++++-- include/linux/pipe_fs_i.h | 10 ++++++---- kernel/trace/trace.c | 6 +++++- 5 files changed, 29 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 809c0f2f9942..64f4de983468 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2034,10 +2034,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len; ret = -EINVAL; - if (rem < len) { - pipe_unlock(pipe); - goto out; - } + if (rem < len) + goto out_free; rem = len; while (rem) { @@ -2055,7 +2053,9 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1); pipe->nrbufs--; } else { - pipe_buf_get(pipe, ibuf); + if (!pipe_buf_get(pipe, ibuf)) + goto out_free; + *obuf = *ibuf; obuf->flags &= ~PIPE_BUF_FLAG_GIFT; obuf->len = rem; @@ -2078,11 +2078,11 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, ret = fuse_dev_do_write(fud, &cs, len); pipe_lock(pipe); +out_free: for (idx = 0; idx < nbuf; idx++) pipe_buf_release(pipe, &bufs[idx]); pipe_unlock(pipe); -out: kvfree(bufs); return ret; } diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c0977d..b1543b85c14a 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -189,9 +189,9 @@ EXPORT_SYMBOL(generic_pipe_buf_steal); * in the tee() system call, when we duplicate the buffers in one * pipe into another. */ -void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) +bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - get_page(buf->page); + return try_get_page(buf->page); } EXPORT_SYMBOL(generic_pipe_buf_get); diff --git a/fs/splice.c b/fs/splice.c index de2ede048473..f30af82b850d 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1588,7 +1588,11 @@ retry: * Get a reference to this pipe buffer, * so we can copy the contents over. */ - pipe_buf_get(ipipe, ibuf); + if (!pipe_buf_get(ipipe, ibuf)) { + if (ret == 0) + ret = -EFAULT; + break; + } *obuf = *ibuf; /* @@ -1660,7 +1664,11 @@ static int link_pipe(struct pipe_inode_info *ipipe, * Get a reference to this pipe buffer, * so we can copy the contents over. */ - pipe_buf_get(ipipe, ibuf); + if (!pipe_buf_get(ipipe, ibuf)) { + if (ret == 0) + ret = -EFAULT; + break; + } obuf = opipe->bufs + nbuf; *obuf = *ibuf; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b7c9ad..3f2a42c11e20 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -108,18 +108,20 @@ struct pipe_buf_operations { /* * Get a reference to the pipe buffer. */ - void (*get)(struct pipe_inode_info *, struct pipe_buffer *); + bool (*get)(struct pipe_inode_info *, struct pipe_buffer *); }; /** * pipe_buf_get - get a reference to a pipe_buffer * @pipe: the pipe that the buffer belongs to * @buf: the buffer to get a reference to + * + * Return: %true if the reference was successfully obtained. */ -static inline void pipe_buf_get(struct pipe_inode_info *pipe, +static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - buf->ops->get(pipe, buf); + return buf->ops->get(pipe, buf); } /** @@ -178,7 +180,7 @@ struct pipe_inode_info *alloc_pipe_info(void); void free_pipe_info(struct pipe_inode_info *); /* Generic pipe buffer ops functions */ -void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *); +bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c4238b441624..0f300d488c9f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6835,12 +6835,16 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, buf->private = 0; } -static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, +static bool buffer_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct buffer_ref *ref = (struct buffer_ref *)buf->private; + if (ref->ref > INT_MAX/2) + return false; + ref->ref++; + return true; } /* Pipe buffer operations for a buffer. */ -- cgit v1.3-7-g2ca7 From 8b39adbee805c539a461dbf208b125b096152b1c Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Mon, 15 Apr 2019 10:05:38 -0700 Subject: locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again If lockdep_register_key() and lockdep_unregister_key() are called with debug_locks == false then the following warning is reported: WARNING: CPU: 2 PID: 15145 at kernel/locking/lockdep.c:4920 lockdep_unregister_key+0x1ad/0x240 That warning is reported because lockdep_unregister_key() ignores the value of 'debug_locks' and because the behavior of lockdep_register_key() depends on whether or not 'debug_locks' is set. Fix this inconsistency by making lockdep_unregister_key() take 'debug_locks' again into account. Signed-off-by: Bart Van Assche Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Cc: Will Deacon Cc: shenghui Fixes: 90c1cba2b3b3 ("locking/lockdep: Zap lock classes even with lock debugging disabled") Link: http://lkml.kernel.org/r/20190415170538.23491-1-bvanassche@acm.org Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index e16766ff184b..e221be724fe8 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4907,8 +4907,9 @@ void lockdep_unregister_key(struct lock_class_key *key) return; raw_local_irq_save(flags); - arch_spin_lock(&lockdep_lock); - current->lockdep_recursion = 1; + if (!graph_lock()) + goto out_irq; + pf = get_pending_free(); hlist_for_each_entry_rcu(k, hash_head, hash_entry) { if (k == key) { @@ -4920,8 +4921,8 @@ void lockdep_unregister_key(struct lock_class_key *key) WARN_ON_ONCE(!found); __lockdep_free_key_range(pf, key, 1); call_rcu_zapped(pf); - current->lockdep_recursion = 0; - arch_spin_unlock(&lockdep_lock); + graph_unlock(); +out_irq: raw_local_irq_restore(flags); /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ -- cgit v1.3-7-g2ca7 From 5f843ed415581cfad4ef8fefe31c138a8346ca8a Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 15 Apr 2019 15:01:25 +0900 Subject: kprobes: Fix error check when reusing optimized probes The following commit introduced a bug in one of our error paths: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") it missed to handle the return value of kprobe_optready() as error-value. In reality, the kprobe_optready() returns a bool result, so "true" case must be passed instead of 0. This causes some errors on kprobe boot-time selftests on ARM: [ ] Beginning kprobe tests... [ ] Probe ARM code [ ] kprobe [ ] kretprobe [ ] ARM instruction simulation [ ] Check decoding tables [ ] Run test cases [ ] FAIL: test_case_handler not run [ ] FAIL: Test andge r10, r11, r14, asr r7 [ ] FAIL: Scenario 11 ... [ ] FAIL: Scenario 7 [ ] Total instruction simulation tests=1631, pass=1433 fail=198 [ ] kprobe tests failed This can happen if an optimized probe is unregistered and next kprobe is registered on same address until the previous probe is not reclaimed. If this happens, a hidden aggregated probe may be kept in memory, and no new kprobe can probe same address. Also, in that case register_kprobe() will return "1" instead of minus error value, which can mislead caller logic. Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: stable@vger.kernel.org # v5.0+ Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2 Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index c83e54727131..b1ea30a5540e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -709,7 +709,6 @@ static void unoptimize_kprobe(struct kprobe *p, bool force) static int reuse_unused_kprobe(struct kprobe *ap) { struct optimized_kprobe *op; - int ret; /* * Unused kprobe MUST be on the way of delayed unoptimizing (means @@ -720,9 +719,8 @@ static int reuse_unused_kprobe(struct kprobe *ap) /* Enable the probe again */ ap->flags &= ~KPROBE_FLAG_DISABLED; /* Optimize it again (remove from op->list) */ - ret = kprobe_optready(ap); - if (ret) - return ret; + if (!kprobe_optready(ap)) + return -EINVAL; optimize_kprobe(ap); return 0; -- cgit v1.3-7-g2ca7 From 52a44f83fc2d64a5e74d5d685fad2fecc7b7a321 Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 29 Mar 2019 11:12:12 +0200 Subject: perf/core: Fix the address filtering fix The following recent commit: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") changes the address filtering logic to communicate filter ranges to the PMU driver via a single address range object, instead of having the driver do the final bit of math. That change forgets to take into account kernel filters, which are not calculated the same way as DSO based filters. Fix that by passing the kernel filters the same way as file-based filters. This doesn't require any additional changes in the drivers. Reported-by: Adrian Hunter Signed-off-by: Alexander Shishkin Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Fixes: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") Link: https://lkml.kernel.org/r/20190329091212.29870-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 534e01e7bc36..dc7dead2d2cc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9077,26 +9077,29 @@ static void perf_event_addr_filters_apply(struct perf_event *event) if (task == TASK_TOMBSTONE) return; - if (!ifh->nr_file_filters) - return; - - mm = get_task_mm(event->ctx->task); - if (!mm) - goto restart; + if (ifh->nr_file_filters) { + mm = get_task_mm(event->ctx->task); + if (!mm) + goto restart; - down_read(&mm->mmap_sem); + down_read(&mm->mmap_sem); + } raw_spin_lock_irqsave(&ifh->lock, flags); list_for_each_entry(filter, &ifh->list, entry) { - event->addr_filter_ranges[count].start = 0; - event->addr_filter_ranges[count].size = 0; + if (filter->path.dentry) { + /* + * Adjust base offset if the filter is associated to a + * binary that needs to be mapped: + */ + event->addr_filter_ranges[count].start = 0; + event->addr_filter_ranges[count].size = 0; - /* - * Adjust base offset if the filter is associated to a binary - * that needs to be mapped: - */ - if (filter->path.dentry) perf_addr_filter_apply(filter, mm, &event->addr_filter_ranges[count]); + } else { + event->addr_filter_ranges[count].start = filter->offset; + event->addr_filter_ranges[count].size = filter->size; + } count++; } @@ -9104,9 +9107,11 @@ static void perf_event_addr_filters_apply(struct perf_event *event) event->addr_filters_gen++; raw_spin_unlock_irqrestore(&ifh->lock, flags); - up_read(&mm->mmap_sem); + if (ifh->nr_file_filters) { + up_read(&mm->mmap_sem); - mmput(mm); + mmput(mm); + } restart: perf_event_stop(event, 1); -- cgit v1.3-7-g2ca7 From 339bc4183596e1f68c2c98a03b87aa124107c317 Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 29 Mar 2019 11:13:38 +0200 Subject: perf/ring_buffer: Fix AUX record suppression The following commit: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") has an unintended side-effect of also suppressing all AUX records with no flags and non-zero size, so all the regular records in the full trace mode. This breaks some use cases for people. Fix this by restoring "regular" AUX records. Reported-by: Ben Gainey Tested-by: Ben Gainey Signed-off-by: Alexander Shishkin Signed-off-by: Peter Zijlstra (Intel) Cc: Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Fixes: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") Link: https://lkml.kernel.org/r/20190329091338.29999-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 2545ac08cc77..5eedb49a65ea 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -455,24 +455,21 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size) rb->aux_head += size; } - if (size || handle->aux_flags) { - /* - * Only send RECORD_AUX if we have something useful to communicate - * - * Note: the OVERWRITE records by themselves are not considered - * useful, as they don't communicate any *new* information, - * aside from the short-lived offset, that becomes history at - * the next event sched-in and therefore isn't useful. - * The userspace that needs to copy out AUX data in overwrite - * mode should know to use user_page::aux_head for the actual - * offset. So, from now on we don't output AUX records that - * have *only* OVERWRITE flag set. - */ - - if (handle->aux_flags & ~(u64)PERF_AUX_FLAG_OVERWRITE) - perf_event_aux_event(handle->event, aux_head, size, - handle->aux_flags); - } + /* + * Only send RECORD_AUX if we have something useful to communicate + * + * Note: the OVERWRITE records by themselves are not considered + * useful, as they don't communicate any *new* information, + * aside from the short-lived offset, that becomes history at + * the next event sched-in and therefore isn't useful. + * The userspace that needs to copy out AUX data in overwrite + * mode should know to use user_page::aux_head for the actual + * offset. So, from now on we don't output AUX records that + * have *only* OVERWRITE flag set. + */ + if (size || (handle->aux_flags & ~(u64)PERF_AUX_FLAG_OVERWRITE)) + perf_event_aux_event(handle->event, aux_head, size, + handle->aux_flags); rb->user_page->aux_head = rb->aux_head; if (rb_need_aux_wakeup(rb)) -- cgit v1.3-7-g2ca7 From 2e8e19226398db8265a8e675fcc0118b9e80c9e8 Mon Sep 17 00:00:00 2001 From: Phil Auld Date: Tue, 19 Mar 2019 09:00:05 -0400 Subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld Signed-off-by: Peter Zijlstra (Intel) Cc: Cc: Anton Blanchard Cc: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 40bd1e27b1b7..a4d9e14bf138 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer) return HRTIMER_NORESTART; } +extern const u64 max_cfs_quota_period; + static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) { struct cfs_bandwidth *cfs_b = @@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) unsigned long flags; int overrun; int idle = 0; + int count = 0; raw_spin_lock_irqsave(&cfs_b->lock, flags); for (;;) { @@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) if (!overrun) break; + if (++count > 3) { + u64 new, old = ktime_to_ns(cfs_b->period); + + new = (old * 147) / 128; /* ~115% */ + new = min(new, max_cfs_quota_period); + + cfs_b->period = ns_to_ktime(new); + + /* since max is 1s, this is limited to 1e9^2, which fits in u64 */ + cfs_b->quota *= new; + cfs_b->quota = div64_u64(cfs_b->quota, old); + + pr_warn_ratelimited( + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", + smp_processor_id(), + div_u64(new, NSEC_PER_USEC), + div_u64(cfs_b->quota, NSEC_PER_USEC)); + + /* reset count so we don't come right back in here */ + count = 0; + } + idle = do_sched_cfs_period_timer(cfs_b, overrun, flags); } if (idle) -- cgit v1.3-7-g2ca7 From 1b02cd6a2d7f3e2a6a5262887d2cb2912083e42f Mon Sep 17 00:00:00 2001 From: luca abeni Date: Mon, 25 Mar 2019 14:15:30 +0100 Subject: sched/deadline: Correctly handle active 0-lag timers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: chengjian (D) Link: https://lkml.kernel.org/r/20190325131530.34706-1-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6a73e41a2016..43901fa3f269 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -252,7 +252,6 @@ static void task_non_contending(struct task_struct *p) if (dl_entity_is_special(dl_se)) return; - WARN_ON(hrtimer_active(&dl_se->inactive_timer)); WARN_ON(dl_se->dl_non_contending); zerolag_time = dl_se->deadline - @@ -269,7 +268,7 @@ static void task_non_contending(struct task_struct *p) * If the "0-lag time" already passed, decrease the active * utilization now, instead of starting a timer */ - if (zerolag_time < 0) { + if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) { if (dl_task(p)) sub_running_bw(dl_se, dl_rq); if (!dl_task(p) || p->state == TASK_DEAD) { -- cgit v1.3-7-g2ca7 From 3f2552f7e9c5abef2775c53f7af66532f8bf65bc Mon Sep 17 00:00:00 2001 From: Chang-An Chen Date: Fri, 29 Mar 2019 10:59:09 +0800 Subject: timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze() tick_freeze() introduced by suspend-to-idle in commit 124cf9117c5f ("PM / sleep: Make it possible to quiesce timers during suspend-to-idle") uses timekeeping_suspend() instead of syscore_suspend() during suspend-to-idle. As a consequence generic sched_clock will keep going because sched_clock_suspend() and sched_clock_resume() are not invoked during suspend-to-idle which can result in a generic sched_clock wrap. On a ARM system with suspend-to-idle enabled, sched_clock is registered as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which means the real wrapping duration is 8796093022202ns. [ 134.551779] suspend-to-idle suspend (timekeeping_suspend()) [ 1204.912239] suspend-to-idle resume (timekeeping_resume()) ...... [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend()) [ 5880.502807] suspend-to-idle resume (timekeeping_resume()) ...... [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend()) [ 8035.753167] suspend-to-idle resume (timekeeping_resume()) ...... [ 8795.786684] (2)[321:charger_thread]...... [ 8795.788387] (2)[321:charger_thread]...... [ 0.057226] (0)[0:swapper/0]...... [ 0.061447] (2)[0:swapper/2]...... sched_clock was not stopped during suspend-to-idle, and sched_clock_poll hrtimer was not expired because timekeeping_suspend() was invoked during suspend-to-idle. It makes sched_clock wrap at kernel time 8796s. To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in tick_freeze() together with timekeeping_suspend() and timekeeping_resume(). Fixes: 124cf9117c5f (PM / sleep: Make it possible to quiesce timers during suspend-to-idle) Signed-off-by: Chang-An Chen Signed-off-by: Thomas Gleixner Cc: Frederic Weisbecker Cc: Matthias Brugger Cc: John Stultz Cc: Kees Cook Cc: Corey Minyard Cc: Cc: Cc: Stanley Chu Cc: Cc: Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.com --- kernel/time/sched_clock.c | 4 ++-- kernel/time/tick-common.c | 2 ++ kernel/time/timekeeping.h | 7 +++++++ 3 files changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 094b82ca95e5..930113b9799a 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -272,7 +272,7 @@ static u64 notrace suspended_sched_clock_read(void) return cd.read_data[seq & 1].epoch_cyc; } -static int sched_clock_suspend(void) +int sched_clock_suspend(void) { struct clock_read_data *rd = &cd.read_data[0]; @@ -283,7 +283,7 @@ static int sched_clock_suspend(void) return 0; } -static void sched_clock_resume(void) +void sched_clock_resume(void) { struct clock_read_data *rd = &cd.read_data[0]; diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 529143b4c8d2..df401463a191 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -487,6 +487,7 @@ void tick_freeze(void) trace_suspend_resume(TPS("timekeeping_freeze"), smp_processor_id(), true); system_state = SYSTEM_SUSPEND; + sched_clock_suspend(); timekeeping_suspend(); } else { tick_suspend_local(); @@ -510,6 +511,7 @@ void tick_unfreeze(void) if (tick_freeze_depth == num_online_cpus()) { timekeeping_resume(); + sched_clock_resume(); system_state = SYSTEM_RUNNING; trace_suspend_resume(TPS("timekeeping_freeze"), smp_processor_id(), false); diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index 7a9b4eb7a1d5..141ab3ab0354 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -14,6 +14,13 @@ extern u64 timekeeping_max_deferment(void); extern void timekeeping_warp_clock(void); extern int timekeeping_suspend(void); extern void timekeeping_resume(void); +#ifdef CONFIG_GENERIC_SCHED_CLOCK +extern int sched_clock_suspend(void); +extern void sched_clock_resume(void); +#else +static inline int sched_clock_suspend(void) { return 0; } +static inline void sched_clock_resume(void) { } +#endif extern void do_timer(unsigned long ticks); extern void update_wall_time(void); -- cgit v1.3-7-g2ca7 From 738a7832d21e3d911fcddab98ce260b79010b461 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 18 Apr 2019 12:18:39 +0200 Subject: signal: use fdget() since we don't allow O_PATH As stated in the original commit for pidfd_send_signal() we don't allow to signal processes through O_PATH file descriptors since it is semantically equivalent to a write on the pidfd. We already correctly error out right now and return EBADF if an O_PATH fd is passed. This is because we use file->f_op to detect whether a pidfd is passed and O_PATH fds have their file->f_op set to empty_fops in do_dentry_open() and thus fail the test. Thus, there is no regression. It's just semantically correct to use fdget() and return an error right from there instead of taking a reference and returning an error later. Signed-off-by: Christian Brauner Acked-by: Oleg Nesterov Cc: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Thomas Gleixner Cc: Jann Horn Cc: David Howells Cc: "Michael Kerrisk (man-pages)" Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Oleg Nesterov Cc: Aleksa Sarai Cc: Al Viro Signed-off-by: Linus Torvalds --- kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index f98448cf2def..227ba170298e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3581,7 +3581,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, if (flags) return -EINVAL; - f = fdget_raw(pidfd); + f = fdget(pidfd); if (!f.file) return -EBADF; -- cgit v1.3-7-g2ca7 From fabe38ab6b2bd9418350284c63825f13b8a6abba Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Sun, 24 Feb 2019 01:50:20 +0900 Subject: kprobes: Mark ftrace mcount handler functions nokprobe Mark ftrace mcount handler functions nokprobe since probing on these functions with kretprobe pushes return address incorrectly on kretprobe shadow stack. Reported-by: Francis Deslauriers Tested-by: Andrea Righi Signed-off-by: Masami Hiramatsu Acked-by: Steven Rostedt Acked-by: Steven Rostedt (VMware) Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/trace/ftrace.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 26c8ca9bd06b..b920358dd8f7 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -33,6 +33,7 @@ #include #include #include +#include #include @@ -6246,7 +6247,7 @@ void ftrace_reset_array_ops(struct trace_array *tr) tr->ops->func = ftrace_stub; } -static inline void +static nokprobe_inline void __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *ignored, struct pt_regs *regs) { @@ -6306,11 +6307,13 @@ static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, { __ftrace_ops_list_func(ip, parent_ip, NULL, regs); } +NOKPROBE_SYMBOL(ftrace_ops_list_func); #else static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip) { __ftrace_ops_list_func(ip, parent_ip, NULL, NULL); } +NOKPROBE_SYMBOL(ftrace_ops_no_ops); #endif /* @@ -6337,6 +6340,7 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip, preempt_enable_notrace(); trace_clear_recursion(bit); } +NOKPROBE_SYMBOL(ftrace_ops_assist_func); /** * ftrace_ops_get_func - get the function a trampoline should call -- cgit v1.3-7-g2ca7 From 8f4a8c12cafe54535f334a09be7ec7f236134764 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 18 Apr 2019 17:50:41 -0700 Subject: kernel/watchdog_hld.c: hard lockup message should end with a newline Separate print_modules() and hard lockup error message. Before the patch: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437 Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/watchdog_hld.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c index 71381168dede..247bf0b1582c 100644 --- a/kernel/watchdog_hld.c +++ b/kernel/watchdog_hld.c @@ -135,7 +135,8 @@ static void watchdog_overflow_callback(struct perf_event *event, if (__this_cpu_read(hard_watchdog_warn) == true) return; - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu); + pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", + this_cpu); print_modules(); print_irqtrace_events(current); if (regs) -- cgit v1.3-7-g2ca7 From a860fa7b96e1a1c974556327aa1aee852d434c21 Mon Sep 17 00:00:00 2001 From: Xie XiuQi Date: Sat, 20 Apr 2019 16:34:16 +0800 Subject: sched/numa: Fix a possible divide-by-zero sched_clock_cpu() may not be consistent between CPUs. If a task migrates to another CPU, then se.exec_start is set to that CPU's rq_clock_task() by update_stats_curr_start(). Specifically, the new value might be before the old value due to clock skew. So then if in numa_get_avg_runtime() the expression: 'now - p->last_task_numa_placement' ends up as -1, then the divider '*period + 1' in task_numa_placement() is 0 and things go bang. Similar to update_curr(), check if time goes backwards to avoid this. [ peterz: Wrote new changelog. ] [ mingo: Tweaked the code comment. ] Signed-off-by: Xie XiuQi Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: cj.chengjian@huawei.com Cc: Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4d9e14bf138..35f3ea375084 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2007,6 +2007,10 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) if (p->last_task_numa_placement) { delta = runtime - p->last_sum_exec_runtime; *period = now - p->last_task_numa_placement; + + /* Avoid time going backwards, prevent potential divide error: */ + if (unlikely((s64)*period < 0)) + *period = 0; } else { delta = p->se.avg.load_sum; *period = LOAD_AVG_MAX; -- cgit v1.3-7-g2ca7 From 7a0df7fbc14505e2e2be19ed08654a09e1ed5bf6 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Wed, 6 Mar 2019 13:14:13 -0700 Subject: seccomp: Make NEW_LISTENER and TSYNC flags exclusive As the comment notes, the return codes for TSYNC and NEW_LISTENER conflict, because they both return positive values, one in the case of success and one in the case of error. So, let's disallow both of these flags together. While this is technically a userspace break, all the users I know of are still waiting on me to land this feature in libseccomp, so I think it'll be safe. Also, at present my use case doesn't require TSYNC at all, so this isn't a big deal to disallow. If someone wanted to support this, a path forward would be to add a new flag like TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, but the use cases are so different I don't see it really happening. Finally, it's worth noting that this does actually fix a UAF issue: at the end of seccomp_set_mode_filter(), we have: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { if (ret < 0) { listener_f->private_data = NULL; fput(listener_f); put_unused_fd(listener); } else { fd_install(listener, listener_f); ret = listener; } } out_free: seccomp_filter_free(prepared); But if ret > 0 because TSYNC raced, we'll install the listener fd and then free the filter out from underneath it, causing a UAF when the task closes it or dies. This patch also switches the condition to be simply if (ret), so that if someone does add the flag mentioned above, they won't have to remember to fix this too. Reported-by: syzbot+b562969adb2e04af3442@syzkaller.appspotmail.com Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") CC: stable@vger.kernel.org # v5.0+ Signed-off-by: Tycho Andersen Signed-off-by: Kees Cook Acked-by: James Morris --- kernel/seccomp.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 54a0347ca812..4b51d9cbcf06 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -502,7 +502,10 @@ out: * * Caller must be holding current->sighand->siglock lock. * - * Returns 0 on success, -ve on error. + * Returns 0 on success, -ve on error, or + * - in TSYNC mode: the pid of a thread which was either not in the correct + * seccomp mode or did not have an ancestral seccomp filter + * - in NEW_LISTENER mode: the fd of the new listener */ static long seccomp_attach_filter(unsigned int flags, struct seccomp_filter *filter) @@ -1258,6 +1261,16 @@ static long seccomp_set_mode_filter(unsigned int flags, if (flags & ~SECCOMP_FILTER_FLAG_MASK) return -EINVAL; + /* + * In the successful case, NEW_LISTENER returns the new listener fd. + * But in the failure case, TSYNC returns the thread that died. If you + * combine these two flags, there's no way to tell whether something + * succeeded or failed. So, let's disallow this combination. + */ + if ((flags & SECCOMP_FILTER_FLAG_TSYNC) && + (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER)) + return -EINVAL; + /* Prepare the new filter before holding any locks. */ prepared = seccomp_prepare_user_filter(filter); if (IS_ERR(prepared)) @@ -1304,7 +1317,7 @@ out: mutex_unlock(¤t->signal->cred_guard_mutex); out_put_fd: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { - if (ret < 0) { + if (ret) { listener_f->private_data = NULL; fput(listener_f); put_unused_fd(listener); -- cgit v1.3-7-g2ca7 From c6a9efa1d8353d8960d152e7d469d952b01495c0 Mon Sep 17 00:00:00 2001 From: Paul Chaignon Date: Wed, 24 Apr 2019 21:50:42 +0200 Subject: bpf: mark registers in all frames after pkt/null checks In case of a null check on a pointer inside a subprog, we should mark all registers with this pointer as either safe or unknown, in both the current and previous frames. Currently, only spilled registers and registers in the current frame are marked. Packet bound checks in subprogs have the same issue. This patch fixes it to mark registers in previous frames as well. A good reproducer for null checks looks as follow: 1: ptr = bpf_map_lookup_elem(map, &key); 2: ret = subprog(ptr) { 3: return ptr != NULL; 4: } 5: if (ret) 6: value = *ptr; With the above, the verifier will complain on line 6 because it sees ptr as map_value_or_null despite the null check in subprog 1. Note that this patch fixes another resulting bug when using bpf_sk_release(): 1: sk = bpf_sk_lookup_tcp(...); 2: subprog(sk) { 3: if (sk) 4: bpf_sk_release(sk); 5: } 6: if (!sk) 7: return 0; 8: return 1; In the above, mark_ptr_or_null_regs will warn on line 6 because it will try to free the reference state, even though it was already freed on line 3. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Signed-off-by: Paul Chaignon Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 76 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 46 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6c5a41f7f338..09d5d972c9ff 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4138,15 +4138,35 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return 0; } +static void __find_good_pkt_pointers(struct bpf_func_state *state, + struct bpf_reg_state *dst_reg, + enum bpf_reg_type type, u16 new_range) +{ + struct bpf_reg_state *reg; + int i; + + for (i = 0; i < MAX_BPF_REG; i++) { + reg = &state->regs[i]; + if (reg->type == type && reg->id == dst_reg->id) + /* keep the maximum range already checked */ + reg->range = max(reg->range, new_range); + } + + bpf_for_each_spilled_reg(i, state, reg) { + if (!reg) + continue; + if (reg->type == type && reg->id == dst_reg->id) + reg->range = max(reg->range, new_range); + } +} + static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, struct bpf_reg_state *dst_reg, enum bpf_reg_type type, bool range_right_open) { - struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *regs = state->regs, *reg; u16 new_range; - int i, j; + int i; if (dst_reg->off < 0 || (dst_reg->off == 0 && range_right_open)) @@ -4211,20 +4231,9 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, * the range won't allow anything. * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16. */ - for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].type == type && regs[i].id == dst_reg->id) - /* keep the maximum range already checked */ - regs[i].range = max(regs[i].range, new_range); - - for (j = 0; j <= vstate->curframe; j++) { - state = vstate->frame[j]; - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - if (reg->type == type && reg->id == dst_reg->id) - reg->range = max(reg->range, new_range); - } - } + for (i = 0; i <= vstate->curframe; i++) + __find_good_pkt_pointers(vstate->frame[i], dst_reg, type, + new_range); } /* compute branch direction of the expression "if (reg opcode val) goto target;" @@ -4698,6 +4707,22 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } } +static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id, + bool is_null) +{ + struct bpf_reg_state *reg; + int i; + + for (i = 0; i < MAX_BPF_REG; i++) + mark_ptr_or_null_reg(state, &state->regs[i], id, is_null); + + bpf_for_each_spilled_reg(i, state, reg) { + if (!reg) + continue; + mark_ptr_or_null_reg(state, reg, id, is_null); + } +} + /* The logic is similar to find_good_pkt_pointers(), both could eventually * be folded together at some point. */ @@ -4705,10 +4730,10 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, bool is_null) { struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *reg, *regs = state->regs; + struct bpf_reg_state *regs = state->regs; u32 ref_obj_id = regs[regno].ref_obj_id; u32 id = regs[regno].id; - int i, j; + int i; if (ref_obj_id && ref_obj_id == id && is_null) /* regs[regno] is in the " == NULL" branch. @@ -4717,17 +4742,8 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, */ WARN_ON_ONCE(release_reference_state(state, id)); - for (i = 0; i < MAX_BPF_REG; i++) - mark_ptr_or_null_reg(state, ®s[i], id, is_null); - - for (j = 0; j <= vstate->curframe; j++) { - state = vstate->frame[j]; - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - mark_ptr_or_null_reg(state, reg, id, is_null); - } - } + for (i = 0; i <= vstate->curframe; i++) + __mark_ptr_or_null_regs(vstate->frame[i], id, is_null); } static bool try_match_pkt_pointers(const struct bpf_insn *insn, -- cgit v1.3-7-g2ca7 From b987222654f84f7b4ca95b3a55eca784cb30235b Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Thu, 4 Apr 2019 23:59:25 +0200 Subject: tracing: Fix buffer_ref pipe ops This fixes multiple issues in buffer_pipe_buf_ops: - The ->steal() handler must not return zero unless the pipe buffer has the only reference to the page. But generic_pipe_buf_steal() assumes that every reference to the pipe is tracked by the page's refcount, which isn't true for these buffers - buffer_pipe_buf_get(), which duplicates a buffer, doesn't touch the page's refcount. Fix it by using generic_pipe_buf_nosteal(), which refuses every attempted theft. It should be easy to actually support ->steal, but the only current users of pipe_buf_steal() are the virtio console and FUSE, and they also only use it as an optimization. So it's probably not worth the effort. - The ->get() and ->release() handlers can be invoked concurrently on pipe buffers backed by the same struct buffer_ref. Make them safe against concurrency by using refcount_t. - The pointers stored in ->private were only zeroed out when the last reference to the buffer_ref was dropped. As far as I know, this shouldn't be necessary anyway, but if we do it, let's always do it. Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com Cc: Ingo Molnar Cc: Masami Hiramatsu Cc: Al Viro Cc: stable@vger.kernel.org Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: Jann Horn Signed-off-by: Steven Rostedt (VMware) --- fs/splice.c | 4 ++-- include/linux/pipe_fs_i.h | 1 + kernel/trace/trace.c | 28 ++++++++++++++-------------- 3 files changed, 17 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/fs/splice.c b/fs/splice.c index 3ee7e82df48f..e75807380caa 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -330,8 +330,8 @@ const struct pipe_buf_operations default_pipe_buf_ops = { .get = generic_pipe_buf_get, }; -static int generic_pipe_buf_nosteal(struct pipe_inode_info *pipe, - struct pipe_buffer *buf) +int generic_pipe_buf_nosteal(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) { return 1; } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 787d224ff43e..a830e9a00eb9 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -174,6 +174,7 @@ void free_pipe_info(struct pipe_inode_info *); void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); +int generic_pipe_buf_nosteal(struct pipe_inode_info *, struct pipe_buffer *); void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *); void pipe_buf_mark_unmergeable(struct pipe_buffer *buf); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 21153e64bf1c..0cfa13a60086 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7025,19 +7025,23 @@ struct buffer_ref { struct ring_buffer *buffer; void *page; int cpu; - int ref; + refcount_t refcount; }; +static void buffer_ref_release(struct buffer_ref *ref) +{ + if (!refcount_dec_and_test(&ref->refcount)) + return; + ring_buffer_free_read_page(ref->buffer, ref->cpu, ref->page); + kfree(ref); +} + static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct buffer_ref *ref = (struct buffer_ref *)buf->private; - if (--ref->ref) - return; - - ring_buffer_free_read_page(ref->buffer, ref->cpu, ref->page); - kfree(ref); + buffer_ref_release(ref); buf->private = 0; } @@ -7046,14 +7050,14 @@ static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, { struct buffer_ref *ref = (struct buffer_ref *)buf->private; - ref->ref++; + refcount_inc(&ref->refcount); } /* Pipe buffer operations for a buffer. */ static const struct pipe_buf_operations buffer_pipe_buf_ops = { .confirm = generic_pipe_buf_confirm, .release = buffer_pipe_buf_release, - .steal = generic_pipe_buf_steal, + .steal = generic_pipe_buf_nosteal, .get = buffer_pipe_buf_get, }; @@ -7066,11 +7070,7 @@ static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i) struct buffer_ref *ref = (struct buffer_ref *)spd->partial[i].private; - if (--ref->ref) - return; - - ring_buffer_free_read_page(ref->buffer, ref->cpu, ref->page); - kfree(ref); + buffer_ref_release(ref); spd->partial[i].private = 0; } @@ -7125,7 +7125,7 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, break; } - ref->ref = 1; + refcount_set(&ref->refcount, 1); ref->buffer = iter->trace_buffer->buffer; ref->page = ring_buffer_alloc_read_page(ref->buffer, iter->cpu_file); if (IS_ERR(ref->page)) { -- cgit v1.3-7-g2ca7 From 91862cc7867bba4ee5c8fcf0ca2f1d30427b6129 Mon Sep 17 00:00:00 2001 From: Wenwen Wang Date: Fri, 19 Apr 2019 21:22:59 -0500 Subject: tracing: Fix a memory leak by early error exit in trace_pid_write() In trace_pid_write(), the buffer for trace parser is allocated through kmalloc() in trace_parser_get_init(). Later on, after the buffer is used, it is then freed through kfree() in trace_parser_put(). However, it is possible that trace_pid_write() is terminated due to unexpected errors, e.g., ENOMEM. In that case, the allocated buffer will not be freed, which is a memory leak bug. To fix this issue, free the allocated buffer when an error is encountered. Link: http://lkml.kernel.org/r/1555726979-15633-1-git-send-email-wang6495@umn.edu Fixes: f4d34a87e9c10 ("tracing: Use pid bitmap instead of a pid array for set_event_pid") Cc: stable@vger.kernel.org Signed-off-by: Wenwen Wang Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 0cfa13a60086..46f68fad6373 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -496,8 +496,10 @@ int trace_pid_write(struct trace_pid_list *filtered_pids, * not modified. */ pid_list = kmalloc(sizeof(*pid_list), GFP_KERNEL); - if (!pid_list) + if (!pid_list) { + trace_parser_put(&parser); return -ENOMEM; + } pid_list->pid_max = READ_ONCE(pid_max); @@ -507,6 +509,7 @@ int trace_pid_write(struct trace_pid_list *filtered_pids, pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3); if (!pid_list->pids) { + trace_parser_put(&parser); kfree(pid_list); return -ENOMEM; } -- cgit v1.3-7-g2ca7 From d6097c9e4454adf1f8f2c9547c2fa6060d55d952 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 23 Apr 2019 22:03:18 +0200 Subject: trace: Fix preempt_enable_no_resched() abuse Unless the very next line is schedule(), or implies it, one must not use preempt_enable_no_resched(). It can cause a preemption to go missing and thereby cause arbitrary delays, breaking the PREEMPT=y invariant. Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net Cc: Waiman Long Cc: Linus Torvalds Cc: Ingo Molnar Cc: Will Deacon Cc: Thomas Gleixner Cc: the arch/x86 maintainers Cc: Davidlohr Bueso Cc: Tim Chen Cc: huang ying Cc: Roman Gushchin Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: stable@vger.kernel.org Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()") Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 41b6f96e5366..4ee8d8aa3d0f 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -762,7 +762,7 @@ u64 ring_buffer_time_stamp(struct ring_buffer *buffer, int cpu) preempt_disable_notrace(); time = rb_time_stamp(buffer); - preempt_enable_no_resched_notrace(); + preempt_enable_notrace(); return time; } -- cgit v1.3-7-g2ca7 From 9a4f26cc98d81b67ecc23b890c28e2df324e29f3 Mon Sep 17 00:00:00 2001 From: "Tobin C. Harding" Date: Tue, 30 Apr 2019 10:11:44 +1000 Subject: sched/cpufreq: Fix kobject memleak Currently the error return path from kobject_init_and_add() is not followed by a call to kobject_put() - which means we are leaking the kobject. Fix it by adding a call to kobject_put() in the error path of kobject_init_and_add(). Signed-off-by: Tobin C. Harding Cc: Greg Kroah-Hartman Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Tobin C. Harding Cc: Vincent Guittot Cc: Viresh Kumar Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/cpufreq_schedutil.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5c41ea367422..3638d2377e3c 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -771,6 +771,7 @@ out: return 0; fail: + kobject_put(&tunables->attr_set.kobj); policy->governor_data = NULL; sugov_tunables_free(tunables); -- cgit v1.3-7-g2ca7 From 26ae4f4406f88d82d79c85c11ac5fae18213cd38 Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Fri, 3 May 2019 11:55:35 +0300 Subject: perf/ring_buffer: Fix AUX software double buffering This recent commit: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") overlooked the fact that the previous one page granularity of the AUX buffer provided an implicit double buffering capability to the PMU driver, which went away when the entire buffer became one high-order page. Always make the full-trace mode AUX allocation at least two-part to preserve the previous behavior and allow the implicit double buffering to continue. Reported-by: Ammy Yi Signed-off-by: Alexander Shishkin Acked-by: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Cc: adrian.hunter@intel.com Fixes: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 5eedb49a65ea..674b35383491 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -610,8 +610,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, * PMU requests more than one contiguous chunks of memory * for SW double buffering */ - if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) && - !overwrite) { + if (!overwrite) { if (!max_order) return -EINVAL; -- cgit v1.3-7-g2ca7