aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/call-graph-from-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2016-04-13sched/clock: Make local_clock()/cpu_clock() inlineDaniel Lezcano2-43/+31
The local_clock/cpu_clock functions were changed to prevent a double identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK is set. That resulted in one line functions. As these functions are in all the cases one line functions and in the hot path, it is useful to specify them as static inline in order to give a strong hint to the compiler. After verification, it appears the compiler does not inline them without this hint. Change those functions to static inline. sched_clock_cpu() is called via the inlined local_clock()/cpu_clock() functions from sched.h. So any module code including sched.h will reference sched_clock_cpu(). Thus it must be exported with the EXPORT_SYMBOL_GPL macro. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/clock: Remove pointless test in cpu_clock/local_clockDaniel Lezcano1-8/+2
In case the HAVE_UNSTABLE_SCHED_CLOCK config is set, the cpu_clock() version checks if sched_clock_stable() is not set and calls sched_clock_cpu(), otherwise it calls sched_clock(). sched_clock_cpu() checks also if sched_clock_stable() is set and, if true, calls sched_clock(). sched_clock() will be called in sched_clock_cpu() if sched_clock_stable() is true. Remove the duplicate test by directly calling sched_clock_cpu() and let the static key act in this function instead. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/debug: Don't dump sched debug info in SysRq-WRabin Vincent1-1/+2
sysrq_sched_debug_show() can dump a lot of information. Don't print out all that if we're just trying to get a list of blocked tasks (SysRq-W). The information is still accessible with SysRq-T. Signed-off-by: Rabin Vincent <rabinv@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459777322-30902-1-git-send-email-rabin.vincent@axis.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/fair: Initiate a new task's util avg to a bounded valueYuyang Du3-2/+57
A new task's util_avg is set to full utilization of a CPU (100% time running). This accelerates a new task's utilization ramp-up, useful to boost its execution in early time. However, it may result in (insanely) high utilization for a transient time period when a flood of tasks are spawned. Importantly, it violates the "fundamentally bounded" CPU utilization, and its side effect is negative if we don't take any measure to bound it. This patch proposes an algorithm to address this issue. It has two methods to approach a sensible initial util_avg: (1) An expected (or average) util_avg based on its cfs_rq's util_avg: util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight (2) A trajectory of how successive new tasks' util develops, which gives 1/2 of the left utilization budget to a new task such that the additional util is noticeably large (when overall util is low) or unnoticeably small (when overall util is high enough). In the meantime, the aggregate utilization is well bounded: util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n where n denotes the nth task. If util_avg is larger than util_avg_cap, then the effective util is clamped to the util_avg_cap. Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/fair: Update comments after a variable renameYuyang Du1-2/+2
The following commit: ed82b8a1ff76 ("sched/core: Move the sched_to_prio[] arrays out of line") renamed prio_to_weight to sched_prio_to_weight, but the old name was not updated in comments. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/core: Add preempt checks in preempt_schedule() codeSteven Rostedt1-9/+59
While testing the tracer preemptoff, I hit this strange trace: <...>-259 0...1 0us : schedule <-worker_thread <...>-259 0d..1 0us : rcu_note_context_switch <-__schedule <...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch <...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch <...>-259 0d..1 0us : _raw_spin_lock <-__schedule <...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock <...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock <...>-259 0d..2 1us : deactivate_task <-__schedule <...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task <...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task <...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair <...>-259 0d..2 1us : update_curr <-dequeue_entity <...>-259 0d..2 1us : update_min_vruntime <-update_curr <...>-259 0d..2 1us : cpuacct_charge <-update_curr <...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge <...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge <...>-259 0d..2 1us : clear_buddies <-dequeue_entity <...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity <...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity <...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity <...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair <...>-259 0d..2 2us : wq_worker_sleeping <-__schedule <...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping <...>-259 0d..2 2us : pick_next_task_fair <-__schedule <...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair <...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair <...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity <...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup gnome-sh-1031 0...1 603us : <stack trace> => preempt_count_sub => _raw_spin_unlock => drm_gem_object_lookup => i915_gem_madvise_ioctl => drm_ioctl => do_vfs_ioctl => SyS_ioctl => entry_SYSCALL_64_fastpath As I'm tracing preemption disabled, it seemed incorrect that the trace would go across a schedule and report not being in the scheduler. Looking into this I discovered the problem. schedule() calls preempt_disable() but the preempt_schedule() calls preempt_enable_notrace(). What happened above was that the gnome-shell task was preempted on another CPU, migrated over to the idle cpu. The tracer stared with idle calling schedule(), which called preempt_disable(), but then gnome-shell finished, and it enabled preemption with preempt_enable_notrace() that does stop the trace, even though preemption was enabled. The purpose of the preempt_disable_notrace() in the preempt_schedule() is to prevent function tracing from going into an infinite loop. Because function tracing can trace the preempt_enable/disable() calls that are traced. The problem with function tracing is: NEED_RESCHED set preempt_schedule() preempt_disable() preempt_count_inc() function trace (before incrementing preempt count) preempt_disable_notrace() preempt_enable_notrace() sees NEED_RESCHED set preempt_schedule() (repeat) Now by breaking out the preempt off/on tracing into their own code: preempt_disable_check() and preempt_enable_check(), we can add these to the preempt_schedule() code. As preemption would then be disabled, even if they were to be traced by the function tracer, the disabled preemption would prevent the recursion. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernelsTim Chen1-0/+2
In account_entity_enqueue(), we do not do account_numa_enqueue() as NUMA balancing is not needed for UP kernels. Hence, we should remove the account_numa_dequeue() call from account_entity_dequeue() for UP kernels. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/fair: Reset nr_balance_failed after active balancingSrikar Dronamraju1-6/+6
To force a task migration during active balancing, nr_balance_failed is set to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side effect, the next regular load balance under the same sd, a cache hot task might be migrated, just because nr_balance_failed count is high. Resetting nr_balance_failed after a successful active balance ensures that a hot task is not unreasonably migrated. This can be verified by looking at othe number of hot task migrations reported by /proc/schedstat. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/cpuacct: Split usage accounting into user_usage and sys_usageDongsheng Yang1-27/+113
Sometimes, cpuacct.usage is not detailed enough to see how much CPU usage a group had. We want to know how much time it used in user mode and how much in kernel mode. This patch introduces more files to give this information: # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys ... while keeping the ABI with the existing counter. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> [ Ported to newer kernels. ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/cpuacct: Show all possible CPUs in cpuacct outputZhao Lei1-5/+5
Current code show stats of online CPUs in cpuacct.statcpus, show stats of present cpus in cpuacct.usage(_percpu), and using present CPUs for setting cpuacct.usage. It will cause inconsistent result when a CPU is online or offline or hotpluged. We should always use possible CPUs to avoid above problem. Here are the contents of a cpuacct.usage_percpu sysfs file, on a 4 CPU system with maxcpus=32: Before the patch: # cat cpuacct.usage_percpu 2456565 411435 1052897 832584 After the patch: # cat cpuacct.usage_percpu 2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Tejun Heo <htejun@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a11d56cef12d0b4807f8be3a46bf9798c3014d59.1458635566.git.zhaolei@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-30uapi/linux/stddef.h: Provide __always_inline to userspace headersDenys Vlasenko1-0/+4
Josh Boyer reported that my recent change to uapi/linux/swab.h broke the Qemu build: bc27fb68aaad ("include/uapi/linux/byteorder, swab: force inlining of some byteswap operations") Unfortunately, UAPI headers don't include compiler.h so fixing it there is not enough, add an __always_inline definition to uapi/linux/stddef.h instead. Testcase: "make headers_install" and try to compile this: #include <linux/swab.h> void main() {} Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Rientjes <rientjes@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Graf <tgraf@suug.ch> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1459289697-12875-1-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-30tools/lib/lockdep: Fix unsupported 'basename -s' in run_tests.shSedat Dilek1-4/+8
Here on Ubuntu/precise I have GNU/coreutils v8.13 installed where 'basename -s' is not supported. The result is that run_tests.sh is not done properly. How to reproduce: $ cd $BUILD_DIR $ LC_ALL=C make -C tools/ liblockdep $ cd tools/lib/lockdep/ $ LC_ALL=C ./run_tests.sh basename: invalid option -- 's' Try `basename --help' for more information. ... timeout: failed to run command `./tests/': Permission denied FAILED! rm: cannot remove `tests/': Is a directory Due to unsupported basename the tests programs are not generated and cannot be removed. Fix this by doing a compatible basename invocation and check for the existence of generated tests programs. For more details see this LKML thread: http://marc.info/?t=145906667300001&r=1&w=2 Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> (maintainer:LIBLOCKDEP) Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org> Link: http://lkml.kernel.org/r/1459326169-7009-1-git-send-email-sedat.dilek@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29locking/atomic, sched: Unexport fetch_or()Frederic Weisbecker2-21/+18
This patch functionally reverts: 5fd7a09cfb8c ("atomic: Export fetch_or()") During the merge Linus observed that the generic version of fetch_or() was messy: " This makes the ugly "fetch_or()" macro that the scheduler used internally a new generic helper, and does a bad job at it. " e23604edac2a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Now that we have introduced atomic_fetch_or(), fetch_or() is only used by the scheduler in order to deal with thread_info flags which type can vary across architectures. Lets confine fetch_or() back to the scheduler so that we encourage future users to use the more robust and well typed atomic_t version instead. While at it, fetch_or() gets robustified, pasting improvements from a previous patch by Ingo Molnar that avoids needless expression re-evaluations in the loop. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29timers/nohz: Convert tick dependency mask to atomic_tFrederic Weisbecker3-34/+33
The tick dependency mask was intially unsigned long because this is the type on which clear_bit() operates on and fetch_or() accepts it. But now that we have atomic_fetch_or(), we can instead use atomic_andnot() to clear the bit. This consolidates the type of our tick dependency mask, reduce its size on structures and benefit from possible architecture optimizations on atomic_t operations. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29locking/atomic: Introduce atomic_fetch_or()Frederic Weisbecker1-0/+21
This is deemed to replace the type generic fetch_or() which brings a lot of issues such as macro induced block variable aliasing and sloppy types. Not to mention fetch_or() doesn't refer to any namespace, adding even more confusion. So lets provide an atomic_t version. Current and next users of fetch_or() are thus encouraged to use atomic_t. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458830281-4255-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-26Linux 4.6-rc1Linus Torvalds1-2/+2
2016-03-26f2fs/crypto: fix xts_tweak initializationLinus Torvalds1-1/+1
Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_" to just "FS" for preprocessor symbols). Because of the symbol renaming, it's a bit hard to see it as a file move: use git show -M30 0b81d07790726 to lower the rename detection to just 30% similarity and make git show the files as renamed (the header file won't be shown as a rename even then - since all it contains is symbol definitions, it looks almost completely different). Even with the renames showing as renames, the diffs are not all that easy to read, since so much is just the renames. But Eric Biggers noticed that it's not just all renames: the initialization of the xts_tweak had been broken too, using the inode number rather than the page offset. That's not right - it makes the xfs_tweak the same for all pages of each inode. It _might_ make sense to make the xfs_tweak contain both the offset _and_ the inode number, but not just the inode number. Reported-by: Eric Biggers <ebiggers3@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26NTB: Remove _addr functions from ntb_hw_amdAllen Hubbe1-30/+0
Kernel zero day testing warned about address space confusion. A virtual iomem address was used where a physical address is expected. The offending functions implement an optional part of the api, so they are removed. They can be added later, after testing. Fixes: a1b3695820aa490e58915d720a1438069813008b Signed-off-by: Allen Hubbe <Allen.Hubbe@emc.com> Acked-by: Xiangliang Yu <Xiangliang.Yu@amd.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
2016-03-26orangefs: fix orangefs_superblock lockingAl Viro3-58/+47
* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb * remove from the list _before_ orangefs_unmount() - request_mutex in the latter will make sure that nothing observed in the loop in ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end of loop * on removal, keep the forward pointer and zero the back one. That way we can drop and regain the spinlock in the loop body (again, ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the rest of the list. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs: fix do_readv_writev() handling of error halfway throughAl Viro1-1/+1
Error should only be returned if nothing had been read/written. Otherwise we need to report a short read/write instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs: have ->kill_sb() evict the VFS side of things firstAl Viro1-3/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs: sanitize ->llseek()Al Viro2-10/+3
a) open files can't have NULL inodes b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute. c) make_bad_inode() on lseek()? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs-bufmap.h: trim unused junkAl Viro1-9/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs: saner calling conventions for getting a slotAl Viro4-28/+16
just have it return the slot number or -E... - the caller checks the sign anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointerAl Viro3-23/+14
it's always __orangefs_bufmap Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25orangefs: get rid of readdir_handle_sAl Viro1-63/+30
no point, really - we couldn't keep those across the calls of getdents(); it would be too easy to DoS, having all slots exhausted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-25thp: fix typo in khugepaged_scan_pmd()Kirill A. Shutemov1-1/+1
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25MAINTAINERS: fill entries for KASANAndrey Ryabinin1-0/+14
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm/filemap: generic_file_read_iter(): check for zero reads unconditionallyNicolai Stange1-3/+4
If - generic_file_read_iter() gets called with a zero read length, - the read offset is at a page boundary, - IOCB_DIRECT is not set - and the page in question hasn't made it into the page cache yet, then do_generic_file_read() will trigger a readahead with a req_size hint of zero. Since roundup_pow_of_two(0) is undefined, UBSAN reports UBSAN: Undefined behaviour in include/linux/log2.h:63:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14 [...] Call Trace: [...] [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0 [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0 [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210 [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0 [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90 [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420 [...] [<ffffffff81510b06>] __vfs_read+0x256/0x3d0 [...] when get_init_ra_size() gets called from ondemand_readahead(). The net effect is that the initial readahead size is arch dependent for requested read lengths of zero: for example, since 1UL << (sizeof(unsigned long) * 8) evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead size becomes 4 on the former and 0 on the latter. What's more, whether or not the file access timestamp is updated for zero length reads is decided differently for the two cases of IOCB_DIRECT being set or cleared: in the first case, generic_file_read_iter() explicitly skips updating that timestamp while in the latter case, it is always updated through the call to do_generic_file_read(). According to POSIX, zero length reads "do not modify the last data access timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct. Let generic_file_read_iter() unconditionally check the requested read length at its entry and return immediately with success if it is zero. Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2Alexander Potapenko1-0/+2
Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, kasan: stackdepot implementation. Enable stackdepot for SLABAlexander Potapenko9-12/+391
Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot will allow KASAN store allocation/deallocation stack traces for memory chunks. The stack traces are stored in a hash table and referenced by handles which reside in the kasan_alloc_meta and kasan_free_meta structures in the allocated memory chunks. IRQ stack traces are cut below the IRQ entry point to avoid unnecessary duplication. Right now stackdepot support is only enabled in SLAB allocator. Once KASAN features in SLAB are on par with those in SLUB we can switch SLUB to stackdepot as well, thus removing the dependency on SLUB stack bookkeeping, which wastes a lot of memory. This patch is based on the "mm: kasan: stack depots" patch originally prepared by Dmitry Chernenkov. Joonsoo has said that he plans to reuse the stackdepot code for the mm/page_owner.c debugging facility. [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t] [aryabinin@virtuozzo.com: comment style fixes] Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25arch, ftrace: for KASAN put hard/soft IRQ entries into separate sectionsAlexander Potapenko23-15/+51
KASAN needs to know whether the allocation happens in an IRQ handler. This lets us strip everything below the IRQ entry point to reduce the number of unique stack traces needed to be stored. Move the definition of __irq_entry to <linux/interrupt.h> so that the users don't need to pull in <linux/ftrace.h>. Also introduce the __softirq_entry macro which is similar to __irq_entry, but puts the corresponding functions to the .softirqentry.text section. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, kasan: add GFP flags to KASAN APIAlexander Potapenko8-42/+48
Add GFP flags to KASAN hooks for future patches to use. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, kasan: SLAB supportAlexander Potapenko12-22/+266
Add KASAN hooks to SLAB allocator. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()Alexander Potapenko1-1/+27
This patchset implements SLAB support for KASAN Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap objects, therefore we reimplement this feature in mm/kasan/stackdepot.c. The intention is to ultimately switch SLUB to use this implementation as well, which will save a lot of memory (right now SLUB bloats each object by 256 bytes to store the allocation/deallocation stacks). Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which is necessary for better detection of use-after-free errors. We introduce memory quarantine (mm/kasan/quarantine.c), which allows delayed reuse of deallocated memory. This patch (of 7): Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as the test only checks the page allocator functionality. Also reimplement kmalloc_large_oob_right() so that the test allocates a large enough chunk of memory that still does not trigger the page allocator fallback. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()Tetsuo Handa1-2/+0
A leftover from commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path raceless"). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm/page_alloc: prevent merging between isolated and other pageblocksVlastimil Babka1-13/+33
Hanjun Guo has reported that a CMA stress test causes broken accounting of CMA and free pages: > Before the test, I got: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 195044 kB > > > After running the test: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 6602584 kB > > So the freed CMA memory is more than total.. > > Also the the MemFree is more than mem total: > > -bash-4.3# cat /proc/meminfo > MemTotal: 16342016 kB > MemFree: 22367268 kB > MemAvailable: 22370528 kB Laura Abbott has confirmed the issue and suspected the freepage accounting rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA pageblocks: > CMA isolates MAX_ORDER aligned blocks, but, during the process, > partialy isolated block exists. If MAX_ORDER is 11 and > pageblock_order is 9, two pageblocks make up MAX_ORDER > aligned block and I can think following scenario because pageblock > (un)isolation would be done one by one. > > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA, > MIGRATE_ISOLATE, respectively. > > CC -> IC -> II (Isolation) > II -> CI -> CC (Un-isolation) > > If some pages are freed at this intermediate state such as IC or CI, > that page could be merged to the other page that is resident on > different type of pageblock and it will cause wrong freepage count. This was supposed to be prevented by CMA operating on MAX_ORDER blocks, but since it doesn't hold the zone->lock between pageblocks, a race window does exist. It's also likely that unexpected merging can occur between MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in __free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock"). However, we only check the migratetype of the pageblock where buddy merging has been initiated, not the migratetype of the buddy pageblock (or group of pageblocks) which can be MIGRATE_ISOLATE. Joonsoo has suggested checking for buddy migratetype as part of page_is_buddy(), but that would add extra checks in allocator hotpath and bloat-o-meter has shown significant code bloat (the function is inline). This patch reduces the bloat at some expense of more complicated code. The buddy-merging while-loop in __free_one_page() is initially bounded to pageblock_border and without any migratetype checks. The checks are placed outside, bumping the max_order if merging is allowed, and returning to the while-loop with a statement which can't be possibly considered harmful. This fixes the accounting bug and also removes the arguably weird state in the original commit 3c605096d315 where buddies could be left unmerged. Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock") Link: https://lkml.org/lkml/2016/3/2/280 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Hanjun Guo <guohanjun@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Debugged-by: Laura Abbott <labbott@redhat.com> Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25drivers/memstick/host/r592.c: avoid gcc-6 warningArnd Bergmann1-2/+1
The r592 driver relies on behavior of the DMA mapping API that is normally observed but not guaranteed by the API. Instead it uses a runtime check to fail transfers if the API ever behaves When CONFIG_NEED_SG_DMA_LENGTH is not set, one of the checks turns into a comparison of a variable with itself, which gcc-6.0 now warns about: drivers/memstick/host/r592.c: In function 'r592_transfer_fifo_dma': drivers/memstick/host/r592.c:302:31: error: self-comparison always evaluates to false [-Werror=tautological-compare] (sg_dma_len(&dev->req->sg) < dev->req->sg.length)) { ^ The check itself is not a problem, so this patch just rephrases the condition in a way that gcc does not consider an indication of a mistake. We already know that dev->req->sg.length was initially R592_LFIFO_SIZE, so we can compare it to that constant again. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Quentin Lambert <lambert.quentin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: extend enough credits for freeing one truncate record while replaying truncate recordsXue jiufei1-11/+8
Now function ocfs2_replay_truncate_records() first modifies tl_used, then calls ocfs2_extend_trans() to extend transactions for gd and alloc inode used for freeing clusters. jbd2_journal_restart() may be called and it may happen that tl_used in truncate log is decreased but the clusters are not freed, which means these clusters are lost. So we should avoid extending transactions in these two operations. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and etXue jiufei1-28/+54
I found that jbd2_journal_restart() is called in some places without keeping things consistently before. However, jbd2_journal_restart() may commit the handle's transaction and restart another one. If the first transaction is committed successfully while another not, it may cause filesystem inconsistency or read only. This is an effort to fix this kind of problems. This patch (of 3): The following functions will be called while truncating an extent: ocfs2_remove_btree_range -> ocfs2_start_trans -> ocfs2_remove_extent -> ocfs2_truncate_rec -> ocfs2_extend_rotate_transaction -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_rotate_tree_left -> ocfs2_remove_rightmost_path -> ocfs2_extend_rotate_transaction -> ocfs2_unlink_subtree -> ocfs2_update_edge_lengths -> ocfs2_extend_trans -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_et_update_clusters -> ocfs2_commit_trans jbd2_journal_restart() may be called and it may happened that the buffers dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in ocfs2_et_update_clusters() are not, the total clusters on extent tree and i_clusters in ocfs2_dinode is inconsistency. So the clusters got from ocfs2_dinode is incorrect, and it also cause read-only problem when call ocfs2_commit_truncate() with the error message: "Inode %llu has empty extent block at %llu". We should extend enough credits for function ocfs2_remove_rightmost_path and ocfs2_update_edge_lengths to avoid this inconsistency. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: move lock to the tail of grant queue while doing in-place convertxuejiufei1-0/+6
We have found a bug when two nodes doing umount one after another. 1) Node 1 migrate a lockres that has 3 locks in grant queue such as N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock N3(NL) and N4(PR) are empty on node 2 because migration target do not copy lvb to these two lock. 2) Node 3 want to convert to PR, it can be granted in __dlmconvert_master(), and the order of these locks is unchanged. The lvb of the lock N3(PR) on node 2 is copyed from lockres in function dlm_update_lvb() while the lvb of lock N4(PR) is still empty. 3) Node 2 want to leave domain, it will migrate this lockres to node 3. Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration() when adding the lock N4(PR) to mres with the following message because the lvb of mres is already copied from lock N3(PR), but the lvb of lock N4(PR) is empty. "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u" [akpm@linux-foundation.org: tweak comment] Signed-off-by: xuejiufei <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: solve a problem of crossing the boundary in updating backupsjiangyiwen1-1/+1
In update_backups() there exists a problem of crossing the boundary as follows: we assume that lun will be resized to 1TB(cluster_size is 32kb), it will include 0~33554431 cluster, in update_backups func, it will backup super block in location of 1TB which is the 33554432th cluster, so the phenomenon of crossing the boundary happens. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix occurring deadlock by changing ocfs2_wq from global to localjiangyiwen7-33/+32
This patch fixes a deadlock, as follows: Node 1 Node 2 Node 3 1)volume a and b are only mount vol a only mount vol b mounted 2) start to mount b start to mount a 3) check hb of Node 3 check hb of Node 2 in vol a, qs_holds++ in vol b, qs_holds++ 4) -------------------- all nodes' network down -------------------- 5) progress of mount b the same situation as failed, and then call Node 2 ocfs2_dismount_volume. but the process is hung, since there is a work in ocfs2_wq cannot beo completed. This work is about vol a, because ocfs2_wq is global wq. BTW, this work which is scheduled in ocfs2_wq is ocfs2_orphan_scan_work, and the context in this work needs to take inode lock of orphan_dir, because lockres owner are Node 1 and all nodes' nework has been down at the same time, so it can't get the inode lock. 6) Why can't this node be fenced when network disconnected? Because the process of mount is hung what caused qs_holds is not equal 0. Because all works in the ocfs2_wq are relative to the super block. The solution is to change the ocfs2_wq from global to local. In other words, move it into struct ocfs2_super. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_listJoseph Qi2-1/+13
When master handles convert request, it queues ast first and then returns status. This may happen that the ast is sent before the request status because the above two messages are sent by two threads. And right after the ast is sent, if master down, it may trigger BUG in dlm_move_lockres_to_recovery_list in the requested node because ast handler moves it to grant list without clear lock->convert_pending. So remove BUG_ON statement and check if the ast is processed in dlmconvert_remote. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix race between convert and recoveryJoseph Qi1-1/+10
There is a race window between dlmconvert_remote and dlm_move_lockres_to_recovery_list, which will cause a lock with OCFS2_LOCK_BUSY in grant list, thus system hangs. dlmconvert_remote { spin_lock(&res->spinlock); list_move_tail(&lock->list, &res->converting); lock->convert_pending = 1; spin_unlock(&res->spinlock); status = dlm_send_remote_convert_request(); >>>>>> race window, master has queued ast and return DLM_NORMAL, and then down before sending ast. this node detects master down and calls dlm_move_lockres_to_recovery_list, which will revert the lock to grant list. Then OCFS2_LOCK_BUSY won't be cleared as new master won't send ast any more because it thinks already be authorized. spin_lock(&res->spinlock); lock->convert_pending = 0; if (status != DLM_NORMAL) dlm_revert_pending_convert(res, lock); spin_unlock(&res->spinlock); } In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res is still in recovering) or res master changed (new master has finished recovery), reset the status to DLM_RECOVERING, then it will retry convert. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()Ryan Ding1-4/+8
The code should call ocfs2_free_alloc_context() to free meta_ac & data_ac before calling ocfs2_run_deallocs(). Because ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by meta_ac. So try to release the lock before ocfs2_run_deallocs(). Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix disk file size and memory file size mismatchRyan Ding1-10/+17
When doing append direct write in an already allocated cluster, and fast path in ocfs2_dio_get_block() is triggered, function ocfs2_dio_end_io_write() will be skipped as there is no context allocated. As a result, the disk file size will not be changed as it should be. The solution is to skip fast path when we are about to change file size. Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_writeRyan Ding1-6/+17
Take ip_alloc_sem to prevent concurrent access to extent tree, which may cause the extent tree in an unstable state. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix ip_unaligned_aio deadlock with dio work queueRyan Ding5-36/+9
In the current implementation of unaligned aio+dio, lock order behave as follow: in user process context: -> call io_submit() -> get i_mutex <== window1 -> get ip_unaligned_aio -> submit direct io to block device -> release i_mutex -> io_submit() return in dio work queue context(the work queue is created in __blockdev_direct_IO): -> release ip_unaligned_aio <== window2 -> get i_mutex -> clear unwritten flag & change i_size -> release i_mutex There is a limitation to the thread number of dio work queue. 256 at default. If all 256 thread are in the above 'window2' stage, and there is a user process in the 'window1' stage, the system will became deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio lock, while there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio work queue thread to be schedule. But all the dio work queue thread is waiting for i_mutex lock in 'window2'. This case only happened in a test which send a large number(more than 256) of aio at one io_submit() call. My design is to remove ip_unaligned_aio lock. Change it to a sync io instead. Just like ip_unaligned_aio lock, serialize the unaligned aio dio. [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: code clean up for direct ioRyan Ding2-144/+10
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write: * remove append dio check: it will be checked in ocfs2_direct_IO() * remove file hole check: file hole is supported for now * remove inline data check: it will be checked in ocfs2_direct_IO() * remove the full_coherence check when append dio: we will get the inode_lock in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure the coherence semantics. Now the drop dio procedure is gone. :) [akpm@linux-foundation.org: remove unused label] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>