linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2020-06-30	ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU	Nicholas Piggin	1	-14/+71
	On a 144 thread system, `perf ftrace` takes about 20 seconds to start up, due to calling synchronize_rcu() for each CPU. cat /proc/108560/stack 0xc0003e7eb336f470 __switch_to+0x2e0/0x480 __wait_rcu_gp+0x20c/0x220 synchronize_rcu+0x9c/0xc0 ring_buffer_reset_cpu+0x88/0x2e0 tracing_reset_online_cpus+0x84/0xe0 tracing_open+0x1d4/0x1f0 On a system with 10x more threads, it starts to become an annoyance. Batch these up so we disable all the per-cpu buffers first, then synchronize_rcu() once, then reset each of the buffers. This brings the time down to about 0.5s. Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com Tested-by: Anton Blanchard <anton@ozlabs.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30	ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit	Steven Rostedt (VMware)	1	-27/+243
	After a discussion with the new time algorithm to have nested events still have proper time keeping but required using local64_t atomic operations. Mathieu was concerned about the performance this would have on 32 bit machines, as in most cases, atomic 64 bit operations on them can be expensive. As the ring buffer's timing needs do not require full features of local64_t, a wrapper is made to implement a new rb_time_t operation that uses two longs on 32 bit machines but still uses the local64_t operations on 64 bit machines. There's a switch that can be made in the file to force 64 bit to use the 32 bit version just for testing purposes. All reads do not need to succeed if a read happened while the stamp being read is in the process of being updated. The requirement is that all reads must succed that were done by an interrupting event (where this event was interrupted by another event that did the write). Or if the event itself did the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will always succeed (even if it gets interrupted by another event that writes to t. The result of the read will be either the previous set, or a set performed by an interrupting event. If the read is done by an event that interrupted another event that was in the process of setting the time stamp, and no other event came along to write to that time stamp, it will fail and the rb_time_read() will return that it failed (the value to read will be undefined). A set will always write to the time stamp and return with a valid time stamp, such that any read after it will be valid. A cmpxchg may fail if it interrupted an event that was in the process of updating the time stamp just like the reads do. Other than that, it will act like a normal cmpxchg. The way this works is that the rb_time_t is made of of three fields. A cnt, that gets updated atomically everyting a modification is made. A top that represents the most significant 30 bits of the time, and a bottom to represent the least significant 30 bits of the time. Notice, that the time values is only 60 bits long (where the ring buffer only uses 59 bits, which gives us 18 years of nanoseconds!). The top two bits of both the top and bottom is a 2 bit counter that gets set by the value of the least two significant bits of the cnt. A read of the top and the bottom where both the top and bottom have the same most significant top 2 bits, are considered a match and a valid 60 bit number can be created from it. If they do not match, then the number is considered invalid, and this must only happen if an event interrupted another event in the midst of updating the time stamp. This is only used for 32 bits machines as 64 bit machines can get better performance out of the local64_t. This has been tested heavily by forcing 64 bit to use this logic. Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30	ring-buffer: Incorporate absolute timestamp into add_timestamp logic	Steven Rostedt (VMware)	1	-18/+22
	Instead of calling out the absolute test for each time to check if the ring buffer wants absolute time stamps for all its recording, incorporate it with the add_timestamp field and turn it into flags for faster processing between wanting a absolute tag and needing to force one. Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30	ring-buffer: Have nested events still record running time stamp	Steven Rostedt (VMware)	1	-95/+181
	Up until now, if an event is interrupted while it is recorded by an interrupt, and that interrupt records events, the time of those events will all be the same. This is because events only record the delta of the time since the previous event (or beginning of a page), and to handle updating the time keeping for that of nested events is extremely racy. After years of thinking about this and several failed attempts, I finally have a solution to solve this puzzle. The problem is that you need to atomically calculate the delta and then update the time stamp you made the delta from, as well as then record it into the buffer, all this while at any time an interrupt can come in and do the same thing. This is easy to solve with heavy weight atomics, but that would be detrimental to the performance of the ring buffer. The current state of affairs sacrificed the time deltas for nested events for performance. The reason for previous failed attempts at solving this puzzle was because I was trying to completely avoid slow atomic operations like cmpxchg. I final came to the conclusion to always avoid cmpxchg is not possible, which is why those previous attempts always failed. But it is possible to pick one path (the most common case) and avoid cmpxchg in that path, which is the "fast path". The most common case is that an event will not be interrupted and have other events added into it. An event can detect if it has interrupted another event, and for these cases we can make it the slow path and use the heavy operations like cmpxchg. One more player was added to the game that made this possible, and that is the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59 bit time stamp. (Of course this breaks if a machine is running for more than 18 years without a reboot!). There's barrier() placements around for being paranoid, even when they are not needed because of other atomic functions near by. But those should not hurt, as if they are not needed, they basically become a nop. Note, this also makes the race window much smaller, which means there are less slow paths to slow down the performance. The basic idea is that there's two main paths taken. 1) Not being interrupted between time stamps and reserving buffer space. In this case, the time stamps taken are true to the location in the buffer. 2) Was interrupted by another path between taking time stamps and reserving buffer space. The objective is to know what the delta is from the last reserved location in the buffer. As it is possible to detect if an event is interrupting another event before reserving data, space is added to the length to be reserved to inject a full time stamp along with the event being reserved. When an event is not interrupted, the write stamp is always the time of the last event written to the buffer. In path 1, there's two sub paths we care about: a) The event did not interrupt another event. b) The event interrupted another event. In case a, as the write stamp was read and known to be correct, the delta between the current time stamp and the write stamp is the delta between the current event and the previously recorded event. In case b, extra space was reserved to just put the full time stamp into the buffer. Which is done, as stated, in this path the time stamp taken is known to match the location in the buffer. In path 2, there's also two sub paths we care about: a) The event was not interrupted by another event since it reserved space on the buffer and re-reading the write stamp. b) The event was interrupted by another event. In case a, the write stamp is that of the last event that interrupted this event between taking the time stamps and reserving. As no event came in after re-reading the write stamp, that event is known to be the time of the event directly before this event and the delta can be the new time stamp and the write stamp. In case b, one or more events came in between reserving the event and re-reading he write stamp. Since this event's buffer reservation is between other events at this path, there's no way to know what the delta is. But because an event interrupted this event after it started, its fine to just give a zero delta, and take the same time stamp as the events that happened within the event being recorded. Here's the implementation of the design of this solution: All this is per cpu, and only needs to worry about nested events (not parallel events). The players: write_tail: The index in the buffer where new events can be written to. It is incremented via local_add() to reserve space for a new event. before_stamp: A time stamp set by all events before reserving space. write_stamp: A time stamp updated by events after it has successfully reserved space. /* Save the current position of write / [A] w = local_read(write_tail); barrier(); / Read both before and write stamps before touching anything / before = local_read(before_stamp); after = local_read(write_stamp); barrier(); / * If before and after are the same, then this event is not * interrupting a time update. If it is, then reserve space for adding * a full time stamp (this can turn into a time extend which is * just an extended time delta but fill up the extra space). / if (after != before) abs = true; ts = clock(); / Now update the before_stamp (everyone does this!) / [B] local_set(before_stamp, ts); / Now reserve space on the buffer / [C] write = local_add_return(len, write_tail); / Set tail to be were this event's data is / tail = write - len; if (w == tail) { / Nothing interrupted this between A and C / [D] local_set(write_stamp, ts); barrier(); [E] save_before = local_read(before_stamp); if (!abs) { / This did not interrupt a time update / delta = ts - after; } else { delta = ts; / The full time stamp will be in use / } if (ts != save_before) { / slow path - Was interrupted between C and E / / The update to write_stamp could have overwritten the update to * it by the interrupting event, but before and after should be * the same for all completed top events / after = local_read(write_stamp); if (save_before > after) local_cmpxchg(write_stamp, after, save_before); } } else { / slow path - Interrupted between A and C / after = local_read(write_stamp); temp_ts = clock(); barrier(); [F] if (write == local_read(write_tail) && after < temp_ts) { / This was not interrupted since C and F * The last write_stamp is still valid for the previous event * in the buffer. / delta = temp_ts - after; / OK to keep this new time stamp / ts = temp_ts; } else { / Interrupted between C and F * Well, there's no use to try to know what the time stamp * is for the previous event. Just set delta to zero and * be the same time as that event that interrupted us before * the reservation of the buffer. / delta = 0; } / No need to use full timestamps here */ abs = 0; } Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23	ring-buffer: Zero out time extend if it is nested and not absolute	Steven Rostedt (VMware)	1	-1/+1
	Currently the ring buffer makes events that happen in interrupts that preempt another event have a delta of zero. (Hopefully we can change this soon). But this is to deal with the races of updating a global counter with lockless and nesting functions updating deltas. With the addition of absolute time stamps, the time extend didn't follow this rule. A time extend can happen if two events happen longer than 2^27 nanoseconds appart, as the delta time field in each event is only 27 bits. If that happens, then a time extend is injected with 2^59 bits of nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between two events, and as it is writing the event, an interrupt triggers, it will see the 2^27 difference as well and inject a time extend of its own. But a recent change made the time extend logic not take into account the nesting, and this can cause two time extend deltas to happen moving the time stamp much further ahead than the current time. This gets all reset when the ring buffer moves to the next page, but that can cause time to appear to go backwards. This was observed in a trace-cmd recording, and since the data is saved in a file, with trace-cmd report --debug, it was possible to see that this indeed did happen! bash-52501 110d... 81778.908247: sched_switch: bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64] <idle>-0 110d... 81778.908757: sched_switch: swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64] TIME EXTEND: delta:306454770 length:0 bash-52501 110.... 81779.215212: sched_swap_numa: src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48] TIME EXTEND: delta:306458165 length:0 bash-52501 110dNh. 81779.521670: sched_wakeup: migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40] and at the next page, caused the time to go backwards: bash-52504 110d... 81779.685411: sched_switch: bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64] CPU:110 [SUBBUFFER START] [81779379165886:0x1320000] <idle>-0 110dN.. 81779.379166: sched_wakeup: bash:52504 [120] success=1 CPU:110 [0:0x10:40] <idle>-0 110d... 81779.379167: sched_switch: swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64] Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: dc4e2801d400b ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP") Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14	ring-buffer: Remove all BUG() calls	Steven Rostedt (VMware)	1	-6/+6
	There's a lot of checks to make sure the ring buffer is working, and if an anomaly is detected, it safely shuts itself down. But there's a few cases that it will call BUG(), which defeats the point of being safe (it crashes the kernel when an anomaly is found!). There's no reason for them. Switch them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present), or to RB_WARN_ON() (when a ring buffer descriptor is present). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14	ring-buffer: Don't deactivate the ring buffer on failed iterator reads	Steven Rostedt (VMware)	1	-15/+7
	If the function tracer is running and the trace file is read (which uses the ring buffer iterator), the iterator can get in sync with the writes, and caues it to fail to find a page with content it can read three times. This causes a warning and deactivation of the ring buffer code. Looking at the other cases of failure to get an event, it appears that there's a chance that the writer could cause them too. Since the iterator is a "best effort" to read the ring buffer if there's an active writer (the consumer reader is made for this case "see trace_pipe"), if it fails to get an event after three tries, simply give up and return NULL. Don't warn, nor disable the ring buffer on this failure. Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian Reported-by: kernel test robot <lkp@intel.com> Fixes: ff84c50cfb4b ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice") Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27	ring-buffer/tracing: Have iterator acknowledge dropped events	Steven Rostedt (VMware)	1	-0/+16
	Have the ring_buffer_iterator set a flag if events were dropped as it were to go and peek at the next event. Have the trace file display this fact if it happened with a "LOST EVENTS" message. Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27	ring-buffer: Do not disable recording when there is an iterator	Steven Rostedt (VMware)	1	-2/+0
	Now that the iterator can handle a concurrent writer, do not disable writing to the ring buffer when there is an iterator present. Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27	ring-buffer: Make resize disable per cpu buffer instead of total buffer	Steven Rostedt (VMware)	1	-13/+30
	When the ring buffer becomes writable for even when the trace file is read, it must still not be resized. But since tracers can be activated while the trace file is being read, the irqsoff tracer can modify the per CPU buffers, and this can cause the reader of the trace file to update the wrong buffer's resize disable bit, as the irqsoff tracer swaps out cpu buffers. By making the resize disable per cpu_buffer, it makes the update follow the per cpu_buffer even if it's swapped out with the snapshot buffer and keeps the release of the trace file modifying the same data as the open did. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Optimize rb_iter_head_event()	Steven Rostedt (VMware)	1	-0/+3
	As it is fine to perform several "peeks" of event data in the ring buffer via the iterator before moving it forward, do not re-read the event, just return what was read before. Otherwise, it can cause inconsistent results, especially when testing multiple CPU buffers to interleave them. Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Do not die if rb_iter_peek() fails more than thrice	Steven Rostedt (VMware)	1	-3/+10
	As the iterator will be reading a live buffer, and if the event being read is on a page that a writer crosses, it will fail and try again, the condition in rb_iter_peek() that only allows a retry to happen three times is no longer valid. Allow rb_iter_peek() to retry more than three times without killing the ring buffer, but only if rb_iter_head_event() had failed at least once. Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Have rb_iter_head_event() handle concurrent writer	Steven Rostedt (VMware)	1	-31/+75
	Have the ring_buffer_iter structure have a place to store an event, such that it can not be overwritten by a writer, and load it in such a way via rb_iter_head_event() that it will return NULL and reset the iter to the start of the current page if a writer updated the page. Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Add page_stamp to iterator for synchronization	Steven Rostedt (VMware)	1	-3/+7
	Have the ring_buffer_iter structure contain a page_stamp, such that it can be used to see if the writer entered the page the iterator is on. When going to a new page, the iterator will record the time stamp of that page. When reading events, it can copy the event to an internal buffer on the iterator (to be implemented later), then check the page's time stamp with its own to see if the writer entered the page. If so, it will need to try to read the event again. Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()	Steven Rostedt (VMware)	1	-17/+6
	When the ring buffer was first created, the iterator followed the normal producer/consumer operations where it had both a peek() operation, that just returned the event at the current location, and a read(), that would return the event at the current location and also increment the iterator such that the next peek() or read() will return the next event. The only use of the ring_buffer_read() is currently to move the iterator to the next location and nothing now actually reads the event it returns. Rename this function to its actual use case to ring_buffer_iter_advance(), which also adds the "iter" part to the name, which is more meaningful. As the timestamp returned by ring_buffer_read() was never used, there's no reason that this new version should bother having returning it. It will also become a void function. Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19	ring-buffer: Have ring_buffer_empty() not depend on tracing stopped	Steven Rostedt (VMware)	1	-2/+23
	It was complained about that when the trace file is read, that the tracing is disabled, as the iterator expects writing to the buffer it reads is not updated. Several steps are needed to make the iterator handle a writer, by testing if things have changed as it reads. This step is to make ring_buffer_empty() expect the buffer to be changing. Note if the current location of the iterator is overwritten, then it will return false as new data is being added. Note, that this means that data will be skipped. Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21	ring-buffer: Remove abandoned macro RB_MISSED_FLAGS	Alex Shi	1	-2/+0
	This macro isn't used since commit d325c402964e ("ring-buffer: Remove unused function ring_buffer_page_len()"), so better to remove it. Link: http://lkml.kernel.org/r/1579586080-45300-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14	ring-buffer: Fix kernel doc for rb_update_event()	Steven Rostedt (VMware)	1	-3/+3
	rb_update_event has changed without the kernel-doc update. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14	ring-bufer: kernel-doc warning fixes	Fabian Frederick	1	-0/+3
	Also fixes a couple of typos Link: http://lkml.kernel.org/r/1401992525-10417-1-git-send-email-fabf@skynet.be Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> [ Found this deep in the abyss of my INBOX ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13	tracing: Make struct ring_buffer less ambiguous	Steven Rostedt (VMware)	1	-62/+62
	As there's two struct ring_buffers in the kernel, it causes some confusion. The other one being the perf ring buffer. It was agreed upon that as neither of the ring buffers are generic enough to be used globally, they should be renamed as: perf's ring_buffer -> perf_buffer ftrace's ring_buffer -> trace_buffer This implements the changes to the ring buffer that ftrace uses. Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-06	Fix up for "printk: Drop pr_warning definition"	Stephen Rothwell	1	-1/+1
	Link: http://lkml.kernel.org/r/20191206092503.303d6a57@canb.auug.org.au Cc: Linux Next Mailing List <linux-next@vger.kernel.org> Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-12-04	tracing: Do not create directories if lockdown is in affect	Steven Rostedt (VMware)	1	-0/+6
	If lockdown is disabling tracing on boot up, it prevents the tracing files from even bering created. But when that happens, there's several places that will give a warning that the files were not created as that is usually a sign of a bug. Add in strategic locations where a check is made to see if tracing is disabled by lockdown, and if it is, do not go further, and fail silently (but print that tracing is disabled by lockdown, without doing a WARN_ON()). Cc: Matthew Garrett <mjg59@google.com> Fixes: 17911ff38aa5 ("tracing: Add locked_down checks to the open calls of files created for tracefs") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28	ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS	Steven Rostedt (VMware)	1	-13/+4
	Commit c19fa94a8fed ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for architectures that required 64bit aligned access for all 64bit words. As the ftrace ring buffer stores data on 4 byte alignment, this config option was used to force it to store data on 8 byte alignment to make sure the data being stored and written directly into the ring buffer was 8 byte aligned as it would cause issues trying to write an 8 byte word on a 4 not 8 byte aligned memory location. But with the removal of the metag architecture, which was the only architecture to use this, there is no architecture supported by Linux that requires 8 byte aligne access for all 8 byte words (4 byte alignment is good enough). Removing this config can simplify the code a bit. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-15	Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace	Linus Torvalds	1	-1/+1
	Pull tracing updates from Steven Rostedt: "The major changes in this tracing update includes: - Removal of non-DYNAMIC_FTRACE from 32bit x86 - Removal of mcount support from x86 - Emulating a call from int3 on x86_64, fixes live kernel patching - Consolidated Tracing Error logs file Minor updates: - Removal of klp_check_compiler_support() - kdb ftrace dumping output changes - Accessing and creating ftrace instances from inside the kernel - Clean up of #define if macro - Introduction of TRACE_EVENT_NOP() to disable trace events based on config options And other minor fixes and clean ups" * tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) x86: Hide the int3_emulate_call/jmp functions from UML livepatch: Remove klp_check_compiler_support() ftrace/x86: Remove mcount support ftrace/x86_32: Remove support for non DYNAMIC_FTRACE tracing: Simplify "if" macro code tracing: Fix documentation about disabling options using trace_options tracing: Replace kzalloc with kcalloc tracing: Fix partial reading of trace event's id file tracing: Allow RCU to run between postponed startup tests tracing: Fix white space issues in parse_pred() function tracing: Eliminate const char[] auto variables ring-buffer: Fix mispelling of Calculate tracing: probeevent: Fix to make the type of $comm string tracing: probeevent: Do not accumulate on ret variable tracing: uprobes: Re-enable $comm support for uprobe events ftrace/x86_64: Emulate call function while updating in breakpoint handler x86_64: Allow breakpoints to emulate call instructions x86_64: Add gap to int3 to allow for call emulation tracing: kdb: Allow ftdump to skip all but the last few entries tracing: Add trace_total_entries() / trace_total_entries_cpu() ...
2019-04-26	trace: Fix preempt_enable_no_resched() abuse	Peter Zijlstra	1	-1/+1
	Unless the very next line is schedule(), or implies it, one must not use preempt_enable_no_resched(). It can cause a preemption to go missing and thereby cause arbitrary delays, breaking the PREEMPT=y invariant. Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net Cc: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: the arch/x86 maintainers <x86@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: huang ying <huang.ying.caritas@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: stable@vger.kernel.org Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-02	ring-buffer: Fix ring buffer size in rb_write_something()	YueHaibing	1	-1/+1
	'cnt' should be used to calculate ring buffer size rather than data->cnt Link: http://lkml.kernel.org/r/1537704693-184237-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-13	tracing: kdb: Fix ftdump to not sleep	Douglas Anderson	1	-2/+3
	As reported back in 2016-11 [1], the "ftdump" kdb command triggers a BUG for "sleeping function called from invalid context". kdb's "ftdump" command wants to call ring_buffer_read_prepare() in atomic context. A very simple solution for this is to add allocation flags to ring_buffer_read_prepare() so kdb can call it without triggering the allocation error. This patch does that. Note that in the original email thread about this, it was suggested that perhaps the solution for kdb was to either preallocate the buffer ahead of time or create our own iterator. I'm hoping that this alternative of adding allocation flags to ring_buffer_read_prepare() can be considered since it means I don't need to duplicate more of the core trace code into "trace_kdb.c" (for either creating my own iterator or re-preparing a ring allocator whose memory was already allocated). NOTE: another option for kdb is to actually figure out how to make it reuse the existing ftrace_dump() function and totally eliminate the duplication. This sounds very appealing and actually works (the "sr z" command can be seen to properly dump the ftrace buffer). The downside here is that ftrace_dump() fully consumes the trace buffer. Unless that is changed I'd rather not use it because it means "ftdump \| grep xyz" won't be very useful to search the ftrace buffer since it will throw away the whole trace on the first grep. A future patch to dump only the last few lines of the buffer will also be hard to implement. [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org Reported-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06	ring-buffer: Remove unused function ring_buffer_page_len()	Miroslav Benes	1	-14/+0
	Commit 6b7e633fe9c2 ("tracing: Remove extra zeroing out of the ring buffer page") removed the only caller of ring_buffer_page_len(). The function is now unused and may be removed. Link: http://lkml.kernel.org/r/20181228133847.106177-1-mbenes@suse.cz Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-31	Merge tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace	Linus Torvalds	1	-13/+81
	Pull tracing updates from Steven Rostedt: - Rework of the kprobe/uprobe and synthetic events to consolidate all the dynamic event code. This will make changes in the future easier. - Partial rewrite of the function graph tracing infrastructure. This will allow for multiple users of hooking onto functions to get the callback (return) of the function. This is the ground work for having kprobes and function graph tracer using one code base. - Clean up of the histogram code that will facilitate adding more features to the histograms in the future. - Addition of str_has_prefix() and a few use cases. There currently is a similar function strstart() that is used in a few places, but only returns a bool and not a length. These instances will be removed in the future to use str_has_prefix() instead. - A few other various clean ups as well. * tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits) tracing: Use the return of str_has_prefix() to remove open coded numbers tracing: Have the historgram use the result of str_has_prefix() for len of prefix tracing: Use str_has_prefix() instead of using fixed sizes tracing: Use str_has_prefix() helper for histogram code string.h: Add str_has_prefix() helper function tracing: Make function ‘ftrace_exports’ static tracing: Simplify printf'ing in seq_print_sym tracing: Avoid -Wformat-nonliteral warning tracing: Merge seq_print_sym_short() and seq_print_sym_offset() tracing: Add hist trigger comments for variable-related fields tracing: Remove hist trigger synth_var_refs tracing: Use hist trigger's var_ref array to destroy var_refs tracing: Remove open-coding of hist trigger var_ref management tracing: Use var_refs[] for hist trigger reference checking tracing: Change strlen to sizeof for hist trigger static strings tracing: Remove unnecessary hist trigger struct field tracing: Fix ftrace_graph_get_ret_stack() to use task and not current seq_buf: Use size_t for len in seq_buf_puts() seq_buf: Make seq_buf_puts() null-terminate the buffer arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack ...
2018-12-08	tracing: Add tracefs file buffer_percentage	Steven Rostedt (VMware)	1	-16/+23
	Add a "buffer_percentage" file, that allows users to specify how much of the buffer (percentage of pages) need to be filled before waking up a task blocked on a per cpu trace_pipe_raw file. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08	ring-buffer: Add percentage of ring buffer full to wake up reader	Steven Rostedt (VMware)	1	-5/+66
	Instead of just waiting for a page to be full before waking up a pending reader, allow the reader to pass in a "percentage" of pages that have content before waking up a reader. This should help keep the process of reading the events not cause wake ups that constantly cause reading of the buffer. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-11-27	tracing: Replace synchronize_sched() and call_rcu_sched()	Paul E. McKenney	1	-6/+6
	Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). Similarly, call_rcu_sched() can be replaced by call_rcu(). This commit therefore makes these changes. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: <linux-kernel@vger.kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-17	ring-buffer: Allow for rescheduling when removing pages	Vaibhav Nagarnaik	1	-0/+2
	When reducing ring buffer size, pages are removed by scheduling a work item on each CPU for the corresponding CPU ring buffer. After the pages are removed from ring buffer linked list, the pages are free()d in a tight loop. The loop does not give up CPU until all pages are removed. In a worst case behavior, when lot of pages are to be freed, it can cause system stall. After the pages are removed from the list, the free() can happen while the work is rescheduled. Call cond_resched() in the loop to prevent the system hangup. Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Jason Behmer <jbehmer@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-16	tracing: Add SPDX License format tags to tracing files	Steven Rostedt (VMware)	1	-0/+1
	Add the SPDX License header to ease license compliance management. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01	ring-buffer: Make ring_buffer_record_is_set_on() return bool	Steven Rostedt (VMware)	1	-1/+1
	The value of ring_buffer_record_is_set_on() is either true or false, so have its return value be bool. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-01	ring-buffer: Make ring_buffer_record_is_on() return bool	Steven Rostedt (VMware)	1	-1/+1
	The value of ring_buffer_record_is_on() is either true or false, so have its return value be bool. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25	ring_buffer: tracing: Inherit the tracing setting to next ring buffer	Masami Hiramatsu	1	-0/+16
	Maintain the tracing on/off setting of the ring_buffer when switching to the trace buffer snapshot. Taking a snapshot is done by swapping the backup ring buffer (max_tr_buffer). But since the tracing on/off setting is defined by the ring buffer, when swapping it, the tracing on/off setting can also be changed. This causes a strange result like below: /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 0 > tracing_on /sys/kernel/debug/tracing # cat tracing_on 0 /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 0 We don't touch tracing_on, but snapshot changes tracing_on setting each time. This is an anomaly, because user doesn't know that each "ring_buffer" stores its own tracing-enable state and the snapshot is done by swapping ring buffers. Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox Cc: Ingo Molnar <mingo@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp> Cc: stable@vger.kernel.org Fixes: debdd57f5145 ("tracing: Make a snapshot feature available from userspace") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> [ Updated commit log and comment in the code ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-06-04	ring-buffer: Fix a bunch of typos in comments	Steven Rostedt (VMware)	1	-10/+10
	An anonymous source sent me a bunch of typo fixes in the comments of ring_buffer.c file. That source did not want to be associated to this patch because they don't want to be known as "one of those" commiters (you know who you are!). They gave me permission to sign this off in my own name. Suggested-by: One-of-those-commiters@YouKnowWhoYouAre.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-06	ring-buffer: Add set/clear_current_oom_origin() during allocations	Steven Rostedt (VMware)	1	-10/+38
	As si_mem_available() can say there is enough memory even though the memory available is not useable by the ring buffer, it is best to not kill innocent applications because the ring buffer is taking up all the memory while it is trying to allocate a great deal of memory. If the allocator is user space (because kernel threads can also increase the size of the kernel ring buffer on boot up), then after si_mem_available() says there is enough memory, set the OOM killer to kill the current task if an OOM triggers during the allocation. Link: http://lkml.kernel.org/r/20180404062340.GD6312@dhcp22.suse.cz Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-06	ring-buffer: Check if memory is available before allocation	Steven Rostedt (VMware)	1	-0/+5
	The ring buffer is made up of a link list of pages. When making the ring buffer bigger, it will allocate all the pages it needs before adding to the ring buffer, and if it fails, it frees them and returns an error. This makes increasing the ring buffer size an all or nothing action. When this was first created, the pages were allocated with "NORETRY". This was to not cause any Out-Of-Memory (OOM) actions from allocating the ring buffer. But NORETRY was too strict, as the ring buffer would fail to expand even when there's memory available, but was taken up in the page cache. Commit 848618857d253 ("tracing/ring_buffer: Try harder to allocate") changed the allocating from NORETRY to RETRY_MAYFAIL. The RETRY_MAYFAIL would allocate from the page cache, but if there was no memory available, it would simple fail the allocation and not trigger an OOM. This worked fine, but had one problem. As the ring buffer would allocate one page at a time, it could take up all memory in the system before it failed to allocate and free that memory. If the allocation is happening and the ring buffer allocates all memory and then tries to take more than available, its allocation will not trigger an OOM, but if there's any allocation that happens someplace else, that could trigger an OOM, even though once the ring buffer's allocation fails, it would free up all the previous memory it tried to allocate, and allow other memory allocations to succeed. Commit d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a separate function") separated out si_mem_availble() as a separate function that could be used to see how much memory is available in the system. Using this function to make sure that the ring buffer could be allocated before it tries to allocate pages we can avoid allocating all memory in the system and making it vulnerable to OOMs if other allocations are taking place. Link: http://lkml.kernel.org/r/1522320104-6573-1-git-send-email-zhaoyang.huang@spreadtrum.com CC: stable@vger.kernel.org Cc: linux-mm@kvack.org Fixes: 848618857d253 ("tracing/ring_buffer: Try harder to allocate") Requires: d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a separate function") Reported-by: Zhaoyang Huang <huangzhaoyang@gmail.com> Tested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-06	tracing: Mention trace_clock=global when warning about unstable clocks	Chris Wilson	1	-1/+2
	Mention the alternative of adding trace_clock=global to the kernel command line when we detect that we've used an unstable clock across a suspend/resume cycle. Link: http://lkml.kernel.org/r/20180330150132.16903-2-chris@chris-wilson.co.uk Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10	ring-buffer: Add nesting for adding events within events	Steven Rostedt (VMware)	1	-3/+54
	The ring-buffer code has recusion protection in case tracing ends up tracing itself, the ring-buffer will detect that it was called at the same context (normal, softirq, interrupt or NMI), and not continue to record the event. With the histogram synthetic events, they are called while tracing another event at the same context. The recusion protection triggers because it detects tracing at the same context and stops it. Add ring_buffer_nest_start() and ring_buffer_nest_end() that will notify the ring buffer that a trace is about to happen within another trace and that it is intended, and not to trigger the recursion blocking. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10	ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP	Tom Zanussi	1	-28/+76
	RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can gather was reserved for something like an absolute timestamp feature for the ring buffer, if not a complete replacement of the current time_delta scheme. This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time stamps. Another way to look at it is that it essentially forces extended time_deltas for all events. The motivation for doing this is to enable time_deltas that aren't dependent on previous events in the ring buffer, making it feasible to use the ring_buffer_event timetamps in a more random-access way, for purposes other than serial event printing. To set/reset this mode, use tracing_set_timestamp_abs() from the previous interface patch. Link: http://lkml.kernel.org/r/477b362dba1ce7fab9889a1a8e885a62c472f041.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10	ring-buffer: Add interface for setting absolute time stamps	Tom Zanussi	1	-0/+11
	Define a new function, tracing_set_time_stamp_abs(), which can be used to enable or disable the use of absolute timestamps rather than time deltas for a trace array. Only the interface is added here; a subsequent patch will add the underlying implementation. Link: http://lkml.kernel.org/r/ce96119de44c7fe0ee44786d15254e9b493040d3.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-02-11	vfs: do bulk POLL* -> EPOLL* replacement	Linus Torvalds	1	-2/+2
	This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V \| grep -v '^t' \| grep -v /um/ \| grep -v '^sa' \| grep -v '/poll.h$'\|grep -v '^D'` for f in $L; do sed -i "-es/^$[^\"]$$\<POLL$V\>$/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-30	Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds	1	-1/+1
	Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ...
2018-01-18	ring-buffer: Fix duplicate results in mapping context to bits in recursive lock	Steven Rostedt (VMware)	1	-2/+1
	In bringing back the context checks, the code checks first if its normal (non-interrupt) context, and then for NMI then IRQ then softirq. The final check is redundant. Since the if branch is only hit if the context is one of NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if it is SOFTIRQ. The current code returns the same result even if its not a SOFTIRQ. Which is confusing. pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ Is redundant as RB_CTX_SOFTIRQ is 2! Fixes: a0e3a18f4baf ("ring-buffer: Bring back context level recursive checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-15	ring-buffer: Bring back context level recursive checks	Steven Rostedt (VMware)	1	-17/+45
	Commit 1a149d7d3f45 ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler") replaced the context level recursion checks with a simple counter. This would prevent the ring buffer code from recursively calling itself more than the max number of contexts that exist (Normal, softirq, irq, nmi). But this change caused a lockup in a specific case, which was during suspend and resume using a global clock. Adding a stack dump to see where this occurred, the issue was in the trace global clock itself: trace_buffer_lock_reserve+0x1c/0x50 __trace_graph_entry+0x2d/0x90 trace_graph_entry+0xe8/0x200 prepare_ftrace_return+0x69/0xc0 ftrace_graph_caller+0x78/0xa8 queued_spin_lock_slowpath+0x5/0x1d0 trace_clock_global+0xb0/0xc0 ring_buffer_lock_reserve+0xf9/0x390 The function graph tracer traced queued_spin_lock_slowpath that was called by trace_clock_global. This pointed out that the trace_clock_global() is not reentrant, as it takes a spin lock. It depended on the ring buffer recursive lock from letting that happen. By removing the context detection and adding just a max number of allowable recursions, it allowed the trace_clock_global() to be entered again and try to retake the spinlock it already held, causing a deadlock. Fixes: 1a149d7d3f45 ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler") Reported-by: David Weinehall <david.weinehall@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27	ring-buffer: Do no reuse reader page if still in use	Steven Rostedt (VMware)	1	-0/+6
	To free the reader page that is allocated with ring_buffer_alloc_read_page(), ring_buffer_free_read_page() must be called. For faster performance, this page can be reused by the ring buffer to avoid having to free and allocate new pages. The issue arises when the page is used with a splice pipe into the networking code. The networking code may up the page counter for the page, and keep it active while sending it is queued to go to the network. The incrementing of the page ref does not prevent it from being reused in the ring buffer, and this can cause the page that is being sent out to the network to be modified before it is sent by reading new data. Add a check to the page ref counter, and only reuse the page if it is not being used anywhere else. Cc: stable@vger.kernel.org Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27	ring-buffer: Mask out the info bits when returning buffer page length	Steven Rostedt (VMware)	1	-1/+5
	Two info bits were added to the "commit" part of the ring buffer data page when returned to be consumed. This was to inform the user space readers that events have been missed, and that the count may be stored at the end of the page. What wasn't handled, was the splice code that actually called a function to return the length of the data in order to zero out the rest of the page before sending it up to user space. These data bits were returned with the length making the value negative, and that negative value was not checked. It was compared to PAGE_SIZE, and only used if the size was less than PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an unsigned compare, meaning the negative size value did not end up causing a large portion of memory to be randomly zeroed out. Cc: stable@vger.kernel.org Fixes: 66a8cb95ed040 ("ring-buffer: Add place holder recording of dropped events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>