diff options
Diffstat (limited to 'documentation/profile-manual/profile-manual-usage.rst')
-rw-r--r-- | documentation/profile-manual/profile-manual-usage.rst | 2624 |
1 files changed, 2624 insertions, 0 deletions
diff --git a/documentation/profile-manual/profile-manual-usage.rst b/documentation/profile-manual/profile-manual-usage.rst new file mode 100644 index 0000000000..32b04f6ff7 --- /dev/null +++ b/documentation/profile-manual/profile-manual-usage.rst | |||
@@ -0,0 +1,2624 @@ | |||
1 | .. SPDX-License-Identifier: CC-BY-2.0-UK | ||
2 | .. highlight:: shell | ||
3 | |||
4 | *************************************************************** | ||
5 | Basic Usage (with examples) for each of the Yocto Tracing Tools | ||
6 | *************************************************************** | ||
7 | |||
8 | | | ||
9 | |||
10 | This chapter presents basic usage examples for each of the tracing | ||
11 | tools. | ||
12 | |||
13 | .. _profile-manual-perf: | ||
14 | |||
15 | perf | ||
16 | ==== | ||
17 | |||
18 | The 'perf' tool is the profiling and tracing tool that comes bundled | ||
19 | with the Linux kernel. | ||
20 | |||
21 | Don't let the fact that it's part of the kernel fool you into thinking | ||
22 | that it's only for tracing and profiling the kernel - you can indeed use | ||
23 | it to trace and profile just the kernel, but you can also use it to | ||
24 | profile specific applications separately (with or without kernel | ||
25 | context), and you can also use it to trace and profile the kernel and | ||
26 | all applications on the system simultaneously to gain a system-wide view | ||
27 | of what's going on. | ||
28 | |||
29 | In many ways, perf aims to be a superset of all the tracing and | ||
30 | profiling tools available in Linux today, including all the other tools | ||
31 | covered in this HOWTO. The past couple of years have seen perf subsume a | ||
32 | lot of the functionality of those other tools and, at the same time, | ||
33 | those other tools have removed large portions of their previous | ||
34 | functionality and replaced it with calls to the equivalent functionality | ||
35 | now implemented by the perf subsystem. Extrapolation suggests that at | ||
36 | some point those other tools will simply become completely redundant and | ||
37 | go away; until then, we'll cover those other tools in these pages and in | ||
38 | many cases show how the same things can be accomplished in perf and the | ||
39 | other tools when it seems useful to do so. | ||
40 | |||
41 | The coverage below details some of the most common ways you'll likely | ||
42 | want to apply the tool; full documentation can be found either within | ||
43 | the tool itself or in the man pages at | ||
44 | `perf(1) <http://linux.die.net/man/1/perf>`__. | ||
45 | |||
46 | .. _perf-setup: | ||
47 | |||
48 | Perf Setup | ||
49 | ---------- | ||
50 | |||
51 | For this section, we'll assume you've already performed the basic setup | ||
52 | outlined in the ":ref:`profile-manual/profile-manual-intro:General Setup`" section. | ||
53 | |||
54 | In particular, you'll get the most mileage out of perf if you profile an | ||
55 | image built with the following in your ``local.conf`` file: :: | ||
56 | |||
57 | INHIBIT_PACKAGE_STRIP = "1" | ||
58 | |||
59 | perf runs on the target system for the most part. You can archive | ||
60 | profile data and copy it to the host for analysis, but for the rest of | ||
61 | this document we assume you've ssh'ed to the host and will be running | ||
62 | the perf commands on the target. | ||
63 | |||
64 | .. _perf-basic-usage: | ||
65 | |||
66 | Basic Perf Usage | ||
67 | ---------------- | ||
68 | |||
69 | The perf tool is pretty much self-documenting. To remind yourself of the | ||
70 | available commands, simply type 'perf', which will show you basic usage | ||
71 | along with the available perf subcommands: :: | ||
72 | |||
73 | root@crownbay:~# perf | ||
74 | |||
75 | usage: perf [--version] [--help] COMMAND [ARGS] | ||
76 | |||
77 | The most commonly used perf commands are: | ||
78 | annotate Read perf.data (created by perf record) and display annotated code | ||
79 | archive Create archive with object files with build-ids found in perf.data file | ||
80 | bench General framework for benchmark suites | ||
81 | buildid-cache Manage build-id cache. | ||
82 | buildid-list List the buildids in a perf.data file | ||
83 | diff Read two perf.data files and display the differential profile | ||
84 | evlist List the event names in a perf.data file | ||
85 | inject Filter to augment the events stream with additional information | ||
86 | kmem Tool to trace/measure kernel memory(slab) properties | ||
87 | kvm Tool to trace/measure kvm guest os | ||
88 | list List all symbolic event types | ||
89 | lock Analyze lock events | ||
90 | probe Define new dynamic tracepoints | ||
91 | record Run a command and record its profile into perf.data | ||
92 | report Read perf.data (created by perf record) and display the profile | ||
93 | sched Tool to trace/measure scheduler properties (latencies) | ||
94 | script Read perf.data (created by perf record) and display trace output | ||
95 | stat Run a command and gather performance counter statistics | ||
96 | test Runs sanity tests. | ||
97 | timechart Tool to visualize total system behavior during a workload | ||
98 | top System profiling tool. | ||
99 | |||
100 | See 'perf help COMMAND' for more information on a specific command. | ||
101 | |||
102 | |||
103 | Using perf to do Basic Profiling | ||
104 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
105 | |||
106 | As a simple test case, we'll profile the 'wget' of a fairly large file, | ||
107 | which is a minimally interesting case because it has both file and | ||
108 | network I/O aspects, and at least in the case of standard Yocto images, | ||
109 | it's implemented as part of busybox, so the methods we use to analyze it | ||
110 | can be used in a very similar way to the whole host of supported busybox | ||
111 | applets in Yocto. :: | ||
112 | |||
113 | root@crownbay:~# rm linux-2.6.19.2.tar.bz2; \ | ||
114 | wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
115 | |||
116 | The quickest and easiest way to get some basic overall data about what's | ||
117 | going on for a particular workload is to profile it using 'perf stat'. | ||
118 | 'perf stat' basically profiles using a few default counters and displays | ||
119 | the summed counts at the end of the run: :: | ||
120 | |||
121 | root@crownbay:~# perf stat wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
122 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
123 | linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA | ||
124 | |||
125 | Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2': | ||
126 | |||
127 | 4597.223902 task-clock # 0.077 CPUs utilized | ||
128 | 23568 context-switches # 0.005 M/sec | ||
129 | 68 CPU-migrations # 0.015 K/sec | ||
130 | 241 page-faults # 0.052 K/sec | ||
131 | 3045817293 cycles # 0.663 GHz | ||
132 | <not supported> stalled-cycles-frontend | ||
133 | <not supported> stalled-cycles-backend | ||
134 | 858909167 instructions # 0.28 insns per cycle | ||
135 | 165441165 branches # 35.987 M/sec | ||
136 | 19550329 branch-misses # 11.82% of all branches | ||
137 | |||
138 | 59.836627620 seconds time elapsed | ||
139 | |||
140 | Many times such a simple-minded test doesn't yield much of | ||
141 | interest, but sometimes it does (see Real-world Yocto bug (slow | ||
142 | loop-mounted write speed)). | ||
143 | |||
144 | Also, note that 'perf stat' isn't restricted to a fixed set of counters | ||
145 | - basically any event listed in the output of 'perf list' can be tallied | ||
146 | by 'perf stat'. For example, suppose we wanted to see a summary of all | ||
147 | the events related to kernel memory allocation/freeing along with cache | ||
148 | hits and misses: :: | ||
149 | |||
150 | root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
151 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
152 | linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA | ||
153 | |||
154 | Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2': | ||
155 | |||
156 | 5566 kmem:kmalloc | ||
157 | 125517 kmem:kmem_cache_alloc | ||
158 | 0 kmem:kmalloc_node | ||
159 | 0 kmem:kmem_cache_alloc_node | ||
160 | 34401 kmem:kfree | ||
161 | 69920 kmem:kmem_cache_free | ||
162 | 133 kmem:mm_page_free | ||
163 | 41 kmem:mm_page_free_batched | ||
164 | 11502 kmem:mm_page_alloc | ||
165 | 11375 kmem:mm_page_alloc_zone_locked | ||
166 | 0 kmem:mm_page_pcpu_drain | ||
167 | 0 kmem:mm_page_alloc_extfrag | ||
168 | 66848602 cache-references | ||
169 | 2917740 cache-misses # 4.365 % of all cache refs | ||
170 | |||
171 | 44.831023415 seconds time elapsed | ||
172 | |||
173 | So 'perf stat' gives us a nice easy | ||
174 | way to get a quick overview of what might be happening for a set of | ||
175 | events, but normally we'd need a little more detail in order to | ||
176 | understand what's going on in a way that we can act on in a useful way. | ||
177 | |||
178 | To dive down into a next level of detail, we can use 'perf record'/'perf | ||
179 | report' which will collect profiling data and present it to use using an | ||
180 | interactive text-based UI (or simply as text if we specify --stdio to | ||
181 | 'perf report'). | ||
182 | |||
183 | As our first attempt at profiling this workload, we'll simply run 'perf | ||
184 | record', handing it the workload we want to profile (everything after | ||
185 | 'perf record' and any perf options we hand it - here none - will be | ||
186 | executed in a new shell). perf collects samples until the process exits | ||
187 | and records them in a file named 'perf.data' in the current working | ||
188 | directory. :: | ||
189 | |||
190 | root@crownbay:~# perf record wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
191 | |||
192 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
193 | linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA | ||
194 | [ perf record: Woken up 1 times to write data ] | ||
195 | [ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ] | ||
196 | |||
197 | To see the results in a | ||
198 | 'text-based UI' (tui), simply run 'perf report', which will read the | ||
199 | perf.data file in the current working directory and display the results | ||
200 | in an interactive UI: :: | ||
201 | |||
202 | root@crownbay:~# perf report | ||
203 | |||
204 | .. image:: figures/perf-wget-flat-stripped.png | ||
205 | :align: center | ||
206 | |||
207 | The above screenshot displays a 'flat' profile, one entry for each | ||
208 | 'bucket' corresponding to the functions that were profiled during the | ||
209 | profiling run, ordered from the most popular to the least (perf has | ||
210 | options to sort in various orders and keys as well as display entries | ||
211 | only above a certain threshold and so on - see the perf documentation | ||
212 | for details). Note that this includes both userspace functions (entries | ||
213 | containing a [.]) and kernel functions accounted to the process (entries | ||
214 | containing a [k]). (perf has command-line modifiers that can be used to | ||
215 | restrict the profiling to kernel or userspace, among others). | ||
216 | |||
217 | Notice also that the above report shows an entry for 'busybox', which is | ||
218 | the executable that implements 'wget' in Yocto, but that instead of a | ||
219 | useful function name in that entry, it displays a not-so-friendly hex | ||
220 | value instead. The steps below will show how to fix that problem. | ||
221 | |||
222 | Before we do that, however, let's try running a different profile, one | ||
223 | which shows something a little more interesting. The only difference | ||
224 | between the new profile and the previous one is that we'll add the -g | ||
225 | option, which will record not just the address of a sampled function, | ||
226 | but the entire callchain to the sampled function as well: :: | ||
227 | |||
228 | root@crownbay:~# perf record -g wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
229 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
230 | linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA | ||
231 | [ perf record: Woken up 3 times to write data ] | ||
232 | [ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ] | ||
233 | |||
234 | |||
235 | root@crownbay:~# perf report | ||
236 | |||
237 | .. image:: figures/perf-wget-g-copy-to-user-expanded-stripped.png | ||
238 | :align: center | ||
239 | |||
240 | Using the callgraph view, we can actually see not only which functions | ||
241 | took the most time, but we can also see a summary of how those functions | ||
242 | were called and learn something about how the program interacts with the | ||
243 | kernel in the process. | ||
244 | |||
245 | Notice that each entry in the above screenshot now contains a '+' on the | ||
246 | left-hand side. This means that we can expand the entry and drill down | ||
247 | into the callchains that feed into that entry. Pressing 'enter' on any | ||
248 | one of them will expand the callchain (you can also press 'E' to expand | ||
249 | them all at the same time or 'C' to collapse them all). | ||
250 | |||
251 | In the screenshot above, we've toggled the ``__copy_to_user_ll()`` entry | ||
252 | and several subnodes all the way down. This lets us see which callchains | ||
253 | contributed to the profiled ``__copy_to_user_ll()`` function which | ||
254 | contributed 1.77% to the total profile. | ||
255 | |||
256 | As a bit of background explanation for these callchains, think about | ||
257 | what happens at a high level when you run wget to get a file out on the | ||
258 | network. Basically what happens is that the data comes into the kernel | ||
259 | via the network connection (socket) and is passed to the userspace | ||
260 | program 'wget' (which is actually a part of busybox, but that's not | ||
261 | important for now), which takes the buffers the kernel passes to it and | ||
262 | writes it to a disk file to save it. | ||
263 | |||
264 | The part of this process that we're looking at in the above call stacks | ||
265 | is the part where the kernel passes the data it's read from the socket | ||
266 | down to wget i.e. a copy-to-user. | ||
267 | |||
268 | Notice also that here there's also a case where the hex value is | ||
269 | displayed in the callstack, here in the expanded ``sys_clock_gettime()`` | ||
270 | function. Later we'll see it resolve to a userspace function call in | ||
271 | busybox. | ||
272 | |||
273 | .. image:: figures/perf-wget-g-copy-from-user-expanded-stripped.png | ||
274 | :align: center | ||
275 | |||
276 | The above screenshot shows the other half of the journey for the data - | ||
277 | from the wget program's userspace buffers to disk. To get the buffers to | ||
278 | disk, the wget program issues a ``write(2)``, which does a ``copy-from-user`` to | ||
279 | the kernel, which then takes care via some circuitous path (probably | ||
280 | also present somewhere in the profile data), to get it safely to disk. | ||
281 | |||
282 | Now that we've seen the basic layout of the profile data and the basics | ||
283 | of how to extract useful information out of it, let's get back to the | ||
284 | task at hand and see if we can get some basic idea about where the time | ||
285 | is spent in the program we're profiling, wget. Remember that wget is | ||
286 | actually implemented as an applet in busybox, so while the process name | ||
287 | is 'wget', the executable we're actually interested in is busybox. So | ||
288 | let's expand the first entry containing busybox: | ||
289 | |||
290 | .. image:: figures/perf-wget-busybox-expanded-stripped.png | ||
291 | :align: center | ||
292 | |||
293 | Again, before we expanded we saw that the function was labeled with a | ||
294 | hex value instead of a symbol as with most of the kernel entries. | ||
295 | Expanding the busybox entry doesn't make it any better. | ||
296 | |||
297 | The problem is that perf can't find the symbol information for the | ||
298 | busybox binary, which is actually stripped out by the Yocto build | ||
299 | system. | ||
300 | |||
301 | One way around that is to put the following in your ``local.conf`` file | ||
302 | when you build the image: :: | ||
303 | |||
304 | INHIBIT_PACKAGE_STRIP = "1" | ||
305 | |||
306 | However, we already have an image with the binaries stripped, so | ||
307 | what can we do to get perf to resolve the symbols? Basically we need to | ||
308 | install the debuginfo for the busybox package. | ||
309 | |||
310 | To generate the debug info for the packages in the image, we can add | ||
311 | ``dbg-pkgs`` to :term:`EXTRA_IMAGE_FEATURES` in ``local.conf``. For example: :: | ||
312 | |||
313 | EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs" | ||
314 | |||
315 | Additionally, in order to generate the type of debuginfo that perf | ||
316 | understands, we also need to set | ||
317 | :term:`PACKAGE_DEBUG_SPLIT_STYLE` | ||
318 | in the ``local.conf`` file: :: | ||
319 | |||
320 | PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory' | ||
321 | |||
322 | Once we've done that, we can install the | ||
323 | debuginfo for busybox. The debug packages once built can be found in | ||
324 | ``build/tmp/deploy/rpm/*`` on the host system. Find the busybox-dbg-...rpm | ||
325 | file and copy it to the target. For example: :: | ||
326 | |||
327 | [trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2_32/busybox-dbg-1.20.2-r2.core2_32.rpm root@192.168.1.31: | ||
328 | busybox-dbg-1.20.2-r2.core2_32.rpm 100% 1826KB 1.8MB/s 00:01 | ||
329 | |||
330 | Now install the debug rpm on the target: :: | ||
331 | |||
332 | root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2_32.rpm | ||
333 | |||
334 | Now that the debuginfo is installed, we see that the busybox entries now display | ||
335 | their functions symbolically: | ||
336 | |||
337 | .. image:: figures/perf-wget-busybox-debuginfo.png | ||
338 | :align: center | ||
339 | |||
340 | If we expand one of the entries and press 'enter' on a leaf node, we're | ||
341 | presented with a menu of actions we can take to get more information | ||
342 | related to that entry: | ||
343 | |||
344 | .. image:: figures/perf-wget-busybox-dso-zoom-menu.png | ||
345 | :align: center | ||
346 | |||
347 | One of these actions allows us to show a view that displays a | ||
348 | busybox-centric view of the profiled functions (in this case we've also | ||
349 | expanded all the nodes using the 'E' key): | ||
350 | |||
351 | .. image:: figures/perf-wget-busybox-dso-zoom.png | ||
352 | :align: center | ||
353 | |||
354 | Finally, we can see that now that the busybox debuginfo is installed, | ||
355 | the previously unresolved symbol in the ``sys_clock_gettime()`` entry | ||
356 | mentioned previously is now resolved, and shows that the | ||
357 | sys_clock_gettime system call that was the source of 6.75% of the | ||
358 | copy-to-user overhead was initiated by the ``handle_input()`` busybox | ||
359 | function: | ||
360 | |||
361 | .. image:: figures/perf-wget-g-copy-to-user-expanded-debuginfo.png | ||
362 | :align: center | ||
363 | |||
364 | At the lowest level of detail, we can dive down to the assembly level | ||
365 | and see which instructions caused the most overhead in a function. | ||
366 | Pressing 'enter' on the 'udhcpc_main' function, we're again presented | ||
367 | with a menu: | ||
368 | |||
369 | .. image:: figures/perf-wget-busybox-annotate-menu.png | ||
370 | :align: center | ||
371 | |||
372 | Selecting 'Annotate udhcpc_main', we get a detailed listing of | ||
373 | percentages by instruction for the udhcpc_main function. From the | ||
374 | display, we can see that over 50% of the time spent in this function is | ||
375 | taken up by a couple tests and the move of a constant (1) to a register: | ||
376 | |||
377 | .. image:: figures/perf-wget-busybox-annotate-udhcpc.png | ||
378 | :align: center | ||
379 | |||
380 | As a segue into tracing, let's try another profile using a different | ||
381 | counter, something other than the default 'cycles'. | ||
382 | |||
383 | The tracing and profiling infrastructure in Linux has become unified in | ||
384 | a way that allows us to use the same tool with a completely different | ||
385 | set of counters, not just the standard hardware counters that | ||
386 | traditional tools have had to restrict themselves to (of course the | ||
387 | traditional tools can also make use of the expanded possibilities now | ||
388 | available to them, and in some cases have, as mentioned previously). | ||
389 | |||
390 | We can get a list of the available events that can be used to profile a | ||
391 | workload via 'perf list': :: | ||
392 | |||
393 | root@crownbay:~# perf list | ||
394 | |||
395 | List of pre-defined events (to be used in -e): | ||
396 | cpu-cycles OR cycles [Hardware event] | ||
397 | stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] | ||
398 | stalled-cycles-backend OR idle-cycles-backend [Hardware event] | ||
399 | instructions [Hardware event] | ||
400 | cache-references [Hardware event] | ||
401 | cache-misses [Hardware event] | ||
402 | branch-instructions OR branches [Hardware event] | ||
403 | branch-misses [Hardware event] | ||
404 | bus-cycles [Hardware event] | ||
405 | ref-cycles [Hardware event] | ||
406 | |||
407 | cpu-clock [Software event] | ||
408 | task-clock [Software event] | ||
409 | page-faults OR faults [Software event] | ||
410 | minor-faults [Software event] | ||
411 | major-faults [Software event] | ||
412 | context-switches OR cs [Software event] | ||
413 | cpu-migrations OR migrations [Software event] | ||
414 | alignment-faults [Software event] | ||
415 | emulation-faults [Software event] | ||
416 | |||
417 | L1-dcache-loads [Hardware cache event] | ||
418 | L1-dcache-load-misses [Hardware cache event] | ||
419 | L1-dcache-prefetch-misses [Hardware cache event] | ||
420 | L1-icache-loads [Hardware cache event] | ||
421 | L1-icache-load-misses [Hardware cache event] | ||
422 | . | ||
423 | . | ||
424 | . | ||
425 | rNNN [Raw hardware event descriptor] | ||
426 | cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] | ||
427 | (see 'perf list --help' on how to encode it) | ||
428 | |||
429 | mem:<addr>[:access] [Hardware breakpoint] | ||
430 | |||
431 | sunrpc:rpc_call_status [Tracepoint event] | ||
432 | sunrpc:rpc_bind_status [Tracepoint event] | ||
433 | sunrpc:rpc_connect_status [Tracepoint event] | ||
434 | sunrpc:rpc_task_begin [Tracepoint event] | ||
435 | skb:kfree_skb [Tracepoint event] | ||
436 | skb:consume_skb [Tracepoint event] | ||
437 | skb:skb_copy_datagram_iovec [Tracepoint event] | ||
438 | net:net_dev_xmit [Tracepoint event] | ||
439 | net:net_dev_queue [Tracepoint event] | ||
440 | net:netif_receive_skb [Tracepoint event] | ||
441 | net:netif_rx [Tracepoint event] | ||
442 | napi:napi_poll [Tracepoint event] | ||
443 | sock:sock_rcvqueue_full [Tracepoint event] | ||
444 | sock:sock_exceed_buf_limit [Tracepoint event] | ||
445 | udp:udp_fail_queue_rcv_skb [Tracepoint event] | ||
446 | hda:hda_send_cmd [Tracepoint event] | ||
447 | hda:hda_get_response [Tracepoint event] | ||
448 | hda:hda_bus_reset [Tracepoint event] | ||
449 | scsi:scsi_dispatch_cmd_start [Tracepoint event] | ||
450 | scsi:scsi_dispatch_cmd_error [Tracepoint event] | ||
451 | scsi:scsi_eh_wakeup [Tracepoint event] | ||
452 | drm:drm_vblank_event [Tracepoint event] | ||
453 | drm:drm_vblank_event_queued [Tracepoint event] | ||
454 | drm:drm_vblank_event_delivered [Tracepoint event] | ||
455 | random:mix_pool_bytes [Tracepoint event] | ||
456 | random:mix_pool_bytes_nolock [Tracepoint event] | ||
457 | random:credit_entropy_bits [Tracepoint event] | ||
458 | gpio:gpio_direction [Tracepoint event] | ||
459 | gpio:gpio_value [Tracepoint event] | ||
460 | block:block_rq_abort [Tracepoint event] | ||
461 | block:block_rq_requeue [Tracepoint event] | ||
462 | block:block_rq_issue [Tracepoint event] | ||
463 | block:block_bio_bounce [Tracepoint event] | ||
464 | block:block_bio_complete [Tracepoint event] | ||
465 | block:block_bio_backmerge [Tracepoint event] | ||
466 | . | ||
467 | . | ||
468 | writeback:writeback_wake_thread [Tracepoint event] | ||
469 | writeback:writeback_wake_forker_thread [Tracepoint event] | ||
470 | writeback:writeback_bdi_register [Tracepoint event] | ||
471 | . | ||
472 | . | ||
473 | writeback:writeback_single_inode_requeue [Tracepoint event] | ||
474 | writeback:writeback_single_inode [Tracepoint event] | ||
475 | kmem:kmalloc [Tracepoint event] | ||
476 | kmem:kmem_cache_alloc [Tracepoint event] | ||
477 | kmem:mm_page_alloc [Tracepoint event] | ||
478 | kmem:mm_page_alloc_zone_locked [Tracepoint event] | ||
479 | kmem:mm_page_pcpu_drain [Tracepoint event] | ||
480 | kmem:mm_page_alloc_extfrag [Tracepoint event] | ||
481 | vmscan:mm_vmscan_kswapd_sleep [Tracepoint event] | ||
482 | vmscan:mm_vmscan_kswapd_wake [Tracepoint event] | ||
483 | vmscan:mm_vmscan_wakeup_kswapd [Tracepoint event] | ||
484 | vmscan:mm_vmscan_direct_reclaim_begin [Tracepoint event] | ||
485 | . | ||
486 | . | ||
487 | module:module_get [Tracepoint event] | ||
488 | module:module_put [Tracepoint event] | ||
489 | module:module_request [Tracepoint event] | ||
490 | sched:sched_kthread_stop [Tracepoint event] | ||
491 | sched:sched_wakeup [Tracepoint event] | ||
492 | sched:sched_wakeup_new [Tracepoint event] | ||
493 | sched:sched_process_fork [Tracepoint event] | ||
494 | sched:sched_process_exec [Tracepoint event] | ||
495 | sched:sched_stat_runtime [Tracepoint event] | ||
496 | rcu:rcu_utilization [Tracepoint event] | ||
497 | workqueue:workqueue_queue_work [Tracepoint event] | ||
498 | workqueue:workqueue_execute_end [Tracepoint event] | ||
499 | signal:signal_generate [Tracepoint event] | ||
500 | signal:signal_deliver [Tracepoint event] | ||
501 | timer:timer_init [Tracepoint event] | ||
502 | timer:timer_start [Tracepoint event] | ||
503 | timer:hrtimer_cancel [Tracepoint event] | ||
504 | timer:itimer_state [Tracepoint event] | ||
505 | timer:itimer_expire [Tracepoint event] | ||
506 | irq:irq_handler_entry [Tracepoint event] | ||
507 | irq:irq_handler_exit [Tracepoint event] | ||
508 | irq:softirq_entry [Tracepoint event] | ||
509 | irq:softirq_exit [Tracepoint event] | ||
510 | irq:softirq_raise [Tracepoint event] | ||
511 | printk:console [Tracepoint event] | ||
512 | task:task_newtask [Tracepoint event] | ||
513 | task:task_rename [Tracepoint event] | ||
514 | syscalls:sys_enter_socketcall [Tracepoint event] | ||
515 | syscalls:sys_exit_socketcall [Tracepoint event] | ||
516 | . | ||
517 | . | ||
518 | . | ||
519 | syscalls:sys_enter_unshare [Tracepoint event] | ||
520 | syscalls:sys_exit_unshare [Tracepoint event] | ||
521 | raw_syscalls:sys_enter [Tracepoint event] | ||
522 | raw_syscalls:sys_exit [Tracepoint event] | ||
523 | |||
524 | .. admonition:: Tying it Together | ||
525 | |||
526 | These are exactly the same set of events defined by the trace event | ||
527 | subsystem and exposed by ftrace/tracecmd/kernelshark as files in | ||
528 | /sys/kernel/debug/tracing/events, by SystemTap as | ||
529 | kernel.trace("tracepoint_name") and (partially) accessed by LTTng. | ||
530 | |||
531 | Only a subset of these would be of interest to us when looking at this | ||
532 | workload, so let's choose the most likely subsystems (identified by the | ||
533 | string before the colon in the Tracepoint events) and do a 'perf stat' | ||
534 | run using only those wildcarded subsystems: :: | ||
535 | |||
536 | root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
537 | Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2': | ||
538 | |||
539 | 23323 skb:kfree_skb | ||
540 | 0 skb:consume_skb | ||
541 | 49897 skb:skb_copy_datagram_iovec | ||
542 | 6217 net:net_dev_xmit | ||
543 | 6217 net:net_dev_queue | ||
544 | 7962 net:netif_receive_skb | ||
545 | 2 net:netif_rx | ||
546 | 8340 napi:napi_poll | ||
547 | 0 sched:sched_kthread_stop | ||
548 | 0 sched:sched_kthread_stop_ret | ||
549 | 3749 sched:sched_wakeup | ||
550 | 0 sched:sched_wakeup_new | ||
551 | 0 sched:sched_switch | ||
552 | 29 sched:sched_migrate_task | ||
553 | 0 sched:sched_process_free | ||
554 | 1 sched:sched_process_exit | ||
555 | 0 sched:sched_wait_task | ||
556 | 0 sched:sched_process_wait | ||
557 | 0 sched:sched_process_fork | ||
558 | 1 sched:sched_process_exec | ||
559 | 0 sched:sched_stat_wait | ||
560 | 2106519415641 sched:sched_stat_sleep | ||
561 | 0 sched:sched_stat_iowait | ||
562 | 147453613 sched:sched_stat_blocked | ||
563 | 12903026955 sched:sched_stat_runtime | ||
564 | 0 sched:sched_pi_setprio | ||
565 | 3574 workqueue:workqueue_queue_work | ||
566 | 3574 workqueue:workqueue_activate_work | ||
567 | 0 workqueue:workqueue_execute_start | ||
568 | 0 workqueue:workqueue_execute_end | ||
569 | 16631 irq:irq_handler_entry | ||
570 | 16631 irq:irq_handler_exit | ||
571 | 28521 irq:softirq_entry | ||
572 | 28521 irq:softirq_exit | ||
573 | 28728 irq:softirq_raise | ||
574 | 1 syscalls:sys_enter_sendmmsg | ||
575 | 1 syscalls:sys_exit_sendmmsg | ||
576 | 0 syscalls:sys_enter_recvmmsg | ||
577 | 0 syscalls:sys_exit_recvmmsg | ||
578 | 14 syscalls:sys_enter_socketcall | ||
579 | 14 syscalls:sys_exit_socketcall | ||
580 | . | ||
581 | . | ||
582 | . | ||
583 | 16965 syscalls:sys_enter_read | ||
584 | 16965 syscalls:sys_exit_read | ||
585 | 12854 syscalls:sys_enter_write | ||
586 | 12854 syscalls:sys_exit_write | ||
587 | . | ||
588 | . | ||
589 | . | ||
590 | |||
591 | 58.029710972 seconds time elapsed | ||
592 | |||
593 | |||
594 | |||
595 | Let's pick one of these tracepoints | ||
596 | and tell perf to do a profile using it as the sampling event: :: | ||
597 | |||
598 | root@crownbay:~# perf record -g -e sched:sched_wakeup wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
599 | |||
600 | .. image:: figures/sched-wakeup-profile.png | ||
601 | :align: center | ||
602 | |||
603 | The screenshot above shows the results of running a profile using | ||
604 | sched:sched_switch tracepoint, which shows the relative costs of various | ||
605 | paths to sched_wakeup (note that sched_wakeup is the name of the | ||
606 | tracepoint - it's actually defined just inside ttwu_do_wakeup(), which | ||
607 | accounts for the function name actually displayed in the profile: | ||
608 | |||
609 | .. code-block:: c | ||
610 | |||
611 | /* | ||
612 | * Mark the task runnable and perform wakeup-preemption. | ||
613 | */ | ||
614 | static void | ||
615 | ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) | ||
616 | { | ||
617 | trace_sched_wakeup(p, true); | ||
618 | . | ||
619 | . | ||
620 | . | ||
621 | } | ||
622 | |||
623 | A couple of the more interesting | ||
624 | callchains are expanded and displayed above, basically some network | ||
625 | receive paths that presumably end up waking up wget (busybox) when | ||
626 | network data is ready. | ||
627 | |||
628 | Note that because tracepoints are normally used for tracing, the default | ||
629 | sampling period for tracepoints is 1 i.e. for tracepoints perf will | ||
630 | sample on every event occurrence (this can be changed using the -c | ||
631 | option). This is in contrast to hardware counters such as for example | ||
632 | the default 'cycles' hardware counter used for normal profiling, where | ||
633 | sampling periods are much higher (in the thousands) because profiling | ||
634 | should have as low an overhead as possible and sampling on every cycle | ||
635 | would be prohibitively expensive. | ||
636 | |||
637 | Using perf to do Basic Tracing | ||
638 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
639 | |||
640 | Profiling is a great tool for solving many problems or for getting a | ||
641 | high-level view of what's going on with a workload or across the system. | ||
642 | It is however by definition an approximation, as suggested by the most | ||
643 | prominent word associated with it, 'sampling'. On the one hand, it | ||
644 | allows a representative picture of what's going on in the system to be | ||
645 | cheaply taken, but on the other hand, that cheapness limits its utility | ||
646 | when that data suggests a need to 'dive down' more deeply to discover | ||
647 | what's really going on. In such cases, the only way to see what's really | ||
648 | going on is to be able to look at (or summarize more intelligently) the | ||
649 | individual steps that go into the higher-level behavior exposed by the | ||
650 | coarse-grained profiling data. | ||
651 | |||
652 | As a concrete example, we can trace all the events we think might be | ||
653 | applicable to our workload: :: | ||
654 | |||
655 | root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* | ||
656 | -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write | ||
657 | wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
658 | |||
659 | We can look at the raw trace output using 'perf script' with no | ||
660 | arguments: :: | ||
661 | |||
662 | root@crownbay:~# perf script | ||
663 | |||
664 | perf 1262 [000] 11624.857082: sys_exit_read: 0x0 | ||
665 | perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000 | ||
666 | wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER] | ||
667 | wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER] | ||
668 | wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER] | ||
669 | wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200 | ||
670 | wget 1262 [001] 11624.858177: sys_exit_read: 0x200 | ||
671 | wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308 | ||
672 | wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308 | ||
673 | wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER] | ||
674 | wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER] | ||
675 | wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER] | ||
676 | wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 | ||
677 | wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7 | ||
678 | wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 | ||
679 | wget 1262 [001] 11624.859233: sys_exit_read: 0x0 | ||
680 | wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200 | ||
681 | wget 1262 [001] 11624.859584: sys_exit_read: 0x200 | ||
682 | wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 | ||
683 | wget 1262 [001] 11624.859888: sys_exit_read: 0x400 | ||
684 | wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 | ||
685 | wget 1262 [001] 11624.859944: sys_exit_read: 0x400 | ||
686 | |||
687 | This gives us a detailed timestamped sequence of events that occurred within the | ||
688 | workload with respect to those events. | ||
689 | |||
690 | In many ways, profiling can be viewed as a subset of tracing - | ||
691 | theoretically, if you have a set of trace events that's sufficient to | ||
692 | capture all the important aspects of a workload, you can derive any of | ||
693 | the results or views that a profiling run can. | ||
694 | |||
695 | Another aspect of traditional profiling is that while powerful in many | ||
696 | ways, it's limited by the granularity of the underlying data. Profiling | ||
697 | tools offer various ways of sorting and presenting the sample data, | ||
698 | which make it much more useful and amenable to user experimentation, but | ||
699 | in the end it can't be used in an open-ended way to extract data that | ||
700 | just isn't present as a consequence of the fact that conceptually, most | ||
701 | of it has been thrown away. | ||
702 | |||
703 | Full-blown detailed tracing data does however offer the opportunity to | ||
704 | manipulate and present the information collected during a tracing run in | ||
705 | an infinite variety of ways. | ||
706 | |||
707 | Another way to look at it is that there are only so many ways that the | ||
708 | 'primitive' counters can be used on their own to generate interesting | ||
709 | output; to get anything more complicated than simple counts requires | ||
710 | some amount of additional logic, which is typically very specific to the | ||
711 | problem at hand. For example, if we wanted to make use of a 'counter' | ||
712 | that maps to the value of the time difference between when a process was | ||
713 | scheduled to run on a processor and the time it actually ran, we | ||
714 | wouldn't expect such a counter to exist on its own, but we could derive | ||
715 | one called say 'wakeup_latency' and use it to extract a useful view of | ||
716 | that metric from trace data. Likewise, we really can't figure out from | ||
717 | standard profiling tools how much data every process on the system reads | ||
718 | and writes, along with how many of those reads and writes fail | ||
719 | completely. If we have sufficient trace data, however, we could with the | ||
720 | right tools easily extract and present that information, but we'd need | ||
721 | something other than pre-canned profiling tools to do that. | ||
722 | |||
723 | Luckily, there is a general-purpose way to handle such needs, called | ||
724 | 'programming languages'. Making programming languages easily available | ||
725 | to apply to such problems given the specific format of data is called a | ||
726 | 'programming language binding' for that data and language. Perf supports | ||
727 | two programming language bindings, one for Python and one for Perl. | ||
728 | |||
729 | .. admonition:: Tying it Together | ||
730 | |||
731 | Language bindings for manipulating and aggregating trace data are of | ||
732 | course not a new idea. One of the first projects to do this was IBM's | ||
733 | DProbes dpcc compiler, an ANSI C compiler which targeted a low-level | ||
734 | assembly language running on an in-kernel interpreter on the target | ||
735 | system. This is exactly analogous to what Sun's DTrace did, except | ||
736 | that DTrace invented its own language for the purpose. Systemtap, | ||
737 | heavily inspired by DTrace, also created its own one-off language, | ||
738 | but rather than running the product on an in-kernel interpreter, | ||
739 | created an elaborate compiler-based machinery to translate its | ||
740 | language into kernel modules written in C. | ||
741 | |||
742 | Now that we have the trace data in perf.data, we can use 'perf script | ||
743 | -g' to generate a skeleton script with handlers for the read/write | ||
744 | entry/exit events we recorded: :: | ||
745 | |||
746 | root@crownbay:~# perf script -g python | ||
747 | generated Python script: perf-script.py | ||
748 | |||
749 | The skeleton script simply creates a python function for each event type in the | ||
750 | perf.data file. The body of each function simply prints the event name along | ||
751 | with its parameters. For example: | ||
752 | |||
753 | .. code-block:: python | ||
754 | |||
755 | def net__netif_rx(event_name, context, common_cpu, | ||
756 | common_secs, common_nsecs, common_pid, common_comm, | ||
757 | skbaddr, len, name): | ||
758 | print_header(event_name, common_cpu, common_secs, common_nsecs, | ||
759 | common_pid, common_comm) | ||
760 | |||
761 | print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name), | ||
762 | |||
763 | We can run that script directly to print all of the events contained in the | ||
764 | perf.data file: :: | ||
765 | |||
766 | root@crownbay:~# perf script -s perf-script.py | ||
767 | |||
768 | in trace_begin | ||
769 | syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0 | ||
770 | sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0 | ||
771 | irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER | ||
772 | irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER | ||
773 | irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER | ||
774 | syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512 | ||
775 | syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512 | ||
776 | skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0 | ||
777 | skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0 | ||
778 | irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER | ||
779 | irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER | ||
780 | irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER | ||
781 | syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024 | ||
782 | syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471 | ||
783 | syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024 | ||
784 | syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0 | ||
785 | syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512 | ||
786 | syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512 | ||
787 | syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024 | ||
788 | syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024 | ||
789 | syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024 | ||
790 | syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024 | ||
791 | |||
792 | That in itself isn't very useful; after all, we can accomplish pretty much the | ||
793 | same thing by simply running 'perf script' without arguments in the same | ||
794 | directory as the perf.data file. | ||
795 | |||
796 | We can however replace the print statements in the generated function | ||
797 | bodies with whatever we want, and thereby make it infinitely more | ||
798 | useful. | ||
799 | |||
800 | As a simple example, let's just replace the print statements in the | ||
801 | function bodies with a simple function that does nothing but increment a | ||
802 | per-event count. When the program is run against a perf.data file, each | ||
803 | time a particular event is encountered, a tally is incremented for that | ||
804 | event. For example: | ||
805 | |||
806 | .. code-block:: python | ||
807 | |||
808 | def net__netif_rx(event_name, context, common_cpu, | ||
809 | common_secs, common_nsecs, common_pid, common_comm, | ||
810 | skbaddr, len, name): | ||
811 | inc_counts(event_name) | ||
812 | |||
813 | Each event handler function in the generated code | ||
814 | is modified to do this. For convenience, we define a common function | ||
815 | called inc_counts() that each handler calls; inc_counts() simply tallies | ||
816 | a count for each event using the 'counts' hash, which is a specialized | ||
817 | hash function that does Perl-like autovivification, a capability that's | ||
818 | extremely useful for kinds of multi-level aggregation commonly used in | ||
819 | processing traces (see perf's documentation on the Python language | ||
820 | binding for details): | ||
821 | |||
822 | .. code-block:: python | ||
823 | |||
824 | counts = autodict() | ||
825 | |||
826 | def inc_counts(event_name): | ||
827 | try: | ||
828 | counts[event_name] += 1 | ||
829 | except TypeError: | ||
830 | counts[event_name] = 1 | ||
831 | |||
832 | Finally, at the end of the trace processing run, we want to print the | ||
833 | result of all the per-event tallies. For that, we use the special | ||
834 | 'trace_end()' function: | ||
835 | |||
836 | .. code-block:: python | ||
837 | |||
838 | def trace_end(): | ||
839 | for event_name, count in counts.iteritems(): | ||
840 | print "%-40s %10s\n" % (event_name, count) | ||
841 | |||
842 | The end result is a summary of all the events recorded in the trace: :: | ||
843 | |||
844 | skb__skb_copy_datagram_iovec 13148 | ||
845 | irq__softirq_entry 4796 | ||
846 | irq__irq_handler_exit 3805 | ||
847 | irq__softirq_exit 4795 | ||
848 | syscalls__sys_enter_write 8990 | ||
849 | net__net_dev_xmit 652 | ||
850 | skb__kfree_skb 4047 | ||
851 | sched__sched_wakeup 1155 | ||
852 | irq__irq_handler_entry 3804 | ||
853 | irq__softirq_raise 4799 | ||
854 | net__net_dev_queue 652 | ||
855 | syscalls__sys_enter_read 17599 | ||
856 | net__netif_receive_skb 1743 | ||
857 | syscalls__sys_exit_read 17598 | ||
858 | net__netif_rx 2 | ||
859 | napi__napi_poll 1877 | ||
860 | syscalls__sys_exit_write 8990 | ||
861 | |||
862 | Note that this is | ||
863 | pretty much exactly the same information we get from 'perf stat', which | ||
864 | goes a little way to support the idea mentioned previously that given | ||
865 | the right kind of trace data, higher-level profiling-type summaries can | ||
866 | be derived from it. | ||
867 | |||
868 | Documentation on using the `'perf script' python | ||
869 | binding <http://linux.die.net/man/1/perf-script-python>`__. | ||
870 | |||
871 | System-Wide Tracing and Profiling | ||
872 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
873 | |||
874 | The examples so far have focused on tracing a particular program or | ||
875 | workload - in other words, every profiling run has specified the program | ||
876 | to profile in the command-line e.g. 'perf record wget ...'. | ||
877 | |||
878 | It's also possible, and more interesting in many cases, to run a | ||
879 | system-wide profile or trace while running the workload in a separate | ||
880 | shell. | ||
881 | |||
882 | To do system-wide profiling or tracing, you typically use the -a flag to | ||
883 | 'perf record'. | ||
884 | |||
885 | To demonstrate this, open up one window and start the profile using the | ||
886 | -a flag (press Ctrl-C to stop tracing): :: | ||
887 | |||
888 | root@crownbay:~# perf record -g -a | ||
889 | ^C[ perf record: Woken up 6 times to write data ] | ||
890 | [ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ] | ||
891 | |||
892 | In another window, run the wget test: :: | ||
893 | |||
894 | root@crownbay:~# wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2 | ||
895 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
896 | linux-2.6.19.2.tar.b 100% \|*******************************\| 41727k 0:00:00 ETA | ||
897 | |||
898 | Here we see entries not only for our wget load, but for | ||
899 | other processes running on the system as well: | ||
900 | |||
901 | .. image:: figures/perf-systemwide.png | ||
902 | :align: center | ||
903 | |||
904 | In the snapshot above, we can see callchains that originate in libc, and | ||
905 | a callchain from Xorg that demonstrates that we're using a proprietary X | ||
906 | driver in userspace (notice the presence of 'PVR' and some other | ||
907 | unresolvable symbols in the expanded Xorg callchain). | ||
908 | |||
909 | Note also that we have both kernel and userspace entries in the above | ||
910 | snapshot. We can also tell perf to focus on userspace but providing a | ||
911 | modifier, in this case 'u', to the 'cycles' hardware counter when we | ||
912 | record a profile: :: | ||
913 | |||
914 | root@crownbay:~# perf record -g -a -e cycles:u | ||
915 | ^C[ perf record: Woken up 2 times to write data ] | ||
916 | [ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ] | ||
917 | |||
918 | .. image:: figures/perf-report-cycles-u.png | ||
919 | :align: center | ||
920 | |||
921 | Notice in the screenshot above, we see only userspace entries ([.]) | ||
922 | |||
923 | Finally, we can press 'enter' on a leaf node and select the 'Zoom into | ||
924 | DSO' menu item to show only entries associated with a specific DSO. In | ||
925 | the screenshot below, we've zoomed into the 'libc' DSO which shows all | ||
926 | the entries associated with the libc-xxx.so DSO. | ||
927 | |||
928 | .. image:: figures/perf-systemwide-libc.png | ||
929 | :align: center | ||
930 | |||
931 | We can also use the system-wide -a switch to do system-wide tracing. | ||
932 | Here we'll trace a couple of scheduler events: :: | ||
933 | |||
934 | root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup | ||
935 | ^C[ perf record: Woken up 38 times to write data ] | ||
936 | [ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ] | ||
937 | |||
938 | We can look at the raw output using 'perf script' with no arguments: :: | ||
939 | |||
940 | root@crownbay:~# perf script | ||
941 | |||
942 | perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
943 | perf 1383 [001] 6171.460066: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120 | ||
944 | kworker/1:1 21 [001] 6171.460093: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120 | ||
945 | swapper 0 [000] 6171.468063: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000 | ||
946 | swapper 0 [000] 6171.468107: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120 | ||
947 | kworker/0:3 1209 [000] 6171.468143: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 | ||
948 | perf 1383 [001] 6171.470039: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
949 | perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120 | ||
950 | kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120 | ||
951 | perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
952 | |||
953 | .. _perf-filtering: | ||
954 | |||
955 | Filtering | ||
956 | ^^^^^^^^^ | ||
957 | |||
958 | Notice that there are a lot of events that don't really have anything to | ||
959 | do with what we're interested in, namely events that schedule 'perf' | ||
960 | itself in and out or that wake perf up. We can get rid of those by using | ||
961 | the '--filter' option - for each event we specify using -e, we can add a | ||
962 | --filter after that to filter out trace events that contain fields with | ||
963 | specific values: :: | ||
964 | |||
965 | root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf && prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf' | ||
966 | ^C[ perf record: Woken up 38 times to write data ] | ||
967 | [ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ] | ||
968 | |||
969 | |||
970 | root@crownbay:~# perf script | ||
971 | |||
972 | swapper 0 [000] 7932.162180: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120 | ||
973 | kworker/0:3 1209 [000] 7932.162236: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 | ||
974 | perf 1407 [001] 7932.170048: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
975 | perf 1407 [001] 7932.180044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
976 | perf 1407 [001] 7932.190038: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
977 | perf 1407 [001] 7932.200044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
978 | perf 1407 [001] 7932.210044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
979 | perf 1407 [001] 7932.220044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
980 | swapper 0 [001] 7932.230111: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001 | ||
981 | swapper 0 [001] 7932.230146: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:1 next_pid=21 next_prio=120 | ||
982 | kworker/1:1 21 [001] 7932.230205: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120 | ||
983 | swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000 | ||
984 | swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120 | ||
985 | kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 | ||
986 | |||
987 | In this case, we've filtered out all events that have | ||
988 | 'perf' in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice that | ||
989 | there are still events recorded for perf, but notice that those events | ||
990 | don't have values of 'perf' for the filtered fields. To completely | ||
991 | filter out anything from perf will require a bit more work, but for the | ||
992 | purpose of demonstrating how to use filters, it's close enough. | ||
993 | |||
994 | .. admonition:: Tying it Together | ||
995 | |||
996 | These are exactly the same set of event filters defined by the trace | ||
997 | event subsystem. See the ftrace/tracecmd/kernelshark section for more | ||
998 | discussion about these event filters. | ||
999 | |||
1000 | .. admonition:: Tying it Together | ||
1001 | |||
1002 | These event filters are implemented by a special-purpose | ||
1003 | pseudo-interpreter in the kernel and are an integral and | ||
1004 | indispensable part of the perf design as it relates to tracing. | ||
1005 | kernel-based event filters provide a mechanism to precisely throttle | ||
1006 | the event stream that appears in user space, where it makes sense to | ||
1007 | provide bindings to real programming languages for postprocessing the | ||
1008 | event stream. This architecture allows for the intelligent and | ||
1009 | flexible partitioning of processing between the kernel and user | ||
1010 | space. Contrast this with other tools such as SystemTap, which does | ||
1011 | all of its processing in the kernel and as such requires a special | ||
1012 | project-defined language in order to accommodate that design, or | ||
1013 | LTTng, where everything is sent to userspace and as such requires a | ||
1014 | super-efficient kernel-to-userspace transport mechanism in order to | ||
1015 | function properly. While perf certainly can benefit from for instance | ||
1016 | advances in the design of the transport, it doesn't fundamentally | ||
1017 | depend on them. Basically, if you find that your perf tracing | ||
1018 | application is causing buffer I/O overruns, it probably means that | ||
1019 | you aren't taking enough advantage of the kernel filtering engine. | ||
1020 | |||
1021 | Using Dynamic Tracepoints | ||
1022 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
1023 | |||
1024 | perf isn't restricted to the fixed set of static tracepoints listed by | ||
1025 | 'perf list'. Users can also add their own 'dynamic' tracepoints anywhere | ||
1026 | in the kernel. For instance, suppose we want to define our own | ||
1027 | tracepoint on do_fork(). We can do that using the 'perf probe' perf | ||
1028 | subcommand: :: | ||
1029 | |||
1030 | root@crownbay:~# perf probe do_fork | ||
1031 | Added new event: | ||
1032 | probe:do_fork (on do_fork) | ||
1033 | |||
1034 | You can now use it in all perf tools, such as: | ||
1035 | |||
1036 | perf record -e probe:do_fork -aR sleep 1 | ||
1037 | |||
1038 | Adding a new tracepoint via | ||
1039 | 'perf probe' results in an event with all the expected files and format | ||
1040 | in /sys/kernel/debug/tracing/events, just the same as for static | ||
1041 | tracepoints (as discussed in more detail in the trace events subsystem | ||
1042 | section: :: | ||
1043 | |||
1044 | root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al | ||
1045 | drwxr-xr-x 2 root root 0 Oct 28 11:42 . | ||
1046 | drwxr-xr-x 3 root root 0 Oct 28 11:42 .. | ||
1047 | -rw-r--r-- 1 root root 0 Oct 28 11:42 enable | ||
1048 | -rw-r--r-- 1 root root 0 Oct 28 11:42 filter | ||
1049 | -r--r--r-- 1 root root 0 Oct 28 11:42 format | ||
1050 | -r--r--r-- 1 root root 0 Oct 28 11:42 id | ||
1051 | |||
1052 | root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# cat format | ||
1053 | name: do_fork | ||
1054 | ID: 944 | ||
1055 | format: | ||
1056 | field:unsigned short common_type; offset:0; size:2; signed:0; | ||
1057 | field:unsigned char common_flags; offset:2; size:1; signed:0; | ||
1058 | field:unsigned char common_preempt_count; offset:3; size:1; signed:0; | ||
1059 | field:int common_pid; offset:4; size:4; signed:1; | ||
1060 | field:int common_padding; offset:8; size:4; signed:1; | ||
1061 | |||
1062 | field:unsigned long __probe_ip; offset:12; size:4; signed:0; | ||
1063 | |||
1064 | print fmt: "(%lx)", REC->__probe_ip | ||
1065 | |||
1066 | We can list all dynamic tracepoints currently in | ||
1067 | existence: :: | ||
1068 | |||
1069 | root@crownbay:~# perf probe -l | ||
1070 | probe:do_fork (on do_fork) | ||
1071 | probe:schedule (on schedule) | ||
1072 | |||
1073 | Let's record system-wide ('sleep 30' is a | ||
1074 | trick for recording system-wide but basically do nothing and then wake | ||
1075 | up after 30 seconds): :: | ||
1076 | |||
1077 | root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30 | ||
1078 | [ perf record: Woken up 1 times to write data ] | ||
1079 | [ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ] | ||
1080 | |||
1081 | Using 'perf script' we can see each do_fork event that fired: :: | ||
1082 | |||
1083 | root@crownbay:~# perf script | ||
1084 | |||
1085 | # ======== | ||
1086 | # captured on: Sun Oct 28 11:55:18 2012 | ||
1087 | # hostname : crownbay | ||
1088 | # os release : 3.4.11-yocto-standard | ||
1089 | # perf version : 3.4.11 | ||
1090 | # arch : i686 | ||
1091 | # nrcpus online : 2 | ||
1092 | # nrcpus avail : 2 | ||
1093 | # cpudesc : Intel(R) Atom(TM) CPU E660 @ 1.30GHz | ||
1094 | # cpuid : GenuineIntel,6,38,1 | ||
1095 | # total memory : 1017184 kB | ||
1096 | # cmdline : /usr/bin/perf record -g -a -e probe:do_fork sleep 30 | ||
1097 | # event : name = probe:do_fork, type = 2, config = 0x3b0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern | ||
1098 | = 0, id = { 5, 6 } | ||
1099 | # HEADER_CPU_TOPOLOGY info available, use -I to display | ||
1100 | # ======== | ||
1101 | # | ||
1102 | matchbox-deskto 1197 [001] 34211.378318: do_fork: (c1028460) | ||
1103 | matchbox-deskto 1295 [001] 34211.380388: do_fork: (c1028460) | ||
1104 | pcmanfm 1296 [000] 34211.632350: do_fork: (c1028460) | ||
1105 | pcmanfm 1296 [000] 34211.639917: do_fork: (c1028460) | ||
1106 | matchbox-deskto 1197 [001] 34217.541603: do_fork: (c1028460) | ||
1107 | matchbox-deskto 1299 [001] 34217.543584: do_fork: (c1028460) | ||
1108 | gthumb 1300 [001] 34217.697451: do_fork: (c1028460) | ||
1109 | gthumb 1300 [001] 34219.085734: do_fork: (c1028460) | ||
1110 | gthumb 1300 [000] 34219.121351: do_fork: (c1028460) | ||
1111 | gthumb 1300 [001] 34219.264551: do_fork: (c1028460) | ||
1112 | pcmanfm 1296 [000] 34219.590380: do_fork: (c1028460) | ||
1113 | matchbox-deskto 1197 [001] 34224.955965: do_fork: (c1028460) | ||
1114 | matchbox-deskto 1306 [001] 34224.957972: do_fork: (c1028460) | ||
1115 | matchbox-termin 1307 [000] 34225.038214: do_fork: (c1028460) | ||
1116 | matchbox-termin 1307 [001] 34225.044218: do_fork: (c1028460) | ||
1117 | matchbox-termin 1307 [000] 34225.046442: do_fork: (c1028460) | ||
1118 | matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460) | ||
1119 | matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460) | ||
1120 | gaku 1312 [000] 34237.202388: do_fork: (c1028460) | ||
1121 | |||
1122 | And using 'perf report' on the same file, we can see the | ||
1123 | callgraphs from starting a few programs during those 30 seconds: | ||
1124 | |||
1125 | .. image:: figures/perf-probe-do_fork-profile.png | ||
1126 | :align: center | ||
1127 | |||
1128 | .. admonition:: Tying it Together | ||
1129 | |||
1130 | The trace events subsystem accommodate static and dynamic tracepoints | ||
1131 | in exactly the same way - there's no difference as far as the | ||
1132 | infrastructure is concerned. See the ftrace section for more details | ||
1133 | on the trace event subsystem. | ||
1134 | |||
1135 | .. admonition:: Tying it Together | ||
1136 | |||
1137 | Dynamic tracepoints are implemented under the covers by kprobes and | ||
1138 | uprobes. kprobes and uprobes are also used by and in fact are the | ||
1139 | main focus of SystemTap. | ||
1140 | |||
1141 | .. _perf-documentation: | ||
1142 | |||
1143 | Perf Documentation | ||
1144 | ------------------ | ||
1145 | |||
1146 | Online versions of the man pages for the commands discussed in this | ||
1147 | section can be found here: | ||
1148 | |||
1149 | - The `'perf stat' manpage <http://linux.die.net/man/1/perf-stat>`__. | ||
1150 | |||
1151 | - The `'perf record' | ||
1152 | manpage <http://linux.die.net/man/1/perf-record>`__. | ||
1153 | |||
1154 | - The `'perf report' | ||
1155 | manpage <http://linux.die.net/man/1/perf-report>`__. | ||
1156 | |||
1157 | - The `'perf probe' manpage <http://linux.die.net/man/1/perf-probe>`__. | ||
1158 | |||
1159 | - The `'perf script' | ||
1160 | manpage <http://linux.die.net/man/1/perf-script>`__. | ||
1161 | |||
1162 | - Documentation on using the `'perf script' python | ||
1163 | binding <http://linux.die.net/man/1/perf-script-python>`__. | ||
1164 | |||
1165 | - The top-level `perf(1) manpage <http://linux.die.net/man/1/perf>`__. | ||
1166 | |||
1167 | Normally, you should be able to invoke the man pages via perf itself | ||
1168 | e.g. 'perf help' or 'perf help record'. | ||
1169 | |||
1170 | However, by default Yocto doesn't install man pages, but perf invokes | ||
1171 | the man pages for most help functionality. This is a bug and is being | ||
1172 | addressed by a Yocto bug: `Bug 3388 - perf: enable man pages for basic | ||
1173 | 'help' | ||
1174 | functionality <https://bugzilla.yoctoproject.org/show_bug.cgi?id=3388>`__. | ||
1175 | |||
1176 | The man pages in text form, along with some other files, such as a set | ||
1177 | of examples, can be found in the 'perf' directory of the kernel tree: :: | ||
1178 | |||
1179 | tools/perf/Documentation | ||
1180 | |||
1181 | There's also a nice perf tutorial on the perf | ||
1182 | wiki that goes into more detail than we do here in certain areas: `Perf | ||
1183 | Tutorial <https://perf.wiki.kernel.org/index.php/Tutorial>`__ | ||
1184 | |||
1185 | .. _profile-manual-ftrace: | ||
1186 | |||
1187 | ftrace | ||
1188 | ====== | ||
1189 | |||
1190 | 'ftrace' literally refers to the 'ftrace function tracer' but in reality | ||
1191 | this encompasses a number of related tracers along with the | ||
1192 | infrastructure that they all make use of. | ||
1193 | |||
1194 | .. _ftrace-setup: | ||
1195 | |||
1196 | ftrace Setup | ||
1197 | ------------ | ||
1198 | |||
1199 | For this section, we'll assume you've already performed the basic setup | ||
1200 | outlined in the ":ref:`profile-manual/profile-manual-intro:General Setup`" section. | ||
1201 | |||
1202 | ftrace, trace-cmd, and kernelshark run on the target system, and are | ||
1203 | ready to go out-of-the-box - no additional setup is necessary. For the | ||
1204 | rest of this section we assume you've ssh'ed to the host and will be | ||
1205 | running ftrace on the target. kernelshark is a GUI application and if | ||
1206 | you use the '-X' option to ssh you can have the kernelshark GUI run on | ||
1207 | the target but display remotely on the host if you want. | ||
1208 | |||
1209 | Basic ftrace usage | ||
1210 | ------------------ | ||
1211 | |||
1212 | 'ftrace' essentially refers to everything included in the /tracing | ||
1213 | directory of the mounted debugfs filesystem (Yocto follows the standard | ||
1214 | convention and mounts it at /sys/kernel/debug). Here's a listing of all | ||
1215 | the files found in /sys/kernel/debug/tracing on a Yocto system: :: | ||
1216 | |||
1217 | root@sugarbay:/sys/kernel/debug/tracing# ls | ||
1218 | README kprobe_events trace | ||
1219 | available_events kprobe_profile trace_clock | ||
1220 | available_filter_functions options trace_marker | ||
1221 | available_tracers per_cpu trace_options | ||
1222 | buffer_size_kb printk_formats trace_pipe | ||
1223 | buffer_total_size_kb saved_cmdlines tracing_cpumask | ||
1224 | current_tracer set_event tracing_enabled | ||
1225 | dyn_ftrace_total_info set_ftrace_filter tracing_on | ||
1226 | enabled_functions set_ftrace_notrace tracing_thresh | ||
1227 | events set_ftrace_pid | ||
1228 | free_buffer set_graph_function | ||
1229 | |||
1230 | The files listed above are used for various purposes | ||
1231 | - some relate directly to the tracers themselves, others are used to set | ||
1232 | tracing options, and yet others actually contain the tracing output when | ||
1233 | a tracer is in effect. Some of the functions can be guessed from their | ||
1234 | names, others need explanation; in any case, we'll cover some of the | ||
1235 | files we see here below but for an explanation of the others, please see | ||
1236 | the ftrace documentation. | ||
1237 | |||
1238 | We'll start by looking at some of the available built-in tracers. | ||
1239 | |||
1240 | cat'ing the 'available_tracers' file lists the set of available tracers: :: | ||
1241 | |||
1242 | root@sugarbay:/sys/kernel/debug/tracing# cat available_tracers | ||
1243 | blk function_graph function nop | ||
1244 | |||
1245 | The 'current_tracer' file contains the tracer currently in effect: :: | ||
1246 | |||
1247 | root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer | ||
1248 | nop | ||
1249 | |||
1250 | The above listing of current_tracer shows that the | ||
1251 | 'nop' tracer is in effect, which is just another way of saying that | ||
1252 | there's actually no tracer currently in effect. | ||
1253 | |||
1254 | echo'ing one of the available_tracers into current_tracer makes the | ||
1255 | specified tracer the current tracer: :: | ||
1256 | |||
1257 | root@sugarbay:/sys/kernel/debug/tracing# echo function > current_tracer | ||
1258 | root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer | ||
1259 | function | ||
1260 | |||
1261 | The above sets the current tracer to be the 'function tracer'. This tracer | ||
1262 | traces every function call in the kernel and makes it available as the | ||
1263 | contents of the 'trace' file. Reading the 'trace' file lists the | ||
1264 | currently buffered function calls that have been traced by the function | ||
1265 | tracer: :: | ||
1266 | |||
1267 | root@sugarbay:/sys/kernel/debug/tracing# cat trace | less | ||
1268 | |||
1269 | # tracer: function | ||
1270 | # | ||
1271 | # entries-in-buffer/entries-written: 310629/766471 #P:8 | ||
1272 | # | ||
1273 | # _-----=> irqs-off | ||
1274 | # / _----=> need-resched | ||
1275 | # | / _---=> hardirq/softirq | ||
1276 | # || / _--=> preempt-depth | ||
1277 | # ||| / delay | ||
1278 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
1279 | # | | | |||| | | | ||
1280 | <idle>-0 [004] d..1 470.867169: ktime_get_real <-intel_idle | ||
1281 | <idle>-0 [004] d..1 470.867170: getnstimeofday <-ktime_get_real | ||
1282 | <idle>-0 [004] d..1 470.867171: ns_to_timeval <-intel_idle | ||
1283 | <idle>-0 [004] d..1 470.867171: ns_to_timespec <-ns_to_timeval | ||
1284 | <idle>-0 [004] d..1 470.867172: smp_apic_timer_interrupt <-apic_timer_interrupt | ||
1285 | <idle>-0 [004] d..1 470.867172: native_apic_mem_write <-smp_apic_timer_interrupt | ||
1286 | <idle>-0 [004] d..1 470.867172: irq_enter <-smp_apic_timer_interrupt | ||
1287 | <idle>-0 [004] d..1 470.867172: rcu_irq_enter <-irq_enter | ||
1288 | <idle>-0 [004] d..1 470.867173: rcu_idle_exit_common.isra.33 <-rcu_irq_enter | ||
1289 | <idle>-0 [004] d..1 470.867173: local_bh_disable <-irq_enter | ||
1290 | <idle>-0 [004] d..1 470.867173: add_preempt_count <-local_bh_disable | ||
1291 | <idle>-0 [004] d.s1 470.867174: tick_check_idle <-irq_enter | ||
1292 | <idle>-0 [004] d.s1 470.867174: tick_check_oneshot_broadcast <-tick_check_idle | ||
1293 | <idle>-0 [004] d.s1 470.867174: ktime_get <-tick_check_idle | ||
1294 | <idle>-0 [004] d.s1 470.867174: tick_nohz_stop_idle <-tick_check_idle | ||
1295 | <idle>-0 [004] d.s1 470.867175: update_ts_time_stats <-tick_nohz_stop_idle | ||
1296 | <idle>-0 [004] d.s1 470.867175: nr_iowait_cpu <-update_ts_time_stats | ||
1297 | <idle>-0 [004] d.s1 470.867175: tick_do_update_jiffies64 <-tick_check_idle | ||
1298 | <idle>-0 [004] d.s1 470.867175: _raw_spin_lock <-tick_do_update_jiffies64 | ||
1299 | <idle>-0 [004] d.s1 470.867176: add_preempt_count <-_raw_spin_lock | ||
1300 | <idle>-0 [004] d.s2 470.867176: do_timer <-tick_do_update_jiffies64 | ||
1301 | <idle>-0 [004] d.s2 470.867176: _raw_spin_lock <-do_timer | ||
1302 | <idle>-0 [004] d.s2 470.867176: add_preempt_count <-_raw_spin_lock | ||
1303 | <idle>-0 [004] d.s3 470.867177: ntp_tick_length <-do_timer | ||
1304 | <idle>-0 [004] d.s3 470.867177: _raw_spin_lock_irqsave <-ntp_tick_length | ||
1305 | . | ||
1306 | . | ||
1307 | . | ||
1308 | |||
1309 | Each line in the trace above shows what was happening in the kernel on a given | ||
1310 | cpu, to the level of detail of function calls. Each entry shows the function | ||
1311 | called, followed by its caller (after the arrow). | ||
1312 | |||
1313 | The function tracer gives you an extremely detailed idea of what the | ||
1314 | kernel was doing at the point in time the trace was taken, and is a | ||
1315 | great way to learn about how the kernel code works in a dynamic sense. | ||
1316 | |||
1317 | .. admonition:: Tying it Together | ||
1318 | |||
1319 | The ftrace function tracer is also available from within perf, as the | ||
1320 | ftrace:function tracepoint. | ||
1321 | |||
1322 | It is a little more difficult to follow the call chains than it needs to | ||
1323 | be - luckily there's a variant of the function tracer that displays the | ||
1324 | callchains explicitly, called the 'function_graph' tracer: :: | ||
1325 | |||
1326 | root@sugarbay:/sys/kernel/debug/tracing# echo function_graph > current_tracer | ||
1327 | root@sugarbay:/sys/kernel/debug/tracing# cat trace | less | ||
1328 | |||
1329 | tracer: function_graph | ||
1330 | |||
1331 | CPU DURATION FUNCTION CALLS | ||
1332 | | | | | | | | | ||
1333 | 7) 0.046 us | pick_next_task_fair(); | ||
1334 | 7) 0.043 us | pick_next_task_stop(); | ||
1335 | 7) 0.042 us | pick_next_task_rt(); | ||
1336 | 7) 0.032 us | pick_next_task_fair(); | ||
1337 | 7) 0.030 us | pick_next_task_idle(); | ||
1338 | 7) | _raw_spin_unlock_irq() { | ||
1339 | 7) 0.033 us | sub_preempt_count(); | ||
1340 | 7) 0.258 us | } | ||
1341 | 7) 0.032 us | sub_preempt_count(); | ||
1342 | 7) + 13.341 us | } /* __schedule */ | ||
1343 | 7) 0.095 us | } /* sub_preempt_count */ | ||
1344 | 7) | schedule() { | ||
1345 | 7) | __schedule() { | ||
1346 | 7) 0.060 us | add_preempt_count(); | ||
1347 | 7) 0.044 us | rcu_note_context_switch(); | ||
1348 | 7) | _raw_spin_lock_irq() { | ||
1349 | 7) 0.033 us | add_preempt_count(); | ||
1350 | 7) 0.247 us | } | ||
1351 | 7) | idle_balance() { | ||
1352 | 7) | _raw_spin_unlock() { | ||
1353 | 7) 0.031 us | sub_preempt_count(); | ||
1354 | 7) 0.246 us | } | ||
1355 | 7) | update_shares() { | ||
1356 | 7) 0.030 us | __rcu_read_lock(); | ||
1357 | 7) 0.029 us | __rcu_read_unlock(); | ||
1358 | 7) 0.484 us | } | ||
1359 | 7) 0.030 us | __rcu_read_lock(); | ||
1360 | 7) | load_balance() { | ||
1361 | 7) | find_busiest_group() { | ||
1362 | 7) 0.031 us | idle_cpu(); | ||
1363 | 7) 0.029 us | idle_cpu(); | ||
1364 | 7) 0.035 us | idle_cpu(); | ||
1365 | 7) 0.906 us | } | ||
1366 | 7) 1.141 us | } | ||
1367 | 7) 0.022 us | msecs_to_jiffies(); | ||
1368 | 7) | load_balance() { | ||
1369 | 7) | find_busiest_group() { | ||
1370 | 7) 0.031 us | idle_cpu(); | ||
1371 | . | ||
1372 | . | ||
1373 | . | ||
1374 | 4) 0.062 us | msecs_to_jiffies(); | ||
1375 | 4) 0.062 us | __rcu_read_unlock(); | ||
1376 | 4) | _raw_spin_lock() { | ||
1377 | 4) 0.073 us | add_preempt_count(); | ||
1378 | 4) 0.562 us | } | ||
1379 | 4) + 17.452 us | } | ||
1380 | 4) 0.108 us | put_prev_task_fair(); | ||
1381 | 4) 0.102 us | pick_next_task_fair(); | ||
1382 | 4) 0.084 us | pick_next_task_stop(); | ||
1383 | 4) 0.075 us | pick_next_task_rt(); | ||
1384 | 4) 0.062 us | pick_next_task_fair(); | ||
1385 | 4) 0.066 us | pick_next_task_idle(); | ||
1386 | ------------------------------------------ | ||
1387 | 4) kworker-74 => <idle>-0 | ||
1388 | ------------------------------------------ | ||
1389 | |||
1390 | 4) | finish_task_switch() { | ||
1391 | 4) | _raw_spin_unlock_irq() { | ||
1392 | 4) 0.100 us | sub_preempt_count(); | ||
1393 | 4) 0.582 us | } | ||
1394 | 4) 1.105 us | } | ||
1395 | 4) 0.088 us | sub_preempt_count(); | ||
1396 | 4) ! 100.066 us | } | ||
1397 | . | ||
1398 | . | ||
1399 | . | ||
1400 | 3) | sys_ioctl() { | ||
1401 | 3) 0.083 us | fget_light(); | ||
1402 | 3) | security_file_ioctl() { | ||
1403 | 3) 0.066 us | cap_file_ioctl(); | ||
1404 | 3) 0.562 us | } | ||
1405 | 3) | do_vfs_ioctl() { | ||
1406 | 3) | drm_ioctl() { | ||
1407 | 3) 0.075 us | drm_ut_debug_printk(); | ||
1408 | 3) | i915_gem_pwrite_ioctl() { | ||
1409 | 3) | i915_mutex_lock_interruptible() { | ||
1410 | 3) 0.070 us | mutex_lock_interruptible(); | ||
1411 | 3) 0.570 us | } | ||
1412 | 3) | drm_gem_object_lookup() { | ||
1413 | 3) | _raw_spin_lock() { | ||
1414 | 3) 0.080 us | add_preempt_count(); | ||
1415 | 3) 0.620 us | } | ||
1416 | 3) | _raw_spin_unlock() { | ||
1417 | 3) 0.085 us | sub_preempt_count(); | ||
1418 | 3) 0.562 us | } | ||
1419 | 3) 2.149 us | } | ||
1420 | 3) 0.133 us | i915_gem_object_pin(); | ||
1421 | 3) | i915_gem_object_set_to_gtt_domain() { | ||
1422 | 3) 0.065 us | i915_gem_object_flush_gpu_write_domain(); | ||
1423 | 3) 0.065 us | i915_gem_object_wait_rendering(); | ||
1424 | 3) 0.062 us | i915_gem_object_flush_cpu_write_domain(); | ||
1425 | 3) 1.612 us | } | ||
1426 | 3) | i915_gem_object_put_fence() { | ||
1427 | 3) 0.097 us | i915_gem_object_flush_fence.constprop.36(); | ||
1428 | 3) 0.645 us | } | ||
1429 | 3) 0.070 us | add_preempt_count(); | ||
1430 | 3) 0.070 us | sub_preempt_count(); | ||
1431 | 3) 0.073 us | i915_gem_object_unpin(); | ||
1432 | 3) 0.068 us | mutex_unlock(); | ||
1433 | 3) 9.924 us | } | ||
1434 | 3) + 11.236 us | } | ||
1435 | 3) + 11.770 us | } | ||
1436 | 3) + 13.784 us | } | ||
1437 | 3) | sys_ioctl() { | ||
1438 | |||
1439 | As you can see, the function_graph display is much easier | ||
1440 | to follow. Also note that in addition to the function calls and | ||
1441 | associated braces, other events such as scheduler events are displayed | ||
1442 | in context. In fact, you can freely include any tracepoint available in | ||
1443 | the trace events subsystem described in the next section by simply | ||
1444 | enabling those events, and they'll appear in context in the function | ||
1445 | graph display. Quite a powerful tool for understanding kernel dynamics. | ||
1446 | |||
1447 | Also notice that there are various annotations on the left hand side of | ||
1448 | the display. For example if the total time it took for a given function | ||
1449 | to execute is above a certain threshold, an exclamation point or plus | ||
1450 | sign appears on the left hand side. Please see the ftrace documentation | ||
1451 | for details on all these fields. | ||
1452 | |||
1453 | The 'trace events' Subsystem | ||
1454 | ---------------------------- | ||
1455 | |||
1456 | One especially important directory contained within the | ||
1457 | /sys/kernel/debug/tracing directory is the 'events' subdirectory, which | ||
1458 | contains representations of every tracepoint in the system. Listing out | ||
1459 | the contents of the 'events' subdirectory, we see mainly another set of | ||
1460 | subdirectories: :: | ||
1461 | |||
1462 | root@sugarbay:/sys/kernel/debug/tracing# cd events | ||
1463 | root@sugarbay:/sys/kernel/debug/tracing/events# ls -al | ||
1464 | drwxr-xr-x 38 root root 0 Nov 14 23:19 . | ||
1465 | drwxr-xr-x 5 root root 0 Nov 14 23:19 .. | ||
1466 | drwxr-xr-x 19 root root 0 Nov 14 23:19 block | ||
1467 | drwxr-xr-x 32 root root 0 Nov 14 23:19 btrfs | ||
1468 | drwxr-xr-x 5 root root 0 Nov 14 23:19 drm | ||
1469 | -rw-r--r-- 1 root root 0 Nov 14 23:19 enable | ||
1470 | drwxr-xr-x 40 root root 0 Nov 14 23:19 ext3 | ||
1471 | drwxr-xr-x 79 root root 0 Nov 14 23:19 ext4 | ||
1472 | drwxr-xr-x 14 root root 0 Nov 14 23:19 ftrace | ||
1473 | drwxr-xr-x 8 root root 0 Nov 14 23:19 hda | ||
1474 | -r--r--r-- 1 root root 0 Nov 14 23:19 header_event | ||
1475 | -r--r--r-- 1 root root 0 Nov 14 23:19 header_page | ||
1476 | drwxr-xr-x 25 root root 0 Nov 14 23:19 i915 | ||
1477 | drwxr-xr-x 7 root root 0 Nov 14 23:19 irq | ||
1478 | drwxr-xr-x 12 root root 0 Nov 14 23:19 jbd | ||
1479 | drwxr-xr-x 14 root root 0 Nov 14 23:19 jbd2 | ||
1480 | drwxr-xr-x 14 root root 0 Nov 14 23:19 kmem | ||
1481 | drwxr-xr-x 7 root root 0 Nov 14 23:19 module | ||
1482 | drwxr-xr-x 3 root root 0 Nov 14 23:19 napi | ||
1483 | drwxr-xr-x 6 root root 0 Nov 14 23:19 net | ||
1484 | drwxr-xr-x 3 root root 0 Nov 14 23:19 oom | ||
1485 | drwxr-xr-x 12 root root 0 Nov 14 23:19 power | ||
1486 | drwxr-xr-x 3 root root 0 Nov 14 23:19 printk | ||
1487 | drwxr-xr-x 8 root root 0 Nov 14 23:19 random | ||
1488 | drwxr-xr-x 4 root root 0 Nov 14 23:19 raw_syscalls | ||
1489 | drwxr-xr-x 3 root root 0 Nov 14 23:19 rcu | ||
1490 | drwxr-xr-x 6 root root 0 Nov 14 23:19 rpm | ||
1491 | drwxr-xr-x 20 root root 0 Nov 14 23:19 sched | ||
1492 | drwxr-xr-x 7 root root 0 Nov 14 23:19 scsi | ||
1493 | drwxr-xr-x 4 root root 0 Nov 14 23:19 signal | ||
1494 | drwxr-xr-x 5 root root 0 Nov 14 23:19 skb | ||
1495 | drwxr-xr-x 4 root root 0 Nov 14 23:19 sock | ||
1496 | drwxr-xr-x 10 root root 0 Nov 14 23:19 sunrpc | ||
1497 | drwxr-xr-x 538 root root 0 Nov 14 23:19 syscalls | ||
1498 | drwxr-xr-x 4 root root 0 Nov 14 23:19 task | ||
1499 | drwxr-xr-x 14 root root 0 Nov 14 23:19 timer | ||
1500 | drwxr-xr-x 3 root root 0 Nov 14 23:19 udp | ||
1501 | drwxr-xr-x 21 root root 0 Nov 14 23:19 vmscan | ||
1502 | drwxr-xr-x 3 root root 0 Nov 14 23:19 vsyscall | ||
1503 | drwxr-xr-x 6 root root 0 Nov 14 23:19 workqueue | ||
1504 | drwxr-xr-x 26 root root 0 Nov 14 23:19 writeback | ||
1505 | |||
1506 | Each one of these subdirectories | ||
1507 | corresponds to a 'subsystem' and contains yet again more subdirectories, | ||
1508 | each one of those finally corresponding to a tracepoint. For example, | ||
1509 | here are the contents of the 'kmem' subsystem: :: | ||
1510 | |||
1511 | root@sugarbay:/sys/kernel/debug/tracing/events# cd kmem | ||
1512 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem# ls -al | ||
1513 | drwxr-xr-x 14 root root 0 Nov 14 23:19 . | ||
1514 | drwxr-xr-x 38 root root 0 Nov 14 23:19 .. | ||
1515 | -rw-r--r-- 1 root root 0 Nov 14 23:19 enable | ||
1516 | -rw-r--r-- 1 root root 0 Nov 14 23:19 filter | ||
1517 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kfree | ||
1518 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc | ||
1519 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc_node | ||
1520 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc | ||
1521 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc_node | ||
1522 | drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_free | ||
1523 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc | ||
1524 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_extfrag | ||
1525 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_zone_locked | ||
1526 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free | ||
1527 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free_batched | ||
1528 | drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_pcpu_drain | ||
1529 | |||
1530 | Let's see what's inside the subdirectory for a | ||
1531 | specific tracepoint, in this case the one for kmalloc: :: | ||
1532 | |||
1533 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem# cd kmalloc | ||
1534 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# ls -al | ||
1535 | drwxr-xr-x 2 root root 0 Nov 14 23:19 . | ||
1536 | drwxr-xr-x 14 root root 0 Nov 14 23:19 .. | ||
1537 | -rw-r--r-- 1 root root 0 Nov 14 23:19 enable | ||
1538 | -rw-r--r-- 1 root root 0 Nov 14 23:19 filter | ||
1539 | -r--r--r-- 1 root root 0 Nov 14 23:19 format | ||
1540 | -r--r--r-- 1 root root 0 Nov 14 23:19 id | ||
1541 | |||
1542 | The 'format' file for the | ||
1543 | tracepoint describes the event in memory, which is used by the various | ||
1544 | tracing tools that now make use of these tracepoint to parse the event | ||
1545 | and make sense of it, along with a 'print fmt' field that allows tools | ||
1546 | like ftrace to display the event as text. Here's what the format of the | ||
1547 | kmalloc event looks like: :: | ||
1548 | |||
1549 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# cat format | ||
1550 | name: kmalloc | ||
1551 | ID: 313 | ||
1552 | format: | ||
1553 | field:unsigned short common_type; offset:0; size:2; signed:0; | ||
1554 | field:unsigned char common_flags; offset:2; size:1; signed:0; | ||
1555 | field:unsigned char common_preempt_count; offset:3; size:1; signed:0; | ||
1556 | field:int common_pid; offset:4; size:4; signed:1; | ||
1557 | field:int common_padding; offset:8; size:4; signed:1; | ||
1558 | |||
1559 | field:unsigned long call_site; offset:16; size:8; signed:0; | ||
1560 | field:const void * ptr; offset:24; size:8; signed:0; | ||
1561 | field:size_t bytes_req; offset:32; size:8; signed:0; | ||
1562 | field:size_t bytes_alloc; offset:40; size:8; signed:0; | ||
1563 | field:gfp_t gfp_flags; offset:48; size:4; signed:0; | ||
1564 | |||
1565 | print fmt: "call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s", REC->call_site, REC->ptr, REC->bytes_req, REC->bytes_alloc, | ||
1566 | (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {(unsigned long)(((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( | ||
1567 | gfp_t)0x20000u) | (( gfp_t)0x02u) | (( gfp_t)0x08u)) | (( gfp_t)0x4000u) | (( gfp_t)0x10000u) | (( gfp_t)0x1000u) | (( gfp_t)0x200u) | (( | ||
1568 | gfp_t)0x400000u)), "GFP_TRANSHUGE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x20000u) | (( | ||
1569 | gfp_t)0x02u) | (( gfp_t)0x08u)), "GFP_HIGHUSER_MOVABLE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( | ||
1570 | gfp_t)0x20000u) | (( gfp_t)0x02u)), "GFP_HIGHUSER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( | ||
1571 | gfp_t)0x20000u)), "GFP_USER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x80000u)), GFP_TEMPORARY"}, | ||
1572 | {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u)), "GFP_KERNEL"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u)), | ||
1573 | "GFP_NOFS"}, {(unsigned long)((( gfp_t)0x20u)), "GFP_ATOMIC"}, {(unsigned long)((( gfp_t)0x10u)), "GFP_NOIO"}, {(unsigned long)(( | ||
1574 | gfp_t)0x20u), "GFP_HIGH"}, {(unsigned long)(( gfp_t)0x10u), "GFP_WAIT"}, {(unsigned long)(( gfp_t)0x40u), "GFP_IO"}, {(unsigned long)(( | ||
1575 | gfp_t)0x100u), "GFP_COLD"}, {(unsigned long)(( gfp_t)0x200u), "GFP_NOWARN"}, {(unsigned long)(( gfp_t)0x400u), "GFP_REPEAT"}, {(unsigned | ||
1576 | long)(( gfp_t)0x800u), "GFP_NOFAIL"}, {(unsigned long)(( gfp_t)0x1000u), "GFP_NORETRY"}, {(unsigned long)(( gfp_t)0x4000u), "GFP_COMP"}, | ||
1577 | {(unsigned long)(( gfp_t)0x8000u), "GFP_ZERO"}, {(unsigned long)(( gfp_t)0x10000u), "GFP_NOMEMALLOC"}, {(unsigned long)(( gfp_t)0x20000u), | ||
1578 | "GFP_HARDWALL"}, {(unsigned long)(( gfp_t)0x40000u), "GFP_THISNODE"}, {(unsigned long)(( gfp_t)0x80000u), "GFP_RECLAIMABLE"}, {(unsigned | ||
1579 | long)(( gfp_t)0x08u), "GFP_MOVABLE"}, {(unsigned long)(( gfp_t)0), "GFP_NOTRACK"}, {(unsigned long)(( gfp_t)0x400000u), "GFP_NO_KSWAPD"}, | ||
1580 | {(unsigned long)(( gfp_t)0x800000u), "GFP_OTHER_NODE"} ) : "GFP_NOWAIT" | ||
1581 | |||
1582 | The 'enable' file | ||
1583 | in the tracepoint directory is what allows the user (or tools such as | ||
1584 | trace-cmd) to actually turn the tracepoint on and off. When enabled, the | ||
1585 | corresponding tracepoint will start appearing in the ftrace 'trace' file | ||
1586 | described previously. For example, this turns on the kmalloc tracepoint: :: | ||
1587 | |||
1588 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 1 > enable | ||
1589 | |||
1590 | At the moment, we're not interested in the function tracer or | ||
1591 | some other tracer that might be in effect, so we first turn it off, but | ||
1592 | if we do that, we still need to turn tracing on in order to see the | ||
1593 | events in the output buffer: :: | ||
1594 | |||
1595 | root@sugarbay:/sys/kernel/debug/tracing# echo nop > current_tracer | ||
1596 | root@sugarbay:/sys/kernel/debug/tracing# echo 1 > tracing_on | ||
1597 | |||
1598 | Now, if we look at the the 'trace' file, we see nothing | ||
1599 | but the kmalloc events we just turned on: :: | ||
1600 | |||
1601 | root@sugarbay:/sys/kernel/debug/tracing# cat trace | less | ||
1602 | # tracer: nop | ||
1603 | # | ||
1604 | # entries-in-buffer/entries-written: 1897/1897 #P:8 | ||
1605 | # | ||
1606 | # _-----=> irqs-off | ||
1607 | # / _----=> need-resched | ||
1608 | # | / _---=> hardirq/softirq | ||
1609 | # || / _--=> preempt-depth | ||
1610 | # ||| / delay | ||
1611 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
1612 | # | | | |||| | | | ||
1613 | dropbear-1465 [000] ...1 18154.620753: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1614 | <idle>-0 [000] ..s3 18154.621640: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1615 | <idle>-0 [000] ..s3 18154.621656: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1616 | matchbox-termin-1361 [001] ...1 18154.755472: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f0e00 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT | ||
1617 | Xorg-1264 [002] ...1 18154.755581: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY | ||
1618 | Xorg-1264 [002] ...1 18154.755583: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1619 | Xorg-1264 [002] ...1 18154.755589: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1620 | matchbox-termin-1361 [001] ...1 18155.354594: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db35400 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT | ||
1621 | Xorg-1264 [002] ...1 18155.354703: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY | ||
1622 | Xorg-1264 [002] ...1 18155.354705: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1623 | Xorg-1264 [002] ...1 18155.354711: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1624 | <idle>-0 [000] ..s3 18155.673319: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1625 | dropbear-1465 [000] ...1 18155.673525: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1626 | <idle>-0 [000] ..s3 18155.674821: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1627 | <idle>-0 [000] ..s3 18155.793014: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1628 | dropbear-1465 [000] ...1 18155.793219: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1629 | <idle>-0 [000] ..s3 18155.794147: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1630 | <idle>-0 [000] ..s3 18155.936705: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1631 | dropbear-1465 [000] ...1 18155.936910: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1632 | <idle>-0 [000] ..s3 18155.937869: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1633 | matchbox-termin-1361 [001] ...1 18155.953667: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f2000 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT | ||
1634 | Xorg-1264 [002] ...1 18155.953775: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY | ||
1635 | Xorg-1264 [002] ...1 18155.953777: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1636 | Xorg-1264 [002] ...1 18155.953783: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO | ||
1637 | <idle>-0 [000] ..s3 18156.176053: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1638 | dropbear-1465 [000] ...1 18156.176257: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1639 | <idle>-0 [000] ..s3 18156.177717: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1640 | <idle>-0 [000] ..s3 18156.399229: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1641 | dropbear-1465 [000] ...1 18156.399434: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_http://rostedt.homelinux.com/kernelshark/req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL | ||
1642 | <idle>-0 [000] ..s3 18156.400660: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC | ||
1643 | matchbox-termin-1361 [001] ...1 18156.552800: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db34800 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT | ||
1644 | |||
1645 | To again disable the kmalloc event, we need to send 0 to the enable file: :: | ||
1646 | |||
1647 | root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 0 > enable | ||
1648 | |||
1649 | You can enable any number of events or complete subsystems (by | ||
1650 | using the 'enable' file in the subsystem directory) and get an | ||
1651 | arbitrarily fine-grained idea of what's going on in the system by | ||
1652 | enabling as many of the appropriate tracepoints as applicable. | ||
1653 | |||
1654 | A number of the tools described in this HOWTO do just that, including | ||
1655 | trace-cmd and kernelshark in the next section. | ||
1656 | |||
1657 | .. admonition:: Tying it Together | ||
1658 | |||
1659 | These tracepoints and their representation are used not only by | ||
1660 | ftrace, but by many of the other tools covered in this document and | ||
1661 | they form a central point of integration for the various tracers | ||
1662 | available in Linux. They form a central part of the instrumentation | ||
1663 | for the following tools: perf, lttng, ftrace, blktrace and SystemTap | ||
1664 | |||
1665 | .. admonition:: Tying it Together | ||
1666 | |||
1667 | Eventually all the special-purpose tracers currently available in | ||
1668 | /sys/kernel/debug/tracing will be removed and replaced with | ||
1669 | equivalent tracers based on the 'trace events' subsystem. | ||
1670 | |||
1671 | .. _trace-cmd-kernelshark: | ||
1672 | |||
1673 | trace-cmd/kernelshark | ||
1674 | --------------------- | ||
1675 | |||
1676 | trace-cmd is essentially an extensive command-line 'wrapper' interface | ||
1677 | that hides the details of all the individual files in | ||
1678 | /sys/kernel/debug/tracing, allowing users to specify specific particular | ||
1679 | events within the /sys/kernel/debug/tracing/events/ subdirectory and to | ||
1680 | collect traces and avoid having to deal with those details directly. | ||
1681 | |||
1682 | As yet another layer on top of that, kernelshark provides a GUI that | ||
1683 | allows users to start and stop traces and specify sets of events using | ||
1684 | an intuitive interface, and view the output as both trace events and as | ||
1685 | a per-CPU graphical display. It directly uses 'trace-cmd' as the | ||
1686 | plumbing that accomplishes all that underneath the covers (and actually | ||
1687 | displays the trace-cmd command it uses, as we'll see). | ||
1688 | |||
1689 | To start a trace using kernelshark, first start kernelshark: :: | ||
1690 | |||
1691 | root@sugarbay:~# kernelshark | ||
1692 | |||
1693 | Then bring up the 'Capture' dialog by | ||
1694 | choosing from the kernelshark menu: :: | ||
1695 | |||
1696 | Capture | Record | ||
1697 | |||
1698 | That will display the following dialog, which allows you to choose one or more | ||
1699 | events (or even one or more complete subsystems) to trace: | ||
1700 | |||
1701 | .. image:: figures/kernelshark-choose-events.png | ||
1702 | :align: center | ||
1703 | |||
1704 | Note that these are exactly the same sets of events described in the | ||
1705 | previous trace events subsystem section, and in fact is where trace-cmd | ||
1706 | gets them for kernelshark. | ||
1707 | |||
1708 | In the above screenshot, we've decided to explore the graphics subsystem | ||
1709 | a bit and so have chosen to trace all the tracepoints contained within | ||
1710 | the 'i915' and 'drm' subsystems. | ||
1711 | |||
1712 | After doing that, we can start and stop the trace using the 'Run' and | ||
1713 | 'Stop' button on the lower right corner of the dialog (the same button | ||
1714 | will turn into the 'Stop' button after the trace has started): | ||
1715 | |||
1716 | .. image:: figures/kernelshark-output-display.png | ||
1717 | :align: center | ||
1718 | |||
1719 | Notice that the right-hand pane shows the exact trace-cmd command-line | ||
1720 | that's used to run the trace, along with the results of the trace-cmd | ||
1721 | run. | ||
1722 | |||
1723 | Once the 'Stop' button is pressed, the graphical view magically fills up | ||
1724 | with a colorful per-cpu display of the trace data, along with the | ||
1725 | detailed event listing below that: | ||
1726 | |||
1727 | .. image:: figures/kernelshark-i915-display.png | ||
1728 | :align: center | ||
1729 | |||
1730 | Here's another example, this time a display resulting from tracing 'all | ||
1731 | events': | ||
1732 | |||
1733 | .. image:: figures/kernelshark-all.png | ||
1734 | :align: center | ||
1735 | |||
1736 | The tool is pretty self-explanatory, but for more detailed information | ||
1737 | on navigating through the data, see the `kernelshark | ||
1738 | website <http://rostedt.homelinux.com/kernelshark/>`__. | ||
1739 | |||
1740 | .. _ftrace-documentation: | ||
1741 | |||
1742 | ftrace Documentation | ||
1743 | -------------------- | ||
1744 | |||
1745 | The documentation for ftrace can be found in the kernel Documentation | ||
1746 | directory: :: | ||
1747 | |||
1748 | Documentation/trace/ftrace.txt | ||
1749 | |||
1750 | The documentation for the trace event subsystem can also be found in the kernel | ||
1751 | Documentation directory: :: | ||
1752 | |||
1753 | Documentation/trace/events.txt | ||
1754 | |||
1755 | There is a nice series of articles on using ftrace and trace-cmd at LWN: | ||
1756 | |||
1757 | - `Debugging the kernel using Ftrace - part | ||
1758 | 1 <http://lwn.net/Articles/365835/>`__ | ||
1759 | |||
1760 | - `Debugging the kernel using Ftrace - part | ||
1761 | 2 <http://lwn.net/Articles/366796/>`__ | ||
1762 | |||
1763 | - `Secrets of the Ftrace function | ||
1764 | tracer <http://lwn.net/Articles/370423/>`__ | ||
1765 | |||
1766 | - `trace-cmd: A front-end for | ||
1767 | Ftrace <https://lwn.net/Articles/410200/>`__ | ||
1768 | |||
1769 | There's more detailed documentation kernelshark usage here: | ||
1770 | `KernelShark <http://rostedt.homelinux.com/kernelshark/>`__ | ||
1771 | |||
1772 | An amusing yet useful README (a tracing mini-HOWTO) can be found in | ||
1773 | ``/sys/kernel/debug/tracing/README``. | ||
1774 | |||
1775 | .. _profile-manual-systemtap: | ||
1776 | |||
1777 | systemtap | ||
1778 | ========= | ||
1779 | |||
1780 | SystemTap is a system-wide script-based tracing and profiling tool. | ||
1781 | |||
1782 | SystemTap scripts are C-like programs that are executed in the kernel to | ||
1783 | gather/print/aggregate data extracted from the context they end up being | ||
1784 | invoked under. | ||
1785 | |||
1786 | For example, this probe from the `SystemTap | ||
1787 | tutorial <http://sourceware.org/systemtap/tutorial/>`__ simply prints a | ||
1788 | line every time any process on the system open()s a file. For each line, | ||
1789 | it prints the executable name of the program that opened the file, along | ||
1790 | with its PID, and the name of the file it opened (or tried to open), | ||
1791 | which it extracts from the open syscall's argstr. | ||
1792 | |||
1793 | .. code-block:: none | ||
1794 | |||
1795 | probe syscall.open | ||
1796 | { | ||
1797 | printf ("%s(%d) open (%s)\n", execname(), pid(), argstr) | ||
1798 | } | ||
1799 | |||
1800 | probe timer.ms(4000) # after 4 seconds | ||
1801 | { | ||
1802 | exit () | ||
1803 | } | ||
1804 | |||
1805 | Normally, to execute this | ||
1806 | probe, you'd simply install systemtap on the system you want to probe, | ||
1807 | and directly run the probe on that system e.g. assuming the name of the | ||
1808 | file containing the above text is trace_open.stp: :: | ||
1809 | |||
1810 | # stap trace_open.stp | ||
1811 | |||
1812 | What systemtap does under the covers to run this probe is 1) parse and | ||
1813 | convert the probe to an equivalent 'C' form, 2) compile the 'C' form | ||
1814 | into a kernel module, 3) insert the module into the kernel, which arms | ||
1815 | it, and 4) collect the data generated by the probe and display it to the | ||
1816 | user. | ||
1817 | |||
1818 | In order to accomplish steps 1 and 2, the 'stap' program needs access to | ||
1819 | the kernel build system that produced the kernel that the probed system | ||
1820 | is running. In the case of a typical embedded system (the 'target'), the | ||
1821 | kernel build system unfortunately isn't typically part of the image | ||
1822 | running on the target. It is normally available on the 'host' system | ||
1823 | that produced the target image however; in such cases, steps 1 and 2 are | ||
1824 | executed on the host system, and steps 3 and 4 are executed on the | ||
1825 | target system, using only the systemtap 'runtime'. | ||
1826 | |||
1827 | The systemtap support in Yocto assumes that only steps 3 and 4 are run | ||
1828 | on the target; it is possible to do everything on the target, but this | ||
1829 | section assumes only the typical embedded use-case. | ||
1830 | |||
1831 | So basically what you need to do in order to run a systemtap script on | ||
1832 | the target is to 1) on the host system, compile the probe into a kernel | ||
1833 | module that makes sense to the target, 2) copy the module onto the | ||
1834 | target system and 3) insert the module into the target kernel, which | ||
1835 | arms it, and 4) collect the data generated by the probe and display it | ||
1836 | to the user. | ||
1837 | |||
1838 | .. _systemtap-setup: | ||
1839 | |||
1840 | systemtap Setup | ||
1841 | --------------- | ||
1842 | |||
1843 | Those are a lot of steps and a lot of details, but fortunately Yocto | ||
1844 | includes a script called 'crosstap' that will take care of those | ||
1845 | details, allowing you to simply execute a systemtap script on the remote | ||
1846 | target, with arguments if necessary. | ||
1847 | |||
1848 | In order to do this from a remote host, however, you need to have access | ||
1849 | to the build for the image you booted. The 'crosstap' script provides | ||
1850 | details on how to do this if you run the script on the host without | ||
1851 | having done a build: :: | ||
1852 | |||
1853 | $ crosstap root@192.168.1.88 trace_open.stp | ||
1854 | |||
1855 | Error: No target kernel build found. | ||
1856 | Did you forget to create a local build of your image? | ||
1857 | |||
1858 | 'crosstap' requires a local sdk build of the target system | ||
1859 | (or a build that includes 'tools-profile') in order to build | ||
1860 | kernel modules that can probe the target system. | ||
1861 | |||
1862 | Practically speaking, that means you need to do the following: | ||
1863 | - If you're running a pre-built image, download the release | ||
1864 | and/or BSP tarballs used to build the image. | ||
1865 | - If you're working from git sources, just clone the metadata | ||
1866 | and BSP layers needed to build the image you'll be booting. | ||
1867 | - Make sure you're properly set up to build a new image (see | ||
1868 | the BSP README and/or the widely available basic documentation | ||
1869 | that discusses how to build images). | ||
1870 | - Build an -sdk version of the image e.g.: | ||
1871 | $ bitbake core-image-sato-sdk | ||
1872 | OR | ||
1873 | - Build a non-sdk image but include the profiling tools: | ||
1874 | [ edit local.conf and add 'tools-profile' to the end of | ||
1875 | the EXTRA_IMAGE_FEATURES variable ] | ||
1876 | $ bitbake core-image-sato | ||
1877 | |||
1878 | Once you've build the image on the host system, you're ready to | ||
1879 | boot it (or the equivalent pre-built image) and use 'crosstap' | ||
1880 | to probe it (you need to source the environment as usual first): | ||
1881 | |||
1882 | $ source oe-init-build-env | ||
1883 | $ cd ~/my/systemtap/scripts | ||
1884 | $ crosstap root@192.168.1.xxx myscript.stp | ||
1885 | |||
1886 | .. note:: | ||
1887 | |||
1888 | SystemTap, which uses 'crosstap', assumes you can establish an ssh | ||
1889 | connection to the remote target. Please refer to the crosstap wiki | ||
1890 | page for details on verifying ssh connections at | ||
1891 | . Also, the ability to ssh into the target system is not enabled by | ||
1892 | default in \*-minimal images. | ||
1893 | |||
1894 | So essentially what you need to | ||
1895 | do is build an SDK image or image with 'tools-profile' as detailed in | ||
1896 | the ":ref:`profile-manual/profile-manual-intro:General Setup`" section of this | ||
1897 | manual, and boot the resulting target image. | ||
1898 | |||
1899 | .. note:: | ||
1900 | |||
1901 | If you have a build directory containing multiple machines, you need | ||
1902 | to have the MACHINE you're connecting to selected in local.conf, and | ||
1903 | the kernel in that machine's build directory must match the kernel on | ||
1904 | the booted system exactly, or you'll get the above 'crosstap' message | ||
1905 | when you try to invoke a script. | ||
1906 | |||
1907 | Running a Script on a Target | ||
1908 | ---------------------------- | ||
1909 | |||
1910 | Once you've done that, you should be able to run a systemtap script on | ||
1911 | the target: :: | ||
1912 | |||
1913 | $ cd /path/to/yocto | ||
1914 | $ source oe-init-build-env | ||
1915 | |||
1916 | ### Shell environment set up for builds. ### | ||
1917 | |||
1918 | You can now run 'bitbake <target>' | ||
1919 | |||
1920 | Common targets are: | ||
1921 | core-image-minimal | ||
1922 | core-image-sato | ||
1923 | meta-toolchain | ||
1924 | meta-ide-support | ||
1925 | |||
1926 | You can also run generated qemu images with a command like 'runqemu qemux86-64' | ||
1927 | |||
1928 | Once you've done that, you can cd to whatever | ||
1929 | directory contains your scripts and use 'crosstap' to run the script: :: | ||
1930 | |||
1931 | $ cd /path/to/my/systemap/script | ||
1932 | $ crosstap root@192.168.7.2 trace_open.stp | ||
1933 | |||
1934 | If you get an error connecting to the target e.g.: :: | ||
1935 | |||
1936 | $ crosstap root@192.168.7.2 trace_open.stp | ||
1937 | error establishing ssh connection on remote 'root@192.168.7.2' | ||
1938 | |||
1939 | Try ssh'ing to the target and see what happens: :: | ||
1940 | |||
1941 | $ ssh root@192.168.7.2 | ||
1942 | |||
1943 | A lot of the time, connection | ||
1944 | problems are due specifying a wrong IP address or having a 'host key | ||
1945 | verification error'. | ||
1946 | |||
1947 | If everything worked as planned, you should see something like this | ||
1948 | (enter the password when prompted, or press enter if it's set up to use | ||
1949 | no password): | ||
1950 | |||
1951 | .. code-block:: none | ||
1952 | |||
1953 | $ crosstap root@192.168.7.2 trace_open.stp | ||
1954 | root@192.168.7.2's password: | ||
1955 | matchbox-termin(1036) open ("/tmp/vte3FS2LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) | ||
1956 | matchbox-termin(1036) open ("/tmp/vteJMC7LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) | ||
1957 | |||
1958 | .. _systemtap-documentation: | ||
1959 | |||
1960 | systemtap Documentation | ||
1961 | ----------------------- | ||
1962 | |||
1963 | The SystemTap language reference can be found here: `SystemTap Language | ||
1964 | Reference <http://sourceware.org/systemtap/langref/>`__ | ||
1965 | |||
1966 | Links to other SystemTap documents, tutorials, and examples can be found | ||
1967 | here: `SystemTap documentation | ||
1968 | page <http://sourceware.org/systemtap/documentation.html>`__ | ||
1969 | |||
1970 | .. _profile-manual-sysprof: | ||
1971 | |||
1972 | Sysprof | ||
1973 | ======= | ||
1974 | |||
1975 | Sysprof is a very easy to use system-wide profiler that consists of a | ||
1976 | single window with three panes and a few buttons which allow you to | ||
1977 | start, stop, and view the profile from one place. | ||
1978 | |||
1979 | .. _sysprof-setup: | ||
1980 | |||
1981 | Sysprof Setup | ||
1982 | ------------- | ||
1983 | |||
1984 | For this section, we'll assume you've already performed the basic setup | ||
1985 | outlined in the ":ref:`profile-manual/profile-manual-intro:General Setup`" section. | ||
1986 | |||
1987 | Sysprof is a GUI-based application that runs on the target system. For | ||
1988 | the rest of this document we assume you've ssh'ed to the host and will | ||
1989 | be running Sysprof on the target (you can use the '-X' option to ssh and | ||
1990 | have the Sysprof GUI run on the target but display remotely on the host | ||
1991 | if you want). | ||
1992 | |||
1993 | .. _sysprof-basic-usage: | ||
1994 | |||
1995 | Basic Sysprof Usage | ||
1996 | ------------------- | ||
1997 | |||
1998 | To start profiling the system, you simply press the 'Start' button. To | ||
1999 | stop profiling and to start viewing the profile data in one easy step, | ||
2000 | press the 'Profile' button. | ||
2001 | |||
2002 | Once you've pressed the profile button, the three panes will fill up | ||
2003 | with profiling data: | ||
2004 | |||
2005 | .. image:: figures/sysprof-copy-to-user.png | ||
2006 | :align: center | ||
2007 | |||
2008 | The left pane shows a list of functions and processes. Selecting one of | ||
2009 | those expands that function in the right pane, showing all its callees. | ||
2010 | Note that this caller-oriented display is essentially the inverse of | ||
2011 | perf's default callee-oriented callchain display. | ||
2012 | |||
2013 | In the screenshot above, we're focusing on ``__copy_to_user_ll()`` and | ||
2014 | looking up the callchain we can see that one of the callers of | ||
2015 | ``__copy_to_user_ll`` is sys_read() and the complete callpath between them. | ||
2016 | Notice that this is essentially a portion of the same information we saw | ||
2017 | in the perf display shown in the perf section of this page. | ||
2018 | |||
2019 | .. image:: figures/sysprof-copy-from-user.png | ||
2020 | :align: center | ||
2021 | |||
2022 | Similarly, the above is a snapshot of the Sysprof display of a | ||
2023 | copy-from-user callchain. | ||
2024 | |||
2025 | Finally, looking at the third Sysprof pane in the lower left, we can see | ||
2026 | a list of all the callers of a particular function selected in the top | ||
2027 | left pane. In this case, the lower pane is showing all the callers of | ||
2028 | ``__mark_inode_dirty``: | ||
2029 | |||
2030 | .. image:: figures/sysprof-callers.png | ||
2031 | :align: center | ||
2032 | |||
2033 | Double-clicking on one of those functions will in turn change the focus | ||
2034 | to the selected function, and so on. | ||
2035 | |||
2036 | .. admonition:: Tying it Together | ||
2037 | |||
2038 | If you like sysprof's 'caller-oriented' display, you may be able to | ||
2039 | approximate it in other tools as well. For example, 'perf report' has | ||
2040 | the -g (--call-graph) option that you can experiment with; one of the | ||
2041 | options is 'caller' for an inverted caller-based callgraph display. | ||
2042 | |||
2043 | .. _sysprof-documentation: | ||
2044 | |||
2045 | Sysprof Documentation | ||
2046 | --------------------- | ||
2047 | |||
2048 | There doesn't seem to be any documentation for Sysprof, but maybe that's | ||
2049 | because it's pretty self-explanatory. The Sysprof website, however, is | ||
2050 | here: `Sysprof, System-wide Performance Profiler for | ||
2051 | Linux <http://sysprof.com/>`__ | ||
2052 | |||
2053 | LTTng (Linux Trace Toolkit, next generation) | ||
2054 | ============================================ | ||
2055 | |||
2056 | .. _lttng-setup: | ||
2057 | |||
2058 | LTTng Setup | ||
2059 | ----------- | ||
2060 | |||
2061 | For this section, we'll assume you've already performed the basic setup | ||
2062 | outlined in the ":ref:`profile-manual/profile-manual-intro:General Setup`" section. | ||
2063 | LTTng is run on the target system by ssh'ing to it. | ||
2064 | |||
2065 | Collecting and Viewing Traces | ||
2066 | ----------------------------- | ||
2067 | |||
2068 | Once you've applied the above commits and built and booted your image | ||
2069 | (you need to build the core-image-sato-sdk image or use one of the other | ||
2070 | methods described in the ":ref:`profile-manual/profile-manual-intro:General Setup`" section), you're ready to start | ||
2071 | tracing. | ||
2072 | |||
2073 | Collecting and viewing a trace on the target (inside a shell) | ||
2074 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
2075 | |||
2076 | First, from the host, ssh to the target: :: | ||
2077 | |||
2078 | $ ssh -l root 192.168.1.47 | ||
2079 | The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established. | ||
2080 | RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e. | ||
2081 | Are you sure you want to continue connecting (yes/no)? yes | ||
2082 | Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts. | ||
2083 | root@192.168.1.47's password: | ||
2084 | |||
2085 | Once on the target, use these steps to create a trace: :: | ||
2086 | |||
2087 | root@crownbay:~# lttng create | ||
2088 | Spawning a session daemon | ||
2089 | Session auto-20121015-232120 created. | ||
2090 | Traces will be written in /home/root/lttng-traces/auto-20121015-232120 | ||
2091 | |||
2092 | Enable the events you want to trace (in this case all kernel events): :: | ||
2093 | |||
2094 | root@crownbay:~# lttng enable-event --kernel --all | ||
2095 | All kernel events are enabled in channel channel0 | ||
2096 | |||
2097 | Start the trace: :: | ||
2098 | |||
2099 | root@crownbay:~# lttng start | ||
2100 | Tracing started for session auto-20121015-232120 | ||
2101 | |||
2102 | And then stop the trace after awhile or after running a particular workload that | ||
2103 | you want to trace: :: | ||
2104 | |||
2105 | root@crownbay:~# lttng stop | ||
2106 | Tracing stopped for session auto-20121015-232120 | ||
2107 | |||
2108 | You can now view the trace in text form on the target: :: | ||
2109 | |||
2110 | root@crownbay:~# lttng view | ||
2111 | [23:21:56.989270399] (+?.?????????) sys_geteuid: { 1 }, { } | ||
2112 | [23:21:56.989278081] (+0.000007682) exit_syscall: { 1 }, { ret = 0 } | ||
2113 | [23:21:56.989286043] (+0.000007962) sys_pipe: { 1 }, { fildes = 0xB77B9E8C } | ||
2114 | [23:21:56.989321802] (+0.000035759) exit_syscall: { 1 }, { ret = 0 } | ||
2115 | [23:21:56.989329345] (+0.000007543) sys_mmap_pgoff: { 1 }, { addr = 0x0, len = 10485760, prot = 3, flags = 131362, fd = 4294967295, pgoff = 0 } | ||
2116 | [23:21:56.989351694] (+0.000022349) exit_syscall: { 1 }, { ret = -1247805440 } | ||
2117 | [23:21:56.989432989] (+0.000081295) sys_clone: { 1 }, { clone_flags = 0x411, newsp = 0xB5EFFFE4, parent_tid = 0xFFFFFFFF, child_tid = 0x0 } | ||
2118 | [23:21:56.989477129] (+0.000044140) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 681660, vruntime = 43367983388 } | ||
2119 | [23:21:56.989486697] (+0.000009568) sched_migrate_task: { 1 }, { comm = "lttng-consumerd", tid = 1193, prio = 20, orig_cpu = 1, dest_cpu = 1 } | ||
2120 | [23:21:56.989508418] (+0.000021721) hrtimer_init: { 1 }, { hrtimer = 3970832076, clockid = 1, mode = 1 } | ||
2121 | [23:21:56.989770462] (+0.000262044) hrtimer_cancel: { 1 }, { hrtimer = 3993865440 } | ||
2122 | [23:21:56.989771580] (+0.000001118) hrtimer_cancel: { 0 }, { hrtimer = 3993812192 } | ||
2123 | [23:21:56.989776957] (+0.000005377) hrtimer_expire_entry: { 1 }, { hrtimer = 3993865440, now = 79815980007057, function = 3238465232 } | ||
2124 | [23:21:56.989778145] (+0.000001188) hrtimer_expire_entry: { 0 }, { hrtimer = 3993812192, now = 79815980008174, function = 3238465232 } | ||
2125 | [23:21:56.989791695] (+0.000013550) softirq_raise: { 1 }, { vec = 1 } | ||
2126 | [23:21:56.989795396] (+0.000003701) softirq_raise: { 0 }, { vec = 1 } | ||
2127 | [23:21:56.989800635] (+0.000005239) softirq_raise: { 0 }, { vec = 9 } | ||
2128 | [23:21:56.989807130] (+0.000006495) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 330710, vruntime = 43368314098 } | ||
2129 | [23:21:56.989809993] (+0.000002863) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 1015313, vruntime = 36976733240 } | ||
2130 | [23:21:56.989818514] (+0.000008521) hrtimer_expire_exit: { 0 }, { hrtimer = 3993812192 } | ||
2131 | [23:21:56.989819631] (+0.000001117) hrtimer_expire_exit: { 1 }, { hrtimer = 3993865440 } | ||
2132 | [23:21:56.989821866] (+0.000002235) hrtimer_start: { 0 }, { hrtimer = 3993812192, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 } | ||
2133 | [23:21:56.989822984] (+0.000001118) hrtimer_start: { 1 }, { hrtimer = 3993865440, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 } | ||
2134 | [23:21:56.989832762] (+0.000009778) softirq_entry: { 1 }, { vec = 1 } | ||
2135 | [23:21:56.989833879] (+0.000001117) softirq_entry: { 0 }, { vec = 1 } | ||
2136 | [23:21:56.989838069] (+0.000004190) timer_cancel: { 1 }, { timer = 3993871956 } | ||
2137 | [23:21:56.989839187] (+0.000001118) timer_cancel: { 0 }, { timer = 3993818708 } | ||
2138 | [23:21:56.989841492] (+0.000002305) timer_expire_entry: { 1 }, { timer = 3993871956, now = 79515980, function = 3238277552 } | ||
2139 | [23:21:56.989842819] (+0.000001327) timer_expire_entry: { 0 }, { timer = 3993818708, now = 79515980, function = 3238277552 } | ||
2140 | [23:21:56.989854831] (+0.000012012) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 49237, vruntime = 43368363335 } | ||
2141 | [23:21:56.989855949] (+0.000001118) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 45121, vruntime = 36976778361 } | ||
2142 | [23:21:56.989861257] (+0.000005308) sched_stat_sleep: { 1 }, { comm = "kworker/1:1", tid = 21, delay = 9451318 } | ||
2143 | [23:21:56.989862374] (+0.000001117) sched_stat_sleep: { 0 }, { comm = "kworker/0:0", tid = 4, delay = 9958820 } | ||
2144 | [23:21:56.989868241] (+0.000005867) sched_wakeup: { 0 }, { comm = "kworker/0:0", tid = 4, prio = 120, success = 1, target_cpu = 0 } | ||
2145 | [23:21:56.989869358] (+0.000001117) sched_wakeup: { 1 }, { comm = "kworker/1:1", tid = 21, prio = 120, success = 1, target_cpu = 1 } | ||
2146 | [23:21:56.989877460] (+0.000008102) timer_expire_exit: { 1 }, { timer = 3993871956 } | ||
2147 | [23:21:56.989878577] (+0.000001117) timer_expire_exit: { 0 }, { timer = 3993818708 } | ||
2148 | . | ||
2149 | . | ||
2150 | . | ||
2151 | |||
2152 | You can now safely destroy the trace | ||
2153 | session (note that this doesn't delete the trace - it's still there in | ||
2154 | ~/lttng-traces): :: | ||
2155 | |||
2156 | root@crownbay:~# lttng destroy | ||
2157 | Session auto-20121015-232120 destroyed at /home/root | ||
2158 | |||
2159 | Note that the trace is saved in a directory of the same name as returned by | ||
2160 | 'lttng create', under the ~/lttng-traces directory (note that you can change this by | ||
2161 | supplying your own name to 'lttng create'): :: | ||
2162 | |||
2163 | root@crownbay:~# ls -al ~/lttng-traces | ||
2164 | drwxrwx--- 3 root root 1024 Oct 15 23:21 . | ||
2165 | drwxr-xr-x 5 root root 1024 Oct 15 23:57 .. | ||
2166 | drwxrwx--- 3 root root 1024 Oct 15 23:21 auto-20121015-232120 | ||
2167 | |||
2168 | Collecting and viewing a userspace trace on the target (inside a shell) | ||
2169 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
2170 | |||
2171 | For LTTng userspace tracing, you need to have a properly instrumented | ||
2172 | userspace program. For this example, we'll use the 'hello' test program | ||
2173 | generated by the lttng-ust build. | ||
2174 | |||
2175 | The 'hello' test program isn't installed on the rootfs by the lttng-ust | ||
2176 | build, so we need to copy it over manually. First cd into the build | ||
2177 | directory that contains the hello executable: :: | ||
2178 | |||
2179 | $ cd build/tmp/work/core2_32-poky-linux/lttng-ust/2.0.5-r0/git/tests/hello/.libs | ||
2180 | |||
2181 | Copy that over to the target machine: :: | ||
2182 | |||
2183 | $ scp hello root@192.168.1.20: | ||
2184 | |||
2185 | You now have the instrumented lttng 'hello world' test program on the | ||
2186 | target, ready to test. | ||
2187 | |||
2188 | First, from the host, ssh to the target: :: | ||
2189 | |||
2190 | $ ssh -l root 192.168.1.47 | ||
2191 | The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established. | ||
2192 | RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e. | ||
2193 | Are you sure you want to continue connecting (yes/no)? yes | ||
2194 | Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts. | ||
2195 | root@192.168.1.47's password: | ||
2196 | |||
2197 | Once on the target, use these steps to create a trace: :: | ||
2198 | |||
2199 | root@crownbay:~# lttng create | ||
2200 | Session auto-20190303-021943 created. | ||
2201 | Traces will be written in /home/root/lttng-traces/auto-20190303-021943 | ||
2202 | |||
2203 | Enable the events you want to trace (in this case all userspace events): :: | ||
2204 | |||
2205 | root@crownbay:~# lttng enable-event --userspace --all | ||
2206 | All UST events are enabled in channel channel0 | ||
2207 | |||
2208 | Start the trace: :: | ||
2209 | |||
2210 | root@crownbay:~# lttng start | ||
2211 | Tracing started for session auto-20190303-021943 | ||
2212 | |||
2213 | Run the instrumented hello world program: :: | ||
2214 | |||
2215 | root@crownbay:~# ./hello | ||
2216 | Hello, World! | ||
2217 | Tracing... done. | ||
2218 | |||
2219 | And then stop the trace after awhile or after running a particular workload | ||
2220 | that you want to trace: :: | ||
2221 | |||
2222 | root@crownbay:~# lttng stop | ||
2223 | Tracing stopped for session auto-20190303-021943 | ||
2224 | |||
2225 | You can now view the trace in text form on the target: :: | ||
2226 | |||
2227 | root@crownbay:~# lttng view | ||
2228 | [02:31:14.906146544] (+?.?????????) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 0, intfield2 = 0x0, longfield = 0, netintfield = 0, netintfieldhex = 0x0, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 } | ||
2229 | [02:31:14.906170360] (+0.000023816) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 1, intfield2 = 0x1, longfield = 1, netintfield = 1, netintfieldhex = 0x1, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 } | ||
2230 | [02:31:14.906183140] (+0.000012780) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 2, intfield2 = 0x2, longfield = 2, netintfield = 2, netintfieldhex = 0x2, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 } | ||
2231 | [02:31:14.906194385] (+0.000011245) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 3, intfield2 = 0x3, longfield = 3, netintfield = 3, netintfieldhex = 0x3, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 } | ||
2232 | . | ||
2233 | . | ||
2234 | . | ||
2235 | |||
2236 | You can now safely destroy the trace session (note that this doesn't delete the | ||
2237 | trace - it's still there in ~/lttng-traces): :: | ||
2238 | |||
2239 | root@crownbay:~# lttng destroy | ||
2240 | Session auto-20190303-021943 destroyed at /home/root | ||
2241 | |||
2242 | .. _lltng-documentation: | ||
2243 | |||
2244 | LTTng Documentation | ||
2245 | ------------------- | ||
2246 | |||
2247 | You can find the primary LTTng Documentation on the `LTTng | ||
2248 | Documentation <https://lttng.org/docs/>`__ site. The documentation on | ||
2249 | this site is appropriate for intermediate to advanced software | ||
2250 | developers who are working in a Linux environment and are interested in | ||
2251 | efficient software tracing. | ||
2252 | |||
2253 | For information on LTTng in general, visit the `LTTng | ||
2254 | Project <http://lttng.org/lttng2.0>`__ site. You can find a "Getting | ||
2255 | Started" link on this site that takes you to an LTTng Quick Start. | ||
2256 | |||
2257 | .. _profile-manual-blktrace: | ||
2258 | |||
2259 | blktrace | ||
2260 | ======== | ||
2261 | |||
2262 | blktrace is a tool for tracing and reporting low-level disk I/O. | ||
2263 | blktrace provides the tracing half of the equation; its output can be | ||
2264 | piped into the blkparse program, which renders the data in a | ||
2265 | human-readable form and does some basic analysis: | ||
2266 | |||
2267 | .. _blktrace-setup: | ||
2268 | |||
2269 | blktrace Setup | ||
2270 | -------------- | ||
2271 | |||
2272 | For this section, we'll assume you've already performed the basic setup | ||
2273 | outlined in the ":ref:`profile-manual/profile-manual-intro:General Setup`" | ||
2274 | section. | ||
2275 | |||
2276 | blktrace is an application that runs on the target system. You can run | ||
2277 | the entire blktrace and blkparse pipeline on the target, or you can run | ||
2278 | blktrace in 'listen' mode on the target and have blktrace and blkparse | ||
2279 | collect and analyze the data on the host (see the | ||
2280 | ":ref:`profile-manual/profile-manual-usage:Using blktrace Remotely`" section | ||
2281 | below). For the rest of this section we assume you've ssh'ed to the host and | ||
2282 | will be running blkrace on the target. | ||
2283 | |||
2284 | .. _blktrace-basic-usage: | ||
2285 | |||
2286 | Basic blktrace Usage | ||
2287 | -------------------- | ||
2288 | |||
2289 | To record a trace, simply run the 'blktrace' command, giving it the name | ||
2290 | of the block device you want to trace activity on: :: | ||
2291 | |||
2292 | root@crownbay:~# blktrace /dev/sdc | ||
2293 | |||
2294 | In another shell, execute a workload you want to trace. :: | ||
2295 | |||
2296 | root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2; sync | ||
2297 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
2298 | linux-2.6.19.2.tar.b 100% \|*******************************\| 41727k 0:00:00 ETA | ||
2299 | |||
2300 | Press Ctrl-C in the blktrace shell to stop the trace. It | ||
2301 | will display how many events were logged, along with the per-cpu file | ||
2302 | sizes (blktrace records traces in per-cpu kernel buffers and simply | ||
2303 | dumps them to userspace for blkparse to merge and sort later). :: | ||
2304 | |||
2305 | ^C=== sdc === | ||
2306 | CPU 0: 7082 events, 332 KiB data | ||
2307 | CPU 1: 1578 events, 74 KiB data | ||
2308 | Total: 8660 events (dropped 0), 406 KiB data | ||
2309 | |||
2310 | If you examine the files saved to disk, you see multiple files, one per CPU and | ||
2311 | with the device name as the first part of the filename: :: | ||
2312 | |||
2313 | root@crownbay:~# ls -al | ||
2314 | drwxr-xr-x 6 root root 1024 Oct 27 22:39 . | ||
2315 | drwxr-sr-x 4 root root 1024 Oct 26 18:24 .. | ||
2316 | -rw-r--r-- 1 root root 339938 Oct 27 22:40 sdc.blktrace.0 | ||
2317 | -rw-r--r-- 1 root root 75753 Oct 27 22:40 sdc.blktrace.1 | ||
2318 | |||
2319 | To view the trace events, simply invoke 'blkparse' in the directory | ||
2320 | containing the trace files, giving it the device name that forms the | ||
2321 | first part of the filenames: :: | ||
2322 | |||
2323 | root@crownbay:~# blkparse sdc | ||
2324 | |||
2325 | 8,32 1 1 0.000000000 1225 Q WS 3417048 + 8 [jbd2/sdc-8] | ||
2326 | 8,32 1 2 0.000025213 1225 G WS 3417048 + 8 [jbd2/sdc-8] | ||
2327 | 8,32 1 3 0.000033384 1225 P N [jbd2/sdc-8] | ||
2328 | 8,32 1 4 0.000043301 1225 I WS 3417048 + 8 [jbd2/sdc-8] | ||
2329 | 8,32 1 0 0.000057270 0 m N cfq1225 insert_request | ||
2330 | 8,32 1 0 0.000064813 0 m N cfq1225 add_to_rr | ||
2331 | 8,32 1 5 0.000076336 1225 U N [jbd2/sdc-8] 1 | ||
2332 | 8,32 1 0 0.000088559 0 m N cfq workload slice:150 | ||
2333 | 8,32 1 0 0.000097359 0 m N cfq1225 set_active wl_prio:0 wl_type:1 | ||
2334 | 8,32 1 0 0.000104063 0 m N cfq1225 Not idling. st->count:1 | ||
2335 | 8,32 1 0 0.000112584 0 m N cfq1225 fifo= (null) | ||
2336 | 8,32 1 0 0.000118730 0 m N cfq1225 dispatch_insert | ||
2337 | 8,32 1 0 0.000127390 0 m N cfq1225 dispatched a request | ||
2338 | 8,32 1 0 0.000133536 0 m N cfq1225 activate rq, drv=1 | ||
2339 | 8,32 1 6 0.000136889 1225 D WS 3417048 + 8 [jbd2/sdc-8] | ||
2340 | 8,32 1 7 0.000360381 1225 Q WS 3417056 + 8 [jbd2/sdc-8] | ||
2341 | 8,32 1 8 0.000377422 1225 G WS 3417056 + 8 [jbd2/sdc-8] | ||
2342 | 8,32 1 9 0.000388876 1225 P N [jbd2/sdc-8] | ||
2343 | 8,32 1 10 0.000397886 1225 Q WS 3417064 + 8 [jbd2/sdc-8] | ||
2344 | 8,32 1 11 0.000404800 1225 M WS 3417064 + 8 [jbd2/sdc-8] | ||
2345 | 8,32 1 12 0.000412343 1225 Q WS 3417072 + 8 [jbd2/sdc-8] | ||
2346 | 8,32 1 13 0.000416533 1225 M WS 3417072 + 8 [jbd2/sdc-8] | ||
2347 | 8,32 1 14 0.000422121 1225 Q WS 3417080 + 8 [jbd2/sdc-8] | ||
2348 | 8,32 1 15 0.000425194 1225 M WS 3417080 + 8 [jbd2/sdc-8] | ||
2349 | 8,32 1 16 0.000431968 1225 Q WS 3417088 + 8 [jbd2/sdc-8] | ||
2350 | 8,32 1 17 0.000435251 1225 M WS 3417088 + 8 [jbd2/sdc-8] | ||
2351 | 8,32 1 18 0.000440279 1225 Q WS 3417096 + 8 [jbd2/sdc-8] | ||
2352 | 8,32 1 19 0.000443911 1225 M WS 3417096 + 8 [jbd2/sdc-8] | ||
2353 | 8,32 1 20 0.000450336 1225 Q WS 3417104 + 8 [jbd2/sdc-8] | ||
2354 | 8,32 1 21 0.000454038 1225 M WS 3417104 + 8 [jbd2/sdc-8] | ||
2355 | 8,32 1 22 0.000462070 1225 Q WS 3417112 + 8 [jbd2/sdc-8] | ||
2356 | 8,32 1 23 0.000465422 1225 M WS 3417112 + 8 [jbd2/sdc-8] | ||
2357 | 8,32 1 24 0.000474222 1225 I WS 3417056 + 64 [jbd2/sdc-8] | ||
2358 | 8,32 1 0 0.000483022 0 m N cfq1225 insert_request | ||
2359 | 8,32 1 25 0.000489727 1225 U N [jbd2/sdc-8] 1 | ||
2360 | 8,32 1 0 0.000498457 0 m N cfq1225 Not idling. st->count:1 | ||
2361 | 8,32 1 0 0.000503765 0 m N cfq1225 dispatch_insert | ||
2362 | 8,32 1 0 0.000512914 0 m N cfq1225 dispatched a request | ||
2363 | 8,32 1 0 0.000518851 0 m N cfq1225 activate rq, drv=2 | ||
2364 | . | ||
2365 | . | ||
2366 | . | ||
2367 | 8,32 0 0 58.515006138 0 m N cfq3551 complete rqnoidle 1 | ||
2368 | 8,32 0 2024 58.516603269 3 C WS 3156992 + 16 [0] | ||
2369 | 8,32 0 0 58.516626736 0 m N cfq3551 complete rqnoidle 1 | ||
2370 | 8,32 0 0 58.516634558 0 m N cfq3551 arm_idle: 8 group_idle: 0 | ||
2371 | 8,32 0 0 58.516636933 0 m N cfq schedule dispatch | ||
2372 | 8,32 1 0 58.516971613 0 m N cfq3551 slice expired t=0 | ||
2373 | 8,32 1 0 58.516982089 0 m N cfq3551 sl_used=13 disp=6 charge=13 iops=0 sect=80 | ||
2374 | 8,32 1 0 58.516985511 0 m N cfq3551 del_from_rr | ||
2375 | 8,32 1 0 58.516990819 0 m N cfq3551 put_queue | ||
2376 | |||
2377 | CPU0 (sdc): | ||
2378 | Reads Queued: 0, 0KiB Writes Queued: 331, 26,284KiB | ||
2379 | Read Dispatches: 0, 0KiB Write Dispatches: 485, 40,484KiB | ||
2380 | Reads Requeued: 0 Writes Requeued: 0 | ||
2381 | Reads Completed: 0, 0KiB Writes Completed: 511, 41,000KiB | ||
2382 | Read Merges: 0, 0KiB Write Merges: 13, 160KiB | ||
2383 | Read depth: 0 Write depth: 2 | ||
2384 | IO unplugs: 23 Timer unplugs: 0 | ||
2385 | CPU1 (sdc): | ||
2386 | Reads Queued: 0, 0KiB Writes Queued: 249, 15,800KiB | ||
2387 | Read Dispatches: 0, 0KiB Write Dispatches: 42, 1,600KiB | ||
2388 | Reads Requeued: 0 Writes Requeued: 0 | ||
2389 | Reads Completed: 0, 0KiB Writes Completed: 16, 1,084KiB | ||
2390 | Read Merges: 0, 0KiB Write Merges: 40, 276KiB | ||
2391 | Read depth: 0 Write depth: 2 | ||
2392 | IO unplugs: 30 Timer unplugs: 1 | ||
2393 | |||
2394 | Total (sdc): | ||
2395 | Reads Queued: 0, 0KiB Writes Queued: 580, 42,084KiB | ||
2396 | Read Dispatches: 0, 0KiB Write Dispatches: 527, 42,084KiB | ||
2397 | Reads Requeued: 0 Writes Requeued: 0 | ||
2398 | Reads Completed: 0, 0KiB Writes Completed: 527, 42,084KiB | ||
2399 | Read Merges: 0, 0KiB Write Merges: 53, 436KiB | ||
2400 | IO unplugs: 53 Timer unplugs: 1 | ||
2401 | |||
2402 | Throughput (R/W): 0KiB/s / 719KiB/s | ||
2403 | Events (sdc): 6,592 entries | ||
2404 | Skips: 0 forward (0 - 0.0%) | ||
2405 | Input file sdc.blktrace.0 added | ||
2406 | Input file sdc.blktrace.1 added | ||
2407 | |||
2408 | The report shows each event that was | ||
2409 | found in the blktrace data, along with a summary of the overall block | ||
2410 | I/O traffic during the run. You can look at the | ||
2411 | `blkparse <http://linux.die.net/man/1/blkparse>`__ manpage to learn the | ||
2412 | meaning of each field displayed in the trace listing. | ||
2413 | |||
2414 | .. _blktrace-live-mode: | ||
2415 | |||
2416 | Live Mode | ||
2417 | ~~~~~~~~~ | ||
2418 | |||
2419 | blktrace and blkparse are designed from the ground up to be able to | ||
2420 | operate together in a 'pipe mode' where the stdout of blktrace can be | ||
2421 | fed directly into the stdin of blkparse: :: | ||
2422 | |||
2423 | root@crownbay:~# blktrace /dev/sdc -o - | blkparse -i - | ||
2424 | |||
2425 | This enables long-lived tracing sessions | ||
2426 | to run without writing anything to disk, and allows the user to look for | ||
2427 | certain conditions in the trace data in 'real-time' by viewing the trace | ||
2428 | output as it scrolls by on the screen or by passing it along to yet | ||
2429 | another program in the pipeline such as grep which can be used to | ||
2430 | identify and capture conditions of interest. | ||
2431 | |||
2432 | There's actually another blktrace command that implements the above | ||
2433 | pipeline as a single command, so the user doesn't have to bother typing | ||
2434 | in the above command sequence: :: | ||
2435 | |||
2436 | root@crownbay:~# btrace /dev/sdc | ||
2437 | |||
2438 | Using blktrace Remotely | ||
2439 | ~~~~~~~~~~~~~~~~~~~~~~~ | ||
2440 | |||
2441 | Because blktrace traces block I/O and at the same time normally writes | ||
2442 | its trace data to a block device, and in general because it's not really | ||
2443 | a great idea to make the device being traced the same as the device the | ||
2444 | tracer writes to, blktrace provides a way to trace without perturbing | ||
2445 | the traced device at all by providing native support for sending all | ||
2446 | trace data over the network. | ||
2447 | |||
2448 | To have blktrace operate in this mode, start blktrace on the target | ||
2449 | system being traced with the -l option, along with the device to trace: :: | ||
2450 | |||
2451 | root@crownbay:~# blktrace -l /dev/sdc | ||
2452 | server: waiting for connections... | ||
2453 | |||
2454 | On the host system, use the -h option to connect to the target system, | ||
2455 | also passing it the device to trace: :: | ||
2456 | |||
2457 | $ blktrace -d /dev/sdc -h 192.168.1.43 | ||
2458 | blktrace: connecting to 192.168.1.43 | ||
2459 | blktrace: connected! | ||
2460 | |||
2461 | On the target system, you should see this: :: | ||
2462 | |||
2463 | server: connection from 192.168.1.43 | ||
2464 | |||
2465 | In another shell, execute a workload you want to trace. :: | ||
2466 | |||
2467 | root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2; sync | ||
2468 | Connecting to downloads.yoctoproject.org (140.211.169.59:80) | ||
2469 | linux-2.6.19.2.tar.b 100% \|*******************************\| 41727k 0:00:00 ETA | ||
2470 | |||
2471 | When it's done, do a Ctrl-C on the host system to stop the | ||
2472 | trace: :: | ||
2473 | |||
2474 | ^C=== sdc === | ||
2475 | CPU 0: 7691 events, 361 KiB data | ||
2476 | CPU 1: 4109 events, 193 KiB data | ||
2477 | Total: 11800 events (dropped 0), 554 KiB data | ||
2478 | |||
2479 | On the target system, you should also see a trace summary for the trace | ||
2480 | just ended: :: | ||
2481 | |||
2482 | server: end of run for 192.168.1.43:sdc | ||
2483 | === sdc === | ||
2484 | CPU 0: 7691 events, 361 KiB data | ||
2485 | CPU 1: 4109 events, 193 KiB data | ||
2486 | Total: 11800 events (dropped 0), 554 KiB data | ||
2487 | |||
2488 | The blktrace instance on the host will | ||
2489 | save the target output inside a hostname-timestamp directory: :: | ||
2490 | |||
2491 | $ ls -al | ||
2492 | drwxr-xr-x 10 root root 1024 Oct 28 02:40 . | ||
2493 | drwxr-sr-x 4 root root 1024 Oct 26 18:24 .. | ||
2494 | drwxr-xr-x 2 root root 1024 Oct 28 02:40 192.168.1.43-2012-10-28-02:40:56 | ||
2495 | |||
2496 | cd into that directory to see the output files: :: | ||
2497 | |||
2498 | $ ls -l | ||
2499 | -rw-r--r-- 1 root root 369193 Oct 28 02:44 sdc.blktrace.0 | ||
2500 | -rw-r--r-- 1 root root 197278 Oct 28 02:44 sdc.blktrace.1 | ||
2501 | |||
2502 | And run blkparse on the host system using the device name: :: | ||
2503 | |||
2504 | $ blkparse sdc | ||
2505 | |||
2506 | 8,32 1 1 0.000000000 1263 Q RM 6016 + 8 [ls] | ||
2507 | 8,32 1 0 0.000036038 0 m N cfq1263 alloced | ||
2508 | 8,32 1 2 0.000039390 1263 G RM 6016 + 8 [ls] | ||
2509 | 8,32 1 3 0.000049168 1263 I RM 6016 + 8 [ls] | ||
2510 | 8,32 1 0 0.000056152 0 m N cfq1263 insert_request | ||
2511 | 8,32 1 0 0.000061600 0 m N cfq1263 add_to_rr | ||
2512 | 8,32 1 0 0.000075498 0 m N cfq workload slice:300 | ||
2513 | . | ||
2514 | . | ||
2515 | . | ||
2516 | 8,32 0 0 177.266385696 0 m N cfq1267 arm_idle: 8 group_idle: 0 | ||
2517 | 8,32 0 0 177.266388140 0 m N cfq schedule dispatch | ||
2518 | 8,32 1 0 177.266679239 0 m N cfq1267 slice expired t=0 | ||
2519 | 8,32 1 0 177.266689297 0 m N cfq1267 sl_used=9 disp=6 charge=9 iops=0 sect=56 | ||
2520 | 8,32 1 0 177.266692649 0 m N cfq1267 del_from_rr | ||
2521 | 8,32 1 0 177.266696560 0 m N cfq1267 put_queue | ||
2522 | |||
2523 | CPU0 (sdc): | ||
2524 | Reads Queued: 0, 0KiB Writes Queued: 270, 21,708KiB | ||
2525 | Read Dispatches: 59, 2,628KiB Write Dispatches: 495, 39,964KiB | ||
2526 | Reads Requeued: 0 Writes Requeued: 0 | ||
2527 | Reads Completed: 90, 2,752KiB Writes Completed: 543, 41,596KiB | ||
2528 | Read Merges: 0, 0KiB Write Merges: 9, 344KiB | ||
2529 | Read depth: 2 Write depth: 2 | ||
2530 | IO unplugs: 20 Timer unplugs: 1 | ||
2531 | CPU1 (sdc): | ||
2532 | Reads Queued: 688, 2,752KiB Writes Queued: 381, 20,652KiB | ||
2533 | Read Dispatches: 31, 124KiB Write Dispatches: 59, 2,396KiB | ||
2534 | Reads Requeued: 0 Writes Requeued: 0 | ||
2535 | Reads Completed: 0, 0KiB Writes Completed: 11, 764KiB | ||
2536 | Read Merges: 598, 2,392KiB Write Merges: 88, 448KiB | ||
2537 | Read depth: 2 Write depth: 2 | ||
2538 | IO unplugs: 52 Timer unplugs: 0 | ||
2539 | |||
2540 | Total (sdc): | ||
2541 | Reads Queued: 688, 2,752KiB Writes Queued: 651, 42,360KiB | ||
2542 | Read Dispatches: 90, 2,752KiB Write Dispatches: 554, 42,360KiB | ||
2543 | Reads Requeued: 0 Writes Requeued: 0 | ||
2544 | Reads Completed: 90, 2,752KiB Writes Completed: 554, 42,360KiB | ||
2545 | Read Merges: 598, 2,392KiB Write Merges: 97, 792KiB | ||
2546 | IO unplugs: 72 Timer unplugs: 1 | ||
2547 | |||
2548 | Throughput (R/W): 15KiB/s / 238KiB/s | ||
2549 | Events (sdc): 9,301 entries | ||
2550 | Skips: 0 forward (0 - 0.0%) | ||
2551 | |||
2552 | You should see the trace events and summary just as you would have if you'd run | ||
2553 | the same command on the target. | ||
2554 | |||
2555 | Tracing Block I/O via 'ftrace' | ||
2556 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
2557 | |||
2558 | It's also possible to trace block I/O using only | ||
2559 | :ref:`profile-manual/profile-manual-usage:The 'trace events' Subsystem`, which | ||
2560 | can be useful for casual tracing if you don't want to bother dealing with the | ||
2561 | userspace tools. | ||
2562 | |||
2563 | To enable tracing for a given device, use /sys/block/xxx/trace/enable, | ||
2564 | where xxx is the device name. This for example enables tracing for | ||
2565 | /dev/sdc: :: | ||
2566 | |||
2567 | root@crownbay:/sys/kernel/debug/tracing# echo 1 > /sys/block/sdc/trace/enable | ||
2568 | |||
2569 | Once you've selected the device(s) you want | ||
2570 | to trace, selecting the 'blk' tracer will turn the blk tracer on: :: | ||
2571 | |||
2572 | root@crownbay:/sys/kernel/debug/tracing# cat available_tracers | ||
2573 | blk function_graph function nop | ||
2574 | |||
2575 | root@crownbay:/sys/kernel/debug/tracing# echo blk > current_tracer | ||
2576 | |||
2577 | Execute the workload you're interested in: :: | ||
2578 | |||
2579 | root@crownbay:/sys/kernel/debug/tracing# cat /media/sdc/testfile.txt | ||
2580 | |||
2581 | And look at the output (note here that we're using 'trace_pipe' instead of | ||
2582 | trace to capture this trace - this allows us to wait around on the pipe | ||
2583 | for data to appear): :: | ||
2584 | |||
2585 | root@crownbay:/sys/kernel/debug/tracing# cat trace_pipe | ||
2586 | cat-3587 [001] d..1 3023.276361: 8,32 Q R 1699848 + 8 [cat] | ||
2587 | cat-3587 [001] d..1 3023.276410: 8,32 m N cfq3587 alloced | ||
2588 | cat-3587 [001] d..1 3023.276415: 8,32 G R 1699848 + 8 [cat] | ||
2589 | cat-3587 [001] d..1 3023.276424: 8,32 P N [cat] | ||
2590 | cat-3587 [001] d..2 3023.276432: 8,32 I R 1699848 + 8 [cat] | ||
2591 | cat-3587 [001] d..1 3023.276439: 8,32 m N cfq3587 insert_request | ||
2592 | cat-3587 [001] d..1 3023.276445: 8,32 m N cfq3587 add_to_rr | ||
2593 | cat-3587 [001] d..2 3023.276454: 8,32 U N [cat] 1 | ||
2594 | cat-3587 [001] d..1 3023.276464: 8,32 m N cfq workload slice:150 | ||
2595 | cat-3587 [001] d..1 3023.276471: 8,32 m N cfq3587 set_active wl_prio:0 wl_type:2 | ||
2596 | cat-3587 [001] d..1 3023.276478: 8,32 m N cfq3587 fifo= (null) | ||
2597 | cat-3587 [001] d..1 3023.276483: 8,32 m N cfq3587 dispatch_insert | ||
2598 | cat-3587 [001] d..1 3023.276490: 8,32 m N cfq3587 dispatched a request | ||
2599 | cat-3587 [001] d..1 3023.276497: 8,32 m N cfq3587 activate rq, drv=1 | ||
2600 | cat-3587 [001] d..2 3023.276500: 8,32 D R 1699848 + 8 [cat] | ||
2601 | |||
2602 | And this turns off tracing for the specified device: :: | ||
2603 | |||
2604 | root@crownbay:/sys/kernel/debug/tracing# echo 0 > /sys/block/sdc/trace/enable | ||
2605 | |||
2606 | .. _blktrace-documentation: | ||
2607 | |||
2608 | blktrace Documentation | ||
2609 | ---------------------- | ||
2610 | |||
2611 | Online versions of the man pages for the commands discussed in this | ||
2612 | section can be found here: | ||
2613 | |||
2614 | - http://linux.die.net/man/8/blktrace | ||
2615 | |||
2616 | - http://linux.die.net/man/1/blkparse | ||
2617 | |||
2618 | - http://linux.die.net/man/8/btrace | ||
2619 | |||
2620 | The above manpages, along with manpages for the other blktrace utilities | ||
2621 | (btt, blkiomon, etc) can be found in the /doc directory of the blktrace | ||
2622 | tools git repo: :: | ||
2623 | |||
2624 | $ git clone git://git.kernel.dk/blktrace.git | ||