Binary event tracer #2417

xemul · 2024-08-30T11:08:52Z

This patch adds ability to capture info about frequent events (such as queued/dispatched/completed IO requests) into a buffer in memory and write it into a file. The goal is to collect a lot of events in relatively short period of time (e.g. -- seconds to minutes) thus getting what exactly is happening in the system, instead of averaged statistical metrics. Events are collected in binary form, so a decoder is included.

In this PR the tracer is put on io-queue to collect IO events. Example of the decoded output:

00000000:000.000002 --- buffer (skipped 0) ---
00000009:- OPENING: shard=0 text=0:main,1:exit,2:lr  
00000021:000.000105 IO Q 25180 r class 2 512
0000002c:000.000106 IO Q 24d80 r class 2 512
00000037:000.000116 IO D 25180
0000003e:000.000137 IO D 24d80
00000045:000.000190 IO C 24d80
0000004c:000.000190 IO C 25180
00000053:000.000304 IO Q 25180 r class 2 512
0000005e:000.000305 IO Q 24d80 r class 2 512
00000069:000.000306 IO D 25180
00000070:000.000306 IO D 24d80
00000077:000.000361 IO C 25180
0000007e:000.000371 IO C 24d80
...

The OPENING message includes shard id and id:name mapping for registered sched classes.
The IO event format above is
<offset in file>:<timestamp in usec> <event name Queue/Dispatch/Complete> <request id>
and for Queued event also -- <direction r/w> <class ID> <length>

For IO there's a set of io_log.trace() per-request calls, but it's not very efficient. First it takes time to format the message and forward it to syslogd (in prod servers). Next, each request traces three messages:

TRACE 2024-08-30 14:02:59,640 [shard 2:lr  ] io - dev 64771 : req 0x60200011ec80 queue  len 512 capacity 80612
TRACE 2024-08-30 14:02:59,640 [shard 2:main] io - dev 64771 : req 0x60200011ec80 submit
TRACE 2024-08-30 14:02:59,640 [shard 2:main] io - dev 64771 : req 0x60200011ec80 complete

that's 290 bytes. Contrary to this, this tracer generates 25 bytes per request. Also note, that timestamp precision is msec, while IO events need at least usec resolution.

refs: #2403

The goal of the tracer is to capture events at the frequence that exceeds the regular logging and metrics resolution. Basically, it's a buffer (well -- a bunch of buffers) into which pre-defined trace points but their binary data. Overhead: when OFF the it's is single if, when ON it's putting several bytes-long data, advancing a pointer and checking if the buffer is full. Current implementation pre-allocates 8 buffers 128k in size. Spoiler: io events take 25 bytes per request. At 10 kiops this size is able to store trace data for ~4 seconds. When a buffer is full it's flushed on disk in a single write, then re-cycled. If an event happens when there are no buffers available, it counts the number of skipped events and puts them into the next available buffer to at least let user know it happened. Tracing is started for a given duration (in seconds) and total size of flushed data (in bytes, but it's naturally rounded to 128k). Once either of the limits is hit, tracing is stopped. This is not to disturb system too much including both -- run-time overhead and consumed disk space. Message format is: event id (1 bytes) timestamp (2 bytes) incremental, duration since previous event in usecs, thus holding up to 64 msec length. then goes the payload of arbitrary length. The length is not saved, it's impled that event id defines the payload length. There are 4 service events - buffer-head: placed at the head of each buffer and includes the number of skipped events (4 bytes) and 4-bytes timestamp offset. The latter is 2x times larger than per-event timestamp, since it may take time to recycle a buffer and it can take more than 64 msec - opening: an even that's placed when tracing is enabled, includes shard number and whatever enabler code wants to put there (spoiler: IO tracer puts id:name mapping for registered sched classes) - tick: there's a timer running that puts this 3-bytes event every 32 msec. This is to prevent 2-bytes timestamps from overflowing in case no events happen for that long and as sanity milestones for decoder - terminator: when an event doesn't fit the buffer tail room, 1-bytes event (without timestamp) is placed there to tell decoder to stop decoding current buffer Signed-off-by: Pavel Emelyanov <[email protected]>

As messages format is determined by the event id, decoder should know exactly the version from this a trace was generated, thus be tagged with the same "release ID" in the repository. Signed-off-by: Pavel Emelyanov <[email protected]>

First trace generator is going to be IO queue code, so put the tracer object on board and introduce the enabler method. Events will come in the next patch. Signed-off-by: Pavel Emelyanov <[email protected]>

There's one interesting even that is generated by fair-queue -- when the queue starts waiting for shared capacity. In fact, the output throttler code should exist on IO-queue level (see scylladb#2332), but currently it's in the fair-queue, so it will need to generate this event. This patch adds tracer reference on fair-queue for future patching. Signed-off-by: Pavel Emelyanov <[email protected]>

Add a CLI option that enables tracing after all preparations are done. The workload jobs will then be traced and trace files will be available afterwards. Signed-off-by: Pavel Emelyanov <[email protected]>

There are 3 interesting events -- when a request is queued, when it's dispatched and when it completes. The first one comes with request ID, class ID, direction and length. To save some space on the trace buffer the length is trimmed down to 512-bytes blocks, so 2 bytes are enough for up to 32-MB requests. The request "ID" is derived from io_desc_read_write pointer as it uniquely identifes a request while it's in-flight. For that, the pointer is disguised -- shifted right for 3 bits thus cutting off always zero bits, and then converted to 32-bit number. This cuts off the upper part of the pointer, but it seems to be "unique enough" for one shard. This even is 1 (event id) + 2 (timestamp) + 1 (direction) + 1 (class) + 2 (length in blocks) + 4 (id) = 11 bytes Dispatch and complete only include the request ID field. Its length is thus 7 bytes (event id + timestamps + request id). Totally, each request generates 25 bytes in the buffer. Signed-off-by: Pavel Emelyanov <[email protected]>

There's only one interesting event for it -- when the queue goes "pending" state waiting for shared capacity. No arguments here, so it just takes 3 bytes in the buffer. Signed-off-by: Pavel Emelyanov <[email protected]>

avikivity · 2024-09-01T18:11:22Z

We should use statically defined tracepoints (<sys/sdt.h>) instead of an ad-hoc format/protocol/parser, like blktrace/blkparse. The challenge is getting the information out in a performant way. I tried systemtap dyninst and failed, several times. Please check if ltt-ng works.

/cc @michoecho

michoecho · 2024-09-02T08:47:46Z

We should use statically defined tracepoints (<sys/sdt.h>) instead of an ad-hoc format/protocol/parser, like blktrace/blkparse. The challenge is getting the information out in a performant way. I tried systemtap dyninst and failed, several times. Please check if ltt-ng works.

/cc @michoecho

@xemul In other words: Avi has banned custom tracing patches until it is proven that existing tracing frameworks aren't good enough. (I had a custom tracing patch denied too, once.)

If you want to experiment with lttng, you can take https://github.com/scylladb/scylla-enterprise-seastar/commit/f0b870368f733bde6fd54cfbec6052080b5f2db5#diff-ceb4cd71fc94af90f390646cde7f7cc8bc673cbd7fd77208609b42fc634d0fd1 as a starting point. (But to make it compile, you will probably have to prepend LLTNG_UST_(/lltng_ust_) to all lttng identifiers — there was an API change in lltng between v2.12 and v2.13).

xemul · 2024-09-04T07:19:39Z

@xemul In other words: Avi has banned custom tracing patches until it is proven that existing tracing frameworks aren't good enough.

Yeah, Avi made it crystal clear :D

(I had a custom tracing patch denied too, once.)

There must be denied-binary-tracing club somewhere, optionally we could found one

If you want to experiment with lttng, you can take https://github.com/scylladb/scylla-enterprise-seastar/commit/f0b870368f733bde6fd54cfbec6052080b5f2db5#diff-ceb4cd71fc94af90f390646cde7f7cc8bc673cbd7fd77208609b42fc634d0fd1 as a starting point.

Thanks!

xemul added 7 commits August 30, 2024 13:15

app: Binary trace decoder

7a45cd6

As messages format is determined by the event id, decoder should know exactly the version from this a trace was generated, thus be tagged with the same "release ID" in the repository. Signed-off-by: Pavel Emelyanov <[email protected]>

io: Wire binary tracer into IO queue

33e0977

First trace generator is going to be IO queue code, so put the tracer object on board and introduce the enabler method. Events will come in the next patch. Signed-off-by: Pavel Emelyanov <[email protected]>

io_tester: Add ability to turn tracing ON

12a81b6

Add a CLI option that enables tracing after all preparations are done. The workload jobs will then be traced and trace files will be available afterwards. Signed-off-by: Pavel Emelyanov <[email protected]>

fq: Add trace event

1cf0ddf

There's only one interesting event for it -- when the queue goes "pending" state waiting for shared capacity. No arguments here, so it just takes 3 bytes in the buffer. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul requested a review from avikivity August 30, 2024 11:08

xemul marked this pull request as draft September 4, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary event tracer #2417

Binary event tracer #2417

xemul commented Aug 30, 2024 •

edited

Loading

avikivity commented Sep 1, 2024 •

edited

Loading

michoecho commented Sep 2, 2024

xemul commented Sep 4, 2024

Binary event tracer #2417

Are you sure you want to change the base?

Binary event tracer #2417

Conversation

xemul commented Aug 30, 2024 • edited Loading

avikivity commented Sep 1, 2024 • edited Loading

michoecho commented Sep 2, 2024

xemul commented Sep 4, 2024

xemul commented Aug 30, 2024 •

edited

Loading

avikivity commented Sep 1, 2024 •

edited

Loading