Implement removal of files via blocks discarding according to configured bandwidth #2303

pwrobelse · 2024-06-26T15:23:51Z

To avoid I/O latency spikes that are caused by too many TRIM requests
issued to the drive over a short period of time when the filesystem
is mounted with the online discard option and many files are removed,
the file_discard_queue class is used.

Instead of relying on the unlink operation that trims all extents of
a file at once, the approach that utilizes blocks discarding at a
given pace is used.

When a user wants to remove a file, then remove_file_via_blocks_discarding()
needs to be called. Its main purpose is to enqueue the work into file_discard_queue
object associated with the device on which the file is stored.

Reactor will invoke file_discard_queue::try_discard_blocks() function via poller.
The function checks if the bandwidth of discards allows another block to be
discarded. If the bandwidth is available, then the discard is performed.

The described approach ensures, that the amount of removed data over time
remains at the configured level.

Refs: scylladb/scylladb#14940

xemul · 2024-06-27T09:00:50Z

src/core/file_discard_queue.cc

+    }
+
+    _currently_discarded_files.reserve(std::max(4u * _config._requests_per_second, uint64_t{256u}));
+    _refill_available_requests_timer.arm_periodic(std::chrono::seconds{1});


Consider you have 1M requests per second. AFAIU it means, that the discarder will emit 1M requests almost (there's magic 8u constant in try_discard_blocks) instantly and then will be idling for the remaining (almost) second.

That's true. Also, if the timer expires right after the requests were issued, then another 1M will be issued. When using periodic timer such situation may happen. In such case the bandwidth at that second may be almost twice as big as the configured one. It is wrong.

During creation of file_discard_queue_global_test.cc I also had impression that setting the periodic timer in constructor is problematic. However, I wanted to push the draft ASAP to get some general feedback about used approach. Also, I see some corner cases in other parts of the PR that need to be discussed.

Regarding the timer - currently I have two ideas:

decrease the period of the timer and refill only fraction of the available shares e.g. given 1/n s period refill RPS/n shares - it could alleviate "long idling" problem, but still if the refill happens right after the first portion of discards was issued, then the bandwidth for the first second may be larger than the configured one (by BW/n)

do not use periodic timer - instead arm timer after issuing discards that caused decreasing number of shares

Regarding the magic constant - file_discard_queue processes files in FIFO order. If any file was opened, then the queue tries to discard that file completely before opening next ones. My intention was to not open too many files at once. 8 is just an arbitrary value.

You're trying to implement token-bucket algorithm, it can be done without a timer and with smoother refill. All you need is to refill tokens not by timer, but synchronously, every time you try to grab tokens from them. For that, there should a function tokens = fn(duration<double>) that converts duration into tokens. Every time you grab tokens you first put some of them back using that fn() and note the time of the replenish to calculate duration for the next refill. There's util/shared_token_bucket.hh that implements it, but it's node-wide, not shard-local and thus has some extra complexity due to that.

xemul · 2024-06-27T09:01:30Z

src/core/reactor.cc

+        if (node["discard_block_size"]) {
+            mp.discard_block_size = parse_memory_size(node["discard_block_size"].as<std::string>());
+        }
+        if (node["discard_requests_per_second"]) {


How to calculate this number for a given node?

It is a quite tricky question. The experiments with diskplorer seemed to indicate, that when block_size == 32MiB and discard_bw == write_bw, then on i3.2xlarge such huge latency spikes had not been visible.

However, what I am afraid of is whether the configured bandwidth would be sufficient. The queue of files to remove cannot grow indefinitely. Also, I am not sure how to measure the desired value via iotune - creation of files to test certain bandwidth may take long time.

xemul · 2024-06-27T09:02:14Z

src/core/reactor.cc

+
+        auto& params = mountpoint_it->second;
+        if (params.discard_block_size && params.discard_requests_per_second) {
+            return file_discard_queue_config{*params.discard_block_size, *params.discard_requests_per_second};


So the discard numbers in io-properties file are not node-wide, but per-shard?

It works like that for the current state of the PR. It would be better to specify the value for the device instead of per-shard.

Then consider the number of discard requests specified is less than the number of shards on the node. Is that possible? If yes, how to handle it?

xemul · 2024-06-27T09:04:52Z

Since we try to go on fine-grained block-by-block (the "discard_block_size" parameter) removal, why don't we route those discards via regular io-queue?

To avoid I/O latency spikes that are caused by too many TRIM requests issued to the drive over a short period of time when the filesystem is mounted with the online discard option and many files are removed, the file_discard_queue class is used. Instead of relying on the unlink operation that trims all extents of a file at once, the approach that utilizes blocks discarding at a given pace is used. When a user wants to remove a file, then file_discard_queue::enqueue_file_removal() needs to be called. Its main purpose is to store the information about the work related to file discard in the internal data of the class. Reactor will invoke file_discard_queue::try_discard_blocks() function via poller. The function checks if the bandwidth of discards allows another block to be discarded. If the bandwidth is available, then the discard is performed. The described approach ensures, that the amount of removed data remains at the configured level. Signed-off-by: Patryk Wrobel <[email protected]>

This patch introduces unit tests for file_discard_queue class. To avoid interference between test cases, each of them manually triggers file_discard_queue::try_discard_blocks(). In next patches file_discard_queue will be plugged into reactor for each shard. Then another part of tests will be added as a separate binary to check the full flow. Signed-off-by: Patryk Wrobel <[email protected]>

To enable usage of file_discard_queue in seastar applications, it had to be plugged into reactor. This patch introduces usage of file_discard_queue in reactor. The following changes have been implemented: - remove_file_via_blocks_discarding() function is available and allows users to discard files according to a given bandwidth (using file_discard_queue) - file_discard_queue_submission_pollfn poller was added to reactor's main loop to trigger the work related to discarding files - I/O properties contain two new optional parameters: discard_block_size and discard_requests_per_second that allow a user to configure the bandwidth of discards for a given mountpoint Signed-off-by: Patryk Wrobel <[email protected]>

… removal In order to ensure that file discard queue is properly used by the reactor and can be configured via I/O properties parameters a new set of global tests was introduced. To avoid any interferences between test cases, they are run one after another in a seastar::thread. The test cases verify the full flow including usage of global utility functions, reactor, pollers and configured file discard queues. The bandwidth of discard operation is also verified. Signed-off-by: Patryk Wrobel <[email protected]>

pwrobelse · 2024-06-27T11:25:59Z

Since we try to go on fine-grained block-by-block (the "discard_block_size" parameter) removal, why don't we route those discards via regular io-queue?

Could you elaborate a little bit?

In the beginning I saw that there is no support for fallocate() in AIO. Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Moreover, io_queue seems to be strongly read/write oriented. Somehow, even when I looked at io_queue::queue_one_request() I felt that it has much more parameters than the logic related to discards need.

That's the reason why I created a separate class called file_discard_queue.

When I try to think about routing via io_queue, I have a few ideas and I am not sure which one is correct (if any).

Are you referring to making file_discard_queue an implementation detail of io_queue and plugging it to the existing implementation of that class? In such case the new logic related to poller for file_discard_queue and selection of FDQ in reactor could be removed. Also, we would have only a single config for I/O queue that would get two new fields - instead of producing another config for FDQ. Making a file_discard_queue a private member of io_queue would still ensure the separation of logic.
Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

xemul · 2024-06-27T12:03:48Z

In the beginning I saw that there is no support for fallocate() in AIO.

Correct.

Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Yes, this just means that io_queue::poll_io_queue() will need to submit discard somehow else, not via io_sink. One of the options is to convert io_desc_read_write into an abstract class with its dispatch() being a virtual method and have two different inherits from it -- existing io_desc_read_write and io_desc_discard.

Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

Definitely not. Discards must compete with regular read-write requests in the same fair-queue, because they are competing with each other in the kernel/disk. That's why asked for this-like measurement of read-vs-discard to see how they correspond to existing model.

You're right that currently io_queue is read-write oriented, but that's ... historical coincidence :) There's queued_io_request thing, io_queue maintains a set of those, pushing them through the fair_queue, so that they obey both -- shares-based classes and outgoing throttler (*). There's nothing that's read-write specific here yet.

(*) the outgoing throttler is currently in fair_queue, but that's not quite correct either. The goal of the fair-queue is to provide cross-class fair scheduling. The goal of the outgoing throttler is to limit the amount of requests sent to disk. It effectively the properly of io_sink, and it's in fair_queue due to me being lazy or shortsighted (likely both) back when I implemented it. And my above point is that discards should be pushed through it in the first place.

pwrobelse · 2024-07-02T15:55:01Z

In the beginning I saw that there is no support for fallocate() in AIO.

Correct.

Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Yes, this just means that io_queue::poll_io_queue() will need to submit discard somehow else, not via io_sink. One of the options is to convert io_desc_read_write into an abstract class with its dispatch() being a virtual method and have two different inherits from it -- existing io_desc_read_write and io_desc_discard.

Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

Definitely not. Discards must compete with regular read-write requests in the same fair-queue, because they are competing with each other in the kernel/disk. That's why asked for this-like measurement of read-vs-discard to see how they correspond to existing model.

You're right that currently io_queue is read-write oriented, but that's ... historical coincidence :) There's queued_io_request thing, io_queue maintains a set of those, pushing them through the fair_queue, so that they obey both -- shares-based classes and outgoing throttler (*). There's nothing that's read-write specific here yet.

(*) the outgoing throttler is currently in fair_queue, but that's not quite correct either. The goal of the fair-queue is to provide cross-class fair scheduling. The goal of the outgoing throttler is to limit the amount of requests sent to disk. It effectively the properly of io_sink, and it's in fair_queue due to me being lazy or shortsighted (likely both) back when I implemented it. And my above point is that discards should be pushed through it in the first place.

Hi, I tried to prepare io_queue for request types that do not need to use io_sink according to your comment. Please find another PR: #2319

pwrobelse force-pushed the implement-files-removal-via-blocks-discarding-according-to-given-rate branch 2 times, most recently from 5934da6 to 3559f90 Compare June 27, 2024 08:57

xemul reviewed Jun 27, 2024

View reviewed changes

pwrobelse added 4 commits June 27, 2024 11:38

pwrobelse force-pushed the implement-files-removal-via-blocks-discarding-according-to-given-rate branch from 3559f90 to e2ad5ba Compare June 27, 2024 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement removal of files via blocks discarding according to configured bandwidth #2303

Implement removal of files via blocks discarding according to configured bandwidth #2303

pwrobelse commented Jun 26, 2024 •

edited

Loading

xemul Jun 27, 2024

pwrobelse Jun 27, 2024

xemul Jun 27, 2024

xemul Jun 27, 2024

pwrobelse Jun 27, 2024

xemul Jun 27, 2024

pwrobelse Jun 27, 2024

xemul Jun 28, 2024

xemul commented Jun 27, 2024

pwrobelse commented Jun 27, 2024

xemul commented Jun 27, 2024

pwrobelse commented Jul 2, 2024

Implement removal of files via blocks discarding according to configured bandwidth #2303

Are you sure you want to change the base?

Implement removal of files via blocks discarding according to configured bandwidth #2303

Conversation

pwrobelse commented Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xemul commented Jun 27, 2024

pwrobelse commented Jun 27, 2024

xemul commented Jun 27, 2024

pwrobelse commented Jul 2, 2024

pwrobelse commented Jun 26, 2024 •

edited

Loading