Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement removal of files via blocks discarding according to configured bandwidth #2303

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pwrobelse
Copy link
Contributor

@pwrobelse pwrobelse commented Jun 26, 2024

To avoid I/O latency spikes that are caused by too many TRIM requests
issued to the drive over a short period of time when the filesystem
is mounted with the online discard option and many files are removed,
the file_discard_queue class is used.

Instead of relying on the unlink operation that trims all extents of
a file at once, the approach that utilizes blocks discarding at a
given pace is used.

When a user wants to remove a file, then remove_file_via_blocks_discarding()
needs to be called. Its main purpose is to enqueue the work into file_discard_queue
object associated with the device on which the file is stored.

Reactor will invoke file_discard_queue::try_discard_blocks() function via poller.
The function checks if the bandwidth of discards allows another block to be
discarded. If the bandwidth is available, then the discard is performed.

The described approach ensures, that the amount of removed data over time
remains at the configured level.

Refs: scylladb/scylladb#14940

@pwrobelse pwrobelse force-pushed the implement-files-removal-via-blocks-discarding-according-to-given-rate branch 2 times, most recently from 5934da6 to 3559f90 Compare June 27, 2024 08:57
}

_currently_discarded_files.reserve(std::max(4u * _config._requests_per_second, uint64_t{256u}));
_refill_available_requests_timer.arm_periodic(std::chrono::seconds{1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider you have 1M requests per second. AFAIU it means, that the discarder will emit 1M requests almost (there's magic 8u constant in try_discard_blocks) instantly and then will be idling for the remaining (almost) second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Also, if the timer expires right after the requests were issued, then another 1M will be issued. When using periodic timer such situation may happen. In such case the bandwidth at that second may be almost twice as big as the configured one. It is wrong.

During creation of file_discard_queue_global_test.cc I also had impression that setting the periodic timer in constructor is problematic. However, I wanted to push the draft ASAP to get some general feedback about used approach. Also, I see some corner cases in other parts of the PR that need to be discussed.

Regarding the timer - currently I have two ideas:

  • decrease the period of the timer and refill only fraction of the available shares e.g. given 1/n s period refill RPS/n shares - it could alleviate "long idling" problem, but still if the refill happens right after the first portion of discards was issued, then the bandwidth for the first second may be larger than the configured one (by BW/n)
  • do not use periodic timer - instead arm timer after issuing discards that caused decreasing number of shares

Regarding the magic constant - file_discard_queue processes files in FIFO order. If any file was opened, then the queue tries to discard that file completely before opening next ones. My intention was to not open too many files at once. 8 is just an arbitrary value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're trying to implement token-bucket algorithm, it can be done without a timer and with smoother refill. All you need is to refill tokens not by timer, but synchronously, every time you try to grab tokens from them. For that, there should a function tokens = fn(duration<double>) that converts duration into tokens. Every time you grab tokens you first put some of them back using that fn() and note the time of the replenish to calculate duration for the next refill. There's util/shared_token_bucket.hh that implements it, but it's node-wide, not shard-local and thus has some extra complexity due to that.

if (node["discard_block_size"]) {
mp.discard_block_size = parse_memory_size(node["discard_block_size"].as<std::string>());
}
if (node["discard_requests_per_second"]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to calculate this number for a given node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a quite tricky question. The experiments with diskplorer seemed to indicate, that when block_size == 32MiB and discard_bw == write_bw, then on i3.2xlarge such huge latency spikes had not been visible.

However, what I am afraid of is whether the configured bandwidth would be sufficient. The queue of files to remove cannot grow indefinitely. Also, I am not sure how to measure the desired value via iotune - creation of files to test certain bandwidth may take long time.


auto& params = mountpoint_it->second;
if (params.discard_block_size && params.discard_requests_per_second) {
return file_discard_queue_config{*params.discard_block_size, *params.discard_requests_per_second};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the discard numbers in io-properties file are not node-wide, but per-shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works like that for the current state of the PR. It would be better to specify the value for the device instead of per-shard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then consider the number of discard requests specified is less than the number of shards on the node. Is that possible? If yes, how to handle it?

@xemul
Copy link
Contributor

xemul commented Jun 27, 2024

Since we try to go on fine-grained block-by-block (the "discard_block_size" parameter) removal, why don't we route those discards via regular io-queue?

To avoid I/O latency spikes that are caused by too many TRIM requests
issued to the drive over a short period of time when the filesystem
is mounted with the online discard option and many files are removed,
the file_discard_queue class is used.

Instead of relying on the unlink operation that trims all extents of
a file at once, the approach that utilizes blocks discarding at a
given pace is used.

When a user wants to remove a file, then file_discard_queue::enqueue_file_removal()
needs to be called. Its main purpose is to store the information about the work
related to file discard in the internal data of the class.

Reactor will invoke file_discard_queue::try_discard_blocks() function via poller.
The function checks if the bandwidth of discards allows another block to be
discarded. If the bandwidth is available, then the discard is performed.

The described approach ensures, that the amount of removed data remains at
the configured level.

Signed-off-by: Patryk Wrobel <[email protected]>
This patch introduces unit tests for file_discard_queue class.
To avoid interference between test cases, each of them manually
triggers file_discard_queue::try_discard_blocks().

In next patches file_discard_queue will be plugged into reactor
for each shard. Then another part of tests will be added as a
separate binary to check the full flow.

Signed-off-by: Patryk Wrobel <[email protected]>
To enable usage of file_discard_queue in seastar applications,
it had to be plugged into reactor.

This patch introduces usage of file_discard_queue in reactor.
The following changes have been implemented:
 - remove_file_via_blocks_discarding() function is available
   and allows users to discard files according to a given
   bandwidth (using file_discard_queue)
 - file_discard_queue_submission_pollfn poller was added to
   reactor's main loop to trigger the work related to discarding
   files
 - I/O properties contain two new optional parameters:
   discard_block_size and discard_requests_per_second that
   allow a user to configure the bandwidth of discards for
   a given mountpoint

Signed-off-by: Patryk Wrobel <[email protected]>
… removal

In order to ensure that file discard queue is
properly used by the reactor and can be configured
via I/O properties parameters a new set of global
tests was introduced.

To avoid any interferences between test cases, they
are run one after another in a seastar::thread.

The test cases verify the full flow including usage
of global utility functions, reactor, pollers and
configured file discard queues. The bandwidth of
discard operation is also verified.

Signed-off-by: Patryk Wrobel <[email protected]>
@pwrobelse pwrobelse force-pushed the implement-files-removal-via-blocks-discarding-according-to-given-rate branch from 3559f90 to e2ad5ba Compare June 27, 2024 09:39
@pwrobelse
Copy link
Contributor Author

Since we try to go on fine-grained block-by-block (the "discard_block_size" parameter) removal, why don't we route those discards via regular io-queue?

Could you elaborate a little bit?

In the beginning I saw that there is no support for fallocate() in AIO. Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Moreover, io_queue seems to be strongly read/write oriented. Somehow, even when I looked at io_queue::queue_one_request() I felt that it has much more parameters than the logic related to discards need.

That's the reason why I created a separate class called file_discard_queue.

When I try to think about routing via io_queue, I have a few ideas and I am not sure which one is correct (if any).

  1. Are you referring to making file_discard_queue an implementation detail of io_queue and plugging it to the existing implementation of that class? In such case the new logic related to poller for file_discard_queue and selection of FDQ in reactor could be removed. Also, we would have only a single config for I/O queue that would get two new fields - instead of producing another config for FDQ. Making a file_discard_queue a private member of io_queue would still ensure the separation of logic.
  2. Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

@xemul
Copy link
Contributor

xemul commented Jun 27, 2024

In the beginning I saw that there is no support for fallocate() in AIO.

Correct.

Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Yes, this just means that io_queue::poll_io_queue() will need to submit discard somehow else, not via io_sink. One of the options is to convert io_desc_read_write into an abstract class with its dispatch() being a virtual method and have two different inherits from it -- existing io_desc_read_write and io_desc_discard.

Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

Definitely not. Discards must compete with regular read-write requests in the same fair-queue, because they are competing with each other in the kernel/disk. That's why asked for this-like measurement of read-vs-discard to see how they correspond to existing model.

You're right that currently io_queue is read-write oriented, but that's ... historical coincidence :) There's queued_io_request thing, io_queue maintains a set of those, pushing them through the fair_queue, so that they obey both -- shares-based classes and outgoing throttler (*). There's nothing that's read-write specific here yet.

(*) the outgoing throttler is currently in fair_queue, but that's not quite correct either. The goal of the fair-queue is to provide cross-class fair scheduling. The goal of the outgoing throttler is to limit the amount of requests sent to disk. It effectively the properly of io_sink, and it's in fair_queue due to me being lazy or shortsighted (likely both) back when I implemented it. And my above point is that discards should be pushed through it in the first place.

@pwrobelse
Copy link
Contributor Author

In the beginning I saw that there is no support for fallocate() in AIO.

Correct.

Therefore, I thought that plugging discards to the existing I/O queue is not viable. io_queue uses io_sink, that is drained by reactor_backend, which translates the seastar IO request into API-specific request for a certain backend. In the case of AIO backend the seastar IO request could not be translated to AIO request in the case of fallocate().

Yes, this just means that io_queue::poll_io_queue() will need to submit discard somehow else, not via io_sink. One of the options is to convert io_desc_read_write into an abstract class with its dispatch() being a virtual method and have two different inherits from it -- existing io_desc_read_write and io_desc_discard.

Do you propose to have another dimension in _streams array that would contain discard requests and avoid using io_sink for that dimension? Instead, dispatch() would issue an ordinary discards via thread_pool.

Definitely not. Discards must compete with regular read-write requests in the same fair-queue, because they are competing with each other in the kernel/disk. That's why asked for this-like measurement of read-vs-discard to see how they correspond to existing model.

You're right that currently io_queue is read-write oriented, but that's ... historical coincidence :) There's queued_io_request thing, io_queue maintains a set of those, pushing them through the fair_queue, so that they obey both -- shares-based classes and outgoing throttler (*). There's nothing that's read-write specific here yet.

(*) the outgoing throttler is currently in fair_queue, but that's not quite correct either. The goal of the fair-queue is to provide cross-class fair scheduling. The goal of the outgoing throttler is to limit the amount of requests sent to disk. It effectively the properly of io_sink, and it's in fair_queue due to me being lazy or shortsighted (likely both) back when I implemented it. And my above point is that discards should be pushed through it in the first place.

Hi, I tried to prepare io_queue for request types that do not need to use io_sink according to your comment. Please find another PR: #2319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants