Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to support splitting a given range into shards individually #11510

Open
DuanChangfeng0708 opened this issue Jul 15, 2024 · 6 comments

Comments

@DuanChangfeng0708
Copy link

Writing a continuous Key space under heavy pressure can cause hot spots, causing storage_server_write_queue_size or even process_behind. However, such hot spots may be known to the upper layer services but cannot be avoided. I wish there was an API that partitioned the shards ahead of time so that the write load was evenly distributed across multiple storageservers. In this way, the upper business can actively divide the Keys space to reduce hot spots.

@kakaiu
Copy link
Member

kakaiu commented Jul 15, 2024

Thanks for your information. @DuanChangfeng0708 Is the skewed traffic primarily from update operations (or insertions following deletions) or from insert operations? Which FDB version are you using?

@DuanChangfeng0708
Copy link
Author

Thanks for your information. @DuanChangfeng0708 Is the skewed traffic primarily from update operations (or insertions following deletions) or from insert operations? Which FDB version are you using?

I'm using version 7.1.27. Sequential writes under heavy pressure can cause this problem, especially after a keyRange has been cleaned up over a long period of time. In the end, we found out that all the keyranges written with the same prefix were concentrated on a few StorageServers, so the version of StorageServer was too backward. A process_behind error occurred while RateKeeper was fetching the ServerList, which limited the LimitTPS to 0, causing service interruption.
The root cause is that Keys with the same prefix are all concentrated on the same shard, causing a hot spot.
In our upper layer, for better concurrency and to avoid hot spots, we also introduced the concept of shards by appending hash strings to keys with the same prefix, but it doesn't seem to work for fdb because it's still in an fdb shard. So I would like to have an api where the business layer can control the granularity of the fdb shard to make the pressure load more even and avoid hot spots.

@kakaiu
Copy link
Member

kakaiu commented Jul 16, 2024

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1.
The fdbcli is redistribute <BeginKey> <EndKey>.
The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d).
On the main branch, we are developing a better solution for this requirement.

@DuanChangfeng0708
Copy link
Author

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1. The fdbcli is redistribute <BeginKey> <EndKey>. The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d). On the main branch, we are developing a better solution for this requirement.

I have read the relevant implementation code and I believe that custom shards should be persistent and should never be merged.

Consider a scenario (such as backend tasks) where a batch of tasks (with the same prefix) is created in batches at 1am every day, and all tasks are completed and cleared at 2am. If that's the case, keys written at 1am will experience hotspots due to having the same prefix. After deleting at 2am, the related shards will merge again, and the same situation will occur the next day. The purpose of customizing shards to evenly distribute the load has not been achieved.

@kakaiu
Copy link
Member

kakaiu commented Jul 21, 2024

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1. The fdbcli is redistribute <BeginKey> <EndKey>. The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d). On the main branch, we are developing a better solution for this requirement.

I have read the relevant implementation code and I believe that custom shards should be persistent and should never be merged.

Consider a scenario (such as backend tasks) where a batch of tasks (with the same prefix) is created in batches at 1am every day, and all tasks are completed and cleared at 2am. If that's the case, keys written at 1am will experience hotspots due to having the same prefix. After deleting at 2am, the related shards will merge again, and the same situation will occur the next day. The purpose of customizing shards to evenly distribute the load has not been achieved.

Yes. On the main branch/release-7.3, we have introduced another experimental feature command rangeconfig. This command was initially developed to create a larger team with more storage servers to handle high read traffic on a specific range. This command allows for the customization of shard boundaries, with these custom boundaries being persisted.

@DuanChangfeng0708
Copy link
Author

If this requirement is to be done, there is a detail that needs to be noted. I want the range with the same prefix to be dispersed as much as possible among different teams, so I need to calculate the ratio of this range on each team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@kakaiu @DuanChangfeng0708 and others