Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch GRV Rate Limit Exceeded is not always thrown #11500

Open
ScottDugas opened this issue Jul 11, 2024 · 2 comments
Open

Batch GRV Rate Limit Exceeded is not always thrown #11500

ScottDugas opened this issue Jul 11, 2024 · 2 comments
Assignees

Comments

@ScottDugas
Copy link

The tests noted in FoundationDB/fdb-record-layer#2813 will occasionally run forever due to this code:
https://github.com/FoundationDB/fdb-record-layer/blob/200ac05041a1af712f621a27b4c5c37f9eab001c/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/storestate/FDBRecordStoreStateCacheEntry.java#L97-L100

Where it is combining two futures.
The first one: recordStore.loadRecordStoreStateAsync is doing a regular read.
The second one is doing a snapshot get of SystemKeyspace.METADATA_VERSION_KEY.

The first future fails with Batch GRV request rate limit exceeded (code 1051).
The second future never completes.

I have tried to reproduce this in a more isolated environment, but it is proving tricky to get it to reliably start failing with Batch GRV request rate limit exceeded.

@ScottDugas
Copy link
Author

ok, I created a reasonable reproduction at: https://github.com/FoundationDB/fdb-record-layer/pull/2823/files
About half the time, it fails with timeouts just for the reads of SystemKeyspace.METADATA_VERSION_KEY and Batch GRV request rate limit exceeded for the other operations.
The other times it will fail with timeouts for all the operations.

@PierreZ
Copy link
Contributor

PierreZ commented Aug 9, 2024

I think we have stampled across this issue in simulation. We have a very basic RL's fork in Rust that we can simulate as an external workload. We found this morning a specific seed (5267156628) that is failing the same way, as transaction.get_metadata_version is hanging.

FoundationDB 7.3 (v7.3.43)
source version 412531b5c97fa84343da94888cc949a4d29e8c29
protocol fdb00b073000000

Our testfile looks like this:

[[test]]
testTitle = 'QuotaWorkload'

[[test.workload]]
testName = 'External'
libraryName = 'ldb'
workloadName = 'QuotaWorkload'
libraryPath = './target/release'
iteration_count = 50

[[test.workload]]
testName = 'RandomClogging'
testDuration = 30.0
swizzle = 1

[[test.workload]]
testName = 'Attrition'
machinesToKill = 10
machinesToLeave = 3
reboot = true
testDuration = 30.0

[[test.workload]]
testName = 'Rollback'
testDuration = 30

[[test.workload]]
testName = 'ChangeConfig'
maxDelayBeforeChange = 30.0
coordinators = 'auto'

Let me know if we can help 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants