Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution can't recover after crash #1440

Open
morph-dev opened this issue Sep 12, 2024 · 5 comments
Open

Execution can't recover after crash #1440

morph-dev opened this issue Sep 12, 2024 · 5 comments
Labels
shelf-stable Will not be closed by stale-bot trin execution

Comments

@morph-dev
Copy link
Collaborator

While running trin execution, it happened that era1 deserialization failed (irrelevant to this issue).

When I tried to resume running it, it would fail very soon afterwards with error:
Error: database error: not found database error block_hash

After looking a bit more into it, I found the problem.

The BlockExecutor::manage_block_hash_serve_window modifies the db directly after every processed block. If the execution crashes (like it happened to me) and we try to resume it, the stored block hashes will not be the correct ones (we will have 256 blocks from the moment of crash, not the saved checkpoint).

Possible solutions:

  1. (preferred) Keep track of block hashes in memory and flush them to this when the rest of state is flushed.
  2. Before execution starts, make sure we have all required block hashes in db (and seed them if that's not the case)
@morph-dev
Copy link
Collaborator Author

Alternatively, we can just never delete block_number->block_hash from the db. Clearly, not most optimized solution, but definitely the easiest one.

It's only ~64 bytes per block, so it's not the end of the world (total of ~1.2 GB for entire chain at the moment).

@KolbyML
Copy link
Member

KolbyML commented Sep 12, 2024

I think the right solution is to change from RocksDB to LMDB or MXDB they are both ACID compliment, so if a crash happens we wouldn't have a problem, we could set it to finalize everything once we are done doing the full block execution cycle.

Instead of doing 1 off solutions like are listed above, which won't solve the root problem

@KolbyML
Copy link
Member

KolbyML commented Sep 14, 2024

#1451 (comment)
#1451 (comment)

Additional comments I made on this problem, and why switching to an ACID database solves them

@morph-dev
Copy link
Collaborator Author

Why can't we use RocksDB? Instead of using rocksdb::DB, we can use rocksdb::TransactionDB or rocksdb::OptimisticTransactionDB.
Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .

I think in our case, we can even use rocksdb::DB::write. Might be the simplest solution.

@KolbyML
Copy link
Member

KolbyML commented Sep 15, 2024

Erigon has a write up here

https://github.com/erigontech/erigon/wiki/Choice-of-storage-engine

They tried like 5 different database solutions then ended up with MDBX.

They say it isn't ACID,

Why can't we use RocksDB? Instead of using rocksdb::DB, we can use rocksdb::TransactionDB or rocksdb::OptimisticTransactionDB. Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .

I think in our case, we can even use rocksdb::DB::write. Might be the simplest solution.

This looks like a good initial start, as it seems to have higher reliability than our current solution, but because various projects have pointed out issues, I am inclined to think it is a bad choice long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shelf-stable Will not be closed by stale-bot trin execution
Projects
None yet
Development

No branches or pull requests

2 participants