Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp Timezone Processing and Management #8037

Open
pedroerp opened this issue Dec 14, 2023 · 6 comments
Open

Revamp Timezone Processing and Management #8037

pedroerp opened this issue Dec 14, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@pedroerp
Copy link
Contributor

pedroerp commented Dec 14, 2023

Description

There are a few important aspects regarding timezone processing in Velox:

  1. How timezones are represented and stored in a columnar format:

In order to efficiently represent timezones in Velox, timezones are mapped to 16 bit integers. Because the mappings should be compatible with data saved on disk by Presto java (e.g TimestampWithTimezone type), we import the mapping between timezone name (string) and integers from Presto. The mapping is automatically generated and available here:

https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneDatabase.cpp

Naturally, both sides of the mapping are available (name ->id; id -> name):

https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneMap.h

  1. How timezone conversion is performed:

Many functions use timezone information to perform timestamp/date/time conversion across timezones. This is done using an external library available here:

https://github.com/facebookincubator/velox/tree/main/velox/external/date

This library is compatible with the new timezone std::chrono standard in C++20:

https://en.cppreference.com/w/cpp/chrono/time_zone

It takes a string with the timezone name and leverages the local tzdata package to perform timezone conversions.

===

There are two problems with the current setup:

A) These two mappings are not synchronized. Timezone names available in the Presto mapping (item 1 above) are not necessarily understood by external/date. This means that in some cases timezone which we can represent, fail if we try to perform conversions (#7804 is an example).

B) Inefficient conversion. For timezones that happen to be available in both mappings, the common conversion code consists of (a) first mapping the stored integer id back to the string name, (b) using the string name to find the pointer to the correct conversion object, then (c) actually performing the conversion.

format_datetime() is an example:

https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/DateTimeFunctions.h#L1286

===

To address the issues above, we should consider:

X) Ensure every timezone name supported in Velox (the ones in the Presto-based mapping) have a corresponding entry in external/date. We should have tests to cover every timezone. There is a good overlap, but they are not the same. Particularly, fixed offsets like "03:00" and "-07:30" are supported by Presto but not by external/date.

Y) Provide a cached mapping from "id" -> "timezone object pointer" to improve efficiency in the common conversion path.

Z) Work with the Presto community to add missing timezone names - it was reported that at least EST, CST, and similar are official timezone and available in tzdata/tzdump, but not supported by Presto (and hence not available in Velox).

Cc: @mbasmanova @PHILO-HE @aditi-pandit @majetideepak @zacw7 @tanjialiang @gggrace14 @svm1

@pedroerp pedroerp added the enhancement New feature or request label Dec 14, 2023
@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 14, 2023

A) These two mappings are not synchronized. Timezone names available in the Presto mapping (item 1 above) are not necessarily understood by external/date. This means that in some cases timezone which we can represent, fail if we try to perform conversions (#7804 is an example).

Hi @pedroerp, does Velox's external/date come from this lib: https://github.com/HowardHinnant/date? Not sure whether it can understand more timezone patterns in the latest versions.

In addition, I note USE_OS_TZDB is used to build this external/lib. (https://github.com/facebookincubator/velox/blob/main/velox/external/date/CMakeLists.txt#L14), which makes this library only use OS-supplied timezone database. It may cause a few limitations.

@pedroerp
Copy link
Contributor Author

pedroerp commented Dec 14, 2023

Hi @pedroerp, does Velox's external/date come from this lib: https://github.com/HowardHinnant/date? Not sure whether it can understand more timezone patterns in the latest versions.

@PHILO-HE Yes, it was copied from that repo.

which makes this library only use OS-supplied timezone database. It may cause a few limitations.

Which limitations you have in mind? The OS timezone database should be our source of authority for timezone conversion, IMO.

@PHILO-HE
Copy link
Contributor

Hi @pedroerp, does Velox's external/date come from this lib: https://github.com/HowardHinnant/date? Not sure whether it can understand more timezone patterns in the latest versions.

@PHILO-HE Yes, it was copied from that repo.

which makes this library only use OS-supplied timezone database. It may cause a few limitations.

Which limitations you have in mind? The OS timezone database should be our source of authority for timezone conversion, IMO.

Hi @pedroerp, I investigated the external date lib recently. In the previous discussion, I guessed OS timezone database has some limitation, so it cannot recognize offset-based timezone like +08:00. But even though I set AUTO_DOWNLOAD=1, HAS_REMOTE_API=1 for this lib to use downloaded timezone database, such offset-based timezone still cannot be recognized (See PHILO-HE@b98d593).

I submitted an issue to that community to discuss offset-based timezone: HowardHinnant/date#823. It seems we can firstly convert the offset to "Etc/GMT+x" (only workable for offset who's a multiple of hours).

@svm1
Copy link
Collaborator

svm1 commented Apr 23, 2024

@PHILO-HE @pedroerp That's quite interesting - maybe implementing such a conversion can be a good incremental first step? As in, a function to convert whole-hour offsets to "Etc/GMT+x", and use a custom time zone class with minute precision, as outlined in the documentation, for the other cases. I could start putting that together if we want to go down that route.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 24, 2024

@PHILO-HE @pedroerp That's quite interesting - maybe implementing such a conversion can be a good incremental first step? As in, a function to convert whole-hour offsets to "Etc/GMT+x", and use a custom time zone class with minute precision, as outlined in the documentation, for the other cases. I could start putting that together if we want to go down that route.

Hi @svm1, I have a PR to support timezone like "GMT+8". See #9591. We can do the similar conversion for other whole-hour offsets, like "+08:00", in separate PR.

@svm1
Copy link
Collaborator

svm1 commented Apr 24, 2024

Sorry, I must've missed that. Makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants