Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-UTF-8 inputs #77

Open
lemire opened this issue Jan 20, 2023 · 4 comments
Open

Add support for non-UTF-8 inputs #77

lemire opened this issue Jan 20, 2023 · 4 comments

Comments

@lemire
Copy link
Member

lemire commented Jan 20, 2023

The entire code base assumes UTF-8. To support UTF-16, we simply need to transcode (easy!).

@lemire lemire linked a pull request Jan 22, 2023 that will close this issue
@anonrig
Copy link
Member

anonrig commented Feb 9, 2023

We should research the use cases for adding non-UTF-8 input support before advancing/working on this.

@anonrig anonrig added the good first issue Good for newcomers label Feb 9, 2023
@lemire
Copy link
Member Author

lemire commented Feb 9, 2023

It is trivial to add a front-end transcoder to support any unicode encoding. But yeah... not much demand so far.

@anonrig
Copy link
Member

anonrig commented Feb 9, 2023

I wonder if there is any usage from non-browser environments for this. @jasnell any demand from cloudflare workers and/or node.js regarding this?

@cyyynthia
Copy link

That thread is quite old but given the labels I feel like it wouldn't hurt if I give my two cents. I've been hacking together privately a toy-project-grade js engine and am looking at the ecosystem of relevant high performance libs.

I think in the context of js engines it would be interesting to have a fast path for UTF-16 (and eventually latin1) that doesn't "just" hide UTF-8 transcoding. Working with the native encoding of the engine means no need for conversion work and no need to allocate working copies in UTF-8 which sounds desirable to me.

As for providing frontends with a transcode logic, I generally am cautious about it since libraries tend to each use a different implementation and in the end causes binaries to hold multiple implementations of the same function (without domain-specific shortcuts or optimizations) which isn't ideal. So long there's good documentation about the fact they're just QoL shortcuts and there's a path that involves no transcoding I'd be fine with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants