Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify handling of encoding #69

Open
wants to merge 1 commit into
base: python3_only
Choose a base branch
from

Conversation

Delgan
Copy link
Collaborator

@Delgan Delgan commented Jan 18, 2019

Hi @Qix-

Taking advantage of the fact that we no longer need to support Python 2.7, I think we can largely simplify how we manage strings encoding.

This avoids the use of hacky ProxyBufferStreamWrapper class, the dubious to_unicode() and to_bytes() functions, and the conditional statement in write_stream().

Basically, we just format the exception as an unicode string, and we let the sys.stderr stream handles encoding, no need to deal with .buffer bytes.

You may notice one side-effect: utf-8 characters like "天" are no longer displayed as it on ascii terminals. I think this is actually the correct way to do it.

One thing I don't understand with the current implementation is that in one hand we test encoding of and fallback to -> on error, on the other hand we manage to print in all cases. This results in traceback which may look like -> "天". It's paradoxical, either we can display utf8 characters or we can't. I suspect that this is a source of errors for the problems encountered by some users.

By writing not encoded unicode to sys.stderr, the unprintable characters are automatically escapted with the surrogateescape policy, and hence displays -> "\u5929" on ascii terminals, └ "天" otherwise.

Also, I replaced sys.getpreferredencoding() with STREAM.encoding, because we are writing to STREAM (sys.stderr) so why not use its specified encoding? Using sys.getpreferredencoding() proved to display mojibake characters to some users, so maybe this will fix it.

I made some tests on both Linux and Windows and compared exception formatting between standard and better_exceptions based on locale and IO encoding. The handling of utf8 characters is now identitical to what is done by the default exception handler, so I think this should reduce problems due to encoding.

This pull request is made for the python3_only branch.

@Delgan
Copy link
Collaborator Author

Delgan commented Jan 19, 2019

@Qix- I realized there is actually a problem with this solution.

Given this code runing on ascii terminal:

a = "天"
"天" * a

Without better_exceptions:

Traceback (most recent call last):
  File "a.py", line 6, in <module>
    "\u5929" * a
TypeError: can't multiply sequence by non-int of type 'str'

With better_exeptions/master:

Traceback (most recent call last):
  File "a.py", line 6, in <module>
    '天' * a
            -> '天'
TypeError: can't multiply sequence by non-int of type 'str'

With better_exceptions/simplify_encoding:

Traceback (most recent call last):
  File "a.py", line 6, in <module>
    '\u5929' * a
            -> '\u5929'
TypeError: can't multiply sequence by non-int of type 'str

The column where to start -> is wrongly computed as it is calculated from the non-encoded source string. I have a solution but can't really fix it here because of others problem with source formatting. I made a branch based on this one which fixes source formatting and where this can be easily fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant