Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

Closed
nvaccessAuto opened this issue Feb 8, 2012 · 5 comments

Comments

@nvaccessAuto
Copy link

Reported by jteh on 2012-02-08 04:57
Str:

  1. Try to read the following line in browse mode:
    `html
    Test line
    `
  2. Expected: "Test line" should be reported.
  3. Actual: Exception:[[br]]
    Traceback (most recent call last): File "XMLFormatting.py", line 60, in parse self.parser.Parse(XMLText.encode('utf-8')) ExpatError: not well-formed (invalid token): line 1, column 5287

This occurs because the name of the div contains a Unicode character \U0001f44d which is larger than 16 bits. In UTF-16 (which is what Windows and Python use), this is represented using surrogates \ud83d\udc4d. The fact that we can't handle characters larger than 16 bits at all is ugly, but isn't the cause of the error. The error occurs because surrogates are invalid in XML, so our appendCharToXML function outputs a unich tag. This is fine in content, but it breaks when this happens in the middle of an attribute value - in this case, the name attribute - as you can't have a tag within an attribute value.

This doesn't happen in 2011.3, as we previously just replaced all characters we couldn't output with a replacement character.

I'm not sure how we're going to work around this. Are you allowed to invent entities in XML?

@nvaccessAuto
Copy link
Author

Comment 1 by jteh on 2012-02-08 05:07
If we can't come up with anything better, a sorta hacky solution is to go back to using a replacement character just for attribute values, as we probably don't really care about the exact characters so much there.

@nvaccessAuto
Copy link
Author

Comment 2 by jteh on 2012-02-08 05:10
Changes:
Changed title from "Unicode characters larger than 16 bits in name cause errors browse mode" to "Unicode characters larger than 16 bits in object names cause errors in browse mode"

@nvaccessAuto
Copy link
Author

Comment 3 by mdcurran on 2012-02-08 05:35
I would be for adding an argument to appendCharToXML which says whether it should use replacement or tag, and then we only do replacement for attribute values. What we currently have is in deed rather dodjy, my mistake I think. I'm happy to fix if this is what we want.

@nvaccessAuto
Copy link
Author

Comment 4 by jteh on 2012-02-08 06:38
I can't think of a better solution, so please go ahead with this one. :)

@nvaccessAuto
Copy link
Author

Comment 5 by jteh on 2012-02-12 22:54
Fixed in 2dae1e4.
Changes:
State: closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants