Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

nvaccessAuto · 2012-02-08T04:57:29Z

Reported by jteh on 2012-02-08 04:57
Str:

Try to read the following line in browse mode:
`html
Test line
`
Expected: "Test line" should be reported.
Actual: Exception:[[br]]
Traceback (most recent call last): File "XMLFormatting.py", line 60, in parse self.parser.Parse(XMLText.encode('utf-8')) ExpatError: not well-formed (invalid token): line 1, column 5287

This occurs because the name of the div contains a Unicode character \U0001f44d which is larger than 16 bits. In UTF-16 (which is what Windows and Python use), this is represented using surrogates \ud83d\udc4d. The fact that we can't handle characters larger than 16 bits at all is ugly, but isn't the cause of the error. The error occurs because surrogates are invalid in XML, so our appendCharToXML function outputs a unich tag. This is fine in content, but it breaks when this happens in the middle of an attribute value - in this case, the name attribute - as you can't have a tag within an attribute value.

This doesn't happen in 2011.3, as we previously just replaced all characters we couldn't output with a replacement character.

I'm not sure how we're going to work around this. Are you allowed to invent entities in XML?

The text was updated successfully, but these errors were encountered:

nvaccessAuto · 2012-02-08T05:07:02Z

Comment 1 by jteh on 2012-02-08 05:07
If we can't come up with anything better, a sorta hacky solution is to go back to using a replacement character just for attribute values, as we probably don't really care about the exact characters so much there.

nvaccessAuto · 2012-02-08T05:10:38Z

Comment 2 by jteh on 2012-02-08 05:10
Changes:
Changed title from "Unicode characters larger than 16 bits in name cause errors browse mode" to "Unicode characters larger than 16 bits in object names cause errors in browse mode"

nvaccessAuto · 2012-02-08T05:35:43Z

Comment 3 by mdcurran on 2012-02-08 05:35
I would be for adding an argument to appendCharToXML which says whether it should use replacement or tag, and then we only do replacement for attribute values. What we currently have is in deed rather dodjy, my mistake I think. I'm happy to fix if this is what we want.

nvaccessAuto · 2012-02-08T06:38:54Z

Comment 4 by jteh on 2012-02-08 06:38
I can't think of a better solution, so please go ahead with this one. :)

nvaccessAuto · 2012-02-12T22:54:04Z

Comment 5 by jteh on 2012-02-12 22:54
Fixed in 2dae1e4.
Changes:
State: closed

nvaccessAuto added bug feature/browse-mode bug/regression labels Nov 10, 2015

nvaccessAuto assigned michaelDCurran Nov 10, 2015

nvaccessAuto added this to the 2012.1 milestone Nov 10, 2015

nvaccessAuto closed this as completed Nov 10, 2015

nvaccessAuto mentioned this issue Nov 10, 2015

Read HTML entities, unicode characters, other symbols #3805

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 12, 2012

Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

Unicode characters larger than 16 bits in object names cause errors in browse mode #2090

Comments

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 8, 2012

nvaccessAuto commented Feb 12, 2012