You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Try to read the following line in browse mode:
`html
Test line
`
Expected: "Test line" should be reported.
Actual: Exception:[[br]] Traceback (most recent call last): File "XMLFormatting.py", line 60, in parse self.parser.Parse(XMLText.encode('utf-8')) ExpatError: not well-formed (invalid token): line 1, column 5287
This occurs because the name of the div contains a Unicode character \U0001f44d which is larger than 16 bits. In UTF-16 (which is what Windows and Python use), this is represented using surrogates \ud83d\udc4d. The fact that we can't handle characters larger than 16 bits at all is ugly, but isn't the cause of the error. The error occurs because surrogates are invalid in XML, so our appendCharToXML function outputs a unich tag. This is fine in content, but it breaks when this happens in the middle of an attribute value - in this case, the name attribute - as you can't have a tag within an attribute value.
This doesn't happen in 2011.3, as we previously just replaced all characters we couldn't output with a replacement character.
I'm not sure how we're going to work around this. Are you allowed to invent entities in XML?
The text was updated successfully, but these errors were encountered:
Comment 1 by jteh on 2012-02-08 05:07
If we can't come up with anything better, a sorta hacky solution is to go back to using a replacement character just for attribute values, as we probably don't really care about the exact characters so much there.
Comment 2 by jteh on 2012-02-08 05:10
Changes:
Changed title from "Unicode characters larger than 16 bits in name cause errors browse mode" to "Unicode characters larger than 16 bits in object names cause errors in browse mode"
Comment 3 by mdcurran on 2012-02-08 05:35
I would be for adding an argument to appendCharToXML which says whether it should use replacement or tag, and then we only do replacement for attribute values. What we currently have is in deed rather dodjy, my mistake I think. I'm happy to fix if this is what we want.
Reported by jteh on 2012-02-08 04:57
Str:
`html
Traceback (most recent call last): File "XMLFormatting.py", line 60, in parse self.parser.Parse(XMLText.encode('utf-8')) ExpatError: not well-formed (invalid token): line 1, column 5287
This occurs because the name of the div contains a Unicode character \U0001f44d which is larger than 16 bits. In UTF-16 (which is what Windows and Python use), this is represented using surrogates \ud83d\udc4d. The fact that we can't handle characters larger than 16 bits at all is ugly, but isn't the cause of the error. The error occurs because surrogates are invalid in XML, so our appendCharToXML function outputs a unich tag. This is fine in content, but it breaks when this happens in the middle of an attribute value - in this case, the name attribute - as you can't have a tag within an attribute value.
This doesn't happen in 2011.3, as we previously just replaced all characters we couldn't output with a replacement character.
I'm not sure how we're going to work around this. Are you allowed to invent entities in XML?
The text was updated successfully, but these errors were encountered: