New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic language detection based on unicode ranges #2990
Comments
Comment 2 by ragb (in reply to comment 1) on 2013-02-13 12:37
I thin #1606 is only related with ponctuation, although, to be honest, I don't understand that ticket's description that well. |
Comment 3 by Ahiiron on 2013-05-21 14:35 |
Comment 4 by dineshkaushal on 2015-07-13 05:30 There is a Writing Script dialog within preferences menu. This dialog has options to add/ remove and move up and down languages. I tested with 2 Devanagari languages Hindi and marathi, and I could get the proper language code for those languages in the log. Code is in branch in_t2990 |
Comment 6 by dineshkaushal on 2015-08-17 19:16 |
Comment 7 by jteh on 2015-09-21 05:09 Thanks for the changes, Dinesh. This looks pretty good. A few things: gui
unicodeScriptHandler
unicodeScriptPrep
Documentation
Thanks! |
Comment 8 by dineshkaushal on 2015-09-28 08:11 |
Comment 9 by dineshkaushal on 2015-10-07 13:34 |
Comment 11 by jteh on 2015-10-19 01:22
|
Comment 12 by MarcoZehe on 2015-10-19 10:46 In consequence: If I try to set my synth to German Anna in the Vocalizer 2.0 for NVDA, it will still use the English Samantha voice for most things, even German web pages. I have to turn off language detection completely to get my old functionality back. This will, of course, also take away the language switching where the author did use correct lang attributes on web sites or in Word documents. |
Comment 16 by nishimotz on 2015-10-19 12:32 For example, the word 'Yomu' ('read' in Japanese) usually consists of two characters, 読む The first one is ideographic character (Chinese letter), To give correct pronunciation, Japanese TTS should take the two characters at the same time, With this version of NVDA, the two letters are pronounced separately, so the reading of first letter is wrong. In the unicodeScriptData.py, it seems that 0x8aad is in the range of "Han", |
Comment 18 by jteh (in reply to comment 16) on 2015-10-26 11:04 |
Comment 19 by nvdakor on 2015-10-27 07:51
|
Comment 20 by nvdakor on 2015-10-27 07:53 |
Comment 21 by mohammed on 2015-10-27 13:47 another GUI change would be to only have a close button. I don't think OK and cancel are functional in this dialogue box. thoughts? on another note, since #5427 is closed as fixed, I think it should be removed from the blocking tickets? thanks. |
Comment 22 by jteh on 2015-10-28 00:41 |
Comment 23 by jteh (in reply to comment 21) on 2015-10-28 00:51
They should be. Cancel should discard any changes you make (e.g. removing a language you didn't intend to remove), whereas OK saves them.
No, it shouldn't. Blocking indicates whether another ticket was required for this one, whether it's fixed yet or not. If it is fixed, it's still useful to know that it was required. |
Comment 24 by dineshkaushal on 2015-10-28 07:32 The problem of han and Hiragana is occurring because our algorithm assumes that each language has only one script. One possible solution is that during unicodeData building we can name all han and hiragana characters as something HiraganaHan and then add language to script mapping for Japanese as HiraganaHan we could do the same for chinese and Korean. Another solution is that we could create script groups and add a check for script groups for each character and do not split strings for script groups. Could anyone explain what scripts are relevant for Japanese, Chinese and Korean languages? and how various scripts combine for these languages. Alternatively a reliable reference for a resource. |
Comment 26 by nishimotz on 2015-10-28 08:49
I think such requirements are because of Japanese TTS and symbol dictionary, which already covers wider ranges of Unicode characters by historical reasons. If such requirement is only for Japanese users, I will work around only for Japanese. |
Comment 27 by jteh on 2015-10-29 00:35 |
Comment 28 by nishimotz on 2015-10-29 03:01 https://vocalizer-nvda.com/docs/en/userguide.html#automatic-language-switching-settings I am asking them to the usage of this functionality. As far as I heard, automatic language switching based on content attribute and character code should be separately disabled for Japanese language users. |
Comment 29 by jteh (in reply to comment 28) on 2015-10-29 03:09
To clarify, do you mean that these users disable language detection (using characters), but leave language switching for author-specified language enabled? Or are you saying the reverse? Or are you saying that different users have different settings, but all agree both need to be toggled separately? How well doe sthe Vocalizer language detection implementation work for Japanese users? For what it's worth, I'm starting to think we should allow users to disable language detection (i.e. using characters) separately. At the very least, it provides for a workaround if our language detection code gets it wrong. I'm not convinced it is necessary to separately disable author-specified language switching, though. If you disagree, can you explain why? |
Comment 30 by nishimotz on 2015-10-29 03:51 For example, if a synthesizer supports English and Japanese, and if the actual content of a web site is written in Japanese characters, and the element is incorrectly attributed as lang='en', the content cannot be accessed at all, without turning off the author-specified language switching. I am now investing the implementation of Vocalizer language detection by myself, however, I heard that they are only useful for working with multilingual materials. |
Comment 31 by nishimotz on 2015-10-29 12:41 The important feature is: By the way, it would be nice to allow disabling "language switching for author-specified language" and enabling "detect text language based on unicode characters" in some cases. For example, Microsoft Word already has ability of content language detection based on character code. I am now asking to some friends regarding this, but it seems Japanese users of Microsoft Word cannot use the language switching of NVDA because of this. |
Comment 32 by James Teh <jamie@... on 2015-11-02 05:30 This is causing problems for quite a few languages and needs some additional work before it is ready. |
Comment 33 by jteh on 2015-11-02 05:31 |
Comment 34 by mohammed on 2015-11-04 16:00 it'd be good if people here can try the automatic language implementation in the new ad-on from Codefactory. for me it works if I choose an English voice from NVDA's voice settings dialog box. the only annoyance for me is that I hear punctuation marks with the Arabic voice regardless of "Trust voice's language when processing characters and symbols" state. Jamie, can we probably make this functionality that has been reverted available as an ad-on? because for me, it is the most successful implementation where my primary language is English and Arabic is a secondary. it worked perfectly for me. |
Comment 35 by jteh (in reply to comment 34) on 2015-11-04 22:24
Do you mean that the Code FActory add-on includes it's wn language detection or do you mean you were trying an NVDA next build which included this functionality (before it was reverted)? I assume the second, but just checking.
Unfortunately, no; it needs to integrate quite deeply into NVDA's speech code. However, work on this isn't being abandoned. It just needs more work before it's ready for wide spread testing again. |
No; you can't choose the specific voice. This choice is made by the synth. Supporting this will be possible using the same technique we will use to suppoort synth switching.
|
Thanks @nishimotz for the fix.
But I thought this scenario should be covered by the common Unicode category? My understanding is that the algorithm does not work for your example of 1個 as number comes before the character 個. Can you verify if number coming after the Japanese character works fine.
In that case instead of adding “Number” as a separate category, we could change the processing so that language code will apply for previous common category string if there is no earlier language code. This should solve the above scenario. the current implementation should take the default language code for the common category so this problem should not occur.
Could you also give me log so that I could check what default language is showing for the above example with a Japanese synthesizer?
Thanks for other improvements as well, the code is looking better.
From: James Teh [mailto:notifications@github.com]
Sent: Wednesday, August 16, 2017 2:06 AM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: dineshkaushal <dineshkaushal@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)
No; you can't choose the specific voice. This choice is made by the synth. Supporting this will be possible using the same technique we will use to suppoort synth switching.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2990 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AE08v_LGb-klkJ-QfXu4bRh-JaA09BAoks5sYgEVgaJpZM4LHyH_> . <https://github.com/notifications/beacon/AE08v0j6ZBkXkARiV5D_WHoFamvW8fNaks5sYgEVgaJpZM4LHyH_.gif>
|
Original code treats numbers as Common category. My modification treats digit numbers, for all languages, as their native script, so the preferred language priority is respected. |
I have merged modifications proposed by @nishimotz and added a few unit tests for language detection.
Based on these unit tests, I found that language detection didn’t work properly for numbers if there is no preferred language added.
I have made some corrections so that if there is no preferred language then the default language is used for numbers. The default language is the language reported by the synthesizer. I have also renamed a parameter to make it read better.
@nishimotz could you test the modifications and add more tests specially for Japanese and Chinese?
Thanks
From: Takuya Nishimoto [mailto:notifications@github.com]
Sent: Wednesday, August 16, 2017 3:25 PM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: dineshkaushal <dineshkaushal@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)
Original code treats numbers as Common category.
Because detectScript() ignores Common category, the language code of digit numbers will be same as the preceding characters.
For example, even if Japanese has higher priority, "Excel 2016" is spoken in English to the end.
It is difficult to understand for Japanese language users.
My modification treats digit numbers, for all languages, as their native script, so the preferred language priority is respected.
For example, if Japanese has higher priority, "Excel" is spoken in English and "2016" is in Japanese.
This is much easier to understand.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2990 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AE08v9lf5_61fkScad-k2ShovuuSIyBpks5sYrxtgaJpZM4LHyH_> . <https://github.com/notifications/beacon/AE08v5L4Ea5etJFDjL3FcG_-C6jNAnQeks5sYrxtgaJpZM4LHyH_.gif>
|
Use of default language sounds good, however, I found an issue with your new revision. setup:
procedure:
|
Tests are working as expected. The second parameter of detectLanguage() is given in speech.py. However, if automatic language detection is enabled at the NVDA voice setting, locale value is set to the synthesizer's default language. Am I correct? |
I have learned more about your code.
|
As per original design, the second parameter i.e. defaultLanguage of detectLanguage was used to decide whether we would add languageChange command or not. So if defaultLanguage is English and if text string is in Latin and preferred language is English, then there would be no languageChange command as it is added before calling this function.
The purpose of preferred language was to choose a language from list of languages that have the same script.
I thought common script property would take care of numbers and punctuations. Common script seems to be working for punctuation, but for numbers I am not very sure. There could be following scenarios:
If language before numbers and after the numbers is same, then we could default to that language, and common property does that very well.
If language is followed by a number then we could speak the number in that language, but you suggested that for excel 2016, you want numbers to be spoken in Japanese even though text is in English. For that we need a way to determine which language should we use for numbers
If number is followed by a language, we could solve that as well with common property along with backtracking.
If number is stand alone, then we don’t know what to do so either speak with previous language or speak with NVDA language selection.
From: Takuya Nishimoto [mailto:notifications@github.com]
Sent: Monday, August 21, 2017 5:13 AM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: dineshkaushal <dineshkaushal@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)
I have learned more about your code.
I am still not sure how voice language (aka default language) and prerefenres should be used.
For example, this test, written by me, fails.
It is because second parameter of detectLanguage has higher priority than preferred languages, so Number always respects the voice language.
Is it relevant or not?
def test_case1(self):
combinedText = u"Windows 10 OCR"
config.conf["languageDetection"]["preferredLanguages"] = ("ja",)
languageDetection.updateLanguagePriorityFromConfig()
detectedLanguageSequence = languageDetection.detectLanguage(combinedText, "en_US")
self.compareSpeechSequence(detectedLanguageSequence, [
LangChangeCommand("en"),
u"Windows ",
LangChangeCommand("ja"),
u"10 ",
LangChangeCommand("en"),
u"OCR"
])
config.conf["languageDetection"]["preferredLanguages"] = ()
languageDetection.updateLanguagePriorityFromConfig()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2990 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AE08v8ZaZSoQLHPDM2RTjbNLZGmW5XODks5saMRxgaJpZM4LHyH_> . <https://github.com/notifications/beacon/AE08v0EACipcE1wmUcj7_HnxXYq6vIlxks5saMRxgaJpZM4LHyH_.gif>
|
Thank you for clarifications regarding preferences. I made new pull request which only adds tests regarding Japanese. |
nvaccess#2990 Japanese test cases
@nishimotz I have included the test cases. Should I assume that these test cases are what Japanese users expect from NVDA language detection? Or do you propose any change regarding how we handle the numbers?
As per your previous comment, “windows 10 OCR” should be read in English.
So do you propose that we should go by either synthesizer language or Language selected in NVDA?
I also request others for their suggestion about this issue.
From: Takuya Nishimoto [mailto:notifications@github.com]
Sent: Monday, August 21, 2017 7:12 PM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: dineshkaushal <dineshkaushal@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)
Thank you for clarifications regarding preferences.
I made new pull request which only adds tests regarding Japanese.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2990 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AE08v82gyNKouAai9TaYYB7cSnZ-Zgmfks5saYkqgaJpZM4LHyH_> . <https://github.com/notifications/beacon/AE08v7HVJwBN27-dmShstLWJrX6dbGUdks5saYkqgaJpZM4LHyH_.gif>
|
So far, Japanese language users can accept the behavior of current implementation, I think. |
could you summarize what work needs to be done before you consider send this as a BR to be reviewed? For Arabic this works as expected, and it seems this is true for Japanese too. |
Are we going to get this in 2017.4?
|
… Are we going to get this in 2017.4?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2990 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKohk6uO4ZeHsxo86FoPj50FW0YVGb30ks5s5YOJgaJpZM4LHyH_>.
|
I don’t understand why? I had submitted it almost a month and half ago with unit tests?
|
because we have now an rc.
but wait for the mick’s statement.
W dniu 23.11.2017 o 16:25, dineshkaushal pisze:
… I don’t understand why? I had submitted it almost a month and half ago
with unit tests?
From: zstanecic ***@***.***
Sent: Thursday, November 23, 2017 8:34 PM
To: nvaccess/nvda ***@***.***>
Cc: dineshkaushal ***@***.***>; Mention
***@***.***>
Subject: Re: [nvaccess/nvda] Automatic language detection based on
unicode ranges (#2990)
i am afraid, no
@josephsl,
@mdcurran
W dniu 23.11.2017 o 15:39, dineshkaushal pisze:
> Are we going to get this in 2017.4?
>
>
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#2990 (comment)>,
> or mute the thread
>
<https://github.com/notifications/unsubscribe-auth/AKohk6uO4ZeHsxo86FoPj50FW0YVGb30ks5s5YOJgaJpZM4LHyH_>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2990 (comment)>
, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE08v6-sFDRQdcn9KIK7JGNZxuQPBzJxks5s5YlegaJpZM4LHyH_>
.
<https://github.com/notifications/beacon/AE08v_0885i94CHU0y3NE3BA_jiWm8Nwks5s5YlegaJpZM4LHyH_.gif>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2990 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKohkxhVVahlfG8hntPP99ZmhuaUCl55ks5s5Y5sgaJpZM4LHyH_>.
|
Yes, it's now too late for this change to go into 2017.4. This is perhaps best anyway, the associated PR ( #7629 ) is a large change, which will take some time to review and given the nature of the change, it will be good for many people to use it via master and next builds before it goes into a release |
Ok, would wait for comments after the review.
From: Reef Turner [mailto:notifications@github.com]
Sent: Monday, November 27, 2017 1:22 PM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: dineshkaushal <dineshkaushal@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)
Yes, it's now too late for this change to go into 2017.4. This is perhaps best anyway, the associated PR ( #7629 <#7629> ) is a large change, which will take some time to review and given the nature of the change, it will be good for many people to use it via master and next builds before it goes into a release
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2990 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AE08v_bTc8JdUpMXEJtTmBgyLIhni4v6ks5s6moFgaJpZM4LHyH_> . <https://github.com/notifications/beacon/AE08v2W04uNzH-RKpDblOPUeBr_4tJjYks5s6moFgaJpZM4LHyH_.gif>
|
@dineshkaushal are you still considering to continue your work on this? It would be highly appreciated. Since there has been put a lot of work in that PR, it would be really too bad if this is discontinued. Now that NVDA has been migrated to Python 3, I guess the PR is not compatible anymore. |
cc: @mltony |
I think we should implement this... |
I already implemented this feature in Tony's enhancements add-on. However for NVDA core I would argue that we can take this idea a step further and make use of a language detection library in order to distinguish languages properly, e.g. distinguishing English from German, which is not possible with just Unicode character analysis. VoiceOver can already do this. My cursory googling revealed multiple options available: |
Hello! This is a big message... In a conversation with mohammad suliman mohmad.s93@gmail.com: mohammad suliman wrote: Good! I can cooperate in several tasks, but not coding, since my skills You wrote: First, we want to highlight that this PR is very needed for us Yes, I know that and we try to make our Vocalizer NVDA compatible as soon You wrote: That means that if NVDA encounters a specific language, let's say Hebrew - We think that this behavior is the prefered one for most users, but We have choosen the second way, making it configurable, since many users You wrote: - A new panel will be created for language detection feature, and it As we have in Vocalizer, I will suggest a combobox to select the voice to You wrote: - We propose the following components to be added also: I disagree, and suggest 4 options, including: switch languages using Unicode character propertiesonly This is because we found on the web a lot of pages coded as using english, You wrote: If you want also to get some coding logic or GUI from Vocalizer And, finally, one suggestion: Why not use, after the language selection through the character set, one I have tried several similar tools, and this one proved to be the fastest With more than 4 words the results are almost perfect... And it is easy to use in NVDA. It can get only the most probable language Here a small add-on I made to test..: https://www.dropbox.com/s/3jesk88koae35sg/languageDetect_1.0_Gen.nvda-addon?dl=1 . I could not understood totally the speech module to try to include this in The commands are: NVDA+Shift+l": "getLang", Sorry by writing in private, but I think is more produtive... Best regards, Rui Fontes |
Reported by ragb on 2013-02-13 12:26
This is kind of a spin-of of #279.
As settled some time ago, this proposal aims to implement automatic text “language” detection for NVDA. The main goal of this feature is for users to read text in different languages (or better said, language families) using proper synthesizer voices. By using unicode character ranges, one can understand at least the language family of a bunch of text: Latine-based (english, german, portuguese, spanish, french,…),, cyrilic (russian, ukrainian,…), kanji (japanese, maybe korean, - I that already written but it is too much for my memory), greek, arabic (arabic, farsy), and others more.
In broad terms, the implementation of this feature in NVDA requires the addition of a detection module in the speech sub system, that intercepts speech commands and adds “fake” language commands for the synth to change language, based on changes on text characters. It is also needed an interface for the user to tell NVDA what particular language to choose for some language family, that is, what to assume for latin-based, what to assume for arabic-based characters, etc.
I’ve implemented a prototype of this feature in a custome vocalizer driver, with no interface to choose the “proper” language. Prliminary testing with arabic users, using arabi and english vocalizer voices, has shown good results, that is, people like the idea. Detection language code was adapted from the Guess_language module, removing some of the detection code which was not applicable (tri-gram detection for differentiating latin languages, for instance).
I’ll explain the decision to use, for now, only unicode based language detection. Language detection could also be done using trigrams (see here for instance), dictionaries, or other heuristics of that kind. However, the text that is passed each time for the synthesizer is very very small (a line of text, a menu name, etc), which makes these processes, which are probabilistic by nature, very very error-prone. From my testing, applying trigram detect for latin languages in NVDA showed completely unusable, further from adding a noticeable delay when speaking. For bigger text content (books, articles, etc.) it seems to work well, however I don’t know if this can by applied somehow in the future, say by analyzing virtuel buffers, or anything.
Regarding punctuation, digits, and other general characters, I’m defaulting to the current language (and voice) of the synth.
I’ll create a branch with my detection module integrated within NVDA, with no interface.
Regarding the interface for selecting what language to assume for each given language group (when applicable, greek, for instance, is only itself), I see a dialog with various combo boxes, each one for each language family, to choose the language to be used. I think restricting the available language choices from the available languages of the current synth may improve usability. I don’t know where to put that dialog, or what to call it (“language detection options”?).
Any questions please ask.
Regards,
Rui Batista
Blocked by #5427, #5438
The text was updated successfully, but these errors were encountered: