Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

nvaccessAuto · 2015-02-03T00:56:36Z

Reported by jteh on 2015-02-03 00:56
A lot of potentially nice functionality is not possible (or is at least ridiculously painful) with our current speech framework. Speech needs a pretty big refactor to allow for such things. It needs to support:

Callbacks which are called at a requested point in the speech output
Callbacks which are called when a given synth finishes speaking and when overall speech has stopped
Speech commands in control/format field speech so that control and formatting info can be indicated by things other than just text
Priority output so that important messages can interrupt what is being spoken and/or be spoken after the next utterance without losing lower priority utterances already sent

Unfortunately, some of this is going to break backwards compatibility, but I think it's worth it in this case.
Blocking #905, #3188, #3286, #3736, #4089, #4233, #4433, #4874, #4966, #5026, #5104

nvaccessAuto · 2015-02-04T17:37:15Z

Comment 2 by camlorn on 2015-02-04 17:37
+infinity. Glad to see this is at least a ticket now. I'll give input where and as I can; at the moment this is too ill-defined for me to really comment beyond suggesting you get a time machine so we can have this yesterday. If only.

nvaccessAuto · 2015-02-06T16:35:20Z

Comment 3 by leonarddr on 2015-02-06 16:35
I wonder, is #914 something which could be involved in this ticket?

nvaccessAuto · 2015-02-24T17:53:34Z

Comment 4 by camlorn on 2015-02-24 17:53
I've been thinking about this some.
I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.
The first is when NVDA says a fixed string, so basically anywhere that's not say all. In this case, I need to be allowed to potentially lengthen the buffer. If the speech takes less time than the sounds, simple addition will not work. Preparing the buffer as one chunk and passing it through the add-on with tags for specific sample ranges should be sufficient for this case but may cause processing issues for larger chunks.
The second is say all or other "streaming" situations, and this is the harder one. ideally, something like Unspoken can work in parallel with say all. But when you're applying filters, they have tails of a few hundred samples. Not to mention the proceeding situation. In this case, I don't want to lengthen the buffer for the simple reason that it is not semantically separated into logical chunks.
And the problem with playing in parallel: variable latencies means I won't actually be tightly aligned with the speech anymore.
I'm not sure how to fix these. I think we need to decide what kinds of manipulation we want to allow and disallow. I've got enough knowledge to talk about what we can potentially do to NVWave or something that sits a level above it, and in the worst case we allow add-ons to monkeypatch through a blessed interface or something. Obviously I want Unspoken to work in say all and other places where the object is said but not focused, but beyond that I'm open. Nevertheless, I do know most of the algorithms at this point, and I think we should start pinning down the capabilities. Ideally we can get input from more than just me, but i'm not sure anyone else is working on add-ons similar to Unspoken.

nvaccessAuto · 2015-02-24T22:51:03Z

Comment 5 by jteh (in reply to comment 4) on 2015-02-24 22:51
Replying to camlorn:

I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.

What you're suggesting (sample range tagging, etc.) requires that the speech framework has intimate knowledge of audio output. The biggest problem with this is that not all synths output audio through NVDA, so this actually isn't possible.

One approach which should cover at least some of your needs is that we allow a callback to specify that it wants to suspend further utterances until it is complete, at which point it will request that speech resume. One major problem with this is that a broken add-on can very easily break speech quite badly (more easily than it can already), so maybe we need to allow a maximum timeout for this or something.

nvaccessAuto · 2015-02-25T03:06:02Z

Comment 6 by camlorn on 2015-02-25 03:06
How common is not playing through NVDA? Personally, I'd have no problem saying "either you give me the audio and semantic sample tagging or my add-on doesn't work" if that means I'm getting very accurate and not drifting playback during say all, and I'm still not entirely convinced that applying filters directly to speech is a bad idea when possible. Bump volume for bold, underline is chorus, I don't know. I wish I could experiment with this now so that we could know if this is super valuable or just me being me. I do not feel confident enough in the quality of Espeak's code to hack Espeak into having extra filters that you can toggle without it probably becoming a pretty hefty project, unfortunately.
And doesn't going through other things break NVDA's device selection? Also, what can't we get audio out of? Because this breaks backward compatibility anyway, and if there's not actually anything you can't request samples from reasonably, why not break it in that way?
The callback would help with some, I think. In the common case, sounds are short. But the add-on will still need to know if it's a say all situation, and possibly if it needs to abort the sound instead. If the utterance is because I pressed something, it needs to bail gracefully. I think that timeouts should not be allowed here or at least set to a significantly large value; if you are an add-on developer using this feature, then you need to be aware that it is dangerous and treat it accordingly. It is worth noting that if I'm playing in parallel because there's no choice, then playing things that overlap slightly is also pretty trivial. If Libaudioverse is the backend (it is for the unreleased version of Unspoken), it's always playing anyway.

nvaccessAuto · 2015-02-25T05:34:53Z

Comment 7 by jteh on 2015-02-25 05:34
We still use direct output from SAPI4, SAPI5 and Audiologic. I believe the Festival and Acapela drivers do also. And no, this doesn't t break device selection; the drivers just handle the initialisation themselves. There are also some who still want external synths. We can get samples from SAPI4 and SAPI5 and probably will do so in future, but the point is that we aren't going to drop support for this.

Aside from that, the stuff I'm working on relates to how the speech framework processes utterances, passes them to synths and calls callbacks. The synths still generate the samples after they receive the utterance. Therefore, I don't see how you could do sample tagging at this level anyway. You can have the synth fire callbacks when it outputs an index, which is what I plan to do.

nvaccessAuto · 2015-02-26T22:08:04Z

Comment 8 by camlorn on 2015-02-26 22:08
I get it, I'm just not thrilled. Things can be done without sample-accurate playback and, thinking about it more, I think that a callback to delay speech won't help much either. But it just seems like we could do a lot here, especially as we move forward and external synths finally, finally begin to finish dying.
If you use the Synth's callbacks, doesn't it usually give you sample indexes? You can still get this information even if it doesn't by gathering samples at, say, chunks of 64. If a callback triggers, tag that chunk. I get that we can't have this, at least without custom synth drivers or something, but it's certainly possible. I'm personally not adverse to having features of add-ons that depend on synth drivers supporting certain things, but maybe I'm being too ambitious.
Also, if the synth's latency is too high, there's no way to sync at all. Sapi is a pretty bad offender for this, at least from the few times I've used it.
How high level will new speech commands be? Am I still monkeypatching stuff, or do we have semantic stuff like "is saying object" or something? Higher level, hookable speech commands would be nice-this would let me move my add-on into the synth itself if I wanted, for example.

nvaccessAuto · 2015-03-18T13:48:21Z

Comment 11 by camlorn on 2015-03-18 13:48
I had a thought on the syncing issue that might actually be workable. This might also be overly complicated, and maybe it's something we can do after this refactor if it proves necessary. It's also got some unaddressed details and such, but I've been thinking about it for a few days and I can't see specific downsides that would make it unworkable.
First, include the Speex resampler. This may have other benefits, especially if integrated into NVWave. Investigating the parts that don't apply to this ticket is on my to-do list; waveout uses a linear resampler and those aren't exactly good for large jumps. It's a couple c source files that can be integrated pretty easily. This gives synths which wish to support the next thing a convenient way to upsample their samples to 44.1khz.
Second, either include a "play stereo samples" command or allow for callbacks in whatever architecture exists to return samples they wish played. I like the former because it's the most common thing I think people are going to want to do and there's no guarantee that the synth has to process it in realtime.
Third, implement a background thread that can play these samples when passed over a queue. This isn't exactly as hard as it sounds, though it might involve 2 threads depending. I can give this code if we go here. Then, in all synths that don't want to or can't support highly accurate synchronization, they can delegate to this thread in the same way we would if we were getting callbacks. It might even be possible to map this into a callback command for those synths that don't want to deal with it.
Finally, add the ability to splice the audio directly into the speech stream before playing to whatever synths we can inside NVDA itself. This involves an upsample to 44.1khz or thereabouts; 44.1khz is the most common and is sufficiently high for anything we'd want to do.
I figure this is at least a starting point, though I will admit that knowing how much of a problem differences in latency is will have to wait until we have the ability to play sound during say all and the like.

nvaccessAuto · 2015-04-03T19:41:21Z

Comment 12 by Q (in reply to comment 4) on 2015-04-03 19:41
Replying to camlorn:

I've been thinking about this some.

I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.

The first is when NVDA says a fixed string, so basically anywhere that's not say all. In this case, I need to be allowed to potentially lengthen the buffer. If the speech takes less time than the sounds, simple addition will not work. Preparing the buffer as one chunk and passing it through the add-on with tags for specific sample ranges should be sufficient for this case but may cause processing issues for larger chunks.

The second is say all or other "streaming" situations, and this is the harder one. ideally, something like Unspoken can work in parallel with say all. But when you're applying filters, they have tails of a few hundred samples. Not to mention the proceeding situation. In this case, I don't want to lengthen the buffer for the simple reason that it is not semantically separated into logical chunks.

And the problem with playing in parallel: variable latencies means I won't actually be tightly aligned with the speech anymore.

I'm not sure how to fix these. I think we need to decide what kinds of manipulation we want to allow and disallow. I've got enough knowledge to talk about what we can potentially do to NVWave or something that sits a level above it, and in the worst case we allow add-ons to monkeypatch through a blessed interface or something. Obviously I want Unspoken to work in say all and other places where the object is said but not focused, but beyond that I'm open. Nevertheless, I do know most of the algorithms at this point, and I think we should start pinning down the capabilities. Ideally we can get input from more than just me, but i'm not sure anyone else is working on add-ons similar to Unspoken.

ahicks92 · 2016-04-23T18:36:02Z

I thought of another possible use case, though I'm not sure how useful it would be: implement an option to monitor things in the background. I'm thinking at least aria live regions and the controller API, but it might also be useful to tag terminals for watching.
As an example, I'm currently programming something wherein I need to have two terminals open, one running a client and the other running a server. If I could tag them both with separate voices and monitor even when it doesn't have focus, that might be very useful. It could also be horribly confusing, mind you, and it would probably help if this also included pan. But it's an interesting thought and maybe something to put on the "We want an add-on to be able to..." list at least. I don't think you could use it instantly, but I think you could train to it if you tried a bit.
There is an old demo somewhere of a prototype system that did something like this for other stuff as well, speaking labels out one speaker and control types out the other in parallel. I wish I had a link to it. I'll see if I can find it, but don't even know what to search for.
But we're still collecting use cases, so I figured I'd throw this out there anyway. I'm sure it has problems I haven't thought of, etc, but it's something to put on the list.

ahicks92 · 2016-07-21T20:59:13Z

Got one more.

On the web, we currently don't read the title attribute when moving by arrows. I think this also applies to aria-label and maybe some other things. With this, it would be possible to indicate these attributes efficiently, so that a user could know to check it in one way or another.

As it stands, we would have to say something like "has title" or just read it every time. Since this can be used on things like abbreviations, this is possibly less than useful.

There's already an issue about them not reading, but I don't remember which one at the moment. Just that it has a weird title and that I'm on the thread.

amangano-edx · 2016-12-27T20:05:39Z

I posted on bugzilla about an issue with NVDA (and JAWS) on Firefox where the 'Remember Password' popup interrupts other aria alerts on the page (https://bugzilla.mozilla.org/show_bug.cgi?id=1323070). It was determined that this was a problem with the screen readers not being able to handle multiple alerts at the same time. Would the changes proposed here fix this issue?

ahicks92 · 2016-12-29T17:27:03Z

It wouldn't, I don't think. It would allow cool experimental things like saying them all at the same time in different voices and maybe panning them across the sound field or something, but the core issue is probably about queuing things to be spoken, and we should already be able to do that.

bhavyashah · 2018-08-19T02:04:44Z

List of tickets which may be related to, blocked by or dependent on this ticket:
#4966
#5026
#5096
#5104
#3286
#5638
#5862
#1229
#6360
#3564
#4738
#6685
#6688
NVDARemote/NVDARemote#110
#4433
#4629
#310
#7274
#3772
#3493
#4874
#4661
#2590
#2670
#3807
#279
#905
#847
#7427
#3188
#3736
#4086
#7594
#1398
#4089
#4233

Adriani90 · 2019-05-15T20:20:36Z

@michaelDCurran I propose to let this issue open because here are many issues referenced which contain use cases for this feature. After solving and closing them, I propose to close this as well.

LeonarddeR · 2019-07-17T13:02:05Z

I've looked at the initial description of this issue, and I think it is fully covered by #7599. Leaving this issue open until all the mentioned issues are solved as @Adriani90 suggested, will probably mean that this will be open for ever. Furthermore, there's nothing holding you back from referring to this issue to find the use cases for the framework grouped together.

Closing it.

nvaccessAuto added enhancement component/speech labels Nov 10, 2015

nvaccessAuto assigned jcsteh Nov 10, 2015

nvaccessAuto added this to the next milestone Nov 10, 2015

This was referenced Nov 10, 2015

Add audio queues to NVDA #4966

Closed

Add colors to speech viewer #5026

Open

Move NVWave to C/C++ #5096

Closed

NVDA does not announce selection for mathematical content #5104

Open

nvaccessAuto mentioned this issue Dec 14, 2015

Optional reporting of capitals while reading full text (not just characters) #3286

Open

jcsteh mentioned this issue Jan 3, 2016

Add the ability to determine if NVDA is speaking in NVDA Controller.dll #5638

Closed

jcsteh mentioned this issue Apr 6, 2016

Reading MATH in label strings #5862

Open

jcsteh mentioned this issue Jun 24, 2016

Problem spelling things when speaking of tooltips is turn on. #1229

Closed

jcsteh removed this from the next milestone Jun 24, 2016

jcsteh added the p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority label Jul 1, 2016

LeonarddeR mentioned this issue Sep 9, 2016

Enhanced support of capital letters #6360

Open

ahicks92 mentioned this issue Oct 28, 2016

add support for sending speech over remote desktop protocol channels #3564

Open

jcsteh mentioned this issue Nov 13, 2016

Changing speeds for different languages #4738

Open

jcsteh mentioned this issue Jul 18, 2017

On a Form that is enhanced with Aria invalid, NVDA does not read the next Formfield after announcing the error alert. #3807

Closed

LeonarddeR mentioned this issue Jul 18, 2017

Virtual synth driver which can automatically recognise and switch between certain languages/synths #279

Open

nvaccessAuto mentioned this issue Nov 10, 2015

When in a sayall on webpages or in other html documents, nvda should be able to produce beeps when indicating links #905

Open

This was referenced Jul 18, 2017

Controling menu voice and reading voice #847

Closed

Reading selections in a non english language does not honor language detection #7427

Open

nvaccessAuto mentioned this issue Aug 16, 2017

Language switching for object reporting #3188

Open

LeonarddeR mentioned this issue Aug 16, 2017

Generic framework for code extensibility via actions and filters #7484

Merged

nvaccessAuto mentioned this issue Aug 26, 2017

Suggestion: allow some pause between the cell content and cell address. #3736

Open

ehollig mentioned this issue Aug 26, 2017

Different Voice Contexts #4086

Open

LeonarddeR mentioned this issue Sep 12, 2017

Add a bunch of extension points to eliminate NVDA Remote monkey patching #7594

Closed

jcsteh mentioned this issue Sep 13, 2017

New speech framework including callbacks, beeps, sounds, profile switches and prioritized queuing #7599

Merged

ehollig mentioned this issue Sep 21, 2017

an independent option for spelling rate #1398

Open

LeonarddeR mentioned this issue Sep 29, 2017

Allow braille display specific settings in the GUI #7452

Closed

nvaccessAuto mentioned this issue Nov 10, 2015

Trigger a Sound for Control Types and States #4089

Open

nvaccessAuto mentioned this issue Nov 10, 2015

Provision of indication options for reporting spelling errors. #4233

Open

Adriani90 mentioned this issue Jan 2, 2019

Toggling Announcement of Controls #4331

Open

josephsl mentioned this issue Jan 2, 2019

Nvda and mark tag in html5. #4247

Closed

Adriani90 mentioned this issue Jan 11, 2019

Ms Word: Less verbosity when reading editor revisions by character or word #4717

Open

LeonarddeR mentioned this issue Mar 25, 2019

Braille doesn't follow continuous reading in virtual buffer. #3287

Open

Adriani90 mentioned this issue Mar 31, 2019

nvda interrupts reading continues while reading notification of chrome download progress #9420

Closed

nvaccessAuto added this to the 2019.2 milestone May 15, 2019

This was referenced May 18, 2019

Source code: consistency edits for coding style and guidelines #9589

Closed

Upgrade NVDA codebase to Python 3.7 #9543

Closed

LeonarddeR closed this as completed Jul 17, 2019

feerrenrut modified the milestones: 2019.2, 2019.3 Jul 30, 2019

ehollig mentioned this issue Nov 21, 2019

NVDA is not honoring lang tags within a fieldset #10488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

nvaccessAuto commented Feb 3, 2015

nvaccessAuto commented Feb 4, 2015

nvaccessAuto commented Feb 6, 2015

nvaccessAuto commented Feb 24, 2015

nvaccessAuto commented Feb 24, 2015

nvaccessAuto commented Feb 25, 2015

nvaccessAuto commented Feb 25, 2015

nvaccessAuto commented Feb 26, 2015

nvaccessAuto commented Mar 18, 2015

nvaccessAuto commented Apr 3, 2015

ahicks92 commented Apr 23, 2016

ahicks92 commented Jul 21, 2016

amangano-edx commented Dec 27, 2016

ahicks92 commented Dec 29, 2016

bhavyashah commented Aug 19, 2018

Adriani90 commented May 15, 2019

LeonarddeR commented Jul 17, 2019 •

edited

Navigation Menu

Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

Comments

nvaccessAuto commented Feb 3, 2015

nvaccessAuto commented Feb 4, 2015

nvaccessAuto commented Feb 6, 2015

nvaccessAuto commented Feb 24, 2015

nvaccessAuto commented Feb 24, 2015

nvaccessAuto commented Feb 25, 2015

nvaccessAuto commented Feb 25, 2015

nvaccessAuto commented Feb 26, 2015

nvaccessAuto commented Mar 18, 2015

nvaccessAuto commented Apr 3, 2015

ahicks92 commented Apr 23, 2016

ahicks92 commented Jul 21, 2016

amangano-edx commented Dec 27, 2016

ahicks92 commented Dec 29, 2016

bhavyashah commented Aug 19, 2018

Adriani90 commented May 15, 2019

LeonarddeR commented Jul 17, 2019 • edited

LeonarddeR commented Jul 17, 2019 •

edited