SFSpeechRecognitionResult discards previous transcripts with on-device option set to true

Hi everyone, I might need some help with on-device recognition. It seems that the speech recognition task will discard whatever it has transcribed after a new sentence starts (or it believes it becomes a new sentence) during a single audio session, with requiresOnDeviceRecognition is set to true.

This doesn't happen with requiresOnDeviceRecognition set to false.

System environment: macOS 14 with Xcode 15, deploying to iOS 17

Thank you all!

Did some further research into this topic. It seem that the SFTranscriptSegment array of the result will be cleared after certain standards were meet.

By doing

if let result {
    print("Segment.size:  \(result.bestTranscription.segments.count)")
    //reset of the code... 
}

At the time when the contents get cleared, it prints out like this on console:

...
Segment.size:  108
Segment.size:  110
Segment.size:  111
Segment.size:  112
Segment.size:  112
Segment.size:  114
Segment.size:  105
Segment.size:  1
Segment.size:  2
Segment.size:  3
Segment.size:  4
Segment.size:  5
Segment.size:  6
...

The question can be how to determine which one will be the final result? By the way, the isFinal variable of the request never turned true during the contents being discarded.

Hi, did you find a way to resolve this weird behaviour?

Hi, did you find a way to resolve this weird behaviour?

@tom63001 You can check the segments’ timestamp and duration properties and use that as a proxy as to whether what you’ve received is final.

Unfortunately this is all even more broken on iOS18, where the bestTranscript just gets randomly erased after every pause in speech...

(Context: iphone12 running 17.6.1, XCode Version 15.4 (15F31d))

I see exactly what the OP above reported when running the Apple SpokenWord sample with no changes. However, changing this one line from true to false fixes the problem:

I'm fine with the quality of the recognition being different between local and remote (presumably because cloud might be better), but this is not that and this feels very broken. Valid, recognized text is simply being thrown away after brief (speaking) pauses in the local-required case but not in the local-not-required case. In addition, in the case of setting the flag to 'false' to not require local recognition, the workaround still fixes it even when I have completely disabled all network connectivity on the iPhone (ie. it cannot make a remote call and the recognition is, by definition, being done locally).

Other notes of potential interest:

  • even if the workaround fixes it, part of my requirement is that it can always work whether remote calls are possible or not. Hence, why I set the flag to require local to true in the first place.
  • as reported above the "isFinal" flag is never set to true during the time the earlier results are discarded
  • i'm hearing that ios18 is even worse, specifically that setting the requiresOnDeviceRecognition to false does not help as a workaround. I have not yet verified this on ios18 because it is in beta at this time.

Example to repro the bug:

[with requiresOnDeviceRecognition = true] speaking "add 1+2+3+4+ (go as long as you want with no brief pauses)" results in exactly what was spoken. Doing the same with a brief pause followed by "5+6" results in all text preceding "5+6" being thrown away. By "brief pause" I mean 1 1/2 to 2 seconds.

[with requiresOnDeviceRecognition = false] speaking the exact same as above with a pause as long as 2 minutes (maybe longer - I stopped testing at 2 mins) before adding "5+6" results in the full spoken text being returned (ie. the result contains "add 1+2+3+4+5+6". Again, this works even if iPhone networking is completely disabled.

Quick update now that iOS18 is released versus in beta.

The nice workaround I documented above of setting requireOnDeviceRecognition to false no longer works. As of iOS18 the loss of words recognized after a brief pause always happens regardless of that flag being set to true or false.

Apple folks: It would be nice to hear back from you on this. Do you concur that this reported behavior is a bug? Or, if by design, is there a recommended approach for coping with it?

I shared via Dropbox (should be publicly accessible) a quick video from my iPhone illustrating the issue:

https://www.dropbox.com/scl/fi/ci16tz76q9trxsuv1k1dx/audioBug.MP4?rlkey=pkywy8hanqasxya5myca3ezq4&e=1&dl=0

Hello, since iOS 18 I have the same problem, when I make longer pauses between speaking, SFSpeechRecognitionResult somehow forgot transcriptions before making pauses. I have found a workaround which I described in StackOverflow post: https://stackoverflow.com/questions/79005416/sfspeechrecognitionresult-discards-previous-transcripts-when-making-long-break/79005417#79005417

i have the same issue

Same issue on iOS 18

SFSpeechRecognitionResult discards previous transcripts with on-device option set to true
 
 
Q