Configuring Apple Vision Pro's microphones to effectively pick up other speaker's voice

Question

Created Jul ’24

Replies 0

Boosts 0

Participants 1

I am developing a visionOS app that captions speech in real environments. Currently, I am using Apple's built-in speech recognizer. However, when I was testing the app with a Vision Pro, the device seemed to only pick up the user's voice (in other words, the voices of the wearer of the Vision Pro device). For example, when the speech recognition task is running, and another person in front of me is talking, the system does not pick up the speech well.

I tried to set the AVAudioSession to be equally sensitive to all directions:

private func configureAudioSession() {
        do {
            try audioSession.setCategory(.record, mode: .measurement)
            try audioSession.setActive(true)
            if #available(visionOS 1.0, *) {
                let availableDataSources = audioSession.availableInputs?.first?.dataSources
                if let omniDirectionalSource = availableDataSources?.first(where: {$0.preferredPolarPattern == .omnidirectional}) {
                    try audioSession.setInputDataSource(omniDirectionalSource)
                }
            }
        } catch {
            print("Failed to set up audio session: \(error)")
        }
}

And here is how I set up the speech recognition and configure the microphone inputs:

private func startSpeechRecognition(completion: @escaping (String) -> Void) {
        do {
            // Cancel the previous task if it's running.
            if let recognitionTask = recognitionTask {
                recognitionTask.cancel()
                self.recognitionTask = nil
            }
            
            // The AudioSession is already active, creating input node.
            let inputNode = audioEngine.inputNode
            try inputNode.setVoiceProcessingEnabled(false)
            
            // Create and configure the speech recognition request
            recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
            guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a recognition request") }
            recognitionRequest.shouldReportPartialResults = true
            
            // Keep speech recognition data on device
            if #available(iOS 13, *) {
                recognitionRequest.requiresOnDeviceRecognition = true
            }
            
            // Create a recognition task for speech recognition session.
            // Keep a reference to the task so that it can be canceled.
            recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in
//                var isFinal = false
                
                if let result = result {
                    // Update the recognizedText
                    completion(result.bestTranscription.formattedString)
                } else if let error = error {
                    completion("Recognition error: \(error.localizedDescription)")
                }
                
                if error != nil || result?.isFinal == true {
                    // Stop recognizing speech if there is a problem
                    self.audioEngine.stop()
                    inputNode.removeTap(onBus: 0)
                    self.recognitionRequest = nil
                    self.recognitionTask = nil
                }
            }
            
            // Configure the microphone input
            let recordingFormat = inputNode.outputFormat(forBus: 0)
            inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
                self.recognitionRequest?.append(buffer)
            }
            
            audioEngine.prepare()
            try audioEngine.start()
        } catch {
            completion("Audio engine could not start: \(error.localizedDescription)")
        }
    }

Boost