【iOS10 SpeechRecognition】语音识别现说现译的最佳实践

时间 2019-12-06

标签 ios10 ios speechrecognition 语音识别现说最佳实践栏目 iOS 繁體版

原文原文链接

首先想强调一下“语音识别”四个字字面意义上的需求：用户说话而后立刻把用户说的话转成文字显示！，这才是开发者真正须要的功能。git

作需求以前实际上是先谷歌百度一下看有没有造好的轮子直接用，结果然的很呵呵，都是标着这个库深刻学习的标题，里面调用一下api从URL里取出一个本地语音文件进行识别，这就没了？最基本的需求都无法实现。github

今天整理下对于此功能的两种实现方式：api

首先看下识别请求的API有两种 SFSpeechAudioBufferRecognitionRequest 和 SFSpeechURLRecognitionRequest ，而且实现解析的方式也有两种 block 和 delegate。我就相互组合下两种方法把这些内容都能涵盖。数组

在开发以前须要先在info.plist注册用户隐私权限，虽然你们都已经知道了我仍是说一嘴为了本文的完整性。微信

Privacy - Microphone Usage Description
Privacy - Speech Recognition Usage Description

再使用requestAuthorization来请求使用权限app

    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
        // 对结果枚举的判断
    }];

关于麦克风的权限在首次开始录音时也会提出权限选择。oop

1、 SFSpeechAudioBufferRecognitionRequest 加上 block的方式

用这种方式实现主要分为如下几个步骤学习

①多媒体引擎的创建

成员变量须要添加如下几个属性，便于开始结束释放等atom

@property(nonatomic,strong)SFSpeechRecognizer *bufferRec;
@property(nonatomic,strong)SFSpeechAudioBufferRecognitionRequest *bufferRequest;
@property(nonatomic,strong)SFSpeechRecognitionTask *bufferTask;
@property(nonatomic,strong)AVAudioEngine *bufferEngine;
@property(nonatomic,strong)AVAudioInputNode *buffeInputNode;

初始化建议写在启动的方法里，便于启动和关闭，若是准备使用全局的也能够只初始化一次spa

    self.bufferRec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    self.bufferEngine = [[AVAudioEngine alloc]init];
    self.buffeInputNode = [self.bufferEngine inputNode];

②建立语音识别请求

    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;

shouldReportPartialResults 其中这个属性能够自行设置开关，是等你一句话说完再回调一次，仍是每个散碎的语音片断都会回调。

③创建任务，并执行任务

    // block外的代码也都是准备工做，参数初始设置等
    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;
    __weak ViewController *weakSelf = self;
    self.bufferTask = [self.bufferRec recognitionTaskWithRequest:self.bufferRequest resultHandler:^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {
            // 接收到结果后的回调
    }];
    
    // 监听一个标识位并拼接流文件
    AVAudioFormat *format =[self.buffeInputNode outputFormatForBus:0];
    [self.buffeInputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        [weakSelf.bufferRequest appendAudioPCMBuffer:buffer];
    }];
    
    // 准备并启动引擎
    [self.bufferEngine prepare];
    NSError *error = nil;
    if (![self.bufferEngine startAndReturnError:&error]) {
        NSLog(@"%@",error.userInfo);
    };
    self.showBufferText.text = @"等待命令中.....";

对runloop稍微了解过的人都知道，block外面的代码是在前一个运行循环先执行的，正常的启动流程是先初始化参数而后启动引擎，而后会不断地调用拼接buffer的这个回调方法，而后一个单位的buffer攒够了后会回调一次上面的语音识别结果的回调，有时候没声音也会调用buffer的方法，可是不会调用上面的resulthandler回调，这个方法内部应该有个容错（音量power没到设定值会自动忽略）。

④接收到结果的回调

结果的回调就是在上面resultHandler里面的block里了，执行后返回的参数就是result和error了，能够针对这个结果作一些操做。

        if (result != nil) {
            self.showBufferText.text = result.bestTranscription.formattedString;
        }
        if (error != nil) {
            NSLog(@"%@",error.userInfo);
        }

这个结果类型SFSpeechRecognitionResult能够看看里面的属性，有最佳结果，还有备选结果的数组。若是想作精确匹配的应该得把备选数组的答案也都过滤一遍。

⑤结束监听

    [self.bufferEngine stop];
    [self.buffeInputNode removeTapOnBus:0];
    self.showBufferText.text = @"";
    self.bufferRequest = nil;
    self.bufferTask = nil;

这个中间的bus是临时标识的节点，大概理解和端口的概念差很少。

2、SFSpeechURLRecognitionRequest 和 delegate的方法

block和delegate的主要区别是，block方式使用简洁， delegate则能够有更多的自定义需求的空间，由于里面有更多的结果回调生命周期方法。

这五个方法也没什么好说的，都是顾名思义。要注意的一点是第二个方法会调用屡次，第三个方法会在一句话说完时调用一次。

// Called when the task first detects speech in the source audio
- (void)speechRecognitionDidDetectSpeech:(SFSpeechRecognitionTask *)task;

// Called for all recognitions, including non-final hypothesis
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didHypothesizeTranscription:(SFTranscription *)transcription;

// Called only for final recognitions of utterances. No more about the utterance will be reported
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult;

// Called when the task is no longer accepting new audio but may be finishing final processing
- (void)speechRecognitionTaskFinishedReadingAudio:(SFSpeechRecognitionTask *)task;

// Called when the task has been cancelled, either by client app, the user, or the system
- (void)speechRecognitionTaskWasCancelled:(SFSpeechRecognitionTask *)task;

// Called when recognition of all requested utterances is finished.
// If successfully is false, the error property of the task will contain error information
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishSuccessfully:(BOOL)successfully;

这种实现的思路是，先实现一个录音器（能够手动控制开始结束，也能够是根据音调大小自动开始结束的同步录音器相似于会说话的汤姆猫），而后将录音文件存到一个本地目录，而后使用URLRequest的方式读取出来进行翻译。步骤分解以下

①创建同步录音器

须要如下这些属性

/** 录音设备 */
@property (nonatomic, strong) AVAudioRecorder *recorder;
/** 监听设备 */
@property (nonatomic, strong) AVAudioRecorder *monitor;
/** 录音文件的URL */
@property (nonatomic, strong) NSURL *recordURL;
/** 监听器 URL */
@property (nonatomic, strong) NSURL *monitorURL;
/** 定时器 */
@property (nonatomic, strong) NSTimer *timer;

属性的初始化

    // 参数设置
    NSDictionary *recordSettings = [[NSDictionary alloc] initWithObjectsAndKeys:
                                    [NSNumber numberWithFloat: 14400.0], AVSampleRateKey,
                                    [NSNumber numberWithInt: kAudioFormatAppleIMA4], AVFormatIDKey,
                                    [NSNumber numberWithInt: 2], AVNumberOfChannelsKey,
                                    [NSNumber numberWithInt: AVAudioQualityMax], AVEncoderAudioQualityKey,
                                    nil];
    
    NSString *recordPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"record.caf"];
    _recordURL = [NSURL fileURLWithPath:recordPath];
    
    _recorder = [[AVAudioRecorder alloc] initWithURL:_recordURL settings:recordSettings error:NULL];
    
    // 监听器
    NSString *monitorPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"monitor.caf"];
    _monitorURL = [NSURL fileURLWithPath:monitorPath];
    _monitor = [[AVAudioRecorder alloc] initWithURL:_monitorURL settings:recordSettings error:NULL];
    _monitor.meteringEnabled = YES;

其中参数设置的那个字典里，的那些常量你们不用过于上火，这是以前写的代码直接扒来用的，上文中设置的最优语音质量。

②开始与结束

要想经过声音大小来控制开始结束的话，须要在录音器外再额外设置个监听器用来查看语音的大小经过peakPowerForChannel 方法查看当前话筒环境的声音环境音量。而且有个定时器来控制音量检测的周期。大体代码以下

- (void)setupTimer {
    [self.monitor record];
    self.timer = [NSTimer scheduledTimerWithTimeInterval:0.1 target:self selector:@selector(updateTimer) userInfo:nil repeats:YES]; //董铂然博客园
}

// 监听开始与结束的方法
- (void)updateTimer {

    // 不更新就无法用了
    [self.monitor updateMeters];
    
    // 得到0声道的音量，彻底没有声音-160.0，0是最大音量
    float power = [self.monitor peakPowerForChannel:0];
    
    //        NSLog(@"%f", power);
    if (power > -20) {
        if (!self.recorder.isRecording) {
            NSLog(@"开始录音");
            [self.recorder record];
        }
    } else {
        if (self.recorder.isRecording) {
            NSLog(@"中止录音");
            [self.recorder stop];
            [self recognition];
        }
    }
}

③语音识别的任务请求

- (void)recognition {
    // 时钟中止
    [self.timer invalidate];
    // 监听器也中止
    [self.monitor stop];
    // 删除监听器的录音文件
    [self.monitor deleteRecording];
    
    //建立语音识别操做类对象
    SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    //            SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en_ww"]];  //董铂然博客园
    
    //经过一个本地的音频文件来解析
    SFSpeechRecognitionRequest * request = [[SFSpeechURLRecognitionRequest alloc]initWithURL:_recordURL];
    [rec recognitionTaskWithRequest:request delegate:self];
}

这段经过一个本地文件进行识别转汉字的代码，应该是网上传的最多的，由于不用动脑子都能写出来。可是单有这一段代码基本是没有什么卵用的。（除了人家微信如今有个长按把语音转文字的功能，其余谁的App需求我真想不到会直接拿出一个本地音频文件来解析，自动生成mp3歌词？周杰伦的歌解析难度比较大，还有语音识别时间要求不能超过1分钟）

④结果回调的代理方法

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult
{
    NSLog(@"%s",__FUNCTION__);
    NSLog(@"%@",recognitionResult.bestTranscription.formattedString);
    [self setupTimer];
}

用的最多的就这个方法了，另外不一样时刻的回调方法能够按需添加，这里也就是简单展现，能够看个人demo程序里有更多功能。

https://github.com/dsxNiubility/SXSpeechRecognitionTwoWays

iOS10在语音相关识别相关功能上有了一个大的飞跃，主要体如今两点一点就是上面的语音识别，另外一点是sirikit能够实现将外部的信息透传到App内进行操做，可是暂时局限性比较明显，只可以实现官网所说叫车，发信息等消息类型，甚至连“打开美团搜索烤鱼店”这种类型都还不能识别，因此暂时也没法往下作过多研究，等待苹果以后的更新吧。

【iOS10 SpeechRecognition】语音识别 现说现译的最佳实践