Microsoft Cognitive Services are a new set of cloud APIs that can be used from practically any technical platform (Android apps, iOS apps, Windows apps, websites, etc) to leverage capabilities such as natural language processing, speech, vision, knowledge exploration, etc. These capabilities, like the whole field of artificial intelligence, are easy for humans but difficult for computers. Now you can leverage the results of Microsoft Research and powerful cloud computing to power your own applications with these capabilities.
Bing Speech API
The Bing Speech API offers the capabilities to:
- Translate speech audio to text
- Translate text to speech audio
This blog post will describe how to build an app that translates speech audio to text.
Click on the picture below to go to the Bing Speech API webpage where you can test the speech recognition.
Getting an API key
To be able to use the Microsoft Cognitive Services APIs you will need a suitable API key. These keys are requested from the Azure Portal. Click New -> Intelligence & Analytics -> Cognitive Services APIs.
Choose API type: Bing Speech API. Also enter an account name and complete the remaining fields. You can select the free pricing tier F0.
Your account should be created within a few minutes. Open it in the All resources list and select Keys. Copy any of the keys.
Accessing the API
There are three ways of accessing the Bing Speech API:
- REST API. This is a standard REST API that can be accessed from any technical platform. However it gives no partial results, only final results. Documentation for the REST API is available here. A sample (in C#) is available here.
- Service Library. This is typically used by backend services and only available in C#. It can be downloaded here.
My advice is to start by downloading a sample. They are located on GitHub. This is a sample for C#:
Downloading it and opening it in Visual Studio allows you to compile and run it. You will need to enter your API key from the Azure Portal.
Inspecting the file MainWindow.xaml.cs, you will see how the library works. The interesting things start to happen in the StartButton_Click method. Also the methods CreateMicrophoneRecoClient and CreateMicrophoneRecoClientWithIntent are good reading. However, the “intent” functionality requires a bit more configuration and I will get back to that later.
Samples for other platforms
Translating speech to text is useful, but it is even more useful if the API can deduce what the speaker really wants and what are the key words in the phrases. This is the subject of my next blog post:
Make your apps understand speech with Microsoft Cognitive Services – part 2.