Windows PowerShell and the Text-to-Speech REST API (Part 3)

Summary: Use Windows PowerShell to access the Cognitive Services Text-to-Speech API.

Q: Hey, Scripting Guy!

I was reading up on how we could use PowerShell to communicate with Azure to gain an access token. I’ve been just itching to see how we can use this! Would you show us some examples of this in use in Azure?

—TL

A: Hello TL, I would be delighted to! This is a cool way to play with PowerShell as well!

If you remember from the last post, when we authenticated with the following lines to Cognitive Services, it returned a temporary access token.

Try

{

[string]$ Token=$ NULL

# Rest API Method

[string]$ Method='POST'

# Rest API Endpoint

[string]$ Uri=' https://api.cognitive.microsoft.com/sts/v1.0/issueToken'

# Authentication Key

[string]$ AuthenticationKey='13775361233908722041033142028212'

# Headers to pass to Rest API

$ Headers=@{'Ocp-Apim-Subscription-Key' = $ AuthenticationKey }

# Get Authentication Token to communicate with Text to Speech Rest API

[string]$ Token=Invoke-RestMethod -Method $ Method -Uri $ Uri -Headers $ Headers

}

Catch [System.Net.Webexception]

{

Write-Output 'Failed to Authenticate'

}

The token was naturally stored in the object $ Token, so it was easy to remember. I suppose we could have named it $ CompletelyUnthinkableVariableThatIsPointless, but we didn’t. Because we can use pretty descriptive names in PowerShell, we should. It makes documenting a script easier.

Our next task is to open up the documentation on the Cognitive Services API to see what information we need to supply. We can find everything we need to know here.

Under “HTTP headers,” we can see several pieces of information we need to supply.

HTTP headers table

X-Microsoft-OutputFormat is the resulting output for the file returned.

There are many industry standard types we can use. You’ll have to play with the returned output to determine which one meets your needs.

I found that ‘riff-16khz-16bit-mono-pcm’ is the format needed for a standard WAV file. I chose WAV specifically because I can use the internal Windows services to play a WAV file, without invoking a third-party application.

We’ll assign this to an appropriately named object.

$ AudioOutputType=’riff-16khz-16bit-mono-pcm’

Both X-Search-AppId and X-Search-ClientID are just unique GUIDs that identify your application. In this case, we’re referring to the PowerShell script or function we’re creating.

The beautiful part is that you can do this in PowerShell right now, by using New-Guid:

Screenshot of PowerShell

If you’d like to be efficient and avoid typing (well, unless you do data entry for a living and you need to type it…I once had that job!), we can grab the Guid property and store it on the clipboard.

New-Guid | Select-Object -ExpandProperty Guid | Set-Clipboard

But the GUID format needed by the REST API requires only the number pieces. We can fix that with a quick Replace method.

(New-Guid | Select-Object -ExpandProperty Guid).replace(‘-‘,”) | Set-Clipboard

Run this once for each property, and paste it into a well-descriptive variable, like so:

$ XSearchAppID=’dccd93ecb3cf4535aac9350c9b5fb2f8′

$ XSearchClientID=’45b403b6ae0d4f9ca13ca05f61a58ab2′

UserAgent is just a unique name for your application. Pick a unique but sensible name.

$ UserAgent=’PowerShellTextToSpeechApp’

Finally, Authorization is that token that was generated earlier, and is stored in $ Token.

At this point, we put the headers together. Do you remember the last headers from authentication? It was small, but the format is the same.

$ Headers=@{‘Ocp-Apim-Subscription-Key’ = $ AuthenticationKey }

You can string it all together like this:

$ Headers=@{‘Property1’=’Value’;’Property2’=’Value’;’Property3’=’Value’;’Property4’=’Value’;}

But as you add more information, it becomes too unreadable for others working on your script. This is a great case for using backticks ( ` ) to separate the content out. Every time I think about backticks, I think of Patrick Warburton as “The Tick.”

Here is an example with the same information, spaced out with a space and then a backtick.

$ Headers=@{ `

‘Property1’=’Value’; `

‘Property2’=’Value’; `

‘Property3’=’Value’; `

‘Property4’=’Value’; `

}

Let’s populate the values for our header from the examples I provided earlier in this page.

$ AudioOutputType=’riff-16khz-16bit-mono-pcm’

$ XSearchAppID=’dccd93ecb3cf4535aac9350c9b5fb2f8′

$ XSearchClientID=’45b403b6ae0d4f9ca13ca05f61a58ab2′

$ UserAgent=’PowerShellTextToSpeechApp’

 

$ Header=@{ `

‘Content-Type’ = ‘application/ssml+xml’; `

‘X-Microsoft-OutputFormat’ = $ AudioOutputType; `

‘X-Search-AppId’ = $ XSearchAppId; `

‘X-Search-ClientId’ = $ XSearchClientId; `

‘Authorization’ = $ AccessToken `

}

With the header populated, we are now ready to proceed to our next major piece: actually taking text and converting it to audio content, by using Azure.

But we’ll touch upon that next time. Keep watching the blog and keep on scripting!

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.

Sean Kearney, Premier Field Engineer, Microsoft

Frequent contributor to Hey, Scripting Guy!

 

 

Hey, Scripting Guy! Blog

PowerTip: Ensure that errors in PowerShell are caught

Summary: Here’s how to make sure your errors get caught with Try Catch Finally.

   Hey, Scripting Guy! I’ve been trying to use the Try Catch Finally, but some of my errors aren’t getting caught. What could be the cause?

   For Try Catch Finally, you need to make sure the error defaults to a “Stop” action for the cmdlet in question. Here’s a quick example:

try

{

Get-Childitem c:\Foo -ErrorAction stop

}

catch [System.Management.Automation.ItemNotFoundException]

{

'oops, I guess that folder was not there'

}

 

Drawing of Dr. Scripto

 

 

Hey, Scripting Guy! Blog

Windows PowerShell and the Text-to-Speech REST API (Part 4)

Summary: Send and receive content to the Text-to-Speech API with PowerShell.

Q: Hey, Scripting Guy!

I was playing with the Text-to-Speech API. I have it almost figured out, but I’m stumbling over the final steps of formatting the SSML markup language. Could you lend me a hand?

—MD

A: Hello MD,

Glad to lend a hand to a Scripter in need! I remember having that same challenge the first time I worked with it. It’s actually not hard, but I needed a sample to work with.

Let’s first off remember where we were last time. We’ve accomplished the first two pieces for Cognitive Services Text-to-Speech:

  1. The authentication piece, to obtain a temporary token for communicating with Cognitive Services.
  2. Headers containing the audio format and our application’s unique parameters.

Next, we need to build the body of content we need to send up to Azure. The body contains some key pieces:

  • Region of the speech (for example, English US, Spanish, or French).
  • Text we need converted to speech.
  • Voice of the speaker (male or female).

For more information about all this, see the section “Supported locales and voice fonts” in Bing text to speech API.

The challenge I ran into was in just how to create the SSML content that was needed. SSML, which stands for Speech Synthesis Markup Language, is a standard for identifying just how speech should be spoken. Examples of this would be:

  • Content
  • Language
  • Speed

I could spend a lot of time reading up on it, but Azure gives you a great tool to create sample content without even trying! Check out Bing Speech, and look under the heading “Text to Speech.” In the text box, type in whatever you would like to hear.

In the sample below, I have entered in “Hello everyone, this is Azure Text to Speech.”

Screenshot of Bing Speech

Now if you select View SSML (the blue button), you can see the code in SSML that would have been the body we would have sent to Azure.

Screenshot of SSML code

You can copy and paste this into your editor of choice. From here, I will try to break down the content from our example.

<speak version=”1.0″ xmlns=”http://www.w3.org/2001/10/synthesis” xmlns:mstts=”http://www.w3.org/2001/mstts” xml:lang=“en-US”><voice xml:lang=“en-US” name=“Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)”>Hello everyone, this is Azure Text to Speech</voice></speak>

The section highlighted in GREEN is our locale. The BLUE section contains our service name mapping. The locale must always be matched with the same service name mapping from the row it came from. The double quotes are also equally important.

If you mix them up, Azure will wag its finger at you and give a nasty error back.

The section in RED is the actual content that Azure would like us to convert to speech.

Let’s take a sample from the table, and change this to an Australian female voice.

Table with two rows

We first replace the locale with “en-AU,” and then the service name mapping with “Microsoft Server Speech Text to Speech Voice (en-AU, Catherine).”

<speak version=”1.0″ xmlns=”http://www.w3.org/2001/10/synthesis” xmlns:mstts=”http://www.w3.org/2001/mstts” xml:lang=“en-AU”><voice xml:lang=“en-AU” name=” Microsoft Server Speech Text to Speech Voice (en-AU, Catherine)”>Hello everyone, this is Azure Text to Speech</voice></speak>

Now if we’d like to have her say something different, we just change the content in red.

How does this translate in Windows PowerShell?

We can take the three separate components (locale, service name mapping, and content), and store them as objects.

$ Locale=‘en-US’

$ ServiceNameMapping=‘Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)’

$ Content=‘Hello everyone, this is Azure Text to Speech’

Now you can have a line like this in Windows PowerShell to dynamically build out the SSML content, and change only the pieces you typically need.

$ Body='<speak version=”1.0″ xmlns=”http://www.w3.org/2001/10/synthesis” xmlns:mstts=”http://www.w3.org/2001/mstts” xml:lang=”‘+$ locale+'”><voice xml:lang=”‘ +$ locale+'” name=’+$ ServiceNameMapping+’>’+$ Content+'</voice></speak>’

At this point, we only need to call up the REST API to have it do the magic. But that is for another post!

See you next time when we finish playing with this cool technology!

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.

Sean Kearney, Premier Field Engineer, Microsoft

Frequent contributor to Hey, Scripting Guy!

Hey, Scripting Guy! Blog

PowerTip: Use PowerShell to play WAV files

Summary: Make use of the native features of Windows through PowerShell to play sound.

   Hey, Scripting Guy! I’ve got some WAV files I would love to play without launching an application. Is there a way in Windows PowerShell to do this?

     You sure can! Using the System.Media.Soundplayer object, you can do this quite easily. Here is an example of how to do this:

$ PlayWav=New-Object System.Media.SoundPlayer

$ PlayWav.SoundLocation=’C:\Foo\Soundfile.wav’

$ PlayWav.playsync()

Drawing of Dr. Scripto

 

Hey, Scripting Guy! Blog

Windows PowerShell and the Text-to-Speech REST API (Part 5)

Summary: Send and receive content to the Text-to-Speech API with PowerShell.

Q: Hey, Scripting Guy!

Could you give a buddy a hand in getting the last pieces together for the Text-to-Speech API?

—SR

A: Hello SR,

No problem at all. The last few posts, we dealt with the “Heavy Lifting” (which really wasn’t that heavy):

  • Authentication
  • Creating the headers
  • Defining the SSML body

Now at this point, we only need to call up the REST API to have it do the magic. So just like our Authentication API, we have the following pieces:

  • Endpoint
  • Method
  • Authentication
  • Headers
  • Body

We have the final three already. In a previous post, we built the headers (Application ID, GUID, and audio format), and we’ve just finished building the body. The authentication token is contained within the headers.

What we need to know now is the endpoint, and what Invoke-RestMethod needs. This can all be found here at Bing Text to Speech API, under the “Authorization Token” section.

The endpoint is https://speech.platform.bing.com/synthesize. To determine the method required, just glance at the “Example: voice output request.” It shows you the method as the first line.

In many REST API examples, if you see an example of the output, the first line is often the method. You can use this trick in other scenarios with REST APIs in general.

All we need to do now is assemble the pieces, and call up Invoke-RestMethod. To avoid you running back and forth, I’ve assembled the entire script:

 

Try

{

[string]$ Token=$ NULL

# Rest API Method

[string]$ Method='POST'

# Rest API Endpoint

[string]$ Uri='https://api.cognitive.microsoft.com/sts/v1.0/issueToken'

# Authentication Key

[string]$ AuthenticationKey='13775361233908722041033142028212'

# Headers to pass to Rest API

$ Headers=@{'Ocp-Apim-Subscription-Key' = $ AuthenticationKey }

# Get Authentication Token to communicate with Text to Speech Rest API

[string]$ Token=Invoke-RestMethod -Method $ Method -Uri $ Uri -Headers $ Headers

}

Catch [System.Net.Webexception]

{

Write-Output 'Failed to Authenticate'

}

$ AudioOutputType='riff-16khz-16bit-mono-pcm'

$ XSearchAppID='dccd93ecb3cf4535aac9350c9b5fb2f8'

$ XSearchClientID='45b403b6ae0d4f9ca13ca05f61a58ab2'

$ UserAgent='PowerShellTextToSpeechApp'

$ Header=@{ `

'Content-Type' = 'application/ssml+xml'; `

'X-Microsoft-OutputFormat' = $ AudioOutputType; `

'X-Search-AppId' = $ XSearchAppId; `

'X-Search-ClientId' = $ XSearchClientId; `

'Authorization' = $ AccessToken `

}

; New Content from this post below here ;

$ Locale='en-US'

$ ServiceNameMapping='Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)'

$ Content='Hello everyone, this is Azure Text to Speech'

$ Body=''+$ Content+''

$ Endpoint= 'https://speech.platform.bing.com/synthesize'

$ Method='POST'

$ ContentType='application/ssml+xml'

$ Filename='output.wav'

Invoke-RestMethod -Uri $ Endpoint -Method $ Method `

-Headers $ Headers -ContentType $ ContentType `

-Body $ Body -UserAgent $ UserAgent `

-OutFile $ Filename

Notice that Invoke-RestMethod has a -filename parameter, which allows you to store received output directly as a file on the file system.

When this process is complete, you will have a WAV file (which is what we chose). It can be launched in your choice of audio application.

The reason the API has regions is that it accepts text in a local language (such as French). If you target the appropriate region, the speaking voice is tuned to speak with the appropriate accent and inflections of the text.

Pretty cool, eh? If you’re tripping over any of the pieces, don’t forget to review the earlier four parts of this series to get it all ironed out!

That’s all for the moment, but keep a lookout for more in the way of PowerShell here.

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.

Sean Kearney, Premier Field Engineer, Microsoft

Frequent contributor to Hey, Scripting Guy!

 

Hey, Scripting Guy! Blog