<

Getting Started with Speech Synthesis Markup Language (SSML)

SSML was designed by W3C to provide an XML-based markup language to assist in generating natural sounding synthesized speech. Plivo <Speak> XML now supports the generation of SSML based speech. SSML speech generation on Plivo is powered by Amazon Polly, the leader in SSML speech synthesis.

With normal text-to-speech, developers can only choose from a basic male or female voice in a subset of languages. Plivo SSML supports 27 languages and over 40 voices, and allows developers to also control pronunciation, pitch, volume, etc. Plivo’s root XML element for SSML tags is <Speak>, same as that for basic TTS. For example:

1
2
3
4
5
6
<Response>
    <Speak voice="MAN">Go Green, Go Plivo</Speak> //Basic Text-to-Speech
    <Speak voice="Polly.Joey">
        <emphasis level="moderate">Go Green, Go Plivo</emphasis> //Text-to-Speech using SSML
    </Speak>
</Response>

Amazon Polly voices can process text-to-speech for a maximum of 3000 characters using the <Speak> tag. For more information about SSML, see the W3C specifications.

Amazon Polly

Amazon Polly is a service that provides life-like text-to-speech across several languages and locales. SSML support on Plivo is powered by Amazon Polly.

To synthesize SSML speech on Plivo, simply specify one of the many Amazon Polly voices in the ‘voice’ attribute of Plivo’s <Speak> XML. Note that Polly voices must be namespaced with Polly..

For example:

1
2
3
4
5
<Response>
    <Speak voice="Polly.Joey">
        <emphasis level="moderate">Go Green, Go Plivo</emphasis>
    </Speak>
</Response>

A complete list of supported Polly voices is available here.

SSML Tags

The following SSML tags are supported for use in Plivo’s XML:

Action SSML Tag Description
Adding a Pause <break> Use this tag to include a pause in the speech.
Emphasizing words <emphasis> Use this tag to change the rate and voice of the speech.
Specifying Another language for Specific Words <lang> Use this tag to set the natural language of the text.
Adding a Pause between Paragraphs <p> Use this tag to represent a paragraph.
Controlling Volume, Speaking Rate and Pitch <prosody> Use this tag to modify the volume, pitch, and rate of the tagged text.
Adding a Pause between sentences <s> Use this tag to represent a sentence. This will add a strong break before and after the tag.
Controlling How special types of words are spoken <say-as> Use this tag to describe how to interpret the text.
Pronouncing Acronyms and Abbreviations <sub> Use this tag to pronounce the specified words or phrases as different words or phrases.
Improving Pronunciation by specifying parts of speech <w> Use this tag to customize the pronunciation of words by specifying the part of speech.

Note: The following AWS Polly specific tags are not supported for use with Plivo XML:

  • <amazon:auto-breaths>
  • <amazon:effect name=”drc”>
  • <amazon:effect phonation=”soft”>
  • <amazon:effect vocal-tract-length>
  • <amazon: effect name=”whispered”>

SSML Voices

The SSML Voices are supported for use with Plivo XML:

Language Female Male
Australian English (en-AU) Nicole Russell
Brazilian Portuguese (pt-BR) Vitória Ricardo
Canadian French (fr-CA) Chantal -
Danish (da-DK) Naja Mads
Dutch (nl-NL) Lotte Ruben
French (fr-FR) Lea  Celine< Mathieu
German (de-DE) Vicki Hans
Marlene -
Hindi (hi-IN) Aditi -
Icelandic (is-IS) Dora Karl
Indian English (en-IN) Raveena  Aditi -
Italian (it-IT) Carla Giorgio
Japanese (ja-JP) Mizuki Takumi
Korean (ko-KR) Seoyeon -
Mandarin Chinese (cmn-CN) Zhiyu -
Norwegian (nb-NO) Liv -
Polish (pl-PL) Ewa Jacek
Maja Jan
Portuguese - Iberic (pt-PT) Ines Cristiano
Romanian (ro-RO) Carmen -
Russian (ru-RU) Tatyana Maxim
Spanish - Castilian (es-ES) Conchita Enrique
Swedish (sv-SE) Astrid -
Turkish (tr-TR) Filiz -
UK English (en-GB) Amy Brian
Emma -
US English (en-US) Joanna Matthew
Salli Justin
Kendra Joey
Kimberly -
Ivy -
US Spanish (es-US) Penelope Miguel
Welsh (cy-GB) Gwyneth -
Welsh English (en-GB-WLS) - Geraint

Character Limit

To ensure quick synthesis, an upper cap of 3000 characters is enforced on the text that can be synthesized in one <Speak> XML.

Pricing

Support for SSML based speech synthesis is currently in Beta. While in Beta, SSML based speech synthesis is absolutely free.

SSML based speech synthesis will eventually be charged on the basis of the number of characters synthesized.

SSML Support In Plivo Server-Side SDKs

At the moment, only Plivo’s DOTNET SDK supports SSML tags in <Speak> XML. Support for all other server-side SDKs is planned to be added soon. If you have any specific requests, please contact our support team.

Examples

The below examples use the Joey voice for US English (en-US). Use the <Speak voice> tag to specify the voice for your text.

  • Say-as

The say-as tag describes how to interpret the text.

1
2
3
4
5
6
<Response>
    <Speak voice="Polly.Joey">
        The date is
        <say-as interpret-as="date">20180626</say-as>
    </Speak>
</Response>
  • W

The w tag is used to customize the pronunciation of words by specifying the part of speech.

1
2
3
4
5
6
7
8
9
10
11
12
<Response>
    <Speak voice="Polly.Joey">
    The word
    <say-as interpret-as="characters">read</say-as>
    <s>
        may be interpreted as either the present simple form
    </s>
    <w role="amazon:VB">read</w>
    <s>or the past participle form</s>
    <w role="amazon:VBD">read</w>
    </Speak>
</Response>