Amazon Polly is a service that turns textual content into lifelike speech. It permits the event of a complete class of purposes that may convert textual content into speech in a number of languages.
This service can be utilized by chatbots, audio books, and different text-to-speech purposes along with different AWS AI or machine studying (ML) companies. For instance, Amazon Lex and Amazon Polly could be mixed to create a chatbot that engages in a two-way dialog with a consumer and performs sure duties based mostly on the consumer’s instructions. Amazon Transcribe, Amazon Translate, and Amazon Polly could be mixed to transcribe speech to textual content within the supply language, translate it to a unique language, and converse it.
On this publish, we current an attention-grabbing method for highlighting textual content because it’s being spoken utilizing Amazon Polly. This answer can be utilized in lots of text-to-speech purposes to do the next:
- Add visible capabilities to audio in books, web sites, and blogs
- Improve comprehension when clients try to grasp the textual content quickly because it’s being spoken
Our answer provides the shopper (the browser, on this instance), the flexibility to know what textual content (phrase or sentence) is being spoken by Amazon Polly at any instantaneous. This allows the shopper to dynamically spotlight the textual content because it’s being spoken. Such a functionality is beneficial for offering visible assist to speech for the use circumstances talked about beforehand.
Our answer could be prolonged to carry out further duties moreover highlighting textual content. For instance, the browser can present pictures, play music, or carry out different animations on the entrance finish because the textual content is being spoken. This functionality is beneficial for creating dynamic audio books, instructional content material, and richer text-to-speech purposes.
At its core, the answer makes use of Amazon Polly to transform a string of textual content into speech. The textual content could be enter from the browser or by an API name to the endpoint uncovered by our answer. The speech generated by Amazon Polly is saved as an audio file (MP3 format) in an Amazon Simple Storage Service (Amazon S3) bucket.
Nonetheless, utilizing the audio file alone, the browser can’t discover what elements of the textual content are being spoken at any instantaneous as a result of we don’t have granular data on when every phrase is spoken.
Amazon Polly offers a solution to acquire this utilizing speech marks. Speech marks are saved in a textual content file that exhibits the time (measured in milliseconds from begin of the audio) when every phrase or sentence is spoken.
Amazon Polly returns speech mark objects in a line-delimited JSON stream. A speech mark object accommodates the next fields:
- Time – The timestamp in milliseconds from the start of the corresponding audio stream
- Kind – The kind of speech mark (sentence, phrase, viseme, or SSML)
- Begin – The offset in bytes (not characters) of the beginning of the article within the enter textual content (not together with viseme marks)
- Finish – The offset in bytes (not characters) of the article’s finish within the enter textual content (not together with viseme marks)
- Worth – This varies relying on the kind of speech mark:
- SSML – <mark> SSML tag
- Viseme – The viseme title
- Phrase or sentence – A substring of the enter textual content as delimited by the beginning and finish fields
For instance, the sentence “Mary had just a little lamb” can provide the following speech marks file when you use
SpeechMarkTypes = [“word”, “sentence”] within the API name to acquire the speech marks:
The phrase “had” (on the finish of line 3) begins 373 milliseconds after the audio stream begins, begins at byte 5, and ends at byte 8 of the enter textual content.
The structure of our answer is introduced within the following diagram.
The Lambda operate creates pre-signed URLs for the speech and speech marks information and returns them to the browser within the type of an array (7, 8, 9).
When the browser sends the textual content file to the API endpoint (3), it will get again two pre-signed URLs for the audio file and the speech marks file in a single synchronous invocation (9). That is indicated by the important thing image subsequent to the arrow.
To run this answer, you want an AWS account with an AWS Identity and Access Management (IAM) consumer who has permission to make use of Amazon CloudFront, Amazon API Gateway, Amazon Polly, Amazon S3, AWS Lambda, and AWS Step Capabilities.
Use Lambda to generate speech and speech marks
The next code invokes the Amazon Polly
synthesize_speech operate two occasions to fetch the audio and speech marks file. They’re run as asynchronous capabilities and coordinated to return the end result on the identical time utilizing guarantees.
As an alternative of the earlier method, you may contemplate just a few options:
- Create each the speech marks and audio information inside a Step Capabilities state machine. The state machine can invoke the parallel department situation to invoke two totally different Lambda capabilities: one to generate speech and one other to generate speech marks. The code for this may be discovered within the using-step-functions subfolder within the Github repo.
- Invoke Amazon Polly asynchronously to generate the audio and speech marks. This method can be utilized if the textual content content material is giant or the consumer doesn’t want a real-time response. For extra particulars about creating lengthy audio information, confer with Creating Long Audio Files.
- Have Amazon Polly create the presigned URL straight utilizing the
generate_presigned_urlname on the Amazon Polly shopper in Boto3. For those who go along with this method, Amazon Polly generates the audio and speech marks newly each time. In our present method, we retailer these information in Amazon S3. Though these saved information aren’t accessible from the browser in our model of the code, you may modify the code to play beforehand generated audio information by fetching them from Amazon S3 (as an alternative of regenerating the audio for the textual content once more utilizing Amazon Polly). We now have extra code examples for accessing Amazon Polly with Python within the AWS Code Library.
Create the answer
Your complete answer is offered from our Github repo. To create this answer in your account, observe the directions within the README.md file. The answer consists of an AWS CloudFormation template to provision your sources.
To wash up the sources created on this demo, carry out the next steps:
- Delete the S3 buckets created to retailer the CloudFormation template (Bucket A), the supply code (Bucket B) and the web site (
- Delete the CloudFormation stack
- Delete the S3 bucket containing the speech information (
pth-speech-[Suffix]). This bucket was created by the CloudFormation template to retailer the audio and speech marks information generated by Amazon Polly.
On this publish, we confirmed an instance of an answer that may spotlight textual content because it’s being spoken utilizing Amazon Polly. It was developed utilizing the Amazon Polly speech marks function, which offers us markers for the place every phrase or sentence begins in an audio file.
The answer is offered as a CloudFormation template. It may be deployed as is to any internet utility that performs text-to-speech conversion. This is able to be helpful for including visible capabilities to audio in books, avatars with lip-sync capabilities (utilizing viseme speech marks), web sites, and blogs, and for aiding individuals with listening to impairments.
It may be prolonged to carry out further duties moreover highlighting textual content. For instance, the browser can present pictures, play music, and carry out different animations on the entrance finish whereas the textual content is being spoken. This functionality could be helpful for creating dynamic audio books, instructional content material, and richer text-to-speech purposes.
We welcome you to check out this answer and study extra concerning the related AWS companies from the next hyperlinks. You’ll be able to prolong the performance in your particular wants.
In regards to the Writer
Varad G Varadarajan is a Trusted Advisor and Subject CTO for Digital Native Companies (DNB) clients at AWS. He helps them architect and construct progressive options at scale utilizing AWS services and products. Varad’s areas of curiosity are IT technique consulting, structure, and product administration. Exterior of labor, Varad enjoys artistic writing, watching films with household and pals, and touring.