This post describes how to split large text documents into chunks small enough to be processed by Amazon Translate. I’ll show you how to split documents without splitting words or sentences so source text remains grammatically correct, and I’ll demonstrate this with sample code that uses the Python NLTK library for detecting sentence boundaries.
What is Amazon Translate?
Amazon Translate is a service for translating text on the Amazon Web Services (AWS) platform. It is designed to be used programmatically and supports interfaces in Python, Java, AWS Mobile SDK, and the AWS Command Line Interface (AWS CLI). The service currently supports translations between 54 different languages!
I have been using Amazon Translate to translate video transcripts in a video search engine I’ve built with a few colleagues called the Media Insights Engine. It uses video transcripts, their translations, and data generated by Amazon Rekognition from which users can search and find content in massively large video collections. For example, a user could find The Terminator by searching for transcript excerpts like “Hasta la vista, baby,” or celebrity names like “Schwarzenegger,” or content moderation terms like “Violence”.
What are the limitations of Amazon Translate?
AWS services include default limitations that help prevent users from misusing the service or from accidentally provisioning more resources than they need. They control things like the maximum size of input objects, how often you can send requests, how long your requests can run, etc.
For example, AWS Lambda, a popular serverless computing platform, limits functions to running no more than 15 minutes and will shut down functions that exceed that limit without warning. Other services return errors that explicitly identify the limit that has been exceeded. For example, if you send input text to Amazon Translate that is too long, you’ll see an error like this:
An error occurred (TextSizeLimitExceededException) Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 5074 bytes
If you need a service to process more data than you’re allowed to put in a single request, try splitting the workload into multiple smaller parts. As described in the Guidelines and Limits documentation for Amazon Translate, the preceding error message reminds us that the service can accept no more than 5,000 bytes of UTF-8 encoded characters. You can check this in Python with the command
`len(source_text.encode('utf-8')) < 5000`. If you’re working with a text document that is longer than 5,000 bytes, you can split the source text into smaller chunks, call Amazon Translate for each chunk, and combine the translated results once complete.
In order to maintain the grammatical integrity of the source text, it must not be split in the middle of words or sentences. The simplest way to locate sentence boundaries involves looking for a period followed by a capitalized word. A more accurate strategy also considers language specific abbreviations that include a period but don’t necessarily end a sentence, such as the English title Missus abbreviated as Mrs. or the German title Frau abbreviated as Fr.
Splitting text with the Natural Language Toolkit for Python
The Natural Language Toolkit (NLTK) for Python provides a convenient way to split text into sentences for many different languages. The following Python code shows how to download pretrained models, or tokenizers, for dividing text into a list of sentences. An English language tokenizer is used in this example. You can find tokenizers for other languages under the
(Mouseover to scroll through the following code block.)
# Be sure to first install nltk and boto3 import nltk.data import boto3 # Define the source document that needs to be translated source_document = “My little pony heart is yours...” # Tell the NLTK data loader to look for resource files in /tmp/ nltk.data.path.append("/tmp/") # Download NLTK tokenizers to /tmp/ nltk.download('punkt', download_dir='/tmp/') # Load the English language tokenizer tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') # Split input text into a list of sentences sentences = tokenizer.tokenize(source_document) print("Input text length: " + str(len(source_document))) print("Number of sentences: " + str(len(sentences))) translated_text = '' source_text_chunk = '' translate_client = boto3.client('translate') source_lang = "en" target_lang = "fr" for sentence in sentences: # Translate expects utf-8 encoded input to be no more than # 5000 bytes, so we’ll split on the 5000th byte. if ((len(sentence.encode('utf-8')) + len(source_text_chunk.encode('utf-8')) < 5000): source_text_chunk = source_text_chunk + ' ' + sentence else: print("Translation input text length: " + str(len(source_text_chunk))) translation_chunk = translate_client.translate_text(Text=source_text_chunk,SourceLanguageCode=source_lang,TargetLanguageCode=target_lang) print("Translation output text length: " + str(len(translation_chunk))) translated_text = translated_text + ' ' + translation_chunk["TranslatedText"] source_text_chunk = sentence # Translate the final chunk of input text print("Translation input text length: " + str(len(source_text_chunk))) translation_chunk = translate_client.translate_text(Text=source_text_chunk,SourceLanguageCode=source_lang,TargetLanguageCode=target_lang) print("Translation output text length: " + str(len(translation_chunk))) translated_text = translated_text + ' ' + translation_chunk["TranslatedText"] print("Final translation text length: " + str(len(translated_text)))
$ python3 app.py [nltk_data] Downloading package punkt to /tmp/... [nltk_data] Package punkt is already up-to-date! Input text length: 32 Number of sentences: 1 Translation input text length: 33 Translation output text length: 4 Final translation text length: 37
After you are finished running this example, be sure to turn off the AWS services or resources used to avoid incurring ongoing costs.
In this blog post, I reviewed how you can respect sentence boundaries when splitting large text documents into chunks to be processed by Amazon Translate. As long as you stay within the other service guidelines, there is virtually no limit to the size of documents you can translate with this strategy!
For more information about Amazon Translate, refer to the developer guide.