In the previous post, I described my need to parse many PDF documents in an automated way. We have introduced Amazon Textract — a service that can do that at scale with surprising accuracy. Then we have gone through the architecture and limitations. Today we will deploy the solution, parse some documents and use the results.
The complete Textract CloudFormation project is available on GitHub. Follow the instructions there to deploy the stack to one of the regions where Textract is available. For example to us-west-2 (Oregon) which also happens to have the lowest Textract prices.
Once the CloudFormation stack is created look at its Outputs to find the newly created S3 Bucket name. The deployment script will print it too.
~/amazon-textract-cloudformation $ ./deploy.sh
Upload your PDF files to: s3://s3-textract-abcd1234/upload/
Download results from: s3://s3-textract-abcd1234/output/
Now it’s time to upload a file:
aws s3 cp bank-statement.pdf s3://s3-textract-abcd1234/upload/
It will take between 30 seconds and 1 minute for Textract to process the file. You can watch the progress at the Step Functions console.
Once the processing is completed download the “blocks file” bank-statement.pdf.blocks.json from the output folder:
aws s3 cp s3://s3-textract-abcd1234/output/bank-statement.pdf.blocks.json .
Now we will have a detailed look at how to make use of the output file.
Textract returns the parsed data as a list of Blocks.
The biggest block is a PAGE — one per document for images and single-page PDFs, more PAGEs are returned for multi-page PDFs. The PAGE links to a list of LINEs and each LINE links to a list of WORDs. For each LINE and WORD we get its content (text) and geometry — position and size relative to the PAGE.
The geometry of a PAGE is [0,0]x[1,1], we can think of it as [0, 0] to [100%, 100%] of the page size, with [0,0] being the top-left corner of the page. When a WORD is at a position [0.5, 0.5] it means that the WORD’s top-left corner is right in the middle of the page.
The raw data returned from Textract are quite hard to work with — it’s a bunch of entities with unique IDs, lists, references, geometries, and so on.
Fortunately, there exists a great little “hidden gem” —Python trp module (where trp probably stands for Textract Results Parser — I’m guessing) that makes working with the returned data a breeze. I found it buried deep inside amazon-textract-code-samples repository.
To make it easier to use I have re-packaged it and published on pypi.org. Now, all we need to do is run pip3 install textract-trp (requires python 3.6 or newer).
With that module installed the baseline for all the further examples is this code:
import trp# Load and parse the "*.blocks.json" file returned from Textract
with open(“bank-statement.pdf.blocks.json”, “rt”) as f:
blocks = json.loads(f)
document = trp.Document(blocks)
Let’s examine the document variable. For start, we can print all the text extracted from the page.
We can also work with the text line-by-line (document.pages.lines) and drill down to the words (document.pages.lines.words). It’s best to play with it in Python shell and explore the Document() and the related, nested objects — pages, lines, words, and so on.
Form fields have a key (name, label) and a value. In the bank statement above the keys are the account name and account number.
The values are now very easy to extract. Note that the key is case sensitive.
account_number = document.pages.form.getFieldByKey('ACCOUNT NUMBER')
print(account_number.key.text) # "ACCOUNT NUMBER"
print(account_number.value.text) # "87654321–00001"
Likewise, we can extract the “ACCOUNT NAME” field that should give us “MR HOME OWNER” value.
The statement number and the statement period are not recognised as form fields because they are not in the “key: value” format. However, we can still retrieve them by looking at the right place on the page.
What’s the right place though? We have to calculate the Bounding Box with coordinates relative to the page. To do that I rendered the page to an image and took some measurements.
With these values, we can now create a BoundingBox object.
bbox = trp.BoundingBox(top=115/818, left=850/1157,
Now with the bounding box, we can get a list of all Lines contained in that box. We can also verify that the box contains what we expect, for example, that the second line reads “FOR THE PERIOD”
bbox_lines = document.pages.getLinesInBoundingBox(bbox)
if bbox_lines.text != "FOR THE PERIOD":
raise ValueError("Unable to parse statement period")
If the boxed contains what we expect we can get the statement number from bbox_lines and the period from bbox_lines.
Tables are comprised of Rows, Rows are comprised of Cells. We can cycle through them and print it out as a CSV file — that’s what my accountant wants!
Again it’s a good idea to perform some validation — verify that the table has the expected number of cells and that the cells contain the expected text. Just in case Textract misanalyses the document.
page = document.pages# Validate the top row - column names
top_row = page.tables.rows
if top_row.cells.text != "Date" or len(top_row.cells) != 6:
raise ValueError("Invalid table format")# Loop through the rows and cells and print it in CSV format
for row in page.tables.rows[1:]:
for cell in row.cells:
That gives us the desired CSV output.
06 Apr,SCHEDULED REPAYMENT,02-0123-0123456-00,,60.90,54772.34 OD
06 Apr,INTEREST CHARGE 35.45,87654321-00001,35.45,,54807.79 OD
Of course, I have done some more parsing and formatting before saving the CSV file — dates, amounts, totals validations, etc. That’s not very interesting and will be different for your needs so we won’t go into the details here.
- Amazon Textract is a very powerful service for extracting structured information from PDFs and images. And with the right library for parsing the results, it’s very easy to use.
- Achieved saving an hour of copy and pasting data from a bunch of PDFs by spending a day building a fully automated solution that then parsed the PDFs in a matter of seconds. Totally worth it 🙂