If you’re building a data lake in the AWS cloud, you’ll most likely want to have metadata and catalog search capability for the underlying data. We recommend Amazon Elasticsearch Service (Amazon ES) for storing and searching S3 keys, and S3 and object metadata. At minimum, your S3 keys include an object name, but they probably include additional, identifying path elements that you want to search. This post tells you how to set up Amazon ES so that you can search your key paths. Along the way, we’ll cover building a custom analyzer for your text fields.
Amazon Elasticsearch Service (Amazon ES) makes it easy to use Elasticsearch to search your application’s data. This post is about Elasticsearch features. Amazon ES is the management layer, through which data, mappings, search requests, etc. pass. For more information on the relationship between Amazon ES and Elasticsearch, see our documentation.
You send documents to Amazon ES along with a mapping. The mapping determines how Elasticsearch parses the values in your fields and creates terms that are the targets for your queries. When you send a query to Amazon ES, Elasticsearch matches the terms in the query with the terms in the document’s fields to determine whether that document matches the query. Elasticsearch scores and returns to you a ranked set of these matches as your query result.
In this article, we’ll focus on text type values. Elasticsearch 6.x mappings support two kinds of text fields: text and keyword. Keyword fields are minimally processed and serve as the basis for exact matching. Text fields are analyzed and broken down into individual words, called terms. (Note that some earlier versions of Elasticsearch used different names for the text types.)
Elasticsearch analyzes text fields, passing the text through several processing stages. It passes the stream of characters through a character filter, then applies a tokenizer to emit a stream of terms, and finally applies a token filter on the generated tokens.
You can customize the behavior of each of these stages, as we’ll see below.
Elasticsearch has a number of analyzers built in, including:
- Whitespace – Creates terms by splitting source strings on whitespace and without any additional character or token filtering.
- Simple – Creates terms by splitting source strings on non-letters and converting text to lower case.
- Standard – Creates terms by applying Unicode text segmentation. This is the default analyzer for text fields. It does not convert terms to lower case.
- Keyword – The entire input string is used as a single term, unmodified.
- Language analyzers – These language-specific analyzers include specialized, per-language handling of text, including stemming, stop words, and synonyms (more on stemming, stop words and synonyms in a bit).
Working with analyzers
You set the analyzer you want to use for a field by specifying it in the mapping when you create the index. For example, the following mapping sets the analyzer for my_field to be the whitespace analyzer:
Note: You can’t change the analyzer on an existing field in Amazon Elasticsearch Service. When testing your analyzer, use a scratch index that you can easily delete and recreate.
As we work through a sample analysis, we’ll imagine an IoT use case, where solar panels are out in the desert near Santa Fe, New Mexico. The power company has many such installations across the United States, and it stores data about temperature, ambient light, sunlight, etc. This data flows to S3 (s3://devices/region/state/city/geo-ip/date-time/device name.txt) with keys that are organized by region, state, city, geo location of the field, and with hourly files for each device.
The power company searches this data for region, city, state, location, and device name. They also use Kibana to monitor for the presence of files based on the timestamp of those files.
You can see how Elasticsearch will analyze any string by using the _analyze API. I used the standard, simple, and whitespace analyzers to process the string “s3://devices/southwest/new-mexico/santa-fe/9wkdvgw781z9/2019-02-08-00/Location-15.txt”. For example, to see the output of the standard analyzer, use the call below. (Note: we’re using a Geohash for the location—9wkdvgw781z9 specifies a location outside Santa Fe. See Wikipedia for further details about using Geohash.)
Analyzing the preceding URL with the standard, simple, and whitespace analyzers yields the following terms:
Each of these results is almost, but not exactly right. The standard analyzer parses the filename into three tokens and breaks both the city and state into two terms based on where the hyphens occur. The simple analyzer removes all numbers, including the time stamp, breaking the Geohash into multiple, numeral-free tokens. The whitespace analyzer emits a single token, which doesn’t allow for in-path searching. Each result also retains at least part of the “s3” from the URL schema.
Defining a custom analyzer
You define a custom analyzer in the settings portion of the mapping for your index, and apply it by name to fields, as shown in the preceding section. To define a custom analyzer, run the following command:
Our example request defines a custom analyzer s3_path_analyzer, which uses a char_filter, tokenizer, and token filter (filter). The mapping applies the s3_path_analyzer to a single field, s3_key, by defining the field in the mapping properties and using the analyzer directive.
The s3_path_analyzer uses a mapping char_filter to replace instances of “s3:” and “/” with space (Unicode 32). It uses a tokenizer that employs simple pattern matching to group together alpha-numeric characters and retaining hyphens and periods in the same token. It uses the lowercase filter to lowercase all tokens for better matching.
This analyzer generates the terms [devices, southwest, new-mexico, santa-fe, 9wkdvgw781z9, 2019-02-08-00, location-15.txt].
Unlike in the previous example, where the analyzer was unable to correctly parse all of the terms, the custom analyzer correctly parses the S3 URL to generate terms for the city, state, geo location, date, and file name. Now you can search (and find!) based on all of the URL’s path elements.
We’ve only scratched the surface of Elasticsearch analysis and analyzers. The custom analyzer we just described works well for our example data, but won’t necessarily work in every example. For instance, retaining the periods in the data works in this case, but might not work in all cases. Experiment with the different analyzers that come out of the box with Elasticsearch! You’ll improve your matching and the relevance of your query results.
About the Author
Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.