Designing dataflows for multi-schema messages in AWS IoT Analytics

The AWS IoT Analytics platform offers configurable dataflows for processing messages from IoT devices, which can be either uni-schema or multi-schema. When you have billions of messages and millions of IoT devices, you need efficient designs for message-processing dataflows.

In this post, you do a deep dive into various aspects of designing dataflows. You learn about the types of dataflows, and then look at a simple use case and try to design a dataflow to match its example requirements.

Overview

This post discusses the following topics:

  • Types of dataflows, including the pros and cons of each, and examples
  • Rules in AWS IoT Core that help send messages AWS IoT Analytics channels
  • Use cases and analysis of how to choose a specific type of dataflow

Consider the following points in each processing phase when selecting a particular dataflow to process IoT messages.

Pre-processing phase:

  • Types of IoT devices
  • Volume of messages
  • Structure of messages (uni-schema vs. multi-schema)
  • Whether devices can be clustered in a group

Processing phase:

  • List of activities for desired transformation
  • Order of activities for desired transformation
  • Schedule of the processing dataflows
  • Desired attributes for our analysis

Post-processing phase:

  • Number of data stores needed
  • Data consumption
    • Ad hoc queries
    • Jupyter notebooks
    • Amazon QuickSight

Types of dataflows

To process IoT messages efficiently on the AWS IoT Analytics platform, understand the different dataflow designs and select the best one for a particular use case. It’s a challenging design problem—you must clean, process, enrich, and transform incoming IoT messages quickly and at a low cost

For a simple dataflow for ingesting IoT messages, see Presenting AWS IoT Analytics: Delivering IoT Analytics at Scale and Faster than Ever Before.

Here are some general requirements for dataflows:

  1. A pipeline must have a source and destination for the messages.
  2. Each pipeline can read messages from only one channel and can write the messages to only one data store.
  3. Channels can send data through multiple pipelines.
  4. Data stores can generate multiple datasets, but the dataset can be generated from only one data store.

Therefore, each message must go through at least one channel, one pipeline, and one data store. Based on the previous inferences, you can design four different types of dataflows:

  • Type 1: One channel, one pipeline, one data store
  • Type 2: N channels, N pipelines, one data store
  • Type 3: One channel, N pipelines, N data stores
  • Type 4: One channel, N pipelines, one data store

Type 1: One channel, one pipeline, one data store

This design offers a simple yet powerful configuration. A pipeline sources raw messages from one channel through one pipeline and stores the normalized messages in a data store, as shown in the following diagram.

Type 1 dataflow: 1 channel, 1 pipeline, 1 data store

Type 1 dataflow: 1 channel, 1 pipeline, 1 data store

 

To keep pipeline operations simple, it’s best if all the messages follow the same structure (uni-schema) or if they vary predictably. For multi-schema messages or when the message schema varies widely across the devices, you might use a more complex pipeline architecture or run an AWS Lambda function for analytical requirements.

Example: Consider a sensor placed in industrial equipment that emits messages about a specific parameter of the equipment, such as water level, internal temperature, or oil levels. The Type 1 dataflow would be best for this particular use case for the easy collection, transformation, and storage of transformed data for future analysis. Given the context of this example, here are the pros and cons of this kind of dataflow design:

Pros

  • Good for fixed or uni-schema message processing (or when the message schema varies predictably).
  • All the messages are in one data store, which makes it easier to generate the datasets (for example, no dependency on other data stores).
  • Increased flexibility for analyzing the data, as all the attributes stored are in a normalized manner in the data store.
  • Cost-effective, as a message is ingested, processed, and stored only one time (two times, one in the channel and one in the data store).

Cons

  • Messages with different schemas can’t be handled.
  • Activities in the pipeline can become complex.

Type 2: N channels, N pipelines, one data store

In this design, multiple pipelines each source data from their respective channel and store the transformed data stored in one data store, as shown in the following diagram.

Type 2 dataflow: N channels, N pipelines, 1 data store

Type 2 dataflow: N channels, N pipelines, 1 data store

Each channel can either source data from a single IoT rule or from multiple IoT rules. If you want to perform analysis on different attributes originating from different sources and in different schemas, you can design dedicated pipelines for transforming such multi-schema messages. With this kind of design, you can store relevant data in one data store.

Example: Consider the case of a car company, ABC Corporation, that manufactures three different types of car engines: Engine1, Engine2, and Engine3. These engines are designed for highly specialized applications, and ABC wants to analyze their performance while they’re actively deployed across thousands of cars.

Each type of engine is equipped with different types of sensors and emits data in a different message schema. This use case compares the engine performance across common parameters that can be derived from the engines’ sensor data. To do this, ABC needs to process sensor data originating from each type of engine and design a dedicated transformation pipeline to convert the data into a common format.

For these types of use cases, this kind of dataflow design may be a good choice. Given the context of this example, here are the pros and cons of this type:

Pros

  • Multi-schema messages are much easier to process than with the Type 1 dataflow.
  • Pipeline functionalities are easy to limit for a specific type of transformation.
  • Datasets are easy to generate from one data store, so you don’t have to worry about joining multiple data stores.
  • Messages only go through the pipeline one time, saving the cost of replication at the channel level.

Cons

  • Data store may increase quickly as fed by multiple pipelines.
  • Queries required to scan more data for creating datasets, which might increase the cost of creating datasets as the data store grows.

Type 3: One channel, N pipelines, N data stores

This design replicates the messages at the channel output. Each pipeline gets the same raw message received by the channel, and the channel acts as a broadcaster device. The following diagram shows this configuration.

Type 3 dataflow: One channel, N pipelines, N data stores

Type 3 dataflow: One channel, N pipelines, N data stores

Each pipeline can also act as a filter and be dedicated for a specific type of a message, thus each data store can store only a particular type of message.

Example: Consider a fleet of cars emitting messages about engine, tires, cabin temperature, surrounding temperatures, and so on. You want to analyze each part’s performance, and can have dedicated pipelines that each process messages from individual parts and filter out the rest of the messages. At the end of each pipeline, each data store contains only the processed messages of certain parts.

Given the context of this example, here are the pros and cons of this kind of dataflow design:

Pros

  • Can perform the multiple independent transformations on the same message.
  • By filtering messages in the pipelines, individual data stores can be designed to store only subsets of all IoT messages.
  • Query for generating a dataset doesn’t have to scan lots of data when pipelines are configured to filter out messages.

Cons

  • Must consider the cost factor, as such configurations can increase cost due to the replication of messages among multiple pipelines. An alternative solution for such dataflow would be to create multiple channels, with each channel configured to receive specific types of messages from the IoT rule engine. In such a case, the solution would look like you repeated the Type 1 solution multiple times.

Type 4: One channel, N pipelines, one data store

As with Type 3, in this type of dataflow, the messages are replicated at a channel level. Each pipeline gets the same raw message received by the channel, as shown in the following diagram.

Type 4 dataflow: 1 channel, N pipelines, 1 data store

Type 4 dataflow: 1 channel, N pipelines, 1 data store

This kind of configuration is good for cases in which you want to derive different types of parameters. It’s also good for enriching the data with different types of information for data that originates from a common source.

Example: Consider route analysis for network routers. A network device manufacturing company performs trend analysis based on the health and performance of the network as the volume and composition of messages change over time. The company can use the data emitted by the network routers and design two different pipelines: one for processing health- and performance-related transformation and another for the volume and composition of message-related aggregation.

At the end, all transformed data can be put into one data store for further analysis based on notebooks or SQL-based datasets in the AWS IoT Analytics applications portal. Given the context of this example, here are the pros and cons of this kind of dataflow design:

Pros

  • Ability to perform multiple independent transformations on the same message.
  • The transformed messages end up in one data store, making it easier to generate multiple datasets.

Cons

  • Increased cost due to replicating messages among multiple pipelines. An alternative solution for such dataflow is to create multiple channels, with each channel configured to receive specific types of messages from an IoT rule. In such a case, the solution appears to repeat the Type 1 dataflow multiple times.

Connecting the AWS IoT rules engine to AWS IoT Analytics

AWS IoT rules give you the ability to interact with AWS services, forward messages to Amazon DynamoDB, notify users through Amazon SNS, or send messages to AWS IoT Analytics channels. There are two ways to do this:

  • One AWS IoT rule can replicate messages to multiple channels, as shown in the following diagram.
1 IoT Rule to N IoT Analytics Channel

1 IoT Rule to N IoT Analytics Channel

  • Multiple rules can forward messages to the same channel, as shown in the following diagram.
N IoT Rules to 1 AWS IoT Channel

N IoT Rules to 1 AWS IoT Channel

Use case

Consider a weather sensor device and a barometer sensor device responsible for following:

  • Weather sensor: Sends out messages about humidity and temperature readings.
  • Barometer sensor: Sends out messages about air pressure and temperature readings.

There are four different types of messages coming from these sensors, as shown in the following chart.

  1. Weather Sensor Messages
    1. Message 1
      {
      "type": "weather",
      "source": {
      "epoch": 1508629428,
      "humidity": 67,
      "unit": "RH",
      "full_topic": "rtjm/A020A6008464/weather/humidity",
      }
      }
    2. Message 2
      {
      "type": "weather",
      "source": {
      "epoch": 1508629550,
      "temperature": 12,
      "unit": "C",
      "full_topic":  "rtjm/A020A6008464/weather/temperature",
      }
      }
  2. Barometer Sensor Messages
    1. Message 1
      {
      "type": "barometer",
      "epoch": 1508311359,
      "pressure": 100660.53,
      "unit": "Pa",
      "full_topic":  "rtjm/A020A6008464/barometer/pressure",
      }
    2. Message 2
      {
      "type": "barometer",
      "epoch": 1508311370,
      "temperature": 19.41,
      "unit": "C",
      "full_topic":  "rtjm/A020A6008464/barometer/temperature",
      }

Because the schemas of weather messages and barometer messages are different, the same method to extract the key value parameters can’t be applied.

For the final dataset, the goal is to see the attributes shown in the following list in one data store (unit and full_topic attributes can be dropped).

  1. Type – Type of device from which the data originated
  2. Epoch – Timestamp when the message was generated
  3. Humidity – Relative humidity in percentages
  4. Pressure – Barometric pressure reading (in Pascal)
  5. temperature_weather – Temperature (in °C) sensed by the weather sensor
  6. temperature_barometer – Temperature (in °C) sensed by the barometer sensor

Feed all the messages through the AWS IoT console.

The following section evaluates the four different types of dataflows and describes how to choose one for this use case. Remember the following list that you saw earlier in the Overview section.

Pre-processing phase:

  • Types of IoT devices – 2
  • Volume of messages – 10,000 per device
  • Structure of messages (uni-schema vs. multi-schema) – Multi-schema messages
  • Whether devices can be clustered in a group – Each device belongs to its own cluster.

Processing phase:

  • List of activities for desired transformation
    • Filter, addAttribute, removeAttribute
    • Lambda function
  • Order of activities for desired transformation
    • Filter, addAttribute, removeAttribute
  • Schedule of the processing dataflows
  • Desired attributes for the analysis
    • Type, epoch, humidity, pressure, temperature_weather, temperature_barometer

Post-processing phase:

  • Number of data stores needed
  • Data consumption

Choosing a dataflow

Now, consider each type of dataflow and evaluate its applicability to this use case:

  • Type 1: Because the messages have different schemas, a single pipeline can’t handle all incoming messages without having a complex architecture or Lambda function.
  • Type 2: This kind of dataflow allows you to process messages of multiple schemas through their designated channels. But be careful how you choose the attributes and messages you want to store in the data store; a data store should contain only those attributes and messages that add value to the datasets. For the messages shown earlier, you can set up one pipeline for processing weather messages and another pipeline for processing the barometer messages. Because the messages are partitioned among various channels, each message goes through the pipeline once, thus reducing time and cost. Therefore, Type 2 appears to be a good choice for this use case.
  • Type 3: This kind of configuration would not be a good choice for this use case. Why? You don’t intend to replicate the messages for multiple pipelines, and you don’t want the weather data and barometer data stored in different data stores.
  • Type 4: Similar to Type 3, this kind of configuration would not be a good choice for this use case, because you do not intend to replicate messages for multiple pipelines. You also don’t want weather data and barometer data stored in different data stores. You also don’t have to replicate messages across pipelines, because then the pipelines would be processing unnecessary data.

Solution

Based on these findings, the best choice for this use case is the Type 2 dataflow, with N channels, N pipelines, and one data store. You can create two channels, each sourcing messages from their respective IoT rule, and each type of device can have its own IoT rule. From the channels, respective pipelines can source the messages and push out the processed messages to the same data store.

Create two pipelines for processing messages, as the schema of the weather and barometer messages are different. However, this also helps you handle the different types of messages originating from each device. The following diagram shows this configuration.

Sample solution for the usecase

Sample solution for the use-case

The weather pipeline has one Lambda function activity to perform the following operations:

  • Remove the fields that you marked as not required.
  • Filter out the messages where the humidity or temperature value is 0.
  • Rename the temperature field to “temperature_weather”.

Meanwhile, the barometer pipeline has multiple pipeline activities to perform the following operations:

  • Remove the fields that you marked as not required.
  • Filter out messages where the barometer or temperature value is 0.
  • Rename the temperature field to “temperature_barometer”.

As you remove the unnecessary attributes, you decrease the storage size needed for the processed messages.

Conclusion

This post described four different dataflow designs and how to choose one for various types of use cases. You learned to consider information at each processing phase, and how to define the correct order of the pipeline activities for transforming the messages.

Although this post is not exhaustive, it should guide you in making design choices to create your AWS IoT Analytics workflows. For more information, see Using AWS IoT Analytics to Prepare Data for QuickSight Time-Series Analysis.