Associating prediction results with input data using Amazon SageMaker Batch Transform

When you run predictions on large datasets, you may want to drop some input attributes before running the predictions. This is because those attributes don’t carry any signal or were not part of the dataset used to train your machine learning (ML) model. Similarly, it can be helpful to map the prediction results to all or part of the input data for analysis after the job is complete.

For example, consider a dataset that comes with an ID attribute. Commonly, an observation ID is a randomly generated or sequential number that carries no signal for a given ML problem. For this reason, it is usually not part of the training data attributes. However, when you make batch predictions, you may want your output to contain both the observation ID and the prediction result as a single record.

The Batch Transform feature in Amazon SageMaker enables you to run predictions on datasets stored in Amazon S3. Previously, you had to filter your input data before creating your batch transform job and join prediction results with desired input fields after the job was complete. Now, you can use Amazon SageMaker Batch Transform to exclude attributes before running predictions. You can also join the prediction results with partial or entire input data attributes when using data that is in CSV, text, or JSON format. This eliminates the need for any additional pre-processing or post-processing and accelerates the overall ML process.

This post demonstrates how you can use this new capability to filter input data for a batch transform job in Amazon SageMaker and join the prediction results with attributes from the input dataset.

Background

Amazon SageMaker is a fully managed service that covers the entire ML workflow. The service labels and prepares your data, chooses an algorithm, trains the model, tunes and optimizes it for deployment, makes predictions, and takes action.

Amazon SageMaker manages the provisioning of resources at the start of batch transform jobs. It releases the resources when the jobs are complete, so you pay only for what was used during the execution of your job. When the job is complete, Amazon SageMaker saves the prediction results in an S3 bucket that you specify.

Batch transform example

Use the public data set for breast cancer detection from UCI and train a binary classification model to detect whether a given tumor is likely to be malignant (1) or benign (0). This dataset comes with an ID attribute for each tumor, which you exclude during training and prediction. However, you bring it back in your final output and record it with the predicted probability of malignancy for each tumor from the batch transform job.

You can also download the companion Jupyter notebook. Each of the following sections in the post corresponds to a notebook section so that you can run the code for each step as you read along.

Setup

First, import common Python libraries for ML such as pandas and NumPy, along with the Amazon SageMaker and Boto3 libraries that you later use to run the training and batch transform jobs.

Also, set up your S3 bucket for uploading your training data, validation data, and the dataset against which you run the batch transform job. Amazon SageMaker stores the model artifact in this bucket, as well as the output of the batch transform job. Use a folder structure to keep the input datasets separate from the model artifacts and job outputs.

import os
import boto3
import sagemaker
import pandas as pd
import numpy as np

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

bucket=sess.default_bucket()
prefix = 'sagemaker/breast-cancer-prediction-xgboost' # place to upload training files within the bucket

Data preparation

Download the public data set onto the notebook instance and look at a sample for preliminary analysis. Although the dataset for this example is small (with 569 observations and 32 columns), you can use the Amazon SageMaker Batch Transform feature on large datasets with petabytes of data.

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)

# specify columns extracted from wbdc.names
data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

In the following table, the first column in your dataset is the ID of the tumors and the second is the diagnosis (M for malignant or B for benign). In the context of supervised learning, this is your target, or what you want to be able to predict. The following attributes are the features also known as predictors.

iddiagnosisradius_meantexture_meanperimeter_meanconcave points_worstsymmetry_worstfractal_dimension_worst
2888913049B11.2619.9673.720.093140.29550.07009
375901303B16.1716.07106.30.12510.31530.0896
4679113514B9.66818.161.060.0250.30570.07875
20387880M13.8123.7591.560.20130.44320.1086
14886973702B14.4415.1893.970.15990.26910.07683
118864877M15.7822.91105.70.20340.32740.1252
2248813129B13.2717.0284.550.096780.25060.07623
3649010877B13.416.9585.480.069870.27410.07582

After doing some minimal data preparation, split the data into three sets:

  • A training set consisting of 80% of your original data.
  • A validation set for your algorithm to perform the proper evaluation of the model.
  • A batch set that you set aside for now and use later to run a batch transform job using the new I/O join feature.

To train and validate the model, keep all the features, such as radius_mean, texture_mean, perimeter_mean, and so on. Drop the id attribute, because it has no relevance in determining whether a tumor is malignant.

When you have a trained model in production, you typically want to run predictions against it. One way to do that is to deploy the model for real-time predictions using the Amazon SageMaker hosting services.

However, in your case, you do not need real-time predictions. Instead, you have a backlog of tumors as a .csv file in Amazon S3, which consists of a list of tumors identified by their ID. Use a batch transform job to predict, for each tumor, the probability of being malignant. To create this backlog of tumors here, make a batch set with the id attribute but without the diagnosis attribute. That’s what you’re trying to predict with your batch transform job.

The following code example shows the configuration of data between the three datasets:

# replace the M/B diagnosis with a 1/0 boolean value
data['diagnosis']=data['diagnosis'].apply(lambda x: ((x =="M"))+0) 

# data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(['id'],axis=1)
data_val = data[val_list].drop(['id'],axis=1)
data_batch = data[batch_list].drop(['diagnosis'],axis=1)

data_train = data[train_list].drop(['id'],axis=1)
data_val = data[val_list].drop(['id'],axis=1)
data_batch = data[batch_list].drop(['diagnosis'],axis=1)

Finally, upload these three datasets to S3.

train_file = 'train_data.csv'
data_train.to_csv(train_file,index=False,header=False)
sess.upload_data(train_file, key_prefix='{}/train'.format(prefix))

validation_file = 'validation_data.csv'
data_val.to_csv(validation_file,index=False,header=False)
sess.upload_data(validation_file, key_prefix='{}/validation'.format(prefix))

batch_file = 'batch_data.csv'
data_batch.to_csv(batch_file,index=False,header=False)
sess.upload_data(batch_file, key_prefix='{}/batch'.format(prefix))  

Training job

Use the Amazon SageMaker XGBoost built-in algorithm to quickly train a model for binary classification based on your training and validation datasets. Set the training objective to binary:logistic, which trains XGBoost to output the probability that an observation belongs to the positive class (malignant in this example), as shown in the following code example:

%%time
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import get_image_uri

job_name = 'xgb-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = 's3://{}/{}/output/{}'.format(bucket, prefix, job_name)
image = get_image_uri(boto3.Session().region_name, 'xgboost')

sm_estimator = sagemaker.estimator.Estimator(image,
                                             role,
                                             train_instance_count=1,
                                             train_instance_type='ml.m5.4xlarge',
                                             train_volume_size=50,
                                             input_mode='File',
                                             output_path=output_location,
                                             sagemaker_session=sess)

sm_estimator.set_hyperparameters(objective="binary:logistic",
                                 max_depth=5,
                                 eta=0.2,
                                 gamma=4,
                                 min_child_weight=6,
                                 subsample=0.8,
                                 silent=0,
                                 num_round=100)

train_data = sagemaker.session.s3_input('s3://{}/{}/train'.format(bucket, prefix), distribution='FullyReplicated', 
                                        content_type='text/csv', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input('s3://{}/{}/validation'.format(bucket, prefix), distribution='FullyReplicated', 
                                             content_type='text/csv', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

# Start training by calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, logs=True)

Batch transform

Use the Python SDK to kick off the batch transform job to run inferences on your batch dataset and store your inference results in S3.

Your batch dataset contains the id attribute in the first column. As I explained earlier, because this attribute was not used for training, you must use the new input filter capability. Also, before the I/O join feature, the output of the batch transform job would have been a list of probabilities, such as the following:

0.0226082857698

0.987275004387

0.836603999138

0.00795079022646

0.0182465240359

0.995905399323

0.0129367504269

0.961541593075

0.988895177841

It would have required some post-inference logic to map these probabilities to their corresponding input tumor. The Batch Transform filters make this easy.

Specify the input as the join source. Then, specify an output filter to indicate that you do not require the entire input (which is the ID followed by the 30 features). Instead, you want to present the tumor ID and its probability of being malignant. And you only want to show the first (id) and the last (inference result) columns.

The Batch Transform filters use JSONPath expressions to selectively extract the input or output data, based on your needs. For the supported JSONPath operators in Batch Transform, see this link.

JSONPath is developed to work with JSON data. To use it on CSV data, consider a CSV row as a JSON array with a zero-based index. For example, applying ‘$[0,1]’ on a row of ‘8810158, B, 13.110, 22.54, 87.02’ returns the first and second column of the row, which is ‘8810158, B’.

In this example, the input filter is “$[1:]”. You are excluding column 0 (id) before processing the inferences, and keeping everything from column 1 to the last column

(all the features or predictors). The output filter is “$[0,-1].” When presenting the output, you only want to keep column 0 (id) and the last (-1) column (inference_result), which is the probability of a given tumor to be malignant.

You could also consider bringing back input columns other than id. In your current example, where you happen to have the ground truth diagnosis, you could consider bringing it back in your output file next to the prediction. That way, you could do side-by-side comparisons and evaluate your predictions.

%%time

sm_transformer = sm_estimator.transformer(1, 'ml.m4.xlarge', assemble_with = 'Line', accept = 'text/csv')

# start a transform job
input_location = 's3://{}/{}/batch/{}'.format(bucket, prefix, batch_file) # use input data with ID column
sm_transformer.transform(input_location, split_type='Line', content_type='text/csv', input_filter='$[1:]', join_source='Input', output_filter='$[0,-1]')
sm_transformer.wait()

Result

Read the CSV output in S3 at your output location:

import json
import io
from urllib.parse import urlparse

def get_csv_output_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:]
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')

output = get_csv_output_from_s3(sm_transformer.output_path, '{}.out'.format(batch_file))
output_df = pd.read_csv(io.StringIO(output), sep=",", header=None)
output_df.sample(10)    

It should show the list of tumors identified by their ID and their corresponding probabilities of being malignant. Your file should look like the following:

844359,0.990931391716 

84458202,0.968179702759 

8510824,0.006071804557 

855138,0.780932843685 

857155,0.0154032697901 

857343,0.0171982143074 

861598,0.540158748627 

86208,0.992102086544 

862261,0.00940885581076 

862989,0.00758415739983 

864292,0.006071804557 

864685,0.0332484431565 

....

Conclusion

This post demonstrated how you can provide input and output filters for your batch transform jobs using the Amazon SageMaker Batch Transform feature. This eliminates the need to pre-process or post-process input and output data respectively. In addition, you can associate prediction results with their corresponding input data with the flexibility of keep all or part of the input data attributes. To learn more about this feature, see the Amazon SageMaker Developer Guide.


About the Authors

Ro Mullier is a Sr. Solutions Architect at AWS helping customers run a variety of applications on AWS and machine learning workloads in particular. In his spare time, he enjoy spending time with family and friends, playing soccer and competing in machine learning competitions.

 

 

 

Han Wang is a Software Development Engineer at AWS AI. She focuses on developing highly scalable and distributed machine learning platforms. In her spare time, she enjoys watching movies, hiking and playing “Zelda: Breath of the wild”.