Amazon SageMaker Adds Batch Transform Feature and Pipe Input Mode for TensorFlow Containers

At the New York Summit a few days ago we launched two new Amazon SageMaker features: a new batch inference feature called Batch Transform that allows customers to make predictions in non-real time scenarios across petabytes of data and Pipe Input Mode support for TensorFlow containers. SageMaker remains one of my favorite services and we’ve covered it extensively on this blog and the machine learning blog. In fact, the rapid pace of innovation from the SageMaker team is a bit hard to keep up with. Since our last post on SageMaker’s Automatic Model Tuning with Hyper Parameter Optimization, the team launched 4 new built-in algorithms and tons of new features. Let’s take a look at the new Batch Transform feature.

Batch Transform

The Batch Transform feature is a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need sub-second latency, or need to both preprocess and transform the training data. The best part? You don’t have to write a single additional line of code to make use of this feature. You can take all of your existing models and start batch transform jobs based on them. This feature is available at no additional charge and you pay only for the underlying resources.

Let’s take a look at how we would do this for the built-in Object Detection algorithm. I followed the example notebook to train my object detection model. Now I’ll go to the SageMaker console and open the Batch Transform sub-console.

From there I can start a new batch transform job.

Here I can name my transform job, select which of my models I want to use, and the number and type of instances to use. Additionally, I can configure the specifics around how many records to send to my inference concurrently and the size of the payload. If I don’t manually specify these then SageMaker will select some sensible defaults.

Next I need to specify my input location. I can either use a manifest file or just load all the files in an S3 location. Since I’m dealing with images here I’ve manually specified my input content-type.

Finally, I’ll configure my output location and start the job!

Once the job is running, I can open the job detail page and follow the links to the metrics and the logs in Amazon CloudWatch.

I can see the job is running and if I look at my results in S3 I can see the predicted labels for each image.

The transform generated one output JSON file per input file containing the detected objects.

From here it would be easy to create a table for the bucket in AWS Glue and either query the results with Amazon Athena or visualize them with Amazon QuickSight.

Of course it’s also possible to start these jobs programmatically from the SageMaker API.

You can find a lot more detail on how to use batch transforms in your own containers in the documentation.

Pipe Input Mode for Tensorflow

Pipe input mode allows customers to stream their training dataset directly from Amazon Simple Storage Service (S3) into Amazon SageMaker using a highly optimized multi-threaded background process. This mode offers significantly better read throughput than the File input mode that must first download the data to the local Amazon Elastic Block Store (EBS) volume. This means your training jobs start sooner, finish faster, and use less disk space, lowering the costs associated with training your models. It has the added benefit of letting you train on datasets beyond the 16 TB EBS volume size limit.

Earlier this year, we ran some experiments with Pipe Input Mode and found that startup times were reduced up to 87% on a 78 GB dataset, with throughput twice as fast in some benchmarks, ultimately resulting in up to a 35% reduction in total training time.

By adding support for Pipe Input Mode to TensorFlow we’re making it easier for customers to take advantage of the same increased speed available to the built-in algorithms. Let’s look at how this works in practice.

First, I need to make sure I have the sagemaker-tensorflow-extensions available for my training job. This gives us the new PipeModeDataset class which takes a channel and a record format as inputs and returns a TensorFlow dataset. We can use this in our input_fn for the TensorFlow estimator and read from the channel. The code sample below shows a simple example.

from sagemaker_tensorflow import PipeModeDataset

def input_fn(channel):
    # Simple example data - a labeled vector.
    features = {
        'data': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.int64),
    }
    
    # A function to parse record bytes to a labeled vector record
    def parse(record):
        parsed = tf.parse_single_example(record, features)
        return ({
            'data': tf.decode_raw(parsed['data'], tf.float64)
        }, parsed['labels'])

    # Construct a PipeModeDataset reading from a 'training' channel, using
    # the TF Record encoding.
    ds = PipeModeDataset(channel=channel, record_format='TFRecord')

    # The PipeModeDataset is a TensorFlow Dataset and provides standard Dataset methods
    ds = ds.repeat(20)
    ds = ds.prefetch(10)
    ds = ds.map(parse, num_parallel_calls=10)
    ds = ds.batch(64)
    
    return ds

Then you can define your model and the same way you would for a normal TensorFlow estimator. When it comes to estimator creation time you just need to pass in input_mode='Pipe' as one of the parameters.

Available Now

Both of these new features are available now at no additional charge, and I’m looking forward to seeing what customers can build with the batch transform feature. I can already tell you that it will help us with some of our internal ML workloads here in AWS Marketing.

As always, let us know what you think of this feature in the comments or on Twitter!

Randall