Amazon Comprehend Launches Asynchronous Batch Operations

My colleague Jeff Barr last wrote about Amazon Comprehend, a service for discovering insights and relationships in text, when it launched at AWS re:Invent in 2017. Today, after iterating on customer feedback, we’re releasing a new asynchronous batch inferencing feature for Comprehend. Asynchronous batch operations work on documents stored in Amazon Simple Storage Service (S3) buckets and can perform all of the normal Comprehend operations like entity recognition, key phrase extraction, sentiment analysis, and language detection. These new asynchronous batch APIs support significantly larger documents than the single document and batch APIs, reducing the need for customers to truncate their documents for the service. Of course, all of the single document and batch synchronous API operations remain available for real-time results. The addition of asynchronous operations allows developers to choose the tools most suited for their applications. Let’s take a deeper look at the new API.

Asynchronous API Operations

The new batch APIs follow the same asynchronous call structure as Amazon Comprehend’s TopicDetection API. To analyze a collection of documents we first call one of the Start* APIs like StartDominantLanguageDetectionJob, StartEntitiesDetectionJob, StartKeyPhrasesDetectionJob, or StartSentimentDetectionJob.

Each of these APIs take an InputDataConfig and an OutputDataConfig that specify the incoming data format and location as well as where in S3 the results should be stored. The InputDataConfig specifies whether the input data should be treated as one document per file or one document per line.

Additionally we can name the job and include a unique request identifier for synchronization purposes. If we don’t supply these the Comprehend service will automatically generate them.

At the time of this writing the asynchronous operations support individual documents of up to 100KB for entity and key phrase detection, 1MB for language detection, and 5KB for sentiment detection. The total size of all files in the batch must be under 5GB and we cannot submit more than 1 million individual files per batch.

Now that we see what the API is doing let’s take a look at the updated console and start a job!

Amazon Comprehend Analysis Console

First I’ll navigate to the AWS Management Console and open Amazon Comprehend. Next I’ll select the new Analysis console.

From here I can create a new analysis job by clicking the create button in the top right of the console. I’ll create an entities detection job and select English as my document language. Then I’ll point the console at some sampe data.

Now I’ll configure my output data location and make sure the service role has access to that S3 bucket. Then I’ll start the job!

Here I can see the operation started in the console and I can wait until it’s complete to view the detailed results.

Here on the job page I can see the status of the job and the output location. If I download the results from the S3 location I can take a look at the detected entities in the sample text.

I’ve truncated the results here but mostly they look something like this:

  "Entities": [
      "BeginOffset": 875,
      "EndOffset": 899,
      "Score": 0.9936646223068237,
      "Text": "University of California",
      "Type": "ORGANIZATION"
      "BeginOffset": 903,
      "EndOffset": 911,
      "Score": 0.9519965648651123,
      "Text": "Berkeley",
      "Type": "LOCATION"
      "BeginOffset": 974,
      "EndOffset": 992,
      "Score": 0.9981470108032227,
      "Text": "Christopher Monroe",
      "Type": "PERSON"
      "BeginOffset": 997,
      "EndOffset": 1010,
      "Score": 0.9992995262145996,
      "Text": "Mikhail Lukin",
      "Type": "PERSON"
      "BeginOffset": 1095,
      "EndOffset": 1099,
      "Score": 0.9990954399108887,
      "Text": "2017",
      "Type": "DATE"
  "File": "Sample.txt",
  "Line": 8

Pretty cool! We could go through similar steps for setniment detection of key phrase detection. The fact that we can submit up to 5GB of data in a single batch means customers will spend less time transforming and truncating their documents.

I personally recommend a tool like AWS Step Functions for checking on the status of jobs programatically. It’s very easy to setup and build programatic analysis pipelines.

You could also use AWS Glue to call comprehend as part of your regular ETL operations as we mentioned in this blog post by Roy Hasson.

Additional Information

You can find detailed information on these new APIs in the documentation and learn more about the limits and best practices.

As previously mentioned the synchronous batch APIs are still available and work best for smaller sets of documents and smaller sizes.

As always don’t hesitate to share your feedback here or on twitter.