How to Store Large Attribute Values in DynamoDB

Learn how compressing large values can save you some $$$ on your DynamoDB bill

How to Store Large Attribute Values in DynamoDB

DynamoDB is a fully managed NoSQL database that delivers single-digit millisecond performance at any scale. In order to keep up with its promises, there are a couple of constraints and good practices that you need to follow. One of them is to keep your items as small as possible. This is true not only for performance but also for cost. With DynamoDB, you pay per amount of data that you read or write as well as for storage. Reducing your data size is important if you want to reduce your monthly bill.

On top of that, DynamoDB also comes with some hard-limits including:

  • Any item cannot exceed 400 KB in size.
  • Query and Scan operations are limited to 1 MB of data scanned (After that, you will be forced to paginate).

If you handle large amounts of data, you can hit those limitations very quickly.

For example, imagine that you are building a blog application (like Hashnode). You might store posts and comments in a DynamoDB table. These kinds of items contain free text that can be quite long and grow very fast. A blog post can easily reach 10 to 20 KB or more. When you know that half an RCU allows you to read 4 KB of data (providing that you are doing eventually consistent reads), we are talking about 2 to 3 RCUs for every read, if not more if you have several other attributes!

When dealing with such large data, AWS recommends compressing them and storing them as Binary attributes. In this blog post, I will show you how to compress long text strings with gzip and how to store them into DynamoDB. We will then inspect the read and write units consumed and compare them with the corresponding uncompressed version.

1. Writes

In this demo, I'll be using node.js but you should easily be able to apply these techniques to your favourite programming language.

For the purpose of this test, we'll write a simple script that will first generate a dummy blog post using lorem-ipsum. We will then save it to DynamoDB twice: once as raw text (uncompressed) and once compressed with gzip. We will also make use of the ReturnConsumedCapacity property of DynamoDB so that it returns the consumed capacity (WCU) for both operations.

//write.js
const AWS = require('aws-sdk');
const { loremIpsum } = require('lorem-ipsum');
const { gzipSync } = require('zlib');

const content = loremIpsum({
    count: 20,
    units: "paragraph",
    format: "plain",
    paragraphLowerBound: 5,
    paragraphUpperBound: 15,
    sentenceLowerBound: 5,
    sentenceUpperBound: 15,
    suffix: "\n\n\n",
});

// output some stats about the text's lenght.
console.log(`Generated a text with ${content.length} characters and ${content.split(' ').length} words`);

// compress the content
const compressed = gzipSync(content);

// more stats about the content
console.log(`total size (uncompressed): ~${Math.round(content.length/1024)} KB`);
console.log(`total size (compressed): ~${Math.round(compressed.length/1024)} KB`);

// config DynamoDB
AWS.config.update({ region: 'eu-west-1' });
const dynamoDbClient = new AWS.DynamoDB();

dynamoDbClient.putItem({
    "TableName": "blog",
    "ReturnConsumedCapacity": "TOTAL",
    "Item": {
        "author": {
            "S": "bboure"
        },
        "slug": {
            "S": "raw-blog-post"
        },
        "title": {
            "S": "My blog post"
        },
        "content": {
            "S": content,
        }
    }
}).promise().then(result => {
    console.log('Write capacity for raw post', result.ConsumedCapacity );
});


dynamoDbClient.putItem({
    "TableName": "blog",
    "ReturnConsumedCapacity": "TOTAL",
    "Item": {
        "author": {
            "S": "bboure"
        },
        "slug": {
            "S": "compressed-blog-post"
        },
        "title": {
            "S": "My blog post"
        },
        "content": {
            "B": compressed,
        }
    }
}).promise().then(result => {
    console.log('Write capacity for compressed post', result.ConsumedCapacity );
});

In the script above, we generate a text of 20 paragraphs. Each paragraph will have between 5 and 15 sentences, and each sentence will be 5 to 15 words long. That is enough to generate a text of around 2000 words. Then, we compress the text and save both versions into DynamDB.

Let's run the script:

$ node write.js
Generated a text with 12973 characters and 1943 words
total size (uncompressed): ~13 KB
total size (compressed): ~4 KB
Write capacity for compressed post { TableName: 'blog', CapacityUnits: 4 }
Write capacity for raw post { TableName: 'blog', CapacityUnits: 14 }

As you can see, the raw text was around 13 KB and consumed 14 WCUs, while the compressed one only 4 KB for 4 RCUs. That looks right since 1 WCU computes for 1 KB of data.

By compressing the data we just saved ourselves 10 WCUs. That's a 70% gain! Not only that, but we also reduced the item size by 70%! Since DynamoDB also charges us for storage, that can make a huge difference on our AWS bill! 🎉

2. Reads

Now that we saved our blog post in DynamoDB we want to read it back. Let's create a new script that will read the items back and see how many RCUs they are consuming.

//read.js
const AWS = require('aws-sdk');
const { gunzipSync } = require('zlib');

AWS.config.update({ region: 'eu-west-1' });
const dynamoDbClient = new AWS.DynamoDB();

dynamoDbClient.getItem({
    "TableName": "blog",
    "ReturnConsumedCapacity": "TOTAL",
    "Key": {
        "author": {
            "S": "bboure"
        },
        "slug": {
            "S": "raw-blog-post"
        },
    }
}).promise().then(result => {
    console.log('Read capacity for raw post', result.ConsumedCapacity );
});


dynamoDbClient.getItem({
    "TableName": "blog",
    "ReturnConsumedCapacity": "TOTAL",
    "Key": {
        "author": {
            "S": "bboure"
        },
        "slug": {
            "S": "compressed-blog-post"
        },
    }
}).promise().then(result => {
    console.log('Read capacity for compressed post', result.ConsumedCapacity );
    // uncompress post content
    const content = gunzipSync(result.Item.content.B).toString();
    console.log(`Original text with ${content.length} characters and ${content.split(' ').length} words`);
});

Let's run it:

$ node read.js
Read capacity for compressed post { TableName: 'blog', CapacityUnits: 0.5 }
Original text with 12973 characters and 1943 words
Read capacity for raw post { TableName: 'blog', CapacityUnits: 2 }

At read time, we only consumed 0.5 RCUs against 2 for the uncompressed version. That's 4 times less! And as you can see, it is just as easy to uncompress the data back into its original form.

3. Secondary indexes

Before we call it a day, there is one last test I'd like to make. Sometimes, you want to add secondary indexes to your table. In our blog example, we could add a GSI that will index blog posts by author and sort them by timestamp. One could argue that you should probably avoid projecting the entire blog content in all your indexes (and I would definitely agree with that), but sometimes, you might not have a choice; and for the sake of completeness, we'll try it out.

Let's create another script that will test just that. I'm not going to copy the full script again here. Instead, just know that I added a timestamp attribute and I created a GSI index that projects all the attributes (Index name: timestamp, PK: author, SK: timestamp).

$ node write.js
Generated a text with 13673 characters and 1986 words
total size (uncompressed): ~13 KB
total size (compressed): ~4 KB
Write capacity for compressed post {
  TableName: 'blog',
  CapacityUnits: 12,
  Table: { CapacityUnits: 4 },
  GlobalSecondaryIndexes: { timestamp: { CapacityUnits: 8 } }
}
Write capacity for raw post {
  TableName: 'blog',
  CapacityUnits: 42,
  Table: { CapacityUnits: 14 },
  GlobalSecondaryIndexes: { timestamp: { CapacityUnits: 28 } }
}

As you can see, GSIs can be greedy in capacity units. That is because every write you make must be replicated to all your indexes. Our secondary index alone consumed 8 WCUs for the compressed post and a whopping 28 WCUs for the uncompressed version! Add the 4 and 14 WCUs that correspond to the table to that and we are at 12 vs 42 WCUs!

Note: To be honest, I was expecting the GSI to consume the same amount of WCU as the table index (i.e.: 4 and 14). For some reason that I still don't understand, that amount is doubled. I could not find any information about why that happens. If you happen to know, please don't hesitate to drop a comment below. 🙏

Here, even though the saving in terms of percentage is about the same (70%), the difference of capacity units starts to increase. We consumed 30 WCUs less with compressed content! Over time, this can quickly make a difference.

Note that here, there would be no difference in terms of RCUs when reading the data back. DynamoDB will read from the index that you provide in the query, and that index only.

Conclusion

We just learned that by compressing large contents before saving them in DynamoDB, you can expect saving up to 70% in WCU, RCU and storage cost. This is significant enough to take the time and make the extra effort of compressing/decompressing the data as you read/write it.

If you'd like to read more content like this, follow me on Twitter

Did you find this article valuable?

Support Benoît Bouré by becoming a sponsor. Any amount is appreciated!