Databases on AWS

Database 101

Databases can be relational or non-relational. Relational databases have tables with rows and columns. These databases maintain consistent types in columns. Relational databases are based on relational algebra which enables sophisticated querying capabilities. Non-relational databases have a collection of documents with key/value pairs. Each document holds arbitrary key/value pairs which don't all need to be the same across documents. Non-relational databases thus have increased flexibility.

A data warehouse is a central repository of information that can be analyzed to make better informed decisions. These systems are a vital component of business intelligence tools. Data warehouses often collect information from many other databases for large scale analysis.

Databases can also be classified into OLTP and OLAP. These categories differ in terms of the type of queries you will run. OLTP, or Online Transaction Processing, is characterized by a large number of short transactions (CRUD). These systems emphasize fast query processing and data integrity. Conventional relational databases are OLTP. On the other hand, OLAP, or Online Analytical Processing, is characterized by a low volume of transactions. Queries are often very complex and involve aggregations. Data warehousing databases are OLAP.

ElastiCache

See the ElastiCache page.

Amazon Relational Database Service (RDS)

Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. Amazon offers Amazon Aurora, MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server as part of this service.

Backups

There are two types of backups for AWS: Automated Backups and Database Snapshots.

Whenever you restore either an Automatic Backup or a manual Snapshot, the restored version of the database will be a new RDS instance with a new DNS endpoint.

Automated Backups

Automated Backups allow you to recover your database to any point in time within a retention period. The retention period can be between one and 35 days. Automated Backups will take a full daily snapshot and will also store transaction logs throughout the day. When you do a recovery, AWS will first choose the most recent daily back up, and then apply transaction logs relevant to that day. This allows you to do a point in time recovery down to a second, within the retention period.

Automated Backups are enabled by default. The backup data is stored in S3 and you get free storage space equal to the size of your database. So if you have an RDS instance of 10GB, you will get 10GB worth of storage.

Backups are taken within a defined window. During the backup window, storage I/O may be suspended while your data is being backed up and you may experience elevated latency

Database Snapshots

DB Snapshots are done manually. They are stored even after you delete the original RDS instance, unlike automated backups.

Encryption

Encryption at rest is supported for all database types. Encryption is done using the AWS Key Management Service (KMS). Once your RDS instance is encrypted, the data stored at rest in the underlying storage is encrypted, as are its automated backups, read replicas, and snapshots.

At the present time, encrypting an already existing DB instance is not supported. To use Amazon RDS encryption for an existing database, you must first create a snapshot, make a copy of that snapshot and encrypt the copy.

Multi-AZ

Multi-AZ allows you to have an exact copy of your production database in another Availability Zone. AWS handles the replication for you, so when your production database is written to, this write will automatically synchronized to the stand by database.

In the event of planned database maintenance, DB instance failure, or an Availability Zone failure, Amazon RDS will automatically failover to the standby so that database operations can resume quickly without administrative intervention (the DNS will automatically point to the new database).

Multi-AZ is for disaster recovery only. It is not primarily used for improving performance. For performance improvement, you need Read Replicas.

Read Replica

Read Replicas allow you to have a read-only copy of your production database. This is achieved by using asynchronous replication from the primary RDS instance to the read replica. You use read replicas primarily for very read-heavy database workloads.

Note that this is currently not available for SQL Server and Oracle.

Read Replica is used for scaling, not for DR!
You must have automatic backups turned on in order to deploy a read replica.
You can have up to 5 read replica copies of any database.
You can have read replicas of read replicas (but watch out for latency.)
Each read replica will have its own DNS end point.
You can have read replicas that have Multi-AZ.
You can create read replicas of Multi-AZ source databases.
Read replicas can be promoted to be their own database. This breaks the replication.
You can have a read replica in a second region.

DynamoDB

See separate DynamoDB page.

RedShift

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.

Because we're using complicated queries that often sum columns, data warehousing databases use a different type of architecture both from a database perspective and infrastructure layer

Redshift charges for each compute node, backup, and data transfer.

Redshift is currently only available in one AZ. One can restore snapshots to new AZs in the event of an outage.

Redshift Configuration

Can be Single Node or Multi-Node. Single node will be 160GB. Multi-Node has a Leader Node which manages client connections and receives queries and Compute Nodes which store data and perform queries and computations. Redshift allows up to 128 compute nodes.

Columnar Data Storage

Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are idea for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance.

Advanced Compression

Columnar data stores can also be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.

Massively Parallel Processing (MPP)

Amazon Redshift automatically distributes data and query loads across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.

Redshift Security

Encrypted in transit using SSL.
Encrypted at rest using AES-256 encryption.
By default Redshift takes care of key management (you can manage your own keys if you want).

Aurora

This is an engine available in Amazon RDS.

Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides up to five times better performance than MySQL at a price point one tenth that of a commercial database while delivering similar performance and availability.

Aurora Scaling

Start with 10GB, scales in 10GB increments to 64TB (this is Storage Autoscaling). Compute resources can scale up to 32vCPUs and 244GB of memory.

Aurora maintains two copies of your data in each availability zone, with a minimum of three availability zones. This is six copies of your data!

Aurora is designed to transparently handle the loss of up to two copies of data without affecting database write availability and up to three copies without affecting read availability.

Aurora storage is also self-healing. Data blocks and disks are continuously scanned for errors and repaired automatically.

Aurora Replicas

Two types of replicas are available: Aurora Replicas (up to 15) and MySQL Read Replicas (up to 5).