Everything you need to know about distributed SQL


Originally published on The New Stack on January 13, 2021

Distributed SQL is generating a lot of buzz lately, and for good reason. Relational databases are trusted by every company in the world to run their mission critical applications. But as the first innovators began shifting to online business and customer engagement, relational databases couldn’t keep up. NoSQL databases captured the spotlight with their ability to support modern web applications with massive workloads, but they did so by sacrificing consistency and data integrity. Distributed SQL is the best of both worlds.

Distributed SQL for Dummies book cover

Distributed SQL for Dummies

Read Now


What is distributed SQL?

A distributed SQL database is made up of multiple database instances working together (i.e., nodes in a cluster), with each one storing and querying a subset of the data rather than all of it. Simply add more instances to increase storage capacity and/or query performance. And of course, in order to guarantee strong consistency, all copies of data (i.e., replicas) are synchronously written at the same time.

Let’s go through a few examples where distributed SQL is a good fit:

  • E-commerce sites: A big sale or seasonal demand can send a huge traffic surge to an e-commerce site. Remember when tons of shoppers received 404 errors during Amazon Prime Day two years ago? Nightmare. In anticipation of these high traffic events, you need to make sure you have enough capacity to handle the surge.
  • Food delivery services: Uber Eats or GrubHub may find orders spike during dinner hours, while during breakfast hours they dip way down. There’s no reason to operate at peak capacity at all hours of the day and night when you know most transactions happen from 11:30-2:00 pm and 5:30-7:30 pm.
  • Streaming media: Communication companies such as Verizon and AT&T – which provide both free and paid video on demand services for millions of consumers – have to maintain massive amounts of data to ensure customers are getting the right programs, in the right format, at the right cost, at the right time.
  • Healthcare insurance: Open enrollment only happens for a few weeks out of the year. On healthcare.gov nearly 8.3 million consumers enroll in health insurance within a 6 week period. That’s a lot of capacity needed to handle that enormous surge during only a few short weeks of the year.

How companies achieve scale

Until recently, there haven’t been a lot of options when it comes to scaling relational databases. The first was to scale up, to replace an existing server with an even bigger one. And when the biggest commodity server was no longer enough, some companies turned to hardware appliances such as Oracle Exadata. The second was to scale out with sharding. This became an alternative for tech giants with deep engineering expertise and anyone else unwilling or unable to spend millions of dollars on a hardware appliance. However, while more cost effective than scaling up, sharding tends to be brittle, difficult to maintain and limited in scope – often specialized to support specific applications and queries. With distributed SQL you get uncompromising scalability for any application without having to spend a fortune on a hardware appliance or accept the limitations and technical burden of database sharding.

Database workloads are often quite variable. Distributed SQL is engineered to ramp up capacity in response to seasonal and business requirements. It lets you run with a few resources when that’s all that is needed, and quickly add more to handle spikes in your business. This elasticity is critical for a myriad of applications.

The ability to adapt to changing conditions on the fly is a key differentiator from other SQL iterations. Distributed SQL allows you to add nodes on demand and scale out. It also eliminates the issue of having to change your shard key or modifying how the data is split. Trust me – no one wants to spend their time sharding.

Some distributed SQL databases can also spread data across various database instances, constantly checking to ensure that the most optimal, efficient distribution of data is in use. If there is a change in customer behavior or there are new applications, it’s looking to see if moving the data around a little bit here and a little bit there will enhance performance.

This flexibility also mitigates downtime and cost concerns. For example, it could take 100 instances to meet your peak traffic conditions, which you don’t want to pay for 24/7 if you only need them all running 1% of the time. Customers can, depending on the solution they’re using, scale out and back on an hourly basis in order to cut their spend as much as possible. In the cloud it’s literally as easy as pushing a button when you need more capacity or to decrease it to save costs when traffic evens out.

Additionally, uniform data distribution is maintained as nodes are added and removed, or if data is inserted unevenly. This automatic data distribution approach removes the need to manage shards, and enables data availability to be maintained in the event of node loss.

Distributed SQL is engineered and designed to assume that the more instances you have in a distributed database, the more likely some are going to fail. That’s just reality. So, if there is an infrastructure failure, the fact that the data is distributed and replicated to a handful of other nodes ensures that it remains available. The database assumes instances will come and go without notice, and it handles topology changes automatically. This is a major advantage for businesses as it eliminates the costs, lost revenue, embarrassment and bad reputation that can result from downtime.

Eliminating trade-offs

Teams responsible for databases need their data to be durable, safe, consistent and the databases themselves to always be available and scalable. Yes, that’s asking a lot, but it’s all of critical importance to any business’ success.

In the not-too-distant past, few databases were able to guarantee that level of functionality. There were a lot of constraints, limitations and trade-offs – such as giving up data integrity and transactions to get scalability. One reason for these constraints is the amount of technical challenges involved in performing transactions across multiple database instances safely (e.g., being ACID-compliant). With the rise of distributed SQL, there’s no need to compromise anymore.

The future looks bright

Distributed SQL continues to evolve, responding to marketplace demands for more reliable, transparent, accurate data, delivered at ever faster speeds, and at a truly global level. Given that it’s a newer technology and has been engineered for the cloud, it’s cloud-native by definition. And if there’s anything the tech and business communities can agree on, it’s that the move to the cloud is basically a runaway train, as a recent survey of IT decision-makers demonstrated.

As we head into 2021 and beyond, I see the future for distributed SQL spanning multiple data centers and making that transparent to applications – letting them write anywhere and read anywhere. And not just within a data center, but at a truly global level. That will be the next maturity phase. The future looks bright for distributed SQL and you’ll be hearing a lot more about it.