Spanner is Google’s scalable, multi-version, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. At the highest level of abstraction, it is a database that shards data across many sets of Paxos state machines in datacenters spread all over the world 5. Replication is used for global availability and geographic locality; clients automatically failover between replicas. Spanner automatically re-shards data across machines as the amount of data or the number of servers’ changes, and it automatically migrates data across machines (even across data centers) to balance load and in response to failures 6. Spanner is designed to provide unlimited horizontal scaling that can scale up to millions of machines across hundreds of datacenters and trillions of database rows. It offers a maximum of 99,999% service-level agreement (SLA) 13.
Its focus is managing cross-datacenter replicated data. Even though many projects happily use Bigtable, Google have also consistently received complaints from users about the lack of transactions in Bigtable and is difficult to use for some kinds of applications: those that have complex, evolving schemas, or those that want strong consistency in the presence of wide-area replication 9. Many applications at Google have chosen to use Megastore because of its semi relational data model and support for synchronous replication, despite its relatively poor write throughput. Therefore, Google made distributed transactions central to Spanner’s design. Spanner evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in schematized semi relational tables; data is versioned, and each version is automatically timestamped with its commit time 6.
Applications use Spanner for high availability. It is claimed by Google that Spanner achieves 99% availability 8. Even in the face of wide-area natural disasters availability is achieved within or across continents due to their data replication feature. Most applications will probably replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure modes. That is, most applications will choose lower latency over higher availability, if they can survive 1 or 2 datacenter failures. The replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance) 6.
Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent 10 reads and writes, and globally consistent reads across the database at a timestamp. These features enable Spanner to support consistent backups, consistent MapReduce executions 11, and atomic schema updates, all at global scale, and even in the presence of ongoing transactions. It makes heavy use of hardware-assisted clock synchronization using GPS clocks and atomic clocks to ensure global consistency.
Implementation of Spanner:
Fig. 1 Spanner server organization 12.
Figure 1 illustrates the deployment view of servers in a Spanner universe. A zone has a zonemaster and assigns data to 100-1000 of spanservers that serve data to clients. The per-zone location proxies are used by clients to locate the spanservers assigned to serve their data. The universe master is primarily a console that displays status information about all the zones for interactive debugging. The placement driver handles automated movement of data across zones on the timescale of minutes. The placement driver periodically communicates with the spanservers to find data that needs to be moved, either to meet updated replication constraints or to balance load.
Fig. 2 Spanserver software stack 6.
The development model of Spanner is shown in Figure 2. At the bottom, each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet. Unlike Bigtable, Spanner assigns timestamps to data, which is an important way in which Spanner is more like a multiversion database than a key-value store. To support replication, each spanserver implements a single Paxos state machine on top of each tablet. The implementation of Paxos is pipelined, to improve Spanner’s throughput in the presence of WAN latencies 6.
The Paxos state machines are used to implement a consistently replicated bag of mappings. Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. The set of replicas is collectively a Paxos group. At every replica that is a leader, a spanserver implements a lock table to implement concurrency control. Each spanserver also implements a transaction manager to support distributed transactions. If a transaction involves more than one Paxos group, those groups’ leaders coordinate to perform two-phase commit 6.
On top of the bag of key-value mappings, the Spanner implementation supports a bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix. When data is moved between Paxos groups, it is moved directory by directory. Spanner might move a directory to shed load from a Paxos group; to put directories that are frequently accessed together into the same group; or to move a directory into a group that is closer to its accessors. If the data is too large, Spanner is designed to shard a directory into multiple fragments 6.
Data Model Concerns:
The lack of cross-row transactions in Bigtable led to frequent complaints to which authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings 9. At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. Given the popularity of Dremel 12 as an interactive data analysis tool there was a need to support an SQL-like query language in Spanner.
Spanner exposes the following set of data features to applications: a data model based on schematized semi relational tables, a query language, and general-purpose transactions. Spanner supports SQL queries and transactions with the ability to scale out horizontally. Careful design is necessary to realize Spanner’s full benefit and ensure that an application can scale to arbitrary levels and maximize its performance. Two tools have a great impact on scalability: Key definition and Interleaving 7.
The application data model is layered on top of the directory-bucketed key-value mappings supported by the implementation. An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Tables look like relational-database tables, with rows, columns, and versioned values. Spanner’s data model is not purely relational, in that rows must have names. Rows in a Spanner table are organized lexicographically by Primary Key. This requirement is where Spanner still looks like a key-value store: the primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns 6.
Client applications declare the hierarchies in database schemas via the ‘interleave in’ declarations. This interleaving of tables to form directories is significant because it allows clients to describe the locality relationships that exist between multiple tables, which is necessary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships 6.
Data within each Spanner replica is organized into two levels of physical hierarchy: database splits, then blocks. Splits hold contiguous ranges of rows and are the unit by which Spanner distributes your database across compute resources. Over time, splits may be broken into smaller parts, merged, or moved to other nodes in your instance to increase parallelism and allow your application to scale. If data is frequently written or read together, it can benefit both latency and throughput to cluster them by carefully selecting primary keys and using interleaving 7.
There are some datasets that are too large to fit on a single machine. There are also scenarios where the dataset is small, but the workload is too heavy for one machine to handle. This means that we need to find a way of splitting our data into separate pieces that can be stored on multiple machines. Our approach is to partition database tables into contiguous key ranges called splits. A single machine can serve multiple splits, and there is a fast lookup service for determining the machine(s) that serve a given key range. The details of how data is split and what machine(s) it resides on are transparent to Spanner users. The result is a system that can provide low latencies for both reads and writes, even under heavy workloads, at very large scale 14.
To make sure that data is accessible despite failures each split is replicated to multiple machines in distinct failure domains. Consistent replication to the different copies of the split is managed by the Paxos algorithm. In Paxos, if most of the voting replicas for the split are up, one of those replicas can be elected leader to process writes and allow other replicas to serve reads 14.
Spanner provides both read-only transactions and read-write transactions. The former is the preferred transaction-type for operations (including SQL SELECT statements) that do not mutate your data. Read-only transactions provide strong consistency and operate, by-default, on the latest copy of your data. But they can run without the need for any form of locking internally, making them faster and more scalable. Read-write transactions are used for transactions that insert, update, or delete data; this includes transactions that perform reads followed by a write. They are still highly scalable, but read-write transactions introduce locking and must be orchestrated by Paxos leaders 14.
Many previous distributed database systems have elected not to provide strong consistency guarantees because of the costly cross-machine communication that is usually required. Spanner is able to provide strongly consistent snapshots across the entire database using a Google-developed technology called TrueTime. It is an API that allows any machine in Google datacenters to know the exact global time with a high degree of accuracy. This allows different Spanner machines to reason about the ordering of transactional operations often without any communication at all. In particular, Spanner assigns a timestamp to all reads and writes. Because Spanner uses multi-version concurrency control, the ordering guarantee on timestamps enables clients of Cloud Spanner to perform consistent reads across an entire database without blocking writes 15.
Spanner provides clients with the strictest concurrency-control guarantees for transactions, which is called external consistency. If the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1. Under external consistency, the system behaves as if all transactions were executed sequentially, even though Spanner runs them across multiple servers (and possibly in multiple datacenters) for higher performance and availability. TrueTime enables Spanner to support atomic schema changes. Bigtable supports atomic schema changes in one datacenter, but its schema changes block all operations. A Spanner schema-change transaction is a generally nonblocking variant of a standard transaction 16.
The CAP theorem says that you can only have two of the three desirable properties of: C: Consistency, A: 100% availability, for both reads and updates; P: tolerance to network partitions. This leads to three kinds of systems: CA, CP and AP, based on the system goals. Despite being a global distributed system, Spanner claims to be consistent and highly available, which implies there are no partitions and thus many are skeptical 8.