Roacher of the Week: Amruta Ranade, Sr. Technical Writer at Cockroach Labs


(upbeat music) – Hi, I’m Amruta, I’m a
writer on the education team, and my recent favorite
project is the video that I made for Flex Friday. And the video is about
establishing balance and I talk about the importance
of establishing balance and giving yourself permission to do that. But it’s not always easy in the tech world because you have so many other things that are demanding your attention. So I hope that video helps
and inspires other people to also give themselves permission to establish balance in their life.

Cockroach University: The Keyspace, Replicas, and Ranges

Cockroach University: The Keyspace, Replicas, and Ranges


(chiming music) – In this lesson, we’re going to introduce how your data is organized into a single
abstract data structure and describe how CockroachDB
uses that to divide, replicate and distribute the data across the cluster. This will give us a
framework to understand at a high level how
the cluster scales out, which is important to everyone because it has performance implications it will also help to build
a conceptual foundation for things to come. First, let’s talk about the ordering of your data in CockroachDB. Take all of your data and imagine it in a
single grand structure, we’re going to call the Keyspace. It’s an ordered set of key value pairs with the full path to each
record in the key part of each entry including where it lives and the primary key of the row. For now, we’re abstracting away lots
of implementation details like how a table is identified or where the indexes fit in, etc. The important thing is that everything we can use to find a record including the primary key
are all part of the Keyspace. The rest of the row’s columns are typically put into the value part of the KV store and don’t
affect the ordering. So the key spaces all of this with additional metadata all together in a Grand
Ordered data structure, for a single node cluster the Keyspace is a fairly accurate
description of how the data is actually organized
by the storage layer. But in a production cluster, the implementation gets complicated. The reason why it’s a useful abstraction is because the cluster
divides the Keyspace into what we call ranges. When a range grows beyond a certain limit 64 megabytes by default, it gets split into two. When those grow each gets
split again and so on. splits can also occur for other reasons, but ranges are important
because they are the units that CockroachDB
replicates and distributes to the nodes of the cluster. Here’s a visualization showing some data getting divided up into seven ranges
represented by flat ovals and six nodes for our cluster. Represented by the cylinders. Multiple copies of each
range called replicas are distributed among the nodes to keep the cluster balanced. In CockroachDB, the default replication factor is three. and that’s what we see here, three replicas of each range. Moreover, each replica is
always on a different node, they never double up, we can increase the replication factor to larger odd numbers, such as five or seven to
increase the resiliency. Here’s what five might look like. Higher replication
factors do come with cost of having to store update and synchronize more replicas of each
range in your cluster. Sometimes that’s worth the cost sometimes not know that your replication factor doesn’t have to be the
same for all of your data. You can set a different replication factor for each database or table if you like, or get even more granular if
you have an enterprise license. Okay, that was a lot and fast. So let’s review. In this lesson, we went over the basic model
we should have in our head when we think about Cluster Data. The Keyspace. We learned that CockroachDB
takes the Keyspace and divides it up into ranges, that it then makes copies of those ranges called replicas up to
a replication factor, and that it distributes those replicas among the nodes of the cluster. Soon we’ll see more. But that’s it for this lesson. (chiming music)

Cockroach University: Resiliency in the Cluster

Cockroach University: Resiliency in the Cluster


Having already seen a three-node cluster respond to node failure, we’re now going to look at what happens in a larger cluster, where each range is replicated to only
a subset of the nodes. This is often done to
scale out by the way, distributing ranges with
their read and write workloads across more nodes, while keeping the
replication factor constant. But it also has
implications for resiliency, good implications. Let’s consider a cluster with seven nodes, still with a replication factor
of three for every range. The replicas are distributed
more or less evenly, and so are the leaders. When all nodes are up, a query acts in a very similar manner to what occurs in a three node cluster. Reads are routed to lease holders, which route their answers
back to the gateway, and from there to the client. Writes are similar but
go through a distributed consensus before sending
back acknowledgement. When a node goes down,
the situation is also very similar to the three-node case, or it will be initially. As before, ranges with a lost leader will elect new ones within seconds. The cluster will remain
available for reads and writes but it’ll be in a fragile state. Now with the three node situation, the cluster had to remain
in that fragile state indefinitely, until it
could get a third node. Because prior to that,
only two nodes were up and both already had
replicas of every range. In this cluster though, CockroachDB has a trick up its sleeve. Here, we still have
more than three nodes up and some of them don’t have
replicas from lost ranges. But they could. In a few minutes, five minutes by default, the cluster will stop waiting
for that lost node to return and declare it dead. When that happens, the
cluster will heal itself by up-replicating the
under-replicated ranges to other nodes that don’t yet have them. And go from this state, to
something like this state. That node might be dead, but new replicas are
put on the other nodes with each starting out by
taking a consistent snapshot of the range before becoming a follower. At that point, the cluster
has actually healed itself. All ranges are again fully replicated in spite of the lost node. The cluster is actually
once again resilient and could tolerate another node failure without loss of availability. It’s incredible! As long as the cluster doesn’t
suffer a second lost node before declaring the first
one dead and up-replicating. Because guaranteeing
availability in the face of two simultaneous node failures would be mathematically impossible. Unless we had a higher replication factor. If we set that to at least five, we’ll still be guaranteed
a majority for every range even if we lost two nodes together. Let’s rewind the clock and this time set that
replication factor to five. Here’s what it would look like. Now the Raft Consensus
Protocol will at this point require three copies for
write to be in a majority, and that’s durable, but
otherwise it works the same. We’ll kill two nodes, ranges with lost leaders will
elect new ones within seconds, the cluster will continue
serving reads and writes and just as before, those
nodes will be declared dead five minutes later, at which point the
cluster will up-replicate the affected ranges to the remaining nodes that don’t yet have them. Once this happens, the cluster is again in a resilient state, able
to maintain availability with up to two more
simultaneous node failures. Though now, we’re down to five nodes left and with a replication factor of five, all of our ranges are
replicated to every node so any further node failures will result in persistent
under-replicated ranges until the cluster can get back to five. And that’s how resiliency
works in CockroachDB. What have we covered? We’ve looked at how up-replication can result in a cluster healing itself back to a resilient state,
whenever there are more nodes than a range’s replication factor. We’ve learned that the
process starts five minutes after a node is lost by default, and we’ve seen how increasing
our replication factor beyond three empowers the
cluster to be resilient even in the face of multiple,
simultaneous node failures. And that’s a lesson.

Cockroach University: Availability and Durability in a Three-Node Cluster

Cockroach University: Availability and Durability in a Three-Node Cluster


Having seen how the
cluster breaks its data down into ranges, replicates and
distributes those ranges to the various nodes and
uses the Raft protocol to keep cluster dated,
durable and consistent, we’re now gonna look at how
those nodes act together to keep data available in
the face of node failure. Consider a three-node
cluster, the smallest size for a resilient production deployment and let’s connect a client. Which ever node the client
connects to is called the gateway and it’ll route queries
wherever they belong. This is something we’ve been hinting at but now’s a good time to make it explicit. The client can make any node
its gateway just by connecting. Next, note the leaders
of the various ranges are not all on the same node. They’re distributed roughly
equally among the three nodes. Suppose that a client sends a query asking for rows from two of those ranges. Here’s how the query might get answered while three nodes are up. First, the gateway would route the query to the appropriate leaseholders
and since it’s a read query, they would send the
results back to the gateway which would combine them and
answer the clients query. The only difference for a write is that there would be a
consensus operations started by the leader for each Raft group. But the flow would otherwise be similar with an acknowledgement returned by the leaseholder back to
gateway and on to client. In a production cluster,
there would be potentially many clients connected to each node all doing this in parallel. Okay, so what happens
when a node goes down? Well, first if a client
is connected to that, that client would need
to find a new gateway. This problem would be solved
by using a load balancer which is crucial in
production deployments. More interesting though is
what happens to those ranges in the moments immediately
following node failure. Suppose a write comes in
just as a node goes down, the leaders on a node is
still up, there’s no problem. But for a range whose leader went down, there’s a short term problem, no leader. That Raft group will hold an
election turning a follower into a new leader in a matter of seconds. The lease will be reassigned as well and the gateway will route the
write to the new leaseholder. Once it knows that the remaining two nodes have achieved consensus, it’ll acknowledge the
write back to the gateway. So the clusters able to keep
serving writes as well as reads with perhaps a few seconds of latency but only if the query comes
in at exactly the wrong time and it touches on a range
that’s temporarily leaderless. That said, we are down to
two nodes at this point and until that node comes back up, well the cluster is able to serve reads and writes just fine, it’s
also in a fragile state. The second node is lost at this point, consensus becomes impossible and the cluster will be unavailable until it’s back to at least two nodes. We want it get back to a resilient state as quickly as we can. When that third node does come back up, it’ll rejoin the cluster and assuming it hasn’t been too long, its ranges will rejoin
their respective Raft groups as followers replicating Raft entries and the cluster will again
be in a resilient state. Okay, so what have we covered. Well, we’ve learned how
the gateway node works. We’ve seen how a
three-node cluster responds to maintain availability
in the face of node failure and we’ve seen how it recovers resiliency when a lost node reconnects. (music)

Cockroach University: A Brief History of Databases

Cockroach University: A Brief History of Databases


– This history starts in
1970 with the publication of “A Relational Model of Data
for Large Shared Data Banks”, an academic paper by Edgar F. Codd. That original paper introduced a beautiful way to model data. You build bunch of cross-link tables and store any piece of data just once. Such a database could answer any question, as long as the answer was
stored somewhere within it. Disk space would be used to efficiently at a time when storage was expensive. It was marvelous! It was the future. The first commercial
implementation arrived in the late 1970’s. And during the 80’s and 90’s, relational databases grew
increasingly dominant. Delivering rich indexes to
make any query efficient. Table joins, a term for read operations that pull together
separate records into one. And transactions, which
meant a combination of reads and especially writes across the database. But, they need to happen together. Where essential, SQL, the
structured query language became the language of data. And software developers learned to use it to ask for what they wanted, and let the database
decide how to deliver it. Strict guarantees were engineered
in to prevent surprises. And in the first decade
of the new millennium, for many business models,
that all went out the window. Relational databases architected around the assumption of running
on a single machine lacks something that became essential with the advent of the internet. They were painfully
difficult to scale out. The volume of data that
can be created by millions or billions of networked
humans and devices is more than any single server can handle. When the workload grows so
heavy that no single computer can bear the the load. When the most expensive
hardware on the market will be brought to its knees by the weight of an application. The only path is to move
forward from a single database server to a cluster of database nodes working in concert. For a legacy sequel
database architected to run on a single server, this
was a painful process. Requiring a massive investment of time, and often trade-offs and
sacrifices of many of the features that brought developers to these databases in the first place. By the late 2000’s, SQL databases were
still extremely popular. But, for those who needed
scale, there were other options. NoSQL had arrived on scene. Google Bigtable, HDFS, and
Cassandra are a few examples. These NoSQL databases were
built to scale out easily and to tolerate node failures
with minimal disruption. But, they came with
compromises and functionality. Typically, a lack of
joins and transactions or limited indexes, shortcomings. The developers had to constantly
engineer their way around. Scale became cheap. But, relational guarantees
didn’t come with it. But legacy SQL databases
have tried to fill the gap in the years since with
add-on features to help reduce the pain of scaling out. At the same time, NoSQL
systems have been building out a subset of their missing
SQL functionality. But none of these were
architected from the ground up to deliver what we might
call distributed SQL. And that’s where Cockroach DB comes in.

Cockroach University: The Raft Protocol in CockroachDB

Cockroach University: The Raft Protocol in CockroachDB


– Let’s look at how CockroachDB uses the Raft Protocol to perform writes in a distributed and durable manner. This is required knowledge
for anyone who wants to understand where CockroachDB’s
guarantees come from. If you’re really
interested in the details, follow the links below because this is one area where anyone curious can always learn more. Raft is an algorithm that
allows a distributed set of servers to agree on
any values without losing the record of that value, even in the face of node failure. CockroachDB uses it to perform all writes. Recall that CockroachDB
organizes it’s data into a keyspace, divided
into ranges and distributes replicas of each range
throughout the cluster based on the replication factor. For CockroachDB, each
range defines a Raft group. The cluster has seven ranges, so there will be seven raft groups. Let’s look at one. Before we get into the
details of Raft, though, CockroachDB has a concept
of something called a lease, which are the science to one of these replicas called the leaseholder. It’s job will be to serve reads on its own bypassing Raft
but also keeping track of write commits, so it knows not to show rights until they’re durable. Let’s put a lease on
one of those replicas. Now all reads and writes to the range will be sent to that node. Now let’s talk about Raft. The first thing to know about Raft is that replicas are either
leaders or followers. Leaders coordinate the
distributed write process while followers assist. If a follower doesn’t see a heartbeat from a leader, it’ll get
a randomized time-out, declare itself a candidate,
and call for an election. Majority vote makes it a leader. The process takes seconds. Let’s elect a leader. I made the leader the
same as the leaseholder, and while they’re different roles, in practice, CockroachDB does a good job of keeping the lease with
the leader for efficiency. So we’ll assume that scenario. Writes are kicked off by the leaseholders which tells the leader
to begin the process. Here’s an insert. The leader first upends the command to its Raft log, which is an ordered set of commands on disc. The leader then proposes the write to the followers. Each follower will replicate the command on its own Raft log. I showed only one replication so far since that’s enough for a majority. Even without hitting the third node, the write will persist through
any single node failure. Consensus has been achieved, but the leader doesn’t know that yet. So the follower has to let it know. At this point, the leader knows the Raft command was replicated so it can commit the write
and notify the leaseholder to begin showing it to readers. Eventually, that write
will go to every replica. Let’s look at our full cluster with a leader for each range to get a big picture sense of things. Here, each range has one replica. That’s its leader and its leaseholder. All sequel operations are routed to the appropriate leaseholders. Reads are returned while writes are passed to the leaders to start
building consensus. So, what have we learned? We’ve seen that the leaseholder ensures that readers only see committed writes, and that replicas arranged together form a Raft group that elects one leader. We’ve seen how a distributed consensus is achieved for writes, and that’s how Raft works in CockroachDB.