Roacher of the Week:  Mikael Austin, Account Executive at Cockroach Labs

Roacher of the Week: Mikael Austin, Account Executive at Cockroach Labs


(gentle music) – Hi, I’m Mikael, I’m on the sales team, and an interesting problem
that I worked on this week was helping a customer with
their multi-region deployment, and I worked with solutions engineers and our support team
here at Cockroach Labs, and now they are set up
successfully, so I’m happy.

Cockroach University: Resiliency in the Cluster

Cockroach University: Resiliency in the Cluster


Having already seen a three-node cluster respond to node failure, we’re now going to look at what happens in a larger cluster, where each range is replicated to only
a subset of the nodes. This is often done to
scale out by the way, distributing ranges with
their read and write workloads across more nodes, while keeping the
replication factor constant. But it also has
implications for resiliency, good implications. Let’s consider a cluster with seven nodes, still with a replication factor
of three for every range. The replicas are distributed
more or less evenly, and so are the leaders. When all nodes are up, a query acts in a very similar manner to what occurs in a three node cluster. Reads are routed to lease holders, which route their answers
back to the gateway, and from there to the client. Writes are similar but
go through a distributed consensus before sending
back acknowledgement. When a node goes down,
the situation is also very similar to the three-node case, or it will be initially. As before, ranges with a lost leader will elect new ones within seconds. The cluster will remain
available for reads and writes but it’ll be in a fragile state. Now with the three node situation, the cluster had to remain
in that fragile state indefinitely, until it
could get a third node. Because prior to that,
only two nodes were up and both already had
replicas of every range. In this cluster though, CockroachDB has a trick up its sleeve. Here, we still have
more than three nodes up and some of them don’t have
replicas from lost ranges. But they could. In a few minutes, five minutes by default, the cluster will stop waiting
for that lost node to return and declare it dead. When that happens, the
cluster will heal itself by up-replicating the
under-replicated ranges to other nodes that don’t yet have them. And go from this state, to
something like this state. That node might be dead, but new replicas are
put on the other nodes with each starting out by
taking a consistent snapshot of the range before becoming a follower. At that point, the cluster
has actually healed itself. All ranges are again fully replicated in spite of the lost node. The cluster is actually
once again resilient and could tolerate another node failure without loss of availability. It’s incredible! As long as the cluster doesn’t
suffer a second lost node before declaring the first
one dead and up-replicating. Because guaranteeing
availability in the face of two simultaneous node failures would be mathematically impossible. Unless we had a higher replication factor. If we set that to at least five, we’ll still be guaranteed
a majority for every range even if we lost two nodes together. Let’s rewind the clock and this time set that
replication factor to five. Here’s what it would look like. Now the Raft Consensus
Protocol will at this point require three copies for
write to be in a majority, and that’s durable, but
otherwise it works the same. We’ll kill two nodes, ranges with lost leaders will
elect new ones within seconds, the cluster will continue
serving reads and writes and just as before, those
nodes will be declared dead five minutes later, at which point the
cluster will up-replicate the affected ranges to the remaining nodes that don’t yet have them. Once this happens, the cluster is again in a resilient state, able
to maintain availability with up to two more
simultaneous node failures. Though now, we’re down to five nodes left and with a replication factor of five, all of our ranges are
replicated to every node so any further node failures will result in persistent
under-replicated ranges until the cluster can get back to five. And that’s how resiliency
works in CockroachDB. What have we covered? We’ve looked at how up-replication can result in a cluster healing itself back to a resilient state,
whenever there are more nodes than a range’s replication factor. We’ve learned that the
process starts five minutes after a node is lost by default, and we’ve seen how increasing
our replication factor beyond three empowers the
cluster to be resilient even in the face of multiple,
simultaneous node failures. And that’s a lesson.

Cockroach University: Availability and Durability in a Three-Node Cluster

Cockroach University: Availability and Durability in a Three-Node Cluster


Having seen how the
cluster breaks its data down into ranges, replicates and
distributes those ranges to the various nodes and
uses the Raft protocol to keep cluster dated,
durable and consistent, we’re now gonna look at how
those nodes act together to keep data available in
the face of node failure. Consider a three-node
cluster, the smallest size for a resilient production deployment and let’s connect a client. Which ever node the client
connects to is called the gateway and it’ll route queries
wherever they belong. This is something we’ve been hinting at but now’s a good time to make it explicit. The client can make any node
its gateway just by connecting. Next, note the leaders
of the various ranges are not all on the same node. They’re distributed roughly
equally among the three nodes. Suppose that a client sends a query asking for rows from two of those ranges. Here’s how the query might get answered while three nodes are up. First, the gateway would route the query to the appropriate leaseholders
and since it’s a read query, they would send the
results back to the gateway which would combine them and
answer the clients query. The only difference for a write is that there would be a
consensus operations started by the leader for each Raft group. But the flow would otherwise be similar with an acknowledgement returned by the leaseholder back to
gateway and on to client. In a production cluster,
there would be potentially many clients connected to each node all doing this in parallel. Okay, so what happens
when a node goes down? Well, first if a client
is connected to that, that client would need
to find a new gateway. This problem would be solved
by using a load balancer which is crucial in
production deployments. More interesting though is
what happens to those ranges in the moments immediately
following node failure. Suppose a write comes in
just as a node goes down, the leaders on a node is
still up, there’s no problem. But for a range whose leader went down, there’s a short term problem, no leader. That Raft group will hold an
election turning a follower into a new leader in a matter of seconds. The lease will be reassigned as well and the gateway will route the
write to the new leaseholder. Once it knows that the remaining two nodes have achieved consensus, it’ll acknowledge the
write back to the gateway. So the clusters able to keep
serving writes as well as reads with perhaps a few seconds of latency but only if the query comes
in at exactly the wrong time and it touches on a range
that’s temporarily leaderless. That said, we are down to
two nodes at this point and until that node comes back up, well the cluster is able to serve reads and writes just fine, it’s
also in a fragile state. The second node is lost at this point, consensus becomes impossible and the cluster will be unavailable until it’s back to at least two nodes. We want it get back to a resilient state as quickly as we can. When that third node does come back up, it’ll rejoin the cluster and assuming it hasn’t been too long, its ranges will rejoin
their respective Raft groups as followers replicating Raft entries and the cluster will again
be in a resilient state. Okay, so what have we covered. Well, we’ve learned how
the gateway node works. We’ve seen how a
three-node cluster responds to maintain availability
in the face of node failure and we’ve seen how it recovers resiliency when a lost node reconnects. (music)

Cockroach University: A Brief History of Databases

Cockroach University: A Brief History of Databases


– This history starts in
1970 with the publication of “A Relational Model of Data
for Large Shared Data Banks”, an academic paper by Edgar F. Codd. That original paper introduced a beautiful way to model data. You build bunch of cross-link tables and store any piece of data just once. Such a database could answer any question, as long as the answer was
stored somewhere within it. Disk space would be used to efficiently at a time when storage was expensive. It was marvelous! It was the future. The first commercial
implementation arrived in the late 1970’s. And during the 80’s and 90’s, relational databases grew
increasingly dominant. Delivering rich indexes to
make any query efficient. Table joins, a term for read operations that pull together
separate records into one. And transactions, which
meant a combination of reads and especially writes across the database. But, they need to happen together. Where essential, SQL, the
structured query language became the language of data. And software developers learned to use it to ask for what they wanted, and let the database
decide how to deliver it. Strict guarantees were engineered
in to prevent surprises. And in the first decade
of the new millennium, for many business models,
that all went out the window. Relational databases architected around the assumption of running
on a single machine lacks something that became essential with the advent of the internet. They were painfully
difficult to scale out. The volume of data that
can be created by millions or billions of networked
humans and devices is more than any single server can handle. When the workload grows so
heavy that no single computer can bear the the load. When the most expensive
hardware on the market will be brought to its knees by the weight of an application. The only path is to move
forward from a single database server to a cluster of database nodes working in concert. For a legacy sequel
database architected to run on a single server, this
was a painful process. Requiring a massive investment of time, and often trade-offs and
sacrifices of many of the features that brought developers to these databases in the first place. By the late 2000’s, SQL databases were
still extremely popular. But, for those who needed
scale, there were other options. NoSQL had arrived on scene. Google Bigtable, HDFS, and
Cassandra are a few examples. These NoSQL databases were
built to scale out easily and to tolerate node failures
with minimal disruption. But, they came with
compromises and functionality. Typically, a lack of
joins and transactions or limited indexes, shortcomings. The developers had to constantly
engineer their way around. Scale became cheap. But, relational guarantees
didn’t come with it. But legacy SQL databases
have tried to fill the gap in the years since with
add-on features to help reduce the pain of scaling out. At the same time, NoSQL
systems have been building out a subset of their missing
SQL functionality. But none of these were
architected from the ground up to deliver what we might
call distributed SQL. And that’s where Cockroach DB comes in.