Roacher of the Week: Amruta Ranade, Sr. Technical Writer at Cockroach Labs


(upbeat music) – Hi, I’m Amruta, I’m a
writer on the education team, and my recent favorite
project is the video that I made for Flex Friday. And the video is about
establishing balance and I talk about the importance
of establishing balance and giving yourself permission to do that. But it’s not always easy in the tech world because you have so many other things that are demanding your attention. So I hope that video helps
and inspires other people to also give themselves permission to establish balance in their life.

Roacher of the Week: Bilal Akhtar, Software Engineer at Cockroach Labs

Roacher of the Week: Bilal Akhtar, Software Engineer at Cockroach Labs


(soft music) – Hey, I’m Bilal. I work on the storage team. I’m based out of here in Toronto, so I’m one of the few remote engineers. And I work on the storage
team, like I said, and one of the things that
I’ve worked on recently that I’ve really enjoyed was
implementing this new layer on top of a new storage engine, the MVCC layer on top of Pebble. So we already, what MVCC stands for is
multiversion concurrency control. Which is one of the
features in CockroachDB at the very bottom, at the storage layer. So what that lets you do is
it handles all sorts of like conflicts between different writes happening at the same time, and it just happens behind the scenes so you don’t need to
worry too much about it as a SQL level end-user. But for porting this layer – which we currently use with
the RocksDB – over to Pebble, lets us eventually move to Pebble as our storage engine down the line. And if you’re familiar
with Cockroach’s internals you probably know it uses
RocksDB under the hood, that might change down the line, and when it does, this work will be used on every Cockroach cluster
for every single thing you do. So, it could be amazing, or it could just not be used we’ll see. (soft music) Bye Dan. – [Cameraman] Bye, Bilal. (upbeat music)

Roacher of the Week:  Mikael Austin, Account Executive at Cockroach Labs

Roacher of the Week: Mikael Austin, Account Executive at Cockroach Labs


(gentle music) – Hi, I’m Mikael, I’m on the sales team, and an interesting problem
that I worked on this week was helping a customer with
their multi-region deployment, and I worked with solutions engineers and our support team
here at Cockroach Labs, and now they are set up
successfully, so I’m happy.

Cockroach University: The Keyspace, Replicas, and Ranges

Cockroach University: The Keyspace, Replicas, and Ranges


(chiming music) – In this lesson, we’re going to introduce how your data is organized into a single
abstract data structure and describe how CockroachDB
uses that to divide, replicate and distribute the data across the cluster. This will give us a
framework to understand at a high level how
the cluster scales out, which is important to everyone because it has performance implications it will also help to build
a conceptual foundation for things to come. First, let’s talk about the ordering of your data in CockroachDB. Take all of your data and imagine it in a
single grand structure, we’re going to call the Keyspace. It’s an ordered set of key value pairs with the full path to each
record in the key part of each entry including where it lives and the primary key of the row. For now, we’re abstracting away lots
of implementation details like how a table is identified or where the indexes fit in, etc. The important thing is that everything we can use to find a record including the primary key
are all part of the Keyspace. The rest of the row’s columns are typically put into the value part of the KV store and don’t
affect the ordering. So the key spaces all of this with additional metadata all together in a Grand
Ordered data structure, for a single node cluster the Keyspace is a fairly accurate
description of how the data is actually organized
by the storage layer. But in a production cluster, the implementation gets complicated. The reason why it’s a useful abstraction is because the cluster
divides the Keyspace into what we call ranges. When a range grows beyond a certain limit 64 megabytes by default, it gets split into two. When those grow each gets
split again and so on. splits can also occur for other reasons, but ranges are important
because they are the units that CockroachDB
replicates and distributes to the nodes of the cluster. Here’s a visualization showing some data getting divided up into seven ranges
represented by flat ovals and six nodes for our cluster. Represented by the cylinders. Multiple copies of each
range called replicas are distributed among the nodes to keep the cluster balanced. In CockroachDB, the default replication factor is three. and that’s what we see here, three replicas of each range. Moreover, each replica is
always on a different node, they never double up, we can increase the replication factor to larger odd numbers, such as five or seven to
increase the resiliency. Here’s what five might look like. Higher replication
factors do come with cost of having to store update and synchronize more replicas of each
range in your cluster. Sometimes that’s worth the cost sometimes not know that your replication factor doesn’t have to be the
same for all of your data. You can set a different replication factor for each database or table if you like, or get even more granular if
you have an enterprise license. Okay, that was a lot and fast. So let’s review. In this lesson, we went over the basic model
we should have in our head when we think about Cluster Data. The Keyspace. We learned that CockroachDB
takes the Keyspace and divides it up into ranges, that it then makes copies of those ranges called replicas up to
a replication factor, and that it distributes those replicas among the nodes of the cluster. Soon we’ll see more. But that’s it for this lesson. (chiming music)

Roacher of the Week: Rafi Shamim, Software Engineer at Cockroach Labs

Roacher of the Week: Rafi Shamim, Software Engineer at Cockroach Labs


– I’m Rafi Shamim. I’m on the App Dev Team
here at Cockroach Labs. So what I’ve been working on recently that’s been pretty cool
is getting our database to work with the Java programming language and different tools that are used with it. So we were just working
on one called Hibernate and it allows application developers to write Java code without
having to write SQL and have their application work
with our database seamlessly just like it could work
with any other database. So it’s been cool to do that because it’s an open-source project and a lot of people rely on it and it’ll be very useful for
Java developers everywhere.

Roacher of the Week: Bram Gruneir, Technical Account Manager at Cockroach Labs

Roacher of the Week: Bram Gruneir, Technical Account Manager at Cockroach Labs


– Hi, my name is Bram Gruneir. I’m a technical account manager with the customer success team here, and I really enjoy helping
customers get the database into production, and then make it better once it’s there. Specifically things like
speeding up queries, especially picking the right index when you’re doing geo-partitioning. It’s a tough problem,
and a fun one to solve. (light music)

Cockroach University: Resiliency in the Cluster

Cockroach University: Resiliency in the Cluster


Having already seen a three-node cluster respond to node failure, we’re now going to look at what happens in a larger cluster, where each range is replicated to only
a subset of the nodes. This is often done to
scale out by the way, distributing ranges with
their read and write workloads across more nodes, while keeping the
replication factor constant. But it also has
implications for resiliency, good implications. Let’s consider a cluster with seven nodes, still with a replication factor
of three for every range. The replicas are distributed
more or less evenly, and so are the leaders. When all nodes are up, a query acts in a very similar manner to what occurs in a three node cluster. Reads are routed to lease holders, which route their answers
back to the gateway, and from there to the client. Writes are similar but
go through a distributed consensus before sending
back acknowledgement. When a node goes down,
the situation is also very similar to the three-node case, or it will be initially. As before, ranges with a lost leader will elect new ones within seconds. The cluster will remain
available for reads and writes but it’ll be in a fragile state. Now with the three node situation, the cluster had to remain
in that fragile state indefinitely, until it
could get a third node. Because prior to that,
only two nodes were up and both already had
replicas of every range. In this cluster though, CockroachDB has a trick up its sleeve. Here, we still have
more than three nodes up and some of them don’t have
replicas from lost ranges. But they could. In a few minutes, five minutes by default, the cluster will stop waiting
for that lost node to return and declare it dead. When that happens, the
cluster will heal itself by up-replicating the
under-replicated ranges to other nodes that don’t yet have them. And go from this state, to
something like this state. That node might be dead, but new replicas are
put on the other nodes with each starting out by
taking a consistent snapshot of the range before becoming a follower. At that point, the cluster
has actually healed itself. All ranges are again fully replicated in spite of the lost node. The cluster is actually
once again resilient and could tolerate another node failure without loss of availability. It’s incredible! As long as the cluster doesn’t
suffer a second lost node before declaring the first
one dead and up-replicating. Because guaranteeing
availability in the face of two simultaneous node failures would be mathematically impossible. Unless we had a higher replication factor. If we set that to at least five, we’ll still be guaranteed
a majority for every range even if we lost two nodes together. Let’s rewind the clock and this time set that
replication factor to five. Here’s what it would look like. Now the Raft Consensus
Protocol will at this point require three copies for
write to be in a majority, and that’s durable, but
otherwise it works the same. We’ll kill two nodes, ranges with lost leaders will
elect new ones within seconds, the cluster will continue
serving reads and writes and just as before, those
nodes will be declared dead five minutes later, at which point the
cluster will up-replicate the affected ranges to the remaining nodes that don’t yet have them. Once this happens, the cluster is again in a resilient state, able
to maintain availability with up to two more
simultaneous node failures. Though now, we’re down to five nodes left and with a replication factor of five, all of our ranges are
replicated to every node so any further node failures will result in persistent
under-replicated ranges until the cluster can get back to five. And that’s how resiliency
works in CockroachDB. What have we covered? We’ve looked at how up-replication can result in a cluster healing itself back to a resilient state,
whenever there are more nodes than a range’s replication factor. We’ve learned that the
process starts five minutes after a node is lost by default, and we’ve seen how increasing
our replication factor beyond three empowers the
cluster to be resilient even in the face of multiple,
simultaneous node failures. And that’s a lesson.

Cockroach University: Availability and Durability in a Three-Node Cluster

Cockroach University: Availability and Durability in a Three-Node Cluster


Having seen how the
cluster breaks its data down into ranges, replicates and
distributes those ranges to the various nodes and
uses the Raft protocol to keep cluster dated,
durable and consistent, we’re now gonna look at how
those nodes act together to keep data available in
the face of node failure. Consider a three-node
cluster, the smallest size for a resilient production deployment and let’s connect a client. Which ever node the client
connects to is called the gateway and it’ll route queries
wherever they belong. This is something we’ve been hinting at but now’s a good time to make it explicit. The client can make any node
its gateway just by connecting. Next, note the leaders
of the various ranges are not all on the same node. They’re distributed roughly
equally among the three nodes. Suppose that a client sends a query asking for rows from two of those ranges. Here’s how the query might get answered while three nodes are up. First, the gateway would route the query to the appropriate leaseholders
and since it’s a read query, they would send the
results back to the gateway which would combine them and
answer the clients query. The only difference for a write is that there would be a
consensus operations started by the leader for each Raft group. But the flow would otherwise be similar with an acknowledgement returned by the leaseholder back to
gateway and on to client. In a production cluster,
there would be potentially many clients connected to each node all doing this in parallel. Okay, so what happens
when a node goes down? Well, first if a client
is connected to that, that client would need
to find a new gateway. This problem would be solved
by using a load balancer which is crucial in
production deployments. More interesting though is
what happens to those ranges in the moments immediately
following node failure. Suppose a write comes in
just as a node goes down, the leaders on a node is
still up, there’s no problem. But for a range whose leader went down, there’s a short term problem, no leader. That Raft group will hold an
election turning a follower into a new leader in a matter of seconds. The lease will be reassigned as well and the gateway will route the
write to the new leaseholder. Once it knows that the remaining two nodes have achieved consensus, it’ll acknowledge the
write back to the gateway. So the clusters able to keep
serving writes as well as reads with perhaps a few seconds of latency but only if the query comes
in at exactly the wrong time and it touches on a range
that’s temporarily leaderless. That said, we are down to
two nodes at this point and until that node comes back up, well the cluster is able to serve reads and writes just fine, it’s
also in a fragile state. The second node is lost at this point, consensus becomes impossible and the cluster will be unavailable until it’s back to at least two nodes. We want it get back to a resilient state as quickly as we can. When that third node does come back up, it’ll rejoin the cluster and assuming it hasn’t been too long, its ranges will rejoin
their respective Raft groups as followers replicating Raft entries and the cluster will again
be in a resilient state. Okay, so what have we covered. Well, we’ve learned how
the gateway node works. We’ve seen how a
three-node cluster responds to maintain availability
in the face of node failure and we’ve seen how it recovers resiliency when a lost node reconnects. (music)