Site Reliability Engineers — Keeping Google up and running 24/7


ANDREW WIDDOWSON: Greetings,
everyone, and welcome to the
Google Students Hangout On Air
about site reliability
engineering.
My name is Andrew Widdowson.
I’ve worked here at Google as an
SRE for the last six years
specializing in web search.
And I came to here from Carnegie
Mellon University.
I’m joined by a panel of SREs
who are very excited to share
with you a little bit about
what SRE is all about.
We’ll start off with
introductions from Leslie.
LESLIE CHEUNG: Hi, my name is
Leslie Cheung I’ve been an SRE
at Google for over
three years now.
I work on the main web server,
the front-end for all of your
search results and
your home page.
And my alma mater
was from UCLA.
ANDREW WIDDOWSON: Thank
you, Leslie.
Dina?
DINA BETSER: Hi, everyone.
I’m Dina Betser.
I’m an ads quality SRE in
the Pittsburgh office.
I actually work on the system
that automatically decides
which ads should be
shown to users.
And this is one of the largest
machine-learning climates in
production.
I’ve been a full-time SRE since
I graduated MIT in July.
And I’d been a software
engineering intern at Google
before that.
ANDREW WIDDOWSON: Very cool.
And last but not least, Aaron.
AARON JOINER: Hi, my name
is Aaron Joiner.
I come to Google from a less
traditional computer science
background, more from a music
performance background at the
IT industry writ large.
I’ve been here for about seven
and a half years now.
And I work down in the bowels of
the Google infrastructure,
working on the building blocks
that keep all of the wonderful
services we run able to scale
to the size they do.
ANDREW WIDDOWSON: Very
cool, indeed.
All right, so what we’re going
to do today is we’re going to
talk about some of the most
common questions and answer
those questions that
we get about SRE.
Let’s go ahead and start off.
Site reliability engineering–
what the heck does that mean?
I think it means different
things to different folks.
So I’d like to get
a take from–
let’s start with Leslie.
LESLIE CHEUNG: Yeah, so site
reliability engineering is
about balance.
So a big production service
has two things
that are going on.
There’s the reliability
aspect.
You don’t want this service
to go down.
You want it to be available
to users at all times.
And at the same time, you also
want to be able to push out
new features–
that newfangled feature
that is going to
serve our users better.
And the thing is these are
sometimes seen as two forces
that are colliding.
And the SRE role is about trying
to make those things
work together, so that way we
can push out new features in a
reliable way.
We want to be able make changes,
but make sure that
we’re still serving our users.
ANDREW WIDDOWSON: Absolutely
right, Leslie.
I think that site reliability
engineering is definitely–
it enables both agility and
stability in our approach.
So these are not two sides
of the same coin.
They are, in fact, constructive
positives, right?
For another take then
on what site
reliability engineering is–
Dina?
DINA BETSER: Thanks.
I also think of being an SRE
is being a paramedic for
production issues.
As a person who carries a
pager, we’re the first
responders when something goes
wrong in the system.
But part of that is being
responsible for triaging the
issue, seeing how
severe it is.
Is it actually something that’s
harming live users?
Or is it something that we can
deal with on our own that
users will never see.
So being able to see how serious
an issue is, and
getting in touch with the right
people who can solve an
issue is really important
to Google.
ANDREW WIDDOWSON: Absolutely.
I like to think of site
reliability engineering kind
of as a fun mix.
I think of it as equal
parts cloud
mechanic and systems scientist.
But that’s just my
own hyperbole.
We’ll find out more about
this as we go.
So next up, let me just go ahead
and take care of the
elephant in the room.
A lot of folks assume that site
reliability engineering
is just some fancy-pants
term for the
heavy-duty operations role.
But the reality is it’s quite
different than that.
It is both operations, and it’s
also software engineering
development and planet-scale
science and engineering.
So let’s get a little bit more
of a sense for that.
I’m going to send this over
to Aaron, who has
some points to make.
AARON JOINER: Yeah, so I guess
essentially we’re trying to
figure out what do we do in
terms of being able to scale
the systems we run.
So clarify for me again,
I’m sorry–
ANDREW WIDDOWSON: Sure.
So, is SRE just some
fancy ops role?
Or how do we differ?
AARON JOINER: Right, sorry.
So essentially my background
is in the ops world.
So I’ve spent many years
in the operational
trenches, if you will.
And the nice thing about being
here at Google is that we
really don’t just run around
and push buttons in
production.
It’s our job to figure out how
to find those issues that we
run across from time to time.
And ultimately, as we were
saying earlier, automate
ourselves out of a job.
So the ability to take whatever
the new exciting
thing that’s happened today is,
and turn that into a way
that we can prevent that from
ever happening again.
ANDREW WIDDOWSON: Absolutely.
Thank you, Aaron.
And, Leslie, what’s
your take on this?
LESLIE CHEUNG: Yeah, as Aaron
was saying, we’re not
interested in handling the
burning building that
sometimes–
if the services were burdened
down by a lot of things that
are going on.
Like let’s say, we have to move
jobs from a machine or
restart a machine.
And that’s not what we’re
interested in at all.
That may come about
from time to time.
But the real interest that we
have is identifying ways that
failure scenarios, things that
can go wrong, and figuring out
how we can design solutions that
either fix themselves, so
we don’t even get woken up in
the middle of the night with
these problems, or just handle
themselves transparently.
They’re by design.
If something fails, they’ll
still continue working.
And we don’t have to worry
so much about it.
And that’s where the real
engineering part of this.
There’s a reason why it’s
called site reliability
engineering.
Because we’re actually
engineering solutions in the
long term to make our
own lives easier.
And whether that frees
up our own time–
we say automate ourselves out
of a job– but that doesn’t
mean that we’re not going
to have a job.
Because there’s always a longer
term big picture things
that we have to think about, and
new features that we need
to work with in order
to make sure that
they roll out reliably.
ANDREW WIDDOWSON: Absolutely.
I like to think of the fact of
this automate ourselves of a
job thing as being able to do
more with even more, right?
To use a silly analogy, in a lot
of ops work, it’s a lot of
repetitive banal barnacle
scrubbing.
And who wants to just scrub
barnacles on the ship?
You want to be able to steer
that ship, right?
So at Google, we are lucky
enough to have some fantastic
infrastructure that takes care
a lot of the mundane and the
repetitive.
And we build from that.
So some pretty good stuff.
And we’re using our own science
and engineering
principles to go levels
beyond that.
So cool.
All right, now, of course we
have the traditional software
engineering role
here at Google.
Many SREs and software
engineers are working
together every day.
So what I want to know is,
is there a crossover
between the two roles?
What sort of collaborations
exist?
And we’ll send this right
back to you Leslie.
LESLIE CHEUNG: Yeah.
I mean, just to kind of preface
this, I was hired as a
software engineer at Google.
And then, just before I started,
SREs, they called me
and said, hey, would you like
to join our organization?
And this kind of shows that
SRE is pulling in software
engineering talent in order to
make sure that they’ve got the
knowledge there in order to
design the solutions that make
our systems reliable.
We definitely will do
consultations with software
dev groups on new features,
making sure that–
we’ve seen the failures that
have occurred in production,
what we’ve learned from
a lot of mistakes.
And there’s always still new
mistakes to be made, new
learning things.
And SREs are right there
responding on these incidents.
So we have a lot of knowledge.
And we also understand the
software systems that are
being deployed.
And we can work with developers
to make sure that
these things get rolled
out in a good fashion.
ANDREW WIDDOWSON: Absolutely.
And Dina, what’s your take?
DINA BETSER: Yeah, absolutely
everything that Leslie
says is 100% true.
I think that SREs work with
software engineers along every
step of the way.
And that includes when
they’re working on a
new push, a new binary.
And an example of this is just
last week when I was on call,
one of the developers on the
ads quality infrastructure
team, was making an improvement
to the system with
a port and replacing
a component.
And as the on-call, I was
responsible for making sure
that as he was doing the push,
that everything was proceeding
smoothly and there was
no unexpected alerts.
This whole time we are working
very closely together, and
making sure that the
push was a success.
So I like to think of this as
the SREs and the SWEs are
working together, just like
Mario and Luigi work together
to save the princess.
ANDREW WIDDOWSON: That’s
absolutely
what we’re doing here.
Now, thank you for that,
Leslie and Dina.
I’d like to just mention for
all of those of you who are
following along to our Google+
Hangout on Air, if you have
questions about the site
reliability engineering role,
which we’re discussing today
here at Google Students, you
can tag a question to us.
Use the hash tag, srehangout.
That’s no dashes, no spaces,
tag srehangout.
OK, we’ll take some questions
a little bit later on.
All right, I’d like to mention
some of the ways that I feel
SREs and software engineers
engage with each other.
I spent a lot of my time for
Google web search doing
proactive sorts of things.
SRE roles are often an equal
mix of proactive
and reactive work.
What I’m trying to do is get
out ahead of new exciting
features in web search.
And one of the perks in the job
is I get to see where web
search is going next.
I know you’re going to enjoy
the next several months of
Google’s search evolution.
But so when I’m meeting with
software engineers who are
excited to deploy these
features, I’m having
conversations with them to make
sure that their designs
are the best they can be so that
they’ll scale to the sort
of challenges that
we have, right?
A common conversation between
myself and another developer
might be something like, well,
how many users do you think
your feature can support
on our infrastructure?
100,000 users?
OK, well, what if you have
10 million users?
Where are the bottlenecks?
Where are the slow downs?
Where are the things that
you’ll need to scale?
And how can we make sure that
we do this right the first
time, so that on opening day,
you have the least-eventful
infrastructurally launch
that you can have?
So having seen the different
failures and the different
sorts of ways that software can
explode here at Google, in
our very fault-tolerant systems,
I’m able to share
their experience with folks
who are maybe writing a
feature for the first time, or
writing a feature that’s 10
times as large as they’ve
ever written.
That’s the sort of value that
SRE can bring to the software
engineering organization,
working in tandem with them.
OK, so that was kind
of my parting
comment for that question.
So let’s see.
If we’ve established then how
SREs and software engineers
work together, how do the scope
of SRE work and software
engineering work differ?
Let’s go back to you, Dina.
DINA BETSER: Right, SREs
are software engineers.
They’re working on our live
production software.
So one thing that SREs focus
on is that everything they
work on has to do with
production code.
So they might be working
on monitoring code or
instrumentation, figuring out
how things will run when it
actually goes live
on our network.
So one of the things that we’ve
been saying repeatedly
is we write things to automate
ourselves out of a job.
When we do this, we try to pick
the pain points that we
currently see, and make sure
that they never happen again.
So while developers may focus
most on just making their
software work, we make sure
that the software works on
tons of machines spread
across many, many data
centers around the world.
ANDREW WIDDOWSON: Absolutely.
And Aaron, what do you think?
AARON JOINER: So, like Dina, I
certainly think we spend a lot
of time working on the
monitoring infrastructure
aspects, more focusing on things
like the ability to–
if a developer comes to us and
says, we’re going to launch
this new feature, we’re going
to launch this new small
product, or even a large
planetary-scale things like
web search, that we’ve had a
chance in advance to put in
place the tools that are going
to be able to allow them to do
that productively.
And large portions of our
lower-level infrastructure are
written directly by the SREs
from the monitoring
infrastructures and the
frameworks that make that easy
to do to roll-out
infrastructures, to be able to
build systems so you can deploy
that software easily to
our quintillions of machines
in production.
It’s really that scale
that I think is–
the familiarity with the
production scale is the key
differentiator.
ANDREW WIDDOWSON: Absolutely.
And we’re really lucky here at
Google to have code that is
highly instrumented, very
scalable in its
base library basis.
Being able to navigate through
the large grid scheduling
system that we have
at Google is–
I count myself lucky to
be able to do that.
And part of everyday job at site
reliability engineering.
OK so we have these site
reliability engineers.
We’ve talked a bit about what
they do, but for which teams
do they do what they
do., right?
Do Google SRE work on
certain products?
Do they work on all products?
Do all Google teams have an
SRE working on things?
How does this exactly work?
And we’ll go back to you,
Aaron, for that.
AARON JOINER: So for my part,
I’ve spent the last probably
three or four years mostly
taking stuff where they did
not currently have SRE support,
and helping bootstrap
it and get it up to the
standards where we’re willing
to have an SRE support it.
This might make you wonder,
well, who ran it before that?
Typically when new software is
introduced by a developer,
it’s run by the developer for at
least the first six months
of its production life cycle.
And that can include rolling
it out to live users.
And then we usually do encourage
those developers to
have SREs involved in helping
them get ready to roll it out.
It was often at a more mature
point in the product life
cycle where they’ve had a chance
to stabilize it and get
the kinks worked out of it, if
you will, and have run a few
hurdles through it themselves,
where the SREs then come in
and help them really prepare to
be long-term supportable.
And of course, that’s not throw
it over the fence and it
will be running from then,
which is one of the nice
differentiators there.
ANDREW WIDDOWSON: Yeah, so what
Aaron’s talking about as
far as bootstrapping a service
up to the quality standard for
SRE teams, I think that
is a very valid point.
That’s a crucial time when
there’s a hand-off of part of
the responsibilities for a
product to the SRE teams.
Leslie, where do you want
to take this from here?
I know you have some thoughts.
LESLIE CHEUNG: Yeah, so Aaron
had said that we don’t start
out– when a new service comes
up, it’s not staffed
immediately with SRE.
We make sure that the devs are
responsible for making sure
that it’s stable.
And that they understand some
of the common pitfalls that
occur when you’re running
in production systems.
SRE teams are not only deployed
for mission-critical
services, or services that
Google sees the need for
making sure that we’re
serving users very
reliably on that service.
And they’ll either work with
the developers to make sure
that this comes up to speed, and
then we can deploy a team
of people who can
work with this.
Or once the project reaches
maturity we can actually have
a real team work with it and
take it over completely, and
then work with the developers
on the future, long-term
prospects of that service.
ANDREW WIDDOWSON: Absolutely.
It’s interesting to me that we
have the standards that we do
for the SRE organization.
I think this actually separates
out substantially
from a traditional operations
organization, where it’s like,
here’s what you get.
We’re give this to you, right?
And in fact, most interestingly
enough, though
SRE provides a lot of value for
production services, if we
find that it’s not up to our
standards, we’ll hand it right
back to the developers, right?
So there’s a time and a place
for us to get involved.
And we do that as
a whole team.
Speaking of which, Dina, what’s
your take on which
projects do Google
SREs work on?
DINA BETSER: Right.
So SRE as a whole really tries
to support organizations with
a group of SREs.
So one thing that is important
to know is that a small
service is unlikely
to get a singleton
SRE working for them.
Because we believe that a group
of SREs working together
are much more likely to
have a bigger impact.
So for instance, sometimes
smaller services are grouped
together such that a group of
SREs can support all of those
services and collaboratively
work on larger and more
complex problems.
ANDREW WIDDOWSON: Absolutely.
Cool.
So it turns out we have a
question from our Google+
Hangout on Air audience.
This question comes
from [? Yukon ?]
[? Wong ?].
And his question is what kinds
of projects have you guys
worked on in the past?
Aaron, why don’t we
start with you.
AARON JOINER: So when I came to
Google, I guess the first
thing I got handed was our
serial console infrastructure.
I had worked on serial console
things before I came here.
And this sort of feeds
back into the
engineering aspect of SRE.
I, pretty much with me and a
small team of a couple other
people, wrote all of the code,
many thousands of lines of
Python from scratch to manage
integrating a planetary scale
serial console infrastructure
out of the gate.
It was quite a challenge
from what I was
used to in the past.
It was a lot of fun.
I’ve worked on what
we like to call–
well, I guess the best way to
say it, because I work in the
bowels of infrastructure on
things that aren’t public
facing, I have to choose my
words a little carefully.
I’ve worked on turning up our
new data centers and machines
when they come online.
So if you have–
you can imagine what large
chunks of machines that we
bring on, and turn web search
on on top of them or Gmail,
and various and other
sundry services.
And automating that process so
that it’s not done by a human
is something that allows us to
greatly increase our velocity.
Every time any of those machines
is sitting idle that
is dead time that is
literally money
ticking away on the clock.
So we work very hard
to make that
process fast and expedient.
And these days, I’ve started
to work pretty much in the
bowels of the infrastructure
still.
So doing things like working on
the system that safely and
carefully rolls out our
production Linux image to all
the machines in the fleet.
You can imagine that if we make
errors there, that it can
be catastrophically bad.
So we have to be very slow and
very careful so that we don’t
disrupt the systems running on
top of that, such as web
search and Gmail, et cetera.
ANDREW WIDDOWSON: Good stuff.
We all have a story about things
that we’ve worked on
here for Google SRE.
I have a couple I’d
like to share.
So I happen to, I think like
Aaron, have a background and
appreciation for music.
And so I jumped at the chance
when we launched a
music-related feature on
Google web search.
In fairness, it was only in
the Americas at the time.
It was a feature where you could
search for an artist,
album or track, and it would
pop up playing a
streaming bit of music.
The thing was, we weren’t
hosting that music ourselves.
We were hosting it with some
external partners.
And like any good site
reliability engineer, I had to
make sure that the entire
end-to-end user experience was
going to be fast, friendly,
perfect the first time.
So I actually worked with those
external companies to
load test them through the
commodity internet.
We used the same infrastructure
that the Google
bot uses to send a
representative
sample of user traffic.
And we streamed at tremendous
bandwidth rates over the
internet to make sure things
were going to work perfectly.
I’ve also had the pleasure and
privilege of working on
scaling up Google’s encrypted
search, which we launched a
few years back.
So if you use HTTPS Google web
search, myself, another SRE,
and a software engineer worked
together in tandem intensely
for a couple of weeks
to get that out the
door in a timely manner.
So privacy and freedom of
information being what it is,
I was very proud as a Googler
and as an individual to be
able to launch that feature–
couldn’t have happened
without SRE.
So those are some ideas of some
things that we worked on.
And thank you, [? Yukon, ?]
for asking that question.
If other folks have questions,
you can tag us at #SREhangout
on Google+.
OK.
So coming up next, another
question that many people ask
is what does a typical day
look like for SRE?
What are some of the typical
problems you face?
And the answer is, there really
is no such thing as a
typical day for SRE.
In fact, that quite frankly
hits the nail on the head.
It turns out that, as an
SRE, you are a Jack
or Jill of all trades.
You work across multiple
teams.
Your challenges adapt
on a daily basis.
Where perhaps a traditional
software engineer may be
working on the one feature that
they’re going to launch
this quarter, or this month,
or the one or two features,
SRE are moving within several
different domains, doing the
reactive consulting–
excuse me, the proactive
consulting we mentioned.
Is something going to scale.
Or spending their own amount
of time being in an on-call
rotation, some of the more
reactive stuff, meetings with
developers, meetings with
their teammates
to get stuff done.
So in my particular case, just
as the example I’ll take for
this question, I spend about
20% of my time in meetings,
keeping track of what’s coming
up, communicating changes to
folks, making sure everyone’s
on board.
I spend about another 20% of my
time as a senior SRE here
at Google, teaching newer SRE
how to be a good on-call
engineer, how to do the
diagnostics and sorts of
things that we do.
I spend maybe another 40%
of my time writing code.
As a software engineer, I’m
the tech lead of a project
that prevents abuse against web
search, denial of service
attacks, people who want to
unfairly copy our search
results, that sort of thing.
And the remainder of my time is
spent occasionally serving
shifts in our on-call
rotation.
So that’s a typical day or
an atypical day in SRE.
You never know quite what
you’re going to get.
So we just wanted to make sure
we covered that topic.
Now another topic that people
particularly have questions
about is, what’s it’s like
to be on call at Google?
What happens if you get paged,
and you’re dealing with an
emergency alert?
And I will send this off to
Aaron to lead that answer.
AARON JOINER: So I’ll start off
by admitting upfront that
I’m somewhat of an
on-call junkie.
I am delighted when the pager
goes off, because of the
intellectual puzzle that
it often presents.

Being woken up at 2:00 in the
morning and having to go fix
something is really something
I thrive on.
That said, I have to say that
at least for our particular
group– and I think
this is broadly
representative of SRE–
we work very hard to make sure
that our pager rotations are
not the meat grinder you might
imagine that on-call rotations
can be from some of
the industry.
We tend to have very well
established procedures for the
typical things that can go wrong
in the form of playbooks
and very good and concise
monitoring alerts.
And making sure that you don’t
have a pager that goes off
when there’s no need for a
human to respond to it.
Getting back into that
whole automate
yourself out of a job.
Let the machines fix it
where it’s practical
for them to do so.
So that’s my take on it.
Yeah, it’s a lot of fun
to hold a pager.
But it’s really not nearly as
bad here as it might be
elsewhere, so to speak.
ANDREW WIDDOWSON: Cool.
That’s definitely one
take on things.
I like the adrenaline junkie
that many of SRE are.
If that’s something that
intrigues you, take a look.
Leslie, I’d like to hear
your take on things.
LESLIE CHEUNG: Yeah.
When you get an emergency alert,
maybe you’re woken up,
or maybe you’re just
out and about.
It’s like getting a problem
dropped in front of you.
It’s like a puzzle and
you have to solve it.
And there might be some time
constraints with that as well.
It makes it a little exciting.
I think that when you’re
on call and get an
alert, you have to–
you’re now the lead person.
You’re coordinating the
effort to fix it.
But that doesn’t mean that
you’re the one person who’s on
call for this event.
There are many other engineers
who are also on call that you
can tap into.
There are other developers you
can ask what’s going on.
They may be experts in a certain
part of the system
that you need to look into.
And that’s part of your
reverse-engineering effort as
you’re trying to figure
out what went wrong.
You can tap into this vast
network of people.
So although you might be the
lead on this, you’re never
really alone when
you’re on call.
You have a very good
support network.
ANDREW WIDDOWSON: Absolutely.
And I’d like to underscore
some your points, Leslie.
You’re fully supported by Google
and Google teams when
you’re on call.
It’s thrilling to think that as
the incident commander for
an outage what you say goes.
And it is your responsibility
to divine the ultimate root
cause of a problem, and make
sure to see it through all the
way to its completion.
At the same time, you have to
prioritize several things.
This is where the excitement and
the puzzle stuff comes in.
How do we prevent the
breakage that our
users would see, right?
It’s actually amazing to
consider that of all the
different things going on the
Google, there’s brokenness
going on every day.
But because we design our
systems to expect failure and
to route around things, much
of it is not seen.
That being said, the SREs can
swoop in and make sure that it
doesn’t become a problem
of epic proportions.
So that to me is really
fascinating.
And when I’m on call, to
reiterate Aaron’s point about
the adrenaline of things,
it’s absolutely a rush.
Because you’re out there
defending our users from
brokenness and making sure they
get the right results.
Thinking of it less from an
operations standpoint and more
from a scientist and engineer
standpoint, you can imagine
that any Google engineer’s
time is very valuable.
But an SRE who’s on call, even
doubly or triply so, because
of all of the stuff that’s
on the line.
Think for a moment about the
hundreds of different factors
that one could imagine that
might contribute to any sort
of problem that we
have at Google.
And like I said, there
are an unlimited
number of these things.
I like to think of it as kind of
a decision tree, or even a
search space, of things that we
have to consider before we
can find the root cause
of a problem.
It’s my job to take my
experiences and my skills that
I brought from school, from
work, and figure out how to
prune and navigate through that
search space so that I
spend my time most effectively,
nipping the
problem in the bud.
And I think that’s something
that SRE definitely
brings to the table.
So perhaps a bit more of an
insight into how we deal with
emergencies here at Google.
But it’s definitely part of our
company culture– pitch
in, get done, fix the problem,
and fight for the user.
That’s the SRE way and
the Google way.
OK, so we have another question
from our audience.
As a reminder, you can
#SREhangout if you’d like to
contribute.
The question from [INAUDIBLE]
is, do you make use of machine
learning and artificial
intelligence algorithms
for your projects?
If yes, could you provide an
example where such algorithms
have proven useful during
your projects?
All right, I know several of
us can answer to that.
I’ll take a first
pass at this.
As a matter of fact, we use
machine learning and
artificial intelligence in our
abuse detection software that
we use here at Google.
I’m, in fact, leading a team
of folks who are doing all
sorts of clustering and analysis
to try to figure out
what behaviors for web search
are totally above board,
totally normal–
our users are on a variety
of different platforms–
versus all sorts of nefarious
behavior that’s trying to
attack and shut us down.
It’s an interesting arms race.
And so to the extent that we can
do some of our pre-work by
automatic classification
or through learning and
supervised systems,
we do that.
And then ultimately we’ll
step back and
measure and cut again.
So it’s definitely an exciting
thing on the
face of abuse detection.
I know that perhaps Aaron would
also like to contribute
to this question?
AARON JOINER: Sure, yeah.
Like I said, down in the bowels
of the infrastructure,
we deal with literally all of
the machines at Google.
And so as you can imagine,
that’s a lot of machines.
In order to keep track of what’s
gone wrong with any
particular machine at a point in
time, we need to be able to
diagnose what that machine’s
problem is, and be able to fix
it quickly and reliably.
There’s a couple neat
things we do here.
One of them, there’s actually
some insight into it in a
paper we published about
hard drive failure
rates a few years ago.
Most of the information gleaned
from that was done in
applying machine learning to
figuring out what are the
signals that this machine
will go bad in time.
And that’s not something I
personally worked on, but it
was something I was closely
associated with.
And things I’ve personally done
for monitoring why the
software on all those machines–
the sort of lowest
layer of software, the actual
production image itself–
as it rolls out, as we roll
out changes to the fleet,
things will break.
And they’ll be coincidental
breakages.
And being able to determine the
signal from the noise in
that condition can
be very tricky.
And we apply some neat machine
learning tricks to make sure
that when we’re rolling out a
bad change, we can tell the
difference from merely some of
those three or four or five,
nine events that happen
to fail in
coincident with that roll-out.
ANDREW WIDDOWSON: So as you
imagine at Google, we do
things that what we’ve been
referring to as a planet
scale, right?
And you can imagine the
different sorts of data
sources we have on everything
from machine failures to
queries per second coming in
through our border routers or
what have you.
It’s a tremendous and
interesting data problem for
anyone who’s interested in
machine learning and
artificial intelligence.
And having the keys to the
castle to be able to make
changes and in an on-call
capacity, also allows you to
spend the time writing
software in that same
environment, which is
pretty darn cool,
if I do say so myself.
Maybe I’m biased.
All right, so we’re going
to go on to our
next question then.
So a lot of students ask me at
job and career fairs that we
host for Google, what are some
of the skills needed to be an
SRE, right?
So in fact, one of our students
online here has asked
a question.
Matthew has asked what is the
typical education level for an
SRE, which I think is a good
follow-on question.
So let’s start off, Leslie,
with what are some of the
skills needed to be an SRE?
LESLIE CHEUNG: Well, a lot of
people think that in all these
engineering roles you need
to have some type of a CS
background.
And while that does help, that’s
not necessarily the
only thing that we’re
looking for.
We’re also looking for people
who can be assertive when a
situation needs someone
to take a lead on.
You need to be able to say–
you need to be able to
tell people like this
is not a good design.
You need to be able to have
confidence in your knowledge
and be able to assert yourself,
either in an
emergency situation, or even
in design discussions.
You need to be able to
say, no, this may not
be the right way.
Maybe you should look at this
design because it will be more
fault tolerant.
You also need to have a good
head for triaging.
You want to evaluate when a few
different problems come up
all at once.
You want to be able to weigh
which one is most severe,
which one’s going to lead to the
most user impact, and then
attack that problem.
And those are skills that
aren’t necessary taught
explicitly in formal
education.
But those are skills that are
really good to have as an SRE.
ANDREW WIDDOWSON: Absolutely
the case.
So Dina, what’s your
take on this?
DINA BETSER: Yeah, I definitely
agree with what
Leslie said.
I think it’s really important
in talking about the kind of
work that SRE do to remember
the reactive and proactive
kinds of work.
So since one of the things we
do is react to pages and
interrupts’ work, it’s important
to be able to
balance between our projects,
as well as
the interrupts’ work.
So one of the things that that
can require is being able to
focus on multiple projects at
the same time, which is a good
skill for SREs.
Another good skill is being
able to communicate
succinctly.
When you’re in the middle of
an outage, you want to make
sure that all of the
stakeholders and users of your
system know what’s going on
every step of the way, and
that everything is communicated
clearly.
Also to respond to the question,
what is the typical
education level for a Google SRE
from our audience, just on
my team alone, there’s a wide
array of education levels.
We have people who majored
in CS– typical.
We have people who
have master’s.
We also have people who learned
mostly by doing, and
may not even have a degree.
But really it’s about what you
can do, and the skills that
you have that you can
bring to Google.
ANDREW WIDDOWSON: Absolutely.
And so, to the idea of skills
that we may not have learned
in the classroom as well, let’s
talk about some of the
skills and extra things you
can practice to help put
yourself on the career path
for site reliability
engineering.
Aaron, what do you think?
AARON JOINER: So for my two
cents, I’m probably the least
traditional in that respect in
that I did not come from the
CS background.
So I certainly had to learn
it the hard way,
if you will, myself.
And so we see in SRE, as
far as the educational
backgrounds, people who did come
from a CS background, we
end up having to help them
learn some of those more
operational skill sets.
And if you’re currently a
student, it’s probably
something you could think about
as well to answer your
question directly.
But any chance you can get to
work with production Linux
systems, either at home or with
your local Linux users
group, or at a computer lab
at the University–
any hands-on exposure you get
to dealing with file systems
and dealing with actual LAMP
stacks serving websites, those
kind of things are very
practical skills, and ranging
the gamut across into
networking as well.
So those things are the sort of
core operational skill sets
that software engineers end up
having to pick up as they go,
not because it’s something
you necessarily need on a
day-to-day basis.
Our systems tend to abstract
away a lot of the sort of
mundanity of running a job on
a Linux machine at Google.
But that’s true until
they don’t.
So when things start to
fail, those skills
become extremely valuable.
And the intuitions that Leslie
was talking about earlier, to
be able to triage those problems
as you see them.
So again, as far as education
backgrounds, myself, I came as
a conservatory musician.
I certainly did not come from
a traditional CS background.
So you certainly can succeed in
this space on your merits,
as Dina was saying earlier.
You can prove that
you can do it.
And that’s all we’re really
after is success over
paperwork, if you will.
ANDREW WIDDOWSON: Absolutely
the case.
Myself as a more traditional
computer scientist, I took it
upon myself to diversify
my out-of-classroom
experience as well.
For example, I was the IT
manager of sorts at the campus
radio station to Carnegie
Mellon.
And I also interned with the
campus’s network engineering
and network development group,
to kind of get a whole
different perspective.
For those of you who are
computer science and
electrical and computer
engineering students perhaps
out there, and variations
thereof, obviously be solid on
your data structures
and algorithms.
Understand computational
complexity and run times.
Understand the fundamentals
really of what makes computer
science computer science.
But venture out a
little bit too.
Getting that extra experience
with Linux or servers or that
sort of stuff, will definitely
make you stand out as far as
site reliability engineering
is concerned.
And the balance of those two
things, regardless of which
way you came into SRE,
is a great thing.
And we will build you up on one,
if you are less familiar
with it than the other–
software engineering and the
more traditional systems
engineering roles.
So with that then, I think we
have spent a lot of good time
here talking about the site
reliability engineering role.
And I’m so glad we could
take this time to
discuss all of this.
I’m sure there will
be more questions.
If you have others, please feel
free to continue tagging
SREhangout on G+, and
we’ll help answer
some of those questions.
For now, for myself, for Dina,
for Aaron, and for Leslie, my
name is Andrew.
And thanks so much for attending
our Google Students
Hangout On Air about site
reliability engineering.
For more information about
Google and its job roles, you
can visit google.com/students,
your portal
to hiring at Google.
And of course, you can also
plus us on Google+.
That’s plus.google.com/
plusGoogleHangout.

Thanks so much.
Everyone say goodbye.
LESLIE CHEUNG: Bye.
DINA BETSER: Bye.
AARON JOINER: Bye.

6 thoughts on “Site Reliability Engineers — Keeping Google up and running 24/7”

  1. holy shit this was streamed on my birthday 😀 anyways this looks really awesome and i've been wanting to work at google for as long as i can remember 😀

  2. What Skills Are Important for an SRE to Have?

    They answered it like a behavioral question. I think people will need some technical insights into this. Please make a note of this. Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *