Site Reliability and Digital Business

I’m Michael Krigsman, industry analyst and
host of CXOTalk. And we’re here at Future Stack ’16, which
is New Relic’s conference being held in San Francisco. And I’m talking with Cameron Tuckerman-Lee,
who is a site reliability engineer for Airbnb. Hey Cameron, how are you doing? Really good! How about you? Good! We all know what Airbnb does, but what does
a site reliability engineer do? I think that’s a good question. I think the role is very different depending
on what company you’re at. So, at a lot of companies, your SRE’s are
your operators. You have developers on one part of your building
that develop your applications, and then throw them over the metaphorical wall over to your
operators, who make sure that it’s running in production. So, silos. Yeah. So, at Airbnb, we don’t subscribe to that
model; we are in the dev-ops model that is becoming very popular lately. So, the same engineers that are building applications
are also the ones that are running them, scaling them, and dealing with incidents. But because of that, there’s a new class of
tools that are required to make sure that they’re doing that efficiently and using best
practices; and so that’s what the SRE team does: it makes sure that the entire site is
reliable and available, and we do that by supporting the other teams that own their
applications. What kind of tools help with this? So, some of it is … a lot of it is learning. So when there are incidents, how do you make
sure that there’s good follow-up to that; that there’s learning from that. And so, there is this tooling around, like
post-mortems, and making sure that when incidents do occur, that if there are previous incidents
that were like this, you are able to get that data very quickly and understand it. It’s also getting the right people in the
room. So, how you do [that] with pagered escalations,
how you deal with alerting; those are also owned by the site reliability team. You know, we’re also the ones that own and
maintain the integrations with some of our monitoring tools, like StatsD and New Relic. These are how, when there are incidents, that
we’re able to quickly triangulate where the problem is and what the impact was. So it’s a combination of technology tools,
but also processes and approaches combined with data. Absolutely. So, I think there’s lots of different good
ways to go about incident response, but a really not-great way to do that would be to
have everybody be doing it their own way, and have no consistency. So, having a team like SRE means that Airbnb
has a consistent approach to incident response, so when there are problems that need to get
escalated up the chain, they can get picked up and handled very quickly. And, you’re very focused using the end-user
as a reference-point. Absolutely. Tell us about that. I think no business likes having downtime. Obviously, there are financial implications
to any business, but there is a really personal human aspect to downtime at AriBnB. The situation I like to remind myself of to
motivate me is, you can imagine, you know: you’re going on vacation, just got off the
plane, you’re in the cab, you’re heading to your listing, you open up your application
to get it’s address, and you just see a 500. It would be a pretty bad or potentially scary
situation. Yeah, very painful. Yeah. And so, Airbnb really is nothing without our
community. I can’t imagine what the product would be
without the guests and hosts that trust us; so, making sure that we’re not just up and
available for taking bookings, but that people are able to rely on us is really important
to our business. You mentioned the word “trust”. How does trust relate to technology, relate
to user experience; how does that web work? It’s a good question. So, some might say that Airbnb is the hospitality
company, but some might also argue that we’re selling trust: the trust that you’re going
to be able to go to a stranger’s home, and feel welcome and have a good experience, and
be able to experience that neighborhood like a local. And so, the technology that goes into making
sure that people are what they say they are, that you’re able to interact with your host,
and get to know each other beforehand; that you’re able to, when you’re searching for
a listing, find a place that’s going to fit with the kind of neighborhood that you’re
looking for; I think all contribute to making sure that when you go someplace, you trust
that it’s going to be a good experience. And how does that, then, connect to site reliability
engineering, and to other engineering functions inside Airbnb? How do you think about the connections? I think this comes down to engineers feeling
like they’re very involved in the product. I don’t think that many engineers at Airbnb
feel like they’re just doing what they’re told – they’re shipping code, and once it’s
deployed, they don’t care about it anymore. They really feel like they need to own their
own impact; that’s the term that we throw around a lot. “Own your own impact.” “Own your own impact.” So, if you think something needs to get done,
if you think something’s not being done the right way, it’s up to you to stand up and
make that change happen. And so, this is from everybody from product
teams developing new features for guests and hosts to make their experience better, all
the way to the, say, reliability team that – you see that there’s issues that need to
get resolved, or there are some parts for processes that aren’t working out, we need
to step up and do something to make sure that our guests and hosts are going to have the
best experience that they can [get]. So you really do see it as a kind of chain
of linked tools and processes that have this ultimate combined impact on the user. Absolutely. We want to have teams build on top of each
other, all the way until the teams that are building the actual experience that our users
see. We want to have a really strong foundation
for them, so that when they are building Javascript frameworks [for] user interfaces, that they’re
able to trust that the back-end is going to stay up, that they’re able to trust that if
there are issues that go to production, that we’re able to tackle them very quickly and
roll back. And so, it really is a pyramid of supporting
each other. And finally, what’s the data that you look
at? There are a couple different parts of the
data that my team cares about. It’s everything from your traditional SRE
metrics, mean time to resolve, mean time to acknowledge, you know, when [it is] incident
response. My team is also starting to really care about
metrics around making sure that our on-call engineers are living healthy, productive lives;
making sure that work-life balance is something that extends [to] something when you’re on
call at 2 AM. I think it’s something important for industry
to start looking at. Lastly, the ones that are aligned with how
our users are seeing things; and these are what a lot of companies would call “service-level
objectives,” making sure that our response time is up, our error rates low, that [it
is] not just response time to sending out bytes to our CDN as fast, but also making
sure that when the browser does get that information, it’s also having fast load times. And that’s where things like application monitoring
with companies and products like New Relic come into play. So, it is a very holistic view. Absolutely. We have been speaking with Cameron Tuckerman-Lee,
who is Site Reliability Engineer at Airbnb. Cameron, thanks a lot! Thank you so much!

, , , , , , , , , , , , , , , , , , , , , , , , ,

Post navigation

Leave a Reply

Your email address will not be published. Required fields are marked *