How to Scale End-to-End Observability in AWS Environments

How resilience and security shift left: An interview with the EVP Product & Engineering and CISO at FOX

    Originally published on Failure is Inevitable.

    Melody Hildebrandt is the Executive Vice President of Product & Engineering and CISO at FOX. Her career journey began with designing wargames for the Department of Defense. She has gained tremendous experience in the world of disaster planning, testing, security, and resilience from organizations like Palantir and more. Recently, she led the effort to plan for and execute FOX's digital streaming of Super Bowl 54, including taking over an entire sound stage in the process.

    CEO and Co-founder of Blameless Ashar Rizqi had the privilege of interviewing Melody on her fascinating personal story, as well as her thoughts on security and resilience in today’s constantly evolving world of technology. They shared perspectives on continuous learning and improvement, the operations behind the most recent Super Bowl, and how resilience and security are shifting left in the software lifecycle.

    This interview has been edited for length and clarity.

    Ashar: Thanks a lot for taking the time, Melody. Today, I wanted to discuss your perspective as historically, product development, engineering, and security have been very siloed, but your case is unique in that you manage all these domains. I’m really excited about shedding light on your perspective, as in my opinion, that is where the world is heading. There's more consolidation happening. The product leaders, engineering leaders, and security leaders need to become like Ironman, don the Ironman suit, and get all these superpowers.

    Before we dive into that, first, you’ve had a pretty amazing career trajectory. Can you tell us about that journey and how you came to FOX?

    Melody: Sure. My career started by moving to France and living in the mountains, but probably the more interesting part of my career started when I began to work at a consultancy for the Department of Defense.

    At the time, I was quite unqualified for the job. Most of the people on that team either had Masters or PhD level education, which I did not have, as well as experience or military background, which I also did not have. Really, I got the job because I met the partner in charge of the team who is a passionate board game designer on the side. I'm a passionate board game player, and found some common alignment, and so was given a chance and entered as essentially the lowest level. It was literally called level zero. Pre-partner, employees are on a hierarchy of levels one through five and I was level zero, not even on the ramp yet.

    It involved designing conventional board games for the Department of Defense, a lot of scenario planning. The whole exercise was to put people into experiences that would pressure test plans, which strangely ended up being actually relevant to my later career opportunities. The military capabilities that we're investing in today — let's say they actually deliver exactly as expected — how will they perform against what we think will be our adversary capabilities in a decade from now? That means bringing a lot of outside experts to pressure test scenarios, and then bringing a lot of senior military people — like four-star level generals — together.

    So I began with that. I was designing the scenarios and saw what it was like to have a plan and see how those plans survive contact with the enemy. In that case, since it was actually the Department of Defense, we could use that metaphor and it was less lame. I did that for five years. Through this experience I began to get interested in general how employees at the Department of Defense get to work. I saw what they were able to use in terms of technology, and it was just so behind what we as consumers were used to interacting with.

    Again, this is around 2011, so consumer technology was just far out-stripping what the government employees had to work with. This was when consumer tech was the hot thing to do. If you were a young engineer, graduating from Stanford, you probably weren't going into enterprise software, and you certainly weren't going to work on government software.

    There was one exception that I came across in the fledgling DC tech community - Palantir. Palantir was early to recognize that Silicon Valley technology should be brought to Washington, and that the operator had the right to be interacting with the best technology and have the Silicon-Valley-level engineers working on these problems, so I got interested in that.

    I had this tech blog on the side, and I wrote about Palantir on my blog. At the time, it caught their attention, and they offered me a job. I went to Palantir, as among their first non-traditional, non-software engineer hires in Washington. The company was around 200 people, so pretty small, early in Palantir's journey. I worked with them for seven years, including moving to New York and helping to open their commercial office, where I started to get exposed to more data problems in the enterprise.

    As a big data company, Palantir got serendipitously pulled into a few of the responses to the biggest corporate breaches of the early 2010s. I led those responses, and while I didn't particularly have a relevant information security background, I had an incident response background from my previous days with the Department of Defense. That really turned out to be useful to our response and the problem set of cybersecurity breaches was a great fit for our product, and so I basically spent the next few years with Palantir focused on building the product to work on cybersecurity.

    Particularly, I was working on the problem around data analysis and cybersecurity, and that extended to money laundering and other enterprise problems around risk. Then the final chapter was when I pitched our product to News Corp and met my now boss. When he became the CTO of 21st Century Fox, he called me. I jumped ship to become the CISO at 21st Century Fox, and then following the March 2019 spin off of FOX by 21st Century Fox and the establishment of FOX as a standalone public company, I took on the expanded role leading engineering and product.

    Ashar: First of all, I think it's amazing that you went from zero background in security, to leading security for a fortune 2000 company, and now transitioning into this expanded role. I mean, that's just such a phenomenal story all around. I'm curious to hear how that experience has been. How are you leveraging these experiences in this new expanded role? And how does it affect the way you think about product delivery, engineering KPIs, security KPIs, and all of this merging data across the software lifecycle?

    Melody: Well, I came with a strong bias around planning and testing plans, and that's expressed itself in a few ways. For example, at FOX, we designed and executed an executive tabletop exercise a few months before the Super Bowl. This was attended by top FOX executives to actually walk through a technically validated scenario of how FOX could face cybersecurity issues during the Super Bowl, and how we would respond.

    From my previous wargaming experience, I came in believing that this kind of simulation was extremely important to do, and had a perspective of how to do it well; to actually drive the right kinds of conversations and really test assumptions. This was less about “what are our technical capabilities?” which we obviously test a lot too, but more from an executive response perspective.

    You think about how breaches are responded to, the quality of the response both technically as well as how you communicate about that, how you make decisions quickly, how you know who has decision rights, how to get decision-makers the information that they need, how to disseminate those decisions quickly so the company can respond and move forward...these are all incredibly important. If I was wearing a purely technical execution hat, I might miss that whole piece of it. It really taught me how important it was to be able to communicate these kinds of concepts to the executives so they could engage in questions, like in the scenario where we posited things being extortion. How do you have the executive team prepared for that kind of conversation, so that they can make the right decisions in the moment?

    We took a similar mindset to engineering planning for major events. For the Super Bowl, we ran a series of dress rehearsals, which were end-to-end tests that had minute-by-minute playbooks about transitions between programs. We introduced chaos into those tests. We had all of our third parties involved. “Are we prepared for scaled testing, having all that ready, and are we prepared to actually deal with incidents?”

    And we were ready. We processed over 100 Jira incidents during the game. I think it was the fact that we had such a machine around it that we could process these tickets and not lose our minds.

    We did have one serious issue with our video stack but it was essentially invisible to the users because we had planned for potential failure too. We had planned to be able to handle 30,000 requests per second, meaning if we lost users due to a technical failure, we could rapidly get them back into the stream. The issue could have been catastrophic, but for the end user experience, it ended up invisible or quite minor. We made a decision really fast about failover, we tested how we would deal with failover, and we had tested our entire architecture to be able to actually handle those requests per second to get people back. This ensured we wouldn’t poorly throttle people who might be stuck and retrying to get in.

    Ashar: So there are tools and tech supporting the foundation; that seems like the easy part to do. This is a pretty massive organization that we're talking about here, and to get them to respond and react and learn and be prepared, that's a phenomenal undertaking in a real-time situation with a tremendous amount of stress. What were specific highlights from the way that you structured the organization, the kind of culture you built internally to be able to achieve that? There aren't that many companies in the world that can arguably sort of say that and come out being successful. I'd love to understand some of the highlights around culture, team, and processes internally.

    Melody: We had over a dozen vendors on site, we had security, product, engineering, client, our entire engineering team in this room. We took over a sound stage on the Fox Studio lot. We're a distributed team across New York, Cincinnati and LA with a huge number of contractors, freelancers, and vendors who are all essential to our stack. That's one thing that makes our environment more complex.

    I think we benefited from the fact that the Super Bowl is such a clear rallying cry. It's a moment in time that we knew we were going to set the record if we were successful. If you're an engineer working at our CDN partners for example, this is really exciting. Everyone knew the importance of the moment. So we benefited from that. A moment like the Super Bowl, you get some of that “for free”, in terms of rallying towards an objective, a shared objective, and getting a lot of people aligned against that objective. The most important thing is everybody knew what mattered the most.

    We also socialized some of the key principles that were important to us, and what success looked like. We obviously had a number of concurrent streams we were planning to hit, which we knew was going to be at least five times our previous week. That's kind of a crazy thing to undertake but it’s clearly defined. That made it almost easier.

    The thing that took the most work to get people really realizing how important this was, is this question of resiliency: "What if we lose our video stream and everyone has to get back in?"

    The team was really informed by observing what happened for a similar event a few years back on a different platform where the app took a catastrophic multiple minute outage. They hit a moment they hadn't anticipated. It was a weird confluence of events, but the biggest problem then was they had over a million people on stream, all of whom got kicked out, all of whom had to get back in.

    During that incident, the average person was clicking like 30 times because they weren't getting back in so the requests were just piling up, so it made what would have been an outage measured in seconds into an outage measured in many minutes. This concept of resiliency, of “if we lose our stream, how fast can we get everyone back in,” that was what drove so much of our work on the client side, all the APIs, all of our third party vendors. Our goal was 30,000 requests per second.

    Things were going to happen, and we had to be prepared to process them really fast. Again, on the day of the Super Bowl, we still hit unexpected issues. If we hadn't actually just processed 2,000 incidents over the previous three weeks in all of our tests, you would get a little more riled up about something like that. But because of our preparation, it was like, "Okay, make sure it's on the big board”, “Create the ticket,” "Who's the incident commander?" We could just process it much more calmly.

    Ashar: Right. So 2000 incidents is a tremendous amount of valuable data. How did you create the culture to learn and implement those learnings as quickly as possible? Because that's really what happened here, is that you had this data from 2000 incidents, and everybody started creating these models. The output of that is you have models of response, and models of certain actions and tasks. How did you convert that so quickly into learnings? And secondly, is that still where you are today? What part of that culture that you've created has become the status quo moving forward?

    Melody: We’re still learning here. We did create this sense of a Super Bowl experience, where we asked "What is the most stripped down experience that allows 99.9% of the users to do what they want to do today - which is to watch the Super Bowl live?"

    So in practice, much of what the tests were actually about was running our client and knowing we have to run it in stripped down ways. We know we have to run it ungated, for example, so we had to take our authentication partner out of the loop, but by taking them out of the loop, that might cause 10 downstream unanticipated things ... there were just so many unknowns.

    So many of the tasks were first to get through all these corner cases, and run in production. Just making these client changes and API changes, we learned of strange corner contingencies. A lot of it was just real-time feedback into the dev teams: “We have to make these changes, and then we have to test again tomorrow.” That's what it was, really ironing out how our clients performed in the wild.

    The other piece was this sense of ownership: who's creating tickets, how are people assigned, who's incident manager, and so on. This was the Super Bowl, so it’s an extraordinary event and not how we possibly could run day-to-day. But by being so maniacally process-oriented, we had some key takeaways around things like what the role of an Incident Commander is.

    Another example: who is the executive in charge? I'm currently on PagerDuty for the week as the executive in charge, which is a new thing we've put in place.

    We know how to run flagship events now. We've got a lot of those learnings. We take this macro view, such as understanding the mean time to resolution of every event of a certain class. There probably could have been more learnings there, but again, it was such a unique event it's hard to extrapolate too much. There are cultural things we learned and process things we learned, but the individual things that happened provided signals to us, such as who is a responsive partner, who's really playing like they're on our team.

    Ashar: The Super Bowl became the new bar, right? 5X in terms of scale, here's how much we can spend, here's how much headcount we can do. You want to get to a point where you take the key learnings from that and then be able to apply that to some of the events that aren’t the Super Bowl. That is exciting and really important. And you're also speaking to the cultural aspects around documentation, understanding how we communicate with other teams and which teams work for us. That's really key here as well.

    You’ve applied your security hat to understand what security scenarios look like, and done the same for resilience scenarios, then dev scenarios and product scenarios, and that's signaled the hot fixes that you talked about, the test scenarios. How are you bringing all those things together moving forward?

    Melody: One thing with security at FOX: we run extraordinarily lean. Philosophically, I made this call two years ago when I first took the role, that we should not build a SOC. At Palantir, I spent a lot of time working with SOCs, I just don't think it's what we need.

    So we kind of went through what Netflix has now coined as "SOC-less" security. I didn't have this fully realized vision of it at the time, it was more this intuition that getting a huge bunch of people to sit in front of a screen was not going to move the needle of the problem. So we thought a lot more about it. One big piece of the puzzle was putting in place continuous testing.

    We have third parties that are continuously testing. It's not red team. It's purple team testing, so we're aware it's happening and it's this constant feedback loop into our operation. They find a way to do something and we're like, "All right, we didn't have coverage of this class of problem on our endpoint, we can fix that." So we're constantly learning and making changes to infrastructure in order to resolve those problems. In some cases the things we don't control are actually upstream in IT, who we can partner with and move the needle on.

    We think about how we can scale impact in that way, and that's through the learning and testing. One of my big projects for the platform this year, for the product and engineering team, is revamping our entire operations end to end, and part of that's because of what we're thinking about now. We need to be more progressive and use more automation.

    When we actually get everyone together, we can run an extraordinary event extremely well, we've run pay-per-views, we've run Super Bowl. So we know that in an extraordinary event, we'll show up, we have a war room, we can crush it. I think the question for us now is how do we actually scale that, so that it's not all heroics because the team will be burnt out. For sports alone, it’s Thursday through Sunday, it's weekend work, it's evening work, and those are the same people who are building software, right? So how do you run both a very spiky live-ops operation for Sports and Entertainment, as well as run a daily operation for a News SVOD service that people expect to always work?

    This is what we're designing right now, and there's a lot to figure out. But we obviously want to learn from what we were able to do in the extraordinary event space, and find ways to make that scalable. What that means is building out an SRE function that's much larger than it is today, and having SREs embedded in more product areas. We're a shared service, we run a lot of products, and they have very different characteristics in terms of expectations.

    If you'd like to read more of our thought leader interviews, check these out:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …