mrbusche.com

Building Secure and Reliable Systems Notes

Google released a new book, Building Secure and Reliable Systems and it's pretty good. Everyone will want to read the first couple chapters and then after that skim the other chapters. There are tons of headers throughout the book (every other page on average), so you can easily skip to what sounds interesting.

Reliability and Security Tradeoff: Incident Management

Least Privilege

Classify access based on risk

A company may need three classifications: public, sensitive, and highly sensitive.

Keep Dependencies Up to Date and Rebuild Frequently

If your dependencies are up to date, it's likely you can apply a critical patch directly instead of needing to merge with a backlog of changes or apply multiple patches.

New releases and their security patches won't make it into your environment until you rebuild. Frequently rebuilding and redeploying your environment means that you'll be ready to roll out a new version when you need to—and that an emergency rollout can pick up the latest changes

Release Frequently Using Automated Testing

Basic SRE principles recommend cutting and rolling out releases regularly to facilitate emergency changes. By splitting one large release into many smaller ones, you ensure that each release contains fewer changes, which are therefore less likely to require rollback. For a deeper exploration of this topic, see the "virtuous cycle" depicted in Figure 16-1 in the SRE workbook.

When each release contains fewer code changes, it's easier to understand what changed and pinpoint potential issues. When you need to roll out a security change, you can be more confident about the expected outcome

Use Containers

As these changes roll out to each task, the system seamlessly moves serving traffic to another instance; see "Case Study 4: Running Hundreds of Microservices on a Shared Platform" in Chapter 7 of the SRE book. You can achieve similar results and avoid downtime while patching with blue/green deployments; see Chapter 16 in the SRE workbook.

To reduce the need for this kind of ad hoc patching, you should monitor the age of containers running in production and redeploy regularly enough to ensure that old containers aren't running. Similarly, to avoid redeploying older, unpatched images, you should enforce that only recently built containers can be deployed in production.

SQL Injection Vulnerabilities

In Java Use the Error Prone code checker, which provides a @CompileTimeConstant annotation for parameters.

Automated Code Inspection Tools

Error Prone for Java and Clang-Tidy for C/C++ are widely used across projects at Google. Both of these analyzers allow engineers to add custom checks. For certain types of bugs, both Error Prone and Clang-Tidy can produce suggested fixes.

Debugging Techniques

Debugging is a skill that you can learn and practice. Chapter 12 of the SRE book offers two requirements for successful debugging:

Debugging: Record your observations and expectations

Write down what you see. Separately, write down your theories, even if you've already rejected them. Doing so has several advantages:

Distinguish horses from zebras

When you hear hoofbeats, do you first think of horses, or zebras? Instructors sometime pose this question to medical students learning how to triage and diagnose diseases. It's a reminder that most ailments are common — most hoofbeats are caused by horses, not zebras. You can imagine why this is helpful advice for a medical student: they don't want to assume symptoms add up to a rare disease when, in fact, the condition is common and straightforward to remedy

Reread the docs

After they found the warning message, they determined that it wasn't a zebra, it was a horse - their code had never worked.

Debugging: Take a break

Giving yourself a bit of distance from an issue can often lead to new insights when you return to the problem. If you've been working heads-down on debugging and hit a lull, take a break: drink some water, go outside, get some exercise, or read a book. Bugs sometimes make themselves evident after a good sleep

Security is a Team Responsibility

One example of this in practice is the way the team approaches security bugs. All engineers, including security team members, fix bugs and write code. If security teams only find and report bugs, they may lose touch with how hard it is to write bug-free code or fix bugs. This also helps mitigate the "us" versus "them" mentality that sometimes arises when security engineers don't contribute to traditional engineering tasks.

Who Is Responsible for Security and Reliability?

Who works on security and reliability in a given organization? We believe that security and reliability should be integrated into the lifecycle of systems; therefore, they're everyone's responsibility. We'd like to challenge the myth that organizations should place the burden for these concerns solely on dedicated experts.

We encourage organizations to make reliability and security the responsibility of everyone: developers, SREs, security engineers, test engineers, tech leads, managers, project managers, tech writers, executives, and so on. That way, the nonfunctional requirements described in Chapter 4 become a focus for the whole organization throughout a system's entire lifecycle

Reduce Fear with Risk-Reduction Mechanisms

Here are some strategies you might want to try:

Dog‐fooding (or "eating your own dogfood") involves adopting a change before that change affects others. This is especially important if you're affecting the systems and processes that impact people's daily lives.

Opt in before mandatory

Overcommunicate and Be Transparent

When advocating for change, the means of communication can influence outcomes. As we discuss in Chapters 7 and 19, good communication is key to building buy-in and confidence in success. Giving people information and clear insight into how change is happening can reduce fear and build trust. We've found the following strategies to be successful: