Beyoncé rule, A-B-C test and Software at Scale Lessons from Google

31 min readJun 4, 2020

A Complete Education

It is admirable how Google effectively “open sourced” its hard-earned wisdom. Thank you! Many places do many things exceptionally well — but the earnestness to coach and educate others is an extreme rarity among the “Big Techs”. Collectively, these five books, “Top N” Google Tech Talks (YouTube) and — depending on your role — coding best practices are collectively a top notch technical and business education in itself.

SEaG is completely worth the two weeks I immersed myself into learning Culture, Processes and Tools of Google Software Teams. The writers are deeply knowledgeable, passionate about their crafts and offer insight after insight into topics ranging from human psychology to the flaws of over-embracing mock testing.

Themes or first principles articulated in this book-

Every problem is a scale problem once you’re successful.
At certain scale, every decision is a trade-off between efficiency and sustainability.
Every decision thus must be backed by data collected from field, OR — if taken with a priori knowledge — must have actionable metrics attached to it so one can revise it if needed.

Central Thesis — Software Engineering vs. Coding

Not targeting algorithms, language, tools or libraries, SEaG focuses on “Software Engineering” as a system, i.e., stuff that is not taught at any school and can only be learned from experience. The book disambiguates “programming” from “engineering”, and filters each engineering component through the hardest engineering challenge — how to scale. Typical of Google, the big assertions are backed by data, e.g., they change roughly 1.5% of their code base a week.

It also offers a treasure of insights and ideas from how Google made a billion lines of code change at one time, to how its automated tests could successfully process 50,000 change requests a day, to how Google Production System is perhaps one of the best machines humans ever engineered. Even if you are working nowhere near Google scale, the book covers a lot of fundamentals, especially on how to sustainably develop software and make objective trade-offs while doing so. I would very highly recommend SEaG to any engineering leader trying to improve her team’s game or just to be a better, all rounded leader to master the growing complexities of the software ecosystem.

3 fundamental principles that Software Engineering needs to embrace -

Sustainability — how code changes with time
Scale — how org, processes and tools change with time
Trade-offs — what & how to decide to sustainably scale

Software Engineering : “Programming integrated over time”. From a reverse lens, code thus becomes a derivative of Software Engineering.

A project is sustainable if for the expected life span of the software, you are capable of reactive to valuable changes for either business or technical reasons.

The higher the stakes, the more imperfect the trade-off value metrics.

Your job as a leader is to aim for sustainability and managing scaling costs for the org, the product and the development workflow.

Hyperbolic Discounting — when we start coding, our implicit life-span assignment to the code is often in hours or days. As the late 90s joke went — WORA — Write Once, Run Away!

Long-lived projects eventually have a different feel to them than programming assignments or startup development.

Hyrum’s Law: With a sufficient number of users of an API, all observable behavior of your system will depend on a random person doing a random thing. Conceptually similar to entropy — your system will inevitably progress to a high degree of disorder, chaos and instability. i.e., all abstractions are leaky.

Two spectrums of code — one way is hacky and clever, another is clean and maintainable. The most important decision is to ensure which way the codebase leans. It is programming if ‘clever’ is a complement, and software engineering if ‘clever’ is an accusation!

SRE book talks about the complexity of managing one of the most complex machines created by humankind — Google Production system. This book focuses on the organization scale and processes to keep that machine running over time.

Scaling an org => sublinear scaling with regard to human interactions. You should need less of humans in the loop as you grow.

The Beyoncé rule: “If you liked it, you should have put a CI test on it”. i.e., if an untested change caused an incident, it is not the change’s fault.

An average Software Engineer produces a constant number of lines of code per unit time.

Treat whiteboard markers as precious goods. Well functioning teams use them a lot!

60–70% developers build locally. But Google built its own distributed build system. Ultimately, even the distributed build gets bloated — Jevon’s Paradox — consumption of a resource increase as a response to greater efficiency in its use.

Part ONE — Culture

Humans are mostly a collection of intermittent bugs.

Teams

Nobody likes to be criticized, especially for things that aren’t finished.
DevOps in one sentence: Get feedback as early as possible, run tests as early as possible, think about security & production environments as early as possible. This is known as “left shifting”.
Many eyes make all bugs shallow.
Three Pillars of Social Interaction — Humility, Respect, Trust.
Relationships always outlast projects.
You have two ways — one, learn, adapt and use the system; two, fight it steadily, as a small undeclared war, for the whole of your life.
Good Postmortem output — Summary, Timeline, Proximate Cause, Impact, Containment (Y/N), Resolved (Y/N) and Lessons Learned.
Psychological safety is the biggest thing in leading teams — take risks and learn to fail occasionally. This is a good short list to follow.
Software engineering is multiperson development of multiversion programs
Understand Context before Changing things — Chesterton’s Fence is the mental model.
Google tends toward email-based workflows by default.
Being an expert and being kind are not mutually exclusive. No brilliant jerks!

Google CodeLabs are guided, hands-on tutorials that teach engineers new concepts or processes by combining examples, explanations and exercises.
Testing on the toilet (tips) and Learning on the Loo (productivity) are single-page newsletters inside toilet stalls.
1–2% of Google engineers are readability (Code Review tool) reviewers. They have demonstrated capability to consistently write clear, idiomatic and maintainable code for a given language.
Code is read far more than it is written. Google uses a Monorepo — here is why.
Readability is high cost — trade-off of increased short-term review latency and upfront costs for long-term payoffs of higher-quality code.
Short-term code — e.g., experimental and the “Area 120” program — is exempted from readability review.
Knowledge is the most important, though intangible, capital for software engineering org.
At a systemic level, encourage and reward those who take time to teach and broaden their expertise beyond (a) themselves, (b) team, and © organizations. [Note: this gels well with the leadership philosophy of org first, team second and individuals third!]
On diversity -

Bias is the default
Don’t build for everyone. Build with everyone.
Don’t assume equity; measure equity throughout your systems.

Leading

A good reason to become a tech lead or manager is to scale yourself.
First rule of management? “Above all, resist the urge to manage.”
Cure for “management disease” is a liberal application of “servant leadership”, assume you’re the butler or majordomo.
Traditional managers worry about how to get things done, whereas great managers worry about what things get done (and trust their team to figure out how to do it).
Being manager — “sometimes you get to be the tooth fairy, other times you have to be the dentist”.
Managers, don’t try to be everyone’s friend.
A hires A, B hires C players — Steve Jobs
Engineers develop an excellent sense of skepticism and cynicism but this is often a liability when leading a team. You would do well to be less vocally skeptical while “still letting your team know you’re aware of the intricacies and obstacles involved in your work”.
As a leader -

Track Happiness — best leaders are amateur psychologists.
Let your team know when they’re doing well.
It’s easy to say “yes” to something that’s easy to undo.
Focus on intrinsic motivation through autonomy, mastery and purpose.
Delegation is really difficult to learn as it goes against all our instincts for efficiency and improvement.

Three “always” of leadership — Always be Deciding, Always be Leaving, Always be Scaling.
“Code Yellow” is Google’s term for “emergency hackathon to fix a critical problem”.
Good Management = 95% observation and listening + 5% making critical adjustments in just the right place.
To scale, aim to be in “uncomfortably exciting” space — Larry Page
All your means of communication — email, chat, meetings — could be a Denial-of-Service attack against your time and attention. You are the “finally” clause in a long list of code blocks!
In pure reactive mode, you spend every moment of your life on urgent things, but almost none of it is important. Mapping a path through the forest is incredibly important but rarely ever urgent.
Your brain operates in natural 90-minute cycles. Take breaks!
Give yourself permission to take a mental health day.

Metrics

Everyone has a goal to be “data driven”, but we often fall to the trap of “anecdata”.
Google uses GSM (Goals/Signals/Metrics) framework to select metrics to measure engineering productivity.

QUANTS — 5 components of productivity

Quality of code
Attention from engineers (measure context switches; state of flow)
Intellectual Complexity
Tempo (how quickly can engineers accomplish something) and velocity (how fast they can push their release out).
Satisfaction (e.g., survey tracking longitudinal metrics)

Quantitative metrics are useful because they give you power and scale. However, they don’t provide any context or narrative. When quantitative and qualitative metrics disagree, it is because the former do not capture the expected result!
Let go of the idea of measuring individuals and embrace measuring the aggregate.
Before measuring productivity, ask whether the result is actionable. If not, it is not worth measuring.

Part TWO — Processes

Rules

An organization of 30K engineers must care deeply about consistency. Consistency requires rules. With current tooling, cost of adherence to rules have fallen dramatically. Some overarching principles to develop rules -

Pull their weight
Optimize for the reader — rules should be “simple to read” over “simple to write”. E.g., if statements are easier to read than conditional statements. Engineers should leave explicit evidence of intended behaviour in their code
Be Consistent — e.g., if you go to a different office building, you expect your WiFi to just work. Consistency also allows for expert chunking. It is a big scale enabler with both technology and human parts of the org, and ensures resilience to time. E.g., naming convention, number of spaces to use for indentation etc. “It is often more important to have ONE answer than the answer itself”.
Avoid error-prone and surprising constructs
Concede to practicalities when necessary — “a foolish consistency is the hobgoblin of mind”. Sometimes, performance or interoperability just matters more than other stuff.
Tech community debates the small stuff like import ordering — bikeshedding or Parkinson’s law of triviality — because there is no clear, measurable benefit for one form over the others.

Whenever possible, strongly prefer to automate enforcement with tooling.
Automation ensures time resilience, and minimizes variance of interpretation.
Technical solutions are unwise to apply for social rules. E.g., leave the definition of “small” in small change to the engineer.
Google uses Error Prone for Java
Where domain expertise matters, e.g., formatting a matrix, humans outperform machines
Most code goes through a presubmit check

Code Review

Gerrit to review Git code for Open Source, Critique is the primary tool.
Code reviews take place BEFORE a change is committed. At least one LGTM is a necessary permission “bit” to allow commit provided all comments are resolved and change is approved..
Code is a liability — like an airplane needs it but cannot take off with too much fuel.
3 aspects of review that require “approval” -

Correctness & comprehension check — does what it claims (LGTM bit)
Approval from Code Owners — tech lead or SME
Approval from readability with the language’s style and best practices

Code review is one of the very few blanket mandates at Google
Cultural benefit — review reinforces to engineers that code is not “theirs” but part of a collective enterprise. Another psychological benefit is validation — works great against imposter syndrome or being too self-critical.
Code review feedback is expected within 24 hours.
Small changes are generally about 200 lines of code.
35% of changes at Google are to a single file.
In essence, code review is the primary developer workflow upon which almost all other processes must hang, from testing to static analysis to CI.
Another way of looking at this — code review is the log of development organization!

Documentation

Benefits are all necessarily downstream, they don’t reap immediate benefits to the author, unlike testing.
Therefore it is viewed as an extra burden. Strive to make writing documentation easier.
“If I had more time, I would have written a shorter letter”
Two ways engineers encounter a document -

Seekers — know exactly what they want. They need consistency.
Stumblers — have vague ideas of what they are working with. They need clarity.

Another important audience distinction is between a customer (user of an API) and a provider (member of the project team). It is easier to document for the providers.
Documentation at Google is not yet a first class citizen.

Testing

Two needs — (a) the later a bug is caught, the more expensive it is to fix, (b) supports ability to change with confidence.
Tests derive their value from the trust engineers place in them. A bad test suite can be worse than no test suite at all.
GWS started a policy of engineer-driven automated testing.
If a single engineer writes ONLY 1 bug a month, a 100 engineer team will produce FIVE bugs every working day. Worse, fixing one complex bug will often result in another.
Automated test cycle = Writing Tests + Running Tests + Reacting to Test Failures.
An average piece of code at Google is expected to be modified dozens of times in its lifetime.
They started writing larger system-scale tests, but the tests were slower, less reliable and more difficult to debug than smaller tests.
Two distinct dimensions for every test — size and scope. Size = resources required to run a test case (memory, processes, and time). Scope = specific code path being verified.
Test Size

Small tests run in a single process — faster, more deterministic, single threaded. No database or network access involved. Aren’t allowed to sleep or perform IO or other blocking operations. “You can run them when coming back in BART”. Relies on test doubles. They are often “flaky” (i.e., nondeterministic).
Medium tests run on a single machine, can span multiple processes/threads and make blocking calls. It cannot make network calls other than to localhost. E.g., use WebDriver that starts a real browser and controls it remotely via the test process. They are often slow and non-deterministic.
Large tests run wherever they want! Localhost restriction is lifted. They typically run during the build and release process to not impact developer workflow.

Flaky tests are generally rerun. But that is effectively trading CPU cycles for engineering time. With a certain flakiness threshold, engineers lose confidence in tests which is far worse than lost productivity.
As you approach 1% flakiness, tests lose value. Google flaky rate hovers around 0.15%.
All tests should strive to be hermetic. Strongly discourage use of control flow statements (conditionals, loops) in a test. Make sure you write the test you’d like to read!
Test Scope

Narrow — often known as Unit Tests — validates logic in a small part of code. Catch logic bugs.
Medium — often known as Integration Tests — verify interactions between a small number of components. Catches issues between components.
Large — functional or end-to-end tests — validate interaction of several distinct parts, or emergent behaviors not expressed in a single class. Does sanity checks — should not be the primary method for catching bugs.

Google Test Pyramid — 80% of their tests are Unit, 15% integration, and 5% E2E.

Two anti-patterns — ice cream cone (too many manual tests) and hourglass (few integration tests because of tight coupling)

Code Coverage is not a gold standard metric. It only measures that a line was invoked, not what happened as a result. Measure coverage ONLY from small tests to avoid coverage inflation from larger tests.
Another danger of code coverage target — engineers start to treat 80% like a ceiling, rather than a floor.
Answering “Do we have enough tests” with a single number ignores a lot of context and is unlikely to be useful. Code coverage does provide some insight into untested code.
Google’s 2B lines of code is kept in a single, monolithic repository — monorepo. It experiences 25M lines of change every week — half of it made by automated systems, and half by 30K engineers.
All changes are committed to head (trunk).
TAP (Test Automation Platform) is the key component of CI system.
Brittle tests (as opposed to flaky) over specify expected outcomes or rely on boilerplate — resist actual change! They fail even when unrelated changes are made.
Some of the worst offenders of brittle tests come from misuse of mock objects. Some engineers declared “no more mocks!”!
The slower a test suite, the less frequently it will be run.
Secret to living with a large test suite is to treat it with respect. Reward engineers as much for rock-solid tests as for having a great feature launch. Treat your tests like production code.
A scale problem — new engineers would quickly outnumber existing team members. You need strong policies to indoctrinate the new engineers to testing! Google added hour long discussion on the value of automation testing during orientation. They also set up Test Certification. Level 1 — set up a continuous build, track code coverage, classify tests as small/medium/large etc. By level 5, all tests were automated, fast tests were running before every commit, nondeterminism removed, and every behavior covered.
An internal dashboard applied social pressure by showing the level of every team.
Testing on the Toilet (ToT) was a one pager episode posted to bathroom walls. Created a lot of social chatter. Outsized public impact.
Today, expectations of testing are deeply embedded in daily developer workflow. Every change is expected to include BOTH the feature code and tests.
New testing tool — pH (Project Health). It continuously gathers dozens of metrics on the health of a project — test coverage and latency — measures on a scale of one (worst) to five (best). A pH-1 project is a problem for the team to address immediately!
Searching for complex security vulnerability is difficult to automate (over humans).
In short, elevate testing into a cultural norm. It takes time.

Unit Testing

After preventing bugs, the most important purpose of a test is to improve engineering productivity.
They are fast, easy to write, rapidly increase coverage and serve as documentation.
An engineer executes thousands of unit tests (directly or indirectly) during a weekday.
Ideal tests are unchanging!
Best Practices -

Test via Public API rather than implementation details.
Test State, Not Interactions — observe the system to see what it looks like after invoking. With interactions, check the system took an expected sequence of actions. Over reliance on mocking is the biggest reason for problematic interaction tests.
Test Behaviors, Not Methods. Behaviors can be expressed using the words “given”, “when” and “then” (also known as “arrange”, “act” and “assert”). The mapping between methods and behaviors is many-to-many.
Name tests after behavior being tested. Start the test name with the word “should”. If you need to use the word “and”, there is a good chance you are testing multiple behaviors.
Don’t put logic in tests. Complexity is introduced in the form of logic.
Tests are not Code — promote DAMP (Descriptive and Meaningful Phrases), not DRY! Tests should not need tests to test it. Duplication in tests is OK so long as the duplication makes the test simpler and clearer.

Test infrastructure (e.g., JUnit) — shared code across multiple test suite — is its own separate product. It must always have its own tests.

Test Doubles

Similar to stunt doubles in a movie.
Fidelity refers to how closely the behavior of a test double resembles the behavior of the implementation it is replacing. Unit tests that use test doubles often need to be supported by larger-scope tests that exercise the real implementation.
Mocks are easy to write, difficult to maintain and rarely finds bugs!
Code written without testing in mine typically needs to be refactored or rewritten before adding tests. Testable code makes it easy to inject dependencies
Three techniques of test doubles -

Faking — lightweight implementation of API that behaves similar to the real but is not suitable for production. e.g., an in-memory DB.
Stubbing — giving behavior to a function — i.e., stub the return values. Typically done through mocking frameworks to reduce boilerplate.
Interaction Testing — validate how a function is called without actually calling the implementation. Sometimes also known as “mocking”.

Their first choice is to use real implementations of the system. This ensures higher fidelity — known as Classical Testing. They found mocking frameworks difficult to scale. Created @DoNotMock annotation in Java.
Common cause of nondeterminism is code that is not hermetic — i.e., has external dependencies outside the control of a test.

Larger Testing

Fidelity — gives confidence about the overall system.
Large tests have a default timeout of 15 minutes or 1 hour. A handful also run for multiple hours or even days.
Non-hermetic and possibly non-deterministic. It is difficult to guarantee determinism if non hermetic.
One way to measure fidelity is in terms of environment — Unit Tests -> 1-process SUT -> Isolated SUT -> Staging -> Production. Increased fidelity comes at an increased cost and risk of failure.
Config changes are the number one reason for Google outages — e.g., 2013 Google outage from an untested network config change. But Unit Tests cannot verify configuration! Large tests should.
Configs are written in non production code languages, and have faster, and often emergency, rollout cycles than binaries and they are more difficult to test.
Unit Tests are limited by the imagination of the developer. They deliberately eliminate the chaos of real dependencies and data — they work in vacuum (Newtonian Physics). Cannot test issues under load, or other unanticipated behaviours and side-effects.
Value of large tests increases with time. “As we move to large-N layers of software, if the service doubles are low fidelity (say, 1-epsilon), the chance of bugs — when all put together — is exponential to N.” Just two low-fidelity (10% inaccurate) doubles in a large system make the likelihood of bugs to 99%! (1 — (0.1 * 0.1))
One key trait Google looks for in Test Engineers is “the ability to outline test strategy for its products”.
Three important design elements for large-scale tests -

SUT — system under test. E.g.,split tests at UI/API boundaries; single-machine or cloud etc.
Test Data — seeded, domain data; production copy; synthetic etc.
Verification — Manual, Assertions (like Unit Tests) or Differential (A/B comparison)

A-B tests simply tests the differential between public and the new API, especially during migration. A-A tests are used to compare a system to itself to eliminate non-determinism. A-B-C tests compare the last production version, the baseline build and a pending change. A-B-C tests accumulated impacts of next-to-release version.
Unit Tests verify code is “Working as implemented” rather than “Working as intended”.
Google does not do a lot of Automated UAT.
DiRT (Disaster Recovery Testing) — yearly fault injection into infrastructure at a “nearly planetary scale”. Contrast with “Chaos Engineering” which is like a low-grade, continuous fever to test the immune system. Google built Catzilla for the latter.
Minimizing flakiness starts with reducing scope of the test.
Larger tests MUST have documented owners or tests will rot.

Part THREE — Tools

Version Control

Time as a dimension is unnecessary in a programming task, critical in software engineering one.
Version control is the engineer’s primary tool for managing the interplay between raw source and time. i.e., an extension of a filesystem.
VCS(filename, time, branch) => file_contents
SW Development = sit down & write code. SW Engineering = produce working code and keep it useful for an extended period of time.
Development is an iterative cycle of branch-and-merge
Commit log is the trigger for a moment of reflection : “what have you accomplished since your last commit?” Marking a task complete brings closure.
Google does not use DVCS but a massive custom in-house centralized VCS (called Piper). DVCS has a massive cost of transmission to dev workstations with Google’s code base (2 B lines; 86 TB of data and metadata, ignoring branches). DVCS requires more explicit policy and norms than a centralized VCS.
Source-of-truth — whatever is most recently committed at the head. DVCS inhabits the notion of “single source of truth” — which copy is the latest?
Centralized VCS requires sublinear human effort to manage as the team grows (foundational principle at Google!) especially because it has one repo and one branch — one source of truth that is trivial to compute.
DVCS works great when orgs and sources of truth are hierarchical.
In version control, technology is only one part. Orgs need an equal amount of policy and usage convention on top of that.
VCS — how you manage your own code. Dependency Management — how you manage others’ code — is more challenging.
Uncommitted local changes are not conceptually different from committed changes on a branch. This “uncommitted work = branch” is a powerful idea to refactor tasks.
Small merges are easier than big ones. There are significant scaling risks when relying on dev branches.
Orgs became addicted to dev branches by not investing into better testing. As the org grows, the number of dev branches grows and more effort is wasted on coordinating the branch merge strategy. That effort does not scale well at all. That is why a single head/trunk is easier to scale. 25% of engineers are subjected to “regularly scheduled” merge strategy meetings!
Any effort spent in merging and retesting is pure overhead. Rather rely on trunk-based development by relying heavily on testing and CI — keeping the build green and disable incomplete/untested features at runtime.
Release branches — with a handful possible cherry-picked changes from trunk — are generally benign. They differ from dev branches on desired end state. A release branch is usually abandoned eventually. They are insurance.
In highest-functioning orgs, release branches are non-existent, replaced by CD — ability to release from trunk multiple times a day. It even eliminates the (small) overhead of cherry-picks.
Strong positive correlation between “trunk-based development”, “no long-lived dev branches” and good technical outcomes.
Remember and repeat this — branches are a drag on productivity.
60–70K commits to repo per day. It takes 15 seconds to create a new client on trunk, add a file and commit a change to Piper (VCS tool).
Java has a standard shading practice that tweaks the names of the internal dependencies of a library to hide it from the rest of the app.
One-Version Rule — Developers must never have a choice of “What version of this component should I depend upon”. Consistency has a profound impact at all levels of the organization. Removing choices in where to commit or what to depend on can result in significant simplification. The work to get around multi-version approach is all lost labor — engineers are not producing value, they’re working around tech debts.
Corollary: If a dev branch does exist, it should be short-lived. i.e.,reduce WIP.
Build Horizon: Every job in prod needs to be rebuilt and redeployed every six months, maximum.
Git often has performance issues after a few million commits and tends to be slow to clone when repo includes large binary artifacts.
A (virtual) monorepo approach with a One-version rule cuts down the complexity of software development by a whole (difficult) dimension: time.

Code Search

Principle: “answering next question about code in a single click”, even ones like “Fleet-wide, how many CPU cycles does it consume”
16% Code Searches try to answer the question of where a specific piece of information exists. 33% are about examples of how others have done something.
Virtuous Cycle of good code browser — writing code that is easy to browse leads to not nesting hierarchies too deep, and using named types rather than generic things like int or strings.
Thinking about scaling: If devs use IDEs to index, it scales quadratically — codebase and workstations each scale linearly.
Domain Optimization: UX for code search is optimized for browsing and understanding code, not editing it, which is bulk of an IDE. Every mouse click in “read mode” is meaningful rather than a way to move cursor in “write mode”
Kythe (Code Search tool) index is rebuilt every day.
Stack frames are linked to source code.
Google processes more than 1 million search queries from devs within Code Search a day. Code search indexes about 1.5 TB of content and processes ~200 queries/sec with a median latency of less than 50 ms and median indexing latency (delta between commit and visibility) of under 10 sec.
RegEx processes about 100 MB/sec in RAM. Substring search can scale up to 1 GB/sec reducing computing needs by 10x.
Further scaling — moved the inverted index from RAM to Flash. Changed trigram index to n-gram to compensate for two orders higher access latency in Flash.
Ranking is based on two signals — query independent (depends only on the document, e.g., file views, number of references to it — can be computed offline) and dependent (depends on search query as well, must be cheap to compute since it’s for each query — e.g., the page rank).
Code search is an essential tool to scale — it helps developers understand code. Big boost to productivity.

Build Systems & Philosophy

Besides free food, Google engineers LOVE the build system (actual survey results were cited)! Allows development to scale.
Automated Build System is all about 3 questions-

When to ship — automatically built, tested and pushed to prod at some regular frequency.
When to test — CRs automatically tested when sent for code review so both author and reviewer can see build or test issues; Tested again before merging into trunk.
What to test — esp when the change is LSC (Large Scale Change).

Modern build system is all about dependencies — “I need that before I can have this”.
Task-based build system (maven, Ant, Gradle) — fundamental unit is task. Build files describe how to perform the build executing the tasks. These end up giving too much power to engineers and not enough power to the system. They are also often unable to parallelize task execution.
Artifact-based build system (Blaze, Bazel) — offers a small number of tasks defined by the system that engineers can configure in a limited way. They still have build files, but the buildfiles look very different. E.g., buildfiles in Blaze are declarative manifest describing a set of artifacts to build, their dependencies and a limited set of options that affect how they’re built. This is analogous to functional programming — it lets the system figure out how to execute. They also parallelize well when possible.
Blaze, e.g., is effectively a mathematical function that takes source files and compiler as inputs and produces binaries as outputs.
Software Supply Chain attacks are becoming more commonplace.
Huge builds with Blaze can be distributed across many machines.
1:1:1 Rule — for Java, e.g., each directory contains a single package, target and BUILD file. Google favors significantly smaller modules — helps parallelizing. Advantages double with testing — more controlled, fine-grained subset of tests can be run against individual modules.
Don’t rely on the latest versions for external dependencies. Have a policy!
Insight: reframing build as centering around artifacts instead of tasks allows builds to scale at Google size. i.e., fine-grained modules scale better than coarse grained ones.

Code Review Tool

Critique (tool) Principles -

Simplicity — easy review without unnecessary choices
Trust — review is not for slowing down others, try to empower. Make changes open!
Communication — challenges are rarely solved through tooling. So keep it simple in the tool.
Workflow integration — easy to toggle from review- to browse- to edit- mode, or view test results alongside.
Anyone can do a “drive by review”

Code Review Workflow — Create Change -> Request Review -> Comment -> Modify & Respond to Comments -> Change Approval -> Commit
Change Approval — 3-part Scoring : LGMT; Approval; Number of unresolved comments.
Requestor may commit when the change has at least one LGTM, sufficient approvals and no unresolved comments. First two are hard requirements, and the unresolved comment is a soft one.
Code Review is a tool used for change archaeology
Time spent in code review is time not spent coding. Optimize it well. Having only two people (author and reviewer) is often sufficient for most cases and will keep the velocity high.
Trust and communication are core to code review. A tool can enhance the experience but cannot replace them.

Static Analysis

Part of Core Developer Workflow — Easy way to codify best practices, help keep code current to modern API versions and prevent tech debt.
Generally focus on newly introduced warnings.
Perception is a key aspect of the false-positive rate. If developers do not take positive action after the issue surfaces it is “effectively false positive”.
Tricorder (the tool) analyzes more than 50,000 code review changes per day and often run several analysis per second. It aims for less than 10% effective false positives.
ErrorProne extends Java compiler to identify AST antipatterns.

Consider this nasty bug — when f is int, the code will compile but the right shift by 32 is a no-op so if f is XOR-ed with itself and no longer affects the value produced.
result = 31 * result + (int) ( f ^ (f >>> 32)));

Click the “Not Useful” button on the analysis result.
Presubmit checks can block committing a pending change.
When possible, static analysis is pushed to the compiler. Compiler checks are fast enough to not slow down the build.
Aim not to issue compiler warnings, they are usually ignored. Go has been designed to transform the warnings into errors.
Some checks can be transferred to IDEs.

Dependency Management

Principle: All else being equal, prefer source control problems over dependency management problems.
Dependency graph for projects continues to expand over time.
SemVer — Semantic Versioning (major.minor.patch version) is foundational to managing dependencies. Version satisfiability solvers for SemVer are similar to SAT-solvers. Works well at small scale.
SemVer is a lossy estimate and represents only a subset of possible scope of changes.
Even with high test coverage, it is difficult to algorithmically separate “behavior change” from “bug fix to an unintended behavior”.
Hyrum’s Law (again): With enough users, any observable of your system will come to be depended on somebody.

Large-Scale Changes

Important questions for your org:

How many files can you update in a single commit?
How does your largest commit compare to the actual size of your codebase?
Would you be able to rollback that change?

As the number of engineers grows, the largest atomic change possible decreases.
One needs technical and social strategies for large-scale changes (LSC).
LSCs are almost always generated using automated tooling (e.g., cleanup anti-patterns; replacing deprecated libraries; compiler upgrades; migrating users). Majority of LSCs have near zero functional impact. i.e., they are behavior-preserving. LSCs could still affect a large number of customers.
Typically, the infrastructure team owns LSCs. “Unfunded Mandates” have benefits diffused across org and thus unlikely to matter enough for individual teams to want to execute on their own initiatives. Centralizing the migration and accounting for its costs is always faster and cheaper than depending on individual teams. Metaphor: building new roads vs. fixing potholes.
“No Haunted Graveyards” — have no system so archaic that no one dares enter it.
LSCs really work only when the bulk effort is done by computers and not by humans.
10–20% of changes in a project are the result of LSCs.
The largest series of LSCs ever executed removed more than 1 billion lines of code from repo over 3 days.
Statically compiled languages are much easier to perform LSCs than dynamically typed ones because of the strength of compiler-based tools. Therefore, language choice is intimately tied to code lifespan. i.e., languages that are focused on dev productivity tend to be more difficult to maintain.
Sadly, systems we most want to decompose are those that are most resilient to doing so. They are the “plastic six-pack rings” of the code ecosystem.

Continuous Integration

CI = CB (Continuous Build) + CT (Continuous Tests)
CI + Release Automation = CD
Traditional CI just tests your binary but changes that break an app are more likely to be in loosely coupled external pieces (e.g., microservices).
As we shift to the right, the code change is subjected to progressively larger-scoped automation tests.
Fast Feedback Loops — as issues progress to the right (from dev to prod) they are farther away- on expertise, on time and on users.
Canarying may create complex version skew problem
Experiments and feature flags are extremely powerful feedback loops as we move right. Test them as they are code, but keep those configurations outside code so you could change it dynamically.
Feedback from CI tests should not just be accessible but actionable.
In reality, CD with large binaries is infeasible. Doing selective CD, through experiments of flags is a common strategy.

As RC progresses through environments, its artifacts (binaries, containers) should not be recompiled or rebuilt. Orchestration tools (e.g., Kubernetes) helps enforce consistency between deployments.
Mid-air Collision — when two unrelated changes to two different files cause a test to fail. This is generally a rare event but happens frequently with scale.
Presubmit — only run fast and high fidelity tests.
Run a comprehensive, automated test suite against an RC (Release Candidate) for (a) sanity check, (b) to allow for cherry picks and © for emergency pushes.
Probers — run the same suite of tests against production as we did against RC
Continuous testing is the defense-in-depth approach to catching bugs.
CI and Alerting are analogous — CI emphasizes the early “left” side of the workflow and surfaces test failures. Alerting focuses on the right end of the same workflow and surfaces metrics that cross certain thresholds. Both focus on fidelity and actionable notifications.
If it’s not actionable, it should not be an alert.
CI and alerting both have localized signals (unit tests; isolated stats/cause-based alerting) and cross-dependency signals (integration tests; black-box probes). The latter are the highest fidelity indicators of whether an aggregate system is working well, but we pay the cost in fidelity in flaky tests or difficulty in pinpointing root causes.
Cause-based alerts (e.g., CPU > 65%) are brittle as there may not be a connection between the threshold and system health, just like brittle tests are “false positives”. Both cause-based/threshold alerts and brittle tests still have value — at least by adding debug data if the system actually fails.
Big Idea: CI therefore could be thought of as “left shift” of alerting.
Extending on that simile, achieving 100% green rate on CI is as incredibly expensive as 100% uptime. However, unlike alerting which is in far more matured state, CI is often viewed as a “luxury feature”
A change that passed presubmit has 95%+ likelihood of passing the rest of the tests. Therefore it is allowed to be checked-in.
Established Cultural Norm — strongly discouraging committing new work on top of known failing tests.
To deal with breakages, each team has a “Build Cop”, whose responsibility is keeping all the tests passing in their particular project, regardless of who breaks them. Rollback is Build Cop’s most effective tool. Any change to Google’s code base can be rolled back in two clicks!
The average wait time to submit a change is 11 minutes, often run in the background.
MTTCU: Mean Time To Clean Up — is a key metric that tracks mean time to close bugs, after a fix is submitted. Disabling failing tests that cannot be immediately fixed is a practical way to keep suite green.
A CI system, thus, decided what tests to use, and when. It becomes more important as the codebase ages and grows. CI should optimize quicker, more reliable tests on presubmit and slower, deterministic ones on post-submit.
Accessible, actionable and aggregated feedback from CI system is essential to productivity.

Continuous Delivery

Long-term lifecycle of software product involves

Rapid exploration of new ideas
Rapid responses to shifts or issues
Enabling dev velocity at scale

Faster is safer — smaller batches of changes result in higher quality
Velocity is a team sport!
Youtube is a monolithic Python codebase with a 50-hour manual regression testing cycle by a remote QA team on every release.
If releases are costly and risky, the instinct is to slow down the cadence and increase stability period. That is wrong! Adding more governance and oversight to the dev process and implementing risk reviews merely reward low-risk, therefore low-return tasks. A process version of bikeshedding!
Key to reliable continuous releases is to “flag guard” all changes and the ability to dynamically update the config for the flag.
i.e., decouple the destiny of a feature from release is a powerful lever for long-term sustainability.
“Deadlines are certain, Life is not” — don’t wait for any pending tasks to block release.
Deployment itself therefore is one of the key features of every successful release.
Toolchain to allow safety measures like dry-run verification, roll-back/roll-forward and reliable patching is an essential safety net.

Compute as a Service

“I don’t try to understand the computers. I try to understand the programs” — Barbara Liskov
Big idea: kill bigger things when little things fail. E.g., if a program unexpectedly terminates, rather than restart-debug it, kill the machine and restart it somewhere else. Enforcing this continuously will make killing bigger things cheap. That will make bigger things effective little things! We know how to manage little things.
e.g., Latency degradation due to disk swap is so horrible that an out-of-memory-kill and migration to a different machine is universally preferable
Writing Software For Managed Compute -
Don’t architect for success, architect for failure. You cannot defeat entropy, but may be able to postpone it!
Pets vs. Cattles is a very powerful, though, gory metaphor. Rather than thinking of infrastructure/code/system as pets (named, individual identity, attachment/coupling), design them as cattles (interchangeable, unnamed, no tears shed if one is killed).
However, make sure “death” of cattles is well directed (i.e., planned) or, if not, well understood (so we can take necessary steps, like add more cattles to the herd).
Batch jobs are interested in throughput. They are short-lived and restart friendly. They can share time and space on the machine with serving jobs (e.g., inbound customer calls) — long-lived and interested in latency — where serving jobs take precedence over them.
Managing state is generally an impediment to make the world Cattle-centric. In process state should be treated as transient, “real storage” happening elsewhere.
Even persistent state can be managed by cattles through state replication.
Caching is good, but design noncached path to be able to serve full load (with higher latency) and not keel over if cache is down.
RTT is a friendly pattern when connecting to services — Retry, Timeout, Throttle. Retries need to be implemented correctly with (a) backoff, (b) graceful degradation and (c) jitter to avoid cascading failures.
Containers are primary an isolation mechanism, a way to enable multi tenancy while minimizing the interference between different tasks/processes sharing a single machine. It is an abstraction boundary between deployed software and actual machine it’s running on.
Container also provides a simple way to manage named resources on the machine (e.g., network ports, GPUs).
Core Cattle Principle — “Machines are anonymous; programs don’t care which machine they run on as long as it has the right characteristics”.
A unified management infrastructure allows us to avoid linear scaling factor for N different workloads. There are not N different practices, there is just one (e.g., Borg for Google).
A batch job can usually be killed without warning.
A serving job needs some early warning to prevent user-facing errors.
Compute infra has a naturally high lock-in factor. Even the smallest details of a compute solution can end up locked in. The large ecosystem of helper services increase lock-in — logging, monitoring, debugging, alerting, config languages etc.
Managed serverless model is attractive for adaptable scaling of resource cost at the low-traffic end.
At the truly high-traffic end, you will be limited by the underlying hard infrastructure, regardless of the compute solution! If your app needs 100,000 cores to process, the choice of serverless/container/other abstractions matter less than actually having 100,000 cores to serve.
To reduce lock-in, two strategies -

Use a lower-level public cloud and run a higher-level open source solution (OpenWhisk or KNative) on top of it. If migrating out, the open source solution becomes the boundary with the new system.
Or, run a hybrid cloud — have part of the overall workload in private infrastructure, possibly use public cloud as an overflow segment.