Good Code is like Good Joke — 25 Insights from Cloud Engineering

Nilendu Misra
3 min readNov 6, 2023

In theory, theory and practice are the same. In practice, they are not” is why such bottom-up insights and lessons from the field are the fastest way to learn real life stuff. This series had a GREAT start with “Engineering Management” — I guess because it is way more subjective than Cloud Engineering and offered a variety of non-overlapping POVs. This one is a mixed bag, perhaps because “Cloud Engineering” was perceived amorphously by the authors. The scope was broad — from cloud-native (architecture), to cloud-ready (topology), to cloud-operations, to choosing tech (e.g., Lambda/serverless), to -ilities and economics — it is like celebrating Halloween, Christmas and Labor Day together in a single long weekend. Overwhelming, but still good to celebrate!

Key insights -

— Real-time visibility across the entire DevOps lifecycle is key to winning in cloud.

— Operations, especially operations at scale, is extremely hard. So, wherever possible, use Managed Services.

Distinguish between “availability” and “uptime” and measure each separately, and concretely.

— In FaaS/Serverless, calling a function synchronously increases debugging complexity.

Good code is like good joke — it needs no explanation.

— “Building your app or platform on top of the abstractions that a cloud provider gives you does not make the underlying layers stop existing. In many cases, it makes them even more important.” That makes the failure modes LESS obvious than we were used to. Therefore having “extreme visibility” into your systems will help “separate the issues at the layer you’re focused on from the fundamental system issues”. i.e., just because what was under the hood is now even less visible, don’t forget them. Many recent “cloud failures” have been in networking fault domains.

Cloud is not optimized for replacing static infrastructures.

Containers, service meshes and serverless jumpstart dev productivity but they also change the attack surface of apps and infra.

— “Number of containers that are alive for 10 sec or less has doubled to 22%”. 73% of all containers live for 30 minutes or less.

— Adopt an “assume breach” stance for everything. Have a break-glass account.

— Ensure you have a thorough understanding of where and how secrets are secured.

Grey failures (transient degradation of services) are often worse than complete crashes, since the latter have a short feedback loop.

— Resilience engineering has existed as a sub-discipline within safety sciences. We just recently started applying its concepts in technology. Resilience can be thought of as a “socio-technical system” with Robustness (“system X has property Y that is robust in sense Z to perturbation W”); Reliability (consistent operations or service levels); Rebound (ability to deal with a chaotic situation using structures developed AND deployed BEFORE the chaos).

In other words, robustness protects systems against a SPECIFIC type of failure mode. When a system is robust in many dimensions, it approaches good resilience to failure.

Resilience is something you “do”, not something you “have”. Resilience is a verb.

Moving from one class of nines to the next is 10 times more expensive.

Production System really means “system that someone else, anyone else, can hold you accountable for”.

Most common theme across incidents is that something, somewhere was surprising.

Incidents are unplanned investments…your challenge is to maximize ROI.

— We used to think of scale in two dimensions — horizontal (more) and vertical (bigger). In cloud, think of “scale out” (when demands increase) and “scale in” (when demand decreases).

Architecture diagram is also a map of failure modes.

Async communication is a friend of Cloud Reliability.

Test in production is a competitive advantage. The complexity of traffic patterns going through high-scale production systems is increasingly harder to reproduce in a controlled env.

— Hundreds of open issues is fine, but if the repo has gone months (or, years!) without a release, THAT is a warning sign.

It is hard to write good tests for bad code.

— Platforms come and go. But first principles and patterns will always exist, because they are the ones and zeros.

--

--

Nilendu Misra

"We must be daring and search after Truth; even if we do not succeed in finding her, we shall at least be closer than we are at the present." - Galen, 200 AD