the playbook part 5: engineering management

Feb 4

sunk thought’s - the playbook

“the playbook” is a mini-book delivered in 8 posts (the forward and seven chapters.) It’s the tiny seed from which growing an innovative product engineering organization becomes possible.

Engineering Management

Embracing the Means of Software Production

Engineering is a Creative Pursuit

Since the beginning of the software revolution business leaders have unsuccessfully tried to apply incompatible production and labor methodologies to the creation of software. Unfortunately, the means of production in software cannot be effectively managed the same as it can be on an assembly line.

The problem is that software development is not like working in a factory. It’s knowledge work. A creative pursuit which requires careful planning, design thinking, and problem solving to engineer solutions to your business problems.

Typing isn’t the primary requirement of the job. Taking time to think about a problem and the impact your choices might make on the system are as important (likely moreso) than typing out lines of code. Business leaders who don’t understand this paradigm and say things like, “This should be easy,” or, “We just need you to do X,” are demonstrating that they don’t understand the means of production for their business. “Engineering ain’t easy” after all.

Engineers are not simple laborers, they are creatives, and productivity can’t be measured as one might a factory worker.

Productive != Effective

The fact is, that “productive” engineering teams are not necessarily “effective” engineering teams.

The job of a software development team is to solve complex business problems with code. In this manner, if you can solve these complex problems without writing software, or by creating as little of it as possible, then you win at software engineering.

Additionally, software systems are complex and code is seldom written in a vacuum. Everything you create will interact with other components of your system (some which exist today and others which will be conceived at some unknown date in the future.) If you aren’t careful, everything you “fix” can lead to another problem elsewhere. This whack-a-mole syndrome is a common symptom of not fully understanding the system you are extending with your new bit of functionality.

Engineering is a Team Sport

Because of this, it’s important to understand that software development is a team sport. You depend on the work of your teammates and they depend back on you to push the software system forward together.

Our work must interact, be easy to understand or extend, and follow common patterns to make the lives of the next person who finds themselves editing your work easier.

Lone wolves are therefore more dangerous at worst and less effective at best than developers who are collaborative and collegial in nature. You have been warned.

Managing Innovative Development Teams

So, how is it that “digitally native” software organizations manage their engineering teams with this understanding?

It all starts with recognizing that your developers are business people, not grunts. In order for business people to perform their jobs effectively they must understand your business strategy, be aligned with the organization’s vision, and given explicit trust to do their jobs.

Your vision must be clearly communicated so that each employee can articulate it in their own words. When your engineers are aligned on strategy they can also align their implementation of product features. This in turn reduces conflicts between teams and roles within your org.

Trust should be exemplified by a tolerance for failure. Failure, after all, is a stop along the way to success. Breaking the build or pushing software that fails to gain traction shouldn’t be met with punishment, humiliation, or negative feedback. Instead, depending on how much your organization learns from that failure, you might even want to reward them for it.

This creates an atmosphere where your workers are not afraid to try something hard, versus an atmosphere of people who hedge their bets.

This leaves the door open for innovation, because your people are more willing to stick their neck out and try something new whose benefits can potentially far outweigh the risks of a temporary setback.

Effective development teams likewise rely less on extensive specifications/documentation and more on communicating the context of a project, its reason for existence, and the base metrics for success.

This enables your experts to utilize their specialty to develop a solution, versus simply standing on the assembly line pushing buttons. Because of this, your organization is able to leverage the cognitive capacity of your entire team, rather than just the brain power of your leadership team (who often do not have the same level of implementation expertise of your individual contributors.)

You unlock the full potential processing power of your Human Structure, rather than constraining it unnecessarily. (More on this in chapter 6.)

Understanding Risk & Cost of Failure

Every organization has a different risk profile. Boeing engineers’ code runs inside planes which are traversing the globe, the risks and the costs associated with those risks are incredibly high. If you’re running an e-commerce platform the risks and the costs associated are incredibly low in comparison. Based on your own software’s risk profile, you will have your own unique view on this risk.

For most of us, the cost of failure is actually pretty low. Let’s say the worst case for us is a production outage which affects revenue for that time period, but no lives are at stake.

So, the worst case happens, an engineer releases some code into production and walks away from their desk to grab a coffee. They chat up some co-workers, and gradually as their changes are propagated our key metrics slowly slide from their normal range all the way to zero. The production environment has gone down, and stays down for 30 minutes before this engineer’s coworker identifies there is a problem, rolls back the toxic changes, and deploys the last known safe state.

Gradually it comes back up and you’re back to normal. You were down for 45 minutes to an hour. How much did this cost you? Well, let’s say this outage cost you around 5% of your daily revenue.

What does this actually cost you in revenue and how much is it worth to ensure this doesn’t happen in the future? Let’s presume your application drives $100MM a year in revenue, which is around $274,000 a day. So, this outage cost you a little bit less than $14,000 (or 0.014% of your annual revenue) for this one time incident.

How much cost in process overhead, QA salaries, and opportunity cost losses (from slowing down your ability to learn rapidly) is it worth to mitigate this risk? Or, alternatively, is it better to improve our ability to react to such events rather than mitigate them entirely?

Let’s presume that this organization has an engineering team with 100 engineers and we currently have no QA staff. These engineers cost on average $125,000 a year in salary, or $12.5MM a year as a group. We decide to add 1 QA engineer for every 2 engineers (I’ve worked at orgs with this ratio) at a cost of $90,000 a year per person, or $4.5MM a year total. This adds a daily cost of ~$12,400 to our company.

Additionally, adding this overhead also slows down the process of deploying changes, by adding one day up to a week of additional process overhead to reduce this risk.

This now means that for every change, where we used to learn if things were good or bad within an hour (albeit at a revenue cost risk), we now require 24-100+ hours to learn the same thing, but don’t lose revenue to learn it.

However, we’ve increased our cost of production by 36% and reduced our speed by a substantial factor. Of course, this is also ignoring the fact that in this new structure problems will still arise. There will still be outages, they will just be less frequent.

Is this cost and loss of velocity worth it to avoid the occasional 0.014% loss in revenue?

You are starting to understand why fewer orgs have large QA teams, now. So, what are they doing instead? They’re improving their reaction time to failure.

Improving Reaction Time

Software engineering has come a long way since I started. Things are far more complex than they used to be, but there are also a ton of new methodologies we can use to reduce the risks related to releasing new code into the world.

Staging server environments only give us so much ability to test our features. In part, because production data can be far more gnarly than our sterile staging data, but additionally because the more load your servers handle the more opportunity there is for persnickety edge cases to pop up that staging can’t begin to replicate.

Fortunately, there are some great ways to gradually release code into the real world and track its performance before exposing that code to our entire user base.

Two of these that I want to talk about are “Blue/Green Deployments” and “Feature Flags.”

Blue/Green Deployments allow us to deploy quickly and instantly roll back as needed without experiencing a full failure. Your Blue environment is what is currently live on production, which your user base is actively using right this moment. When you deploy, your code goes to identical server infrastructure (Green) that is separate from your active environment. Then, you slowly move a small fraction of your active users to the new build. As your users interact with this new server we can see if anything goes wrong or not, slowly transferring more and more load to the new build until everyone is on the new version of your software. Alternatively, the moment you detect that something has gone wrong, you can instantly roll back to the Blue build and go fix your new code.

This vastly reduces the risk of loss by ensuring that complete failure is no longer part of your risk profile and allows you to much more quickly identify and remedy problems.

Feature Flags similarly allow us to do the same thing, but with less operational overhead (additional servers running at the same time, etc.) With this method we can release code into production, but have it flagged as live or not. You can then open this new feature up to a subset of your users on the fly, test the results, and slowly increment its portion of live usage. Again, we can watch and detect the live results and turn this flag off rapidly should we detect a problem, and restore the previous service without waiting for deployment.

While we save some operational cost in so doing, we do maintain a small chance of catastrophic failure as we open the flag to more users (by running it on the same servers as our safety cohort.) It’s a tradeoff, but it similarly offers us the ability to ease into the danger zone, reduce risk of a failure, and rapidly react to failure.

Both of these methods enable us to maintain development and deployment speed, and allow us to reduce the cost of failures. In so doing, we can reduce our risk without introducing large salary and process costs to our business’ bottom line.

As with everything we’ve talked about, your organization has its own risk profile, but understanding that there are methods we can use minimize these risks can save your company money and time, both of which are vital to your ability to rapidly test, iterate, and improve your software’s value.

Software’s Hidden Cost: Complexity

The one cost many engineers and team leaders fail to recognize is the cost of complexity. As your codebase and your team grow it becomes important to avoid creating overly complex or hard to read code. The best code not only executes, it is also clear in its purpose and readable.

The more simple your team can make their code to read, process, and understand to someone who has never seen it, the easier it will be to maintain in the future. Kris Gale was the VP of Engineering at Yammer while I was there. Here are some of his thoughts on this:

“Embrace simplicity in your engineering. The best engineering usually isn’t showy or intense-looking. Given the same result, the simpler code is more valuable to your organization. This will often be unsatisfying to people’s egos, but the best engineers have nothing to prove.”

“When I actually became a real engineer, I realized the simpler I could build something and the less it needed documentation and illustration, the better off my. coworkers were—the faster we could all build the thing we were hired to build.”

Dan North (a lifelong engineer who speaks often about software development, teams, programming, etc) also talks about this concept. He says:

“If something doesn’t fit in your head, you cannot reason about it. It turns out, most of the stuff we deal with doesn’t fit in our heads: systems, software, architecture, organizations, products, domains don’t fit in our head. So, we come up with ways to cope. We fall back on our habits and routines ... which leads to unintended consequences.”

And, so, what happens? We circumvent the situation, going based on our instinct without understanding that which we are extending. We add to the problem by making it more brittle, rather than taking the time to address the underlying complexity which doesn’t fit in our ability to reason.

This exacerbates the issue until things get uglier and uglier, and, finally, one day, it all falls apart. We have to stop moving forward in order to address the rat’s nest we’ve built up together.

How do we avoid this situation? You’re going to hate me, because I keep returning to this, but... Culture is how we avoid this.

We develop a common set of beliefs, design patterns, and solutions that provide consistency. This consistency is how we maintain simplicity, and we test this through careful review of one another’s work. In this manner, code reviews are not just about deciding if your teammate’s code works, but that it also makes sense within your team’s engineering culture.

Will anyone on our team be able to understand this code when they encounter it 6 months from now? If not, we need to try again.

“Consistency is the key. Consistency is the mechanism by which we can reduce the cognitive load or at least make it appropriate to the problem solved. Some problems are hard, but we don’t need to make them any harder than they are.” - Dan North

So, now that we understand just a bit more about how Product and Engineering make your software sausage, and can understand some of the tradeoffs in how those roles are utilized, let’s look at how you scale a software organization.

In part 6, “A Growth Rosetta Stone” we’ll look at how operational structures and software architectures resemble one another as organizations scale.

Wil Everts