Optimizing code is easy

The only hard part is literally everything else

-

"I'm out of ideas about how to optimize this"

Those are words I don't think I've ever heard when working on optimizing performance. Not to say it can't happen mind you. It's just that there's rarely any shortage of ideas about possible ways to optimize things.

As with most things in software engineering, doing the actual work is the fun part, and the part that gets rewarded most often. It looks a lot better on your perf packet if you can say "shaved 1000s off of page load" than it does to say "figured out why our page load time is 1000s long" or "got management to chill out about it taking 2 quarters to shave 1000s off our page load time" (even though the latter part is obviously the hardest).

There's a whole slew of things that are involved in optimizing app performance that are incredibly important and less fun. Bear in mind that none of these are impossible. It's just to say that they're difficult problems that need solving alongside doing the actual work of optimization itself.

It's tough to measure app latency

How difficult could it be to measure how fast your app is? It turns out to be pretty tough.

It's not so much the performance itself that's tough to measure. It's figuring out how those numbers have changed that's tough. Let's dig into why this is.

To start with, we need to come up with a number that represents your latency. Generally speaking, your latency measurements are going to have a long tail. That is to say that most of your users are going to have pretty consistent performance, except for that one user that for some reason takes 5 minutes to load the page.

So this rules out using an average. Averages are highly skewed by outliers, and you're going to have a lot of outliers.

So what about the median? This gets around the issue of your average being skewed by outliers.

But let's stop and think for a moment about what we're trying to measure. When we are measuring latency, we want to focus on the users that have a slow experience. If we're making optimizations for users at the 50th percentile, we're focusing on users who have an average experience, not a slow one.

That's why a generally-accepted best practice is use the 90th or 95th percentile. Ok, so then all we have to do is measure a high percentile and our problems will be solved, right? Not so fast.

Performance measurements are a snapshot in time

Here's the thing about performance measurements in production: they're snapshots in time. It's like an election. We measure who the winner of an election is by seeing who people want in office on one day every n years. But there are 364 more days a year, and taking a polling result on one of those days might give a different result. The day we choose to take a snapshot of is completely arbitrary, and that's just a matter of necessity.

Likewise, a performance measurement is a measure of a particular number of users making a particular number of requests on a particular number of days (or hours or weeks or months or... you get the idea). If you measure the exact same app with no changes on a different set of days, you're going to get a different result because you're measuring a different set of users making a different set of requests on different days.

Comparing different versions of the same app on different days adds yet another wrinkle. This means you can't just push a change out to production and see how latency changes.

So now I hear what you're saying: "Let's run an A/B experiment!" And I agree with you. But as someone once said about regexes: now you have two problems.

A/B experiments have the benefit of comparing two different changes over the same period of time, but they still have the problem of covering two different sets of users and two different sets of requests. Thus, you have to take results of A/B experiments with a grain of salt.

What are you even measuring?

Let's dig into an even deeper problem. What exactly are you trying to measure?

I'm not saying you can't or shouldn't measure these things. What I'm saying is that you need a canonical measurement that the user cares about. This puts these measurements into context. What measurement should you use? That's a question only you can answer.

Some questions to help get you started:

Explaining performance to management is tough

When your manager wants to look at your app's performance, they're expecting to see a neat little table like this one:

TeamQ1Q2
Team 11100ms900ms
Team 2700ms1000ms

Hey look, team 1 had their latency improve by 200ms! Let's make sure to recognize that team (assuming your management does that of course). But at the same time, team 2 had its latency degrade by 300ms. Let's make sure we follow up with that team to ensure this gets fixed.

There are a couple of problems with this though:

  1. Latency measurements are a snapshot in time, so it's not exactly clear whether any changes were caused by any of your changes.
  2. Are you even sure your management is looking at the correct measurements? It's not uncommon for them to be looking at some incorrect measurement, and getting them to look at better measurements is often tough.

That table is likely to look more like this:

TeamQ1Q2
Team 11139ms1098ms
Page owned by Team 2, but depends on code from Team 31136ms1003ms
That one page that gets 10qps that is ostensibly owned by Team 2, but isn't really important enough to properly maintained.100ms1000ms
Infrastructure code owned by Team 4 that affects everyone????ms????ms

Team 1's code regressed by 41ms, but is that because of their code? Or is it a production issue of some kind? Or is it just normal variation?

Team 2's page also improved by 136ms. Is this normal variation? Is it caused by Team 3's code?

And what impact does Team 4 have on all of this code? Who knows?

Performance complexity scales quadratically

If you can't figure out why your app is slow, then you can't debug your app.

Have a simple, static webpage with one or two interactive elements? Debugging performance issues probably isn't all that tough.

It doesn't take very much for complexity to grow to a point that the complexity becomes difficult to manage. For example, we could imagine the following flow for querying a theoretical user service to get the current user:

  1. To make the query, we have to download and execute the javascript that makes the query
  2. To perform the query, we first have to connect to the backend, involving DNS queries, TLS negotiation, and network round trips.
  3. The query has to first go through some middleware (load balancers, caching proxies, query transformation layers, etc) before it can actually reach the user service
  4. The query reaches the user service, which checks the caching layer to see if the user is already stored
  5. If the user isn't cached, the caching layer needs to query a database to get the current user
  6. The result of this query gets passed back through the middleware from step 2
  7. The result gets to the browser, which has to do some processing to transform the returned data.
  8. The result gets plugged into the UI, which involves styling, layout, and painting before it can actually be displayed.

That's 8 different places your page can slow down, each of which require their own level of expertise and knowledge. And it can be one or all of these steps that slows your app down. This grows to be multiple times as complex if you:

Conclusion

Again, none of this is to say that these problems are impossible to solve or that we should just say "performance is hard" and abandon the idea altogether. Rather, it's to say that there are a whole slew of issues that need to be considered when improving performance apart from the optimizations themselves.