Use the Golang context argument to indicate that a function can block

Posted on Mon, May 11, 2020 Golang

Once you've been programming long enough, you're bound to encounter issues where intermittently an application would become stuck, but for no obvious reason. With the root cause found and the issue resolved, you might ask yourself "how can I keep this kind of bug from happening in the future?"

In this post I suggest a possible method for preventing this kind of issue with the help of the compiler: you would indicate in function signatures that they can block through having a Context argument and allow the caller to take the necessary precautions to avoid blocking for too long (or at all).

Similarity to the convention of errors as return values

In Go, whenever you call a function that returns an error, you must check for an error - or risk the function having not done what it was supposed to do. If you handle the error, then all's fine and well - you need not propagate it to your callers. If you don't handle it, then by convention, you simply propagate it up by returning the error.

For example, os.Getenv - func Getenv(key string) string cannot fail: it either returns the value of the environment variable, or an empty string if the environment variable did not exist. On the other hand, http.Get - func Get(url string) (resp *Response, err error) can fail: if it fails, you should handle the failure or tell your caller by returning an error yourself.

This means that when you write a function that calls other functions that can return errors, you are forced to explicitly make this choice: handle the error or propagate it to your callers.

With IO or blocking operations, there is a similar complexity that emerges: you could call a function and not know how it would behave, absent documentation. Can it block? For how long? If it can block, how do you set a timeout? How do you cancel an ongoing operation? These are all questions you can only answer through the documentation or reading the code. If you answer these questions using the documentation - it might not be up to date. Some subtle property could have changed since the documentation was written that makes the function potentially blocking, but you won't know that as the function signature itself tells you nothing about this.

You can use contexts, as in Context from the context package, to surface the complexity of your function performing some potentially blocking operation, as well as allowing it to be canceled and specifying a timeout, forcing the caller to handle the possibility of your function taking a variable and unknown amount of time to return.

When is this useful?

Imagine you were asked to implement a mechanism that reports logged errors to an external error tracking service, such as Bugsnag or Sentry, but the requirement is that it only report errors from production.

Your codebase has a configuration package that uses environment variables to determine the current configuration. You decide to add a function that tells you whether errors should be reported:

func ShouldReportErrors() bool {
  reportBugs := os.Getenv("SHOULD_REPORT_ERRORS")
  if reportBugs == "true" {
    return true
  }

  return false
}

As it is, this function can never block or fail - so it does not return an error (cannot fail), and it also does not take a context.Context parameter (cannot block).

Next, you've implemented a logging hook that sends logged errors to Sentry or Bugsnag, but only does so for production environments. You call ReportError from within your logging hook when you detect an error that should be logged. ReportError then checks if it should report errors (using the environment variable), and if so, takes care to report the error in a manner that does not block the logging hook.

func ReportError(err error) {
  if !configuration.ShouldReportErrors() {
    return
  }

  // ... report error asynchronously ...
}

At this point,

ReportError

cannot block, so it is safe for use from within your logging hook - which is called from every component that logs errors in your codebase. Even components where blocking would be harmful.

What happens if ShouldReportErrors later becomes blocking?

Later, another engineer, perhaps in a different team, is tasked with changing the configuration package so that it fetches all configurations from a centralized configuration service instead of environment variables. A possible naive implementation would use the built-in http.Get to fetch the value from said configuration service:

func shouldReportErrors() bool {
  environment := os.Getenv("ENV")
  if environment == "" {
    environment = "dev"
  }
  res, err := http.Get("http://configuration/%s/report-errors-external")
  if err != nil || res.StatusCode != http.StatusOK {
    // log ...
    return false
  }

  var result bool
  err = json.NewDecoder(res.Body).Decode(&result)
  if err != nil {
    // log ...
    return false
  }

  return result
}

All hell breaks loose

You have a function handling a request to a HTTP endpoint and an error occurs. You log it using your logging framework: logger.WithError(err).Error(…) and the code continues to the next request.

Initially, everything works fine because http.Get will return quickly, assuming the configuration service is local and functioning. But what happens when the configuration service has an issue? http.Get has a default timeout of 30s.

So now you have: myFunction() -> logging framework -> error report logging hook -> ReportError -> shouldReportErrors -> which blocks for 30s if the configuration service is unresponsive.Where normally this would log and continue, now every time you log an error, it blocks for 30s. This quickly grinds your production system to a halt as any request hitting this kind of log is blocked, and memory consumption balloons as the number of concurrent requests increases at the rate of incoming requests.

And all of this happened while:

All of them did not have the full picture, and could not reasonably be expected to be aware of all of this. Consider a codebase hundreds to thousands of files large, perhaps over multiple repositories and many teams. How could any one developer be expected to consistently deal with all of that complexity -- current and future, when it is hiding behind many abstractions, such as a logging framework?

The problem here is that this function, through its signature, tells you nothing about whether it could block. This specific implementation would block for 30 seconds (the default HTTP client timeout) if the configuration service on the other end was unresponsive. Even worse, you might make this change -- and code relying on this function would keep compiling and working, even if it relies on the previous behavior of never blocking (maybe it holds a lock?)

So you find the issue, fix it, and then ask yourself -- how can I prevent this? How can I help other code authors see that this can happen when they write code, but without asking them to read a lot of code?

What can you do?

You can surface this complexity to the caller by adding a context.Context argument. The caller must then pass you a context that specifies when the operation should be canceled. If they want the operation to time out, they could use context.WithTimeout. They cannot call this function without making the choice of which context to pass you: the compiler will force them to deal with this complexity.

You might adapt ShouldReportErrors to use a Context this way:

func ShouldReportErrors(ctx context.Context) (bool, error) {
  environment := os.Getenv("ENV")
  if environment == "" {
    environment = "dev"
  }
  req, err := http.NewRequestWithContext(ctx, "GET", "http://configuration/%s/report-errors-external", nil)
  if err != nil {
    // log ...
    return false, err
  }
  res, err := http.DefaultClient.Do(req)
  if err != nil || res.StatusCode != http.StatusOK {
    // log ...
    return false, err
  }

  var result bool

  err = json.NewDecoder(res.Body).Decode(&result)
  if err != nil {
    // log ...
    return false, err
  }

  return result, nil
}

Now, the responsibility to tie any blocking operations you do with the context and allow them to be terminated lies with you -- and not with the caller. The caller's responsibility ends with understanding this code may block, and specifying restrictions for the blocking using the context argument. Most importantly, they do not have to worry about the internals of your implementation, or your dependencies.

Returning the error leaves it up to the caller to decide what to do if fetching the value fails due to timeout or another error. Knowing that the function can block and fail, they might decide to try to fetch this configuration value just once and cache it if they judge their code to be sensitive to blocking here.

If you wanted to make the implementation never fail, you could decide on some strategy for what value to return if you get an error. For example, you could return false on every error, and also cache the result on an error so you fail fast. Either way, the context forces you to respect the caller's wishes: you mustn't block if the context is canceled or its timeout elapses.

This change also makes any calls to the original function's signature fail to compile (since they are now missing the context argument), which forces you to do one of the following:

My way of dealing with a similar issue at Rookout (where I work) was to introduce contexts to every blocking function. In our case, we had a deadlock caused by an interdependency -- an interdependency created through many layers of abstraction that evolved over time, rather than being architected (as is common with startups, where requirements change often). It was not possible at the call site of either side to tell that the operation would block without reading a LOT of code, just because you're worried it might block. Much of the code that resulted in the deadlock existed before the interdependency came to be, so it was not written with it in mind and had to be changed, but the author of the interdependency would have a hard time realizing that a change was necessary due to the sheer complexity of the system and amount of code.

I like using the compiler as a tool to surface and handle complexity. I feel that contexts are a really powerful tool that isn't used enough in third party libraries I've seen: it lets the caller group multiple operations into a single timeout (e.g. call 3 blocking functions, and say "all of these should complete within 3 seconds", without implementing any complex logic), or using child contexts to stop components, all of their goroutines and their in-flight operations by canceling the parent context.

That's all! If you think I'm missing something or know about another tool I could use to protect against a different class of bugs, feel free to shoot me a mail - I'm eager to hear about it :)