Don't Panic! Handling Errors and Bugs in Go

Go is often characterized as a "small" language, with a carefully curated minimal set of features that together allow for effective programming in the large. In particular, in most cases there is only one way to solve a particular problem — whether enforced by the language itself or by community norms — and so Go code in one project will tend to be very similar to Go code in another project.

To a newcomer, it can appear that error handling is an exception to this rule: Go seems to provide two different error-handling mechanisms, strongly encourage the use of one, but yet frequently use the other. These two mechanisms are explicitly returning error values (the most common and recommended approach) and the so-called "panic", which urgently aborts the running program, unwinding the stack in a similar way to structured exception handling in other languages.

While the use of panic is clearly discouraged in various documentation, it is also frequently used in real-world Go libraries and within the standard library itself. This gives the impression that the situation is not as straightforward as the documentation makes it appear — that there are actually valid reasons to use panic for error handling. The goal of this article is to take a pragmatic look at different ways Go libraries can and do handle errors of different types, and why each may be appropriate in certain situations.

This article is focused on error handling from the perspective of API design. That is, on modeling errors in the exported API of a library to help callers of that library write a program that is robust in the face of errors. Within the implementation details of a library the tradeoffs are often different and the consequences of a particular decision tend to fall on the library author rather than on library users. Poor API design, on the other hand, is an externality felt by all users of that API, with problems potentially repeated across dozens, hundreds, or thousands of other programs.

This is a subjective topic, with no absolute correct answer. You may disagree with some of the tradeoffs I propose here, and that is fine: you know better than I do what makes sense for your specific problem. The primary goal of this article is to introduce the decisions an API designer must make, not to dictate the answers to those questions.

Bugs vs. Errors

Before we begin, it's worth discussing what an "error" actually is. There are lots of reasons why a program might fail to proceed as its author hoped, such as a required file being missing on disk, the network being misconfigured, power being lost on the computer where it is running, the CPU itself having design flaws...

In practice, it is folly to try to handle all possible failure modes in your average program. As always, programming is a game of tradeoffs and as API designers we must weigh a number of competing concerns: Will handling this error cause a significant degredation of performance in the happy path? Can this error be detected and handled once at the start of the program rather than repeatedly during the program? Is it possible to handle this error at all?

The guidelines for review of library code submitted to the Go project itself (in standard libraries or in the "extension" libraries) seem at first glance to be very clear that panic should never be used, in the section simply titled Don't Panic:

See https://golang.org/doc/effective_go.html#errors. Don't use panic for normal error handling. Use error and multiple return values.

But the devil is in the details here. What is "normal error handling" anyway? Is there another abnormal kind of error handling? For the sake of this article, I'm going to use some different terminology that I find easier to keep straight in my head: bugs vs. errors.

An error, broadly speaking, is a problem that arises in the environment of the program: the program would've behaved as desired if only that important file hadn't been deleted, or the user's ISP weren't currently having an outage. Inappropriate user input is another very common kind of error: users will often mistype command lines, use incorrect grammar in configuration files, etc. A high-quality program will respond to errors either by working around them in some way or by producing an actionable error message for the user of the program.

A bug, on the other hand, is a problem within the program itself. Perhaps a developer didn't read a library's API documentation closely enough and passed an unacceptable argument to a function. Perhaps a particular list can legitimately be empty but we forgot to handle that situation.

This binary distinction is a coarse approximation, but I think a helpful one because it is approximately along this line that many of our API design decisions in the following sections will fall.

There will always be some ambiguity between errors and bugs, but we can try to decide many cases by thinking about whose "responsibility" it is to deal with a particular problem: perhaps you are writing a library that expects already-validated values as input, and so you consider invalid values as a bug in the caller. That caller, on the other hand, may consider those invalid values to be an error caused by invalid user input. As designers of library APIs we must consider carefully the scope of our library, and design its API so that callers can understand what is expected. Ideally, we want the compiler to check those assumptions.

Input, Processing, and Output

Another important consideration in software design is dealing with input and output. Many programs will begin by gathering outside data to operate on, and will end by emitting result data.

With "errors" defined as problems originating outside the program, it follows that errors will be most common within these input and output phases, as it is these which directly interact with the program's environment.

For example, the input phase might read a file from disk. There are lots of opportunities for error here: the file might not exist, the filesystem may be corrupt, the file may contain data that is not in a suitable format, it may contain too much or not enough data, and so on.

This leads to a general program structure as shown in the following example.

In this ideal situation, the developer of loader has guaranteed in its API documentation that the data return value will be valid and complete as long as the returned error is nil.

This in turn allowed the developer of doer to assume that validity, and consider it a bug in the calling program if it receives an invalid data value; it doesn't need to also return an error value. The writer too can perhaps assume that result is valid in some sense guaranteed by the doer API documentation, but it must still be prepared to handle errors when creating the result file.

What if the caller instead constructs that data value directly, and makes invalid such that doer.ProcessData cannot produce a result? This function has no "normal error" channel with which to indicate that, and so its only recourse is to panic. However, this is clearly a bug in the calling program: doer.ProcessData mentioned in its documentation that it requires data in the form produced by loader.LoadDataFile, and so constructing that object some other way is incorrect usage, regardless of what environment the program is running in.

As API designers we can help callers write correct programs by making careful use of the type system so that the compiler can detect some kinds of incorrect usage:

Depending on how doer.Data is specified, it may still be possible for a calling program to construct an incorrect value, but our use of a specialized type for the data helps the developer of the calling program to understand how to correctly connect these different components.

In this situation, it is reasonable to use panic to respond to incorrect input in doer.ProcessData because the only resolution to the problem is to correct the calling program, not to adjust the program's environment. The decision to use panic here is a tradeoff: since incorrect usage of this function is a bug rather than an error, we choose to carefully design the API to make this situation unlikely, which avoids placing an error-handling burden on correctly-implemented programs, often making the processing phase more readable.

We can see this tradeoff at play within the language itself: an out-of-bounds access to an array or slice is signalled via panic, rather than explicit error values, because handling these errors with explicit control flow would render many correct programs unreadable by introducing branches that can never be visited.

This leads to a rule of thumb: always use error values when processing input and producing output, since normal errors are most common in these phases. Use panic sparingly to signal program bugs in the main processing phase, along with careful API design to help callers avoid them, when the goal is to reduce error handling complexity in the processing phase of the calling program.

The remaining sections of this article are refinements of and guidelines for this high-level rule.

Know Your Audience

When a problem is detected, who is expected to fix it? What does that person need to know to make progress?

A panic is always directed at the developer of the calling program, and never at the end-user. In the event that an end-user does see a panic message, the user's only recourse is to contact the software developer for a corrected version of the program. Because of this, the default panic behavior includes a detailed stack trace for each active goroutine to help the developer identify the precise location where the problem was detected.

Conversely, panic is never an appropriate mechanism for messaging to the end-user. Problems with the environment — missing files, incorrect files, broken network connectivity, etc — can usually not be addressed by changes to the program, and so these problems should be reported via error values.

This often leads to a different problem: error messages at the wrong level of abstraction. The worst examples of this come when errors arise deep in a call stack and intermediate functions simply pass them through, rather than handling them directly. For example, consider this program that is parsing some JSON input, presumably as part of a larger input-processing stage:

If the sequence of bytes given in buf is not valid JSON, the error message from the JSON library will be returned directly to the caller. If buf is an empty byte slice, for example, the JSON parser may attempt to read and return io.EOF as its error.

If no other function in the call stack handles this error, it is likely to surface to the end user like this:

$ awesome-program
EOF

Not particularly helpful! The end-user may not even be aware that a file was being read and parsed as JSON here. An error return value from a function is, in effect, still a message to the direct caller of a function: even though it may be describing a more general environmental problem, it is often doing so with context and vocabulary common only between that caller and calleee.

Go's JSON library knows that its caller is trying to parse JSON, but it doesn't know why. The caller presumably knows, and so it's the caller's responsibility to interpret and translate the error, re-framing the problem in a way that makes sense to its caller, and so on until eventually the caller is the end user themselves.

By structuring a program or sub-program into separate input, processing, and output phases, this error translation process can be simplified: the call stack stays relatively shallow (the "processing" functions are not in turn calling parsers, for example) and at each phase the program is attempting to acheive a specific goal which can add important context to the error messages eventually returned to the user.

However we achieve it, it's always important for our programs to produce error messages that are understandable by their intended audience, with all of the context they need to understand and address the problem.

State Your Intentions

As API designers, our responsibility is to design an API that is easy to use correctly. Compile-time type checks are one tool in the API designer's toolbox, but are not a panacea. Go's type system is intentionally simple, and so it's not possible in practice to model all real-world expectations so that they can be checked by the compiler.

Another important tool in API design is idiom. As developers gain experience with a variety of different libraries, they develop a mental model for certain API design approaches that appear repeatedly. A very important idiom in Go is that of returning error values: unless otherwise stated, experienced Go developers will expect that if a function returns a non-nil error then any other return values should be assumed invalid.

When an API design steps away from common idiom, developers are likely to use it incorrectly. Sometimes deviations from idiom are warranted though, since each situation is unique.

When decisions in an API design cannot be modelled as type checks and step outside of common idiom, API documentation is our fallback. Go has a simple convention for documenting the intended contracts of functions using comments, which are rendered by the GoDoc tool.

A panic is never idiomatic, and therefore intentional panic situations should always be mentioned in documentation. Consider the standard reflect package for example: many methods of Value use panics to signal incorrect usage by the caller, but crucially they all also carefully document the correct usage and the consequences of violating it:

Bool returns v's underlying value. It panics if v's kind is not Bool.
Bytes returns v's underlying value. It panics if v's underlying value is not a slice of bytes.
Interface returns v's current value as an interface{}. It panics if the Value was obtained by accessing unexported struct fields.

With that said, I would be remiss not to mention Hyrum's Law: to mitigate this, it's best for any requirements in your documentation to be backed up by specific checks in code so that correct usage can emerge from trial and error as well as from careful reading of the documentation.

Help Callers to Succeed

When an API designer decides to consider a particular problem a bug and respond to it with a panic, they can improve ergonomics (and thus encourage safe behavior) by providing convenient patterns of correct usage.

For example, in the previous section we saw that Go's own reflect package has a number of methods that panic under incorrect usage. Some have relatively simple definitions of correct usage, such as Value.Bool which works only for bool values. Others are more complex, such as Value.Interface which panics "if the value was obtained by accessing unexported struct fields".

Since a particular portion of a program may not know how a given value was obtained, the package also offers Value.CanInterface, which returns true only if Value.Interface could be called on the same value without a panic.

This combination of methods is optimizing for a presumed common case where a reflect.Value is both obtained and processed within the same component, and thus that component can "know" that it obtained the value in a way that allows Value.Interface to succeed, but allowing for a less-common situation where some fallback behavior or explicit error handling is needed:

A Real Example

So far we've mainly explored hypotheticals, aside from a brief look at some aspects of the built-in reflect library. To put these ideas in perspective, I'd like to use an API of my own design which attempts to navigate all of these tradeoffs.

My library cty models types and values for applications that need to deal with data that can't be statically typed in the host program, such as data coming from arbitrary input files (e.g. JSON) or whose structure is defined by a separate plugin process.

I created it in response to a sequence of bugs in another program that were the result of working directly with interface{} values as their dynamic value representation, but yet expecting only a subset of values of that type.

For example, it is common for applications working with JSON to use encoding/json to unmarshal an arbitrary structure into an interface{} value and then use type assertions or reflection to work with that. The JSON library is constrained to only produce a specific subset of Go types that correspond approximately with JSON's own data types, but once these values pass into the larger program they may be interpreted by code with a different set of expectations, or may be mutated to include types that cannot be re-serialized as JSON later.

cty, then, essentially establishes a subset of possible types and values and aims to ensure that all of the documented invariants for those types and values are preserved as the values pass through a program. Whereas passing around interface{} values relies on convention and good behavior, cty enforces correct behavior through its API.

Working with dynamic data types creates a lot more potential runtime problems, and raises lots of design questions around which problems are errors vs. bugs, and so handling of runtime problems in cty raised some interesting design questions.

cty follows the "Input, Processing, and Output" model I described in an earlier section. The JSON package within cty (which is separate from Go's own) is one example of both input and output, converting byte buffers containing JSON syntax into values and vice-versa. The functions of this package return error values, and following my "Know Your Audience" principle Unmarshal aims to return error messages that should make sense to the person who wrote the JSON input.

Once a program has obtained values of type cty.Value or cty.Type, the API design switches to treating incorrect arguments as caller bugs rather than errors. This optimizes for ergonomic use by correct programs that have performed any necessary validation or type checking ahead of processing, as we can see in the following (contrived) example:

The cty.Value.AsValueSlice and cty.Value.Add methods used here are designed to assume validation was already performed during input and so will panic if their expectations are not met in order to reduce error-handling "noise" in the calling program. This is reflected in their documentation:

Add returns the sum of the receiver and the given other value. Both values must be numbers; this method will panic if not.
AsValueSlice returns a []cty.Value representation of a non-null, non-unknown value of any type that CanIterateElements, or panics if called on any other value.

In the latter case, we see an example of helping callers to succeed: the definition of what is iterable is complicated, and so cty also offers a method CanIterateElements so that a program that cannot assume a particular type can instead succinctly detect that and handle it, avoiding the panic.

It is important to note that this design doesn't prevent a program from panicking. It is possible to use the library incorrectly by failing to guarantee the correct type before calling AsValueSlice. The design tradeoff here is to provide convenient functions to ensure user input is valid early in the program, allowing for more direct code (with fewer conditional branches) in the "middle" of the program, which is likely to be the most complex part of the calling program and where readability is most important.

In programs like the above where the expected structure is known at compile time and it is only the values that vary, cty also allows a different approach of converting directly to specific native Go types during the input phase, allowing the Go compiler to ensure correctness:

In this case we can do even more of the validation up front, and so the rest of the program need not worry about type-related panics at all. It may still have other panics to worry about, of course!

The API design of cty is not perfect by any means. In practical use I've found that it's easy for callers to allow null values in input but forget to handle them later: this is actually true of the first cty example above — it would panic if given [null] as input — and yet only an expert user of this library would spot that bug, and it is unfortunately a case likely to be missed during testing. Although it's an error on the user's part to provide null, it's a bug in the program that it isn't handled gracefully. (The second example actually fixes this by decoding into a Go type that cannot be nil, but that fix is by luck more than by care in this case.)

Do Panic?

As we've seen in previous sections, while returning error values is the primary way to handle errors in a Go library, there are also some situations where a panic can be appropriate in conjunction with other design work to create an API that is ergonomic and easy to use correctly.

Through thoughtful API design, we can reduce the cognitive overhead of error handling and improve readability by separating the concern of fraught interactions with the environment from the more predictable business of computation.

On the other hand, mistakes in API design — as with the modelling of null in my library cty — can create traps where users of your library can readily create incorrect programs, leading to crashes.

The decision of whether a particular problem is an error or a bug is always contextual and subjective: it is one of the many tradeoffs we must make when designing the API of a library and, in a broader sense, the overall architecture of a program.

The suggestions in this article can be summed up with an API design truism: good API design encourages correct usage, through careful application of language features, idiom, and documentation. Poor usage or over-usage of error values in an API will discourage callers from handling those errors carefully due to the increase in code complexity, while poor usage of panic will lead to software unreliability.

I hope this article will equip the reader with a good set of questions to ask when designing APIs, and that even if you disagree with some of my conclusions here — which I expect and welcome — you can do so knowingly, having considered all of the available options and their effects.