A model with a safety/performance tradeoff

This is the third section in a series on modeling AI governance. You might want to start at the beginning.

Building from our AI governance model, let’s make some assumptions that allow us to describe a scenario in which we have an explicit tradeoff between safety and performance.

The manager’s side

We start with our basic model, where the manager’s payoff is $\rho(x) - c(\hat x),$ with $x \sim \xi(\hat x)$ .

Let’s first assume that $\hat x$ consists of two inputs, $s$ and $p$ , which represent effort put into safety and performance, respectively. We’ll then write the net cost as $c(s, p) = r(s + p),$ where $r$ is a scalar per-unit factor cost. We\‘ll assume for now that $\rho$ is a concave, increasing function on $\mathbb R$ .

We now want to define $\xi$ so that $x$ is a random variable that tends to be higher when $p$ is greater (it tends upward with increasing performance), and has lower variance when $s$ is greater (it becomes more predictable with increasing safety). One way to do this is simply to let

x \sim \mathcal N(p, s^{-2}),

although that is somewhat arbitrary; really any random function where the mean increases in $p$ and the variance increases in $s$ will work (though some might be more appropriate).

If we do this, then, because we assumed $\rho$ to be concave, the manager is risk-averse and thus has an incentive to increase $s$ .

Let’s assume for now that there aren’t any transfers from the public to the manager, assuming instead that the manager receives only the intrinsic benefit $\rho(x)$ , and look more closely at the manager’s optimization problem: the manager chooses $s, p$ to maximize

\mathbb E[\rho(\xi(s, p)) - r(s + p)].

We can write the above expectation as an integral, $\int f(x; s, p) \rho(x) dx - r(s + p),$ where $f(\cdot; s, p)$ is the density function for $x$ given parameters $s$ and $p$ . For an interior solution, the manager should choose $s$ and $p$ so that the marginal benefit from each is equal to the marginal cost:

r = \int \frac{\partial}{\partial s} f(x; s, p) \rho(x) dx = \int \frac{\partial}{\partial p} f(x; s, p) \rho(x) dx

What does this mean?

Increasing $s$ concentrates the density of $x$ around $p$ .
That’s good only if $\rho(x)$ is higher near $p$ .
Therefore, the manager wants to move $p$ to a point where $\rho(p)$ is higher and increase $s$ if $\rho(x)$ is sufficiently high near $p$ .
The manager does this until further benefits aren’t worth the cost.

An example

Let’s take a simple example to get a feel for how this works. Let’s suppose that we can have either a “good” outcome or a “bad” outcome. Increasing $p$ increases the magnitude of the outcome (either good or bad), and increasing $s$ increases the probability that we get the good outcome:

\text P\{x = p\} = \frac{s}{1 + s}, \quad \text P\{x = -p\} = \frac{1}{1 + s}

Notice that in this example, increasing performance without also increasing safety is bad for a risk-neutral manager, since doing so increases the severity of a bad outcome.

A quick note: why are we using $s / (1 + s)$ for the probability of a good outcome? Why not just use $s$ ? This is just because we want $s$ to be able to any positive number, but the probability needs to be between 0 and 1. $s$ therefore represents the odds of a good outcome rather than the probability.

Let’s suppose that manager’s intrinsic payoff is $\rho(x) = x-e^{-x}$ . The expected intrinsic payoff is

\mathbb E[\rho(x)] = \rho(p) \cdot \frac{s}{1+s} + \rho(-p) \cdot \frac{1}{1 + s}.

To hone our intuition, we can take a look at what this payoff is for a range of values of $s$ and $p$ :

Plot is loading...

In the first plot, we see that, with performance held constant, more safety is always helpful. In the second plot, we see that with safety held constant, more performance is only good up to a point, and thereafter increasing performance is too risky.

We can also solve for the choices of $s$ and $p$ that maximize the manager’s net payoff for a range of values of $r$ – this gives us the following plots of solutions:

Plot is loading...

Both safety and performance decrease with an increase in the per-unit factor cost. That, of course, doesn’t have to be the case in general – exploring variations of this model where, e.g., safety increases with $r$ may be worthwhile (and I’m already working on this elsewhere).

The public’s side

Now let’s consider the public’s side of things: an important class of problems is where some social planner wants to choose some transfer rule $t$ to maximize public welfare

\mathbb E[u(x) - t(x)]

subject to the manager’s optimization, as discussed above, but with the transfers added to the manager’s payoff: $x \sim \xi(s^*, p^*)$

s^*, p^* = \text{argmax}_{s, p} \mathbb E[t(x) + \rho(x) - r(s + p)].

Typically, we also set some “individual rationality” condition that stipulates that the manager must be able to achieve some minimum expected payoff under a proposed scheme of transfers. This problem is, at its core, not too tricky: we expect the social planner to set $t$ to reward values of $x$ that are good for the public (for which $u(x)$ is high) and penalize values that are bad for the public (for which $u(x)$ is low). An important question here is why we would expect transfers to be necessary in the first place – after all, if the makes choices that optimize $\mathbb E[u(x)]$ of their own accord, then no transfers will be necessary.

One scenario is where $x$ is some sort of public good that the manager pays to produce but the public also benefits from. In this case, the manager may not produce enough on average from the perspective of the public, so the public would be willing to pay the manager to produce more: the simplest case of this would be where $\rho(x) = 0$ , so all the manager’s payoff comes from payments from the public – in that case, the manager has no intrinsic incentive to produce $x$ and only does so because the public wants them to.

Another, potentially more interesting, scenario, is where the public’s welfare differs from the manager’s in some important way; we’ll focus here on cases where the public is more risk-averse than the manager. Below, I show some example plots where the manager’s intrinsic payoff is $\rho(x) = x - e^{-x},$ but the public’s payoff is $u(x) = x - e^{-2x},$ which represents a higher risk averseness for the public.

Plot is loading...

In the first plot, you can see that the public benefits more from higher levels of safety. In the second plot, you can see that the public has a strong preference for lower levels of performance at any given level of safety. Clearly, the difference in risk averseness gives the public a reason to change the incentives faced by the manager. Figuring out the optimal incentives (as expressed through transfers) for a variety of scenarios seems like a useful area of inquiry.

What’s next

The examples given here are only meant to give an idea of how this all works; there are lots of other interesting assumptions we could make. In the next section, we’ll examine what happens in this model when we have multiple, competing managers.

goodness, truth, and summer rainstorms