Safety and performance with competing managers

This is the fourth section in a series on modeling AI governance. You might want to start at the beginning.

In the last section, we considered a specialized version of the AI governance model that captures a tradeoff between safety and performance. In this section, we’ll expand the model to have multiple, competing actors.

To recap, we assume that our manager chooses $s$ and $p$ to maximize

\mathbb E[\rho(x)] - c(s, p),

where $x \sim \xi(s, p)$ .

We showed that this can sometimes lead to an inefficient outcome from the perspective of the public, but there’s good reason to suspect that a lot of bad outcomes may be the result of competition between managers, which this version of the model doesn’t capture.

Luckily, we just need to make a couple of simple changes. Let’s suppose that we have $n$ managers; each manager chooses their own levels $s_i$ and $p_i$ of safety and performance, respectively. Let’s now define $s$ and $p$ (without subscripts for $i$ ) as the vectors of everyone’s safety and performance:

s = (s_1, \dots, s_n), \quad p = (p_1, \dots, p_n)

Then we’ll say that the outcomes for everyone are drawn from some probability distribution over $s$ and $p$ :

x := (x_1, \dots, x_n) \sim \xi(s, p)

Safety as an externality

Let’s consider an example. In the last section, we considered a scenario where a single manager produces a good outcome ( $x = p$ ) with probability $s / (1+s)$ and a bad outcome ( $x=-p$ ) otherwise. Generalizing that to a scenario with multiple managers, let’s say now that each manager has an independent probability, $1 / (1 + s_i)$ of causing a “disaster.” If any manager causes a disaster, then we get $x_i = -p_i$ for all $i$ (everyone has a bad outcome). If nobody causes a disaster, then we get $x_i = p_i$ for all $i$ (everyone has a good outcome).

For a given reward function $\rho$ , reward is therefore

\mathbb E[\rho(x) | s, p] = \left( \prod_{i=1}^n \frac{s_i}{1 + s_i} \right) \rho(p) + \left(1 - \prod_{i=1}^n \frac{s_i}{1+s_i} \right) \rho(-p)

Let’s pause for a moment here to consider what this model is describing: we’re assuming that every manager has a chance of causing a bad outcome, not just for themself, but also for everyone else; safety is a positive externality since the total benefit of manager’s purchase of $s_i$ is greater than the benefit to that manager alone. (To be specific, with the assumptions here, safety is a public good – everyone benefits equally whenever anyone buys it.) Therefore, even if we only care about the aggregate payoff of the managers, we should generally expect the managers to be insufficiently safe. In the socially efficient (higher safety) case, each manager individually has an incentive to free-ride on the safety purchased by the others, so they all defect to the competitive (lower-safety) equilibrium, even though that makes everyone worse-off. If something like this turns out to be a good description of the real world, then a reasonable class of interventions for policy-makers to consider is those that keep organizations from defecting to the competitive equilibrium, since that will be in everyone’s interest.

Relating this to a plausible real-world scenario, let’s assume that we’re in a world with two managers, which will call ShallowMind and ClosedAI. Let’s say that there is some action that each manager could take that increases the manager’s payoff but introduces a small probability of a disaster. ShallowMind might decide that the small disaster probability is worth it for the boosted payoff, and take the risk. Meanwhile, ClosedAI might think the same thing and also take the risk. Now we have two failure points – either ShallowMind or ClosedAI could generate a disaster – which may be too much risk for the original tradeoff to be worth it, but it may be the case that neither ShallowMind nor ClosedAI can back out, since then they will be facing a higher disaster risk from the other’s actions without even getting the extra payoff! The astute reader may recognize this as a case of the classic prisoner’s dilemma – below I’ve calculated the expected payoffs for a hypothetical case where taking the risky action triples each manager’s payoff but introduces an (independent) 1/2 probability that everyone gets zero payoff:

Shallowmind \ ClosedAI	Safe	Risky
Safe	1, 1	0.5, 1.5
Risky	1.5, 0.5	0.75, 0.75

It’s best for both ShallowMind and DeepAI to act safely, but in the [Nash] equlibrium, both act riskily.

Looking at a concrete model

As part of an ongoing project, I’ve analyzed a specific version of the competing-managers model, which we’ll now consider. Let’s assume the following form for the reward function:

\rho_i(p) = \begin{cases} \frac{p_i}{\sum_{j=1}^n p_j} & p > 0 \\ -d_i & \text{otherwise} \end{cases}

We can think of this as describing a scenario where, if there is no disaster, the managers’ payoffs are determined in a contest where each manager’s chance of winning is equal to their share of the total performance, and if there is a disaster, each manager pays a fixed cost $d_i$ .

A manager’s expected payoff is therefore

\mathbb E[\rho_i(x) | s, p] = \left( \prod_{j=1}^n \frac{s_j}{1 + s_j} \right) \left( \frac{p_i}{\sum_{j=1}^n p_j} \right) - \left(1 - \prod_{j=1}^n \frac{s_j}{1+s_j} \right) d_i

In a Nash equilibrium, each manager’s choice of $s_i$ and $p_i$ maximizes their net payoff

\mathbb E[\rho_i(x)|s,p] - r(s_i + p_i),

subject to everyone else’s choices of safety and performance.

Below, I’ve plotted some solutions for the case where $n=2$ and $d=0$ .

Plot is loading...

The result here is not too surprising: safety and performance both decrease with the per-unit price $r$ . The players have the same strategies, which makes sense given that they are identical.

How can we make this more interesting/informative? Well, we might think that the cost of safety and performance don’t necessarily scale linearly. Let’s say that instead of being able to purchase them at a fixed rate of $r$ , they are produced according to the production functions $s = A X_s^\alpha, \quad p = B X_p^\beta,$ where $X_s$ and $X_p$ are purchased at the per-unit price $r$ . (We can think of $X_s$ and $X_p$ as the factors of production for $s$ and $p$ , or as the effort spent on safety and performance, respectively.)

What are the meanings of all these parameters? We can think of $A$ and $B$ as base productivity parameters and $\alpha$ and $\beta$ as scaling parameters. With higher $A$ (or $B$ ), the marginal cost of safety and productivity is lower at any given level of $s$ (or $p$ ); with $\alpha < 1$ (or $\beta < 1$ ), safety (or performance) has decreasing returns to scale, and with $\alpha > 1$ (or $\beta > 1$ ), safety (or performance) has increasing returns to scale.

Since the managers are now choosing $X_s$ and $X_p$ rather than buying $s$ and $p$ directly, we’ll also write the cost function as

c(X_s, X_p) = r(X_s + X_p),

so each manager’s objective is now to choose $X_{s,i}$ and $X_{p,i}$ that will maximize

\mathbb E[\rho_i(x) | X_s, X_p] - r(X_{s,i} + X_{p,i})

subject to the other managers’ choices.

Changes in productivity

Below, I’ve plotted the solutions when we let $\alpha = \beta = 1$ for both players, across a range of values of $A$ . Unsurprisingly, you can see that increasing $A$ increases safety at any given price. We can also see that increasing $A$ increases performance – the players take advantage of the lower cost of safety to spend more on performance.

Plot is loading...

In the next set of graphs, I show solutions when $B$ varies. You can see that performance increases with $B$ but safety is unaffected.

Plot is loading...

Why don’t we see here the same substitution effect we saw when we increased $A$ , with both $s$ and $p$ increasing in response? Again we see the concept of externalities being important. We discussed before that safety is a positive externality in this model; what about performance? Since we assume that the payoff to manager $i$ in the case of a safe outcome is $\frac{p_i}{\sum_{j=1}^n p_j},$ if one manager increases their performance, it reduces the payoff of the other manager(s) – performance is a negative externality. (As a side note, the particular payoff function we used here assumes that the managers are in a zero-sum competition over performance, since they are fighting over a fixed payoff of 1, but the basic idea remains the same if we loosen that assumption.) To see what this means for our two-player example, let’s consider what would happen if $B$ increased for both managers, and one manager decided to put some of the money saved on performance into increased safety. If the other manager doesn’t do that – using the new productivity just to increase $B$ , then they benefit just as much from the first manager’s increased safety (since safety is a public good) and are now able to out-compete the first manager in terms of performance. Spending on safety makes it easier for each manager’s competitor to beat them out in performance, so it’s not a good strategy. To reiterate an earlier point, regulators might be able to intervene helpfully here by enforcing contracts between managers in cases where everyone would prefer to be safe but where competitive pressures prohibit them from doing so.

Changes in returns to scale

Below I show solutions for a few different values of $\alpha$ . The effect might seem the same as when we changed $A$ , but you can see that here the effect is more pronounced at lower values of $r$ (when $X_s$ is more affordable and thus higher).

Plot is loading...

Doing the same thing for $\beta$ gives us the next set of plots. We get results analogous to what we’ve gotten so far:

Plot is loading...

Making safety’s cost dependent on performance

“Hold on!” I’m sure you’re saying (being a very studious and attentive reader), “Why have we assumed that safety and performance are produced independently? Wouldn’t they realistically interact so the cost of one depends on the other?”

Ah, good point, reader. It’s hard to tell exactly how safety and performance interact, but one reasonable assumption we can make is that safety becomes more expensive as performance increases – it should be easier to ensure that a simple AI system is safe than that an advanced AI system is safe. We can incorporate this idea into our production functions with another scaling parameter, $\theta$ , like so:

s = AX_s^\alpha p^{-\theta}, \quad p = B X_p^\beta

This makes the tradeoff between safety and performance more salient: we can now think of performance having a cost not just in terms of $X_p$ but also in terms of safety.

Below I show solutions for a few different values of $\theta$ . The important thing to note is that, for higher $\theta$ , the managers are more willing to substitute safety for performance as $r$ increases. In fact, it looks like, at least in this particular example, for $\theta > 1$ , safety actually increases with an increase in $r$ .

Plot is loading...

In fact, the general rule is that whenever $\theta > \alpha / \beta$ , safety increases with $r$ . When $\theta < \alpha / \beta$ , safety decreases with $r$ , and when $\theta = \alpha / \beta$ (like at $\theta = 1$ in the example above), safety does not change with $r$ . I am still working out the mathematical details of how this works, but the intuition is not too bad: we can write

s = \frac{A}{B^\theta} X_s^\alpha X_p^{-\theta \beta},

so if $\theta = \alpha / \beta$ , then $s \propto (X_s / X_p)^\alpha$ . This means that if we want to change performance but leave safety unchanged, we need only match any changes in $X_p$ one-to-one with changes in $X_s$ so the ratio $X_s / X_p$ remains constant. If $\theta < \alpha / \beta$ , then $X_s$ needs to change faster than $X_p$ if we want safety to remain constant. (That is, an increase in performance with safety held fixed requires that we increase $X_s$ more than $X_p$ !) On the other hand, if $\theta > \alpha / \beta$ then it takes relatively less change in $X_s$ to maintain a given level of safety.

A useful way to think about this is in terms of costs of scaling. As mentioned previously, $\alpha$ measures how well safety can scale, while $\beta$ measures how well performance can scale. $\theta$ determines how the scaling of performance impacts the scaling of safety – if $\alpha > \theta \beta$ , then safety can “out-scale” performance (or, put in terms of cost, safety costs won’t out-scale performance costs); otherwise, increases in performance must come at the expense of reduced safety.

This has important implications for AI governance. If we’re safety-conscious and think that safety costs will out-scale performance costs (i.e., we suspect that $\alpha < \theta \beta$ in the long run) then we may be inclined to limit performance (e.g., by increasing factor prices as measured by our $r$ parameter) until we can figure out other ways to incentivize safety. The model, unfortunately, doesn’t tell us what values of scaling parameters are accurate, but it does at least allow us a structured way to play around with various assumptions and get a feeling for the scenarios we ought to watch out for.

In the next section, we’ll investigate what happens in this model when we play around with the disaster cost ( $d$ ) parameter.

goodness, truth, and galaxies