Safety and performance with competing managers

This is the fourth section in a series on modeling AI governance. You might want to start at the beginning.

In the last section, we considered a specialized version of the AI governance model that captures a tradeoff between safety and performance. In this section, we’ll expand the model to have multiple, competing actors.

To recap, we assume that our manager chooses ss and pp to maximize

E[ρ(x)]c(s,p),\mathbb E[\rho(x)] - c(s, p),

where xξ(s,p)x \sim \xi(s, p).

We showed that this can sometimes lead to an inefficient outcome from the perspective of the public, but there’s good reason to suspect that a lot of bad outcomes may be the result of competition between managers, which this version of the model doesn’t capture.

Luckily, we just need to make a couple of simple changes. Let’s suppose that we have nn managers; each manager chooses their own levels sis_i and pip_i of safety and performance, respectively. Let’s now define ss and pp (without subscripts for ii) as the vectors of everyone’s safety and performance:

s=(s1,,sn),p=(p1,,pn)s = (s_1, \dots, s_n), \quad p = (p_1, \dots, p_n)

Then we’ll say that the outcomes for everyone are drawn from some probability distribution over ss and pp:

x:=(x1,,xn)ξ(s,p)x := (x_1, \dots, x_n) \sim \xi(s, p)

Safety as an externality

Let’s consider an example. In the last section, we considered a scenario where a single manager produces a good outcome (x=px = p) with probability s/(1+s)s / (1+s) and a bad outcome (x=px=-p) otherwise. Generalizing that to a scenario with multiple managers, let’s say now that each manager has an independent probability, 1/(1+si)1 / (1 + s_i) of causing a “disaster.” If any manager causes a disaster, then we get xi=pix_i = -p_i for all ii (everyone has a bad outcome). If nobody causes a disaster, then we get xi=pix_i = p_i for all ii (everyone has a good outcome).

For a given reward function ρ\rho, reward is therefore

E[ρ(x)s,p]=(i=1nsi1+si)ρ(p)+(1i=1nsi1+si)ρ(p)\mathbb E[\rho(x) | s, p] = \left( \prod_{i=1}^n \frac{s_i}{1 + s_i} \right) \rho(p) + \left(1 - \prod_{i=1}^n \frac{s_i}{1+s_i} \right) \rho(-p)

Let’s pause for a moment here to consider what this model is describing: we’re assuming that every manager has a chance of causing a bad outcome, not just for themself, but also for everyone else; safety is a positive externality since the total benefit of manager’s purchase of sis_i is greater than the benefit to that manager alone. (To be specific, with the assumptions here, safety is a public good – everyone benefits equally whenever anyone buys it.) Therefore, even if we only care about the aggregate payoff of the managers, we should generally expect the managers to be insufficiently safe. In the socially efficient (higher safety) case, each manager individually has an incentive to free-ride on the safety purchased by the others, so they all defect to the competitive (lower-safety) equilibrium, even though that makes everyone worse-off. If something like this turns out to be a good description of the real world, then a reasonable class of interventions for policy-makers to consider is those that keep organizations from defecting to the competitive equilibrium, since that will be in everyone’s interest.

Relating this to a plausible real-world scenario, let’s assume that we’re in a world with two managers, which will call ShallowMind and ClosedAI. Let’s say that there is some action that each manager could take that increases the manager’s payoff but introduces a small probability of a disaster. ShallowMind might decide that the small disaster probability is worth it for the boosted payoff, and take the risk. Meanwhile, ClosedAI might think the same thing and also take the risk. Now we have two failure points – either ShallowMind or ClosedAI could generate a disaster – which may be too much risk for the original tradeoff to be worth it, but it may be the case that neither ShallowMind nor ClosedAI can back out, since then they will be facing a higher disaster risk from the other’s actions without even getting the extra payoff! The astute reader may recognize this as a case of the classic prisoner’s dilemma – below I’ve calculated the expected payoffs for a hypothetical case where taking the risky action triples each manager’s payoff but introduces an (independent) 1/2 probability that everyone gets zero payoff:

Shallowmind \ ClosedAI Safe Risky
Safe 1, 1 0.5, 1.5
Risky 1.5, 0.5 0.75, 0.75

It’s best for both ShallowMind and DeepAI to act safely, but in the [Nash] equlibrium, both act riskily.

Looking at a concrete model

As part of an ongoing project, I’ve analyzed a specific version of the competing-managers model, which we’ll now consider. Let’s assume the following form for the reward function:

ρi(p)={pij=1npjp>0diotherwise\rho_i(p) = \begin{cases} \frac{p_i}{\sum_{j=1}^n p_j} & p > 0 \\ -d_i & \text{otherwise} \end{cases}

We can think of this as describing a scenario where, if there is no disaster, the managers’ payoffs are determined in a contest where each manager’s chance of winning is equal to their share of the total performance, and if there is a disaster, each manager pays a fixed cost did_i.

A manager’s expected payoff is therefore

E[ρi(x)s,p]=(j=1nsj1+sj)(pij=1npj)(1j=1nsj1+sj)di\mathbb E[\rho_i(x) | s, p] = \left( \prod_{j=1}^n \frac{s_j}{1 + s_j} \right) \left( \frac{p_i}{\sum_{j=1}^n p_j} \right) - \left(1 - \prod_{j=1}^n \frac{s_j}{1+s_j} \right) d_i

In a Nash equilibrium, each manager’s choice of sis_i and pip_i maximizes their net payoff

E[ρi(x)s,p]r(si+pi),\mathbb E[\rho_i(x)|s,p] - r(s_i + p_i),

subject to everyone else’s choices of safety and performance.

Below, I’ve plotted some solutions for the case where n=2n=2 and d=0d=0.

Plot is loading...
Plot is loading...

The result here is not too surprising: safety and performance both decrease with the per-unit price rr. The players have the same strategies, which makes sense given that they are identical.

How can we make this more interesting/informative? Well, we might think that the cost of safety and performance don’t necessarily scale linearly. Let’s say that instead of being able to purchase them at a fixed rate of rr, they are produced according to the production functions s=AXsα,p=BXpβ,s = A X_s^\alpha, \quad p = B X_p^\beta, where XsX_s and XpX_p are purchased at the per-unit price rr. (We can think of XsX_s and XpX_p as the factors of production for ss and pp, or as the effort spent on safety and performance, respectively.)

What are the meanings of all these parameters? We can think of AA and BB as base productivity parameters and α\alpha and β\beta as scaling parameters. With higher AA (or BB), the marginal cost of safety and productivity is lower at any given level of ss (or pp); with α<1\alpha < 1 (or β<1\beta < 1), safety (or performance) has decreasing returns to scale, and with α>1\alpha > 1 (or β>1\beta > 1), safety (or performance) has increasing returns to scale.

Since the managers are now choosing XsX_s and XpX_p rather than buying ss and pp directly, we’ll also write the cost function as

c(Xs,Xp)=r(Xs+Xp),c(X_s, X_p) = r(X_s + X_p),

so each manager’s objective is now to choose Xs,iX_{s,i} and Xp,iX_{p,i} that will maximize

E[ρi(x)Xs,Xp]r(Xs,i+Xp,i)\mathbb E[\rho_i(x) | X_s, X_p] - r(X_{s,i} + X_{p,i})

subject to the other managers’ choices.

Changes in productivity

Below, I’ve plotted the solutions when we let α=β=1\alpha = \beta = 1 for both players, across a range of values of AA. Unsurprisingly, you can see that increasing AA increases safety at any given price. We can also see that increasing AA increases performance – the players take advantage of the lower cost of safety to spend more on performance.

Plot is loading...
Plot is loading...

In the next set of graphs, I show solutions when BB varies. You can see that performance increases with BB but safety is unaffected.

Plot is loading...
Plot is loading...

Why don’t we see here the same substitution effect we saw when we increased AA, with both ss and pp increasing in response? Again we see the concept of externalities being important. We discussed before that safety is a positive externality in this model; what about performance? Since we assume that the payoff to manager ii in the case of a safe outcome is pij=1npj,\frac{p_i}{\sum_{j=1}^n p_j}, if one manager increases their performance, it reduces the payoff of the other manager(s) – performance is a negative externality. (As a side note, the particular payoff function we used here assumes that the managers are in a zero-sum competition over performance, since they are fighting over a fixed payoff of 1, but the basic idea remains the same if we loosen that assumption.) To see what this means for our two-player example, let’s consider what would happen if BB increased for both managers, and one manager decided to put some of the money saved on performance into increased safety. If the other manager doesn’t do that – using the new productivity just to increase BB, then they benefit just as much from the first manager’s increased safety (since safety is a public good) and are now able to out-compete the first manager in terms of performance. Spending on safety makes it easier for each manager’s competitor to beat them out in performance, so it’s not a good strategy. To reiterate an earlier point, regulators might be able to intervene helpfully here by enforcing contracts between managers in cases where everyone would prefer to be safe but where competitive pressures prohibit them from doing so.

Changes in returns to scale

Below I show solutions for a few different values of α\alpha. The effect might seem the same as when we changed AA, but you can see that here the effect is more pronounced at lower values of rr (when XsX_s is more affordable and thus higher).

Plot is loading...
Plot is loading...

Doing the same thing for β\beta gives us the next set of plots. We get results analogous to what we’ve gotten so far:

Plot is loading...
Plot is loading...

Making safety’s cost dependent on performance

“Hold on!” I’m sure you’re saying (being a very studious and attentive reader), “Why have we assumed that safety and performance are produced independently? Wouldn’t they realistically interact so the cost of one depends on the other?”

Ah, good point, reader. It’s hard to tell exactly how safety and performance interact, but one reasonable assumption we can make is that safety becomes more expensive as performance increases – it should be easier to ensure that a simple AI system is safe than that an advanced AI system is safe. We can incorporate this idea into our production functions with another scaling parameter, θ\theta, like so:

s=AXsαpθ,p=BXpβs = AX_s^\alpha p^{-\theta}, \quad p = B X_p^\beta

This makes the tradeoff between safety and performance more salient: we can now think of performance having a cost not just in terms of XpX_p but also in terms of safety.

Below I show solutions for a few different values of θ\theta. The important thing to note is that, for higher θ\theta, the managers are more willing to substitute safety for performance as rr increases. In fact, it looks like, at least in this particular example, for θ>1\theta > 1, safety actually increases with an increase in rr.

Plot is loading...
Plot is loading...

In fact, the general rule is that whenever θ>α/β\theta > \alpha / \beta, safety increases with rr. When θ<α/β\theta < \alpha / \beta, safety decreases with rr, and when θ=α/β\theta = \alpha / \beta (like at θ=1\theta = 1 in the example above), safety does not change with rr. I am still working out the mathematical details of how this works, but the intuition is not too bad: we can write

s=ABθXsαXpθβ,s = \frac{A}{B^\theta} X_s^\alpha X_p^{-\theta \beta},

so if θ=α/β\theta = \alpha / \beta, then s(Xs/Xp)αs \propto (X_s / X_p)^\alpha. This means that if we want to change performance but leave safety unchanged, we need only match any changes in XpX_p one-to-one with changes in XsX_s so the ratio Xs/XpX_s / X_p remains constant. If θ<α/β\theta < \alpha / \beta, then XsX_s needs to change faster than XpX_p if we want safety to remain constant. (That is, an increase in performance with safety held fixed requires that we increase XsX_s more than XpX_p!) On the other hand, if θ>α/β\theta > \alpha / \beta then it takes relatively less change in XsX_s to maintain a given level of safety.

A useful way to think about this is in terms of costs of scaling. As mentioned previously, α\alpha measures how well safety can scale, while β\beta measures how well performance can scale. θ\theta determines how the scaling of performance impacts the scaling of safety – if α>θβ\alpha > \theta \beta, then safety can “out-scale” performance (or, put in terms of cost, safety costs won’t out-scale performance costs); otherwise, increases in performance must come at the expense of reduced safety.

This has important implications for AI governance. If we’re safety-conscious and think that safety costs will out-scale performance costs (i.e., we suspect that α<θβ\alpha < \theta \beta in the long run) then we may be inclined to limit performance (e.g., by increasing factor prices as measured by our rr parameter) until we can figure out other ways to incentivize safety. The model, unfortunately, doesn’t tell us what values of scaling parameters are accurate, but it does at least allow us a structured way to play around with various assumptions and get a feeling for the scenarios we ought to watch out for.


In the next section, we’ll investigate what happens in this model when we play around with the disaster cost (dd) parameter.

goodness, truth, and galaxies