Friday 8 March 2019

The Prisoner's Dilemma

The Prisoner's Dilemma has the following payoff matrix:


DefectCooperate
Defect0\02\-1
Cooperate-1\21\1

Whatever move the opponent makes, you are better off defecting. However, if both players cooperate, they are both better off than if both players defect. The bonus from defecting when the other player cooperates is not so high that you could compensate them: the total payoff when both players cooperate is higher than the total payoff when one cooperates and the other defects.

The highest scoring result for you occurs when you always defect and your opponent always cooperates. However this only happens in real life against exceptionally unsophisticated opponents, such as CooperateBot (who always cooperates). In real-life prisoner's dilemma tournaments, DefectBot (who always defects) performs worse than ChanceBot (cooperate with 50% probability).

If you playing against a sophisticated opponent, the best outcome you can hope for is a string of consecutive cooperations. Defections will be punished in subsequent rounds, reducing both player's scores.

You should always defect on the last round (assuming the number of rounds is not random) because you cannot be punished for it. However, if you know your opponent is going to defect on the last round, then you should defect on the round before that, and so on. Some of the most successful algorithms always defect on the last two or even three rounds. This is clearly sub-optimal: if only there were a way for both players to cooperate on the last round...

The classic prisoner's dilemma algorithm is "tit for tat", where you cooperate on the first move, then subsequently just replicate your opponent's last move. The problem with pure tit-for-tat is that it is possible to get stuck in a string of defections, which is bad for both parties. Therefore, tit-for-tat algorithms often have occasional forgiveness built in, where you cooperate twice in a row. It doesn't have to be at random, because your opponent cannot take advantage of it by continuing to defect -- they are better off getting back into a virtuous circle of cooperation. ("Tit for two tats", where you only punish two defections in a row, is too forgiving, and not successful. It allows your opponent to defect every other move with no penalty.) 

Tit-for-tat is the foundation of many successful prisoner's dilemma strategies, because you must have some way of punishing defection, otherwise you get taken advantage of. It is not the only way to punish defection. Alternatively you can defect against someone who has defected too many times total in the past, or too many times as a percentage. The percentage strategy is better in a long tournament, because it allows more cooperation.

It may seem that a good algorithm should have some way of taking advantage of unwise cooperation. An unwise cooperator is someone who too often responds to defection with cooperation. The idea is to defect early to see how the opponent reacts. Algorithms that do not identify unwise cooperators and take advantage of them are leaving money on the table.

In practice, however, this is not a good idea. One rarely encounters unwise cooperators in real life, because other algorithms will take advantage of them and eliminate them. The problem is that by probing with defections to try to identify unwise cooperators, you risk getting stuck in a cycle of mutual defection. One should not be the one to start this, therefore one should not probe. But you must be ready to defend yourself against probers. Probers will outcompete unwise cooperators, but those who do not probe will outcompete probers.

If you have the ability to simulate your opponent, this allows you to beat CooperateBot without the costs of probing. If you can simulate your opponent, the algorithm becomes "Simulate my opponent to find out whether defections by me will be punished. If so, cooperate. If not, defect." The probing gets carried out in the simulation, where it is costless (or relatively costless).

But against minimally sophisticated opponents, being able to simulate them doesn't gain you anything. Simple tit-for-tat is effective against someone who can simulate you; so you don't need to simulate them.

The inputs to simulation are the opponent's source-code, and the history of the game so far (or the number of rounds played so far, so you can recreate the history).

Where simulation is really helpful is in the final round, or in the one-round prisoner's dilemma. In the one-round prisoner's dilemma, the final round is also the first and only round. You can't just automatically cooperate in the final round, because your opponent will simulate you and defect.

In the last round, you want to do whatever your opponent does. Ideally you would both cooperate rather than both defect. You want to say to your opponent "I will cooperate if you will." Simulation is effectively a form of communication. So you simulate your opponent, and do what he does.

If both players try to simulate each other, we need a way of terminating the infinite regression. Yudkowsky said "Simulate my opponent, and if it tries to simulate me, have its simulation of me output Cooperate, then do what it would do." This will achieve cooperation in the final round.

If you can achieve superrational cooperation with your opponent, you can do this in every round. Each round becomes independent and you always cooperate. The history becomes irrelevant.

"Cooperate" in Yudkowsky's algorithm is only the base case, to prevent infinite regression. The main strategy of Yudkowsky's algorithm is to do what the other player would do. It will cooperate against itself. (Only one of the players needs to have the terminating base case.)

In Hammond and Axelrod (2006), "The Evolution of Ethnocentrism", they play a one-shot prisoner's dilemma, where agents of the same "color" make successful use of the strategy "cooperate with agents of the same color, defect against agents of different colors". DefectBots are suppressed by agents of different colors. This only applies to the one-shot game. In the iterated prisoner's dilemma, "ethnocentric" strategies would be dominated by the more sophisticated algorithms described above. There are benefits to cooperation with strangers as well as family, and "family" will not be so similar that they are guaranteed not to defect.

No comments:

Post a Comment