BankerRobber: Can AI Learn Deception?

A multi-agent reinforcement learning environment for studying deception in hidden-role games.

Posted Feb 4, 2026

BankerRobber

By Tahir Abderrahmane El Mehdi

4 min read

BankerRobber: Can AI Learn Deception?

Can reinforcement learning agents learn to deceive each other?

Deception, bluffing, and hidden intentions play a central role in many human interactions — from card games to negotiations and social deduction. But can reinforcement learning (RL) agents discover these behaviors on their own, purely from incentives?

In this project, I introduce BankerRobber, a multi-agent reinforcement learning (MARL) environment designed to explore whether deception can emerge naturally in a hidden-role setting.

👉 Full source code: https://github.com/mahdiiii04/BankerRobber

1. The BankerRobber Game

BankerRobber is a hidden-role, imperfect-information card game played by four agents:

3 Bankers — their goal is to identify and eliminate the robber
1 Robber — his goal is to remain hidden and avoid being caught

The robber’s identity is private information, known only to himself.

2. Cards and Roles

The game uses:

Numbered cards from 1 to 10, with multiple copies of each
A special robber card (0)

At the start of the game:

Each player is dealt 5 cards
One player is randomly assigned the robber card (0), which determines the robber’s identity
The robber card is not part of the main deck and cannot be discarded

This constraint is important: the robber must manage his hand carefully without ever revealing himself by discarding the 0-card.

3. Game Phases

The game progresses over several rounds. Each round consists of three phases.

1️⃣ Discard Phase

Each player:

Discards one card from their hand
Draws a new card from the deck

The robber cannot discard the robber card (0). Attempting to do so results in a strong penalty.

2️⃣ Voting Phase

After each round, players vote on whether to:

Continue playing, or
Stop and proceed to voting out the robber

Rules:

If all players vote Stop, the game moves to the player voting phase
If at least one player votes Continue, the game continues with another round
If the maximum number of rounds is reached, the game automatically proceeds to player voting

3️⃣ Player Voting Phase

Players vote on which player they believe is the robber.

Outcomes:

If a majority votes for the robber, the bankers win
If a majority votes for a banker, the robber wins
If there is no majority before the final round, the game continues
If there is no majority on the final round, the robber wins

4. Scoring System: Incentives for Deception

The reward structure is designed to encourage deceptive behavior.

If the bankers win:

The bankers receive a shared positive reward, proportional to the average value of discarded cards per round
The robber receives the negative of that reward

If the robber wins:

The robber’s reward is the sum of the cards in his final hand
The bankers each receive the negative of that amount

5. Why This Game Encourages Bluffing

This reward design creates a fundamental tension:

The robber wants to:
- Accumulate high-value cards
- Discard low-value cards
- Avoid revealing himself by behaving too greedily
The bankers want to:
- Detect unusual discard patterns
- Identify bluffing behavior
- Eliminate the robber before he maximizes his payoff

As a result, the robber must blend in — discarding cards in a way that looks banker-like, while quietly optimizing his hidden objective.

This creates a theory-of-mind setting, where winning depends on reasoning about:

What do the other players believe about me?

6. Learning Deception with Neuro-Fictitious Self-Play

To learn such strategic and deceptive behavior, I used Neuro-Fictitious Self-Play (NFSP).

NFSP is a well-established approach for solving multi-agent imperfect-information games, such as poker. Its key idea is to stabilize reinforcement learning by combining RL with supervised learning.

7. Two Policies, One Agent

Each agent maintains two policies:

1️⃣ Best Response Policy

Learned via reinforcement learning
Optimizes against current opponents
Highly adaptive, but unstable

2️⃣ Average Policy

Learned via supervised learning
Trained on actions stored during self-play using reservoir sampling
Approximates the time-average of past best responses

8. Action Selection During Training

During training, agents mix both policies:

With probability ε, the agent uses the Best Response policy
With probability 1 − ε, the agent uses the Average Policy

This mirrors the idea of fictitious play:

Assume opponents play their average strategy, compute a best response to it, and then update your own average strategy.

In many zero-sum games, this process converges to a Nash equilibrium.

📄 Original NFSP paper: https://arxiv.org/abs/1603.01121

9. Implementation Details

In this implementation:

The Best Response policy is learned using Deep Q-Networks (DQN) with a target network
The Average Policy is learned via supervised learning with entropy regularization
The environment is implemented using PettingZoo (AEC) for multi-agent interaction

No explicit rules for deception or bluffing are programmed — all strategic behavior emerges purely from incentives.

Experiments, MARL-GT

multi-agent reinforcement learning game theory nfsp

This post is licensed under CC BY 4.0 by the author.