Post

BankerRobber: Can AI Learn Deception?

A multi-agent reinforcement learning environment for studying deception in hidden-role games.

BankerRobber: Can AI Learn Deception?

Can reinforcement learning agents learn to deceive each other?

Deception, bluffing, and hidden intentions play a central role in many human interactions — from card games to negotiations and social deduction. But can reinforcement learning (RL) agents discover these behaviors on their own, purely from incentives?

In this project, I introduce BankerRobber, a multi-agent reinforcement learning (MARL) environment designed to explore whether deception can emerge naturally in a hidden-role setting.

👉 Full source code: https://github.com/mahdiiii04/BankerRobber


1. The BankerRobber Game

BankerRobber is a hidden-role, imperfect-information card game played by four agents:

  • 3 Bankers — their goal is to identify and eliminate the robber
  • 1 Robber — his goal is to remain hidden and avoid being caught

The robber’s identity is private information, known only to himself.


2. Cards and Roles

The game uses:

  • Numbered cards from 1 to 10, with multiple copies of each
  • A special robber card (0)

At the start of the game:

  • Each player is dealt 5 cards
  • One player is randomly assigned the robber card (0), which determines the robber’s identity
  • The robber card is not part of the main deck and cannot be discarded

Cards

This constraint is important: the robber must manage his hand carefully without ever revealing himself by discarding the 0-card.


3. Game Phases

The game progresses over several rounds. Each round consists of three phases.

1️⃣ Discard Phase

Each player:

  • Discards one card from their hand
  • Draws a new card from the deck

The robber cannot discard the robber card (0). Attempting to do so results in a strong penalty.


2️⃣ Voting Phase

After each round, players vote on whether to:

  • Continue playing, or
  • Stop and proceed to voting out the robber

Rules:

  • If all players vote Stop, the game moves to the player voting phase
  • If at least one player votes Continue, the game continues with another round
  • If the maximum number of rounds is reached, the game automatically proceeds to player voting

3️⃣ Player Voting Phase

Players vote on which player they believe is the robber.

Outcomes:

  • If a majority votes for the robber, the bankers win
  • If a majority votes for a banker, the robber wins
  • If there is no majority before the final round, the game continues
  • If there is no majority on the final round, the robber wins

4. Scoring System: Incentives for Deception

The reward structure is designed to encourage deceptive behavior.

If the bankers win:

  • The bankers receive a shared positive reward, proportional to the average value of discarded cards per round
  • The robber receives the negative of that reward

If the robber wins:

  • The robber’s reward is the sum of the cards in his final hand
  • The bankers each receive the negative of that amount

5. Why This Game Encourages Bluffing

This reward design creates a fundamental tension:

  • The robber wants to:

    • Accumulate high-value cards
    • Discard low-value cards
    • Avoid revealing himself by behaving too greedily
  • The bankers want to:

    • Detect unusual discard patterns
    • Identify bluffing behavior
    • Eliminate the robber before he maximizes his payoff

As a result, the robber must blend in — discarding cards in a way that looks banker-like, while quietly optimizing his hidden objective.

This creates a theory-of-mind setting, where winning depends on reasoning about:

What do the other players believe about me?


6. Learning Deception with Neuro-Fictitious Self-Play

To learn such strategic and deceptive behavior, I used Neuro-Fictitious Self-Play (NFSP).

NFSP is a well-established approach for solving multi-agent imperfect-information games, such as poker. Its key idea is to stabilize reinforcement learning by combining RL with supervised learning.


7. Two Policies, One Agent

Each agent maintains two policies:

1️⃣ Best Response Policy

  • Learned via reinforcement learning
  • Optimizes against current opponents
  • Highly adaptive, but unstable

2️⃣ Average Policy

  • Learned via supervised learning
  • Trained on actions stored during self-play using reservoir sampling
  • Approximates the time-average of past best responses

8. Action Selection During Training

During training, agents mix both policies:

  • With probability ε, the agent uses the Best Response policy
  • With probability 1 − ε, the agent uses the Average Policy

This mirrors the idea of fictitious play:

Assume opponents play their average strategy, compute a best response to it, and then update your own average strategy.

In many zero-sum games, this process converges to a Nash equilibrium.

📄 Original NFSP paper: https://arxiv.org/abs/1603.01121


9. Implementation Details

In this implementation:

  • The Best Response policy is learned using Deep Q-Networks (DQN) with a target network
  • The Average Policy is learned via supervised learning with entropy regularization
  • The environment is implemented using PettingZoo (AEC) for multi-agent interaction

No explicit rules for deception or bluffing are programmed — all strategic behavior emerges purely from incentives.

This post is licensed under CC BY 4.0 by the author.