Deep CFR: When Regret Meets Neural Networks

A hands-on implementation of Deep Counterfactual Regret Minimization.

Posted Feb 1, 2026

Deep CFR

By Tahir Abderrahmane El Mehdi

5 min read

Deep CFR: When Regret Meets Neural Networks

Introduction

Have you ever wondered how computers learn to play poker?

Most modern poker-playing agents rely on Counterfactual Regret Minimization (CFR) — a foundational algorithm from game theory that is provably convergent to a Nash equilibrium in two-player zero-sum imperfect-information games, such as poker. CFR can also be extended to multiplayer settings.

However, classical CFR requires storing values for every infoset, which quickly becomes infeasible for large games. This is where Deep CFR comes in.

In this post, I walk through a from-scratch implementation of Deep CFR, based on the paper:

Brown & Sandholm (2019) — Deep Counterfactual Regret Minimization https://arxiv.org/abs/1811.00164

The full source code is available here: 👉 https://github.com/mahdiiii04/DeepCFR

Before diving into Deep CFR itself, we start with the experimental setup: Kuhn Poker.

1. Experimental Setup: Kuhn Poker

Kuhn Poker is one of the simplest poker games, making it ideal for studying imperfect-information algorithms.

2 players
3 cards: King (2), Queen (1), Jack (0)
Each player antes 1 chip
One betting round

Game Flow

Cards are dealt uniformly at random.
Player 0 can check (c) or bet (b).
Depending on the history, players may call (c), bet (b), or fold (f).
If no one folds, the higher card wins the pot.

Below is the full game tree (first-level infosets highlighted for clarity):

2. Implementing the Game Tree (`kuhn_state.py`)

The KuhnState class represents a node in the game tree.

  
class KuhnState:
    def __init__(self, cards, history="", bets=(1, 1), player=0):
        self.cards = cards
        self.history = history
        self.bets = list(bets)
        self.player = player

cards: [player0_card, player1_card]
history: string of past actions (e.g. "cb")
bets: current contribution of each player
player: whose turn it is (0 or 1)

Terminal States

  
def is_terminal(self):
    return self.history in ["cc", "bc", "cbc", "bf", "cbf"]

A state is terminal if the betting sequence has concluded.

Payoff Function

  
def payoff(self):
    pot = sum(self.bets)
    if self.history.endswith("f"):
        winner = 1 - self.player
    else:
        winner = 0 if self.cards[0] > self.cards[1] else 1

    return pot - self.bets[winner] if winner == 0 else - (pot - self.bets[1])

This function returns the payoff from Player 0’s perspective:

Positive → Player 0 gains
Negative → Player 0 loses

Since Kuhn Poker is zero-sum, Player 1’s payoff is simply the negative of this value.

Legal Actions

  
def legal_actions(self):
    if self.history == "":
        return ["c", "b"]
    if self.history == "c":
        return ["c", "b"]
    if self.history in ["b", "cb"]:
        return ["f", "c"]
    return []

State Transition

  
def next_state(self, action):
    new = copy.deepcopy(self)
    new.history += action

    if action in ["b", "c"] and self.history in ["b", "cb"]:
        new.bets[self.player] += 1
    elif action == "b":
        new.bets[self.player] += 1

    new.player = 1 - self.player
    return new

Infoset Representation

  
def infoset(self):
    return self.cards[self.player], self.history

An infoset is defined by:

The player’s private card
The public betting history

3. Counterfactual Regret Minimization (CFR)

At the core of CFR is a simple question:

“What would have happened if I had played a different action at this infoset?”

Cumulative Regret

\[R_i^T(I, a) = \sum_{t=1}^{T} r_i^t(I, a)\]

Instantaneous Regret

\[r_i^t(I, a) = v_i(\sigma^t_{I \rightarrow a}, I) - v_i(\sigma^t, I)\]

Where:

$v_i(\sigma^t_{I \rightarrow a}, I)$ is the counterfactual value if action $a$ is forced
$v_i(\sigma^t, I)$ is the value under the current strategy

Regret Matching

Regrets are converted into a strategy via Regret Matching:

\[\sigma(I, a) = \begin{cases} \dfrac{R_i^+(I, a)}{\sum_{a'} R_i^+(I, a')} & \text{if } \sum_{a'} R_i^+(I, a') > 0 \\ \dfrac{1}{|A(I)|} & \text{otherwise} \end{cases}\]

with:

\[R_i^+(I, a) = \max\{ R_i(I, a), 0 \}\]

Average Strategy

The strategy that converges to Nash equilibrium is the average strategy:

\[\hat{\sigma}_i(I, a) = \frac{\sum_{t=1}^{T} \pi^t(I)\,\sigma_i^t(I, a)} {\sum_{t=1}^{T} \pi^t(I)}\]

4. Deep CFR

Deep CFR replaces tabular storage with neural networks.

Networks

  
self.regret_net = RegretNet()
self.policy_net = PolicyNet()

Regret Network approximates cumulative regrets: $R_\theta(I) \approx R_i^T(I)$
Policy Network approximates the average strategy: $P_\theta(I) \approx \hat{\sigma}_i(I)$

Regret Matching via Network

  
def regret_matching(self, x, legal_actions):
    with torch.no_grad():
        regrets = self.regret_net(x)[:len(legal_actions)]
        regrets = torch.clamp(regrets, min=0)
        if regrets.sum() > 0:
            return regrets / regrets.sum()
        return torch.ones(len(legal_actions)) / len(legal_actions)

Tree Traversal

  
def traverse(self, state):
    if state.is_terminal():
        return state.payoff()

    card, history = state.infoset()
    x = encode_infoset(card, history)

    actions = state.legal_actions()
    strategy = self.regret_matching(x, actions)

    utils = torch.zeros(len(actions))
    for i, a in enumerate(actions):
        utils[i] = self.traverse(state.next_state(a))

    node_util = (strategy * utils).sum()
    regrets = utils - node_util

    self.regret_buffer.append((x, regrets.detach()))
    self.policy_buffer.append((x, strategy.detach()))

    return node_util

This traversal:

Computes counterfactual utilities
Stores training targets for both networks
Propagates expected values upward

Training Loop

  
def train(self, iterations=2000):
    for _ in range(iterations):
        cards = random.sample([0, 1, 2], 2)
        self.traverse(KuhnState(cards))

        self.train_net(self.regret_net, self.regret_buffer, self.r_opt)
        self.train_net(self.policy_net, self.policy_buffer, self.p_opt)

        self.regret_buffer.clear()
        self.policy_buffer.clear()

Each iteration:

Samples a random deal
Traverses the game tree once
Trains both networks
Clears buffers

5. Results

After 20,000 training iterations, the learned strategy achieves:

Winning Rates

After 10000 games:
Player 0 wins: 4962 (0.496)
Player 1 wins: 5038 (0.504)

This is exactly what we expect at a Nash equilibrium in a symmetric zero-sum game. The small difference is due to stochastic sampling.

Conclusion

Although Kuhn Poker is simple, it captures the essence of imperfect information. Deep CFR scales this idea to massive games like No-Limit Texas Hold’em, where it has achieved superhuman performance.

This project demonstrates how game theory, recursion, and deep learning come together in modern AI systems.

Implementations, MARL-GT

This post is licensed under CC BY 4.0 by the author.