Teaching an LLM to Think Strategically: Fine-Tuning Qwen2.5-7B on Game Theory with SFT + GRPO + Real-World Formulation

How we built the first RLVR-ready game theory dataset, trained a 7B model to solve Nash equilibria with 94% accuracy, and created a formulation dataset bridging real-world scenarios to formal game theory — all on consumer GPUs.

Why Game Theory + LLMs?

Game theory sits at the intersection of mathematics, economics, and computer science. It's the formal language of strategic interaction — from pricing wars between corporations to spectrum auctions worth billions, from nuclear deterrence to mechanism design in decentralized protocols. It underpins everything from how Google sells ads to how countries negotiate climate agreements.

Yet despite its importance, LLMs remain surprisingly bad at it.

Ask GPT-4 or Claude to find the Nash equilibrium of a 3×4 payoff matrix and you'll get confident-sounding nonsense. Ask them to compute Shapley values for a cooperative game and they'll hallucinate numbers that don't sum to the coalition value. Request a Bayesian Nash Equilibrium with incomplete information and watch the reasoning derail at the probability calculations.

The problem isn't that these models lack knowledge of game theory concepts — it's that they can't reliably execute the computations. They can describe what a Nash equilibrium is but can't find one in a non-trivial game. They know the Shapley value formula but fumble the combinatorial summation.

We set out to fix this. Inspired by DeepSeek-R1's two-phase approach of supervised fine-tuning followed by reinforcement learning with verifiable rewards, we built a complete three-phase pipeline:

Phase 1 (SFT): Teach the model what correct solutions look like
Phase 2 (GRPO): Teach the model to reason its way to correct solutions
Phase 3 (Formulator): Bridge the gap from real-world scenarios to formal game theory

The result: a 7B model that solves game theory problems with 94% accuracy (up from 82% base), with a 27.7 percentage point improvement on hard problems, and a companion dataset teaching models to formulate real-world strategic interactions as formal games — all trained on consumer GPUs.

 Pipeline:  Dataset → SFT (Phase 1) → GRPO (Phase 2) → Formulator (Phase 3)
            2,913     QLoRA            Verifiable        1,215 real-world
            verified  fine-tuning      rewards           scenarios across
            problems  (~2 hrs)         (~8 hrs)          6 domains

 Results:   Base 82% → Solver 94% → Reasoner 94% + better reasoning
            Hard problems: 66.7% → 94.4%  |  Bayesian games: 0% → 100%

The Dataset: GameTheory-Bench v3.0

🤗 Alogotron/GameTheory-Bench

Before we could train a model, we needed data — and not just any data. For reinforcement learning with verifiable rewards (RLVR), every solution must be computationally verified. No hand-waved reasoning. No "the answer is left as an exercise." No solutions generated by another LLM and assumed correct.

Building the Generator Pipeline

We wrote custom Python generators for eight categories of game theory problems. Each generator creates randomized problem instances, solves them programmatically, and verifies the solutions using the nashpy library and custom analytical solvers:

Category	Count	Key Problem Types
2×2 Normal Form	500	Prisoner's Dilemma, Stag Hunt, pure/mixed NE
N×M Normal Form	660	3×3, 3×4, 4×4 matrices, iterated elimination
Zero-Sum Games	300	Minimax, saddle points, mixed strategies
Sequential Games	386	Backward induction, Stackelberg, entry deterrence
Auction Theory	300	Vickrey, first-price sealed-bid, revenue equivalence
Bayesian Games	267	2-type BNE, signaling games
Cooperative Games	300	Shapley values, core analysis, voting power indices
Mechanism Design	200	VCG mechanisms, incentive compatibility
Total	2,913

Every single problem in the dataset includes: - Step-by-step reasoning: Full solution derivation, not just the final answer - Computational verification: Solutions validated against nashpy and analytical solvers - Structured metadata: Difficulty level, game type, information structure, number of players, and semantic tags - RLVR-compatible answer format: Machine-parseable answers for reward function verification

The Verification Layer

This is what separates GameTheory-Bench from simply generating problems with an LLM. Every solution goes through a verification pipeline:

import nashpy as nash
import numpy as np

def verify_nash_equilibrium(payoff_A, payoff_B, strategy_profile, tol=1e-6):
    game = nash.Game(payoff_A, payoff_B)
    sigma_1, sigma_2 = strategy_profile

    # Check: is sigma_1 a best response to sigma_2?
    payoffs_1 = payoff_A @ sigma_2
    best_payoff_1 = np.max(payoffs_1)
    actual_payoff_1 = sigma_1 @ payoffs_1
    assert abs(actual_payoff_1 - best_payoff_1) < tol, "Player 1 can deviate!"

    # Check: is sigma_2 a best response to sigma_1?
    payoffs_2 = sigma_1 @ payoff_B
    best_payoff_2 = np.max(payoffs_2)
    actual_payoff_2 = payoffs_2 @ sigma_2
    assert abs(actual_payoff_2 - best_payoff_2) < tol, "Player 2 can deviate!"

    return True  # Valid NE

Early dataset versions revealed why this matters: we caught subtle errors like mixed strategies that didn't sum to 1.0 due to floating-point drift, Nash equilibria that were actually only epsilon-equilibria, and Shapley value allocations with rounding errors that didn't sum to the grand coalition value.

Why Game Theory is Ideal for RLVR

The key insight driving this project: game theory problems have verifiable answers. A Nash equilibrium either satisfies the best-response conditions or it doesn't. A Shapley value either sums correctly or it doesn't. A minimax value either matches the optimal strategy or it doesn't.

This makes game theory a natural domain for GRPO-style training where reward signals must be programmatic and deterministic, not learned from human preferences. To our knowledge, GameTheory-Bench is the first RLVR-ready game theory dataset on HuggingFace.

Phase 1: The Solver (Supervised Fine-Tuning)

🤗 Alogotron/GameTheory-Solver

Phase 1 is conceptually straightforward: teach the model how to solve game theory problems by showing it thousands of worked examples. The model learns the mapping from problem description to structured solution through standard next-token prediction.

Why Qwen2.5-7B-Instruct?

We chose Qwen2.5-7B-Instruct as our base for several reasons: - Strong mathematical reasoning baseline among open 7B models - Excellent instruction following out of the box - Good tokenization of mathematical notation - Active community and well-supported in the HuggingFace ecosystem - Small enough to train on consumer hardware with QLoRA

Training Configuration

We used QLoRA (Quantized Low-Rank Adaptation) to make training feasible on consumer hardware:

# QLoRA Configuration
base_model: Qwen/Qwen2.5-7B-Instruct
lora_r: 64
lora_alpha: 128
quantization: 4-bit NF4 (bitsandbytes)
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
trainable_params: 161M / 7.8B (2.1%)
dropout: 0.05

# Training Hyperparameters
epochs: 3
batch_size: 2 (per device)
gradient_accumulation_steps: 4   # effective batch size = 2 x 2 GPUs x 4 = 16
learning_rate: 2e-4
scheduler: cosine
warmup_ratio: 0.03
max_seq_length: 4096
bf16: true
gradient_checkpointing: true

A note on the LoRA rank: we went with r=64, which is higher than the commonly recommended r=16 or r=32. Our reasoning was that game theory requires the model to learn genuinely new computational procedures (support enumeration, backward induction algorithms, Shapley value combinatorics) rather than just adapting its style or tone. The higher rank gives the adapter more capacity to encode these procedures.

Phase 1 Results

Training completed in approximately 2 hours on 2× RTX 3090s:

Metric	Value
Training Loss	0.1613
Eval Loss	0.0873
Token Accuracy	96.1%
Held-out Problem Accuracy	~80%

Performance varied significantly across categories:

Strong (>85%): 2×2 Normal Form, Zero-Sum, Sequential Games — these have clean matrix representations and near-algorithmic solution procedures
Good (70-85%): N×M Normal Form, Auction Theory — larger matrices introduce more computational steps; auction theory requires careful expected-value reasoning
Weaker (<70%): Bayesian Games, Cooperative Games — multi-step probabilistic reasoning and combinatorial computation push the 7B model's limits

The performance gradient makes intuitive sense. Normal-form games essentially reduce to linear algebra that the model can pattern-match. Bayesian games require maintaining belief distributions across types while simultaneously optimizing strategies — a fundamentally harder reasoning chain. This is exactly why Phase 2 exists.

Phase 2: The Reasoner (GRPO with Verifiable Rewards)

🤗 Alogotron/GameTheory-Reasoner

Phase 1 teaches the model what correct solutions look like. Phase 2 teaches it to reason its way there — to prefer correct reasoning over plausible-sounding mistakes, and to explore solution strategies it never saw in the training data.

GRPO in a Nutshell

Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-Math paper, generates multiple candidate responses per prompt, scores them with reward functions, and updates the policy to favor higher-scoring completions relative to the group. Unlike PPO, it doesn't require a separate value/critic model — rewards come directly from verifiable functions, and advantage estimation is done by comparing candidates within each group.

This is computationally efficient (no critic model to train) and especially powerful when you have deterministic reward signals — which we do.

The Three Reward Functions

Designing reward functions for game theory required careful thought. We needed to balance mathematical correctness (the obvious goal) with solution quality signals that encourage well-structured reasoning. We settled on a weighted trio:

# Reward weights
ACCURACY_WEIGHT = 0.70   # Is the answer mathematically correct?
FORMAT_WEIGHT   = 0.15   # Is the solution well-structured?
REASONING_WEIGHT = 0.15  # Does it use appropriate game theory reasoning?

1. Accuracy Reward (weight: 0.70)

The crown jewel. This function extracts the model's numerical answers and verifies them against ground truth — but with crucial domain awareness:

def accuracy_reward(completions, answers, categories):
    rewards = []
    for completion, answer, category in zip(completions, answers, categories):
        predicted = extract_answer(completion)
        if predicted is None:
            rewards.append(0.0)  # couldn't parse any answer
            continue

        if category in ["2x2_normal_form", "nxm_normal_form"]:
            # Nash equilibria: ORDER-INDEPENDENT comparison
            # [(0.3, 0.7), (0.5, 0.5)] == [(0.5, 0.5), (0.3, 0.7)]
            reward = compare_equilibria_sets(predicted, answer, tol=1e-4)
        elif category == "zero_sum":
            # Minimax values: numerical tolerance for mixed strategies
            reward = compare_minimax(predicted, answer, tol=1e-4)
        elif category == "cooperative":
            # Shapley values: verify individual allocations AND sum
            reward = compare_shapley(predicted, answer, tol=1e-4)
        elif category == "sequential":
            # Subgame perfect equilibria: strategy profile comparison
            reward = compare_sequential(predicted, answer)
        elif category == "auction":
            # Bidding strategies and expected revenues
            reward = compare_auction(predicted, answer, tol=1e-3)
        elif category == "bayesian":
            # BNE: type-conditional strategy comparison
            reward = compare_bayesian(predicted, answer, tol=1e-4)
        elif category == "mechanism_design":
            # VCG payments and allocation rules
            reward = compare_mechanism(predicted, answer, tol=1e-3)
        else:
            reward = float(predicted.strip() == answer.strip())
        rewards.append(reward)
    return rewards

The order-independent Nash equilibria comparison deserves emphasis. A naive exact-match reward would penalize the model for listing equilibria as [(0.5, 0.5), (0.3, 0.7)] instead of [(0.3, 0.7), (0.5, 0.5)]. These are mathematically identical answer sets — punishing the model for arbitrary ordering injects noise into the reward signal and slows learning. Every category has similar domain-specific equivalences that the reward function must handle.

2. Format Reward (weight: 0.15)

Encourages structured, step-by-step solutions that are both human-readable and machine-parseable:

def format_reward(completions):
    rewards = []
    for completion in completions:
        score = 0.0
        # Does it break down into clear steps?
        if has_step_by_step_structure(completion):      score += 0.4
        # Does it clearly mark the final answer?
        if has_final_answer_block(completion):           score += 0.3
        # Is reasoning separated from computation?
        if has_clear_separation(completion):             score += 0.3
        rewards.append(score)
    return rewards

3. Reasoning Reward (weight: 0.15)

Measures whether the model actually engages with game theory concepts rather than just pattern-matching to an answer:

GAME_THEORY_CONCEPTS = [
    "nash equilibrium", "dominant strategy", "best response",
    "minimax", "backward induction", "bayesian", "shapley value",
    "expected utility", "mixed strategy", "subgame perfect",
    "incentive compatible", "individual rationality",
    "support enumeration", "indifference condition",
    "deviation", "coalition", "mechanism", "allocation rule",
]

def reasoning_reward(completions):
    rewards = []
    for completion in completions:
        concepts_used = sum(
            1 for c in GAME_THEORY_CONCEPTS
            if c in completion.lower()
        )
        # Normalize: 5+ concepts = full score
        score = min(concepts_used / 5.0, 1.0)
        rewards.append(score)
    return rewards

GRPO Training Configuration

# GRPO Configuration
base_model: Alogotron/GameTheory-Solver   # Start from Phase 1 checkpoint
lora_r: 32                            # Smaller rank: refining, not retraining
lora_alpha: 64
learning_rate: 5e-6                   # 40x lower than SFT phase
kl_beta: 0.04                         # KL divergence penalty
num_generations: 4                    # Candidates per prompt
max_completion_length: 2048
max_prompt_length: 1024
epochs: 1
bf16: true

Note the deliberate design choices: smaller LoRA rank (32 vs 64 in Phase 1) and much lower learning rate (5e-6 vs 2e-4). The goal of GRPO isn't to teach new knowledge — it's to reshape the model's reasoning behavior using the reward signal. We want gentle refinement, not radical rewriting.

The KL penalty (beta=0.04) prevents the model from collapsing to degenerate strategies that game the reward function. Without it, models can find reward-hacking shortcuts like always outputting the most common answer format regardless of the problem.

Phase 2 Training Results

Training ran for 750 steps on a single RTX 3090 over approximately 8 hours. Here's what the reward curves looked like:

Metric	Range During Training	Interpretation
Accuracy Reward	0.85 → 1.0	Model quickly learned to produce verifiably correct answers
Total Reward (weighted sum)	2.36 → 2.55	Steady improvement across all three reward dimensions
KL Divergence	0.004 – 0.015	Model improved without catastrophic forgetting

The KL divergence numbers are particularly telling. Values in the 0.004–0.015 range mean the GRPO-trained model stayed very close to the SFT checkpoint in probability space while meaningfully improving its reasoning quality. This is the sweet spot — the model learned better reasoning strategies without losing the foundational knowledge from Phase 1.

The Three-Way Benchmark

To properly evaluate the GRPO phase, we ran a controlled 3-way comparison: the base Qwen2.5-7B-Instruct (no fine-tuning), the Phase 1 Solver (SFT only), and the Phase 2 Reasoner (SFT + GRPO). All models were evaluated on the same held-out test set:

Model	Overall Accuracy	Reasoning Quality	Hard Problem Accuracy	Bayesian Games
Base Qwen2.5-7B	82%	0.53	66.7%	0%
Solver (SFT)	94%	0.51	83.3%	~50%
Reasoner (GRPO)	94%	0.54	94.4%	100%

Let's unpack these numbers:

Overall accuracy (82% → 94%): SFT alone provided the massive jump from base model to near-expert performance. GRPO maintained this level — it didn't regress, which is important given that RL fine-tuning can sometimes destabilize models.

Reasoning quality (0.53 → 0.51 → 0.54): Here's where GRPO earns its keep. Notice that SFT actually decreased reasoning quality slightly (0.53 → 0.51) — the model learned to produce correct answers but sometimes took shortcuts in its reasoning chains. GRPO reversed this trend, pushing reasoning quality to 0.54, a 6% improvement over the SFT model. The model doesn't just get the right answer; it gets there through better reasoning.

Hard problems (66.7% → 94.4%): This is the headline result. On the most challenging problems in our test set — complex multi-step games requiring extended reasoning chains — the Reasoner achieved a 27.7 percentage point improvement over the base model. These are exactly the problems where reasoning quality matters most: you can't pattern-match your way to a Bayesian Nash Equilibrium in a 3-type signaling game.

Bayesian games (0% → 100%): The most dramatic category-level improvement. Base Qwen2.5-7B scored a flat zero on Bayesian games — it simply couldn't maintain the multi-step probabilistic reasoning required. After SFT + GRPO, the model handles them perfectly. This validates our hypothesis that Bayesian games were the category most likely to benefit from GRPO's reasoning refinement.

Phase 3: The Formulator Dataset

🤗 Alogotron/GameTheory-Formulator

Phases 1 and 2 teach a model to solve game theory problems given a formal mathematical specification. But in practice, the hardest part isn't solving the game — it's recognizing that a real-world situation is a game and formulating it correctly.

Consider a business analyst trying to model a competitive pricing decision, a policy researcher analyzing voting behavior, or a security team designing defense strategies. They don't start with a payoff matrix — they start with a messy real-world scenario that needs to be translated into the language of game theory.

Phase 3 addresses this gap with the GameTheory-Formulator dataset: 1,215 problems that teach models to bridge natural language scenarios to formal game-theoretic analysis.

Dataset Design

Each problem in the Formulator dataset follows a complete pipeline:

Natural Language Scenario → 8-Step Formulation Process → Formal Game Specification
    → Solution Computation → Real-World Interpretation

The 8-step formulation process is the key pedagogical element. Rather than jumping directly from scenario to solution, each problem walks through:

Identify the players and their roles in the strategic interaction
Define the strategy spaces available to each player
Specify the information structure (complete/incomplete, perfect/imperfect)
Construct the payoff functions mapping strategy profiles to outcomes
Determine the appropriate solution concept (Nash, Bayesian Nash, subgame perfect, etc.)
Formulate the mathematical representation (normal form, extensive form, etc.)
Solve the game using the appropriate analytical method
Interpret the solution back in real-world terms with actionable insights

This structured formulation process teaches models not just what the answer is, but how to think about translating real-world complexity into tractable mathematical models.

Six Real-World Domains

The dataset spans six carefully chosen domains to ensure breadth and practical relevance:

Domain	Problems	Example Subtypes
Business & Economics	220	Pricing competition, market entry, supply chain coordination, oligopoly dynamics
Politics & Voting	200	Electoral competition, coalition formation, legislative bargaining, voting power analysis
Technology & Networks	200	Platform competition, network effects, standards adoption, cybersecurity games
Social Interactions	200	Trust games, public goods provision, social norms, collective action problems
Auctions & Market Design	195	Spectrum auctions, procurement design, matching markets, revenue optimization
Security & Conflict	200	Deterrence modeling, resource allocation, inspection games, arms race dynamics
Total	1,215	33 unique game subtypes

What Makes This Dataset Different

Existing game theory datasets (including our own GameTheory-Bench) present problems in mathematical form: "Given the following payoff matrix, find the Nash equilibrium." The Formulator dataset starts from the other end:

"TechCorp and DataInc are both considering entering the cloud infrastructure market in Southeast Asia. TechCorp has existing data center infrastructure but weak local partnerships. DataInc has strong local relationships but would need to build infrastructure from scratch. Market research suggests the region can profitably support at most one major new entrant..."

→ Step 1: Players are TechCorp and DataInc... → Step 2: Each player chooses {Enter, Don't Enter}... → Step 3: Simultaneous decision with incomplete information about competitor costs... → ... → Step 7: The Bayesian Nash Equilibrium reveals mixed-strategy entry with probabilities... → Step 8: TechCorp should enter with probability 0.73, leveraging infrastructure advantage...

This formulation-first approach is what makes the dataset valuable for training models that can serve as genuine strategic analysis tools — not just equation solvers.

Results and Demo

🎮 Alogotron/GameTheory-Solver-Demo

The interactive Gradio Space lets you test the model yourself — paste in a payoff matrix or describe a game scenario and watch it reason through the solution step by step.

Here's what a typical successful generation looks like for a 3×3 normal-form game:

Step 1: Identify the payoff matrix and players. We have a 3×3 bimatrix game with players Row and Column...

Step 2: Check for strictly dominated strategies. Comparing Row's strategies: R1 yields (3, 1, 4), R2 yields (2, 5, 1), R3 yields (1, 2, 6). No strict dominance found.

Step 3: Check for Column dominance... C2 strictly dominates C1. Eliminate C1.

Step 4: In the reduced 3×2 game, check again... R3 now dominates R1. Eliminate R1.

Step 5: Solve the remaining 2×2 game using indifference conditions...

Final Answer: Nash Equilibrium: σ₁ = (0.0, 0.6, 0.4), σ₂ = (0.0, 0.7, 0.3)

Summary of All Three Phases

Phase	What It Does	Key Achievement	Training Cost
Phase 1: Solver (SFT)	Teaches correct solution patterns	82% → 94% overall accuracy	~2 hrs on 2× RTX 3090
Phase 2: Reasoner (GRPO)	Refines reasoning quality	+27.7% on hard problems, 0%→100% Bayesian	~8 hrs on 1× RTX 3090
Phase 3: Formulator (Dataset)	Bridges real-world to formal analysis	1,215 problems across 6 domains	Generation pipeline

Lessons Learned

What Worked

Computational verification is non-negotiable. Early dataset versions had subtle errors — mixed strategies that didn't sum to 1.0 due to floating-point drift, equilibria that were only approximate, Shapley values with rounding errors. The nashpy verification layer caught these before they could poison training. If you're building an RLVR dataset, invest heavily in your verification pipeline.
Step-by-step solutions in training data are essential. We experimented with answer-only training data early on. Models trained this way achieved significantly lower accuracy — they could sometimes produce correct final answers but couldn't recover when their initial approach failed. The reasoning chains in the SFT data give the model recoverable intermediate states.
QLoRA is remarkably effective at this scale. Training only 2.1% of parameters with 4-bit quantization, we achieved results that we believe are competitive with what full fine-tuning would yield on this scale of data. The key was using a high LoRA rank (r=64) — game theory requires learning genuinely new computational skills, not just stylistic adaptation.
Category-aware reward functions are critical. A naive exact-match reward would be incredibly noisy for game theory. Nash equilibria can be listed in any order. Mixed strategies have floating-point tolerance issues. Shapley values must sum correctly as an additional verification. Every domain-specific equivalence you encode in the reward function improves the GRPO signal-to-noise ratio.
GRPO's impact is surgical, not broad. The 3-way benchmark revealed something important: GRPO didn't improve overall accuracy (SFT already achieved 94%). Instead, it dramatically improved performance on the hardest problems — the ones requiring extended reasoning chains. If your SFT model already handles easy cases well, GRPO's value is in the tail of the difficulty distribution.
Low KL divergence is the canary in the coal mine. Our KL values (0.004–0.015) indicate the model improved through genuine reasoning refinement rather than distribution collapse. If you see KL divergence spiking during GRPO training, your model is likely reward-hacking rather than learning. This is a more reliable training health metric than reward curves alone.

What Didn't Work (or Was Harder Than Expected)

Bayesian games are genuinely hard for 7B models — until GRPO. The multi-step probabilistic reasoning pushed the reasoning chain length to where base 7B models completely failed (0% accuracy). SFT helped partially, but GRPO was the breakthrough, taking Bayesian game performance to 100%. This suggests GRPO is particularly valuable for problem categories requiring long reasoning chains with intermediate verification steps.
Sequence length matters more than you think. Some cooperative game solutions (especially Shapley value computations with 4+ players) produce very long reasoning chains. We had to set max_seq_length=4096 and still occasionally see truncation. For Phase 2, we deliberately set max_completion_length=2048 to balance quality with computational cost.
Reward function calibration required iteration. Our first attempt weighted format reward at 0.30. The result: beautifully structured wrong answers. The model learned to produce impressive-looking step-by-step solutions that arrived at incorrect conclusions. Reducing format weight to 0.15 and increasing accuracy to 0.70 fixed this — a useful reminder that in RLVR, correctness must always dominate style.
Real-world formulation is a different skill than mathematical solving. Building the Formulator dataset revealed that the gap between "solve this payoff matrix" and "model this business scenario as a game" is enormous. The 8-step formulation process was essential — without structured intermediate steps, models struggle to bridge the abstraction gap.

Practical Tips for Your Own RLVR Projects

Start with SFT. Don't jump straight to GRPO — the model needs a foundation of domain knowledge first. GRPO refines reasoning; it doesn't teach facts.
Verify everything programmatically. If your domain has verifiable answers, use them. Hand-labeled preferences are expensive, noisy, and don't scale.
Consumer GPUs are enough. A single RTX 3090 (24GB VRAM) handled GRPO training comfortably in 8 hours. You don't need an H100 cluster for meaningful research at the 7B scale.
Watch your reward weights. Format and style rewards are useful but can easily dominate if weighted too heavily. Mathematical correctness should always be king.
Domain-aware comparison functions are worth the investment. Every hour you spend encoding mathematical equivalences into your reward function saves days of noisy training.
Evaluate on the hard tail. Overall accuracy can mask GRPO's real impact. Always benchmark on your hardest problems separately — that's where reasoning refinement pays off.

What's Next

Phase 4: The Formulator Model

With the Formulator dataset complete, the next step is training a model on it — creating an end-to-end system that can take a natural language description of a strategic interaction, formulate it as a formal game, solve it, and interpret the results. This closes the loop from real-world scenarios to mathematical analysis to actionable insights.

Dataset Expansion

Future dataset versions will target: - Evolutionary games: Replicator dynamics, evolutionarily stable strategies - Repeated games: Folk theorem applications, trigger strategies, discount factor analysis - Larger coalition games: 5+ player cooperative games with complex characteristic functions - Combinatorial auctions: Multi-item bidding strategies and computational complexity - Additional real-world domains: Healthcare resource allocation, environmental policy, supply chain resilience

Community Invitation

All assets are open-source under permissive licenses. We encourage the community to: - Test the model on your own game theory problems via the Demo Space - Use GameTheory-Bench for your own RLVR experiments across different model families - Build on the Formulator dataset to train real-world game theory assistants - Contribute new problem categories, generators, or verification methods - Report failure cases — they're the most valuable data for improvement

Links & Resources

Resource	Link
📊 Dataset (Solver training)	Alogotron/GameTheory-Bench
📊 Dataset (Formulator)	Alogotron/GameTheory-Formulator
🧠 Model — Phase 1 Solver	Alogotron/GameTheory-Solver
🧠 Model — Phase 2 Reasoner	Alogotron/GameTheory-Reasoner
🎮 Interactive Demo	Alogotron/GameTheory-Solver-Demo

Built with Qwen2.5, HuggingFace TRL, bitsandbytes, and too much coffee. The complete pipeline — from dataset generation to SFT to GRPO to real-world formulation — runs on consumer GPUs and is fully open-source. If you found this useful, give the repos a ⭐ and let us know what game theory problems stump the model — those failures are our roadmap.