Factorio Learning Environment

1Independent, 2Anthropic
*Equal contribution
Mine 16 Iron Ore
per minute
Smelt 16 Iron Plates
per minute
Make 16 Iron Gears
per minute
Extract 250 Petroleum Gas
per minute
Refine 16 Sulfur
per minute
Make 16 Plastic bars
per minute
Build the largest possible factory
Build the largest possible factory
Build the largest possible factory

Claude Sonnet 3.5 builds factories

Abstract

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.

FLE provides open-ended and exponentially scaling challenges - from basic automation to complex factories processing millions of resource units per second. We provide two settings:

  1. Lab-play consisting of 24 structured tasks with fixed resources.
  2. Open-play with the unbounded task of building the largest factory from scratch on a procedurally generated map.

We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex question-answer (QA) problems, saturating benchmarks in factual recollection, reasoning and code generation. Benchmark saturation presents a critical challenge for the AI research community: how do we meaningfully evaluate and differentiate increasingly capable models?


We introduce the Factorio Learning Environment (FLE): a novel framework built upon the game of Factorio that addresses this challenge by enabling unbounded agent evaluation. FLE provides the infrastructure, API, and metrics for assessing frontier LLM agents in code generation, spatial reasoning and long-term planning. In this environment, agents must navigate rapidly scaling challenges—from basic resource extraction producing ~30 units/minute to sophisticated production chains processing millions of units/second. This dramatic growth in complexity, driven by geometric increases in research costs and the combinatorial expansion of interdependent production chains, creates natural curricula for evaluating increasingly capable agents.


Within FLE, we define two complementary evaluation protocols: (1) lab-play with structured, goal-oriented tasks that have clear completion criteria, allowing targeted assessment of specific capabilities, and (2) open-play with no predetermined end-state, supporting truly unbounded evaluation of an agent's ability to autonomously set and achieve increasingly complex goals.

Environment


Agents in FLE aim to optimise factories programmatically. Left: Agents aim to create increasingly efficient factories, advancing through technological tiers to produce more resources per second. Middle: We provide a Python API to Factorio which enables direct interaction with the environment through code. Right: Agents submit programs to the game server and receive rich feedback, enabling them to refine their strategies through an iterative process of exploration and refinement.

Agents develop policies through an interactive feedback loop. Using 23 core API tools, agents compose programs that interact with the environment and observe the results through stdout and stderr streams. The Python namespace allows agents to store variables and define functions for later use, enabling increasingly sophisticated strategies as experience grows. This approach mirrors the way human programmers learn - through iteration, debugging, and refinement based on direct feedback. Agent programs yield both a Production Score (PS) representing the economic value of all items produced, and milestones that reflect technological advancements.

Experiments

To systematically evaluate agent capabilities in the Factorio Learning Environment, we introduce two complementary experimental settings that test different aspects of planning, automation, and resource management; namely open-play and lab-play.


We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct. Each model interacts with the environment through a consistent prompting approach, receiving the API schema, a guide describing common patterns, and memory of past actions and observations.


Open-Play

Agents begin in a procedurally generated world with instruction to "build the largest possible factory". This setting tests agents' ability to set appropriate goals, balance short-term production against long-term research, and navigate the complex tech tree and game map without external guidance.


Agent capabilities are clearly differentiated by their production scores in open-play. Left: By plotting Production Score (PS) against steps on a log/log scale, we can observe distinct performance trajectories for each model. More capable models not only achieve higher scores but demonstrate steeper growth curves, indicating better long-term planning. Milestone annotations show when the median agent first created key entities, revealing how quickly each model progresses through the tech tree. Right: Final rewards reveal how weaker models struggle to advance when complex automation and logistics become necessary.

Production strategies reveal differences in agent planning and capabilities. We track how various models produce items with multiple antecedent ingredients in open-play, showing not just what they build but how they approach factory design. Claude 3.5-Sonnet demonstrates sophisticated strategy by immediately beginning complex crafting and investing in research and automation, ultimately unlocking electric-mining-drills around step 3k - a decision that boosts iron-plate production by 50% thereafter. In contrast, less advanced models like GPT-4o-Mini produce minimal quantities of multi-ingredient items, revealing limitations in planning horizons. Interestingly, Deepseek showed stronger capabilities in lab-play than open-play, suggesting that its general capabilities exceed its objective-setting abilities in open-ended environments.

Lab-Play

Agents are provided with resources and given a time-limit to achieve an objective. We task agents to build production lines of 24 distinct target entities of increasing complexity, starting from a single resource mine requiring at most 2 machines (making iron-ore) to a late game entity requiring the coordination of close to 100 machines (making utility-science-pack). The target entities cover items from early to late game, requiring agents to use a wide variety of machines present in Factorio (drills, furnaces, assembling machines, oil refineries, chemical plants). As the task difficulty naturally increases with resource requirements, this provides a measure of the complexity that agents are capable of creating in a limited number of steps. All tasks provide the agent with sufficient resources to complete the task with all technologies unlocked.


Item production complexity creates a natural difficulty gradient for agent evaluation. Top: We measure task success rates across the first 8 complexity levels, revealing a clear decline as target entity crafting complexity increases. Even the most capable models struggle with coordinating more than six machines when producing items with three or more ingredients. Bottom: Production progress over time shows a pattern of initial rapid advancement followed by stagnation or regression. This reveals a key limitation in current agents' abilities: they often break existing functional structures when attempting to scale production or add new factory sections. The high variance in task progress across runs further demonstrates the challenge of consistent performance in complex automation tasks.

Plastic bar manufacturing is the most challenging task successfully completed in lab-play. The factory consists of a electricity steam generator (top-left), a coal mine with storage buffer (top), a crude-oil to petroleum gas pipeline (bottom) and a chemical plant (bottom-right). The chemical plant creates plastic bars using the coal and petroleum gas as inputs. By themselves, the cumulative raw resources generate a production score of $224$. With this specific layout, the factory creates $40$ plastic bars per $60$ in-game seconds, for a production score of $352$. This factory was created by Claude Sonnet 3.5.

Even the strongest model (Claude) only completed 7/24 tasks in lab-play, illustrating substantial room for improvement in this benchmark.

Key Insights

Our experiments revealed several key patterns that highlight both the capabilities and limitations of current AI agents when faced with open-ended industrial challenges:

1. Coding skill predicts performance

Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

2. Technology investment drives growth

Only Claude consistently invested resources in researching new technologies, despite their importance for long-term progression. After deploying electric mining drills at step 3k, Claude's PS grew by 50% (from 200k to 300k), demonstrating the value of strategic investment.

3. Planning is essential in open-play

In open-play, agents frequently pursue short-sighted objectives — like Gemini-2.0 manually crafting 300+ wooden chests over 100 steps — rather than investing in research or scaling existing production. This reveals a telling discrepancy: while Gemini-2 and Deepseek demonstrate early-game automation capabilities in structured lab-play, they rarely attempt to create cohesive factories during open-ended exploration, resulting in poorer overall performance.

4. Spatial reasoning is a major limitation

All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement - issues that severely impacted performance in complex tasks requiring coordination of multiple production lines.

5. Error recovery poses a significant challenge

Models frequently become trapped in repetitive error patterns, attempting the same invalid operations repeatedly rather than exploring alternative solutions. For instance, GPT-4o repeated the same API method incorrectly for 78 consecutive steps despite identical error messages.

6. Programming styles vary significantly

Models exhibited distinct coding approaches: Claude favored a REPL style with extensive print statements (43.3% of code lines) but few assertions (2.0%), while GPT-4o used a defensive style with more validation checks (12.8% assertions) and fewer prints (10.3%).

Conclusion

Our results show that even state-of-the-art LLMs struggle with the coordination and optimization challenges inherent in automation tasks. The rapidly scaling complexity of Factorio's technology tree creates evaluation scenarios that will remain challenging even as progress in AI research continues, allowing meaningful differentiation between increasingly capable models.

We release the Factorio Learning Environment as an open-source platform, along with our evaluation protocols and baseline implementations, to encourage research on agent capabilities in complex, open-ended domains.

BibTeX

@article{hopkins2025factorio,
  title   = {Factorio Learning Environment},
  author  = {Jack Hopkins and Mart Bakler and Akbir Khan},
  year    = {2025},
  journal = {arXiv preprint arXiv:2503.09617 }
}
With thanks to Jack Kleeman and Minqi Jiang for their invaluable help with setting up compute resources and advice during the inception of this project. Thanks to Wube and the Factorio team for developing such a stimulating game.