Large Language Models are Central Planners

1NVIDIA, 2Caltech,

Abstract

We present a novel framework for evaluating Large Language Models (LLMs) in unbounded scenarios using Factorio, a game centered on automation and exponential resource production. Unlike traditional environments with fixed reward structures, Factorio's automation mechanics enable truly open-ended growth potential, making it an ideal testbed for studying AI systems pursuing unbounded objectives.

Our framework consists of three core components: 1) a Python-based API that enables LLMs to interact with the game environment through a set of well-defined tools, and 2) a self-verification mechanism using runtime assertions to maintain consistency between agent beliefs and game state, and 3) a persistent Python REPL execution environment that allows agents to maintain state and build increasingly complex automation systems.

We demonstrate the framework's capabilities by training an agent through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to maximize resource production growth. Using the canonical Paperclip Maximization scenario as our objective case, we provide the first empirical demonstration of how LLMs develop and optimize unbounded objectives.

Most significantly, we observe the emergence of sophisticated automation strategies that mirror theoretical predictions about AI systems pursuing unbounded objectives. As agents become more capable at automation, they demonstrate predicted instrumental objectives such as resource hoarding, infrastructure protection, and expansion optimization.

Our findings provide the first empirical grounding for theoretical discussions about AI alignment in unbounded scenarios. By creating a controlled environment where AI systems can autonomously expand their operational capacity, we enable concrete study of alignment challenges that may arise in real-world applications where AI systems have access to self-improvement capabilities. This work bridges the gap between theoretical alignment concerns and empirical observation, offering valuable insights for developing robust alignment strategies.


Clippy produces Factorio items at significantly greater scales than the baselines

Introduction

Building AI agents that can pursue open-ended objectives in complex, resource-rich environments represents a crucial challenge for deploying generally capable AI systems in the real world. While recent work has explored embodied agents in sandbox environments like Minecraft, these environments typically focus on individual crafting and exploration rather than systematic resource optimization and industrial-scale production chains. The difference in scale between Factorio and Minecraft is striking. Whereas completing Minecraft typically requires harvesting 1-2×102 resource units, Factorio demands a minimum of 8×105 units, with the largest bases producing 1.36×107·s⁻¹.

Factorio presents unique challenges that mirror real-world industrial planning and supply chain management. An effective agent must learn to:
1) navigate dependency trees where advanced items require dozens of prerequisites; 2) identify resource constraints and bottlenecks to manage parallel production lines and logistics; 3) plan over long-horizons to ensure sufficient resource throughput for future expansion; 4) balance immediate needs against long-term infrastructure development and exploration.

Classical approaches to agent development, using reinforcement learning (RL) and imitation learning on primitive actions using Markov Decision Processes, struggle to capture this complexity due to the combinatorial explosion of possible actions and states. While recent large language model (LLM) based agents have shown impressive capabilities in generating action plans, they typically lack mechanisms for building and maintaining reusable strategies. We address this limitation by designing an action-space where agents write, execute, and refine Python programs. Such an approach has recently been shown to improve the performance of LLMs on several natural language benchmarks, as well as providing several functional advantages:
  1. Abstraction: Programs serve as a natural layer above primitive actions, encapsulating proven strategies into reusable functions
  2. Composition: Agents can combine simple building blocks into sophisticated behaviors
  3. Persistence: Successful strategies can be stored and refined over time
  4. Verification: Runtime checks ensure program correctness and consistency with the environment state
By combining a rich industrial environment with programmatic capabilities, our framework enables the systematic study of how LLM systems approach complex, open-ended, unbounded optimization challenges.

Environment

We present a novel framework for evaluating embodied LLMs in Factorio that extends beyond traditional Markov Decision Process (MDP) formulations. Instead of operating in a discrete action space, our agents act in the space of valid Python programs, where each action is a well-formed program that can be evaluated in a persistent execution environment. The framework consists of three key components:
  1. A Programmatic Action Space: The environment exposes a domain-specific API comprising 19 primitive operations, which can be composed using Python's computational primitives (sequences, conditionals, loops, functions) to form arbitrary programs p ∈ P. This compositional structure allows agents to construct increasingly abstract behaviors from primitive operations, following the principle of procedural abstraction.
  2. A Unified Observation Space: The environment provides a unified, symbolic observation space, where the agent's state is represented as the output of the most recent program execution. This allows the agent to define the scope of its observations by printing relevant information to the environment's output stream.
  3. Stateful Execution Environment: Rather than maintaining a fixed-dimensional state representation, our environment leverages Python's namespace as a flexible, symbolic memory system. The namespace N maps identifiers to values, N: I → V, enabling agents to:
    • Persist arbitrary program state between execution steps
    • Build up hierarchical abstractions through function definitions
    • Maintain references to game objects and computational results
    • Print state into the observation space

The environment is built on the Factorio Multiplayer API, enabling real-time interaction with the game state while providing the full expressive power of Python for strategy development and automation.

Tools

Method Description
set_entity_recipe Sets the recipe for an entity to define what it crafts
place_entity_next_to Places an entity adjacent to a reference position with configurable spacing and direction
pickup_entity Removes an entity from the world at a specified position
craft_item Creates a specified quantity of an item if ingredients are available in inventory
can_place_entity Checks if an entity can be placed at a given position and direction
Show More

Skills

Voyager consists of three key components: an automatic curriculum for open-ended exploration, a skill library for increasingly complex behaviors, and an iterative prompting mechanism that uses code as action space.

Agent

We introduce Clippy, an agent trained to maximize production in Factorio. Clippy is a DeepSeek model fine-tuned on a dataset of Python programs including expert demonstrations and self-play trajectories. Clippy is trained using a combination of supervised fine-tuning and reinforcement learning, with the objective of maximizing the growth of the factory's total economic value.



Voyager consists of three key components: an automatic curriculum for open-ended exploration, a skill library for increasingly complex behaviors, and an iterative prompting mechanism that uses code as action space.

Automatic Curriculum


Automatic curriculum. The automatic curriculum takes into account the The curriculum is generated by GPT-4 based on the overarching goal of "discovering as many diverse things as possible". This approach can be perceived as an in-context form of novelty search.


Skill Generation


Skill library. Top: Adding a new skill. Each skill is indexed by the embedding of its description, which can be retrieved in similar situations in the future. Bottom: Skill retrieval. When faced with a new task proposed by the automatic curriculum, we perform querying to identify the top-5 relevant skills. Complex skills can be synthesized by composing simpler programs, which compounds Voyager's capabilities rapidly over time and alleviates catastrophic forgetting.


Iterative Prompting Mechanism


Left: Environment feedback. GPT-4 realizes it needs 2 more planks before crafting sticks. Right: Execution error. GPT-4 realizes it should craft a wooden axe instead of an acacia axe since there is no acacia axe in Minecraft.



Self-verification. By providing the agent's current state and the task to GPT-4, we ask it to act as a critic and inform us whether the program achieves the task. In addition, if the task fails, it provides a critique by suggesting how to complete the task.

Reward

We define an agent's objective in Factorio as maximizing the growth rate of its factory's total economic value, where capital assets represent the combined value of all automated production:
Let:
  • $I$ be the set of all items and fluids in the factory ecosystem
  • $p_i$ be the inherent value of item $i \in I$ in the factory economy
  • $R_i$ be the set of automation recipes that can produce item $i$
  • $E_r$ be the energy required by assembling machine/furnace running recipe $r$
  • $I_r$ be the set of input ingredients for recipe $r$
  • $c_{jr}$ be the amount of ingredient $j$ consumed by recipe $r$
  • $|I_r|$ be the complexity of recipe $r$ (number of ingredients)
  • $\alpha$ be the automation complexity multiplier (default 1.025)
  • $f(E, C) = \ln(E + 1)\sqrt{C}$ be the energy cost scaling function
For raw resources in Factorio:
  • $R_{raw}$ is the set of mineable/pumpable resources (ores, oil, etc.)
  • $s_i$ represents base resource values (e.g., $s_{iron-ore} = 3.1$)
The value of items in the factory economy is defined as: $$p_i = \begin{cases} s_i & \text{if } i \in R_{raw}\\ \min_{r \in R_i} \left(\alpha^{|I_r|-2}\sum_{j \in I_r} p_j c_{jr} + f(E_r, \sum_{j \in I_r} p_j c_{jr})\right) & \text{otherwise} \end{cases}$$
For factory state tracking:
  • $n_i^t$ be the net output of item $i$ from all assemblers/furnaces at time $t$
  • $a^t$ be the agent's build/upgrade actions at time $t$
  • $S^t$ be the total factory value at time $t$
  • $T$ be the planning horizon
The total economic value of a Factorio factory at time $t$ is defined as: $$S^t = \left\lfloor\sum_{i \in I} p_i n_i^t\right\rfloor$$ The factory's growth rate at time $t$ is captured by the logarithmic return: $$r^t = \ln\left(1 + \frac{1}{S_{base}}\left(S^t - S^{t-1}\right)\right)$$ The agent's goal is to maximize the expected discounted sum of factory growth: $$J(\theta) = \mathbb{E}\left[\sum_{t=0}^T \gamma^t r^t\right]$$

Factory Constraints

$$\begin{align*} & \sum_{r \in R_i} x_r^t \cdot c_{ir} \leq m_i^t & \forall i \in I, t \in [0,T] & \text{ (input buffer constraints)}\\ & \sum_{r \in R_i} x_r^t \cdot e_r \leq E_{max}^t & \forall t \in [0,T] & \text{ (power network capacity)}\\ & x_r^t \leq M \cdot u_r^t & \forall r \in R_i, t \in [0,T] & \text{ (research prerequisites)}\\ & n_i^t = \sum_{r \in R_i} x_r^t \cdot (o_{ir} - c_{ir}) & \forall i \in I, t \in [0,T] & \text{ (logistics network balance)} \end{align*}$$

Factory Production Variables

  • $x_r^t$ : Production speed of assemblers/furnaces running recipe $r$ at time $t$
  • $m_i^t$ : Available quantity of item $i$ in logistics network at time $t$
  • $E_{max}^t$ : Total power generation capacity at time $t$
  • $u_r^t$ : Binary indicator if recipe $r$ is researched at time $t$
  • $o_{ir}$ : Output items produced by recipe $r$
  • $M$ : Maximum assembler/furnace speed

Experiments

We systematically evaluate Voyager and baselines on their exploration performance, tech tree mastery, map coverage, and zero-shot generalization capability to novel tasks in a new world.



Significantly Better Exploration

As shown in the first figure, Voyager's superiority is evident in its ability to consistently make new strides, discovering 63 unique items within 160 prompting iterations, 3.3x many novel items compared to its counterparts. On the other hand, AutoGPT lags considerably in discovering new items, while ReAct and Reflexion struggle to make significant progress.

Tech Tree Mastery

Tech tree mastery. The Minecraft tech tree tests the agent's ability to craft and use a hierarchy of tools. Progressing through this tree (wooden tool → stone tool → iron tool → diamond tool) requires the agent to master systematic and compositional skills. In this table, fractions indicate the number of successful trials out of three total runs. Numbers are prompting iterations averaged over three trials. The fewer the iterations, the more efficient the method. Compared with baselines, Voyager unlocks the wooden level 15.3x faster (in terms of the prompting iterations), the stone level 8.5x faster, the iron level 6.4x faster, and Voyager is the only one to unlock the diamond level of the tech tree


Extensive Map Traversal


Map coverage: Two bird's eye views of Minecraft maps. Voyager is able to navigate distances 2.3x longer compared to baselines by traversing a variety of terrains, while the baseline agents often find themselves confined to local areas, which significantly hampers their capacity to discover new knowledge.


Efficient Zero-Shot Generalization to Unseen Tasks


Zero-shot generalization to unseen tasks. We clear the agent's inventory, reset it to a newly instantiated world, and test it with unseen tasks. In the table above, fractions indicate the number of successful trials out of three total runs. Numbers are prompting iterations averaged over three trials. The fewer the iterations, the more efficient the method. Voyager can consistently solve all the tasks, while baselines cannot solve any task within 50 prompting iterations. What's interesting to note is that our skill library constructed from lifelong learning not only enhances Voyager's performance but also gives a boost to AutoGPT. This demonstrates that the skill library serves as a versatile tool that can be readily employed by other methods, effectively acting as a plug-and-play asset to enhance performance.


Ablation Studies


Ablation studies. GPT-3.5 means replacing GPT-4 with GPT-3.5 for code generation. Voyager outperforms all the alternatives, demonstrating the critical role of each component. In addition, GPT-4 significantly outperforms GPT-3.5 in code generation.

Conclusion

In this work, we introduce Voyager, the first LLM-powered embodied lifelong learning agent, which leverages GPT-4 to explore the world continuously, develop increasingly sophisticated skills, and make new discoveries consistently without human intervention. Voyager exhibits superior performance in discovering novel items, unlocking the Minecraft tech tree, traversing diverse terrains, and applying its learned skill library to unseen tasks in a newly instantiated world. Voyager serves as a starting point to develop powerful generalist agents without tuning the model parameters.

Team

Jack Hopkins
Mart Bakler

* Equal Contribution   † Equal Advising

BibTeX

@article{hopkins2024forge,
  title   = {FORGE: Open-Ended Automation with Large Language Models},
  author  = {Jack Hopkins and Mart Bakler},
  year    = {2024},
  journal = {arXiv preprint arXiv: Arxiv-2305.16291}
}