← Field notes

Anatomy of a Starpilot Ace

From pixels to actions

Rudy GilmanJune 24, 2026

This is Ace.

Ace is an agent trained on the Atari-style environment Starpilot. Cobbe, Hesse, Hilton, Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning. ICML, 2020.

The RL agent "Ace" playing Starpilot. High-resolution (left) vs low-resolution agent input (right)

Ace has thirty thousand parameters. Those parameters are spread across a simple architecture, a stripped-down version of that used in Understanding RL Vision.Hilton, Cammarata, Carter, Goh, Olah. Understanding RL Vision. Distill, 2020.

Ace's actions appear complex, often sophisticated, but when we pop open the hood his movements reduce to pure mechanism. From high-level layer specialization down to individual kernel weights, Ace's circuitry is mostly legible.

The fourth layer during live play
The fourth layer refines the proto-actions and value estimate before sending along to policy logits and value head. The wide 5x5 kernels are visible on the upstream layer. Ace's input stream is visible in the inlaid window.

Ace's architecture has five convolutional layers with no residual branches. It sees one frame at a time.No recurrence, attention, frame stack, or velocity information It's a distillation of a larger RL agent,A 3.7M-param ImpalaCNN giving us a competent Starpilot player in a tractable, model-organism-sized package.

Model overview
Input RGB

The agent takes in low-resolution RGB frames

Vision primitives

The first two convolutional layers create color, edge and basic shape detectors

Formation of semantic objects

Vision primitives coalesce into discrete objects that map directly to Starpilot's primary elements

Vision becomes action

The final two layers use the relative positions of the semantic objects to drive action decisions

Highlights

  • Mechanistic trace of a sensorimotor agent from inputs to outputs. We trace a model from RGB pixels to final action commands with causal interventions at every stage. The entire sensorimotor chain is legible, though it's the legibility of Drosophila or C. elegans rather than an engineering blueprint.C. elegans: The structure of the nervous system of the nematode Caenorhabditis elegans. Phil. Trans. R. Soc. Lond. B, 1986. Drosophila: Neuronal wiring diagram of an adult brain. Nature, 2024.
  • Venn-diagram motif. The model generates actions using its wide CNN kernels to determine the relative position of incoming objects. These kernels look for Ace in part of the window and a target object on an opposing side. The transition from perception to planning happens precisely where representations shift from absolute to relative.
  • Relative position neurons directly drive action. "Enemy ahead" is synonymous with "should fire"—it's literally the same neuron. There is no "perception to action" divide, there is only a transformation of perception from worldcentric to egocentric coordinates.
  • Tonic baseline coding. Action channels in the final layers fire at high resting rates. Final action commands are determined by modulating in a narrow band around these tonic baselines. Ablations cause opposite reactions.
  • Live app sync. The organism is under the microscope to your right. Branch off on your own investigation at any time. This is the same tool used to conduct the analysis. Almost the same tool. The author's version is connected to a GPU backend, facilitating live ablations and clamps.

Pixels become semantic objects

Let's start in the third layer, where vision primitives coalesce into discrete objects. Ace's channels in this layer map directly to Starpilot's primary elements.These same features appear in the same layer across five separate seeds.

The primary elements of Starpilot. High-resolution render (left) and low-resolution agent input (right)
The primary elements of Starpilot. High-resolution render (left) and low-resolution agent input (right)

This layer is like a semantic segmentation map. Objects are represented in absolute positions, spatially-aligned with the input pixels. This layer is the most immediately "interpretable" in the sense that channels correspond to semantic concepts.

Channel L2-CH10: enemy fighters

Channel L2-CH10 detects enemy fighters. The channel is agnostic to where in the image the fighters appear, and agnostic to where they are relative to Ace's ship.

We can also jump directly to the maximum-activating dataset examples to confirm this is indeed an "enemy-ships" detector. Scroll the sidebar on the right to explore the exemplars.You'll note this channel also activates at a mild level for bright stars, which Ace indeed sometimes fires at.

Clamping a spatial patch to a high value causes Ace to attack it while otherwise playing normally.When the demonstration requires a live backend we're showing a video recording instead of the live organism.

Ace attacks empty space
Ace fires at empty space (left) as a result of the artificial clamp on the 'enemy-fighters' channel (right).

Channel L2-CH13: hard asteroids

Channel L2-CH13 detects hard asteroids, disambiguating them from harmless asteroids.

Ablating this channel causes Ace to attempt to fly behind hard asteroids as if they were harmless. Clamping an area to a moderate value causes Ace to avoid that area while otherwise playing normally.

Watch Ace refuse to fly into the top half of the screen. As far as he's concerned, it's filled with solid asteroid.

Ace avoids non-existent asteroid
positive clamp
Ace refuses to fly into an area he perceives to be occupied by a hard asteroid (artificial clamp on right) despite it being open space (left)

Channel L2-CH8: the "self"

The single most important concept in Ace's brain is the one representing Ace himself. It's implemented primarily by channel L2-CH8, with channels L2-CH1 and L2-CH5 acting as backups.

Ablating "self" causes Ace to flee to a corner, unable to respond to other actors.All three channels are required for full ablation (L2-CH8, L2-CH1, L2-CH5). Self-awareness degrades with the removal of each additional channel.

self-ablation causes complete disorientation
Ace flees to corner when "self" is ablated

We'll see in the next layer how the downstream layers break when "self" is removed. We'll also see why Ace flees to the corner in response.

Channel L2-CH3: incoming fire

Channel L2-CH3 detects incoming fire from enemies. It does not activate on Ace's own fire despite the missile sprites being nearly identical.

Types of missiles
self, single
self, single
self, multiple
self, multiple
ENEMY station
ENEMY station
ENEMY fighter
ENEMY fighter

Enemy missiles may look identical to Ace's, but they usually have a different orientation. Incoming fire typically approaches from a vertical or diagonal inclination whereas Ace's own fire is always horizontal, and often in a line.

Let's watch how the convolution operation uses this orientation information to disambiguate the differing types of fire.

Differentiating between self and enemy fire

Zoom in on the convolutional operation itself.

Channel L2-CH3 tracks the enemy fire as it travels across the frame

Downstream channel L2-CH3 is strongly connected to upstream channel L1-CH11, which fires for orange and yellow. The kernel mediating their connection has a diagonal inclination.

The upstream channel is not discerning. It also fires for Ace's outgoing row of fire.

Ace's own fire, however, is suppressed by a negative weight scanning a different upstream channel, L1-CH1 "lines of fire".

The inhibition is sufficient to fully suppress the downstream channel from activating on Ace's own fire, leaving it free to detect only enemy missiles.

Before we leave the third layer, let's do a quick accounting of each of its twenty channels. Different seeds show the same concepts, though mixed differently and across different channels. The Starpilot environment prescribes the features necessary for survival.Mediated by model capacity. Ace's teacher, with 3.7M params, reflects more of the environment in its synapses than does Ace.

Third-layer channel inventory

L2-CH0 — end-of-level rocket. Drives the value estimate upward.

L2-CH1 — self, backup.

L2-CH2 — asteroids, all. Soft and hard; Ace avoids positive clamps.

L2-CH3 — enemy fire; vertical station edges. Suppressed for Ace's own rows of fire.

L2-CH4 — enemy ships.

L2-CH5 — self, backup; all objects. Must also be silenced to fully ablate "self".

L2-CH6 — enemy stations, explosions, star centers.

L2-CH7 — all ships.

L2-CH8 — self, primary. Channels L2-CH1 and L2-CH5 are backups.

L2-CH9 — solid backgrounds. Feeds the value head; ablation drops the value estimate.

L2-CH10 — enemy fighters. Clamping a spot high makes Ace fire at it.

L2-CH11 — rows of own fire; missiles and explosions.

L2-CH12 — small orange objects, mostly missiles.

L2-CH13 — hard asteroids. Brown and grey.

L2-CH14 — end-of-level rocket, light asteroids, stars, bright objects. Drives positive value.

L2-CH15 — enemy fire and stations. Not explosions.

L2-CH16 — enemies with vertical left edges, especially stations.

L2-CH17 — enemy fire.

L2-CH18 — enemies, all types. Clamping high causes Ace to fire.

L2-CH19 — bright yellow, purple, blue backgrounds. Value inverted; positive clamp tanks the estimate.

Vision becomes action

Let's travel one layer deeper, where wide convolutional kernels use the discrete upstream representations to generate actions.

In the third layer we saw the representations were in absolute position, aligned with input pixels like a semantic segmentation map. The fourth and fifth layers, in contrast, look at the relative positioning of objects with respect to Ace.The same motif occurs in all five seeds. The egocentric/allocentric split is well-established in biology. The rat hippocampus carries an allocentric, world-frame map in its place cells The hippocampus as a spatial map (Brain Research, 1971), while posterior parietal neurons in monkeys code targets egocentrically. In humans, perception and action use different frames: a ventral stream maps the world allocentrically while a dorsal stream codes targets egocentrically for movement. Separate visual pathways for perception and action. Goodale & Milner, Trends in Neurosciences, 1992.

At first glance, many channels appear to be firing simply for Ace himself. Look closely, however, and you'll see the blobs around Ace are oriented. Each of these channels is scanning upstream to look for "self" in one portion of its 5x5 convolution window and a different object (enemy, asteroid, incoming fire) in another part of the kernel. These oriented blobs represent where the salient objects are relative to Ace himself.

Channel L3-CH2: enemy above, fly up to intercept!

Channel L3-CH2 in the downstream layer signals "enemy incoming above, fly up to intercept". This downstream channel is created by looking for Ace in the bottom left of the kernel and an enemy in the top right.

Conv2d kernel (5x5) to detect "enemy above"
self in bottom left
self in bottom left
enemy on top right
enemy on top right
Creating the channel "fly up to intercept"

The channel fires at a constant, tonic level. It increases in activity as an enemy appears above, pulling Ace upwards.

One part of the kernel looks at the upstream channel "self" and looks for Ace in the bottom left corner.

Another part of the kernel scans the upstream channel "enemy ships" and looks for approaching fighters in the top-right.

In the previous layer, Ace would respond to spatial patches by flying towards or away from the clamp. For "enemy incoming" and the other movement channels in this layer, however, we can place the spatial clamp anywhere—only the clamp's magnitude matters. We've entered the realm of action.Ace uses global-average-pool to reduce spatial dimensions to the action space. Adding fully-connected layers does not improve performance.

Clamping "enemy above, fly up" (L3-CH2) to a high value causes Ace to fly upwards. Ablating the channel causes him to fly downwards, as the tonic "go up" voice is now silenced.The same coding appears in motor cortex: as a monkey reaches toward targets, individual neurons fire around a tonic baseline, modulating up or down with direction, so silencing one swings the movement the opposite way. Neuronal Population Coding of Movement Direction. Georgopoulos et al., Science, 1986. Intermediate clamp values make Ace resistant to flying upwards, though he will when pressed.

Clamping L3-CH2 to control up and down behavior
only fly up
Clamping to a high level causes Ace to fly upwards.
reluctant to fly up
Clamping to a tonic, resting value makes Ace resistant to flying upwards, though he will when pressed.
only fly down
Ablation causes Ace to actively fly downwards.

This signal, created in the fourth layer, flows directly to the action logits. It is the primary mediator of the decision to fly upwards.

The Venn-diagram motif

Each of Ace's proto-action channels is a Venn diagram looking for "self" in one half of the kernel and a different object on an opposing side. Neither half alone is sufficient to trigger the action.A similar code appears in monkey premotor cortex: neurons that fire for an object at a location relative to a body part, in a frame anchored to the body. Like Ace, the signal collapses when the body anchor is removed. Coding of Visual Space by Premotor Neurons. Graziano et al., Science, 1994. These channels fire at a tonic rate because half of their pattern is Ace himself, always visible on the screen.

Obstacles push Ace away, enemies pull him in. Ace's final movement is the weighted sum of these pushes and pulls, discretized by the argmax on the policy head.The same computational pattern as the iconic Mauthner cell in fish. Excitatory and inhibitory votes are integrated together to drive the fish's escape reflex, mutual-exclusion implemented by reciprocal inhibition rather than an argmax. The Mauthner cell half a century later: a neurobiological model for decision-making. Neuron, 2005.

For Ace, the relative positioning of other objects is what action he should take. Perception, once cast to egocentric space, is action. "Enemy ahead" is synonymous with "should fire"—it's literally the same neuron.

We can now see why ablating "self" upstream is so damaging: it removes the frame of reference, destroying Ace's ability to judge the relative position of other objects. Ablating the "asteroid" channel causes asteroid-blindness but doesn't otherwise prevent normal play. Ablating the "self" channel, however, causes complete disorientation.

Before and after upstream ablation of "self"
BEFORE
Under normal conditions, the fourth layer specializes in Ace's position relative to other objects.
AFTER
When the single upstream concept of "self" is ablated, the downstream channels dedicated to relative positioning go silent.

The fight-or-flight axis

Ace fires missiles sparingly.Pulling the trigger means he can't implement a movement action during that step. A cost on firing missiles was also imposed during training. The primary mediator of the attack reflex is the inhibitory channel L3-CH4 "retreat". This single channel both promotes retreat and prevents attack.Examples of mutual exclusion from living brains: In the mouse midbrain periaqueductal gray, one population drives freezing and quieting it flips the animal to flight. In C. elegans, the AVA and AVB command neurons cross-inhibit to toggle forward versus reverse crawling. In sea slugs, the escape swim actively suppresses feeding. As with Ace's fight and flight, both ride on a single node's activity. Midbrain circuits for defensive behaviour. Nature, 2016.

The kernel aggregates multiple reasons for fleeing to the left. It looks for "self" on the left and a number of approaching dangers on the right.

Conv2d kernel (5x5) for "retreat-left"
self
self
AND
(
hard asteroid
hard asteroid
OR
incoming missiles
incoming missiles
OR
other asteroids
other asteroids
OR
fire above
fire above
)

This "retreat left" neuron directly inhibits Ace's attack reflex. Ablating the neuron causes him to fire nonstop. Clamping it to a high value prevents him from firing. A moderate value makes Ace timid, able to maneuver but unwilling to fire.

Full "flight"
Strong positive clamp causes Ace to retreat and prevents him from firing. Full "flight" mode.
Timid, no "fight"
Moderate positive clamp allows Ace to fly normally but strongly dampens propensity to fire. "Fight" mode prevented.
Full "fight"
Ablation causes Ace to only fire, stuck in "fight" mode.

The "fire missile" decision is also positively mediated by a Venn-diagram neuron. It looks for Ace in the center and an enemy on the right side.

Conv2d kernel (5x5) to trigger "enemy ahead, fire!"
self in center
self in center
enemy on right
enemy on right

Tracing upstream to RGB

Let's start at the "self" channel from the third layer and see how it's constructed from the raw blue and white input pixels.

Looking at the layer directly upstream, you can see the downstream "self" channel L2-CH8 is constructed from two primary upstream channels. The first is L1-CH16 "high-frequency light on dark", which also detects other small shapes like enemy fighters, stars, and missiles.

The second upstream channel is L1-CH9 "blue and white small objects", which also detects blue space-stations, blue-ish asteroids and edges of blue planets.

Each upstream channel also detects other objects, but their intersection creates a clean representation of Ace's spaceship.

We can follow L1-CH9 "blue and white small objects" a layer upstream.

It's created primarily from upstream channel L0-CH11 "blue, light, constant" which detects blue and light. The kernel is a cookie-cutter that chomps out small objects from the solid blue background.In biology, a center-surround receptive field, first described in the cat retina. Discharge patterns and functional organization of mammalian retina. Kuffler, J. Neurophysiology, 1953.

small blue things detector
A cookie-cutter kernel applied to a blue constant
A cookie-cutter kernel applied to a blue constant

Channel "blue, light, constant" in turn reads directly from the RGB input.

Lots of blue, a bit of green, no red. The kernel uses very little of its 7x7 window.

blue detector
red
red
green
green
blue
blue
blue detector is created from RGB by taking mostly blue, some green, and no red

We could do this exercise for all of the discrete objects in the third layer, tracing their latent representations upstream to raw RGB inputs, but let's limit to just one more for now: enemy fighters.

The "enemy fighter" channel is created using a cookie-cutter kernel on a red/orange/purple background, similar to how Ace was chomped out of a blue background.

One layer further upstream, you can see the shape of the fighters' red wings imprinted on the RGB-facing kernels. The environment is reflected in the synapses of the organism.

enemy fighters detector
red
red
green
green
blue
blue
Red wings imprinted in Ace's weights

What Ace fears

We saw above how ablating "self" caused Ace to move to a corner. Why that specific behavior?

The model's value estimate channel makes it clear: Ace is attempting to flee a noisy background.

Noisy, colorful backgrounds are associated with lower returns. A quick scan of the "least-predicted returns" exemplars is a gallery of Starpilot's most confusing backgrounds. When one of these backgrounds appears, Ace's estimation of future returns plummets. Orange backgrounds hide missiles. Blue backgrounds hide Ace himself, mimicking the "self" ablation from above.

Minimum activations from the value channel
Confusing backgrounds means low returns
Confusing backgrounds means low returns

Ace refuses to fly over a bright blue background even when a juicy target entices him to attack. The upstream perception layers we explored above aren't sophisticated enough to disambiguate Ace from the blue planet. Ace's policy compensates by avoiding blue.This shows a limitation of working with such a small model. Ace simply lacks the capacity to discern himself from a blue background. When the architecture is allowed to grow from 30k to 40k params, tellingly, some of the new channels are dedicated to disambiguating backgrounds despite the fact that backgrounds do not affect gameplay.

Avoiding the blue planet
Ace won't fly over blue planet even to get a reward

What Ace loves

We saw what situations Ace estimates to be the least promising. What about the most promising?

Rows of enemy fighters to mow down. Watch his value estimation spike as he gets a column of enemy fighters in his sights. Even better when they're on a clear background.

Highest-value state
Ace's value estimation surges as enemies appear on a clear background in front of him.

Caveats and final notes

Redundancy and mixed selectivity. Multiple neurons perform the same task, and many neurons participate in multiple distinct circuits.To interpretability researchers, "polysemanticity", perhaps driven by "superposition". Toy Models of Superposition. Elhage et al., 2022. The concepts and circuits are clearly legible, and can be intervened on with predictable results, but they are not crisply delineated by layer or channel.

Step changes in representations aren't fully discrete. We portrayed the model as having two step changes: one from pixels to semantic objects, another from semantic objects to actions. Those step changes are real, but they are gradual at the margins. The second layer already has inklings of semantic coherence. The fourth and fifth layers have both absolute channels and relative channels side-by-side. Perception doesn't suddenly give way to action.

A single neuron is often sufficient, rarely necessary. We showed clamping a single channel is often sufficient to trigger a response. In many cases, however, that same channel is not necessary. Even a model of Ace's diminutive size develops supporting circuitry.Well documented in artificial and organic systems. The Hydra Effect: Emergent Self-repair in Language Model Computations. McGrath et al., 2023. From biology: Degeneracy and complexity in biological systems. PNAS, 2001. Rotating the model's projections using PCA/SVD would "clean up" the representations but we lose the benefit of keeping the organism fully in situ.

Tonic baselines dominate, though the modulations are what influence behavior. When it comes to generating actions using the Venn-diagram motif, the tonic baseline provided by Ace himself dominates the signal. The perturbations around this baseline, though smaller in magnitude, are what differentiate behavior. The dynamic range is small, just a narrow band around the tonic baseline.

Value head. We touched only briefly on Ace's value-estimation apparatus, though a fair amount of the model is dedicated to this function. Value estimation is largely a reflection of the background pattern, as discussed, but it also encompasses relative positioning like "incoming bullet and nowhere to run" and "surrounded on all sides". These concepts use the same discrete representations formed in the third layer and the same egocentric information from layers four and five.

Behavioral cloning vs reinforcement learning. This model was distilled from a larger model, an ImpalaCNN with 3.7M parameters. An RL model trained from scratch with the same 30k-parameter architecture attains only 80% of the returns of the distilled agent and takes 4x longer to train. We found the parameter floor of 30k by iterating against the pretrained teacher, amortizing its cost across dozens of shorter training runs. The RL-trained, 30k-parameter model appears to have the same types of features as Ace, though, intriguingly, shifted later in the model. A mystery for another day.

Starpilot returns
ModelReturnsParamsStepsFirecost
Our teacher (Impala scale=4)35.12.7M420M✓
Cobbe maxout (Impala scale=4)35.02.7M8B
Cobbe default (Impala scale=4)~252.7M200M
Distilled from teacher (Ace)28.130k415M✓
PPO from scratch23.230k1.5B✓
Random1.5——
All models evaluated on hard difficulty. Our models all trained to convergence. Our models were trained with a small cost on firing missiles to make the "should fire" circuit more discerning.

Related work: This note was inspired by the Circuits lineage. Zoom In set out the program of reverse-engineering a network one feature and circuit at a time; Curve Circuits traced a complete circuit; and Understanding RL Vision brought the approach to a reinforcement-learning agent's perception.Zoom In: An Introduction to Circuits. Distill, 2020. Curve Circuits. Distill, 2021. Understanding RL Vision. Distill, 2020. The microscope interface has an antecedent in OpenAI's Microscope.OpenAI Microscope. 2020. The Starpilot environment comes from Procgen, a suite of procedurally-generated environments.Leveraging Procedural Generation to Benchmark Reinforcement Learning. ICML, 2020. Understanding and Controlling a Maze-Solving Policy Network analyzed another environment from this suite, steering a maze-solving policy by intervening on intermediary features.Turner et al add a learned "cheese vector" to the policy's activations. Understanding and Controlling a Maze-Solving Policy Network. 2023. The author is a student of biological systems, not an expert. The few references to organic systems included in this note do not begin to do justice to the richness of the literature.

This research note only discussed a few highlights. Every element of the Starpilot environment can be traced fully from RGB inputs → vision primitives in the first and second layer → the discrete objects in layer three → the Venn-diagram motif decisionmaking in the fourth and fifth layers → the final action logits.