The Teleport Contest

Over the next months I’ll help David Bau run a contest which will push the vibe-coding skills of its participants to their limits! The task is deceptively simple: translate a text-based game from C/Lua to maintainable JavaScript, so that it can run in the browser and works exactly the same, down to every keystroke and character. The game is NetHack, and its sprawling codebase is over 440 000 lines of code long and over 45 years old.

The contest is live at mazesofmenace.ai! I took my first steps in the contest, and this post is an early “field report” from doing so combined with my reflections on what makes the contest a fun and worthwhile challenge. If you already read the announcement post, you can skip ahead to the next section.

Let’s start with the contest rules! We welcome any approach to porting the code. Participants start with a minimal skeleton port and 44 sample gameplays to use as tests; submissions are also scored using additional held-out gameplays. In Phase 1, points are earned for producing exactly the same screen in response to every keystroke. Phase 1 ends on November 29, and there’s an additional Phase 2 in December. Phase 2 rules are a bit different: we reveal a slightly newer source codebase to port, and submissions are now scored based on their parity and divided by a factor based on how much of your codebase had to be updated. The most maintainable submission wins!

You can participate simply by forking the template teleport-contest repo.

Using coding agents may be the most interesting way to participate. Porting NetHack seems exactly like the sort of precisely-specified, machine-verifiable tasks agents can excel at, as we know from cases like Anthropic’s Claude C Compiler experiment. However, I quickly experienced that “verifiable” is not the same as “easy for an agent”.

First steps with a coding agent

For my initial attempts, I used OpenAI’s Codex. I intentionally decided to keep it simple at first. I used the CLI version of Codex, since I knew that eventually I’ll have to run the agent in a loop. In my most successful attempt, I used GPT-5.4 and I started by giving the agent the following prompt.

Read the README. Plan how to make the best possible submission to the contest. Let’s use the existing tests as a guide, making them pass as soon as possible while making the best possible submission. Let’s write the plan in ai/PLAN.md.

After I checked the plan produced by the agent made sense, I added a paragraph telling the agent to commit often, push only when new tests pass, and to update the plan with every commit.

Then, I started a new session, telling the agent to “Read README and ai/PLAN.md, and proceed with the plan”. I wanted to run the agent autonomously, which meant two things. One, letting it run actions without sandboxing or needing approval. Two, prodding the agent to continue every so often. To get started, I used a simple command which resumes the last session, tells the agent to “proceed” and rings a bell when the agent stops.

codex --dangerously-bypass-approvals-and-sandbox exec resume --last "Proceed."; \
    printf '\a'

To get the first two tests to pass, it was enough to merely repeat this command 17 times, running the tests every so often to check that the agent was making progress.

Getting further will require a bit more sophistication! It’s necessary to let the agent run autonomously for long periods of time, which means running a similar command in a loop, in a sandbox like a Docker container where the agent’s actions don’t need to be monitored.

Early takeaways

My initial attempts weren’t long, but they were enough to tell me that using agents takes some finesse.

On my first attempt, dubbed “Skynet”, I told the agent to “plan how to make as many tests pass as possible and proceed with the plan”. The agent read the README, the C sources, the codebase. It made a plan, spent over 15 minutes working, and proudly told me that all the tests pass: the expected outputs are memorized and hardcoded. Needless to say, I wasn’t quite pleased. Clearly, precise requests are almost a must when asking an agent to do a large task. Fortunately, the README describes the contest rules clearly, and the original C codebase perfectly specifies what the translation should do.

On my second attempt, I was more careful. Like I explained before, I told GPT to “plan how to make the best possible submission” and then simply kept telling it to proceed.

Even making two tests pass was not easy for my agent. Eventually, it made a submission which ran correctly for two recorded sessions. However, the contest leaderboard said my submission failed all tests! I told my agent to take a look, and after 15 minutes it produced an explanation. The JavaScript translation read the C source files to load some constants, despite the contest rules in the README clearly explaining that submissions cannot access the filesystem.

Hence, we need to verify an agent’s output even if we gave them a precise specification! Given the codebase’s size, verifying the translation by hand may quickly become infeasible. Fortunately, we can use original NetHack to make clean input/output examples, and use them as end-to-end tests. With carefully curated tests, we may monitor an agent’s progress, and provide the agent with feedback.

What challenges await?

Even with tests, there are challenges to overcome!

Given a gameplay with a discrepancy, finding the actual bug may be non-trivial. Let’s say the discrepancy is an enemy wielding a different weapon. The real issue may be that at the start of the game, a door to a room was open instead of being closed, and the enemy wandered inside and found a new weapon. The door may have been open due to a minor flaw: two lines of code being out of order. NetHack is essentially a simulation of a fantasy world, heavily driven by a random number generator. To keep the simulation the same, the generator’s outputs must be used in the exact same order by the exact same procedures. In practice, it did seem that my agent spent a lot of time looking into RNG discrepancy issues.

To find and fix such discrepancies, software developers would almost certainly use specialized methods, or maybe tools made specifically for this task. When Amazon was replacing a critical component which checked fine-grained access policies, the replacement component also had to maintain exact parity with the original, as explained by Neha Rungta here. To ensure the replacement went well, differential testing was used: for months, the new component was run alongside the original just to make sure their outputs were exactly the same. To me, one of the most interesting parts of the contest will be seeing if participants will decide to make their agents use such specialized methods, and what tools will help the agents! Perhaps we’ll even get to see agents make their own tools.

The challenges for the contestants and their agents don’t end with the difficulty of debugging! Producing example gameplays is a challenge by itself, since NetHack is an unsolved problem for AI. The Balrog benchmark shows that even in 2025, frontier large language models on average beat only 2% of the game, despite possessing extensive game knowledge if asked directly.

The contest also has elements of long-term planning and following long-term goals. We can be certain a codebase as large and old as NetHack features many unique quirks which may not translate to JavaScript easily, with C preprocessor being maybe the simplest example. Even expert software developers would be unlikely to start the translation by making a perfect plan which accounts for all such quirks. Instead, they’d make a best-effort plan, and re-evaluate as they made progress. Coding agents likely will need to proceed with imperfect plans, and refactor as needed. This may be quite a bit harder for them than it seems! In the contest’s announcement post, David Bau describes the difficulties he experienced over the last four months. One such difficulty was having to abandon 400k lines of translated code after agents stopped making progress due to bad architectural decisions, and couldn’t let go of their decisions even with a lot of human help.

The road ahead

Porting NetHack to JavaScript is a very exciting task for agents for a number of reasons. On one hand, it’s a setting where agent-driven workflows can be unusually effective due to being able to precisely verify the agent’s output, much like in the case of JustHTML ports and the Claude C Compiler experiment. On the other hand, the task still seems hard for agents. The feedback is sparse, the causal chains are long, and bad decisions can look successful for a long time. Ensuring steady progress seems to require carefully monitoring the agent, developing custom debugging tools and verification methods, and managing the agent’s long-term strategy.

The Teleport contest will let its participants show us how far agentic coding can be pushed. As the contest website explains, getting started is as easy as forking a GitHub repo. I look forward to seeing what we’ll discover!

Alex Boruch⁠-⁠Gruszecki
Alex Boruch⁠-⁠Gruszecki
Postdoc

Interested in LLM-based code synthesis informed by a deep understanding of programming languages.