Generating coding puzzles with LLMs

I have been working on this daily coding puzzle site dailyprog. It’s like Wordle for programming problems. Each day you get one puzzle with a short narrative, a function stub, and a set of visible test cases. You write a solution, run it locally against the visible tests, then submit to verify against hidden tests on the server. There are four languages (JavaScript, Python, C, Go), visual output for some puzzles, and a share link so you can signal your competency to your friends.

The site needs a new puzzle every day. When I first started, I wrote the first few by hand. Pick a problem, design test cases, write a reference solution, verify it in the sandbox. Each puzzle took 30 to 45 minutes. That doesn’t scale past a week.

The obvious solution is to have an LLM write them. And the obvious problem is that LLMs write broken puzzles. They inline 20,000-element arrays in JSON, they produce solutions that don’t pass their own tests, they cycle through the same three algorithm patterns, and their “brute force” attempts are sometimes correct on every hidden test.

I built a generator that fixes this. Not by making the model smarter. By giving it a vocabulary for what a puzzle is, and then running its output through real code until it stops being wrong.

Picking a corpus

The first problem wasn’t technical. It was legal.

An LLM trained on the open web has seen every puzzle site. LeetCode, HackerRank, Codeforces, Project Euler. If you ask it to generate a coding puzzle, it will regurgitate something it memorized. The only way to constrain it toward original output is to give it a structured vocabulary and a set of seed problems it can remix. But the seed problems have to be legally safe to ingest into a product.

I surveyed nine sources. The table wasn’t encouraging:

Source	Problems	License	Usable?
Exercism	151 exercises	MIT	Yes, backbone
Codeforces	11,255 problems	Site ToS	Facts only, no statements
CP-Algorithms	170 articles	CC BY-SA 4.0	Technique names only
freeCodeCamp	~1,500 challenges	CC BY-SA 4.0	Yes but ShareAlike
USACO Guide	147 modules	CC BY-NC-SA 4.0	Reference only
CSES	300 problems	CC BY-NC-SA 4.0	Reference only
Project Euler	~900 problems	CC BY-NC-SA 4.0	Reference only
Rosetta Code	1,266 tasks	GFDL 1.2	Copyleft, awkward
OEIS	370K sequences	CC BY-SA 4.0 (reported)	Verify first

Three problems run through the whole list.

NonCommercial kills half of them. USACO Guide, CSES, and Project Euler are all CC BY-NC-SA. The NC clause restricts use, not the user. A for-profit product can’t serve NC-derived content even if the product itself is free. Reference-only, probably fine for private calibration, legally risky as ingestion into a generator that ships to users. Didn’t use it anyway.

ShareAlike is an infection risk. CC BY-SA and GFDL require derivative works to carry the same license. If you condition an LLM on BY-SA text and it produces a puzzle, is the output a derivative that inherits BY-SA? Nobody has litigated this. I didn’t want to be the test case. CP-Algorithms and freeCodeCamp are BY-SA, so I took only technique names (facts, not copyrightable) and left the prose alone. Rosetta Code’s GFDL is even worse: copyleft designed for documentation, awkward in a product.

Codeforces has the best data and the worst terms. 11,255 problems with 38 controlled topic tags and a calibrated Elo difficulty scale derived from actual solve data. But the ToS explicitly forbid republishing problem statements in anything with automatic testing (which is exactly what dailyprog is) and explicitly forbid publishing tests and checkers. I only kept the tag vocabulary and the difficulty scale, just metadata. I skipped the problem statements.

That left Exercism. MIT licensed, 151 cross-track exercises, each with a structured problem statement, a separate framing story, and canonical test data in a machine-readable JSON format with a published schema. No restrictions. No ShareAlike infection. No attribution requirement (I attribute anyway). It’s the only source on the list where you can take everything and use it without thinking twice.

I ingested all 142 solvable exercises into a normalized authoring corpus: title, statement, interface signature, structured test cases with named inputs, and metadata. Then I had the LLM classify each one.

Building the ontology

Classification was a single deliberate pass. Read every problem’s full statement, its test cases, its canonical solution. Classify by topic, mechanics, difficulty, and the single algorithmic technique it teaches. Report confidence. Justify each judgment.

This took about an hour of LLM time. Not because 142 problems is a lot of text. Because each judgment needed reasoning, and reasoning tokens are the point. The output isn’t a set of labels. It’s an ontology.

The topic distribution that emerged:

pie showData
    title Puzzle topic distribution (142 problems)
    "strings & text" : 29
    "logic & simulation" : 21
    "math & number theory" : 21
    "arrays & sequences" : 20
    "encoding & ciphers" : 19
    "games & puzzles" : 17
    "parsing & evaluation" : 6
    "graphs & trees" : 5
    "geometry" : 3
    "dynamic programming" : 1

The technique enumeration: two pointers, sliding window, prefix sum, frequency map, binary search, monotonic stack, greedy, backtracking, BFS, DFS, dynamic programming, heap / top-k, union-find, tree traversal, linked list manipulation, matrix operations.

Two things surprised me about the classification. First, the LLM saw structure I wouldn’t have named. The “encoding & ciphers” cluster, for instance. I would have called those “string manipulation” and lost the distinction between text-processing puzzles that are about pattern matching and text-processing puzzles that are about translation between representations. The model caught it because it had to reason about each case individually.

Second, one problem failed: a pure refactoring exercise with no algorithmic core. The taxonomy has no “code-quality” category. The LLM gave it confidence 0.65 and classified it as “logic-simulation,” which is wrong. The problem wasn’t the classification. The problem was that the corpus included something that isn’t a puzzle, and the ontology correctly flagged it as an edge case.

The repair loop

With the ontology in place, the generator is a state machine. A model proposes a puzzle. A validator runs it through every gate. On failure, the issues go back to the model with specific fix instructions. On success, the puzzle is written to disk.

stateDiagram-v2
    [*] --> generate: prompt + diversity constraints
    generate --> validate: JSON proposal
    validate --> write: all 4 gates pass
    validate --> repair: any gate fails
    repair --> generate: fix instructions
    write --> [*]: puzzle.json + solution.ts

The validator runs four checks:

Schema and quality. Does the JSON parse? Right types? Is the id a kebab-case slug? Are there distinct test names? Early versions returned raw Zod errors, and the model burned attempts on things like “id must be a string.” Now each failure class gets its own diagnosis: “id must be a kebab-case slug starting with a letter, here’s what you wrote, fix it.” The difference between “ZodError: invalid string” and a targeted fix instruction is the difference between a 4-attempt failure and a 2-attempt pass.

Reference solution. Run the model’s proposed answer through the production sandbox against every visible and hidden test case. If it fails, the model’s own puzzle defeats it. This catches most problems immediately.

Collapse. The model also proposes a brute-force attempt (the obvious naive solution). The hidden tests must defeat it. Either wrong answer on an edge case, or timeout on a large adversarial input. If the brute force passes everything, there’s no puzzle. A correct solution to a trivial problem is busywork.

Technique diversity. The model records which technique each puzzle teaches. Before generating, the constraint scans the last 7 puzzles’ tags and blocks repeats. A separate gate rejects any candidate whose technique matches a recent puzzle.

The loop catches things the model doesn’t know it’s doing wrong. The biggest one is array inlining. When a puzzle needs a large adversarial test case (10,000 integers to force an O(n) vs O(n squared) collapse), the model writes all 10,000 values as a literal JSON array. The payload hits 300 KB. The model spent its token budget on array elements instead of puzzle design.

The fix is compact array notation. Three sentinels expanded before the tests run: ["__RPT__", n, value] for repeats, ["__RNG__", start, end] for ranges. A pre-expansion gate rejects any literal array over 50 elements. The model gets told “you inlined a 10,000-element array, use RNG instead” and tries again.

This sounds like a trivial formatting rule. It is. The model won’t follow it unless it’s enforced. LLMs are pattern-completers, and the pattern they’ve seen in training is “test cases have concrete values.” The gate makes that pattern fail until the model cooperates.

The second thing the loop catches is catalogue-level technique drift. When I first ran the generator without diversity constraints, it produced three binary-search puzzles in a row. Different titles, different prompts, identical underlying algorithm. The model has no memory of the last puzzle it generated. The technique tags, recorded per puzzle and checked before each generation, give it that memory.

Most puzzles close in 2 attempts. About 1 in 5 fails entirely and I run the generator again with a different pattern hint. The ones that ship are the ones that survive the loop. Hidden Stash is a prefix-sum problem wrapped in a story about stashing contraband in an apartment block. Garden Plots is a dynamic programming problem wrapped in a story about planting flowers. The titles don’t name the technique. The prompts don’t either. The classification exists in the authoring metadata, stripped before the puzzle reaches the client.

The site doesn’t know any of this exists. It just reads JSON files from content/puzzles/. The generator is a script that’s never imported by the app. It produces artifacts for the real system to consume.