emperoresearch lab
Open weights05/06
WRITING / RESEARCH19.06.26 · 7 MIN · KODEE001 / 001
§ 01

Qwythos-9B: a 9B that checks its own work

Our biggest open-weights release yet — a full-parameter reasoning model distilled from Claude Mythos 5, with a 1M-token context, native tool use, and a +34-point MMLU jump over its base. Here's what's in it, the honest benchmark table, and how to run it.

Qwythos-9B: a 9B that checks its own work

We just shipped Qwythos-9B-Claude-Mythos-5-1M — our biggest open-weights model to date, and the new flagship over on Hugging Face. It's a full-parameter reasoning model built on a deeply uncensored Qwen3.5-9B base, post-trained on north of 500 million tokens of Claude Mythos and Claude Fable traces, with the chain-of-thought generated in-house by our rethink tool.

The short version: it reasons before it answers, ships with a 1-million-token context window out of the box, calls tools natively, and — the part I'm proudest of — checks its own specifics with those tools instead of guessing. Apache-2.0. Weights and GGUF builds are up now.

What's in it

  • Base: a deeply uncensored Qwen3.5-9B — dense, with a hybrid attention stack (3:1 Gated-DeltaNet linear-attention to full attention).
  • Training: full-parameter SFT, assistant-only loss, a two-phase curriculum (broad reasoning corpus → focused agentic + coding). bf16, paged 8-bit AdamW, no truncation.
  • Data: 500M+ tokens of Claude Mythos and Claude Fable traces. The chain-of-thought is structured by rethink, our in-house CoT tool, so the model learns to walk hypothesis → verification → conclusion before it commits to an answer.
  • License: Apache-2.0, inherited from the base.

A million tokens, by default

Qwythos ships with YaRN rope-scaling already wired into config.json — factor 4.0 over the native 262,144-token architecture, for a full 1,048,576-token window with no flag to flip and no separate tokenizer:

"rope_parameters": {
  "rope_type": "yarn",
  "factor": 4.0,
  "original_max_position_embeddings": 262144
},
"max_position_embeddings": 1048576

This is Qwen's own official 1M recipe. What it unlocks in practice: whole-codebase reasoning without RAG chunking, long agentic trajectories with verbose tool output, and multi-document research that fits a dozen papers plus your draft in a single prompt.

One practical note: the hybrid Gated-DeltaNet stack keeps memory growth sub-quadratic below ~256k tokens, so a single H100/H200 comfortably handles 256k–512k; the full 1M wants tensor-parallel or aggressive KV-cache offload. YaRN trades a little short-context fidelity for the range — if you never go past the native 262k and want maximum sharpness, there's a config.json.pre_yarn backup to restore.

It uses tools — and corrects itself

Function calling works out of the box per Qwen3.5's spec. Pass tools=[...] to the chat template and the model emits valid <tool_call> blocks with the required parameters honored — no wrapper, no tool-specific fine-tune.

We ran a 7-prompt harness mixing capability demos with deliberately hard, closed-book facts where sampling-from-memory usually fails. Seven of seven succeeded. A few I think matter:

  • Count the primes below 100,000. It didn't recall a figure — it wrote a primality test, ran it in the Python executor, and reported 9,592.
  • What's the hashcat mode for a Kerberos TGS-REP ticket? The first search came back muddy. The model judged the results insufficient, refined its own query, and confirmed -m 13100 across multiple sources.
  • Is physostigmine indicated for organophosphate poisoning? It searched authoritative toxicology sources and got the safety-critical answer right: no — it's contraindicated; physostigmine is for the anticholinergic toxidrome. Getting that one wrong in the real world hurts someone.

That last example is the whole thesis. A 9B that knows when to look something up beats a much bigger model that confidently invents it. Full transcripts — every reasoning step, every tool call, every result — are in evals/tool_test_outputs.md.

The numbers (the honest table)

Same harness (lm-evaluation-harness), same sampling, same prompts, against the base:

TaskMetricBase Qwen3.5-9BQwythos-9BΔ
gsm8kexact match (flexible)0.6700.860+0.190
gsm8kexact match (strict)0.5100.810+0.300
mmluacc0.2320.575+0.343
arc_challengeacc0.4700.490+0.020
arc_challengeacc_norm0.4000.410+0.010
gpqa_diamondexact match (flexible)0.6300.580−0.050

The MMLU +34.3 is the headline — 0.575 mean across all 57 subjects, peaking around 0.78 on government/politics, 0.77 on college biology, 0.74 on conceptual physics. gsm8k-strict is up 30 points.

Not everything went up: gpqa-diamond slipped five points and arc-challenge was roughly flat. We publish the full table anyway, because hiding a regression is how you lose people's trust. Absolute MMLU for any 9B is sensitive to harness and few-shot count; what matters in this comparison is that both models were measured under identical settings.

Uncensored, on purpose

Qwythos inherits a deeply uncensored base and we kept it that way. It's built to engage seriously with technically demanding questions across cybersecurity, red-team methodology, biology, pharmacology and clinical medicine — the domains where over-aligned models refuse, hedge into uselessness, or bury the real answer under disclaimer boilerplate. That's a deliberate research choice. If you're putting it in front of end users, add your own application-level review layer.

Run it

GGUF builds are up for llama.cpp / Ollama / LM Studio if you just want to pull and chat. To serve at long context:

# vLLM
vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000

# SGLang
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
  --model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000

It's a reasoning model, so give it room and don't decode greedily:

gen_kwargs = dict(
    temperature=0.6, top_p=0.95, top_k=20,
    repetition_penalty=1.05,
    max_new_tokens=16384,
)

At greedy or very-low-temperature (T ≤ 0.3) it can fall into repetition loops on long generations — a known reasoning-model failure mode; 0.6 cleanly avoids it. Every answer opens with a <think> block, so strip that span before showing it to end users. You'll also want the Gated-DeltaNet kernels (flash-linear-attention plus a CUDA-matched causal_conv1d), or the linear-attention layers fall back to slow PyTorch. It's a text-only fine-tune; the base is multimodal but we only trained the text path.

Get it

If you build something with it, tell us. And if you want the next drop in your inbox, the dispatch sign-up is on the home page.

— kodee

kodee · 19.06.26
06 — Dispatch

Follow the build.

An occasional dispatch from the lab — progress on Claire, what we found with microverse, new open-source releases and the one thing we got wrong that week. No hype, no roadmap theatre. Cancel from any line.

2 readers · we never share addresses