Qwythos-9B: a 9B that checks its own work
Our biggest open-weights release yet — a full-parameter reasoning model distilled from Claude Mythos 5, with a 1M-token context, native tool use, and a +34-point MMLU jump over its base. Here's what's in it, the honest benchmark table, and how to run it.

We just shipped Qwythos-9B-Claude-Mythos-5-1M — our biggest open-weights model to date, and the new flagship over on Hugging Face. It's a full-parameter reasoning model built on a deeply uncensored Qwen3.5-9B base, post-trained on north of 500 million tokens of Claude Mythos and Claude Fable traces, with the chain-of-thought generated in-house by our rethink tool.
The short version: it reasons before it answers, ships with a 1-million-token context window out of the box, calls tools natively, and — the part I'm proudest of — checks its own specifics with those tools instead of guessing. Apache-2.0. Weights and GGUF builds are up now.
What's in it
- Base: a deeply uncensored Qwen3.5-9B — dense, with a hybrid attention stack (3:1 Gated-DeltaNet linear-attention to full attention).
- Training: full-parameter SFT, assistant-only loss, a two-phase curriculum (broad reasoning corpus → focused agentic + coding). bf16, paged 8-bit AdamW, no truncation.
- Data: 500M+ tokens of Claude Mythos and Claude Fable traces. The chain-of-thought is structured by
rethink, our in-house CoT tool, so the model learns to walk hypothesis → verification → conclusion before it commits to an answer. - License: Apache-2.0, inherited from the base.
A million tokens, by default
Qwythos ships with YaRN rope-scaling already wired into config.json — factor 4.0 over the native 262,144-token architecture, for a full 1,048,576-token window with no flag to flip and no separate tokenizer:
"rope_parameters": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 262144
},
"max_position_embeddings": 1048576
This is Qwen's own official 1M recipe. What it unlocks in practice: whole-codebase reasoning without RAG chunking, long agentic trajectories with verbose tool output, and multi-document research that fits a dozen papers plus your draft in a single prompt.
One practical note: the hybrid Gated-DeltaNet stack keeps memory growth sub-quadratic below ~256k tokens, so a single H100/H200 comfortably handles 256k–512k; the full 1M wants tensor-parallel or aggressive KV-cache offload. YaRN trades a little short-context fidelity for the range — if you never go past the native 262k and want maximum sharpness, there's a config.json.pre_yarn backup to restore.
It uses tools — and corrects itself
Function calling works out of the box per Qwen3.5's spec. Pass tools=[...] to the chat template and the model emits valid <tool_call> blocks with the required parameters honored — no wrapper, no tool-specific fine-tune.
We ran a 7-prompt harness mixing capability demos with deliberately hard, closed-book facts where sampling-from-memory usually fails. Seven of seven succeeded. A few I think matter:
- Count the primes below 100,000. It didn't recall a figure — it wrote a primality test, ran it in the Python executor, and reported 9,592.
- What's the hashcat mode for a Kerberos TGS-REP ticket? The first search came back muddy. The model judged the results insufficient, refined its own query, and confirmed
-m 13100across multiple sources. - Is physostigmine indicated for organophosphate poisoning? It searched authoritative toxicology sources and got the safety-critical answer right: no — it's contraindicated; physostigmine is for the anticholinergic toxidrome. Getting that one wrong in the real world hurts someone.
That last example is the whole thesis. A 9B that knows when to look something up beats a much bigger model that confidently invents it. Full transcripts — every reasoning step, every tool call, every result — are in evals/tool_test_outputs.md.
The numbers (the honest table)
Same harness (lm-evaluation-harness), same sampling, same prompts, against the base:
| Task | Metric | Base Qwen3.5-9B | Qwythos-9B | Δ |
|---|---|---|---|---|
| gsm8k | exact match (flexible) | 0.670 | 0.860 | +0.190 |
| gsm8k | exact match (strict) | 0.510 | 0.810 | +0.300 |
| mmlu | acc | 0.232 | 0.575 | +0.343 |
| arc_challenge | acc | 0.470 | 0.490 | +0.020 |
| arc_challenge | acc_norm | 0.400 | 0.410 | +0.010 |
| gpqa_diamond | exact match (flexible) | 0.630 | 0.580 | −0.050 |
The MMLU +34.3 is the headline — 0.575 mean across all 57 subjects, peaking around 0.78 on government/politics, 0.77 on college biology, 0.74 on conceptual physics. gsm8k-strict is up 30 points.
Not everything went up: gpqa-diamond slipped five points and arc-challenge was roughly flat. We publish the full table anyway, because hiding a regression is how you lose people's trust. Absolute MMLU for any 9B is sensitive to harness and few-shot count; what matters in this comparison is that both models were measured under identical settings.
Uncensored, on purpose
Qwythos inherits a deeply uncensored base and we kept it that way. It's built to engage seriously with technically demanding questions across cybersecurity, red-team methodology, biology, pharmacology and clinical medicine — the domains where over-aligned models refuse, hedge into uselessness, or bury the real answer under disclaimer boilerplate. That's a deliberate research choice. If you're putting it in front of end users, add your own application-level review layer.
Run it
GGUF builds are up for llama.cpp / Ollama / LM Studio if you just want to pull and chat. To serve at long context:
# vLLM
vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000
# SGLang
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
--model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000
It's a reasoning model, so give it room and don't decode greedily:
gen_kwargs = dict(
temperature=0.6, top_p=0.95, top_k=20,
repetition_penalty=1.05,
max_new_tokens=16384,
)
At greedy or very-low-temperature (T ≤ 0.3) it can fall into repetition loops on long generations — a known reasoning-model failure mode; 0.6 cleanly avoids it. Every answer opens with a <think> block, so strip that span before showing it to end users. You'll also want the Gated-DeltaNet kernels (flash-linear-attention plus a CUDA-matched causal_conv1d), or the linear-attention layers fall back to slow PyTorch. It's a text-only fine-tune; the base is multimodal but we only trained the text path.
Get it
- Weights + model card: Qwythos-9B-Claude-Mythos-5-1M
- GGUF: Qwythos-9B-Claude-Mythos-5-1M-GGUF
- Full eval transcripts: tool_test_outputs.md
- The rest of the lab: empero.org
If you build something with it, tell us. And if you want the next drop in your inbox, the dispatch sign-up is on the home page.
— kodee