Qwythos-9B: a 9B that checks its own work

We just shipped Qwythos-9B-Claude-Mythos-5-1M — our biggest open-weights model to date, and the new flagship over on Hugging Face. It's a full-parameter reasoning model built on a deeply uncensored Qwen3.5-9B base, post-trained on north of 500 million tokens of Claude Mythos and Claude Fable traces, with the chain-of-thought generated in-house by our rethink tool.

The short version: it reasons before it answers, ships with a 1-million-token context window out of the box, calls tools natively, and — the part I'm proudest of — checks its own specifics with those tools instead of guessing. Apache-2.0. Weights and GGUF builds are up now.

What's in it

Base: a deeply uncensored Qwen3.5-9B — dense, with a hybrid attention stack (3:1 Gated-DeltaNet linear-attention to full attention).
Training: full-parameter SFT, assistant-only loss, a two-phase curriculum (broad reasoning corpus → focused agentic + coding). bf16, paged 8-bit AdamW, no truncation.
Data: 500M+ tokens of Claude Mythos and Claude Fable traces. The chain-of-thought is structured by rethink, our in-house CoT tool, so the model learns to walk hypothesis → verification → conclusion before it commits to an answer.
License: Apache-2.0, inherited from the base.

A million tokens, by default

Qwythos ships with YaRN rope-scaling already wired into config.json — factor 4.0 over the native 262,144-token architecture, for a full 1,048,576-token window with no flag to flip and no separate tokenizer:

"rope_parameters": {
  "rope_type": "yarn",
  "factor": 4.0,
  "original_max_position_embeddings": 262144
},
"max_position_embeddings": 1048576

This is Qwen's own official 1M recipe. What it unlocks in practice: whole-codebase reasoning without RAG chunking, long agentic trajectories with verbose tool output, and multi-document research that fits a dozen papers plus your draft in a single prompt.

One practical note: the hybrid Gated-DeltaNet stack keeps memory growth sub-quadratic below ~256k tokens, so a single H100/H200 comfortably handles 256k–512k; the full 1M wants tensor-parallel or aggressive KV-cache offload. YaRN trades a little short-context fidelity for the range — if you never go past the native 262k and want maximum sharpness, there's a config.json.pre_yarn backup to restore.

It uses tools — and corrects itself

Function calling works out of the box per Qwen3.5's spec. Pass tools=[...] to the chat template and the model emits valid <tool_call> blocks with the required parameters honored — no wrapper, no tool-specific fine-tune.

We ran a 7-prompt harness mixing capability demos with deliberately hard, closed-book facts where sampling-from-memory usually fails. Seven of seven succeeded. A few I think matter:

Count the primes below 100,000. It didn't recall a figure — it wrote a primality test, ran it in the Python executor, and reported 9,592.
What's the hashcat mode for a Kerberos TGS-REP ticket? The first search came back muddy. The model judged the results insufficient, refined its own query, and confirmed -m 13100 across multiple sources.
Is physostigmine indicated for organophosphate poisoning? It searched authoritative toxicology sources and got the safety-critical answer right: no — it's contraindicated; physostigmine is for the anticholinergic toxidrome. Getting that one wrong in the real world hurts someone.

That last example is the whole thesis. A 9B that knows when to look something up beats a much bigger model that confidently invents it. Full transcripts — every reasoning step, every tool call, every result — are in evals/tool_test_outputs.md.

The numbers (the honest table)

Same harness (lm-evaluation-harness), same sampling, same prompts, against the base:

Task	Metric	Base Qwen3.5-9B	Qwythos-9B	Δ
gsm8k	exact match (flexible)	0.670	0.860	+0.190
gsm8k	exact match (strict)	0.510	0.810	+0.300
mmlu	acc	0.232	0.575	+0.343
arc_challenge	acc	0.470	0.490	+0.020
arc_challenge	acc_norm	0.400	0.410	+0.010
gpqa_diamond	exact match (flexible)	0.630	0.580	−0.050

The MMLU +34.3 is the headline — 0.575 mean across all 57 subjects, peaking around 0.78 on government/politics, 0.77 on college biology, 0.74 on conceptual physics. gsm8k-strict is up 30 points.

Not everything went up: gpqa-diamond slipped five points and arc-challenge was roughly flat. We publish the full table anyway, because hiding a regression is how you lose people's trust. Absolute MMLU for any 9B is sensitive to harness and few-shot count; what matters in this comparison is that both models were measured under identical settings.

Uncensored, on purpose

Qwythos inherits a deeply uncensored base and we kept it that way. It's built to engage seriously with technically demanding questions across cybersecurity, red-team methodology, biology, pharmacology and clinical medicine — the domains where over-aligned models refuse, hedge into uselessness, or bury the real answer under disclaimer boilerplate. That's a deliberate research choice. If you're putting it in front of end users, add your own application-level review layer.

Run it

GGUF builds are up for llama.cpp / Ollama / LM Studio if you just want to pull and chat. To serve at long context:

# vLLM
vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000

# SGLang
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
  --model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000

It's a reasoning model, so give it room and don't decode greedily:

gen_kwargs = dict(
    temperature=0.6, top_p=0.95, top_k=20,
    repetition_penalty=1.05,
    max_new_tokens=16384,
)

At greedy or very-low-temperature (T ≤ 0.3) it can fall into repetition loops on long generations — a known reasoning-model failure mode; 0.6 cleanly avoids it. Every answer opens with a <think> block, so strip that span before showing it to end users. You'll also want the Gated-DeltaNet kernels (flash-linear-attention plus a CUDA-matched causal_conv1d), or the linear-attention layers fall back to slow PyTorch. It's a text-only fine-tune; the base is multimodal but we only trained the text path.

Get it

Weights + model card: Qwythos-9B-Claude-Mythos-5-1M
GGUF: Qwythos-9B-Claude-Mythos-5-1M-GGUF
Full eval transcripts: tool_test_outputs.md
The rest of the lab: empero.org

If you build something with it, tell us. And if you want the next drop in your inbox, the dispatch sign-up is on the home page.

— kodee