How I distilled Claude 4.6 Opus reasoning into a 24B code model — and every wall I hit along the way.
Devstral-Small-2-24B is Mistral's code-focused model. It's excellent at writing, reading, and reasoning about code. Claude 4.6 Opus is (as of early 2026) one of the best reasoning models on the planet. What happens if you combine them — taking Devstral's code specialisation and teaching it Claude's extended <think>...</think> reasoning style?
Nobody had published a Claude-distilled Devstral when I started this. First mover.
The hardware: a single RTX 3090 (24GB VRAM), 96GB DDR5 RAM, Pop!_OS. No cloud, no A100s, no budget beyond electricity.
This post is the honest account of what it actually took — the plan, the failures, the specific bugs, and the working solution.
Classic distillation copies a teacher model's probability distributions into a student. What we're doing is simpler and more practical: supervised fine-tuning (SFT) on teacher-generated data.
We take a dataset where Claude 4.6 Opus answered questions with its full reasoning trace exposed — the <think> block — and train Devstral to reproduce that pattern. The student never sees the teacher's weights, just its outputs. It's the same approach Jackrong used to build the most-downloaded reasoning distill on HuggingFace (436k downloads/month for Qwen3.5-27B).
The format we train on:
[INST]What is the derivative of x³?[/INST]<think>
The power rule states that d/dx[xⁿ] = n·xⁿ⁻¹.
So for x³, n=3, giving 3x².
</think>
The derivative of x³ is **3x²**.
Only the assistant turn (everything after [/INST]) is trained on. The instruction is masked. This is critical — you don't want the model to learn to reproduce prompts.
nohurry/Opus-4.6-Reasoning-3000x-filtered — 2,326 samples of Claude 4.6 Opus responses with full thinking traces, filtered for quality. Free on HuggingFace.
Why this one over alternatives?
filtered variant removes low-quality/trivial samplesproblem, thinking, and solution fields — clean to work withAfter length filtering (removing samples where thinking + solution exceeds 20,000 chars), we end up with 2,324 training samples. At 3 epochs that's 1,743 training steps at effective batch size 4.
This is where things get complicated, and understanding it saves hours.
Devstral-Small-2-24B is published as Devstral-Small-2-24B-Instruct-2512 on HuggingFace. Download it, look at the config, and you'll see:
{
"architectures": ["Mistral3ForConditionalGeneration"],
"model_type": "mistral3",
"vision_config": { ... },
"text_config": { ... }
}
It's a multimodal VLM. Devstral is built on Mistral Small 3.1, which is a vision-language model with a Pixtral vision encoder grafted on. Even though Devstral is used purely for code (no images), the vision components are baked into the published model architecture.
The instruct weights are also published in FP8 quantisation — a format that requires compute capability 8.9+ (L4, H100, etc.). An RTX 3090 is compute capability 8.6. So you can't load the official instruct weights at all.
Mistral also provides a BF16 base variant. I used that, dequantized the FP8 instruct weights on top, and extracted the text-only components.
The FP8 instruct model needs to be converted to BF16 before an RTX 3090 can use it. The process reads each safetensors shard, finds the FP8 tensors and their scale factors, and dequantizes:
# For each FP8 tensor:
dequant = tensor.to(torch.float32) * scale_inv
new_tensor = dequant.to(torch.bfloat16)
After conversion, update config.json to remove the quantization_config block and set torch_dtype: bfloat16. The BF16 model is ~45GB on disk.
This step is the one I didn't expect to need.
When you try to fine-tune the VLM (Mistral3ForConditionalGeneration) with Unsloth on a 24GB GPU, you hit a cascade of problems:
device_map='auto' splits the model: vision encoder goes to GPU, language model layers get split across GPU and CPU.device_map={'': 0} → immediate OOM because the language model + vision encoder doesn't fit.The vision encoder consumes VRAM the language model needs, pushing 20+ language model layers to CPU — and those CPU layers block training.
The fix: strip the vision components out entirely.
The safetensors weight naming tells you everything:
language_model.model.layers.X.* — transformer layers we wantlanguage_model.model.embed_tokens.* — embeddings we wantlanguage_model.lm_head.* — LM head we wantvision_tower.* — skipmulti_modal_projector.* — skipfrom safetensors.torch import load_file, save_file
import json
with open("Devstral-BF16/model.safetensors.index.json") as f:
index = json.load(f)
for shard_file in sorted(set(index["weight_map"].values())):
tensors = load_file(f"Devstral-BF16/{shard_file}")
new_tensors = {}
for name, tensor in tensors.items():
if name.startswith("vision_tower.") or name.startswith("multi_modal_projector."):
continue
if name.startswith("language_model.model."):
new_name = "model." + name[len("language_model.model."):]
elif name.startswith("language_model.lm_head."):
new_name = "lm_head." + name[len("language_model.lm_head."):]
elif name.startswith("language_model."):
new_name = name[len("language_model."):]
else:
new_name = name
new_tensors[new_name] = tensor
save_file(new_tensors, f"Devstral-textonly/{shard_file}")
Then create a config.json using the text_config from the VLM config, setting "model_type": "ministral3" and "architectures": ["Ministral3ForCausalLM"]. Copy over the tokenizer files.
The result: a clean Ministral3ForCausalLM with all 363 language model tensors, no vision components, loadable by FastLanguageModel. The text-only model is ~40GB on disk.
Disk space note: The BF16 VLM (45GB) + text-only extraction (40GB) requires ~85GB free. Extract shard by shard and delete source shards as you go if space is tight.
Even with the clean text-only model, loading with load_in_4bit=True on an RTX 3090 causes an OOM crash. The error looks like:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB.
GPU 0 has total capacity 23.56 GiB of which 45 MiB is free.
Including non-PyTorch memory, this process has 23.49 GiB in use.
This makes no sense. A 24B model quantised to 4-bit should be ~13GB, well within 24GB.
The root cause: transformers 5.x changed its model loading to use a ThreadPoolExecutor with 4 workers by default. It submits all GPU-assigned tensor materialisation tasks concurrently. Each tensor loads in BF16 first, then bitsandbytes quantises to 4-bit.
With 4 concurrent workers, multiple MLP weight matrices load to GPU simultaneously in BF16. Each MLP layer: gate_proj (335MB) + up_proj (335MB) + down_proj (335MB) = ~1GB BF16 per layer. With several layers in flight at once, VRAM fills before quantisation can free anything — crash.
The fix is one environment variable:
os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"
This forces sequential tensor loading: materialize one tensor → quantise to 4-bit → free BF16 → next tensor. Memory stays flat. The 24B model loads using ~13.5GB VRAM, leaving 10GB headroom for training.
Unsloth enables torch.compile by default for speed. On Mistral3/Ministral3 architectures, this crashes immediately with:
Unsupported functorch tracing attempt
The fused cross-entropy loss in Unsloth uses torch.func.grad internally, which is incompatible with torch.compile for this architecture. Fix:
os.environ["TORCH_COMPILE_DISABLE"] = "1"
Set this before any torch or unsloth imports.
Unsloth's SFTTrainer calls fix_untrained_tokens() during initialisation. For VLM-derived models, this hits NotImplementedError: Cannot copy out of meta tensor on meta-device tensors. Simple patch:
import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens
def _patched(*args, **kwargs):
try:
return _orig(*args, **kwargs)
except Exception as e:
print(f"Skipping fix_untrained_tokens: {e}")
_tku.fix_untrained_tokens = _patched
Unsloth has a documented bug where trainer.train() with gradient_accumulation_steps >= 4 can produce NaN grad_norm values due to incorrect loss normalisation. Fix:
from unsloth import unsloth_train
stats = unsloth_train(trainer) # not trainer.train()
Training was running fine — then crashed at step 172:
torch.OutOfMemoryError: Tried to allocate 1.87 GiB
File "sdpa_dense_backward"
The cause: without Flash Attention 2 or xformers, PyTorch falls back to flex_attention, which materialises the full attention matrix for the backward pass. At sequence length 4096, that's a 4096×4096 float32 matrix per head — roughly 1.87GB. With only 5GB VRAM free during training, it crashes.
Flash Attention 2 requires compute capability 8.9+ (same as FP8). RTX 3090 at 8.6 gets neither. Fix: reduce sequence length and add the expandable allocator:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
MAX_SEQ_LEN = 2048 # down from 4096
At 2048, the backward pass needs ~0.47GB — well within budget.
import os
# All three env vars must be set before ANY imports
os.environ["TORCH_COMPILE_DISABLE"] = "1"
os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import torch
from unsloth import FastLanguageModel, unsloth_train
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# ── Patch fix_untrained_tokens ───────────────────────────────────────────────
import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens
def _patched(*a, **k):
try: return _orig(*a, **k)
except Exception as e: print(f"Skipping fix_untrained_tokens: {e}")
_tku.fix_untrained_tokens = _patched
MODEL_NAME = "/path/to/Devstral-Small-2-24B-textonly"
OUTPUT_DIR = "/path/to/output/devstral-opus"
MAX_SEQ_LEN = 2048
LORA_RANK = 16
EPOCHS = 3
BATCH_SIZE = 1
GRAD_ACCUM = 4
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_NAME,
max_seq_length = MAX_SEQ_LEN,
dtype = torch.bfloat16,
load_in_4bit = True,
)
with open(f"{MODEL_NAME}/chat_template.jinja") as f:
tokenizer.chat_template = f.read()
model = FastLanguageModel.get_peft_model(
model,
r = LORA_RANK,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = LORA_RANK,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)
dataset = load_dataset(
"nohurry/Opus-4.6-Reasoning-3000x-filtered", split="train"
).shuffle(seed=3407)
dataset = dataset.filter(
lambda x: len(x["thinking"]) + len(x["solution"]) < 20000, num_proc=4
)
def format_sample(example):
assistant_content = f"<think>\n{example['thinking']}\n</think>\n\n{example['solution']}"
messages = [
{"role": "user", "content": example["problem"]},
{"role": "assistant", "content": assistant_content},
]
return {"text": tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)}
dataset = dataset.map(format_sample, num_proc=4)
from unsloth.chat_templates import train_on_responses_only
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
args=SFTConfig(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACCUM,
num_train_epochs=EPOCHS,
warmup_steps=50,
learning_rate=1e-4,
optim="adamw_8bit",
fp16=False, bf16=True,
gradient_checkpointing=True,
max_grad_norm=1.0,
logging_steps=5,
save_strategy="steps", save_steps=100, save_total_limit=3,
seed=42, report_to="none",
dataset_text_field="text",
max_seq_length=MAX_SEQ_LEN,
packing=False, dataset_num_proc=4,
),
)
trainer = train_on_responses_only(
trainer, instruction_part="[INST]", response_part="[/INST]",
)
stats = unsloth_train(trainer)
print(f"Done. Final loss: {stats.training_loss:.4f}")
model.save_pretrained(f"{OUTPUT_DIR}/lora")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora")
model.save_pretrained_gguf(
f"{OUTPUT_DIR}/gguf", tokenizer, quantization_method="q4_k_m",
)
| Parameter | Value |
|---|---|
| Base model | Devstral-Small-2-24B (Ministral3 architecture) |
| LoRA rank / alpha | 16 / 16 |
| Target modules | q, k, v, o, gate, up, down projections |
| Trainable parameters | 92,405,760 of 23,664,808,960 (0.39%) |
| Sequence length | 2,048 tokens |
| Effective batch size | 4 (1 per device × 4 accumulation) |
| Learning rate | 1e-4 with 50 warmup steps |
| Checkpoint used | 1,200 (end of epoch 2) |
| Time per step | ~11 seconds |
| Total training time | ~3.7 hours |
| VRAM used | 21.2GB / 24GB |
| Step | Epoch | Loss | Notes |
|---|---|---|---|
| 5 | 0.01 | 0.7949 | Warmup |
| 100 | 0.17 | 0.5708 | |
| 300 | 0.52 | 0.5800 | |
| 600 | 1.03 | 0.3559 | End of epoch 1 |
| 900 | 1.55 | 0.3858 | |
| 1100 | 1.89 | 0.3469 | |
| 1160 | 2.00 | 0.3752 | End of epoch 2 |
| 1200 | 2.07 | 0.1493 | Checkpoint used ✓ |
Checkpoint 1200 (just past end of epoch 2) was used for the final model. For reasoning distillation with a small dataset, epoch 3 overfits to the trace style — epoch 2 generalises better. The loss curve told the story.
| # | Error | Cause | Fix |
|---|---|---|---|
| 1 | FP8 quantization only supported on compute 8.9+ | RTX 3090 is 8.6 | Dequantize FP8→BF16 |
| 2 | device_map='auto' in distributed mode | Vision encoder splits model across GPU+CPU | Extract text-only weights |
| 3 | OutOfMemoryError during loading | transformers 5.x concurrent BF16 loader | HF_DEACTIVATE_ASYNC_LOAD=1 |
| 4 | Unsupported functorch tracing attempt | torch.compile + mistral3 incompatibility | TORCH_COMPILE_DISABLE=1 |
| 5 | Cannot copy out of meta tensor | fix_untrained_tokens + meta-device params | Monkey-patch to skip |
| 6 | grad_norm = NaN | Unsloth grad accumulation normalisation bug | Use unsloth_train() |
| 7 | OutOfMemoryError at step 172 | flex_attention backward materialises seq² matrix | seq_len 4096→2048 + expandable segments |
Seven distinct bugs, each requiring a different fix. That's the honest cost of being early on a new architecture with a new version of transformers.
Q4_K_M (14.3GB) · Q5_K_M (16.8GB) · LoRA adapter (370MB) — HuggingFace
Versions that work: Python 3.13, PyTorch 2.11.0+cu130, Unsloth 2026.3.10, transformers 5.3.0, bitsandbytes 0.49.2.