March 24, 2026 AI / LLM Fine-tuning

Fine-Tuning Devstral-Small-2-24B with Claude Opus Reasoning on an RTX 3090

How I distilled Claude 4.6 Opus reasoning into a 24B code model — and every wall I hit along the way.

Why Bother?

Devstral-Small-2-24B is Mistral's code-focused model. It's excellent at writing, reading, and reasoning about code. Claude 4.6 Opus is (as of early 2026) one of the best reasoning models on the planet. What happens if you combine them — taking Devstral's code specialisation and teaching it Claude's extended <think>...</think> reasoning style?

Nobody had published a Claude-distilled Devstral when I started this. First mover.

The hardware: a single RTX 3090 (24GB VRAM), 96GB DDR5 RAM, Pop!_OS. No cloud, no A100s, no budget beyond electricity.

This post is the honest account of what it actually took — the plan, the failures, the specific bugs, and the working solution.

What Is Knowledge Distillation Here?

Classic distillation copies a teacher model's probability distributions into a student. What we're doing is simpler and more practical: supervised fine-tuning (SFT) on teacher-generated data.

We take a dataset where Claude 4.6 Opus answered questions with its full reasoning trace exposed — the <think> block — and train Devstral to reproduce that pattern. The student never sees the teacher's weights, just its outputs. It's the same approach Jackrong used to build the most-downloaded reasoning distill on HuggingFace (436k downloads/month for Qwen3.5-27B).

The format we train on:

[INST]What is the derivative of x³?[/INST]<think>
The power rule states that d/dx[xⁿ] = n·xⁿ⁻¹.
So for x³, n=3, giving 3x².
</think>

The derivative of x³ is **3x²**.

Only the assistant turn (everything after [/INST]) is trained on. The instruction is masked. This is critical — you don't want the model to learn to reproduce prompts.

The Dataset

nohurry/Opus-4.6-Reasoning-3000x-filtered — 2,326 samples of Claude 4.6 Opus responses with full thinking traces, filtered for quality. Free on HuggingFace.

Why this one over alternatives?

The filtered variant removes low-quality/trivial samples
Each example has problem, thinking, and solution fields — clean to work with
The thinking traces are long and substantive (not fake reasoning)
It covers diverse problem types: math, logic, coding, general reasoning

After length filtering (removing samples where thinking + solution exceeds 20,000 chars), we end up with 2,324 training samples. At 3 epochs that's 1,743 training steps at effective batch size 4.

The Model: What Devstral Actually Is

This is where things get complicated, and understanding it saves hours.

Devstral-Small-2-24B is published as Devstral-Small-2-24B-Instruct-2512 on HuggingFace. Download it, look at the config, and you'll see:

{
  "architectures": ["Mistral3ForConditionalGeneration"],
  "model_type": "mistral3",
  "vision_config": { ... },
  "text_config": { ... }
}

It's a multimodal VLM. Devstral is built on Mistral Small 3.1, which is a vision-language model with a Pixtral vision encoder grafted on. Even though Devstral is used purely for code (no images), the vision components are baked into the published model architecture.

The instruct weights are also published in FP8 quantisation — a format that requires compute capability 8.9+ (L4, H100, etc.). An RTX 3090 is compute capability 8.6. So you can't load the official instruct weights at all.

The official instruct weights are FP8 — unusable on a 3090. I dequantized them to BF16 using the same tensor structure as Mistral's BF16 base variant, then extracted the text-only language model components from the resulting VLM. The source throughout is the instruct weights (Devstral-Small-2-24B-Instruct-2512); the "BF16 base variant" refers to matching that output format and dtype, not downloading a separate base model.

Step 1: Dequantize FP8 → BF16

The FP8 instruct model needs to be converted to BF16 before an RTX 3090 can use it. The process reads each safetensors shard, finds the FP8 tensors and their scale factors, and dequantizes:

# For each FP8 tensor (from dequantize_fp8.py lines 51-52):
dequant = tensor.to(torch.float32) * scale     # apply scale factor
new_tensor = dequant.to(torch.bfloat16)        # cast down to BF16

After conversion, update config.json to remove the quantization_config block and set torch_dtype: bfloat16. The BF16 model is ~45GB on disk.

Step 2: Extract the Text-Only Language Model

This step is the one I didn't expect to need.

When you try to fine-tune the VLM (Mistral3ForConditionalGeneration) with Unsloth on a 24GB GPU, you hit a cascade of problems:

device_map='auto' splits the model: vision encoder goes to GPU, language model layers get split across GPU and CPU.
Accelerate refuses to train a model spread across devices.
Force everything to one GPU with device_map={'': 0} → immediate OOM because the language model + vision encoder doesn't fit.

The vision encoder consumes VRAM the language model needs, pushing 20+ language model layers to CPU — and those CPU layers block training.

The fix: strip the vision components out entirely.

The safetensors weight naming tells you everything:

language_model.model.layers.X.* — transformer layers we want
language_model.model.embed_tokens.* — embeddings we want
language_model.lm_head.* — LM head we want
vision_tower.* — skip
multi_modal_projector.* — skip

from safetensors.torch import load_file, save_file
import json

with open("Devstral-BF16/model.safetensors.index.json") as f:
    index = json.load(f)

for shard_file in sorted(set(index["weight_map"].values())):
    tensors = load_file(f"Devstral-BF16/{shard_file}")
    new_tensors = {}

    for name, tensor in tensors.items():
        if name.startswith("vision_tower.") or name.startswith("multi_modal_projector."):
            continue

        if name.startswith("language_model.model."):
            new_name = "model." + name[len("language_model.model."):]
        elif name.startswith("language_model.lm_head."):
            new_name = "lm_head." + name[len("language_model.lm_head."):]
        elif name.startswith("language_model."):
            new_name = name[len("language_model."):]
        else:
            new_name = name

        new_tensors[new_name] = tensor

    save_file(new_tensors, f"Devstral-textonly/{shard_file}")

Then create a config.json using the text_config from the VLM config, setting "model_type": "ministral3" and "architectures": ["Ministral3ForCausalLM"]. Copy over the tokenizer files.

The result: a clean Ministral3ForCausalLM with all 363 language model tensors, no vision components, loadable by FastLanguageModel. The text-only model is ~40GB on disk.

Disk space note: The BF16 VLM (45GB) + text-only extraction (40GB) requires ~85GB free. Extract shard by shard and delete source shards as you go if space is tight.

Step 3: The Loading Bug That Took Hours to Find

Even with the clean text-only model, loading with load_in_4bit=True on an RTX 3090 causes an OOM crash. The error looks like:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB.
GPU 0 has total capacity 23.56 GiB of which 45 MiB is free.
Including non-PyTorch memory, this process has 23.49 GiB in use.

This makes no sense. A 24B model quantised to 4-bit should be ~13GB, well within 24GB.

The root cause: transformers 5.x changed its model loading to use a ThreadPoolExecutor with 4 workers by default. It submits all GPU-assigned tensor materialisation tasks concurrently. Each tensor loads in BF16 first, then bitsandbytes quantises to 4-bit.

With 4 concurrent workers, multiple MLP weight matrices load to GPU simultaneously in BF16. Each MLP layer: gate_proj (335MB) + up_proj (335MB) + down_proj (335MB) = ~1GB BF16 per layer. With several layers in flight at once, VRAM fills before quantisation can free anything — crash.

The fix is one environment variable:

os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"

This forces sequential tensor loading: materialize one tensor → quantise to 4-bit → free BF16 → next tensor. Memory stays flat. The 24B model loads using ~13.5GB VRAM, leaving 10GB headroom for training.

Step 4: The torch_compile Crash

Unsloth enables torch.compile by default for speed. On Mistral3/Ministral3 architectures, this crashes immediately with:

Unsupported functorch tracing attempt

The fused cross-entropy loss in Unsloth uses torch.func.grad internally, which is incompatible with torch.compile for this architecture. Fix:

os.environ["TORCH_COMPILE_DISABLE"] = "1"

Set this before any torch or unsloth imports.

Step 5: The fix_untrained_tokens Crash

Unsloth's SFTTrainer calls fix_untrained_tokens() during initialisation. For VLM-derived models, this hits NotImplementedError: Cannot copy out of meta tensor on meta-device tensors. Simple patch:

import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens

def _patched(*args, **kwargs):
    try:
        return _orig(*args, **kwargs)
    except Exception as e:
        print(f"Skipping fix_untrained_tokens: {e}")

_tku.fix_untrained_tokens = _patched

Step 6: Gradient Accumulation NaN

Unsloth has a documented bug where trainer.train() with gradient_accumulation_steps >= 4 can produce NaN grad_norm values due to incorrect loss normalisation. Fix:

from unsloth import unsloth_train
stats = unsloth_train(trainer)  # not trainer.train()

Step 7: flex_attention OOM During Training

Training was running fine — then crashed at step 172:

torch.OutOfMemoryError: Tried to allocate 1.87 GiB
  File "sdpa_dense_backward"

The cause: without Flash Attention 2 or xformers, PyTorch falls back to flex_attention, which materialises the full attention matrix for the backward pass. At sequence length 4096, that's a 4096×4096 float32 matrix per head — roughly 1.87GB. With only 5GB VRAM free during training, it crashes.

Flash Attention 2 requires compute capability 8.9+ (same as FP8). RTX 3090 at 8.6 gets neither. Fix: reduce sequence length and add the expandable allocator:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
MAX_SEQ_LEN = 2048  # down from 4096

At 2048, the backward pass needs ~0.47GB — well within budget.

The Complete Working Training Script

import os
# All three env vars must be set before ANY imports
os.environ["TORCH_COMPILE_DISABLE"] = "1"
os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

import torch
from unsloth import FastLanguageModel, unsloth_train
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# ── Patch fix_untrained_tokens ───────────────────────────────────────────────
import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens
def _patched(*a, **k):
    try: return _orig(*a, **k)
    except Exception as e: print(f"Skipping fix_untrained_tokens: {e}")
_tku.fix_untrained_tokens = _patched

MODEL_NAME  = "/path/to/Devstral-Small-2-24B-textonly"
OUTPUT_DIR  = "/path/to/output/devstral-opus"
MAX_SEQ_LEN = 2048
LORA_RANK   = 16
EPOCHS      = 3
BATCH_SIZE  = 1
GRAD_ACCUM  = 4

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_NAME,
    max_seq_length = MAX_SEQ_LEN,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)

with open(f"{MODEL_NAME}/chat_template.jinja") as f:
    tokenizer.chat_template = f.read()

model = FastLanguageModel.get_peft_model(
    model,
    r                          = LORA_RANK,
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj",
                                  "gate_proj", "up_proj", "down_proj"],
    lora_alpha                 = LORA_RANK,
    lora_dropout               = 0,
    bias                       = "none",
    use_gradient_checkpointing = "unsloth",
    random_state               = 42,
)

dataset = load_dataset(
    "nohurry/Opus-4.6-Reasoning-3000x-filtered", split="train"
).shuffle(seed=3407)

dataset = dataset.filter(
    lambda x: len(x["thinking"]) + len(x["solution"]) < 20000, num_proc=4
)

def format_sample(example):
    assistant_content = f"<think>\n{example['thinking']}\n</think>\n\n{example['solution']}"
    messages = [
        {"role": "user",      "content": example["problem"]},
        {"role": "assistant", "content": assistant_content},
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_sample, num_proc=4)

from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUM,
        num_train_epochs=EPOCHS,
        warmup_steps=50,
        learning_rate=1e-4,
        optim="adamw_8bit",
        fp16=False, bf16=True,
        gradient_checkpointing=True,
        max_grad_norm=1.0,
        logging_steps=5,
        save_strategy="steps", save_steps=100, save_total_limit=3,
        seed=42, report_to="none",
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LEN,
        packing=False, dataset_num_proc=4,
    ),
)

trainer = train_on_responses_only(
    trainer, instruction_part="[INST]", response_part="[/INST]",
)

stats = unsloth_train(trainer)
print(f"Done. Final loss: {stats.training_loss:.4f}")

model.save_pretrained(f"{OUTPUT_DIR}/lora")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora")

model.save_pretrained_gguf(
    f"{OUTPUT_DIR}/gguf", tokenizer, quantization_method="q4_k_m",
)

The Numbers

Parameter	Value
Base model	Devstral-Small-2-24B (Ministral3 architecture)
LoRA rank / alpha	16 / 16
Target modules	q, k, v, o, gate, up, down projections
Trainable parameters	92,405,760 of 23,664,808,960 (0.39%)
Sequence length	2,048 tokens
Effective batch size	4 (1 per device × 4 accumulation)
Learning rate	1e-4 with 50 warmup steps
Checkpoint used	1,200 (end of epoch 2)
Time per step	~11 seconds
Total training time	~3.7 hours
VRAM used	21.2GB / 24GB

Loss Curve

Step	Epoch	Loss	Notes
5	0.01	0.7949	Warmup
100	0.17	0.5708
300	0.52	0.5800
600	1.03	0.3559	End of epoch 1
900	1.55	0.3858
1100	1.89	0.3469
1160	2.00	0.3752	End of epoch 2
1200	2.07	0.1493	Checkpoint used ✓

Checkpoint 1200 (just past end of epoch 2) was used for the final model. For reasoning distillation with a small dataset, epoch 3 overfits to the trace style — epoch 2 generalises better. The loss curve told the story.

Summary: Every Bug Hit

#	Error	Cause	Fix
1	`FP8 quantization only supported on compute 8.9+`	RTX 3090 is 8.6	Dequantize FP8→BF16
2	`device_map='auto' in distributed mode`	Vision encoder splits model across GPU+CPU	Extract text-only weights
3	`OutOfMemoryError` during loading	transformers 5.x concurrent BF16 loader	`HF_DEACTIVATE_ASYNC_LOAD=1`
4	`Unsupported functorch tracing attempt`	torch.compile + mistral3 incompatibility	`TORCH_COMPILE_DISABLE=1`
5	`Cannot copy out of meta tensor`	fix_untrained_tokens + meta-device params	Monkey-patch to skip
6	`grad_norm = NaN`	Unsloth grad accumulation normalisation bug	Use `unsloth_train()`
7	`OutOfMemoryError` at step 172	flex_attention backward materialises seq² matrix	seq_len 4096→2048 + expandable segments

Seven distinct bugs, each requiring a different fix. That's the honest cost of being early on a new architecture with a new version of transformers.

The Model

🤗

adamjen/Devstral-Small-2-24B-Opus-Reasoning

Q4_K_M (14.3GB) · Q5_K_M (16.8GB) · LoRA adapter (370MB) — HuggingFace

Versions that work: Python 3.13, PyTorch 2.11.0+cu130, Unsloth 2026.3.10, transformers 5.3.0, bitsandbytes 0.49.2.