March 24, 2026 AI / LLM Fine-tuning

Fine-Tuning Devstral-Small-2-24B with Claude Opus Reasoning on an RTX 3090

How I distilled Claude 4.6 Opus reasoning into a 24B code model — and every wall I hit along the way.

Why Bother?

Devstral-Small-2-24B is Mistral's code-focused model. It's excellent at writing, reading, and reasoning about code. Claude 4.6 Opus is (as of early 2026) one of the best reasoning models on the planet. What happens if you combine them — taking Devstral's code specialisation and teaching it Claude's extended <think>...</think> reasoning style?

Nobody had published a Claude-distilled Devstral when I started this. First mover.

The hardware: a single RTX 3090 (24GB VRAM), 96GB DDR5 RAM, Pop!_OS. No cloud, no A100s, no budget beyond electricity.

This post is the honest account of what it actually took — the plan, the failures, the specific bugs, and the working solution.


What Is Knowledge Distillation Here?

Classic distillation copies a teacher model's probability distributions into a student. What we're doing is simpler and more practical: supervised fine-tuning (SFT) on teacher-generated data.

We take a dataset where Claude 4.6 Opus answered questions with its full reasoning trace exposed — the <think> block — and train Devstral to reproduce that pattern. The student never sees the teacher's weights, just its outputs. It's the same approach Jackrong used to build the most-downloaded reasoning distill on HuggingFace (436k downloads/month for Qwen3.5-27B).

The format we train on:

[INST]What is the derivative of x³?[/INST]<think>
The power rule states that d/dx[xⁿ] = n·xⁿ⁻¹.
So for x³, n=3, giving 3x².
</think>

The derivative of x³ is **3x²**.

Only the assistant turn (everything after [/INST]) is trained on. The instruction is masked. This is critical — you don't want the model to learn to reproduce prompts.


The Dataset

nohurry/Opus-4.6-Reasoning-3000x-filtered — 2,326 samples of Claude 4.6 Opus responses with full thinking traces, filtered for quality. Free on HuggingFace.

Why this one over alternatives?

After length filtering (removing samples where thinking + solution exceeds 20,000 chars), we end up with 2,324 training samples. At 3 epochs that's 1,743 training steps at effective batch size 4.


The Model: What Devstral Actually Is

This is where things get complicated, and understanding it saves hours.

Devstral-Small-2-24B is published as Devstral-Small-2-24B-Instruct-2512 on HuggingFace. Download it, look at the config, and you'll see:

{
  "architectures": ["Mistral3ForConditionalGeneration"],
  "model_type": "mistral3",
  "vision_config": { ... },
  "text_config": { ... }
}

It's a multimodal VLM. Devstral is built on Mistral Small 3.1, which is a vision-language model with a Pixtral vision encoder grafted on. Even though Devstral is used purely for code (no images), the vision components are baked into the published model architecture.

The instruct weights are also published in FP8 quantisation — a format that requires compute capability 8.9+ (L4, H100, etc.). An RTX 3090 is compute capability 8.6. So you can't load the official instruct weights at all.

Mistral also provides a BF16 base variant. I used that, dequantized the FP8 instruct weights on top, and extracted the text-only components.


Step 1: Dequantize FP8 → BF16

The FP8 instruct model needs to be converted to BF16 before an RTX 3090 can use it. The process reads each safetensors shard, finds the FP8 tensors and their scale factors, and dequantizes:

# For each FP8 tensor:
dequant = tensor.to(torch.float32) * scale_inv
new_tensor = dequant.to(torch.bfloat16)

After conversion, update config.json to remove the quantization_config block and set torch_dtype: bfloat16. The BF16 model is ~45GB on disk.


Step 2: Extract the Text-Only Language Model

This step is the one I didn't expect to need.

When you try to fine-tune the VLM (Mistral3ForConditionalGeneration) with Unsloth on a 24GB GPU, you hit a cascade of problems:

  1. device_map='auto' splits the model: vision encoder goes to GPU, language model layers get split across GPU and CPU.
  2. Accelerate refuses to train a model spread across devices.
  3. Force everything to one GPU with device_map={'': 0} → immediate OOM because the language model + vision encoder doesn't fit.

The vision encoder consumes VRAM the language model needs, pushing 20+ language model layers to CPU — and those CPU layers block training.

The fix: strip the vision components out entirely.

The safetensors weight naming tells you everything:

from safetensors.torch import load_file, save_file
import json

with open("Devstral-BF16/model.safetensors.index.json") as f:
    index = json.load(f)

for shard_file in sorted(set(index["weight_map"].values())):
    tensors = load_file(f"Devstral-BF16/{shard_file}")
    new_tensors = {}

    for name, tensor in tensors.items():
        if name.startswith("vision_tower.") or name.startswith("multi_modal_projector."):
            continue

        if name.startswith("language_model.model."):
            new_name = "model." + name[len("language_model.model."):]
        elif name.startswith("language_model.lm_head."):
            new_name = "lm_head." + name[len("language_model.lm_head."):]
        elif name.startswith("language_model."):
            new_name = name[len("language_model."):]
        else:
            new_name = name

        new_tensors[new_name] = tensor

    save_file(new_tensors, f"Devstral-textonly/{shard_file}")

Then create a config.json using the text_config from the VLM config, setting "model_type": "ministral3" and "architectures": ["Ministral3ForCausalLM"]. Copy over the tokenizer files.

The result: a clean Ministral3ForCausalLM with all 363 language model tensors, no vision components, loadable by FastLanguageModel. The text-only model is ~40GB on disk.

Disk space note: The BF16 VLM (45GB) + text-only extraction (40GB) requires ~85GB free. Extract shard by shard and delete source shards as you go if space is tight.


Step 3: The Loading Bug That Took Hours to Find

Even with the clean text-only model, loading with load_in_4bit=True on an RTX 3090 causes an OOM crash. The error looks like:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB.
GPU 0 has total capacity 23.56 GiB of which 45 MiB is free.
Including non-PyTorch memory, this process has 23.49 GiB in use.

This makes no sense. A 24B model quantised to 4-bit should be ~13GB, well within 24GB.

The root cause: transformers 5.x changed its model loading to use a ThreadPoolExecutor with 4 workers by default. It submits all GPU-assigned tensor materialisation tasks concurrently. Each tensor loads in BF16 first, then bitsandbytes quantises to 4-bit.

With 4 concurrent workers, multiple MLP weight matrices load to GPU simultaneously in BF16. Each MLP layer: gate_proj (335MB) + up_proj (335MB) + down_proj (335MB) = ~1GB BF16 per layer. With several layers in flight at once, VRAM fills before quantisation can free anything — crash.

The fix is one environment variable:

os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"

This forces sequential tensor loading: materialize one tensor → quantise to 4-bit → free BF16 → next tensor. Memory stays flat. The 24B model loads using ~13.5GB VRAM, leaving 10GB headroom for training.


Step 4: The torch_compile Crash

Unsloth enables torch.compile by default for speed. On Mistral3/Ministral3 architectures, this crashes immediately with:

Unsupported functorch tracing attempt

The fused cross-entropy loss in Unsloth uses torch.func.grad internally, which is incompatible with torch.compile for this architecture. Fix:

os.environ["TORCH_COMPILE_DISABLE"] = "1"

Set this before any torch or unsloth imports.


Step 5: The fix_untrained_tokens Crash

Unsloth's SFTTrainer calls fix_untrained_tokens() during initialisation. For VLM-derived models, this hits NotImplementedError: Cannot copy out of meta tensor on meta-device tensors. Simple patch:

import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens

def _patched(*args, **kwargs):
    try:
        return _orig(*args, **kwargs)
    except Exception as e:
        print(f"Skipping fix_untrained_tokens: {e}")

_tku.fix_untrained_tokens = _patched

Step 6: Gradient Accumulation NaN

Unsloth has a documented bug where trainer.train() with gradient_accumulation_steps >= 4 can produce NaN grad_norm values due to incorrect loss normalisation. Fix:

from unsloth import unsloth_train
stats = unsloth_train(trainer)  # not trainer.train()

Step 7: flex_attention OOM During Training

Training was running fine — then crashed at step 172:

torch.OutOfMemoryError: Tried to allocate 1.87 GiB
  File "sdpa_dense_backward"

The cause: without Flash Attention 2 or xformers, PyTorch falls back to flex_attention, which materialises the full attention matrix for the backward pass. At sequence length 4096, that's a 4096×4096 float32 matrix per head — roughly 1.87GB. With only 5GB VRAM free during training, it crashes.

Flash Attention 2 requires compute capability 8.9+ (same as FP8). RTX 3090 at 8.6 gets neither. Fix: reduce sequence length and add the expandable allocator:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
MAX_SEQ_LEN = 2048  # down from 4096

At 2048, the backward pass needs ~0.47GB — well within budget.


The Complete Working Training Script

import os
# All three env vars must be set before ANY imports
os.environ["TORCH_COMPILE_DISABLE"] = "1"
os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

import torch
from unsloth import FastLanguageModel, unsloth_train
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# ── Patch fix_untrained_tokens ───────────────────────────────────────────────
import unsloth_zoo.tokenizer_utils as _tku
_orig = _tku.fix_untrained_tokens
def _patched(*a, **k):
    try: return _orig(*a, **k)
    except Exception as e: print(f"Skipping fix_untrained_tokens: {e}")
_tku.fix_untrained_tokens = _patched

MODEL_NAME  = "/path/to/Devstral-Small-2-24B-textonly"
OUTPUT_DIR  = "/path/to/output/devstral-opus"
MAX_SEQ_LEN = 2048
LORA_RANK   = 16
EPOCHS      = 3
BATCH_SIZE  = 1
GRAD_ACCUM  = 4

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_NAME,
    max_seq_length = MAX_SEQ_LEN,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)

with open(f"{MODEL_NAME}/chat_template.jinja") as f:
    tokenizer.chat_template = f.read()

model = FastLanguageModel.get_peft_model(
    model,
    r                          = LORA_RANK,
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj",
                                  "gate_proj", "up_proj", "down_proj"],
    lora_alpha                 = LORA_RANK,
    lora_dropout               = 0,
    bias                       = "none",
    use_gradient_checkpointing = "unsloth",
    random_state               = 42,
)

dataset = load_dataset(
    "nohurry/Opus-4.6-Reasoning-3000x-filtered", split="train"
).shuffle(seed=3407)

dataset = dataset.filter(
    lambda x: len(x["thinking"]) + len(x["solution"]) < 20000, num_proc=4
)

def format_sample(example):
    assistant_content = f"<think>\n{example['thinking']}\n</think>\n\n{example['solution']}"
    messages = [
        {"role": "user",      "content": example["problem"]},
        {"role": "assistant", "content": assistant_content},
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_sample, num_proc=4)

from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUM,
        num_train_epochs=EPOCHS,
        warmup_steps=50,
        learning_rate=1e-4,
        optim="adamw_8bit",
        fp16=False, bf16=True,
        gradient_checkpointing=True,
        max_grad_norm=1.0,
        logging_steps=5,
        save_strategy="steps", save_steps=100, save_total_limit=3,
        seed=42, report_to="none",
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LEN,
        packing=False, dataset_num_proc=4,
    ),
)

trainer = train_on_responses_only(
    trainer, instruction_part="[INST]", response_part="[/INST]",
)

stats = unsloth_train(trainer)
print(f"Done. Final loss: {stats.training_loss:.4f}")

model.save_pretrained(f"{OUTPUT_DIR}/lora")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora")

model.save_pretrained_gguf(
    f"{OUTPUT_DIR}/gguf", tokenizer, quantization_method="q4_k_m",
)

The Numbers

ParameterValue
Base modelDevstral-Small-2-24B (Ministral3 architecture)
LoRA rank / alpha16 / 16
Target modulesq, k, v, o, gate, up, down projections
Trainable parameters92,405,760 of 23,664,808,960 (0.39%)
Sequence length2,048 tokens
Effective batch size4 (1 per device × 4 accumulation)
Learning rate1e-4 with 50 warmup steps
Checkpoint used1,200 (end of epoch 2)
Time per step~11 seconds
Total training time~3.7 hours
VRAM used21.2GB / 24GB

Loss Curve

StepEpochLossNotes
50.010.7949Warmup
1000.170.5708
3000.520.5800
6001.030.3559End of epoch 1
9001.550.3858
11001.890.3469
11602.000.3752End of epoch 2
12002.070.1493Checkpoint used ✓

Checkpoint 1200 (just past end of epoch 2) was used for the final model. For reasoning distillation with a small dataset, epoch 3 overfits to the trace style — epoch 2 generalises better. The loss curve told the story.


Summary: Every Bug Hit

#ErrorCauseFix
1FP8 quantization only supported on compute 8.9+RTX 3090 is 8.6Dequantize FP8→BF16
2device_map='auto' in distributed modeVision encoder splits model across GPU+CPUExtract text-only weights
3OutOfMemoryError during loadingtransformers 5.x concurrent BF16 loaderHF_DEACTIVATE_ASYNC_LOAD=1
4Unsupported functorch tracing attempttorch.compile + mistral3 incompatibilityTORCH_COMPILE_DISABLE=1
5Cannot copy out of meta tensorfix_untrained_tokens + meta-device paramsMonkey-patch to skip
6grad_norm = NaNUnsloth grad accumulation normalisation bugUse unsloth_train()
7OutOfMemoryError at step 172flex_attention backward materialises seq² matrixseq_len 4096→2048 + expandable segments

Seven distinct bugs, each requiring a different fix. That's the honest cost of being early on a new architecture with a new version of transformers.


The Model

🤗

adamjen/Devstral-Small-2-24B-Opus-Reasoning

Q4_K_M (14.3GB) · Q5_K_M (16.8GB) · LoRA adapter (370MB) — HuggingFace

Versions that work: Python 3.13, PyTorch 2.11.0+cu130, Unsloth 2026.3.10, transformers 5.3.0, bitsandbytes 0.49.2.