Smaller models. Same API. Less compute.
Dystrio structurally compiles open-weight LLMs into smaller, faster, dense checkpoints. Standard Transformers format — loads in vLLM, TGI, llama.cpp, and any HuggingFace-compatible stack.
Available now on Hugging Face
Drop-in replacements for popular open-weight models. Every tier is a standard dense checkpoint — swap the model ID and go.
| Tier | Size | PPL Ratio | Prefill TPS | TTFT p95 | |
|---|---|---|---|---|---|
| sculpt-default | 12.0 GB | 0.923 | 11,594 (+10%) | 123 ms (−8%) | View → |
| sculpt-production | 11.3 GB | 1.134 | 12,094 (+15%) | 121 ms (−9%) | View → |
| sculpt-throughput | 10.4 GB | 1.297 | 12,667 (+20%) | 113 ms (−15%) | View → |
| sculpt-experimental | 9.6 GB | 1.996 | 13,596 (+29%) | 110 ms (−17%) | View → |
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"dystrio/Mistral-7B-Instruct-v0.3-sculpt-production",
torch_dtype="bfloat16", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"dystrio/Mistral-7B-Instruct-v0.3-sculpt-production"
)
Automated structural compilation
Mistral 7B — sculpt-default tier
Smaller. Faster. Dense. Drop-in. Full downstream benchmarks →
Models allocate uniform width across every layer regardless of actual activation demand. Sculpt measures that demand, physically rewrites MLP dimensions, stabilizes the result, and emits a standard dense model. Not masking. Not sparse pruning. Tensor shape recompilation.
Quantization and structural recompilation are complementary — Sculpt composes with AWQ, GPTQ, GGUF, and existing optimization pipelines for compound savings.
For teams running Mixture-of-Experts models across multi-node GPU clusters. Forge observes expert routing patterns, builds a co-activation graph, and co-locates frequently activated experts to reduce cross-node communication. Same model. Same stack. Fewer hops.
Forge includes a prescriptive decision gate — it tells you whether placement will help for your specific workload, and recommends against it when it won't. 5/5 correct predictions across adversarial workloads.
Reshape the model. Not the stack.
The output is a model, not a runtime. Works wherever models work.
Structural optimization before quantization, fine-tuning, and serving. Compounds with every downstream step.
Quantifies expected gain and recommends whether to apply. When optimization won't help, Dystrio tells you.
Standard safetensors, standard config. Zero pipeline modification.
Try a smaller model today.
Browse compressed checkpoints on Hugging Face — or talk to us about structural compilation for your model fleet.