Structural Compiler for Open-Weight LLMs

Smaller models. Same API. Less compute.

Dystrio structurally compiles open-weight LLMs into smaller, faster, dense checkpoints. Standard Transformers format — loads in vLLM, TGI, llama.cpp, and any HuggingFace-compatible stack.

No Custom Kernels No Runtime Changes No Retraining
NVIDIA Inception Program Member NVIDIA Inception Program

Available now on Hugging Face

Drop-in replacements for popular open-weight models. Every tier is a standard dense checkpoint — swap the model ID and go.

Mistral 7B Instruct v0.3
Tier Size PPL Ratio Prefill TPS TTFT p95
sculpt-default 12.0 GB 0.923 11,594 (+10%) 123 ms (−8%) View →
sculpt-production 11.3 GB 1.134 12,094 (+15%) 121 ms (−9%) View →
sculpt-throughput 10.4 GB 1.297 12,667 (+20%) 113 ms (−15%) View →
sculpt-experimental 9.6 GB 1.996 13,596 (+29%) 110 ms (−17%) View →
Baseline: 13.5 GB · PPL 12.60 · 10,557 prefill TPS · 133 ms TTFT p95 · A100 80GB, bf16
from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "dystrio/Mistral-7B-Instruct-v0.3-sculpt-production", torch_dtype="bfloat16", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained( "dystrio/Mistral-7B-Instruct-v0.3-sculpt-production" )
Works with vLLM · TGI · SGLang · llama.cpp · AWQ · GPTQ · GGUF — no code changes.

Automated structural compilation

1
Fingerprint Detect architecture family, layer geometry, MLP type. Automatic support for Llama, Mistral, Qwen, Phi, Gemma.
2
Compile Analyze layer-level structure. Rewrite model dimensions for target efficiency. Stabilize with calibration and quality guardrails.
3
Benchmark Perplexity, prefill TPS, decode TPS, TTFT across workloads. Every tier benchmarked automatically.
4
Publish Emit standard HuggingFace checkpoint with model card and benchmark table. Ready to deploy.
The factory runs end-to-end on a single GPU. New models are compiled within days of release.

Mistral 7B — sculpt-default tier

Smaller. Faster. Dense. Drop-in. Full downstream benchmarks →

−11%
model size
13.5 GB → 12.0 GB
0.923
PPL ratio
vs. baseline perplexity
+10%
prefill throughput
10,557 → 11,594 TPS
−8%
TTFT p95
133 ms → 123 ms
Benchmarked on A100-SXM4-80GB, bf16, deterministic mode. Single-GPU, standard HuggingFace Transformers. No custom kernels. Full benchmark tables on each model card.
Sculpt
Structural Inference Recompilation

Models allocate uniform width across every layer regardless of actual activation demand. Sculpt measures that demand, physically rewrites MLP dimensions, stabilizes the result, and emits a standard dense model. Not masking. Not sparse pruning. Tensor shape recompilation.

Quantization and structural recompilation are complementary — Sculpt composes with AWQ, GPTQ, GGUF, and existing optimization pipelines for compound savings.

Mistral-7B · Production tier
17% smaller. 15% faster prefill. Drop-in replacement.
+15%
prefill throughput
~17%
weight reduction
Dense
standard HF artifact
Mistral-7B · Throughput tier
Maximum usable compression for speed and edge deployment.
+20%
prefill throughput
~23%
weight reduction
Dense
standard HF artifact
Full-model recompilation. No runtime modifications. No sparse kernels. No serving stack changes.
Forge
Expert Placement for MoE Deployments

For teams running Mixture-of-Experts models across multi-node GPU clusters. Forge observes expert routing patterns, builds a co-activation graph, and co-locates frequently activated experts to reduce cross-node communication. Same model. Same stack. Fewer hops.

Forge includes a prescriptive decision gate — it tells you whether placement will help for your specific workload, and recommends against it when it won't. 5/5 correct predictions across adversarial workloads.

Multi-node · 8× A100 · 2 nodes
Validated on A100-SXM4 and H100 multi-node clusters with vLLM.
+3.5%
throughput
−4.1%
P95 tail latency
−86%
throughput variance
Single node · 4× A100 NVLink
Even on NVLink, structure matters.
−30%
P95/P99 tail latency
17×
throughput stability
+2.7%
throughput under skew
Read-only observation. Output is a placement artifact you apply at deploy time. No runtime changes. Contact us about MoE optimization for your cluster.

Reshape the model. Not the stack.

Platform agnostic

The output is a model, not a runtime. Works wherever models work.

Composable

Structural optimization before quantization, fine-tuning, and serving. Compounds with every downstream step.

Prescriptive

Quantifies expected gain and recommends whether to apply. When optimization won't help, Dystrio tells you.

Portable

Standard safetensors, standard config. Zero pipeline modification.

PyTorch vLLM SGLang TGI llama.cpp TensorRT-LLM AWQ GPTQ GGUF LoRA Kubernetes

Try a smaller model today.

Browse compressed checkpoints on Hugging Face — or talk to us about structural compilation for your model fleet.