The weights are the program.
Open-weight models are not API endpoints. They are 100% specified, frozen, and yours to embody in any substrate you choose. Silicon is one such substrate.
We turn open-weight language models into application-specific silicon. The result is a sealed inference appliance you own — no API, no telemetry, no seat fees, no provider that can change the terms next quarter.
A GPU is built to run any program. An inference workload is one program, executed a trillion times. The gap between those two facts is where the next thousand-fold improvement in efficiency lives — and where the case for ownership becomes unanswerable.
Open-weight models are not API endpoints. They are 100% specified, frozen, and yours to embody in any substrate you choose. Silicon is one such substrate.
Training needs flexibility, gradient flow, experimentation. Inference needs one numerical recipe executed at maximum efficiency. Different problem, different machine.
If your model fits in a sealed appliance with no network egress, your data never leaves your facility. Privacy stops being a policy and becomes a wiring diagram.
Per-token pricing made sense when nobody could afford the hardware. The hardware is now affordable. The pricing model is not.
We do not sell chips off a shelf. Each customer engagement compiles a specific model — frozen at a specific checkpoint — into a specific accelerator, packaged into a specific enclosure with a specific security posture. Below is the standard program.
Pick an open-weight model. Pick a quantization. Pick a context window. That checkpoint is now your product.
We instrument your real traffic. Token mixes, batch shapes, latency targets. The silicon is sized to the workload, not the spec sheet.
Layer mapping, dataflow, memory hierarchy. SRAM-heavy for low-latency decode. HBM-heavy for long-context prefill.
Bit-exact match against your frozen reference implementation. No "close enough." Determinism is the product.
Mask set, MPW or full reticle, first wafers. Twelve to sixteen weeks from sign-off to wafers in hand.
Board, chassis, firmware, network isolation. A sealed unit that boots your model and serves nothing else.
Bonded transport to your facility. Tamper-evident seals. Source-code escrow. The chip is yours; we keep a key only for warranty silicon.
A GPU spends most of its die area on things you do not use during inference: the rasterizer, the texture units, the FP64 path, the speculative scheduler. We delete every transistor that does not multiply a weight or move a token.
| Metric | General-purpose GPU | Hardagent-01 · DeepSeek-V4 build | Delta |
|---|---|---|---|
| Die area spent on inference-relevant logic | ~35 % | ~92 % | 2.6× |
| Tokens / second / watt (671B MoE, batch=1) | 0.45 | 14.2 | 31× |
| Idle power draw (model resident, no traffic) | ~280 W | ~40 W | 7× |
| Time-to-first-token, 32k context | 2.4 s | 0.31 s | 7.7× |
| 5-year TCO (1B tok/day workload) | $18.2 M | $3.9 M | 4.7× |
Figures are simulation targets for the Hardagent-01 reference design. Independent silicon validation pending first wafers. Comparison baseline: H100 SXM5 at MLPerf Inference v4.1 reference settings.
The Hardagent program admits four customers per fabrication cohort. Engagement begins with a workload review under mutual NDA. Minimum order: one appliance rack. Maximum: one fab line.
We are hiring physical-design, RTL, packaging, and firmware engineers with first-silicon experience on AI accelerators or HPC parts. Remote-friendly for design roles, on-site for bring-up. Equity is generous, optics are not.
You will own the MAC array and dataflow. SystemVerilog, formal experience preferred.
Floorplan, P&R, timing closure for an 800+ mm² die. Bring your war stories.
From JTAG to running a model. Hands-on, sleepless, satisfying.
2.5D CoWoS, HBM stacking, thermal. Coordinate with the foundry directly.