AI inference cast in silicon: Taalas announces HC1 chip
The startup Taalas wants to deliver a hardwired Llama 3.1 8B with almost 17,000 tokens/s with the HC1 – almost 10 times faster than previous solutions.
The Taalas HC1 chip, which enables AI inference in silicon, promises a significant performance increase.
(Image: Taalas)
The startup Taalas, founded in Canada in 2023, has announced the HC1, a technology demonstrator that is intended to take AI inference to a new level. Instead of running a language model via software on general-purpose AI compute accelerators, Taalas is, so to speak, casting the model into silicon. The first product is a "hardwired" Llama 3.1 8B, which, according to the manufacturer, is said to generate 17,000 tokens per second per user.
According to Taalas, the core is an application-specific integrated circuit (ASIC) with around 53 billion transistors, manufactured by TSMC in the 6-nm process (N6) and an 815 mm² die area.
As the company announced in a blog post, it is almost ten times faster than the current state of the art. For comparison: According to Nvidia's own baseline data, an Nvidia H200 achieves around 230 tokens per second on the same model. Specialized inference providers like Cerebras achieve around 1,936 tokens per second according to independent benchmarks from Artificial Analysis – about a ninth of the value claimed by Taalas. SambaNova follows with 916 tokens/s, Groq with 609 tokens/s.
However, the competition is not sleeping: Since December 2025, Nvidia has been licensing Groq's technology and has taken over large parts of the design team to strengthen its own position in dedicated hardware.
Test platform is running
Taalas provides the chatbot "Jimmy" for trying out, which actually responds with remarkable speed – almost 16,000 tokens per second were achievable in the test. The company has not yet announced a price for the HC1. Interested developers can register for access to an inference API.
The startup, founded two and a half years ago, pursues three core principles: total specialization on individual models, the merging of memory and compute logic on a single chip, and a radical simplification of the entire hardware stack. Taalas claims to combine memory and compute at DRAM-typical density on a single chip. This eliminates the separation between slow off-chip DRAM and fast on-chip memory, which is common in conventional inference hardware.
Cerebras also promises this, but builds its gigantic Wafer Scale Engine (WSE) for it, which occupies an entire wafer and converts 15 kW of power into heat.
Videos by heise
No HBM, no water cooling, no advanced packaging
The approach fundamentally differs from what major chip manufacturers are currently pursuing. Nvidia relies on expensive High Bandwidth Memory (HBM), complex housing technology (packaging), and extremely high I/O data transfer rates for its AI accelerators like the H200.
For example, Google's TPU, Amazon's Inferentia, or Microsoft's recently announced Azure accelerator Maia 200 also use up to 216 GByte of HBM3E memory with a transfer rate of 7 TByte/s. Microsoft promises higher performance per dollar invested than with Nvidia technology, but Maia is also designed as a general-purpose accelerator for various AI models.
Taalas eliminates this complexity by optimizing the HC1 exclusively for a single model. The result does without HBM, 3D stacking, liquid cooling, and high-speed I/O.
So far only mini model
However, this comes at a price in terms of flexibility. The HC1 is largely hardwired – the chip can only execute Llama 3.1 8B, not any other models.
Llama 3.1 was introduced in mid-2024, which is quite old in the AI arms race. The compact version with 8 billion weights (8 billion, hence Llama 3.1 8B) even runs in quantized form on a Raspberry Pi 5 – albeit very slowly.
Admittedly, according to Taalas, the size of the context window can be configured and fine-tuning can be done via Low-Rank Adapters (LoRA). Furthermore, the company admits that the first silicon generation uses a proprietary 3-bit data format, combined with 6-bit parameters. This aggressive quantization leads to certain quality losses compared to GPU benchmarks with higher precision.
Next generation to solve quality problems
Taalas plans to deliver successors very quickly. The lean, automated, and fast development process for AI ASICs is the actual goal of the young company. It was founded by Tenstorrent founders Ljubisa Bajic and Drago Ignjatovic. Both previously worked for AMD for a long time, Bajic also for Nvidia. Due to the prominent names – the well-known chip developer Jim Keller currently leads Tenstorrent – Taalas is attracting a lot of attention in the AI scene.
A mere 24 team members realized the first product with an expenditure of 30 million US dollars – out of a total of over 200 million dollars in capital raised. For an N6 chip with 53 billion transistors, 30 million US dollars in development costs is very little. Given the extremely high prices for general-purpose AI accelerators, the founders expect a lucrative market niche.
Taalas explicitly targets data centers with its chips, promising costs there that are said to be 20 times lower than with conventional GPU inference, with a tenth of the power consumption.
A medium-sized reasoning model based on the same HC1 platform is expected to arrive in Taalas' labs in the spring and will be available shortly thereafter as an inference service.
After that, the company plans to implement a frontier LLM with the second chip generation, HC2. The HC2 platform is said to support standardized 4-bit floating-point formats, offer higher packing density, and work even faster. Deployment is planned for winter.
Classification and open questions
The performance data mentioned by Taalas are impressive, but so far can only be checked to a limited extent. The benchmarks come from in-house tests; independent measurements by third parties are not yet available.
It is also unclear how the quality losses due to aggressive quantization will affect performance in practice – especially for more complex tasks beyond simple chat conversations. It remains to be seen whether the concept of model-specific chips will scale economically if custom silicon has to be manufactured for each new model.
Taalas is not concerned with so-called "Edge AI" applications, where trained models run directly on the device without cloud connectivity. These are often models for speech recognition, voice control, object detection in video images for surveillance cameras, radar sensor evaluation, or machine monitoring through sound analysis (predictive maintenance). This is the domain of Neural Processing Units (NPUs) with currently 10 to 90 Int8 TOPS, which are entering the market in confusing variety: M5Stack's AI Pyramid-Pro, the Hailo NPUs for retrofitting the Raspberry Pi 5, Google Coral, and the embedded versions of x86 and ARM processors such as AMD Ryzen, Intel Panther Lake, Qualcomm Snapdragon, Mediatek Genio, Rockchip, and also RISC-V SoCs like the SpacemiT K3. European automotive microcontroller specialists Infineon, STMicroelectronics, and NXP also offer chips with built-in NPUs, as do TI and Renesas.
(vza)