
"Tensormesh is using the money to build a commercial version of the open-source LMCache utility, launched and maintained by Tensormesh co-founder Yihua Cheng. Used well, LMCache can reduce inference costs by as much as ten times - a power that's made it a staple in open-source deployments and drawn in integrations from heavy-hitters like Google and Nvidia. Now, Tensormesh is planning to parlay that academic reputation into a viable business."
"The heart of the key-value cache (or KV cache), a memory system used to process complex inputs more efficiently by condensing them down to their key values. In traditional architectures, the KV cache is discarded at the end of each query - but TensorMesh CEO Juchen Jiang argues that this is an enormous source of inefficiency. "It's like having a very smart analyst reading all the data, but they forget what they have learned after each question," says Tensormesh co-founder Junchen Jiang."
"Instead of discarding that cache, Tensormesh's systems hold onto it, allowing it to be redeployed when the model executes a similar process in a separate query. Because GPU memory is so precious, this can mean spreading data across several different storage layers, but the reward is significantly more inference power for the same server load. The change is particularly powerful for chat interfaces, since models need to continually refer back to the growing chat log as the conversation progresses."
Tensormesh launched out of stealth with $4.5 million in seed funding led by Laude Ventures and additional angel funding from Michael Franklin. The company is building a commercial version of the open-source LMCache utility, which can reduce inference costs by as much as ten times and has integrations from Google and Nvidia. Tensormesh focuses on preserving and reusing key-value (KV) caches across queries instead of discarding them, enabling redeployment of condensed representations and distributing data across storage layers to maximize limited GPU memory. The approach yields significantly higher inference throughput, especially for growing chat logs and agentic systems.
Read at TechCrunch
Unable to calculate read time
Collection
[
|
...
]