#vllm tag - Briefly

Deploy MultiModal RAG Systems with vLLM

Vectors convert unstructured data into embeddings enabling vector search, RAG, recommendations, anomaly detection, and applications like drug discovery.

fromInfoWorld

1 month ago

Unlocking LLM superpowers: How PagedAttention helps the memory maze

KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens. Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page. Requests are like processes. Each LLM request is managed like a process, with its "logical" KV blocks mapped to "physical" KV blocks in GPU memory.

Artificial intelligence

fromInfoQ

1 month ago

GenAI at Scale: What It Enables, What It Costs, and How To Reduce the Pain

My name is Mark Kurtz. I was the CTO at a startup called Neural Magic. We were acquired by Red Hat end of last year, and now working under the CTO arm at Red Hat. I'm going to be talking about GenAI at scale. Essentially, what it enables, a quick overview on that, costs, and generally how to reduce the pain. Running through a little bit more of the structure, we'll go through the state of LLMs and real-world deployment trends.

Artificial intelligence

fromInfoWorld

2 months ago

Evolving Kubernetes for generative AI inference

Kubernetes now includes native AI inference features including vLLM support, inference benchmarking, LLM-aware routing, inference gateway extensions, and accelerator scheduling.

fromHackernoon

4 months ago

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

Prior reservation wastes memory even if the context lengths are known in advance, demonstrating the inefficiencies in current KV-cache allocation strategies in production systems.

Scala

#vllm#vllm

Deploy MultiModal RAG Systems with vLLM

Unlocking LLM superpowers: How PagedAttention helps the memory maze

GenAI at Scale: What It Enables, What It Costs, and How To Reduce the Pain

Evolving Kubernetes for generative AI inference

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

#vllm
#vllm