vLLM
vLLM is an open-source library for fast and memory-efficient LLM inference and serving. Originally developed at UC Berkeley, vLLM introduced the PagedAttention algorithm, which revolutionized how LLM serving manages GPU memory by applying virtual memory concepts to the attention key-value cache.
PagedAttention allows vLLM to serve more concurrent requests per GPU than previous approaches, dramatically improving throughput and reducing the cost of LLM inference. The library has been widely adopted by AI companies and cloud providers as the foundation for their model serving infrastructure.
In the model-building toolchain layer, vLLM has become essential infrastructure for the agentic economy — the serving engine that makes it practical to run the LLMs powering AI agents at scale, with the throughput and cost efficiency required for production deployment.