Red Hat's container-based virtual inference server cuts AI operating costs
The new inference server from Red Hat is resource-optimized, platform-independent and clustered in Kubernetes containers.
(Image: Da Da Diamond/Shutterstock.com)
At its annual summit (20 to 21 May in Boston), Red Hat presented an inference server for trained AI models that works platform-independently and with low hardware requirements thanks to virtualization.
The server is based on the Virtual LLM (vLLM) project, which not only virtualizes inference models, but also cleverly structures their memory management in order to use hardware resources efficiently. Red Hat has opted for containerization with Kubernetes so that the server runs on all container platforms and hyperscalers that support Kubernetes and provide the necessary hardware: GPUs from Nvidia, AMD or Google. Edge use is also possible. According to the manufacturer, all common models can also be operated in it.
Users can also cluster the server across several containers, for which Red Hat uses llm-d, a project that the company runs together with Google, IBM, Nvidia and others.
Videos by heise
Ready-made containers on Hugging Face
With this architecture and additional compression methods (Neural Magic), Red Hat promises that trained models will also run on older and cheaper hardware and do not require the latest Nvidia cards. The server can be operated independently of RHEL or Open Shift. Red Hat offers optimized and secure containers on Hugging Face.
(Image:Â Red Hat)
Inferencing refers to the actual operation of a fully trained model and represents the interface to the users and their requests.
(who)