vllm-openai
vLLM-OpenAI Overview
Secure your stack with a hardened vLLM-OpenAI image freshly-built by Minimus. Minimus images always include the most up-to-date package version for all packages and dependencies contained in the image.
vLLM-OpenAI is a high-throughput inference engine for large language models. It exposes an OpenAI-compatible API server, making it a drop-in replacement for applications that use the OpenAI format.
Try It Out
The Minimus vLLM-OpenAI image is designed to run on a machine equipped with an NVIDIA GPU. This guide demonstrates how to serve a model and query its OpenAI-compatible API.
Start the Server
Run the container with GPU access enabled, serving the Qwen2.5-1.5B-Instruct model:
docker run -d --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
reg.mini.dev/vllm-openai \
--model Qwen/Qwen2.5-1.5B-InstructThe server will start listening on http://localhost:8000. Model download and loading may take a few minutes on first run.
To serve a gated or private model, pass your Hugging Face token:
docker run -d --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
-e HUGGING_FACE_HUB_TOKEN=<your-hf-token> \
reg.mini.dev/vllm-openai \
--model meta-llama/Llama-3.1-8B-InstructSend a Test Request
From another terminal, send a chat completion request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, what can you do?"}
]
}'You can also use the OpenAI Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)For more deployment options (multi-GPU, Kubernetes, Helm), see the official deployment guide.
Technical Considerations
The vLLM-OpenAI image provided by Minimus is a slim, security-hardened alternative to the public image from Docker Hub. The images are largely interchangeable, with a few differences as noted below.
vLLM-OpenAI built by Minimus:
- The Minimus vLLM-OpenAI image is designed to run on a machine equipped with an NVIDIA GPU.
- Runs as root to support required functions as does the public image.
- Drill down on the version specification tab to see the default user, listening ports, entrypoint, volumes, environment variables, etc.
The Payoff
A hardened, minimal image that will remain more secure for the long run and accrue vulnerabilities at a slower rate.
- See the risk reduction dashboard for a detailed CVE comparison over the past 30 days.
- Review the compliance report to see the default hardening and security configurations for the image.
Terms & Info
Trademark
This catalog is published by Minimus. All product names, logos, and marks, other than those belonging to Minimus, shown are owned by their respective rights holders and appear here only to identify the open source software each image contains. Minimus claims no ownership of those marks and implies no affiliation with, endorsement by, certification by, or sponsorship by any rights holder.
Disclaimer
Images are provided "as-is" without warranty of any kind. "Hardened" refers to the security configuration applied at the time of build and does not constitute a guarantee of ongoing security or absence of vulnerabilities. The free tier is provided without support, SLA, or guaranteed patching timelines. Security updates may be applied to paid subscriptions before or instead of free tier images. By pulling or using any image you agree to our Terms of Use.