vllm-openai

docker pull reg.mini.dev/vllm-openai

Updated 5 hours ago

docker pull reg.mini.dev/vllm-openai

vLLM-OpenAI Overview

Secure your stack with a hardened vLLM-OpenAI image freshly-built by Minimus. Minimus images always include the most up-to-date package version for all packages and dependencies contained in the image.

vLLM-OpenAI is a high-throughput inference engine for large language models. It exposes an OpenAI-compatible API server, making it a drop-in replacement for applications that use the OpenAI format.

Try It Out

The Minimus vLLM-OpenAI image is designed to run on a machine equipped with an NVIDIA GPU. This guide demonstrates how to serve a model and query its OpenAI-compatible API.

Start the Server

Run the container with GPU access enabled, serving the Qwen2.5-1.5B-Instruct model:

docker run -d --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  reg.mini.dev/vllm-openai \
  --model Qwen/Qwen2.5-1.5B-Instruct

The server will start listening on http://localhost:8000. Model download and loading may take a few minutes on first run.

To serve a gated or private model, pass your Hugging Face token:

docker run -d --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  -e HUGGING_FACE_HUB_TOKEN=<your-hf-token> \
  reg.mini.dev/vllm-openai \
  --model meta-llama/Llama-3.1-8B-Instruct

Send a Test Request

From another terminal, send a chat completion request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, what can you do?"}
    ]
  }'

You can also use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

For more deployment options (multi-GPU, Kubernetes, Helm), see the official deployment guide.

Technical Considerations

The vLLM-OpenAI image provided by Minimus is a slim, security-hardened alternative to the public image from Docker Hub. The images are largely interchangeable, with a few differences as noted below.

vLLM-OpenAI built by Minimus:

The Minimus vLLM-OpenAI image is designed to run on a machine equipped with an NVIDIA GPU.
Runs as root to support required functions as does the public image.
Drill down on the version specification tab to see the default user, listening ports, entrypoint, volumes, environment variables, etc.

The Payoff

A hardened, minimal image that will remain more secure for the long run and accrue vulnerabilities at a slower rate.

See the risk reduction dashboard for a detailed CVE comparison over the past 30 days.
Review the compliance report to see the default hardening and security configurations for the image.

Terms & Info

Trademark

This catalog is published by Minimus. All product names, logos, and marks, other than those belonging to Minimus, shown are owned by their respective rights holders and appear here only to identify the open source software each image contains. Minimus claims no ownership of those marks and implies no affiliation with, endorsement by, certification by, or sponsorship by any rights holder.

Disclaimer

Images are provided "as-is" without warranty of any kind. "Hardened" refers to the security configuration applied at the time of build and does not constitute a guarantee of ongoing security or absence of vulnerabilities. The free tier is provided without support, SLA, or guaranteed patching timelines. Security updates may be applied to paid subscriptions before or instead of free tier images. By pulling or using any image you agree to our Terms of Use.