LLM routing — LiteLLM, vLLM, Qdrant, model tiers
Find a file
G1 Platform Engineer 4d46e10376
Some checks failed
Secret Scan / scan (push) Failing after 6s
fix: use CustomGuardrail base for JsonResponseValidator
CustomGuardrail (extends CustomLogger) is the correct base for
LiteLLM callback registration via get_instance_fn. Removes unused
CustomLogger import.
2026-03-27 20:08:19 +00:00
.forgejo/workflows ci: add Coolify sync workflow for vllm-models (GPU, sync-only) 2026-03-22 18:50:27 +00:00
.gitea/issue_template chore: add infrastructure change issue template 2026-03-04 13:48:06 +00:00
callbacks fix: use CustomGuardrail base for JsonResponseValidator 2026-03-27 20:08:19 +00:00
compose fix: correct service UUID typo in vllm-models compose header (LLM-8) 2026-03-26 05:28:03 +00:00
.env.example Add .env.example 2026-03-04 01:18:17 +00:00
CLAUDE.md Fix CLAUDE.md encoding (was double base64-encoded) 2026-03-06 08:32:12 +00:00
docker-compose.yml fix: increase LiteLLM healthcheck start_period 60s→300s, retries 3→5 (LLM-9) 2026-03-26 06:08:48 +00:00
README.md docs: update README for Scheme B rename (g1-llm → svc-llm) 2026-03-22 06:31:44 +00:00

svc-llm

GPU-accelerated LLM routing, local inference, and vector search — the model layer of the Generate One platform.

Status License Platform GPU LiteLLM vLLM


Overview

svc-llm is the AI model layer of Generate One. It runs LiteLLM as a unified proxy that routes requests across cloud providers (Groq, Cerebras, OpenRouter) and local vLLM instances on an RTX 5090 GPU. It also hosts Qdrant for vector similarity search used by the knowledge and memory systems. All 8 model tiers and 3 GPU-resident vLLM processes are managed from this single compose stack.


🏗️ Architecture

graph TD
    subgraph svc-llm
        LiteLLM["LiteLLM Proxy\n:4000 → llm.generate.one"]
        Qdrant["Qdrant\n:6333"]
        VE["vllm-embedding\n:8100\nQwen3-Embedding-8B FP8\n30% VRAM"]
        VR["vllm-reranker\n:8101\nmxbai-rerank-large-v2\n15% VRAM"]
        VS["vllm-smallmodel\n:8102\nQwen3-8B-FP8\n45% VRAM"]
    end

    subgraph Cloud["Cloud Providers"]
        Groq
        Cerebras
        OpenRouter
    end

    LiteLLM --> Groq
    LiteLLM --> Cerebras
    LiteLLM --> OpenRouter
    LiteLLM --> VE
    LiteLLM --> VR
    LiteLLM --> VS

    Client --> LiteLLM
    Brain["g1-brain"] --> Qdrant
    Brain --> LiteLLM

📦 Services

Service Image Port Description
litellm ghcr.io/berriai/litellm-database:main-stable 4000 LLM proxy with multi-provider routing + virtual keys
qdrant qdrant/qdrant:latest 6333, 6334 Vector similarity search engine
vllm-embedding vllm/vllm-openai:latest 8100 Qwen3-Embedding-8B (FP8, 30% VRAM)
vllm-reranker vllm/vllm-openai:latest 8101 mxbai-rerank-large-v2 (FP16, 15% VRAM)
vllm-smallmodel vllm/vllm-openai:latest 8102 Qwen3-8B-FP8 (45% VRAM, 16K context)

🔧 Model Tiers

Tier Primary Model Fallback Chain
svc-llm K2-instruct (Groq) OpenRouter
svc-llm-turbo GPT-OSS-120B (Groq) Cerebras, OpenRouter
svc-llm-mini Qwen3-235B (Cerebras) OpenRouter
svc-llm-code GLM-4.7 (Cerebras) OpenRouter
svc-llm-pro K2.5 (OpenRouter)
svc-llm-code-pro GLM-5 (OpenRouter)
svc-llm-micro Qwen3-8B (local vLLM) Cerebras, OpenRouter
g1-vlm Qwen3-VL-30B (OpenRouter) svc-llm-pro

🖥️ GPU VRAM Budget (RTX 5090, 32 GB)

Process Allocation VRAM
vllm-reranker 15% ~5.1 GB
vllm-embedding 30% ~9.7 GB
vllm-smallmodel 45% ~14.9 GB
Total 90% ~29.7 GB

🔧 Configuration

Variable Description
LITELLM_MASTER_KEY Master API key (sk- prefix required)
OPENROUTER_API_KEY OpenRouter provider key
GROQ_API_KEY Groq provider key
CEREBRAS_API_KEY Cerebras provider key
LANGFUSE_PUBLIC_KEY Langfuse tracing public key
LANGFUSE_SECRET_KEY Langfuse tracing secret key
DATABASE_URL PostgreSQL for LiteLLM DB (g1-core shared PG)
VALKEY_HOST / VALKEY_PASSWORD Valkey cache (DB 7)
HF_TOKEN HuggingFace token for model downloads

Mounted Files

File Purpose
litellm_config.yaml Model routing config (tiers, fallbacks, providers)
guardrail_callback.py Content safety guardrail (regex + LLM classification)
custom_jwt_auth.py JWT + virtual key authentication handler

🔐 Security Features

  • Custom JWT auth — Authentik OIDC tokens accepted alongside virtual keys
  • Content guardrail — Regex patterns + LLM-based classification for prompt injection detection
  • Langfuse tracing — All requests traced for observability at observe.generate.one

🚀 Quick Start

# LiteLLM + Qdrant service directory
cd /data/coolify/services/t8c8cgogcs08o0gk0wwgksoo

# vLLM GPU service directory (local deploy — GPU reservation requires docker compose directly)
cd /data/coolify/services/lgscs08wo0socwsc0okw4cwo

# Apply changes (stagger vLLM restarts to avoid GPU contention)
docker compose up -d litellm
docker compose up -d qdrant

# View logs
docker compose logs -f litellm

# Test model routing
curl -s https://llm.generate.one/v1/models \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[].id'

Note: vLLM services use deploy: resources: reservations: devices (GPU) which requires local docker compose up -d rather than Coolify deploy (Rule 59).


🔗 Key Endpoints

Endpoint Description
https://llm.generate.one/ LiteLLM proxy (OpenAI-compatible)
https://llm.generate.one/v1/chat/completions Chat API
https://llm.generate.one/v1/embeddings Embedding API
https://llm.generate.one/health/liveness Health check
Internal: http://litellm-t8c8cgogcs08o0gk0wwgksoo:4000/v1 Cross-stack calls

🔗 Dependencies

Depends on:

  • g1-core — PostgreSQL (litellm DB), Valkey (DB 7 for caching)
  • g1-observe — Langfuse for trace collection

Depended on by:

  • g1-brain — Graphiti + knowledge-mcp use LLM tiers for inference
  • g1-gpt — LibreChat routes all model requests through LiteLLM
  • g1-mcp — MCP tools use LLM tiers for query rewriting, classification
  • g1-agent-backend — PydanticAI agent uses LiteLLM for all completions

Repo Relationship
g1-core PostgreSQL + Valkey backend
g1-brain Knowledge/memory search uses Qdrant + LLM tiers
g1-observe Langfuse receives LiteLLM traces
g1-gpt LibreChat UI connects to LiteLLM

🛡️ Part of Generate One

Generate One — AI infrastructure that answers to you.

Self-hosted, sovereign AI platform. generate.one

Licensed under AGPL-3.0.