- Python 100%
|
Some checks failed
Secret Scan / scan (push) Failing after 6s
CustomGuardrail (extends CustomLogger) is the correct base for LiteLLM callback registration via get_instance_fn. Removes unused CustomLogger import. |
||
|---|---|---|
| .forgejo/workflows | ||
| .gitea/issue_template | ||
| callbacks | ||
| compose | ||
| .env.example | ||
| CLAUDE.md | ||
| docker-compose.yml | ||
| README.md | ||
svc-llm ⚡
GPU-accelerated LLM routing, local inference, and vector search — the model layer of the Generate One platform.
✨ Overview
svc-llm is the AI model layer of Generate One. It runs LiteLLM as a unified proxy that routes requests across cloud providers (Groq, Cerebras, OpenRouter) and local vLLM instances on an RTX 5090 GPU. It also hosts Qdrant for vector similarity search used by the knowledge and memory systems. All 8 model tiers and 3 GPU-resident vLLM processes are managed from this single compose stack.
🏗️ Architecture
graph TD
subgraph svc-llm
LiteLLM["LiteLLM Proxy\n:4000 → llm.generate.one"]
Qdrant["Qdrant\n:6333"]
VE["vllm-embedding\n:8100\nQwen3-Embedding-8B FP8\n30% VRAM"]
VR["vllm-reranker\n:8101\nmxbai-rerank-large-v2\n15% VRAM"]
VS["vllm-smallmodel\n:8102\nQwen3-8B-FP8\n45% VRAM"]
end
subgraph Cloud["Cloud Providers"]
Groq
Cerebras
OpenRouter
end
LiteLLM --> Groq
LiteLLM --> Cerebras
LiteLLM --> OpenRouter
LiteLLM --> VE
LiteLLM --> VR
LiteLLM --> VS
Client --> LiteLLM
Brain["g1-brain"] --> Qdrant
Brain --> LiteLLM
📦 Services
| Service | Image | Port | Description |
|---|---|---|---|
| litellm | ghcr.io/berriai/litellm-database:main-stable |
4000 | LLM proxy with multi-provider routing + virtual keys |
| qdrant | qdrant/qdrant:latest |
6333, 6334 | Vector similarity search engine |
| vllm-embedding | vllm/vllm-openai:latest |
8100 | Qwen3-Embedding-8B (FP8, 30% VRAM) |
| vllm-reranker | vllm/vllm-openai:latest |
8101 | mxbai-rerank-large-v2 (FP16, 15% VRAM) |
| vllm-smallmodel | vllm/vllm-openai:latest |
8102 | Qwen3-8B-FP8 (45% VRAM, 16K context) |
🔧 Model Tiers
| Tier | Primary Model | Fallback Chain |
|---|---|---|
svc-llm |
K2-instruct (Groq) | OpenRouter |
svc-llm-turbo |
GPT-OSS-120B (Groq) | Cerebras, OpenRouter |
svc-llm-mini |
Qwen3-235B (Cerebras) | OpenRouter |
svc-llm-code |
GLM-4.7 (Cerebras) | OpenRouter |
svc-llm-pro |
K2.5 (OpenRouter) | — |
svc-llm-code-pro |
GLM-5 (OpenRouter) | — |
svc-llm-micro |
Qwen3-8B (local vLLM) | Cerebras, OpenRouter |
g1-vlm |
Qwen3-VL-30B (OpenRouter) | svc-llm-pro |
🖥️ GPU VRAM Budget (RTX 5090, 32 GB)
| Process | Allocation | VRAM |
|---|---|---|
| vllm-reranker | 15% | ~5.1 GB |
| vllm-embedding | 30% | ~9.7 GB |
| vllm-smallmodel | 45% | ~14.9 GB |
| Total | 90% | ~29.7 GB |
🔧 Configuration
| Variable | Description |
|---|---|
LITELLM_MASTER_KEY |
Master API key (sk- prefix required) |
OPENROUTER_API_KEY |
OpenRouter provider key |
GROQ_API_KEY |
Groq provider key |
CEREBRAS_API_KEY |
Cerebras provider key |
LANGFUSE_PUBLIC_KEY |
Langfuse tracing public key |
LANGFUSE_SECRET_KEY |
Langfuse tracing secret key |
DATABASE_URL |
PostgreSQL for LiteLLM DB (g1-core shared PG) |
VALKEY_HOST / VALKEY_PASSWORD |
Valkey cache (DB 7) |
HF_TOKEN |
HuggingFace token for model downloads |
Mounted Files
| File | Purpose |
|---|---|
litellm_config.yaml |
Model routing config (tiers, fallbacks, providers) |
guardrail_callback.py |
Content safety guardrail (regex + LLM classification) |
custom_jwt_auth.py |
JWT + virtual key authentication handler |
🔐 Security Features
- Custom JWT auth — Authentik OIDC tokens accepted alongside virtual keys
- Content guardrail — Regex patterns + LLM-based classification for prompt injection detection
- Langfuse tracing — All requests traced for observability at
observe.generate.one
🚀 Quick Start
# LiteLLM + Qdrant service directory
cd /data/coolify/services/t8c8cgogcs08o0gk0wwgksoo
# vLLM GPU service directory (local deploy — GPU reservation requires docker compose directly)
cd /data/coolify/services/lgscs08wo0socwsc0okw4cwo
# Apply changes (stagger vLLM restarts to avoid GPU contention)
docker compose up -d litellm
docker compose up -d qdrant
# View logs
docker compose logs -f litellm
# Test model routing
curl -s https://llm.generate.one/v1/models \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[].id'
Note: vLLM services use
deploy: resources: reservations: devices(GPU) which requires localdocker compose up -drather than Coolify deploy (Rule 59).
🔗 Key Endpoints
| Endpoint | Description |
|---|---|
https://llm.generate.one/ |
LiteLLM proxy (OpenAI-compatible) |
https://llm.generate.one/v1/chat/completions |
Chat API |
https://llm.generate.one/v1/embeddings |
Embedding API |
https://llm.generate.one/health/liveness |
Health check |
Internal: http://litellm-t8c8cgogcs08o0gk0wwgksoo:4000/v1 |
Cross-stack calls |
🔗 Dependencies
Depends on:
- g1-core — PostgreSQL (litellm DB), Valkey (DB 7 for caching)
- g1-observe — Langfuse for trace collection
Depended on by:
- g1-brain — Graphiti + knowledge-mcp use LLM tiers for inference
- g1-gpt — LibreChat routes all model requests through LiteLLM
- g1-mcp — MCP tools use LLM tiers for query rewriting, classification
- g1-agent-backend — PydanticAI agent uses LiteLLM for all completions
🔗 Related Repos
| Repo | Relationship |
|---|---|
| g1-core | PostgreSQL + Valkey backend |
| g1-brain | Knowledge/memory search uses Qdrant + LLM tiers |
| g1-observe | Langfuse receives LiteLLM traces |
| g1-gpt | LibreChat UI connects to LiteLLM |
🛡️ Part of Generate One
Generate One — AI infrastructure that answers to you.
Self-hosted, sovereign AI platform. generate.one
Licensed under AGPL-3.0.