generate-one/svc-tools

Fork 0

MetaMCP connector stability: multiple Claude.ai connectors disconnect when new CD leads launch #1

New issue

Closed

opened 2026-03-26 05:11:08 +00:00 by g1admin · 3 comments

g1admin commented

2026-03-26 05:11:08 +00:00

Owner

Problem

Claude.ai MCP connectors (via MetaMCP OAuth 2.1) disconnect when new CD lead sessions are launched. Pattern observed: 5 of 8 namespaces dropped (g1-code, g1-web, g1-time, g1-math, g1-presenter) while 3 remained connected (g1-brain, g1-coolify, g1-project). This happened after launching product-lead-v3 and llm-lead-v1 concurrently with an active architect session.

Impact

Operator must manually reconnect in Claude.ai settings
Interrupts active architect/lead sessions
Will get worse as more concurrent leads run

Suspected Cause

MetaMCP connection handling under concurrent OAuth 2.1 sessions. Possible: connection limits, memory ceiling, SSE/streamablehttp connection eviction when new sessions authenticate.

Investigation Needed

Check MetaMCP logs during lead standup — connection errors, OOM, restarts?
Check if MetaMCP has configurable connection limits
Monitor MetaMCP memory/CPU when 3+ CD sessions are active
Test: does the issue occur without OAuth 2.1 (pre-dispatch a00a93b9 state)?

References

MetaMCP OAuth 2.1 enabled via dispatch a00a93b9
Authentik OIDC provider metamcp-oidc (pk=23)
MetaMCP container in svc-tools Coolify project (vkk0088s8wgoskgo00okwgs0)

Priority: HIGH — affects all concurrent lead operations.

## Problem Claude.ai MCP connectors (via MetaMCP OAuth 2.1) disconnect when new CD lead sessions are launched. Pattern observed: 5 of 8 namespaces dropped (g1-code, g1-web, g1-time, g1-math, g1-presenter) while 3 remained connected (g1-brain, g1-coolify, g1-project). This happened after launching product-lead-v3 and llm-lead-v1 concurrently with an active architect session. ## Impact - Operator must manually reconnect in Claude.ai settings - Interrupts active architect/lead sessions - Will get worse as more concurrent leads run ## Suspected Cause MetaMCP connection handling under concurrent OAuth 2.1 sessions. Possible: connection limits, memory ceiling, SSE/streamablehttp connection eviction when new sessions authenticate. ## Investigation Needed 1. Check MetaMCP logs during lead standup — connection errors, OOM, restarts? 2. Check if MetaMCP has configurable connection limits 3. Monitor MetaMCP memory/CPU when 3+ CD sessions are active 4. Test: does the issue occur without OAuth 2.1 (pre-dispatch a00a93b9 state)? ## References - MetaMCP OAuth 2.1 enabled via dispatch a00a93b9 - Authentik OIDC provider metamcp-oidc (pk=23) - MetaMCP container in svc-tools Coolify project (vkk0088s8wgoskgo00okwgs0) Priority: HIGH — affects all concurrent lead operations.

g1admin commented

2026-03-26 05:13:08 +00:00

Author

Owner

ESCALATED TO P0

g1-brain just dropped. This is the critical namespace — Directus CMS, graph memory, knowledge tools. No lead can boot, read dispatches, or update state without it.

Updated failure pattern: 6/8 namespaces have now dropped at various points (g1-code, g1-web, g1-time, g1-math, g1-presenter, g1-brain). Only g1-coolify and g1-project have remained stable.

This is not intermittent — it's systemic. Reconnecting manually in Claude.ai settings is a temporary workaround but the disconnections recur when new CD lead sessions launch.

Immediate investigation path:

docker logs on MetaMCP container — look for connection errors, OOM, restarts
docker stats on MetaMCP — check memory/CPU under concurrent sessions
Check if MetaMCP is restarting entirely (which would drop ALL connectors) vs individual backend connections dying
Check Cloudflare analytics on mcp.generate.one for 5xx/timeout errors

## ESCALATED TO P0 g1-brain just dropped. This is the critical namespace — Directus CMS, graph memory, knowledge tools. No lead can boot, read dispatches, or update state without it. Updated failure pattern: 6/8 namespaces have now dropped at various points (g1-code, g1-web, g1-time, g1-math, g1-presenter, g1-brain). Only g1-coolify and g1-project have remained stable. This is not intermittent — it's systemic. Reconnecting manually in Claude.ai settings is a temporary workaround but the disconnections recur when new CD lead sessions launch. Immediate investigation path: 1. `docker logs` on MetaMCP container — look for connection errors, OOM, restarts 2. `docker stats` on MetaMCP — check memory/CPU under concurrent sessions 3. Check if MetaMCP is restarting entirely (which would drop ALL connectors) vs individual backend connections dying 4. Check Cloudflare analytics on mcp.generate.one for 5xx/timeout errors

g1admin commented

2026-03-26 05:28:00 +00:00

Author

Owner

Root Cause Analysis (T2 evidence, architect-v50)

Three smoking guns found:

1. ECONNREFUSED 161.184.162.156:8000 (18x in 30min)

MetaMCP is trying to reach something at the server's public IP on port 8000. Nothing listens there. Check MetaMCP DB mcp_servers table for any server with 161.184.162.156 in its URL — this is misconfigured and should use an internal container hostname.

2. Dead backends generating constant errors

plane-mcp → container gone (g1-plane deleted 2026-03-20)
section-mcp → container not running

These need to be removed from MetaMCP's mcp_servers table. Every health check cycle generates errors for these.

3. Time server STDIO crashes (3x in 30min)

time server is STDIO type (npx -y @theo.foobar/mcp-time). Crash handler fires repeatedly. STDIO servers run inside MetaMCP's process — crashes here can destabilize the parent.

Additional concern

1883 PIDs in the container — unusually high. Could be leaked processes from crashed STDIO servers accumulating.

Recommended fix order

Immediate: Remove plane-mcp and section-mcp from MetaMCP DB (dead backends)
Immediate: Find and fix the 161.184.162.156:8000 server URL in MetaMCP DB
Short-term: Evaluate replacing STDIO time server with a container-based one (g1-time already exists as a separate MCP)
Verify: After fixes, monitor for 30min — do connector drops stop?

## Root Cause Analysis (T2 evidence, architect-v50) Three smoking guns found: ### 1. ECONNREFUSED 161.184.162.156:8000 (18x in 30min) MetaMCP is trying to reach something at the server's **public IP** on port 8000. Nothing listens there. Check MetaMCP DB `mcp_servers` table for any server with `161.184.162.156` in its URL — this is misconfigured and should use an internal container hostname. ### 2. Dead backends generating constant errors - `plane-mcp` → container gone (g1-plane deleted 2026-03-20) - `section-mcp` → container not running These need to be removed from MetaMCP's `mcp_servers` table. Every health check cycle generates errors for these. ### 3. Time server STDIO crashes (3x in 30min) `time` server is STDIO type (`npx -y @theo.foobar/mcp-time`). Crash handler fires repeatedly. STDIO servers run inside MetaMCP's process — crashes here can destabilize the parent. ### Additional concern **1883 PIDs** in the container — unusually high. Could be leaked processes from crashed STDIO servers accumulating. ### Recommended fix order 1. **Immediate:** Remove plane-mcp and section-mcp from MetaMCP DB (dead backends) 2. **Immediate:** Find and fix the 161.184.162.156:8000 server URL in MetaMCP DB 3. **Short-term:** Evaluate replacing STDIO time server with a container-based one (g1-time already exists as a separate MCP) 4. **Verify:** After fixes, monitor for 30min — do connector drops stop?

g1admin commented

2026-03-27 23:44:03 +00:00

Author

Owner

P0 RESOLVED

All three root causes fixed by T2:

plane-mcp DELETED from MetaMCP DB — dead backend generating constant health check errors
sympy-mcp port FIXED — was listening on 8081 but MetaMCP pointed at 8000. This was the source of 18+ ECONNREFUSED/30min to 161.184.162.156:8000. Updated both URL and healthcheck to 8081.
graphiti-mcp FIXED (earlier) — 4 issues: fastmcp dependency, FalkorDB env vars, coolify network, SSE transport

Results:

ECONNREFUSED: 158+ → 0
All 9 namespaces healthy with correct tool counts
16 servers remaining (plane-mcp removed)

Monitoring needed: Connector stability should be observed over next 24h to confirm the fix holds under concurrent lead sessions. If connectors still drop, the remaining investigation items (STDIO time server crashes, PID count) may need attention.

Closing as resolved. Reopen if connectors continue dropping.

## P0 RESOLVED All three root causes fixed by T2: 1. **plane-mcp DELETED** from MetaMCP DB — dead backend generating constant health check errors 2. **sympy-mcp port FIXED** — was listening on 8081 but MetaMCP pointed at 8000. This was the source of 18+ ECONNREFUSED/30min to 161.184.162.156:8000. Updated both URL and healthcheck to 8081. 3. **graphiti-mcp FIXED** (earlier) — 4 issues: fastmcp dependency, FalkorDB env vars, coolify network, SSE transport Results: - ECONNREFUSED: 158+ → **0** - All 9 namespaces healthy with correct tool counts - 16 servers remaining (plane-mcp removed) **Monitoring needed:** Connector stability should be observed over next 24h to confirm the fix holds under concurrent lead sessions. If connectors still drop, the remaining investigation items (STDIO time server crashes, PID count) may need attention. Closing as resolved. Reopen if connectors continue dropping.