Architecture Overview
Architecture Overview
Section titled “Architecture Overview”Gambi exposes an HTTP management plane, an OpenAI-compatible HTTP inference plane, and a participant tunnel between the hub and each registered participant.
System Diagram
Section titled “System Diagram”┌──────────────────────────────────────────────┐│ GAMBI HUB ││ ││ Management API Inference API ││ /v1/* /rooms/:code/v1/* ││ ││ SSE events Routing engine ││ ││ Participant tunnel registry and sessions │└──────────────────────────────────────────────┘ ▲ ▲ ▲ │ HTTP │ HTTP │ WebSocket │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴─────────┐ │ SDK and │ │ Apps and │ │ Participant │ │ CLI ops │ │ AI tools │ │ runtimes │ └─────────┘ └──────────┘ └─────────────┘Core Idea
Section titled “Core Idea”Application clients still talk to Gambi over standard HTTP. That keeps the system compatible with OpenAI-style tooling and SDKs.
Participants no longer need to publish a network-reachable provider endpoint. Instead, the participant runtime opens a tunnel to the hub and forwards inference requests to its local or remote provider.
Registration Flow
Section titled “Registration Flow”- The participant runtime probes its provider endpoint locally.
- It registers with
PUT /v1/rooms/:code/participants/:id. - The hub returns
{ participant, roomId, tunnel }. - The runtime opens
GET /v1/rooms/:code/participants/:id/tunnel?token=.... - The hub upgrades the connection and starts forwarding tunnel requests.
- The runtime keeps sending management heartbeats.
Request Flow
Section titled “Request Flow”- An application sends
POST /rooms/:code/v1/responsesorPOST /rooms/:code/v1/chat/completions. - The hub resolves routing by participant ID,
model:<name>, or*. - The hub forwards the request through the participant tunnel.
- The participant runtime forwards it to the real provider endpoint.
- The runtime streams the provider response back through the tunnel.
- The hub returns the response to the application client.
Why This Split Exists
Section titled “Why This Split Exists”HTTP for apps
Section titled “HTTP for apps”- standard OpenAI-compatible interface
- works with existing SDKs and tools
- easy to debug with normal HTTP tooling
WebSocket for participant transport
Section titled “WebSocket for participant transport”- lets providers stay on
localhost - keeps provider credentials on the participant runtime
- avoids asking participants to publish network endpoints just to join a room
SSE for observability
Section titled “SSE for observability”- one-way room event stream is enough for monitoring
- powers the TUI and operational clients
- keeps operational visibility separate from inference transport
Routing Rules
Section titled “Routing Rules”The model field controls participant selection:
| Value | Behavior |
|---|---|
* or any | random available participant |
model:<name> | first available participant matching that model |
<participant-id> | specific participant |
A participant is available only when:
- its tunnel is connected
- it is not offline
- it is not already handling another request
Tunnel Protocol
Section titled “Tunnel Protocol”The tunnel is a WebSocket between the hub and the participant runtime. Messages are JSON objects with a type field.
Server → participant:
tunnel.request— a forwarded inference request. IncludesrequestId, HTTPmethod,path,headers,body, and astreamflag.tunnel.pong— reply to a participant ping.
Participant → server:
tunnel.response.start— response headers and HTTP status forrequestId.tunnel.response.chunk— one streamed body chunk forrequestId.tunnel.response.end— the response body is complete.tunnel.response.error— the participant runtime failed to produce a response; includes astagelabel and a human-readablemessage.tunnel.ping— keepalive from the participant.
See packages/core/src/tunnel-protocol.ts for the authoritative schemas.
Protocol Adaptation (Responses ↔ Chat Completions)
Section titled “Protocol Adaptation (Responses ↔ Chat Completions)”The default protocol is Responses. Chat Completions remains available for compatibility.
When the client and the participant do not speak the same surface natively, the hub adapts between them. Two practical consequences:
- a client using Responses can reach a participant that only exposes Chat Completions, and vice versa
- the adapter focuses on the message-level contract; stateful Responses features such as
previous_response_id,store, andbackgroundmay be limited or unsupported when the underlying participant is a Chat Completions endpoint
New integrations should prefer Responses. Fall back to Chat Completions only when you need explicit compatibility with an existing tool.
Health Timings
Section titled “Health Timings”Two constants drive liveness, both defined in packages/core/src/types.ts:
HEALTH_CHECK_INTERVAL = 10_000 ms— cadence for participant heartbeats and for tunnel pings.PARTICIPANT_TIMEOUT = 30_000 ms— after this window without a heartbeat, the hub marks the participant offline. The tunnel uses the same window before closing a silent connection.
If you build a custom participant runtime, match these windows. createParticipantSession() does it for you.
Observability
Section titled “Observability”The hub emits:
llm.requestllm.completellm.error
llm.complete includes baseline metrics such as:
ttftMsdurationMsinputTokensoutputTokenstotalTokenstokensPerSecond
What Gambi Does Not Do
Section titled “What Gambi Does Not Do”- it does not host the models itself
- it does not add built-in authentication to the hub
- it does not try to be an agent orchestrator yet
The future gambi agents direction builds above this transport layer rather than replacing it.