❱ Architecture

System architecture of the Majestix AI Inference Hub -- a multi-model AI gateway with credits-based billing, agentic orchestration, and 50+ pre-built agents.

System Overview

The platform consists of two Cloud Run services deployed in separate GCP projects, four AI providers, and shared infrastructure for storage, caching, billing, and observability.

                         +----------------------------------------------+
                         |              Clients                          |
                         |  Web App (React 19)  |  VSCode Extension     |
                         |  API Key (CLI/SDK)   |  Cloud Tasks (cron)   |
                         +----------+------------------------+----------+
                                    |                        |
                            Firebase Auth +            OIDC Bearer
                            App Check / API Key        (service account)
                                    |                        |
                      +-------------v-----------+   +--------v---------------------+
                      |  inference-api          |   |  agent-executor              |
                      |  Cloud Run              |   |  Cloud Run                   |
                      |  (inference-platform)   |   |  (inference-agents)          |
                      |                         |   |                              |
                      |  /chat, /code           |<--|  POST /internal/agent/code   |
                      |  /models                |   |                              |
                      |  /billing, /usage       |   |  /internal/agent/execute     |
                      |  /api-keys              |   |  /internal/agent/ensemble    |
                      |  /internal/agent/code   |   |  /internal/agent/swarm       |
                      |                         |   |                              |
                      |  Credits: reserve ->    |   |  Agentic loop, ensemble,    |
                      |  stream -> reconcile    |   |  swarm, tool execution      |
                      +--+----+----+----+------+   +--+----------+----------------+
                         |    |    |    |              |          |
              +----------+    |    |    +------+    +--+          |
              v               v    v           v    v             v
       +----------+   +----------+ +-----+ +----------+   +----------+
       | Anthropic |   |  OpenAI  | |Redis| | Firestore|   |Cloud KMS |
       |  Claude   |   | GPT-5.x | |     | |  (both)  |   |(cred enc)|
       +----------+   +----------+ +-----+ +----------+   +----------+
       | Vertex AI |   |OpenRouter|
       | Gemini 3  |   |DS/Grok/ |         +----------+   +----------+
       |           |   |Qwen/Kimi|         | BigQuery |   | Pub/Sub  |
       +----------+   +----------+         |(analytics|   |(usage +  |
                                           | events)  |   | audit)   |
                                           +----------+   +----------+

Two Cloud Run Services

inference-api (Main API)

Cloud Run in the inference-platform GCP project.

The main API is a stateless model gateway. It handles authentication, credit billing, model routing, and streaming responses. It knows nothing about agent orchestration -- it simply processes LLM requests and charges credits. This separation keeps the main API simple, stateless, and horizontally scalable.

Responsibility

Detail

Authentication

API key, Firebase Auth + App Check, OIDC (3 paths)

Model routing

18 models across 4 providers with automatic fallback

Credit billing

Reservation-based: reserve worst-case, stream, reconcile actual

Streaming

SSE protocol for all model responses

Sessions

Redis-backed conversation history (Fernet-encrypted)

Billing

Stripe subscriptions, top-ups, webhooks (11 event types)

Analytics

Pub/Sub event bus to BigQuery

Admin

Usage dashboards, model economics

Key endpoints:

POST /chat                  Chat completions (streaming SSE)
POST /code                  Agentic tool-use for VSCode
GET  /models                List available models
GET  /session/{id}/history  Conversation history
POST /api-keys              Create API key
GET  /api-keys              List API keys
DELETE /api-keys/{id}       Revoke API key
POST /billing/checkout      Stripe subscription checkout
POST /billing/topup         One-time credit top-up
POST /billing/portal        Stripe customer portal
POST /billing/webhook       Stripe webhook handler (11 events)
GET  /usage/me              Credit usage
GET  /admin/usage           Platform analytics (admin only)
GET  /admin/model-economics Per-model cost vs revenue (admin only)
POST /internal/agent/code   OIDC-only, called by agent executor

agent-executor (Agent Executor)

Cloud Run in the inference-agents GCP project (separate project for isolation).

The agent executor handles all orchestration: scheduled tasks, ensemble consensus, and swarm pipelines. It calls back to the main API for all LLM inference -- it never contacts providers directly and never handles credits.

Responsibility

Detail

Agentic loops

LLM -> tools -> repeat until done or limits hit

Ensemble consensus

Drafter -> Critic -> Synthesizer loop (3 models)

Swarm pipelines

Sequential multi-agent execution with context passing

Tool execution

4 built-in tools with DNS pinning and URL validation

Security

8-layer defense-in-depth

Credential management

KMS decrypt at runtime, never stored in plaintext

Observability

OpenTelemetry traces to Cloud Trace

Key endpoints:

POST /internal/agent/execute    Run a single scheduled task
POST /internal/agent/ensemble   Multi-model consensus loop
POST /internal/agent/swarm      Multi-agent sequential pipeline
GET  /health                    Health check

Four Providers

The main API routes model requests to four providers. When a provider fails, the system auto-retries on a comparable model from a different provider via the FALLBACK_MAP. Credit reservation accounts for fallback cost.

Provider

Models

Connection

Anthropic

claude-sonnet (4.6), claude-haiku (4.5), claude-opus (4.6)

Direct API

OpenAI

gpt-5-mini, gpt-5.2, gpt-5.2-codex, cmo-agent (fine-tuned)

Direct API

Vertex AI

gemini-3-flash, gemini-3.1-pro, gemini-3-image, gpt-5-image, seedream-4.5

Google Cloud

OpenRouter

llama-4-maverick, deepseek-v3.2, deepseek-r1, qwen3-coder, kimi-k2.5, grok-4.1-fast

Relay

All providers use the same SSE streaming protocol:

data: {"type": "chunk", "content": "Hello"}\n\n
data: {"type": "chunk", "content": " world"}\n\n
data: {"type": "tool_use", "id": "tu_1", "name": "http_get", "input": {...}}\n\n
data: {"type": "done", "model": "claude-sonnet", "credits_used": 4.2, "usage": {...}}\n\n

On error:

data: {"type": "error", "detail": "...", "error_id": "..."}\n\n

Key Infrastructure

Service

Purpose

Project

Firestore

Users, API keys, agent tasks, executions, integrations

Both projects

Redis Memorystore

Sessions, API key cache, credit counters, rate limits, usage cache

inference-platform

Cloud KMS

Agent credential encryption (split trust: encrypt-only / decrypt-only)

inference-agents

Cloud Tasks

Agent task scheduling (OIDC-authenticated dispatch)

inference-agents

Cloud Scheduler

Cron triggers for scheduled agent tasks

inference-agents

BigQuery

Usage analytics, model economics, audit trail (no PII)

inference-platform

Pub/Sub

Durable event bus for usage + audit events

inference-platform

Stripe

Subscriptions (Guru, Pro), one-time top-ups, customer portal

inference-platform

Firebase Hosting

Two web app deployments (green + indigo themes)

inference-platform

Cloud Trace

Distributed tracing via OpenTelemetry

Both projects

Artifact Registry

Docker image storage for Cloud Run deployments

Both projects

See Infrastructure for detailed configuration of each service.

Design Principles

Ship v1 First

No premature optimization. Every feature ships as a working v1 before being refined. Architecture decisions favor simplicity and correctness over theoretical scalability.

Single Credit System

All clients -- web app, VSCode extension, CLI, scheduled agents -- use the same credits system. 1 credit = $0.001 USD, 1.4x margin on provider cost. The agent executor does not handle billing directly; it calls the main API for every LLM request, and the main API deducts credits from the user's account. This guarantees consistent billing regardless of how inference is triggered.

SSE Everywhere

All model responses stream via Server-Sent Events. The protocol is consistent across direct chat, agentic tool-use, ensemble, and swarm:

data: {"type": "chunk", "content": "..."}\n\n    # Streaming content
data: {"type": "tool_use", ...}\n\n               # Tool call (agentic)
data: {"type": "done", ...}\n\n                   # Completion with usage
data: {"type": "error", ...}\n\n                  # Error

Project-Level Isolation

The agent executor runs in a separate GCP project (inference-agents) from the main API (inference-platform). Service accounts, KMS keys, and Firestore collections are project-scoped. This ensures that a compromised agent execution cannot access billing data, user credentials, or provider API keys.

Reservation-Based Billing

Credits are reserved (worst-case estimate) before any provider call, then reconciled after the response completes. This prevents overcharges and handles provider failures gracefully:

reserve_credits(worst_case)  ->  call_provider()  ->  reconcile_credits(actual)
                                                       refund = reserved - actual

The reservation accounts for fallback cost when a provider may fail and the system retries on a different (potentially more expensive) model.

Data Flow

Chat Request (Web or API Key)

Client
  |  POST /chat {model, messages, stream: true}
  v
verify_user()  (API key or Firebase Auth + App Check)
  |
  v
reserve_credits(worst_case_estimate)
  |
  v
provider.chat_stream()  (Anthropic / OpenAI / Vertex / OpenRouter)
  |
  v
SSE chunks -> client
  |
  v
reconcile_credits(actual_usage)
  |
  v
publish UsageEvent -> Pub/Sub -> BigQuery

Scheduled Agent Execution

Cloud Scheduler (cron)
  |
  v
Cloud Tasks (OIDC token)
  |
  v
POST /internal/agent/execute {task_id}
  |
  v
agent-executor:
  Load task from Firestore
  Decrypt credentials (KMS)
  Content moderation check
  Agentic loop:
    POST /internal/agent/code -> Main API -> Provider
    Parse tool_use blocks
    Execute tools (DNS-pinned, URL-validated)
    Feed results back to LLM
    Repeat until done or limits hit
  |
  v
Save execution record (Firestore)
Update task metadata

Credit Flow

Request arrives
  |
  +-- estimate_worst_case_cost(model, max_tokens, has_fallback)
  |     worst_case = base_cost * 1.4 margin
  |     if fallback: worst_case += fallback_cost * 1.4
  |
  +-- reserve_credits(user_id, worst_case)
  |     try plan credits -> then top-up balance -> else 402
  |
  +-- call_provider()
  |     actual tokens consumed
  |
  +-- reconcile_credits(user_id, actual_cost, reserved)
        refund = reserved - actual_cost
        return refund to plan credits or top-up balance

Good morning

❱ Architecture

System Overview

Two Cloud Run Services

inference-api (Main API)

agent-executor (Agent Executor)

Four Providers

Key Infrastructure

Design Principles

Ship v1 First

Single Credit System

SSE Everywhere

Project-Level Isolation

Reservation-Based Billing

Data Flow

Chat Request (Web or API Key)

Scheduled Agent Execution

Credit Flow

Further Reading

Good morning

hashtagSystem Overview

hashtagTwo Cloud Run Services

hashtaginference-api (Main API)

hashtagagent-executor (Agent Executor)

hashtagFour Providers

hashtagKey Infrastructure

hashtagDesign Principles

hashtagShip v1 First

hashtagSingle Credit System

hashtagSSE Everywhere

hashtagProject-Level Isolation

hashtagReservation-Based Billing

hashtagData Flow

hashtagChat Request (Web or API Key)

hashtagScheduled Agent Execution

hashtagCredit Flow

hashtagFurther Reading

System Overview

Two Cloud Run Services

inference-api (Main API)

agent-executor (Agent Executor)

Four Providers

Key Infrastructure

Design Principles

Ship v1 First

Single Credit System

SSE Everywhere

Project-Level Isolation

Reservation-Based Billing

Data Flow

Chat Request (Web or API Key)

Scheduled Agent Execution

Credit Flow

Further Reading