❱ Architecture

System architecture of the Majestix AI Inference Hub -- a multi-model AI gateway with credits-based billing, agentic orchestration, and 50+ pre-built agents.


System Overview

The platform consists of two Cloud Run services deployed in separate GCP projects, four AI providers, and shared infrastructure for storage, caching, billing, and observability.

                         +----------------------------------------------+
                         |              Clients                          |
                         |  Web App (React 19)  |  VSCode Extension     |
                         |  API Key (CLI/SDK)   |  Cloud Tasks (cron)   |
                         +----------+------------------------+----------+
                                    |                        |
                            Firebase Auth +            OIDC Bearer
                            App Check / API Key        (service account)
                                    |                        |
                      +-------------v-----------+   +--------v---------------------+
                      |  inference-api          |   |  agent-executor              |
                      |  Cloud Run              |   |  Cloud Run                   |
                      |  (inference-platform)   |   |  (inference-agents)          |
                      |                         |   |                              |
                      |  /chat, /code           |<--|  POST /internal/agent/code   |
                      |  /models                |   |                              |
                      |  /billing, /usage       |   |  /internal/agent/execute     |
                      |  /api-keys              |   |  /internal/agent/ensemble    |
                      |  /internal/agent/code   |   |  /internal/agent/swarm       |
                      |                         |   |                              |
                      |  Credits: reserve ->    |   |  Agentic loop, ensemble,    |
                      |  stream -> reconcile    |   |  swarm, tool execution      |
                      +--+----+----+----+------+   +--+----------+----------------+
                         |    |    |    |              |          |
              +----------+    |    |    +------+    +--+          |
              v               v    v           v    v             v
       +----------+   +----------+ +-----+ +----------+   +----------+
       | Anthropic |   |  OpenAI  | |Redis| | Firestore|   |Cloud KMS |
       |  Claude   |   | GPT-5.x | |     | |  (both)  |   |(cred enc)|
       +----------+   +----------+ +-----+ +----------+   +----------+
       | Vertex AI |   |OpenRouter|
       | Gemini 3  |   |DS/Grok/ |         +----------+   +----------+
       |           |   |Qwen/Kimi|         | BigQuery |   | Pub/Sub  |
       +----------+   +----------+         |(analytics|   |(usage +  |
                                           | events)  |   | audit)   |
                                           +----------+   +----------+

Two Cloud Run Services

inference-api (Main API)

Cloud Run in the inference-platform GCP project.

The main API is a stateless model gateway. It handles authentication, credit billing, model routing, and streaming responses. It knows nothing about agent orchestration -- it simply processes LLM requests and charges credits. This separation keeps the main API simple, stateless, and horizontally scalable.

Responsibility
Detail

Authentication

API key, Firebase Auth + App Check, OIDC (3 paths)

Model routing

18 models across 4 providers with automatic fallback

Credit billing

Reservation-based: reserve worst-case, stream, reconcile actual

Streaming

SSE protocol for all model responses

Sessions

Redis-backed conversation history (Fernet-encrypted)

Billing

Stripe subscriptions, top-ups, webhooks (11 event types)

Analytics

Pub/Sub event bus to BigQuery

Admin

Usage dashboards, model economics

Key endpoints:

agent-executor (Agent Executor)

Cloud Run in the inference-agents GCP project (separate project for isolation).

The agent executor handles all orchestration: scheduled tasks, ensemble consensus, and swarm pipelines. It calls back to the main API for all LLM inference -- it never contacts providers directly and never handles credits.

Responsibility
Detail

Agentic loops

LLM -> tools -> repeat until done or limits hit

Ensemble consensus

Drafter -> Critic -> Synthesizer loop (3 models)

Swarm pipelines

Sequential multi-agent execution with context passing

Tool execution

4 built-in tools with DNS pinning and URL validation

Security

8-layer defense-in-depth

Credential management

KMS decrypt at runtime, never stored in plaintext

Observability

OpenTelemetry traces to Cloud Trace

Key endpoints:


Four Providers

The main API routes model requests to four providers. When a provider fails, the system auto-retries on a comparable model from a different provider via the FALLBACK_MAP. Credit reservation accounts for fallback cost.

Provider
Models
Connection

Anthropic

claude-sonnet (4.6), claude-haiku (4.5), claude-opus (4.6)

Direct API

OpenAI

gpt-5-mini, gpt-5.2, gpt-5.2-codex, cmo-agent (fine-tuned)

Direct API

Vertex AI

gemini-3-flash, gemini-3.1-pro, gemini-3-image, gpt-5-image, seedream-4.5

Google Cloud

OpenRouter

llama-4-maverick, deepseek-v3.2, deepseek-r1, qwen3-coder, kimi-k2.5, grok-4.1-fast

Relay

All providers use the same SSE streaming protocol:

On error:


Key Infrastructure

Service
Purpose
Project

Firestore

Users, API keys, agent tasks, executions, integrations

Both projects

Redis Memorystore

Sessions, API key cache, credit counters, rate limits, usage cache

inference-platform

Cloud KMS

Agent credential encryption (split trust: encrypt-only / decrypt-only)

inference-agents

Cloud Tasks

Agent task scheduling (OIDC-authenticated dispatch)

inference-agents

Cloud Scheduler

Cron triggers for scheduled agent tasks

inference-agents

BigQuery

Usage analytics, model economics, audit trail (no PII)

inference-platform

Pub/Sub

Durable event bus for usage + audit events

inference-platform

Stripe

Subscriptions (Guru, Pro), one-time top-ups, customer portal

inference-platform

Firebase Hosting

Two web app deployments (green + indigo themes)

inference-platform

Cloud Trace

Distributed tracing via OpenTelemetry

Both projects

Artifact Registry

Docker image storage for Cloud Run deployments

Both projects

See Infrastructure for detailed configuration of each service.


Design Principles

Ship v1 First

No premature optimization. Every feature ships as a working v1 before being refined. Architecture decisions favor simplicity and correctness over theoretical scalability.

Single Credit System

All clients -- web app, VSCode extension, CLI, scheduled agents -- use the same credits system. 1 credit = $0.001 USD, 1.4x margin on provider cost. The agent executor does not handle billing directly; it calls the main API for every LLM request, and the main API deducts credits from the user's account. This guarantees consistent billing regardless of how inference is triggered.

SSE Everywhere

All model responses stream via Server-Sent Events. The protocol is consistent across direct chat, agentic tool-use, ensemble, and swarm:

Project-Level Isolation

The agent executor runs in a separate GCP project (inference-agents) from the main API (inference-platform). Service accounts, KMS keys, and Firestore collections are project-scoped. This ensures that a compromised agent execution cannot access billing data, user credentials, or provider API keys.

Reservation-Based Billing

Credits are reserved (worst-case estimate) before any provider call, then reconciled after the response completes. This prevents overcharges and handles provider failures gracefully:

The reservation accounts for fallback cost when a provider may fail and the system retries on a different (potentially more expensive) model.


Data Flow

Chat Request (Web or API Key)

Scheduled Agent Execution

Credit Flow


Further Reading

Last updated