❱ Orchestration

The Majestix AI Inference Hub supports two advanced orchestration patterns beyond the standard single-agent agentic loop. Both patterns run on the agent-executor Cloud Run service in the inference-agents GCP project, isolated from the main API.

Overview

Standard agentic loops execute a single LLM with tool access in an iterative cycle. The orchestration layer adds two higher-order patterns that coordinate multiple models or agents to produce higher-quality outputs.

Pattern

Description

Use Case

Ensemble

3 models iterate on the same output until quality consensus

Content refinement, research synthesis, code review

Swarm

Multiple agents execute sequentially as a pipeline, passing outputs forward

Multi-stage workflows, content pipelines, operational automation

Architecture

Both patterns are triggered via OIDC-authenticated internal endpoints on the agent executor service. They are not directly accessible to end users. The main API enqueues orchestration requests via Cloud Tasks, and the agent executor processes them asynchronously.

+------------------+       Cloud Tasks       +---------------------+
|                  | ----------------------> |                     |
|  inference-api   |                         |  agent-executor     |
|  (main API)      | <---------------------- |  (orchestration)    |
|                  |    /chat (model calls)   |                     |
+------------------+                         +---------------------+
        |                                            |
        |                                            |
   Providers                                   External APIs
   (OpenAI, Anthropic,                         (DNS-pinned,
    Vertex, OpenRouter)                         default-deny)

Request Flow

Client submits a task to the main API (or an internal trigger fires).
Main API enqueues the orchestration job via Cloud Tasks.
Agent executor picks up the job, authenticates via OIDC service account.
The executor calls back to the main API's /chat endpoint for all LLM inference.
Credits are charged through the main API's billing system -- the executor never interacts with billing directly.
Results are stored and/or delivered via webhook.

Credit Billing

All model inference within orchestration patterns is billed through the main API's standard credits system. The orchestration layer itself adds no surcharge. Credits are reserved at the start of each model call and reconciled upon completion, identical to direct /chat requests.

Ensemble: Billed per model call per round. A 3-round ensemble with 3 models incurs up to 9 model calls.
Swarm: Billed per model call per agent. Each agent's agentic loop may involve multiple LLM calls depending on tool use iterations.

Budget enforcement is available at the swarm level via max_total_credits. Ensemble billing is bounded by max_rounds.

Authentication

Both orchestration endpoints use OIDC service-to-service authentication. The agent executor validates that incoming requests originate from an authorized service account in the inference-platform project. User API keys and Firebase tokens are never sent to the executor -- the main API bridges authentication.

Plan Limits

Plan

Ensemble Max Rounds

Swarm Max Agents

Free

Not available

Guru

Pro

Good morning

❱ Orchestration

Overview

Architecture

Request Flow

Credit Billing

Authentication

Plan Limits

Further Reading

Good morning

hashtagOverview

hashtagArchitecture

hashtagRequest Flow

hashtagCredit Billing

hashtagAuthentication

hashtagPlan Limits

hashtagFurther Reading

Overview

Architecture

Request Flow

Credit Billing

Authentication

Plan Limits

Further Reading