❱ Orchestration

The Majestix AI Inference Hub supports two advanced orchestration patterns beyond the standard single-agent agentic loop. Both patterns run on the agent-executor Cloud Run service in the inference-agents GCP project, isolated from the main API.


Overview

Standard agentic loops execute a single LLM with tool access in an iterative cycle. The orchestration layer adds two higher-order patterns that coordinate multiple models or agents to produce higher-quality outputs.

Pattern
Description
Use Case

3 models iterate on the same output until quality consensus

Content refinement, research synthesis, code review

Multiple agents execute sequentially as a pipeline, passing outputs forward

Multi-stage workflows, content pipelines, operational automation


Architecture

Both patterns are triggered via OIDC-authenticated internal endpoints on the agent executor service. They are not directly accessible to end users. The main API enqueues orchestration requests via Cloud Tasks, and the agent executor processes them asynchronously.

+------------------+       Cloud Tasks       +---------------------+
|                  | ----------------------> |                     |
|  inference-api   |                         |  agent-executor     |
|  (main API)      | <---------------------- |  (orchestration)    |
|                  |    /chat (model calls)   |                     |
+------------------+                         +---------------------+
        |                                            |
        |                                            |
   Providers                                   External APIs
   (OpenAI, Anthropic,                         (DNS-pinned,
    Vertex, OpenRouter)                         default-deny)

Request Flow

  1. Client submits a task to the main API (or an internal trigger fires).

  2. Main API enqueues the orchestration job via Cloud Tasks.

  3. Agent executor picks up the job, authenticates via OIDC service account.

  4. The executor calls back to the main API's /chat endpoint for all LLM inference.

  5. Credits are charged through the main API's billing system -- the executor never interacts with billing directly.

  6. Results are stored and/or delivered via webhook.


Credit Billing

All model inference within orchestration patterns is billed through the main API's standard credits system. The orchestration layer itself adds no surcharge. Credits are reserved at the start of each model call and reconciled upon completion, identical to direct /chat requests.

  • Ensemble: Billed per model call per round. A 3-round ensemble with 3 models incurs up to 9 model calls.

  • Swarm: Billed per model call per agent. Each agent's agentic loop may involve multiple LLM calls depending on tool use iterations.

Budget enforcement is available at the swarm level via max_total_credits. Ensemble billing is bounded by max_rounds.


Authentication

Both orchestration endpoints use OIDC service-to-service authentication. The agent executor validates that incoming requests originate from an authorized service account in the inference-platform project. User API keys and Firebase tokens are never sent to the executor -- the main API bridges authentication.


Plan Limits

Plan
Ensemble Max Rounds
Swarm Max Agents

Free

Not available

Not available

Guru

2

5

Pro

3

10


Further Reading

Last updated