Agent Runtime

This document describes how agents run in mutx.dev, including their lifecycle, monitoring, and self-healing capabilities.

What is active today

  • POST /agents/heartbeat is the live runtime path for connected agents. It updates agents.status and last_heartbeat in the control plane.

  • Each runtime heartbeat now emits an agent.heartbeat outgoing webhook event for subscribers.

  • When a heartbeat changes the persisted agent status, MUTX also emits an agent.status outgoing webhook event.

  • The background monitor in src/api/services/monitoring.py still owns stale-agent detection, failure marking, alert resolution, and automatic recovery.

  • More advanced self-healing components remain partially aspirational until they are connected to real schedulers/executors.


Overview

The Agent Runtime (src/api/services/agent_runtime.py:98) is the core execution engine that manages agent lifecycles, tool routing, and resource allocation.

┌─────────────────────────────────────────────────────────────────────────────────┐
│                           Agent Runtime Architecture                            │
│                                                                                  │
│  ┌───────────────────────────────────────────────────────────────────────────┐  │
│  │                         RuntimeManager                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐  │  │
│  │  │                     AgentRuntime                                    │  │  │
│  │  │                                                                     │  │  │
│  │  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────────────┐  │  │ │
│  │  │  │  RuntimeConfig │  │  RuntimeState │  │   ToolExecutionHandler │  │  │ │
│  │  │  │  - timeout    │  │  - status     │  │   - register_handler   │  │  │ │
│  │  │  │  - max_agents │  │  - metrics    │  │   - execute_tool       │  │  │ │
│  │  │  │  - retries    │  │  - active    │  │                        │  │  │ │
│  │  │  └────────────────┘  └────────────────┘  └────────────────────────┘  │  │ │
│  │  │                                                                     │  │  │
│  │  │  ┌─────────────────────────────────────────────────────────────────┐ │  │ │
│  │  │  │                    Agent Registry                               │ │  │ │
│  │  │  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐              │ │  │ │
│  │  │  │  │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ │ Agent N │              │ │  │ │
│  │  │  │  │(LangChain│ │(OpenClaw│ │  (n8n)  │ │         │              │ │  │ │
│  │  │  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘              │ │  │ │
│  │  │  └─────────────────────────────────────────────────────────────────┘ │  │ │
│  │  └─────────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘

Agent Lifecycle

Lifecycle States

Creating an Agent

Execution Modes

Mode
Method
Use Case

Async

execute_agent()

Non-blocking, high throughput

Sync

execute_agent_sync()

Simple scripts, CLI tools

Streaming

execute_agent_stream()

Real-time output, chat UIs


Agent Types

1. LangChain Agent

From src/api/integrations/langchain_agent.py:

Features:

  • Multiple LLM providers (OpenAI, Anthropic, Ollama)

  • Tool-augmented execution

  • Conversation memory

  • Streaming support

2. OpenClaw Agent

Multi-agent orchestration framework for complex workflows.

3. n8n Agent

Workflow automation with visual builder integration.


Tool Execution

Tool Handler Architecture

Built-in Tools

Tool
Description
Example

search_documents

Semantic search via vector store

query="deployment guide"

get_time

Current timestamp

get_time()

calculator

Safe math evaluation

calculator(expression="2+2")

Custom Tool Registration


Monitoring

From src/api/services/monitoring.py:363, the MonitoringService provides comprehensive observability.

Metrics Collection

Health Status Levels

Status
Condition
Action

HEALTHY

All checks pass

Normal operation

DEGRADED

Performance below threshold

Log warning

UNHEALTHY

Health check failed

Trigger recovery

UNKNOWN

No health data

Skip monitoring

Alert Severity

Level
Threshold
Example

INFO

-

Agent registered

WARNING

Error rate > 10%

High latency detected

ERROR

Error rate > 25%

Agent unhealthy

CRITICAL

Error rate > 50%

System failure

Metrics Collected

Metric
Type
Description

request_count

Counter

Total requests processed

error_count

Counter

Failed requests

avg_latency_ms

Gauge

Average response time

p95_latency_ms

Gauge

95th percentile latency

p99_latency_ms

Gauge

99th percentile latency

cpu_usage

Gauge

System CPU percentage

memory_usage

Gauge

System memory percentage

uptime_percentage

Gauge

Agent uptime ratio


Self-Healing

From src/api/services/self_healer.py:491, the SelfHealingService provides automatic recovery.

Self-Healing Architecture

Recovery Actions

Action
Trigger
Description

RESTART

3 consecutive failures

Restart agent process

ROLLBACK

After failed restart

Revert to stable version

RECREATE

Persistent failure

Destroy and recreate agent

SCALE_UP

High load

Add more agent instances

SCALE_DOWN

Low load

Reduce resource usage

Health Check Configuration

Recovery Flow

Recovery Time Tracking

The service tracks recovery metrics:

Target: Recovery time < 5 seconds


Configuration

Runtime Configuration

Example Usage


Next Steps

Last updated