A cron daemon on the relay server monitors every service and recovers failures automatically.

The auto-healing daemon runs continuously on the relay server alongside the admin panel. When a server fails its health check, the daemon re-runs the full recovery sequence and confirms the server is back online — with no manual intervention required.

Zero config

uses the same healthcheck and run hooks already defined in config.yaml

Full recovery

preflight → start → postflight cycle on every failing server

healthcheck-config

yaml

services:
  web:
    run:
      preflight:
        - bun install
      start:
        - pm2 start index.js --name web
      postflight:
        - pm2 list
    healthcheck:
      cmd: curl http://localhost:3000/api/healthcheck
      test: Server Running

How it works

What happens when a server fails its health check

The daemon polls each service on a recurring schedule. A single failed check triggers the full three-stage recovery sequence.

Health check runs

The daemon runs the configured healthcheck.cmd against every server in every service. If the output contains healthcheck.test, the server is considered healthy.

Recovery sequence fires

If the check fails, the daemon re-runs the full run.preflight → run.start → run.postflight sequence on that server — the same commands that ran during the original deployment.

Confirmation check

After the recovery sequence completes, the health check runs again to confirm the server is back online.

Recovery stages

The same three-stage sequence from your config

No additional configuration is required. The daemon reuses the run and healthcheck blocks already defined for each service.

Stage 1

Preflight

Re-runs preflight commands from config: rebuilds, reinstalls, migrations, or anything that must complete before the process starts.

Stage 2

Start

Re-runs start commands to bring the application process back up on the failed server.

Stage 3

Postflight

Re-runs postflight commands for verification, reloads, or any follow-up checks after the process is started.

Configuration

Nothing extra to configure

Auto-healing is enabled automatically on every deployment. The daemon reads the same config.yaml used at deploy time.

config.yaml

yaml

services:
  api:
    instances: 2
    clusters: 2
    run:
      preflight:
        - bun install --production
      start:
        - pm2 start dist/server.js --name api
      postflight:
        - pm2 list
    healthcheck:
      cmd: curl -s http://localhost:4000/health
      test: ok

Any server that crashes, runs out of memory, or encounters a transient error will be brought back to a healthy state automatically. The healthcheck, run.preflight, run.start, and run.postflight fields drive both the initial deployment and the recovery process.

Next step

Take the same setup across providers

Once the deployment shape is clear, duplicate it to another cloud provider with a single config override.

Continue to multi-cloud setup