ECS Auto-Deploy Design

Date: 2026-05-14 Target: Alibaba ECS Chengdu (39.106.170.204) Image: ghcr.io/lacatfly/twilight-driveStatus: Approved, ready for implementation plan

Goal

Replace manual install.sh-driven deployment with a GitHub Actions pipeline that ships every merged main commit to ECS with health-gated auto-rollback and zero open inbound ports.

Constraints

ECS sits in Chengdu behind GFW; no overseas SSH ingress is acceptable as a long-term surface.
ECS can reach ghcr.io (verified: 1.6s HTTP response). No Aliyun ACR mirror required at this stage.
Existing docker-build.yml already pushes tags latest, main, sha-<7> on push to main.
Existing cloudflared tunnel twilight-backend already terminates api.fsagent.cc on the box.
Paid users hit api.fsagent.cc; broken deploys must self-heal without operator intervention.

Architecture

push main
  │
  ├──► docker-build.yml (existing)
  │       build → push ghcr.io tags: latest, main, sha-<7>
  │
  └──► deploy-ecs.yml (NEW, workflow_run after docker-build success)
         1. install cloudflared on runner
         2. auth Cloudflare Access via service token
         3. ssh deploy@ssh-ecs.fsagent.cc (via cloudflared access ssh)
            → runs /opt/twilight/deploy.sh sha-<7>
                a. flock to serialize deploys
                b. validate tag against regex
                c. snapshot current IMAGE_TAG → PREV
                d. write new IMAGE_TAG to .env
                e. docker compose pull && up -d
                f. poll /healthz for 60s (30 × 2s)
                g. success → prune old images >7d
                h. fail → restore PREV, up -d, exit 1

The compose.yml references image: ghcr.io/lacatfly/twilight-drive:${IMAGE_TAG:-latest}. Tag pinning via .env makes rollback an env-flip plus up -d, not a registry hunt.

Components

New

Path	Purpose
`.github/workflows/deploy-ecs.yml`	`workflow_run` trigger off Docker build success; SSHes via cloudflared and runs `deploy.sh`
`deploy/ecs-deploy.sh`	ECS-side deploy script: lock, snapshot, swap tag, pull, up, healthcheck, rollback
`deploy/cloudflared-ssh.yml.template`	Ingress block to merge into existing tunnel config: `ssh-ecs.fsagent.cc → ssh://localhost:22`

Modified

Path	Change
`deploy/compose.yml`	`image:` uses `${IMAGE_TAG:-latest}` interpolation
`deploy/env.example`	Adds `IMAGE_TAG=latest` line

Cloudflare Access (one-time manual)

Add hostname ssh-ecs.fsagent.cc to tunnel ingress: ssh://localhost:22.
Create Access application gated to service-token policy only (no IdP, no user logins).
Generate one service token; record CF-Access-Client-Id and CF-Access-Client-Secret.

GitHub Actions secrets (3 new)

Secret	Source
`CF_ACCESS_CLIENT_ID`	from CF Access service token
`CF_ACCESS_CLIENT_SECRET`	from CF Access service token
`ECS_SSH_PRIVATE_KEY`	newly generated deploy-only ed25519 key

ECS-side one-time prep

cloudflared daemon already installed; add ssh-ecs.fsagent.cc ingress, restart service.
Create unprivileged deploy user with docker group membership.
Append deploy public key to /home/deploy/.ssh/authorized_keys with forced-command restriction:
```
command="/opt/twilight/deploy.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA...
```
Key can ONLY execute deploy.sh; no shell escape.
Install /opt/twilight/deploy.sh, chmod 755, owned by deploy:deploy.
docker login ghcr.io once with a read-only PAT stored in /home/deploy/.docker/config.json (or DOCKER_CONFIG env on deploy invocation).
compose.yml and .env live under /opt/twilight, owned by deploy.

ECS deploy script (reference)

bash

#!/usr/bin/env bash
set -euo pipefail

# SSH_ORIGINAL_COMMAND is set by sshd when forced-command is in effect.
# Workflow sends: "deploy sha-abc1234"
NEW_TAG="${SSH_ORIGINAL_COMMAND##* }"
[[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]] || {
  echo "invalid tag: $NEW_TAG" >&2; exit 2;
}

cd /opt/twilight
ENV_FILE=.env
LOCK=/tmp/twilight-deploy.lock

exec 9>"$LOCK"
flock -n 9 || { echo "deploy already running" >&2; exit 3; }

PREV_TAG=$(grep -E '^IMAGE_TAG=' "$ENV_FILE" | cut -d= -f2)
echo "rollback target: $PREV_TAG"

sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$NEW_TAG|" "$ENV_FILE"

docker compose pull data
docker compose up -d data

for i in {1..30}; do
  if curl -fsS --max-time 2 http://localhost:8081/healthz >/dev/null; then
    echo "healthy after ${i} tries"
    docker image prune -f --filter "until=168h" >/dev/null || true
    exit 0
  fi
  sleep 2
done

echo "HEALTH FAILED -- rolling back to $PREV_TAG" >&2
sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$PREV_TAG|" "$ENV_FILE"
docker compose up -d data
exit 1

Deploy workflow (reference)

yaml

name: Deploy to ECS

on:
  workflow_run:
    workflows: ["Docker build"]
    types: [completed]
    branches: [main]

jobs:
  deploy:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    permissions:
      contents: read
    steps:
      - name: Install cloudflared
        run: |
          curl -fsSL https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
            -o /usr/local/bin/cloudflared
          chmod +x /usr/local/bin/cloudflared

      - name: SSH deploy
        env:
          CF_ACCESS_CLIENT_ID: ${{ secrets.CF_ACCESS_CLIENT_ID }}
          CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
          SSH_KEY: ${{ secrets.ECS_SSH_PRIVATE_KEY }}
          SHA: ${{ github.event.workflow_run.head_sha }}
        run: |
          mkdir -p ~/.ssh
          echo "$SSH_KEY" > ~/.ssh/id_deploy
          chmod 600 ~/.ssh/id_deploy
          SHORT="sha-${SHA::7}"
          ssh -o StrictHostKeyChecking=accept-new \
              -o ProxyCommand="cloudflared access ssh --hostname %h" \
              -i ~/.ssh/id_deploy \
              deploy@ssh-ecs.fsagent.cc "deploy $SHORT"

Failure modes

Failure	Handling
Invalid tag (injection attempt)	Regex gate in deploy.sh; exit 2
Concurrent deploys	`flock` on `/tmp/twilight-deploy.lock`; exit 3
GHCR pull fails	`set -e` aborts before `up -d`; previous container untouched
New container exits at boot	`up -d` returns but healthcheck fails; rollback fires
`/healthz` returns 5xx within 60s window	Probe loop exits non-zero; rollback fires
Rollback itself fails	exit 1 surfaces as workflow failure; manual recovery
Disk fills from old image layers	`image prune --until=168h` on every successful deploy
Schema migration breaks boot	Lifespan loads idempotent SQL; raise aborts container; healthcheck fails; rollback
Cloudflared tunnel down	Runner cannot SSH; workflow red; no half-deploy
Secret leaks in logs	Curls use `-S` not `-v`; no env echo; no `set -x`

Security posture

Zero open inbound ports on ECS (Cloudflare tunnel only).
Cloudflare Access service token is revocable from CF dashboard without touching ECS.
SSH key uses forced-command — compromise of the key cannot get a shell, only re-run deploy.
Deploy user is unprivileged (docker group only), not root.
Workflow permissions: contents: read only; no write tokens.
Tag regex prevents arbitrary command injection via SSH_ORIGINAL_COMMAND.

Out of scope (deliberate YAGNI)

Blue/green or canary releases — single host, no traffic split mechanism, not worth it pre-revenue.
Multi-region failover — Vultr already retired, no DR target.
DB migrations beyond idempotent DuckDB SQL — no schema versioning tool needed yet.
Smoke tests beyond /healthz — add when warehouse routes develop known bad-state symptoms.
Aliyun ACR mirror — GHCR pulls work from Chengdu; revisit only if pull latency degrades.
Backfill orchestration — separate concern, ingest scheduler self-gates on trade_cal.

Success criteria

Merge to main triggers Docker build, then deploy, with no human action.
Healthy deploy: workflow green, container running new SHA, /healthz 200.
Broken deploy: workflow red, container running PREVIOUS SHA, /healthz 200, audit trail in Actions logs.
Zero open inbound ports on ECS aside from tunnel egress.
Total deploy time (build complete → healthy on prod) under 5 minutes.

ECS Auto-Deploy Design ​

Goal ​

Constraints ​

Architecture ​

Components ​

New ​

Modified ​

Cloudflare Access (one-time manual) ​

GitHub Actions secrets (3 new) ​

ECS-side one-time prep ​

ECS deploy script (reference) ​

Deploy workflow (reference) ​

Failure modes ​

Security posture ​

Out of scope (deliberate YAGNI) ​

Success criteria ​