Skip to content

ECS Auto-Deploy Design

Date: 2026-05-14 Target: Alibaba ECS Chengdu (39.106.170.204) Image: ghcr.io/lacatfly/twilight-driveStatus: Approved, ready for implementation plan

Goal

Replace manual install.sh-driven deployment with a GitHub Actions pipeline that ships every merged main commit to ECS with health-gated auto-rollback and zero open inbound ports.

Constraints

  • ECS sits in Chengdu behind GFW; no overseas SSH ingress is acceptable as a long-term surface.
  • ECS can reach ghcr.io (verified: 1.6s HTTP response). No Aliyun ACR mirror required at this stage.
  • Existing docker-build.yml already pushes tags latest, main, sha-<7> on push to main.
  • Existing cloudflared tunnel twilight-backend already terminates api.fsagent.cc on the box.
  • Paid users hit api.fsagent.cc; broken deploys must self-heal without operator intervention.

Architecture

push main

  ├──► docker-build.yml (existing)
  │       build → push ghcr.io tags: latest, main, sha-<7>

  └──► deploy-ecs.yml (NEW, workflow_run after docker-build success)
         1. install cloudflared on runner
         2. auth Cloudflare Access via service token
         3. ssh deploy@ssh-ecs.fsagent.cc (via cloudflared access ssh)
            → runs /opt/twilight/deploy.sh sha-<7>
                a. flock to serialize deploys
                b. validate tag against regex
                c. snapshot current IMAGE_TAG → PREV
                d. write new IMAGE_TAG to .env
                e. docker compose pull && up -d
                f. poll /healthz for 60s (30 × 2s)
                g. success → prune old images >7d
                h. fail → restore PREV, up -d, exit 1

The compose.yml references image: ghcr.io/lacatfly/twilight-drive:${IMAGE_TAG:-latest}. Tag pinning via .env makes rollback an env-flip plus up -d, not a registry hunt.

Components

New

PathPurpose
.github/workflows/deploy-ecs.ymlworkflow_run trigger off Docker build success; SSHes via cloudflared and runs deploy.sh
deploy/ecs-deploy.shECS-side deploy script: lock, snapshot, swap tag, pull, up, healthcheck, rollback
deploy/cloudflared-ssh.yml.templateIngress block to merge into existing tunnel config: ssh-ecs.fsagent.cc → ssh://localhost:22

Modified

PathChange
deploy/compose.ymlimage: uses ${IMAGE_TAG:-latest} interpolation
deploy/env.exampleAdds IMAGE_TAG=latest line

Cloudflare Access (one-time manual)

  • Add hostname ssh-ecs.fsagent.cc to tunnel ingress: ssh://localhost:22.
  • Create Access application gated to service-token policy only (no IdP, no user logins).
  • Generate one service token; record CF-Access-Client-Id and CF-Access-Client-Secret.

GitHub Actions secrets (3 new)

SecretSource
CF_ACCESS_CLIENT_IDfrom CF Access service token
CF_ACCESS_CLIENT_SECRETfrom CF Access service token
ECS_SSH_PRIVATE_KEYnewly generated deploy-only ed25519 key

ECS-side one-time prep

  • cloudflared daemon already installed; add ssh-ecs.fsagent.cc ingress, restart service.
  • Create unprivileged deploy user with docker group membership.
  • Append deploy public key to /home/deploy/.ssh/authorized_keys with forced-command restriction:
    command="/opt/twilight/deploy.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA...
    Key can ONLY execute deploy.sh; no shell escape.
  • Install /opt/twilight/deploy.sh, chmod 755, owned by deploy:deploy.
  • docker login ghcr.io once with a read-only PAT stored in /home/deploy/.docker/config.json (or DOCKER_CONFIG env on deploy invocation).
  • compose.yml and .env live under /opt/twilight, owned by deploy.

ECS deploy script (reference)

bash
#!/usr/bin/env bash
set -euo pipefail

# SSH_ORIGINAL_COMMAND is set by sshd when forced-command is in effect.
# Workflow sends: "deploy sha-abc1234"
NEW_TAG="${SSH_ORIGINAL_COMMAND##* }"
[[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]] || {
  echo "invalid tag: $NEW_TAG" >&2; exit 2;
}

cd /opt/twilight
ENV_FILE=.env
LOCK=/tmp/twilight-deploy.lock

exec 9>"$LOCK"
flock -n 9 || { echo "deploy already running" >&2; exit 3; }

PREV_TAG=$(grep -E '^IMAGE_TAG=' "$ENV_FILE" | cut -d= -f2)
echo "rollback target: $PREV_TAG"

sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$NEW_TAG|" "$ENV_FILE"

docker compose pull data
docker compose up -d data

for i in {1..30}; do
  if curl -fsS --max-time 2 http://localhost:8081/healthz >/dev/null; then
    echo "healthy after ${i} tries"
    docker image prune -f --filter "until=168h" >/dev/null || true
    exit 0
  fi
  sleep 2
done

echo "HEALTH FAILED -- rolling back to $PREV_TAG" >&2
sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$PREV_TAG|" "$ENV_FILE"
docker compose up -d data
exit 1

Deploy workflow (reference)

yaml
name: Deploy to ECS

on:
  workflow_run:
    workflows: ["Docker build"]
    types: [completed]
    branches: [main]

jobs:
  deploy:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    permissions:
      contents: read
    steps:
      - name: Install cloudflared
        run: |
          curl -fsSL https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
            -o /usr/local/bin/cloudflared
          chmod +x /usr/local/bin/cloudflared

      - name: SSH deploy
        env:
          CF_ACCESS_CLIENT_ID: ${{ secrets.CF_ACCESS_CLIENT_ID }}
          CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
          SSH_KEY: ${{ secrets.ECS_SSH_PRIVATE_KEY }}
          SHA: ${{ github.event.workflow_run.head_sha }}
        run: |
          mkdir -p ~/.ssh
          echo "$SSH_KEY" > ~/.ssh/id_deploy
          chmod 600 ~/.ssh/id_deploy
          SHORT="sha-${SHA::7}"
          ssh -o StrictHostKeyChecking=accept-new \
              -o ProxyCommand="cloudflared access ssh --hostname %h" \
              -i ~/.ssh/id_deploy \
              deploy@ssh-ecs.fsagent.cc "deploy $SHORT"

Failure modes

FailureHandling
Invalid tag (injection attempt)Regex gate in deploy.sh; exit 2
Concurrent deploysflock on /tmp/twilight-deploy.lock; exit 3
GHCR pull failsset -e aborts before up -d; previous container untouched
New container exits at bootup -d returns but healthcheck fails; rollback fires
/healthz returns 5xx within 60s windowProbe loop exits non-zero; rollback fires
Rollback itself failsexit 1 surfaces as workflow failure; manual recovery
Disk fills from old image layersimage prune --until=168h on every successful deploy
Schema migration breaks bootLifespan loads idempotent SQL; raise aborts container; healthcheck fails; rollback
Cloudflared tunnel downRunner cannot SSH; workflow red; no half-deploy
Secret leaks in logsCurls use -S not -v; no env echo; no set -x

Security posture

  • Zero open inbound ports on ECS (Cloudflare tunnel only).
  • Cloudflare Access service token is revocable from CF dashboard without touching ECS.
  • SSH key uses forced-command — compromise of the key cannot get a shell, only re-run deploy.
  • Deploy user is unprivileged (docker group only), not root.
  • Workflow permissions: contents: read only; no write tokens.
  • Tag regex prevents arbitrary command injection via SSH_ORIGINAL_COMMAND.

Out of scope (deliberate YAGNI)

  • Blue/green or canary releases — single host, no traffic split mechanism, not worth it pre-revenue.
  • Multi-region failover — Vultr already retired, no DR target.
  • DB migrations beyond idempotent DuckDB SQL — no schema versioning tool needed yet.
  • Smoke tests beyond /healthz — add when warehouse routes develop known bad-state symptoms.
  • Aliyun ACR mirror — GHCR pulls work from Chengdu; revisit only if pull latency degrades.
  • Backfill orchestration — separate concern, ingest scheduler self-gates on trade_cal.

Success criteria

  1. Merge to main triggers Docker build, then deploy, with no human action.
  2. Healthy deploy: workflow green, container running new SHA, /healthz 200.
  3. Broken deploy: workflow red, container running PREVIOUS SHA, /healthz 200, audit trail in Actions logs.
  4. Zero open inbound ports on ECS aside from tunnel egress.
  5. Total deploy time (build complete → healthy on prod) under 5 minutes.

团队内部文档