主题
ECS Auto-Deploy Design
Date: 2026-05-14 Target: Alibaba ECS Chengdu (39.106.170.204) Image: ghcr.io/lacatfly/twilight-driveStatus: Approved, ready for implementation plan
Goal
Replace manual install.sh-driven deployment with a GitHub Actions pipeline that ships every merged main commit to ECS with health-gated auto-rollback and zero open inbound ports.
Constraints
- ECS sits in Chengdu behind GFW; no overseas SSH ingress is acceptable as a long-term surface.
- ECS can reach
ghcr.io(verified: 1.6s HTTP response). No Aliyun ACR mirror required at this stage. - Existing
docker-build.ymlalready pushes tagslatest,main,sha-<7>on push tomain. - Existing
cloudflaredtunneltwilight-backendalready terminatesapi.fsagent.ccon the box. - Paid users hit
api.fsagent.cc; broken deploys must self-heal without operator intervention.
Architecture
push main
│
├──► docker-build.yml (existing)
│ build → push ghcr.io tags: latest, main, sha-<7>
│
└──► deploy-ecs.yml (NEW, workflow_run after docker-build success)
1. install cloudflared on runner
2. auth Cloudflare Access via service token
3. ssh deploy@ssh-ecs.fsagent.cc (via cloudflared access ssh)
→ runs /opt/twilight/deploy.sh sha-<7>
a. flock to serialize deploys
b. validate tag against regex
c. snapshot current IMAGE_TAG → PREV
d. write new IMAGE_TAG to .env
e. docker compose pull && up -d
f. poll /healthz for 60s (30 × 2s)
g. success → prune old images >7d
h. fail → restore PREV, up -d, exit 1The compose.yml references image: ghcr.io/lacatfly/twilight-drive:${IMAGE_TAG:-latest}. Tag pinning via .env makes rollback an env-flip plus up -d, not a registry hunt.
Components
New
| Path | Purpose |
|---|---|
.github/workflows/deploy-ecs.yml | workflow_run trigger off Docker build success; SSHes via cloudflared and runs deploy.sh |
deploy/ecs-deploy.sh | ECS-side deploy script: lock, snapshot, swap tag, pull, up, healthcheck, rollback |
deploy/cloudflared-ssh.yml.template | Ingress block to merge into existing tunnel config: ssh-ecs.fsagent.cc → ssh://localhost:22 |
Modified
| Path | Change |
|---|---|
deploy/compose.yml | image: uses ${IMAGE_TAG:-latest} interpolation |
deploy/env.example | Adds IMAGE_TAG=latest line |
Cloudflare Access (one-time manual)
- Add hostname
ssh-ecs.fsagent.ccto tunnel ingress:ssh://localhost:22. - Create Access application gated to service-token policy only (no IdP, no user logins).
- Generate one service token; record
CF-Access-Client-IdandCF-Access-Client-Secret.
GitHub Actions secrets (3 new)
| Secret | Source |
|---|---|
CF_ACCESS_CLIENT_ID | from CF Access service token |
CF_ACCESS_CLIENT_SECRET | from CF Access service token |
ECS_SSH_PRIVATE_KEY | newly generated deploy-only ed25519 key |
ECS-side one-time prep
cloudflareddaemon already installed; addssh-ecs.fsagent.ccingress, restart service.- Create unprivileged
deployuser with docker group membership. - Append deploy public key to
/home/deploy/.ssh/authorized_keyswith forced-command restriction:Key can ONLY executecommand="/opt/twilight/deploy.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA...deploy.sh; no shell escape. - Install
/opt/twilight/deploy.sh, chmod 755, owned bydeploy:deploy. docker login ghcr.ioonce with a read-only PAT stored in/home/deploy/.docker/config.json(orDOCKER_CONFIGenv on deploy invocation).compose.ymland.envlive under/opt/twilight, owned bydeploy.
ECS deploy script (reference)
bash
#!/usr/bin/env bash
set -euo pipefail
# SSH_ORIGINAL_COMMAND is set by sshd when forced-command is in effect.
# Workflow sends: "deploy sha-abc1234"
NEW_TAG="${SSH_ORIGINAL_COMMAND##* }"
[[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]] || {
echo "invalid tag: $NEW_TAG" >&2; exit 2;
}
cd /opt/twilight
ENV_FILE=.env
LOCK=/tmp/twilight-deploy.lock
exec 9>"$LOCK"
flock -n 9 || { echo "deploy already running" >&2; exit 3; }
PREV_TAG=$(grep -E '^IMAGE_TAG=' "$ENV_FILE" | cut -d= -f2)
echo "rollback target: $PREV_TAG"
sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$NEW_TAG|" "$ENV_FILE"
docker compose pull data
docker compose up -d data
for i in {1..30}; do
if curl -fsS --max-time 2 http://localhost:8081/healthz >/dev/null; then
echo "healthy after ${i} tries"
docker image prune -f --filter "until=168h" >/dev/null || true
exit 0
fi
sleep 2
done
echo "HEALTH FAILED -- rolling back to $PREV_TAG" >&2
sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$PREV_TAG|" "$ENV_FILE"
docker compose up -d data
exit 1Deploy workflow (reference)
yaml
name: Deploy to ECS
on:
workflow_run:
workflows: ["Docker build"]
types: [completed]
branches: [main]
jobs:
deploy:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Install cloudflared
run: |
curl -fsSL https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
-o /usr/local/bin/cloudflared
chmod +x /usr/local/bin/cloudflared
- name: SSH deploy
env:
CF_ACCESS_CLIENT_ID: ${{ secrets.CF_ACCESS_CLIENT_ID }}
CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
SSH_KEY: ${{ secrets.ECS_SSH_PRIVATE_KEY }}
SHA: ${{ github.event.workflow_run.head_sha }}
run: |
mkdir -p ~/.ssh
echo "$SSH_KEY" > ~/.ssh/id_deploy
chmod 600 ~/.ssh/id_deploy
SHORT="sha-${SHA::7}"
ssh -o StrictHostKeyChecking=accept-new \
-o ProxyCommand="cloudflared access ssh --hostname %h" \
-i ~/.ssh/id_deploy \
deploy@ssh-ecs.fsagent.cc "deploy $SHORT"Failure modes
| Failure | Handling |
|---|---|
| Invalid tag (injection attempt) | Regex gate in deploy.sh; exit 2 |
| Concurrent deploys | flock on /tmp/twilight-deploy.lock; exit 3 |
| GHCR pull fails | set -e aborts before up -d; previous container untouched |
| New container exits at boot | up -d returns but healthcheck fails; rollback fires |
/healthz returns 5xx within 60s window | Probe loop exits non-zero; rollback fires |
| Rollback itself fails | exit 1 surfaces as workflow failure; manual recovery |
| Disk fills from old image layers | image prune --until=168h on every successful deploy |
| Schema migration breaks boot | Lifespan loads idempotent SQL; raise aborts container; healthcheck fails; rollback |
| Cloudflared tunnel down | Runner cannot SSH; workflow red; no half-deploy |
| Secret leaks in logs | Curls use -S not -v; no env echo; no set -x |
Security posture
- Zero open inbound ports on ECS (Cloudflare tunnel only).
- Cloudflare Access service token is revocable from CF dashboard without touching ECS.
- SSH key uses forced-command — compromise of the key cannot get a shell, only re-run deploy.
- Deploy user is unprivileged (docker group only), not root.
- Workflow permissions:
contents: readonly; no write tokens. - Tag regex prevents arbitrary command injection via
SSH_ORIGINAL_COMMAND.
Out of scope (deliberate YAGNI)
- Blue/green or canary releases — single host, no traffic split mechanism, not worth it pre-revenue.
- Multi-region failover — Vultr already retired, no DR target.
- DB migrations beyond idempotent DuckDB SQL — no schema versioning tool needed yet.
- Smoke tests beyond
/healthz— add when warehouse routes develop known bad-state symptoms. - Aliyun ACR mirror — GHCR pulls work from Chengdu; revisit only if pull latency degrades.
- Backfill orchestration — separate concern, ingest scheduler self-gates on
trade_cal.
Success criteria
- Merge to
maintriggers Docker build, then deploy, with no human action. - Healthy deploy: workflow green, container running new SHA,
/healthz200. - Broken deploy: workflow red, container running PREVIOUS SHA,
/healthz200, audit trail in Actions logs. - Zero open inbound ports on ECS aside from tunnel egress.
- Total deploy time (build complete → healthy on prod) under 5 minutes.