主题
ECS Auto-Deploy Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Auto-deploy every push to main from GitHub Actions to Alibaba ECS Chengdu via Cloudflare Tunnel SSH, with health-gated rollback and zero open inbound ports.
Architecture: docker-build.yml (existing) pushes image to GHCR with sha-<7> tag. A new deploy-ecs.yml (workflow_run trigger) installs cloudflared, SSHes through Cloudflare Access to a non-root deploy user on ECS, and invokes a forced-command script that swaps TWILIGHT_VERSION in ~/twilight/.env, runs docker compose up -d, polls /healthz, and rolls back on failure.
Tech Stack: GitHub Actions, Bash, Docker Compose, Cloudflare Tunnel + Access, cloudflared CLI, flock, curl. Bash unit tests use plain shell with PATH mocking (no bats dependency).
Reconciliation with spec: The design doc used IMAGE_TAG / /opt/twilight. The existing repo already uses TWILIGHT_VERSION / ~/twilight/.env (see deploy/compose.yml:20, deploy/env.example, deploy/alibaba-ecs-install.sh). This plan follows the existing convention; functional behavior is identical.
File Structure
Create:
deploy/ecs-deploy.sh— ECS-side script: lock, validate tag, snapshot, swap, pull, up, healthcheck, rollbacktests/deploy/test_ecs_deploy.sh— bash test harness with PATH-mocked docker/curltests/deploy/mocks/docker— mock docker binary used by teststests/deploy/mocks/curl— mock curl binary used by tests.github/workflows/deploy-ecs.yml— workflow_run trigger, SSH deploy via cloudflareddeploy/cloudflared-ssh-ingress.yml.snippet— ingress block to merge into existing tunnel configdocs/deploy/ecs-auto-deploy-runbook.md— one-time CF Access + GH secrets + ECS prep
Modify:
deploy/alibaba-ecs-install.sh— provisiondeployuser, installecs-deploy.shto/usr/local/bin/twilight-deploy, append forced-command authorized_keys entrydeploy/alibaba-ecs-deploy.md— link to runbook, document version pinning viaTWILIGHT_VERSION
No changes needed:
deploy/compose.yml— already uses${TWILIGHT_VERSION:-latest}at line 20deploy/env.example— already hasTWILIGHT_VERSION=latest
Task 1: Add bash test harness
Files:
Create:
tests/deploy/test_ecs_deploy.shCreate:
tests/deploy/mocks/dockerCreate:
tests/deploy/mocks/curl[ ] Step 1: Create mocks directory and mock docker
Write tests/deploy/mocks/docker:
bash
#!/usr/bin/env bash
# Mock docker for ecs-deploy.sh tests.
# Behavior controlled by env vars:
# MOCK_DOCKER_PULL_FAIL=1 -> `docker compose pull` exits 1
# MOCK_DOCKER_LOG=/tmp/x -> append each invocation's args to this file
log="${MOCK_DOCKER_LOG:-/dev/null}"
echo "docker $*" >> "$log"
case "$*" in
"compose pull "*)
[[ "${MOCK_DOCKER_PULL_FAIL:-0}" == "1" ]] && exit 1
exit 0
;;
"compose up -d "*) exit 0 ;;
"image prune"*) exit 0 ;;
*) exit 0 ;;
esacchmod +x tests/deploy/mocks/docker.
- [ ] Step 2: Create mock curl
Write tests/deploy/mocks/curl:
bash
#!/usr/bin/env bash
# Mock curl for ecs-deploy.sh tests.
# MOCK_CURL_HEALTHY_AFTER=N -> the Nth call onward returns 0 (default 1 = first)
# MOCK_CURL_NEVER_HEALTHY=1 -> always returns 22 (HTTP error)
state="${MOCK_CURL_STATE:-/tmp/mock-curl-state}"
count=$(( $(cat "$state" 2>/dev/null || echo 0) + 1 ))
echo "$count" > "$state"
if [[ "${MOCK_CURL_NEVER_HEALTHY:-0}" == "1" ]]; then
exit 22
fi
threshold="${MOCK_CURL_HEALTHY_AFTER:-1}"
(( count >= threshold )) && exit 0 || exit 22chmod +x tests/deploy/mocks/curl.
- [ ] Step 3: Create test harness skeleton
Write tests/deploy/test_ecs_deploy.sh:
bash
#!/usr/bin/env bash
# Plain-shell test harness for deploy/ecs-deploy.sh.
# Each test runs in an isolated tmpdir with mocked PATH.
set -u
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SCRIPT="$REPO_ROOT/deploy/ecs-deploy.sh"
MOCKS_DIR="$REPO_ROOT/tests/deploy/mocks"
pass=0; fail=0
log() { printf '%s\n' "$*"; }
assert_eq() {
local got="$1" want="$2" msg="${3:-}"
if [[ "$got" == "$want" ]]; then ((pass++)); log " PASS ${msg}"; else
((fail++)); log " FAIL ${msg}"; log " want: $want"; log " got: $got";
fi
}
assert_contains() {
local hay="$1" needle="$2" msg="${3:-}"
if [[ "$hay" == *"$needle"* ]]; then ((pass++)); log " PASS ${msg}"; else
((fail++)); log " FAIL ${msg}"; log " missing: $needle"; log " in: $hay";
fi
}
setup() {
TMP="$(mktemp -d)"
cd "$TMP"
mkdir -p home/twilight/source/deploy
cp "$REPO_ROOT/deploy/compose.yml" home/twilight/source/deploy/
printf 'TWILIGHT_VERSION=sha-0000000\n' > home/twilight/.env
export TWILIGHT_HOME="$TMP/home/twilight"
export MOCK_DOCKER_LOG="$TMP/docker.log"
export MOCK_CURL_STATE="$TMP/curl.count"
: > "$MOCK_DOCKER_LOG"
: > "$MOCK_CURL_STATE"
export PATH="$MOCKS_DIR:$PATH"
}
teardown() { cd /; rm -rf "$TMP"; unset TWILIGHT_HOME MOCK_DOCKER_LOG MOCK_DOCKER_PULL_FAIL MOCK_CURL_HEALTHY_AFTER MOCK_CURL_NEVER_HEALTHY MOCK_CURL_STATE; }
# Tests appended in later tasks.
trap 'teardown 2>/dev/null || true' EXIT
log "==> running tests"
# (test invocations added in later tasks)
log "==> ${pass} passed, ${fail} failed"
exit $(( fail > 0 ? 1 : 0 ))chmod +x tests/deploy/test_ecs_deploy.sh.
- [ ] Step 4: Run the empty harness
bash
bash tests/deploy/test_ecs_deploy.shExpected: ==> 0 passed, 0 failed, exit 0.
- [ ] Step 5: Commit
bash
git add tests/deploy/
git commit -m "test(deploy): bash test harness scaffold for ecs-deploy.sh"Task 2: Tag validation (TDD)
Files:
Modify:
tests/deploy/test_ecs_deploy.shCreate:
deploy/ecs-deploy.sh[ ] Step 1: Append failing test for invalid tag
Insert immediately before the log "==> ${pass} passed" line in tests/deploy/test_ecs_deploy.sh:
bash
test_invalid_tag_rejected() {
setup
out=$(SSH_ORIGINAL_COMMAND="deploy not-a-tag" bash "$SCRIPT" 2>&1); rc=$?
assert_eq "$rc" "2" "invalid tag exits 2"
assert_contains "$out" "invalid tag" "invalid tag error message"
teardown
}
test_invalid_tag_rejected- [ ] Step 2: Run and verify failure
bash
bash tests/deploy/test_ecs_deploy.shExpected: 2 FAILs (script does not yet exist). Last line: ==> 0 passed, 2 failed, exit 1.
- [ ] Step 3: Create minimal ecs-deploy.sh
Write deploy/ecs-deploy.sh:
bash
#!/usr/bin/env bash
# Twilight Drive ECS-side deploy script.
# Invoked via SSH forced-command. SSH_ORIGINAL_COMMAND format: "deploy <tag>".
# Tag must match: sha-<7-hex> | latest | main | v<X>.<Y>.<Z>
set -euo pipefail
TWILIGHT_HOME="${TWILIGHT_HOME:-/home/deploy/twilight}"
ENV_FILE="$TWILIGHT_HOME/.env"
COMPOSE_DIR="$TWILIGHT_HOME/source/deploy"
LOCK_FILE="${TWILIGHT_LOCK:-/tmp/twilight-deploy.lock}"
HEALTH_URL="${TWILIGHT_HEALTH_URL:-http://localhost:8081/healthz}"
HEALTH_TRIES="${TWILIGHT_HEALTH_TRIES:-30}"
HEALTH_SLEEP="${TWILIGHT_HEALTH_SLEEP:-2}"
cmd="${SSH_ORIGINAL_COMMAND:-}"
NEW_TAG="${cmd##* }"
if ! [[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]]; then
echo "invalid tag: $NEW_TAG" >&2
exit 2
fi
echo "tag accepted: $NEW_TAG"chmod +x deploy/ecs-deploy.sh.
- [ ] Step 4: Run test, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: ==> 2 passed, 0 failed, exit 0.
- [ ] Step 5: Add tests for valid tag shapes
Append before the summary line:
bash
test_valid_tags_accepted() {
for t in "sha-abc1234" "latest" "main" "v1.2.3"; do
setup
SSH_ORIGINAL_COMMAND="deploy $t" bash "$SCRIPT" >/dev/null 2>&1
rc=$?
assert_eq "$rc" "0" "valid tag '$t' accepted"
teardown
done
}
test_valid_tags_accepted(Note: this will fail at later steps once flock and rollback are added — those need full setup. Keep this test now; later tasks make it more thorough.)
- [ ] Step 6: Run, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: ==> 6 passed, 0 failed, exit 0.
- [ ] Step 7: Commit
bash
git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): tag validation for ecs-deploy.sh"Task 3: Concurrency lock (TDD)
Files:
Modify:
tests/deploy/test_ecs_deploy.shModify:
deploy/ecs-deploy.sh[ ] Step 1: Add failing test for concurrent deploy
Append before summary line:
bash
test_concurrent_deploy_rejected() {
setup
# Hold the lock from another process, then try to deploy.
( flock -x 9; sleep 2 ) 9>"$TMP/lock" &
holder=$!
sleep 0.2
out=$(TWILIGHT_LOCK="$TMP/lock" SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
bash "$SCRIPT" 2>&1); rc=$?
wait $holder 2>/dev/null
assert_eq "$rc" "3" "second deploy exits 3"
assert_contains "$out" "already running" "lock-held message"
teardown
}
test_concurrent_deploy_rejected- [ ] Step 2: Run, expect failure
bash
bash tests/deploy/test_ecs_deploy.shExpected: lock test FAILs (script does not yet attempt to acquire lock).
- [ ] Step 3: Add flock to ecs-deploy.sh
Edit deploy/ecs-deploy.sh. After the tag-validation block and before the final echo, insert:
bash
exec 9>"$LOCK_FILE"
if ! flock -n 9; then
echo "deploy already running" >&2
exit 3
fiRemove the temporary echo "tag accepted: $NEW_TAG" (no longer needed).
- [ ] Step 4: Run, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: all tests pass.
- [ ] Step 5: Commit
bash
git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): flock-based concurrency guard"Task 4: Snapshot + tag swap (TDD)
Files:
Modify:
tests/deploy/test_ecs_deploy.shModify:
deploy/ecs-deploy.sh[ ] Step 1: Add failing test
Append before summary line:
bash
test_env_tag_swapped() {
setup
SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
bash "$SCRIPT" >/dev/null 2>&1
rc=$?
got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
assert_eq "$rc" "0" "deploy succeeds with healthy mock"
assert_eq "$got" "sha-abc1234" ".env TWILIGHT_VERSION updated"
}
test_env_tag_swapped(No teardown after — we leave $TMP around so a follow-up assert could inspect; setup will rebuild next run.)
- [ ] Step 2: Run, expect failure
Expected: rc may be 0 but .env unchanged, so the second assert FAILs.
- [ ] Step 3: Implement snapshot + swap
Edit deploy/ecs-deploy.sh. After the flock block, append:
bash
if [[ ! -f "$ENV_FILE" ]]; then
echo "missing env file: $ENV_FILE" >&2
exit 4
fi
PREV_TAG=$(grep -E '^TWILIGHT_VERSION=' "$ENV_FILE" | cut -d= -f2)
if [[ -z "$PREV_TAG" ]]; then
echo "missing TWILIGHT_VERSION in $ENV_FILE" >&2
exit 4
fi
echo "rollback target: $PREV_TAG"
echo "deploying: $NEW_TAG"
# Portable sed -i (GNU vs BSD).
if sed --version >/dev/null 2>&1; then
sed -i "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
else
sed -i '' "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
fi- [ ] Step 4: Run, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: all tests pass.
- [ ] Step 5: Commit
bash
git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): snapshot rollback tag and swap TWILIGHT_VERSION"Task 5: Pull + up via docker compose (TDD)
Files:
Modify:
tests/deploy/test_ecs_deploy.shModify:
deploy/ecs-deploy.sh[ ] Step 1: Add test for docker calls
Append before summary line:
bash
test_docker_compose_invoked() {
setup
SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
bash "$SCRIPT" >/dev/null 2>&1
log_content=$(cat "$MOCK_DOCKER_LOG")
assert_contains "$log_content" "compose pull data" "docker compose pull called"
assert_contains "$log_content" "compose up -d data" "docker compose up called"
teardown
}
test_docker_compose_invoked- [ ] Step 2: Run, expect failure
Expected: docker.log is empty.
- [ ] Step 3: Add docker compose calls
Edit deploy/ecs-deploy.sh, after the sed block append:
bash
cd "$COMPOSE_DIR"
docker compose pull data
docker compose up -d data- [ ] Step 4: Run, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: all tests pass.
- [ ] Step 5: Add test for pull failure aborts before up
Append before summary line:
bash
test_pull_failure_aborts() {
setup
MOCK_DOCKER_PULL_FAIL=1 SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
bash "$SCRIPT" >/dev/null 2>&1
rc=$?
log_content=$(cat "$MOCK_DOCKER_LOG")
assert_eq "$rc" "1" "pull failure exits non-zero"
assert_contains "$log_content" "compose pull data" "pull attempted"
if [[ "$log_content" == *"compose up -d data"* ]]; then
((fail++)); log " FAIL up should NOT have been called after pull fail"
else
((pass++)); log " PASS up not called after pull fail"
fi
teardown
}
test_pull_failure_aborts- [ ] Step 6: Run, verify pass
Expected: all tests pass. set -e already aborts on docker compose pull failure, so no code change needed.
- [ ] Step 7: Commit
bash
git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): pull + up via docker compose, abort on pull failure"Task 6: Health probe + rollback (TDD)
Files:
Modify:
tests/deploy/test_ecs_deploy.shModify:
deploy/ecs-deploy.sh[ ] Step 1: Add tests for healthy and unhealthy paths
Append before summary line:
bash
test_health_success_no_rollback() {
setup
SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
TWILIGHT_HEALTH_TRIES=5 TWILIGHT_HEALTH_SLEEP=0 \
bash "$SCRIPT" >/dev/null 2>&1
rc=$?
got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
assert_eq "$rc" "0" "healthy deploy exits 0"
assert_eq "$got" "sha-abc1234" "version stays at new tag"
teardown
}
test_health_success_no_rollback
test_health_failure_triggers_rollback() {
setup
out=$(MOCK_CURL_NEVER_HEALTHY=1 SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
TWILIGHT_HEALTH_TRIES=3 TWILIGHT_HEALTH_SLEEP=0 \
bash "$SCRIPT" 2>&1)
rc=$?
got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
ups=$(grep -c "compose up -d data" "$MOCK_DOCKER_LOG")
assert_eq "$rc" "1" "rollback path exits 1"
assert_eq "$got" "sha-0000000" "version restored to PREV"
assert_eq "$ups" "2" "compose up called twice (deploy + rollback)"
assert_contains "$out" "HEALTH FAILED" "failure message logged"
teardown
}
test_health_failure_triggers_rollback- [ ] Step 2: Run, expect failure
Expected: probe loop missing, both tests FAIL.
- [ ] Step 3: Implement probe + rollback
Edit deploy/ecs-deploy.sh, append after the docker compose up -d data line:
bash
sed_inplace() {
if sed --version >/dev/null 2>&1; then sed -i "$1" "$2"
else sed -i '' "$1" "$2"
fi
}
i=0
while (( i < HEALTH_TRIES )); do
if curl -fsS --max-time 2 "$HEALTH_URL" >/dev/null 2>&1; then
echo "healthy after $((i + 1)) tries"
docker image prune -f --filter "until=168h" >/dev/null 2>&1 || true
exit 0
fi
i=$((i + 1))
sleep "$HEALTH_SLEEP"
done
echo "HEALTH FAILED -- rolling back to $PREV_TAG" >&2
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_TAG|" "$ENV_FILE"
docker compose up -d data
exit 1Move the two earlier inline sed -i invocations to use sed_inplace (DRY): replace the existing if sed --version ... fi block in the snapshot section with:
bash
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"Define sed_inplace once, near the top after env-var setup, so both call sites can use it.
- [ ] Step 4: Run, verify pass
bash
bash tests/deploy/test_ecs_deploy.shExpected: all tests pass.
- [ ] Step 5: Commit
bash
git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): health probe + auto-rollback on /healthz failure"Task 7: Cloudflare Tunnel SSH ingress snippet
Files:
Create:
deploy/cloudflared-ssh-ingress.yml.snippet[ ] Step 1: Write ingress snippet
Write deploy/cloudflared-ssh-ingress.yml.snippet:
yaml
# Merge into ~/.cloudflared/config.yml as a new entry in `ingress:`.
# Add ABOVE the catch-all `service: http_status:404` line.
#
# Hostname must already exist as a DNS record in Cloudflare pointing at
# the tunnel UUID. Cloudflare Access app must be configured for this
# hostname with a service-token policy (no IdP). See
# docs/deploy/ecs-auto-deploy-runbook.md.
- hostname: ssh-ecs.fsagent.cc
service: ssh://localhost:22- [ ] Step 2: Commit
bash
git add deploy/cloudflared-ssh-ingress.yml.snippet
git commit -m "deploy: cloudflared SSH ingress snippet for ssh-ecs.fsagent.cc"Task 8: Install.sh updates for deploy user + forced-command key
Files:
Modify:
deploy/alibaba-ecs-install.sh[ ] Step 1: Read current install.sh to find insertion point
bash
grep -n "function\|^# ====\|main()" deploy/alibaba-ecs-install.sh | head -30Identify a section where additional one-time setup runs (post-docker-install, pre-compose-up).
- [ ] Step 2: Add deploy-user provisioning function
In deploy/alibaba-ecs-install.sh, add a new function near the other setup helpers:
bash
setup_deploy_user() {
local deploy_user="${DEPLOY_USER:-deploy}"
local pubkey_file="${DEPLOY_PUBKEY_FILE:-}"
if [[ -z "$pubkey_file" ]]; then
warn "DEPLOY_PUBKEY_FILE not set; skipping deploy-user setup (see runbook)"
return 0
fi
if [[ ! -r "$pubkey_file" ]]; then
fail "DEPLOY_PUBKEY_FILE=$pubkey_file not readable"
fi
if ! id -u "$deploy_user" >/dev/null 2>&1; then
useradd -m -s /bin/bash "$deploy_user"
fi
usermod -aG docker "$deploy_user"
install -d -o "$deploy_user" -g "$deploy_user" -m 700 \
"/home/$deploy_user/.ssh"
local script_dest="/usr/local/bin/twilight-deploy"
install -m 755 -o root -g root \
"$(dirname "$0")/ecs-deploy.sh" "$script_dest"
# Set TWILIGHT_HOME for the deploy user so script finds the right tree.
local home_dir
home_dir="$(getent passwd "${SUDO_USER:-$USER}" | cut -d: -f6)"
local auth_keys="/home/$deploy_user/.ssh/authorized_keys"
local key_line
key_line="command=\"TWILIGHT_HOME=$home_dir/twilight $script_dest\",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty $(cat "$pubkey_file")"
touch "$auth_keys"
if ! grep -qF "$(cat "$pubkey_file")" "$auth_keys"; then
echo "$key_line" >> "$auth_keys"
fi
chown "$deploy_user:$deploy_user" "$auth_keys"
chmod 600 "$auth_keys"
log "deploy user '$deploy_user' provisioned; key authorized for $script_dest"
}Then call setup_deploy_user from main() after the docker/compose install steps and before any compose-up step.
- [ ] Step 3: Smoke-test syntax
bash
bash -n deploy/alibaba-ecs-install.shExpected: no output, exit 0.
- [ ] Step 4: Commit
bash
git add deploy/alibaba-ecs-install.sh
git commit -m "feat(install): provision deploy user with forced-command SSH key"Task 9: GitHub Actions deploy workflow
Files:
Create:
.github/workflows/deploy-ecs.yml[ ] Step 1: Write workflow
Write .github/workflows/deploy-ecs.yml:
yaml
name: Deploy to ECS
on:
workflow_run:
workflows: ["Docker build"]
types: [completed]
branches: [main]
concurrency:
group: deploy-ecs
cancel-in-progress: false
permissions:
contents: read
jobs:
deploy:
name: Deploy to Alibaba ECS
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: production
steps:
- name: Install cloudflared
run: |
set -euo pipefail
curl -fsSL -o /tmp/cloudflared \
https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
sudo install -m 755 /tmp/cloudflared /usr/local/bin/cloudflared
cloudflared --version
- name: Write SSH key
env:
SSH_KEY: ${{ secrets.ECS_SSH_PRIVATE_KEY }}
run: |
set -euo pipefail
mkdir -p ~/.ssh
umask 077
printf '%s\n' "$SSH_KEY" > ~/.ssh/id_deploy
chmod 600 ~/.ssh/id_deploy
- name: Deploy via cloudflared SSH
env:
CF_ACCESS_CLIENT_ID: ${{ secrets.CF_ACCESS_CLIENT_ID }}
CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
DEPLOY_USER: ${{ vars.ECS_DEPLOY_USER }}
DEPLOY_HOST: ${{ vars.ECS_SSH_HOSTNAME }}
SHA: ${{ github.event.workflow_run.head_sha }}
run: |
set -euo pipefail
SHORT="sha-${SHA:0:7}"
echo "deploying $SHORT to $DEPLOY_HOST"
ssh \
-o StrictHostKeyChecking=accept-new \
-o UserKnownHostsFile=~/.ssh/known_hosts \
-o ConnectTimeout=15 \
-o ProxyCommand="cloudflared access ssh --hostname %h" \
-i ~/.ssh/id_deploy \
"$DEPLOY_USER@$DEPLOY_HOST" "deploy $SHORT"Notes baked into design:
concurrency: deploy-ecs / cancel-in-progress: falseserializes deploys at the workflow level (defense in depth alongsideflockon the box).environment: productionlets you add manual approval later by configuring the environment in GitHub repo settings; no approval today.vars.ECS_DEPLOY_USERandvars.ECS_SSH_HOSTNAMEare repo-level variables (not secrets) — runbook describes how to set them.[ ] Step 2: Validate YAML
bash
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/deploy-ecs.yml'))"Expected: no output, exit 0.
- [ ] Step 3: Commit
bash
git add .github/workflows/deploy-ecs.yml
git commit -m "ci(deploy): workflow_run-triggered ECS deploy via cloudflared SSH"Task 10: Runbook for one-time setup
Files:
Create:
docs/deploy/ecs-auto-deploy-runbook.md[ ] Step 1: Write the runbook
Write docs/deploy/ecs-auto-deploy-runbook.md:
markdown
# ECS Auto-Deploy Runbook
One-time setup to enable GitHub Actions → Cloudflare Tunnel SSH → ECS deploys.
After this runs, every push to `main` deploys automatically with health-gated
rollback. See `docs/superpowers/specs/2026-05-14-ecs-deploy-design.md` for
the design rationale.
## Prereqs
- ECS box already provisioned via `deploy/alibaba-ecs-install.sh`.
- `cloudflared` tunnel `twilight-backend` already running on ECS.
- Repo admin access on GitHub (for secrets + variables).
- Cloudflare account access with Zero Trust → Access permissions.
## 1. Generate deploy SSH key (on your laptop)
```bash
ssh-keygen -t ed25519 -f ~/.ssh/twilight-ecs-deploy -C "twilight-ecs-deploy" -N ""
# Public key: ~/.ssh/twilight-ecs-deploy.pub -> upload to ECS
# Private key: ~/.ssh/twilight-ecs-deploy -> paste into GH secret
```
## 2. Provision deploy user on ECS
Copy the public key to the box, then re-run install.sh with the
`DEPLOY_PUBKEY_FILE` variable pointing at it:
```bash
scp ~/.ssh/twilight-ecs-deploy.pub root@39.106.170.204:/tmp/deploy.pub
ssh root@39.106.170.204
cd ~/twilight/source
DEPLOY_PUBKEY_FILE=/tmp/deploy.pub bash deploy/alibaba-ecs-install.sh
```
This creates user `deploy`, adds it to the docker group, installs
`/usr/local/bin/twilight-deploy`, and writes the forced-command entry to
`/home/deploy/.ssh/authorized_keys`.
Verify:
```bash
sudo -u deploy /usr/local/bin/twilight-deploy </dev/null 2>&1 || true
# Expected: "invalid tag:" — the script is reachable.
```
## 3. Configure Cloudflare Access
In Cloudflare dashboard → Zero Trust → Access → Applications → Add:
- Application type: **Self-hosted**
- Application name: `twilight-ssh-ecs`
- Subdomain: `ssh-ecs`
- Domain: `fsagent.cc`
- Identity providers: **none** (uncheck all)
- Policy: name `service-token-only`, Action `Service Auth`, Include `Service Token` → create a new token named `gh-actions-deploy`
- Copy the **Client ID** and **Client Secret** that appear — they only show once.
## 4. Add SSH ingress to cloudflared
On ECS, edit the tunnel config (location varies; commonly
`/etc/cloudflared/config.yml` or `~/.cloudflared/config.yml`):
Add the snippet from `deploy/cloudflared-ssh-ingress.yml.snippet` above the
catch-all 404 entry, then:
```bash
sudo systemctl restart cloudflared
```
Add the DNS record pointing the hostname at the tunnel:
```bash
cloudflared tunnel route dns twilight-backend ssh-ecs.fsagent.cc
```
## 5. Wire GitHub repo
In **Settings → Secrets and variables → Actions**:
Secrets:
- `ECS_SSH_PRIVATE_KEY` — paste contents of `~/.ssh/twilight-ecs-deploy`
- `CF_ACCESS_CLIENT_ID` — from step 3
- `CF_ACCESS_CLIENT_SECRET` — from step 3
Variables:
- `ECS_DEPLOY_USER` = `deploy`
- `ECS_SSH_HOSTNAME` = `ssh-ecs.fsagent.cc`
## 6. Smoke test from your laptop first
Confirm the path works before letting CI try:
```bash
cloudflared access ssh-config --hostname ssh-ecs.fsagent.cc
# Then:
ssh -o ProxyCommand="cloudflared access ssh --hostname %h" \
-i ~/.ssh/twilight-ecs-deploy \
deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234"
# Expected: "invalid tag" if you used a fake SHA, OR a real deploy attempt
# if you used a real one.
```
## 7. Trigger a deploy
Push any small change to `main`. The Docker build runs; on success the
deploy workflow fires automatically. Watch in Actions tab.
## Recovery
If a deploy auto-rollback leaves the box on a known-good prior version, the
workflow run is red but service is healthy. Investigate at leisure.
If both new and previous fail (extremely unlikely — would mean previous was
already broken), SSH as `root`, set `TWILIGHT_VERSION` to a known-good tag
in `~/twilight/.env`, then `docker compose -f ~/twilight/source/deploy/compose.yml up -d backend`.
## Revocation
Revoke the deploy key without touching ECS by deleting the CF Access service
token in the Cloudflare dashboard. Even with the SSH private key, no
connection succeeds without a valid CF Access token. As a second layer,
remove `ECS_SSH_PRIVATE_KEY` from GH secrets.- [ ] Step 2: Commit
bash
git add docs/deploy/ecs-auto-deploy-runbook.md
git commit -m "docs(deploy): one-time setup runbook for ECS auto-deploy"Task 11: Link runbook from existing deploy doc
Files:
Modify:
deploy/alibaba-ecs-deploy.md[ ] Step 1: Add cross-reference
At the top of deploy/alibaba-ecs-deploy.md, after the first heading and any intro paragraph, add:
markdown
> **For auto-deploy:** This document covers manual deployment. To enable
> push-to-main auto-deploy via GitHub Actions, see
> [docs/deploy/ecs-auto-deploy-runbook.md](../docs/deploy/ecs-auto-deploy-runbook.md)
> after completing the manual install.- [ ] Step 2: Commit
bash
git add deploy/alibaba-ecs-deploy.md
git commit -m "docs(deploy): link auto-deploy runbook from manual deploy doc"Task 12: Final verification
- [ ] Step 1: Run all bash tests
bash
bash tests/deploy/test_ecs_deploy.shExpected: all tests pass, exit 0.
- [ ] Step 2: Lint the deploy script
bash
shellcheck deploy/ecs-deploy.sh deploy/alibaba-ecs-install.sh \
2>&1 | tee /tmp/shellcheck.logExpected: no SC2086, SC2046, SC2155 errors. Style warnings (SC2034 etc.) are acceptable but inspect each.
If shellcheck is unavailable: bash -n deploy/ecs-deploy.sh deploy/alibaba-ecs-install.sh (syntax-only fallback).
- [ ] Step 3: Validate workflow YAML
bash
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/deploy-ecs.yml'))"
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/docker-build.yml'))"Expected: no output, both exit 0.
- [ ] Step 4: Verify compose.yml unchanged
bash
git diff deploy/compose.ymlExpected: empty (no diff). The existing ${TWILIGHT_VERSION:-latest} already satisfies the design.
- [ ] Step 5: Open PR
bash
git push -u origin "$(git branch --show-current)"
gh pr create --title "feat(deploy): GHA → Cloudflare Tunnel SSH auto-deploy to ECS" \
--body "$(cat <<'EOF'
## Summary
- Adds `.github/workflows/deploy-ecs.yml` that fires on successful `Docker build` workflow_run on `main`.
- Adds `deploy/ecs-deploy.sh` with `flock` lock, tag-regex validation, `TWILIGHT_VERSION` swap in `~/twilight/.env`, docker compose pull + up, `/healthz` probe (30 × 2s), and auto-rollback on health failure.
- Adds bash test harness with PATH-mocked docker/curl.
- Wires deploy user + forced-command SSH key into `deploy/alibaba-ecs-install.sh`.
- Adds runbook for one-time CF Access + GH secrets configuration.
- Zero open inbound ports on ECS (Cloudflare Tunnel only).
## Test plan
- [x] `bash tests/deploy/test_ecs_deploy.sh` — all pass
- [x] `bash -n deploy/ecs-deploy.sh` / shellcheck clean
- [x] `python3 -c 'import yaml; yaml.safe_load(...)'` for both workflows
- [ ] Manual: run install.sh on ECS with `DEPLOY_PUBKEY_FILE`, smoke-test SSH from laptop per runbook §6
- [ ] Manual: push trivial change to main, verify deploy workflow fires and `/healthz` stays green
- [ ] Manual: push intentionally broken change, verify rollback triggers and prior version returns
EOF
)"- [ ] Step 6: Final commit if anything was tweaked during verification
If verification surfaced issues and they were fixed, ensure they are in separate commits before pushing. Otherwise this step is a no-op.
Self-review notes
- Spec coverage: all spec sections mapped — Architecture (Tasks 9, 7); Components (Tasks 2-6, 7, 9, 10); ECS-side prep (Task 8); failure modes (Tasks 2-6 cover invalid tag, lock, pull fail, health fail, rollback); security posture (Tasks 7, 8, 9 enforce CF tunnel + forced-command + minimal perms).
- Reconciled
IMAGE_TAG→TWILIGHT_VERSION(existing convention) and/opt/twilight→~/twilight(existing convention). Functional behavior unchanged. - Tests cover all failure modes the script handles (invalid tag, concurrent run, pull fail, health fail + rollback) plus the happy path.
- No placeholders — every step has the literal file content, command, or code block needed.
- YAGNI — no blue/green, no ACR mirror, no smoke tests beyond
/healthz(per spec).