ECS Auto-Deploy Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Auto-deploy every push to main from GitHub Actions to Alibaba ECS Chengdu via Cloudflare Tunnel SSH, with health-gated rollback and zero open inbound ports.

Architecture: docker-build.yml (existing) pushes image to GHCR with sha-<7> tag. A new deploy-ecs.yml (workflow_run trigger) installs cloudflared, SSHes through Cloudflare Access to a non-root deploy user on ECS, and invokes a forced-command script that swaps TWILIGHT_VERSION in ~/twilight/.env, runs docker compose up -d, polls /healthz, and rolls back on failure.

Tech Stack: GitHub Actions, Bash, Docker Compose, Cloudflare Tunnel + Access, cloudflared CLI, flock, curl. Bash unit tests use plain shell with PATH mocking (no bats dependency).

Reconciliation with spec: The design doc used IMAGE_TAG / /opt/twilight. The existing repo already uses TWILIGHT_VERSION / ~/twilight/.env (see deploy/compose.yml:20, deploy/env.example, deploy/alibaba-ecs-install.sh). This plan follows the existing convention; functional behavior is identical.

File Structure

Create:

deploy/ecs-deploy.sh — ECS-side script: lock, validate tag, snapshot, swap, pull, up, healthcheck, rollback
tests/deploy/test_ecs_deploy.sh — bash test harness with PATH-mocked docker/curl
tests/deploy/mocks/docker — mock docker binary used by tests
tests/deploy/mocks/curl — mock curl binary used by tests
.github/workflows/deploy-ecs.yml — workflow_run trigger, SSH deploy via cloudflared
deploy/cloudflared-ssh-ingress.yml.snippet — ingress block to merge into existing tunnel config
docs/deploy/ecs-auto-deploy-runbook.md — one-time CF Access + GH secrets + ECS prep

Modify:

deploy/alibaba-ecs-install.sh — provision deploy user, install ecs-deploy.sh to /usr/local/bin/twilight-deploy, append forced-command authorized_keys entry
deploy/alibaba-ecs-deploy.md — link to runbook, document version pinning via TWILIGHT_VERSION

No changes needed:

deploy/compose.yml — already uses ${TWILIGHT_VERSION:-latest} at line 20
deploy/env.example — already has TWILIGHT_VERSION=latest

Task 1: Add bash test harness

Files:

Create: tests/deploy/test_ecs_deploy.sh
Create: tests/deploy/mocks/docker
Create: tests/deploy/mocks/curl
[ ] Step 1: Create mocks directory and mock docker

Write tests/deploy/mocks/docker:

bash

#!/usr/bin/env bash
# Mock docker for ecs-deploy.sh tests.
# Behavior controlled by env vars:
#   MOCK_DOCKER_PULL_FAIL=1    -> `docker compose pull` exits 1
#   MOCK_DOCKER_LOG=/tmp/x     -> append each invocation's args to this file
log="${MOCK_DOCKER_LOG:-/dev/null}"
echo "docker $*" >> "$log"
case "$*" in
  "compose pull "*)
    [[ "${MOCK_DOCKER_PULL_FAIL:-0}" == "1" ]] && exit 1
    exit 0
    ;;
  "compose up -d "*) exit 0 ;;
  "image prune"*)    exit 0 ;;
  *) exit 0 ;;
esac

chmod +x tests/deploy/mocks/docker.

[ ] Step 2: Create mock curl

Write tests/deploy/mocks/curl:

bash

#!/usr/bin/env bash
# Mock curl for ecs-deploy.sh tests.
#   MOCK_CURL_HEALTHY_AFTER=N  -> the Nth call onward returns 0 (default 1 = first)
#   MOCK_CURL_NEVER_HEALTHY=1  -> always returns 22 (HTTP error)
state="${MOCK_CURL_STATE:-/tmp/mock-curl-state}"
count=$(( $(cat "$state" 2>/dev/null || echo 0) + 1 ))
echo "$count" > "$state"

if [[ "${MOCK_CURL_NEVER_HEALTHY:-0}" == "1" ]]; then
  exit 22
fi
threshold="${MOCK_CURL_HEALTHY_AFTER:-1}"
(( count >= threshold )) && exit 0 || exit 22

chmod +x tests/deploy/mocks/curl.

[ ] Step 3: Create test harness skeleton

Write tests/deploy/test_ecs_deploy.sh:

bash

#!/usr/bin/env bash
# Plain-shell test harness for deploy/ecs-deploy.sh.
# Each test runs in an isolated tmpdir with mocked PATH.
set -u

REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SCRIPT="$REPO_ROOT/deploy/ecs-deploy.sh"
MOCKS_DIR="$REPO_ROOT/tests/deploy/mocks"

pass=0; fail=0
log() { printf '%s\n' "$*"; }
assert_eq() {
  local got="$1" want="$2" msg="${3:-}"
  if [[ "$got" == "$want" ]]; then ((pass++)); log "  PASS ${msg}"; else
    ((fail++)); log "  FAIL ${msg}"; log "    want: $want"; log "    got:  $got";
  fi
}
assert_contains() {
  local hay="$1" needle="$2" msg="${3:-}"
  if [[ "$hay" == *"$needle"* ]]; then ((pass++)); log "  PASS ${msg}"; else
    ((fail++)); log "  FAIL ${msg}"; log "    missing: $needle"; log "    in: $hay";
  fi
}

setup() {
  TMP="$(mktemp -d)"
  cd "$TMP"
  mkdir -p home/twilight/source/deploy
  cp "$REPO_ROOT/deploy/compose.yml" home/twilight/source/deploy/
  printf 'TWILIGHT_VERSION=sha-0000000\n' > home/twilight/.env
  export TWILIGHT_HOME="$TMP/home/twilight"
  export MOCK_DOCKER_LOG="$TMP/docker.log"
  export MOCK_CURL_STATE="$TMP/curl.count"
  : > "$MOCK_DOCKER_LOG"
  : > "$MOCK_CURL_STATE"
  export PATH="$MOCKS_DIR:$PATH"
}
teardown() { cd /; rm -rf "$TMP"; unset TWILIGHT_HOME MOCK_DOCKER_LOG MOCK_DOCKER_PULL_FAIL MOCK_CURL_HEALTHY_AFTER MOCK_CURL_NEVER_HEALTHY MOCK_CURL_STATE; }

# Tests appended in later tasks.

trap 'teardown 2>/dev/null || true' EXIT
log "==> running tests"
# (test invocations added in later tasks)
log "==> ${pass} passed, ${fail} failed"
exit $(( fail > 0 ? 1 : 0 ))

chmod +x tests/deploy/test_ecs_deploy.sh.

[ ] Step 4: Run the empty harness

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: ==> 0 passed, 0 failed, exit 0.

[ ] Step 5: Commit

bash

git add tests/deploy/
git commit -m "test(deploy): bash test harness scaffold for ecs-deploy.sh"

Task 2: Tag validation (TDD)

Files:

Modify: tests/deploy/test_ecs_deploy.sh
Create: deploy/ecs-deploy.sh
[ ] Step 1: Append failing test for invalid tag

Insert immediately before the log "==> ${pass} passed" line in tests/deploy/test_ecs_deploy.sh:

bash

test_invalid_tag_rejected() {
  setup
  out=$(SSH_ORIGINAL_COMMAND="deploy not-a-tag" bash "$SCRIPT" 2>&1); rc=$?
  assert_eq "$rc" "2" "invalid tag exits 2"
  assert_contains "$out" "invalid tag" "invalid tag error message"
  teardown
}
test_invalid_tag_rejected

[ ] Step 2: Run and verify failure

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: 2 FAILs (script does not yet exist). Last line: ==> 0 passed, 2 failed, exit 1.

[ ] Step 3: Create minimal ecs-deploy.sh

Write deploy/ecs-deploy.sh:

bash

#!/usr/bin/env bash
# Twilight Drive ECS-side deploy script.
# Invoked via SSH forced-command. SSH_ORIGINAL_COMMAND format: "deploy <tag>".
# Tag must match: sha-<7-hex> | latest | main | v<X>.<Y>.<Z>
set -euo pipefail

TWILIGHT_HOME="${TWILIGHT_HOME:-/home/deploy/twilight}"
ENV_FILE="$TWILIGHT_HOME/.env"
COMPOSE_DIR="$TWILIGHT_HOME/source/deploy"
LOCK_FILE="${TWILIGHT_LOCK:-/tmp/twilight-deploy.lock}"
HEALTH_URL="${TWILIGHT_HEALTH_URL:-http://localhost:8081/healthz}"
HEALTH_TRIES="${TWILIGHT_HEALTH_TRIES:-30}"
HEALTH_SLEEP="${TWILIGHT_HEALTH_SLEEP:-2}"

cmd="${SSH_ORIGINAL_COMMAND:-}"
NEW_TAG="${cmd##* }"
if ! [[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]]; then
  echo "invalid tag: $NEW_TAG" >&2
  exit 2
fi
echo "tag accepted: $NEW_TAG"

chmod +x deploy/ecs-deploy.sh.

[ ] Step 4: Run test, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: ==> 2 passed, 0 failed, exit 0.

[ ] Step 5: Add tests for valid tag shapes

Append before the summary line:

bash

test_valid_tags_accepted() {
  for t in "sha-abc1234" "latest" "main" "v1.2.3"; do
    setup
    SSH_ORIGINAL_COMMAND="deploy $t" bash "$SCRIPT" >/dev/null 2>&1
    rc=$?
    assert_eq "$rc" "0" "valid tag '$t' accepted"
    teardown
  done
}
test_valid_tags_accepted

(Note: this will fail at later steps once flock and rollback are added — those need full setup. Keep this test now; later tasks make it more thorough.)

[ ] Step 6: Run, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: ==> 6 passed, 0 failed, exit 0.

[ ] Step 7: Commit

bash

git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): tag validation for ecs-deploy.sh"

Task 3: Concurrency lock (TDD)

Files:

Modify: tests/deploy/test_ecs_deploy.sh
Modify: deploy/ecs-deploy.sh
[ ] Step 1: Add failing test for concurrent deploy

Append before summary line:

bash

test_concurrent_deploy_rejected() {
  setup
  # Hold the lock from another process, then try to deploy.
  ( flock -x 9; sleep 2 ) 9>"$TMP/lock" &
  holder=$!
  sleep 0.2
  out=$(TWILIGHT_LOCK="$TMP/lock" SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
        bash "$SCRIPT" 2>&1); rc=$?
  wait $holder 2>/dev/null
  assert_eq "$rc" "3" "second deploy exits 3"
  assert_contains "$out" "already running" "lock-held message"
  teardown
}
test_concurrent_deploy_rejected

[ ] Step 2: Run, expect failure

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: lock test FAILs (script does not yet attempt to acquire lock).

[ ] Step 3: Add flock to ecs-deploy.sh

Edit deploy/ecs-deploy.sh. After the tag-validation block and before the final echo, insert:

bash

exec 9>"$LOCK_FILE"
if ! flock -n 9; then
  echo "deploy already running" >&2
  exit 3
fi

Remove the temporary echo "tag accepted: $NEW_TAG" (no longer needed).

[ ] Step 4: Run, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: all tests pass.

[ ] Step 5: Commit

bash

git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): flock-based concurrency guard"

Task 4: Snapshot + tag swap (TDD)

Files:

Modify: tests/deploy/test_ecs_deploy.sh
Modify: deploy/ecs-deploy.sh
[ ] Step 1: Add failing test

Append before summary line:

bash

test_env_tag_swapped() {
  setup
  SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
    bash "$SCRIPT" >/dev/null 2>&1
  rc=$?
  got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
  assert_eq "$rc" "0" "deploy succeeds with healthy mock"
  assert_eq "$got" "sha-abc1234" ".env TWILIGHT_VERSION updated"
}
test_env_tag_swapped

(No teardown after — we leave $TMP around so a follow-up assert could inspect; setup will rebuild next run.)

[ ] Step 2: Run, expect failure

Expected: rc may be 0 but .env unchanged, so the second assert FAILs.

[ ] Step 3: Implement snapshot + swap

Edit deploy/ecs-deploy.sh. After the flock block, append:

bash

if [[ ! -f "$ENV_FILE" ]]; then
  echo "missing env file: $ENV_FILE" >&2
  exit 4
fi
PREV_TAG=$(grep -E '^TWILIGHT_VERSION=' "$ENV_FILE" | cut -d= -f2)
if [[ -z "$PREV_TAG" ]]; then
  echo "missing TWILIGHT_VERSION in $ENV_FILE" >&2
  exit 4
fi
echo "rollback target: $PREV_TAG"
echo "deploying:       $NEW_TAG"

# Portable sed -i (GNU vs BSD).
if sed --version >/dev/null 2>&1; then
  sed -i "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
else
  sed -i '' "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
fi

[ ] Step 4: Run, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: all tests pass.

[ ] Step 5: Commit

bash

git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): snapshot rollback tag and swap TWILIGHT_VERSION"

Task 5: Pull + up via docker compose (TDD)

Files:

Modify: tests/deploy/test_ecs_deploy.sh
Modify: deploy/ecs-deploy.sh
[ ] Step 1: Add test for docker calls

Append before summary line:

bash

test_docker_compose_invoked() {
  setup
  SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
    bash "$SCRIPT" >/dev/null 2>&1
  log_content=$(cat "$MOCK_DOCKER_LOG")
  assert_contains "$log_content" "compose pull data" "docker compose pull called"
  assert_contains "$log_content" "compose up -d data" "docker compose up called"
  teardown
}
test_docker_compose_invoked

[ ] Step 2: Run, expect failure

Expected: docker.log is empty.

[ ] Step 3: Add docker compose calls

Edit deploy/ecs-deploy.sh, after the sed block append:

bash

cd "$COMPOSE_DIR"
docker compose pull data
docker compose up -d data

[ ] Step 4: Run, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: all tests pass.

[ ] Step 5: Add test for pull failure aborts before up

Append before summary line:

bash

test_pull_failure_aborts() {
  setup
  MOCK_DOCKER_PULL_FAIL=1 SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
    bash "$SCRIPT" >/dev/null 2>&1
  rc=$?
  log_content=$(cat "$MOCK_DOCKER_LOG")
  assert_eq "$rc" "1" "pull failure exits non-zero"
  assert_contains "$log_content" "compose pull data" "pull attempted"
  if [[ "$log_content" == *"compose up -d data"* ]]; then
    ((fail++)); log "  FAIL up should NOT have been called after pull fail"
  else
    ((pass++)); log "  PASS up not called after pull fail"
  fi
  teardown
}
test_pull_failure_aborts

[ ] Step 6: Run, verify pass

Expected: all tests pass. set -e already aborts on docker compose pull failure, so no code change needed.

[ ] Step 7: Commit

bash

git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): pull + up via docker compose, abort on pull failure"

Task 6: Health probe + rollback (TDD)

Files:

Modify: tests/deploy/test_ecs_deploy.sh
Modify: deploy/ecs-deploy.sh
[ ] Step 1: Add tests for healthy and unhealthy paths

Append before summary line:

bash

test_health_success_no_rollback() {
  setup
  SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
    TWILIGHT_HEALTH_TRIES=5 TWILIGHT_HEALTH_SLEEP=0 \
    bash "$SCRIPT" >/dev/null 2>&1
  rc=$?
  got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
  assert_eq "$rc" "0" "healthy deploy exits 0"
  assert_eq "$got" "sha-abc1234" "version stays at new tag"
  teardown
}
test_health_success_no_rollback

test_health_failure_triggers_rollback() {
  setup
  out=$(MOCK_CURL_NEVER_HEALTHY=1 SSH_ORIGINAL_COMMAND="deploy sha-abc1234" \
        TWILIGHT_HEALTH_TRIES=3 TWILIGHT_HEALTH_SLEEP=0 \
        bash "$SCRIPT" 2>&1)
  rc=$?
  got=$(grep '^TWILIGHT_VERSION=' "$TWILIGHT_HOME/.env" | cut -d= -f2)
  ups=$(grep -c "compose up -d data" "$MOCK_DOCKER_LOG")
  assert_eq "$rc" "1" "rollback path exits 1"
  assert_eq "$got" "sha-0000000" "version restored to PREV"
  assert_eq "$ups" "2" "compose up called twice (deploy + rollback)"
  assert_contains "$out" "HEALTH FAILED" "failure message logged"
  teardown
}
test_health_failure_triggers_rollback

[ ] Step 2: Run, expect failure

Expected: probe loop missing, both tests FAIL.

[ ] Step 3: Implement probe + rollback

Edit deploy/ecs-deploy.sh, append after the docker compose up -d data line:

bash

sed_inplace() {
  if sed --version >/dev/null 2>&1; then sed -i "$1" "$2"
  else sed -i '' "$1" "$2"
  fi
}

i=0
while (( i < HEALTH_TRIES )); do
  if curl -fsS --max-time 2 "$HEALTH_URL" >/dev/null 2>&1; then
    echo "healthy after $((i + 1)) tries"
    docker image prune -f --filter "until=168h" >/dev/null 2>&1 || true
    exit 0
  fi
  i=$((i + 1))
  sleep "$HEALTH_SLEEP"
done

echo "HEALTH FAILED -- rolling back to $PREV_TAG" >&2
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_TAG|" "$ENV_FILE"
docker compose up -d data
exit 1

Move the two earlier inline sed -i invocations to use sed_inplace (DRY): replace the existing if sed --version ... fi block in the snapshot section with:

bash

sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"

Define sed_inplace once, near the top after env-var setup, so both call sites can use it.

[ ] Step 4: Run, verify pass

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: all tests pass.

[ ] Step 5: Commit

bash

git add deploy/ecs-deploy.sh tests/deploy/test_ecs_deploy.sh
git commit -m "feat(deploy): health probe + auto-rollback on /healthz failure"

Task 7: Cloudflare Tunnel SSH ingress snippet

Files:

Create: deploy/cloudflared-ssh-ingress.yml.snippet
[ ] Step 1: Write ingress snippet

Write deploy/cloudflared-ssh-ingress.yml.snippet:

yaml

# Merge into ~/.cloudflared/config.yml as a new entry in `ingress:`.
# Add ABOVE the catch-all `service: http_status:404` line.
#
# Hostname must already exist as a DNS record in Cloudflare pointing at
# the tunnel UUID. Cloudflare Access app must be configured for this
# hostname with a service-token policy (no IdP). See
# docs/deploy/ecs-auto-deploy-runbook.md.

  - hostname: ssh-ecs.fsagent.cc
    service: ssh://localhost:22

[ ] Step 2: Commit

bash

git add deploy/cloudflared-ssh-ingress.yml.snippet
git commit -m "deploy: cloudflared SSH ingress snippet for ssh-ecs.fsagent.cc"

Task 8: Install.sh updates for deploy user + forced-command key

Files:

Modify: deploy/alibaba-ecs-install.sh
[ ] Step 1: Read current install.sh to find insertion point

bash

grep -n "function\|^# ====\|main()" deploy/alibaba-ecs-install.sh | head -30

Identify a section where additional one-time setup runs (post-docker-install, pre-compose-up).

[ ] Step 2: Add deploy-user provisioning function

In deploy/alibaba-ecs-install.sh, add a new function near the other setup helpers:

bash

setup_deploy_user() {
  local deploy_user="${DEPLOY_USER:-deploy}"
  local pubkey_file="${DEPLOY_PUBKEY_FILE:-}"

  if [[ -z "$pubkey_file" ]]; then
    warn "DEPLOY_PUBKEY_FILE not set; skipping deploy-user setup (see runbook)"
    return 0
  fi
  if [[ ! -r "$pubkey_file" ]]; then
    fail "DEPLOY_PUBKEY_FILE=$pubkey_file not readable"
  fi

  if ! id -u "$deploy_user" >/dev/null 2>&1; then
    useradd -m -s /bin/bash "$deploy_user"
  fi
  usermod -aG docker "$deploy_user"

  install -d -o "$deploy_user" -g "$deploy_user" -m 700 \
    "/home/$deploy_user/.ssh"

  local script_dest="/usr/local/bin/twilight-deploy"
  install -m 755 -o root -g root \
    "$(dirname "$0")/ecs-deploy.sh" "$script_dest"

  # Set TWILIGHT_HOME for the deploy user so script finds the right tree.
  local home_dir
  home_dir="$(getent passwd "${SUDO_USER:-$USER}" | cut -d: -f6)"
  local auth_keys="/home/$deploy_user/.ssh/authorized_keys"
  local key_line
  key_line="command=\"TWILIGHT_HOME=$home_dir/twilight $script_dest\",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty $(cat "$pubkey_file")"

  touch "$auth_keys"
  if ! grep -qF "$(cat "$pubkey_file")" "$auth_keys"; then
    echo "$key_line" >> "$auth_keys"
  fi
  chown "$deploy_user:$deploy_user" "$auth_keys"
  chmod 600 "$auth_keys"

  log "deploy user '$deploy_user' provisioned; key authorized for $script_dest"
}

Then call setup_deploy_user from main() after the docker/compose install steps and before any compose-up step.

[ ] Step 3: Smoke-test syntax

bash

bash -n deploy/alibaba-ecs-install.sh

Expected: no output, exit 0.

[ ] Step 4: Commit

bash

git add deploy/alibaba-ecs-install.sh
git commit -m "feat(install): provision deploy user with forced-command SSH key"

Task 9: GitHub Actions deploy workflow

Files:

Create: .github/workflows/deploy-ecs.yml
[ ] Step 1: Write workflow

Write .github/workflows/deploy-ecs.yml:

yaml

name: Deploy to ECS

on:
  workflow_run:
    workflows: ["Docker build"]
    types: [completed]
    branches: [main]

concurrency:
  group: deploy-ecs
  cancel-in-progress: false

permissions:
  contents: read

jobs:
  deploy:
    name: Deploy to Alibaba ECS
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Install cloudflared
        run: |
          set -euo pipefail
          curl -fsSL -o /tmp/cloudflared \
            https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
          sudo install -m 755 /tmp/cloudflared /usr/local/bin/cloudflared
          cloudflared --version

      - name: Write SSH key
        env:
          SSH_KEY: ${{ secrets.ECS_SSH_PRIVATE_KEY }}
        run: |
          set -euo pipefail
          mkdir -p ~/.ssh
          umask 077
          printf '%s\n' "$SSH_KEY" > ~/.ssh/id_deploy
          chmod 600 ~/.ssh/id_deploy

      - name: Deploy via cloudflared SSH
        env:
          CF_ACCESS_CLIENT_ID:     ${{ secrets.CF_ACCESS_CLIENT_ID }}
          CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
          DEPLOY_USER:             ${{ vars.ECS_DEPLOY_USER }}
          DEPLOY_HOST:             ${{ vars.ECS_SSH_HOSTNAME }}
          SHA: ${{ github.event.workflow_run.head_sha }}
        run: |
          set -euo pipefail
          SHORT="sha-${SHA:0:7}"
          echo "deploying $SHORT to $DEPLOY_HOST"
          ssh \
            -o StrictHostKeyChecking=accept-new \
            -o UserKnownHostsFile=~/.ssh/known_hosts \
            -o ConnectTimeout=15 \
            -o ProxyCommand="cloudflared access ssh --hostname %h" \
            -i ~/.ssh/id_deploy \
            "$DEPLOY_USER@$DEPLOY_HOST" "deploy $SHORT"

Notes baked into design:

concurrency: deploy-ecs / cancel-in-progress: false serializes deploys at the workflow level (defense in depth alongside flock on the box).
environment: production lets you add manual approval later by configuring the environment in GitHub repo settings; no approval today.
vars.ECS_DEPLOY_USER and vars.ECS_SSH_HOSTNAME are repo-level variables (not secrets) — runbook describes how to set them.
[ ] Step 2: Validate YAML

bash

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/deploy-ecs.yml'))"

Expected: no output, exit 0.

[ ] Step 3: Commit

bash

git add .github/workflows/deploy-ecs.yml
git commit -m "ci(deploy): workflow_run-triggered ECS deploy via cloudflared SSH"

Task 10: Runbook for one-time setup

Files:

Create: docs/deploy/ecs-auto-deploy-runbook.md
[ ] Step 1: Write the runbook

Write docs/deploy/ecs-auto-deploy-runbook.md:

markdown

# ECS Auto-Deploy Runbook

One-time setup to enable GitHub Actions → Cloudflare Tunnel SSH → ECS deploys.
After this runs, every push to `main` deploys automatically with health-gated
rollback. See `docs/superpowers/specs/2026-05-14-ecs-deploy-design.md` for
the design rationale.

## Prereqs

- ECS box already provisioned via `deploy/alibaba-ecs-install.sh`.
- `cloudflared` tunnel `twilight-backend` already running on ECS.
- Repo admin access on GitHub (for secrets + variables).
- Cloudflare account access with Zero Trust → Access permissions.

## 1. Generate deploy SSH key (on your laptop)

```bash
ssh-keygen -t ed25519 -f ~/.ssh/twilight-ecs-deploy -C "twilight-ecs-deploy" -N ""
# Public key:  ~/.ssh/twilight-ecs-deploy.pub  -> upload to ECS
# Private key: ~/.ssh/twilight-ecs-deploy      -> paste into GH secret
```

## 2. Provision deploy user on ECS

Copy the public key to the box, then re-run install.sh with the
`DEPLOY_PUBKEY_FILE` variable pointing at it:

```bash
scp ~/.ssh/twilight-ecs-deploy.pub root@39.106.170.204:/tmp/deploy.pub
ssh root@39.106.170.204
cd ~/twilight/source
DEPLOY_PUBKEY_FILE=/tmp/deploy.pub bash deploy/alibaba-ecs-install.sh
```

This creates user `deploy`, adds it to the docker group, installs
`/usr/local/bin/twilight-deploy`, and writes the forced-command entry to
`/home/deploy/.ssh/authorized_keys`.

Verify:

```bash
sudo -u deploy /usr/local/bin/twilight-deploy </dev/null 2>&1 || true
# Expected: "invalid tag:" — the script is reachable.
```

## 3. Configure Cloudflare Access

In Cloudflare dashboard → Zero Trust → Access → Applications → Add:

- Application type: **Self-hosted**
- Application name: `twilight-ssh-ecs`
- Subdomain: `ssh-ecs`
- Domain: `fsagent.cc`
- Identity providers: **none** (uncheck all)
- Policy: name `service-token-only`, Action `Service Auth`, Include `Service Token` → create a new token named `gh-actions-deploy`
- Copy the **Client ID** and **Client Secret** that appear — they only show once.

## 4. Add SSH ingress to cloudflared

On ECS, edit the tunnel config (location varies; commonly
`/etc/cloudflared/config.yml` or `~/.cloudflared/config.yml`):

Add the snippet from `deploy/cloudflared-ssh-ingress.yml.snippet` above the
catch-all 404 entry, then:

```bash
sudo systemctl restart cloudflared
```

Add the DNS record pointing the hostname at the tunnel:

```bash
cloudflared tunnel route dns twilight-backend ssh-ecs.fsagent.cc
```

## 5. Wire GitHub repo

In **Settings → Secrets and variables → Actions**:

Secrets:
- `ECS_SSH_PRIVATE_KEY` — paste contents of `~/.ssh/twilight-ecs-deploy`
- `CF_ACCESS_CLIENT_ID` — from step 3
- `CF_ACCESS_CLIENT_SECRET` — from step 3

Variables:
- `ECS_DEPLOY_USER` = `deploy`
- `ECS_SSH_HOSTNAME` = `ssh-ecs.fsagent.cc`

## 6. Smoke test from your laptop first

Confirm the path works before letting CI try:

```bash
cloudflared access ssh-config --hostname ssh-ecs.fsagent.cc
# Then:
ssh -o ProxyCommand="cloudflared access ssh --hostname %h" \
    -i ~/.ssh/twilight-ecs-deploy \
    deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234"
# Expected: "invalid tag" if you used a fake SHA, OR a real deploy attempt
# if you used a real one.
```

## 7. Trigger a deploy

Push any small change to `main`. The Docker build runs; on success the
deploy workflow fires automatically. Watch in Actions tab.

## Recovery

If a deploy auto-rollback leaves the box on a known-good prior version, the
workflow run is red but service is healthy. Investigate at leisure.

If both new and previous fail (extremely unlikely — would mean previous was
already broken), SSH as `root`, set `TWILIGHT_VERSION` to a known-good tag
in `~/twilight/.env`, then `docker compose -f ~/twilight/source/deploy/compose.yml up -d backend`.

## Revocation

Revoke the deploy key without touching ECS by deleting the CF Access service
token in the Cloudflare dashboard. Even with the SSH private key, no
connection succeeds without a valid CF Access token. As a second layer,
remove `ECS_SSH_PRIVATE_KEY` from GH secrets.

[ ] Step 2: Commit

bash

git add docs/deploy/ecs-auto-deploy-runbook.md
git commit -m "docs(deploy): one-time setup runbook for ECS auto-deploy"

Task 11: Link runbook from existing deploy doc

Files:

Modify: deploy/alibaba-ecs-deploy.md
[ ] Step 1: Add cross-reference

At the top of deploy/alibaba-ecs-deploy.md, after the first heading and any intro paragraph, add:

markdown

> **For auto-deploy:** This document covers manual deployment. To enable
> push-to-main auto-deploy via GitHub Actions, see
> [docs/deploy/ecs-auto-deploy-runbook.md](../docs/deploy/ecs-auto-deploy-runbook.md)
> after completing the manual install.

[ ] Step 2: Commit

bash

git add deploy/alibaba-ecs-deploy.md
git commit -m "docs(deploy): link auto-deploy runbook from manual deploy doc"

Task 12: Final verification

[ ] Step 1: Run all bash tests

bash

bash tests/deploy/test_ecs_deploy.sh

Expected: all tests pass, exit 0.

[ ] Step 2: Lint the deploy script

bash

shellcheck deploy/ecs-deploy.sh deploy/alibaba-ecs-install.sh \
  2>&1 | tee /tmp/shellcheck.log

Expected: no SC2086, SC2046, SC2155 errors. Style warnings (SC2034 etc.) are acceptable but inspect each.

If shellcheck is unavailable: bash -n deploy/ecs-deploy.sh deploy/alibaba-ecs-install.sh (syntax-only fallback).

[ ] Step 3: Validate workflow YAML

bash

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/deploy-ecs.yml'))"
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/docker-build.yml'))"

Expected: no output, both exit 0.

[ ] Step 4: Verify compose.yml unchanged

bash

git diff deploy/compose.yml

Expected: empty (no diff). The existing ${TWILIGHT_VERSION:-latest} already satisfies the design.

[ ] Step 5: Open PR

bash

git push -u origin "$(git branch --show-current)"
gh pr create --title "feat(deploy): GHA → Cloudflare Tunnel SSH auto-deploy to ECS" \
  --body "$(cat <<'EOF'
## Summary

- Adds `.github/workflows/deploy-ecs.yml` that fires on successful `Docker build` workflow_run on `main`.
- Adds `deploy/ecs-deploy.sh` with `flock` lock, tag-regex validation, `TWILIGHT_VERSION` swap in `~/twilight/.env`, docker compose pull + up, `/healthz` probe (30 × 2s), and auto-rollback on health failure.
- Adds bash test harness with PATH-mocked docker/curl.
- Wires deploy user + forced-command SSH key into `deploy/alibaba-ecs-install.sh`.
- Adds runbook for one-time CF Access + GH secrets configuration.
- Zero open inbound ports on ECS (Cloudflare Tunnel only).

## Test plan

- [x] `bash tests/deploy/test_ecs_deploy.sh` — all pass
- [x] `bash -n deploy/ecs-deploy.sh` / shellcheck clean
- [x] `python3 -c 'import yaml; yaml.safe_load(...)'` for both workflows
- [ ] Manual: run install.sh on ECS with `DEPLOY_PUBKEY_FILE`, smoke-test SSH from laptop per runbook §6
- [ ] Manual: push trivial change to main, verify deploy workflow fires and `/healthz` stays green
- [ ] Manual: push intentionally broken change, verify rollback triggers and prior version returns
EOF
)"

[ ] Step 6: Final commit if anything was tweaked during verification

If verification surfaced issues and they were fixed, ensure they are in separate commits before pushing. Otherwise this step is a no-op.

Self-review notes

Spec coverage: all spec sections mapped — Architecture (Tasks 9, 7); Components (Tasks 2-6, 7, 9, 10); ECS-side prep (Task 8); failure modes (Tasks 2-6 cover invalid tag, lock, pull fail, health fail, rollback); security posture (Tasks 7, 8, 9 enforce CF tunnel + forced-command + minimal perms).
Reconciled IMAGE_TAG → TWILIGHT_VERSION (existing convention) and /opt/twilight → ~/twilight (existing convention). Functional behavior unchanged.
Tests cover all failure modes the script handles (invalid tag, concurrent run, pull fail, health fail + rollback) plus the happy path.
No placeholders — every step has the literal file content, command, or code block needed.
YAGNI — no blue/green, no ACR mirror, no smoke tests beyond /healthz (per spec).

ECS Auto-Deploy Implementation Plan ​

File Structure ​

Task 1: Add bash test harness ​

Task 2: Tag validation (TDD) ​

Task 3: Concurrency lock (TDD) ​

Task 4: Snapshot + tag swap (TDD) ​

Task 5: Pull + up via docker compose (TDD) ​

Task 6: Health probe + rollback (TDD) ​

Task 7: Cloudflare Tunnel SSH ingress snippet ​

Task 8: Install.sh updates for deploy user + forced-command key ​

Task 9: GitHub Actions deploy workflow ​

Task 10: Runbook for one-time setup ​

Task 11: Link runbook from existing deploy doc ​

Task 12: Final verification ​

Self-review notes ​

ECS Auto-Deploy Implementation Plan

File Structure

Task 1: Add bash test harness

Task 2: Tag validation (TDD)

Task 3: Concurrency lock (TDD)

Task 4: Snapshot + tag swap (TDD)

Task 5: Pull + up via docker compose (TDD)

Task 6: Health probe + rollback (TDD)

Task 7: Cloudflare Tunnel SSH ingress snippet

Task 8: Install.sh updates for deploy user + forced-command key

Task 9: GitHub Actions deploy workflow

Task 10: Runbook for one-time setup

Task 11: Link runbook from existing deploy doc

Task 12: Final verification

Self-review notes