03 Agent Framework MVP — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a framework-agnostic Python reference implementation of the 03 architecture (L2–L5 + Citation Protocol + Verifier + Model Router) that produces a verifiable closed-loop demo: ask "What is 600519's current P/E?" → Skill calls Tool → Tool returns value with cite metadata → Verifier passes → output emitted.

Architecture: Single-agent loop with deterministic Verifier. Citation Protocol uses tagged-union cite (kind: "tool" or kind: "competence"). All cross-layer calls JSON-serializable from day one (containers-ready). Hermes integration is a stub — wrapped later when internal API is documented.

Tech Stack:

Python 3.11+
Pydantic v2 (citation schema, Skill I/O contracts)
pytest + pytest-asyncio (TDD)
httpx (HTTP client; respx for mocking)
structlog (JSON tracing)
ruff + mypy (lint + type-check)
uv (package manager — fast, modern)

Repository layout (will be created by Task 1):

core/                       # framework-agnostic 03 implementation
├── __init__.py
├── citation.py             # Pydantic Cite, Claim envelopes
├── competences.py          # Competence registry
├── tools.py                # @tool decorator, ToolTrace
├── skills.py               # Skill base class
├── verifier.py             # deterministic Verifier
├── model_router.py         # task → model mapping
├── runtime.py              # single-agent loop + retry
└── examples/
    ├── __init__.py
    ├── price_tool.py       # get_price (Tushare; mock-backed in tests)
    └── price_summary.py    # PriceSummarySkill

tests/
├── conftest.py
├── unit/
│   ├── test_citation.py
│   ├── test_competences.py
│   ├── test_tools.py
│   ├── test_verifier.py
│   ├── test_model_router.py
│   └── test_runtime.py
└── integration/
    └── test_price_summary_loop.py

pyproject.toml
.python-version
.ruff.toml
mypy.ini

Task 1: Project Scaffold

Files:

Create: pyproject.toml
Create: .python-version
Create: .ruff.toml
Create: mypy.ini
Create: core/__init__.py
Create: tests/conftest.py
[ ] Step 1: Pin Python version

Create .python-version:

3.11

[ ] Step 2: Create pyproject.toml

toml

[project]
name = "twilight-drive-core"
version = "0.0.1"
description = "03 Agent Framework reference implementation"
requires-python = ">=3.11"
dependencies = [
  "pydantic>=2.6,<3",
  "httpx>=0.27,<1",
  "structlog>=24.1,<25",
]

[project.optional-dependencies]
dev = [
  "pytest>=8.0,<9",
  "pytest-asyncio>=0.23,<1",
  "respx>=0.21,<1",
  "ruff>=0.5,<1",
  "mypy>=1.10,<2",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["core"]

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]

[ ] Step 3: Create .ruff.toml

toml

line-length = 100
target-version = "py311"

[lint]
select = ["E", "F", "W", "I", "B", "UP", "SIM", "RUF"]
ignore = ["E501"]

[ ] Step 4: Create mypy.ini

ini

[mypy]
python_version = 3.11
strict = True
warn_unused_ignores = True
plugins = pydantic.mypy

[ ] Step 5: Empty package init

Create core/__init__.py:

python

"""Twilight Drive — 03 Agent Framework reference implementation."""

[ ] Step 6: Empty conftest

Create tests/conftest.py:

python

"""Shared pytest fixtures."""

[ ] Step 7: Install dependencies

Run:

bash

pip install -e ".[dev]"

Expected: installs Pydantic, httpx, structlog, pytest, ruff, mypy. No errors.

[ ] Step 8: Verify pytest discovers nothing yet

Run: pytest -q Expected: no tests ran

[ ] Step 9: Commit

bash

git add pyproject.toml .python-version .ruff.toml mypy.ini core/ tests/
git commit -m "chore: scaffold core package and dev tooling"

Task 2: Citation Models

Files:

Create: core/citation.py
Create: tests/unit/test_citation.py

The Cite is a discriminated union; claim envelopes wrap a value (numeric) or text (knowledge) plus cite.

[ ] Step 1: Write failing test for ToolCite validation

Create tests/unit/test_citation.py:

python

import pytest
from datetime import datetime, timezone
from pydantic import ValidationError

from core.citation import Cite, ToolCite, CompetenceCite, NumericClaim, KnowledgeClaim


def test_tool_cite_accepts_required_fields():
    cite = ToolCite(
        source="tushare",
        table="daily_basic",
        fetched_at=datetime(2026, 4, 28, 8, 14, 22, tzinfo=timezone.utc),
        tool_call_id="tc_abc123",
    )
    assert cite.kind == "tool"


def test_competence_cite_accepts_id():
    cite = CompetenceCite(competence_id="comp.astock.fiscal_calendar.v1")
    assert cite.kind == "competence"


def test_cite_union_discriminates_by_kind():
    payload = {"kind": "tool", "source": "x", "fetched_at": "2026-04-28T00:00:00Z", "tool_call_id": "tc_1"}
    cite: Cite = ToolCite.model_validate(payload)
    assert isinstance(cite, ToolCite)


def test_numeric_claim_requires_cite():
    with pytest.raises(ValidationError):
        NumericClaim(value=42.5, metric="P/E", code="600519.SH", as_of="2026-04-27")  # type: ignore


def test_numeric_claim_round_trip():
    claim = NumericClaim(
        value=42.5,
        metric="P/E",
        code="600519.SH",
        as_of="2026-04-27",
        cite=ToolCite(source="tushare", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_1"),
    )
    dumped = claim.model_dump_json()
    loaded = NumericClaim.model_validate_json(dumped)
    assert loaded == claim


def test_knowledge_claim_requires_competence_cite_or_tool_cite():
    claim = KnowledgeClaim(
        claim="A-share fiscal year ends Dec 31",
        cite=CompetenceCite(competence_id="comp.astock.fiscal_calendar.v1"),
    )
    assert claim.cite.kind == "competence"

[ ] Step 2: Run test to verify it fails

Run: pytest tests/unit/test_citation.py -v Expected: FAIL with ImportError: cannot import name 'Cite' from 'core.citation'

[ ] Step 3: Implement core/citation.py

python

"""Citation Protocol — tagged-union Cite + claim envelopes.

A `Cite` is a discriminated union by `kind`:
  - "tool"       → originated from a tool call
  - "competence" → originated from a registered Competence

Every claim emitted by a Skill must carry a `cite`. The Verifier rejects
outputs that fail this contract.
"""

from __future__ import annotations

from datetime import datetime
from typing import Annotated, Literal, Union

from pydantic import BaseModel, Field


class ToolCite(BaseModel):
    kind: Literal["tool"] = "tool"
    source: str
    table: str | None = None
    fetched_at: datetime
    tool_call_id: str


class CompetenceCite(BaseModel):
    kind: Literal["competence"] = "competence"
    competence_id: str


Cite = Annotated[Union[ToolCite, CompetenceCite], Field(discriminator="kind")]


class NumericClaim(BaseModel):
    value: float
    metric: str
    code: str
    as_of: str  # ISO date or period (e.g. "2026Q1")
    cite: Cite


class KnowledgeClaim(BaseModel):
    claim: str
    cite: Cite

[ ] Step 4: Run tests to verify pass

Run: pytest tests/unit/test_citation.py -v Expected: 5 passed.

[ ] Step 5: Type-check

Run: mypy core/citation.py Expected: Success: no issues found.

[ ] Step 6: Commit

bash

git add core/citation.py tests/unit/test_citation.py
git commit -m "feat(citation): tagged-union Cite + numeric/knowledge claim envelopes"

Task 3: Competence Registry

Files:

Create: core/competences.py
Create: tests/unit/test_competences.py

Competences are registered at startup and looked up by ID. The registry rejects duplicate IDs and provides a resolution check used by the Verifier.

[ ] Step 1: Write failing tests

Create tests/unit/test_competences.py:

python

import pytest

from core.competences import Competence, CompetenceRegistry, DuplicateCompetenceError, UnknownCompetenceError


def test_register_and_lookup():
    reg = CompetenceRegistry()
    comp = Competence(
        id="comp.astock.fiscal_calendar.v1",
        statement="A-share fiscal year ends Dec 31",
        source="CSRC accounting standards (2014)",
    )
    reg.register(comp)
    assert reg.get("comp.astock.fiscal_calendar.v1") is comp
    assert reg.exists("comp.astock.fiscal_calendar.v1")


def test_duplicate_rejected():
    reg = CompetenceRegistry()
    comp = Competence(id="x", statement="...", source="...")
    reg.register(comp)
    with pytest.raises(DuplicateCompetenceError):
        reg.register(comp)


def test_unknown_lookup_raises():
    reg = CompetenceRegistry()
    with pytest.raises(UnknownCompetenceError):
        reg.get("nope")


def test_exists_false_for_unknown():
    reg = CompetenceRegistry()
    assert reg.exists("nope") is False

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_competences.py -v Expected: ImportError.

[ ] Step 3: Implement core/competences.py

python

"""Competence registry.

A Competence is a unit of domain knowledge with provenance:
  (id, statement, source). Competences are registered at startup;
  the Verifier checks `competence_id` references resolve.

LLMs cannot fabricate Competences — only registered IDs are valid
citation targets.
"""

from __future__ import annotations

from pydantic import BaseModel


class DuplicateCompetenceError(ValueError):
    """Raised when registering an ID that already exists."""


class UnknownCompetenceError(KeyError):
    """Raised when looking up an ID that was never registered."""


class Competence(BaseModel):
    id: str
    statement: str
    source: str
    ttl_days: int | None = None  # optional staleness override


class CompetenceRegistry:
    def __init__(self) -> None:
        self._items: dict[str, Competence] = {}

    def register(self, comp: Competence) -> None:
        if comp.id in self._items:
            raise DuplicateCompetenceError(comp.id)
        self._items[comp.id] = comp

    def get(self, comp_id: str) -> Competence:
        if comp_id not in self._items:
            raise UnknownCompetenceError(comp_id)
        return self._items[comp_id]

    def exists(self, comp_id: str) -> bool:
        return comp_id in self._items

[ ] Step 4: Run tests

Run: pytest tests/unit/test_competences.py -v Expected: 4 passed.

[ ] Step 5: Commit

bash

git add core/competences.py tests/unit/test_competences.py
git commit -m "feat(competences): registry with duplicate + unknown protection"

Task 4: Tool Decorator + ToolTrace

Files:

Create: core/tools.py
Create: tests/unit/test_tools.py

The @tool decorator wraps a pure data function: it injects a generated tool_call_id, captures inputs/outputs into a ToolTrace, and returns the raw value. The trace is what the Verifier later consults.

[ ] Step 1: Write failing tests

Create tests/unit/test_tools.py:

python

from datetime import datetime, timezone
from core.tools import ToolTrace, tool


@tool(source="testsrc", table="t1")
def add(trace: ToolTrace, a: int, b: int) -> int:
    return a + b


def test_tool_returns_value_and_records_trace():
    trace = ToolTrace()
    result = add(trace, 2, 3)
    assert result == 5
    assert len(trace.calls) == 1
    call = trace.calls[0]
    assert call.tool_name == "add"
    assert call.args == {"a": 2, "b": 3}
    assert call.value == 5
    assert call.source == "testsrc"
    assert call.table == "t1"
    assert call.tool_call_id.startswith("tc_")
    assert isinstance(call.fetched_at, datetime)
    assert call.fetched_at.tzinfo == timezone.utc


def test_tool_call_ids_are_unique():
    trace = ToolTrace()
    add(trace, 1, 1)
    add(trace, 2, 2)
    ids = {c.tool_call_id for c in trace.calls}
    assert len(ids) == 2


def test_trace_lookup_by_id():
    trace = ToolTrace()
    add(trace, 7, 8)
    tcid = trace.calls[0].tool_call_id
    looked = trace.get(tcid)
    assert looked is trace.calls[0]
    assert trace.get("missing") is None

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_tools.py -v Expected: ImportError.

[ ] Step 3: Implement core/tools.py

python

"""Tool decorator + trace.

A Tool is a pure data function. The @tool decorator:
  - Generates a unique tool_call_id per invocation
  - Records call metadata into a ToolTrace
  - Returns the underlying value unchanged

The trace is the source of truth the Verifier uses to validate cites.
"""

from __future__ import annotations

import functools
import inspect
import secrets
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any, Callable, ParamSpec, TypeVar

P = ParamSpec("P")
R = TypeVar("R")


@dataclass
class ToolCall:
    tool_call_id: str
    tool_name: str
    source: str
    table: str | None
    args: dict[str, Any]
    value: Any
    fetched_at: datetime


@dataclass
class ToolTrace:
    calls: list[ToolCall] = field(default_factory=list)

    def record(self, call: ToolCall) -> None:
        self.calls.append(call)

    def get(self, tool_call_id: str) -> ToolCall | None:
        for c in self.calls:
            if c.tool_call_id == tool_call_id:
                return c
        return None


def _new_id() -> str:
    return f"tc_{secrets.token_hex(6)}"


def tool(*, source: str, table: str | None = None) -> Callable[[Callable[P, R]], Callable[P, R]]:
    """Decorate a function as a Tool.

    The decorated function MUST take a `trace: ToolTrace` parameter as its first arg.
    """

    def deco(fn: Callable[P, R]) -> Callable[P, R]:
        sig = inspect.signature(fn)
        param_names = list(sig.parameters.keys())
        if not param_names or param_names[0] != "trace":
            raise TypeError(f"@tool function {fn.__name__!r} must accept `trace: ToolTrace` as first param")

        @functools.wraps(fn)
        def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
            bound = sig.bind(*args, **kwargs)
            bound.apply_defaults()
            trace = bound.arguments["trace"]
            if not isinstance(trace, ToolTrace):
                raise TypeError(f"first arg to {fn.__name__!r} must be ToolTrace, got {type(trace).__name__}")

            value = fn(*args, **kwargs)

            recorded_args = {k: v for k, v in bound.arguments.items() if k != "trace"}
            trace.record(
                ToolCall(
                    tool_call_id=_new_id(),
                    tool_name=fn.__name__,
                    source=source,
                    table=table,
                    args=recorded_args,
                    value=value,
                    fetched_at=datetime.now(timezone.utc),
                )
            )
            return value

        return wrapper

    return deco

[ ] Step 4: Run tests

Run: pytest tests/unit/test_tools.py -v Expected: 3 passed.

[ ] Step 5: Commit

bash

git add core/tools.py tests/unit/test_tools.py
git commit -m "feat(tools): @tool decorator + ToolTrace with unique call IDs"

Task 5: Skill Base Class

Files:

Create: core/skills.py
Create: tests/unit/test_skills.py

A Skill packages: a name, a list of tools it may use, and a run(trace, **inputs) -> ClaimEnvelope method. The Skill base class enforces that run returns claims (not raw values).

[ ] Step 1: Write failing tests

Create tests/unit/test_skills.py:

python

from datetime import datetime, timezone
import pytest

from core.citation import NumericClaim, ToolCite
from core.skills import Skill, SkillResult
from core.tools import ToolTrace


class DummySkill(Skill):
    name = "dummy"

    def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
        return SkillResult(
            claims=[
                NumericClaim(
                    value=1.0,
                    metric="x",
                    code="000001",
                    as_of="2026-04-28",
                    cite=ToolCite(source="t", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_xx"),
                )
            ],
            summary="dummy ran",
        )


def test_skill_returns_skill_result():
    s = DummySkill()
    out = s.run(ToolTrace())
    assert isinstance(out, SkillResult)
    assert len(out.claims) == 1


def test_skill_must_define_name():
    class NoName(Skill):
        def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
            return SkillResult(claims=[], summary="")

    with pytest.raises(TypeError):
        NoName()

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_skills.py -v

[ ] Step 3: Implement core/skills.py

python

"""Skill base class.

Skills do reasoning + composition of Tools. They MUST NOT call data
sources directly. Skills return SkillResult, which is a list of claims
plus an optional human-readable summary.
"""

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Any

from core.citation import KnowledgeClaim, NumericClaim
from core.tools import ToolTrace


@dataclass
class SkillResult:
    claims: list[NumericClaim | KnowledgeClaim] = field(default_factory=list)
    summary: str = ""


class Skill:
    name: str = ""

    def __init_subclass__(cls, **kwargs: Any) -> None:
        super().__init_subclass__(**kwargs)
        # Allow abstract intermediate subclasses (those without `name`) but
        # reject instantiation via __init__ instead of here, since name
        # may be set on a deeper subclass.

    def __init__(self) -> None:
        if not self.name:
            raise TypeError(f"{type(self).__name__} must set `name`")

    def run(self, trace: ToolTrace, **inputs: Any) -> SkillResult:
        raise NotImplementedError

[ ] Step 4: Run tests

Run: pytest tests/unit/test_skills.py -v Expected: 2 passed.

[ ] Step 5: Commit

bash

git add core/skills.py tests/unit/test_skills.py
git commit -m "feat(skills): Skill base class returning SkillResult"

Task 6: Verifier

Files:

Create: core/verifier.py
Create: tests/unit/test_verifier.py

The Verifier runs deterministic checks against a SkillResult + ToolTrace + CompetenceRegistry. Returns a structured pass/fail with reasons.

[ ] Step 1: Write failing tests covering each check

Create tests/unit/test_verifier.py:

python

from datetime import datetime, timedelta, timezone

import pytest

from core.citation import CompetenceCite, KnowledgeClaim, NumericClaim, ToolCite
from core.competences import Competence, CompetenceRegistry
from core.skills import SkillResult
from core.tools import ToolCall, ToolTrace
from core.verifier import StalenessBudget, VerificationResult, Verifier


def _trace_with(tool_call_id: str = "tc_1", value: float = 42.5) -> ToolTrace:
    t = ToolTrace()
    t.record(
        ToolCall(
            tool_call_id=tool_call_id,
            tool_name="get_pe",
            source="tushare",
            table="daily_basic",
            args={},
            value=value,
            fetched_at=datetime.now(timezone.utc),
        )
    )
    return t


def _ok_numeric_claim(tool_call_id: str = "tc_1", value: float = 42.5) -> NumericClaim:
    return NumericClaim(
        value=value,
        metric="P/E",
        code="600519.SH",
        as_of=datetime.now(timezone.utc).date().isoformat(),
        cite=ToolCite(source="tushare", table="daily_basic", fetched_at=datetime.now(timezone.utc), tool_call_id=tool_call_id),
    )


def test_passes_when_all_checks_ok():
    reg = CompetenceRegistry()
    trace = _trace_with()
    result = SkillResult(claims=[_ok_numeric_claim()])
    res = Verifier(reg).verify(result, trace)
    assert isinstance(res, VerificationResult)
    assert res.ok, res.failures


def test_fails_when_tool_call_id_missing_from_trace():
    reg = CompetenceRegistry()
    trace = _trace_with("tc_real")
    result = SkillResult(claims=[_ok_numeric_claim(tool_call_id="tc_ghost")])
    res = Verifier(reg).verify(result, trace)
    assert not res.ok
    assert any("tool_call_id" in f.reason for f in res.failures)


def test_fails_when_value_mismatch():
    reg = CompetenceRegistry()
    trace = _trace_with("tc_1", value=99.9)
    result = SkillResult(claims=[_ok_numeric_claim(tool_call_id="tc_1", value=42.5)])
    res = Verifier(reg).verify(result, trace)
    assert not res.ok
    assert any("value" in f.reason.lower() for f in res.failures)


def test_fails_when_competence_unregistered():
    reg = CompetenceRegistry()
    claim = KnowledgeClaim(claim="x", cite=CompetenceCite(competence_id="comp.unknown"))
    res = Verifier(reg).verify(SkillResult(claims=[claim]), ToolTrace())
    assert not res.ok
    assert any("competence" in f.reason.lower() for f in res.failures)


def test_passes_when_competence_registered():
    reg = CompetenceRegistry()
    reg.register(Competence(id="comp.x", statement="...", source="..."))
    claim = KnowledgeClaim(claim="x", cite=CompetenceCite(competence_id="comp.x"))
    res = Verifier(reg).verify(SkillResult(claims=[claim]), ToolTrace())
    assert res.ok


def test_fails_when_stale():
    reg = CompetenceRegistry()
    old_ts = datetime.now(timezone.utc) - timedelta(days=10)
    trace = ToolTrace()
    trace.record(
        ToolCall(
            tool_call_id="tc_old",
            tool_name="get_pe",
            source="tushare",
            table="daily_basic",
            args={},
            value=42.5,
            fetched_at=old_ts,
        )
    )
    claim = NumericClaim(
        value=42.5, metric="P/E", code="600519.SH", as_of=old_ts.date().isoformat(),
        cite=ToolCite(source="tushare", table="daily_basic", fetched_at=old_ts, tool_call_id="tc_old"),
    )
    budget = StalenessBudget(default=timedelta(days=1))
    res = Verifier(reg, staleness=budget).verify(SkillResult(claims=[claim]), trace)
    assert not res.ok
    assert any("stale" in f.reason.lower() for f in res.failures)

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_verifier.py -v

[ ] Step 3: Implement core/verifier.py

python

"""Verifier — deterministic checks over a SkillResult + ToolTrace.

Checks (per spec §3.3):
  1. Tool resolution: every ToolCite.tool_call_id is in the trace.
  2. Value match: claim.value == trace[tool_call_id].value (within float tol).
  3. Competence resolution: every CompetenceCite.competence_id is registered.
  4. Freshness: ToolCite.fetched_at within staleness budget.
"""

from __future__ import annotations

import math
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone

from core.citation import CompetenceCite, KnowledgeClaim, NumericClaim, ToolCite
from core.competences import CompetenceRegistry
from core.skills import SkillResult
from core.tools import ToolTrace


@dataclass
class StalenessBudget:
    default: timedelta = timedelta(days=1)
    per_metric: dict[str, timedelta] = field(default_factory=dict)

    def for_metric(self, metric: str) -> timedelta:
        return self.per_metric.get(metric, self.default)


@dataclass
class VerificationFailure:
    claim_index: int
    reason: str


@dataclass
class VerificationResult:
    ok: bool
    failures: list[VerificationFailure] = field(default_factory=list)


class Verifier:
    def __init__(
        self,
        registry: CompetenceRegistry,
        staleness: StalenessBudget | None = None,
        value_tolerance: float = 1e-9,
    ) -> None:
        self.registry = registry
        self.staleness = staleness or StalenessBudget(default=timedelta(days=365 * 10))
        self.tol = value_tolerance

    def verify(self, result: SkillResult, trace: ToolTrace) -> VerificationResult:
        failures: list[VerificationFailure] = []
        for i, claim in enumerate(result.claims):
            for f in self._check_claim(i, claim, trace):
                failures.append(f)
        return VerificationResult(ok=not failures, failures=failures)

    def _check_claim(
        self, i: int, claim: NumericClaim | KnowledgeClaim, trace: ToolTrace
    ) -> list[VerificationFailure]:
        out: list[VerificationFailure] = []
        cite = claim.cite

        if isinstance(cite, ToolCite):
            call = trace.get(cite.tool_call_id)
            if call is None:
                out.append(VerificationFailure(i, f"tool_call_id {cite.tool_call_id!r} not in trace"))
                return out

            if isinstance(claim, NumericClaim):
                if not _values_match(claim.value, call.value, self.tol):
                    out.append(
                        VerificationFailure(
                            i, f"value {claim.value!r} does not match trace value {call.value!r}"
                        )
                    )

                budget = self.staleness.for_metric(claim.metric)
                age = datetime.now(timezone.utc) - cite.fetched_at
                if age > budget:
                    out.append(VerificationFailure(i, f"stale: age {age} > budget {budget}"))

        elif isinstance(cite, CompetenceCite):
            if not self.registry.exists(cite.competence_id):
                out.append(VerificationFailure(i, f"competence {cite.competence_id!r} not registered"))

        return out


def _values_match(a: float, b: object, tol: float) -> bool:
    if not isinstance(b, (int, float)):
        return False
    return math.isclose(a, float(b), rel_tol=tol, abs_tol=tol)

[ ] Step 4: Run tests

Run: pytest tests/unit/test_verifier.py -v Expected: 6 passed.

[ ] Step 5: Commit

bash

git add core/verifier.py tests/unit/test_verifier.py
git commit -m "feat(verifier): deterministic checks for tool/competence cites + staleness"

Task 7: Model Router

Files:

Create: core/model_router.py
Create: tests/unit/test_model_router.py

A small mapping from task_class to model identifier. No network calls in this MVP — just the routing decision. Skill code asks the router which model to use; the actual LLM client is injected separately.

[ ] Step 1: Write failing tests

Create tests/unit/test_model_router.py:

python

import pytest

from core.model_router import ModelRouter, TaskClass, UnknownTaskClassError


def test_default_routing():
    r = ModelRouter()
    assert r.route(TaskClass.PLANNING) == "hermes-default"
    assert r.route(TaskClass.PDF_PARSING) == "claude-sonnet"
    assert r.route(TaskClass.SYNTHESIS) == "claude-opus"
    assert r.route(TaskClass.EMBEDDING) == "bge-m3"


def test_override_per_class():
    r = ModelRouter(overrides={TaskClass.SYNTHESIS: "gpt-4o"})
    assert r.route(TaskClass.SYNTHESIS) == "gpt-4o"
    assert r.route(TaskClass.PLANNING) == "hermes-default"


def test_unknown_class_rejected_at_route():
    r = ModelRouter()
    with pytest.raises(UnknownTaskClassError):
        r.route("not-a-class")  # type: ignore[arg-type]

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_model_router.py -v

[ ] Step 3: Implement core/model_router.py

python

"""Model Router — task class → model identifier.

The MVP returns a string ID; an LLM client layer (out of scope for this
plan) translates IDs into actual API calls.
"""

from __future__ import annotations

from enum import Enum


class TaskClass(str, Enum):
    PLANNING = "planning"
    PDF_PARSING = "pdf_parsing"
    SYNTHESIS = "synthesis"
    EMBEDDING = "embedding"


class UnknownTaskClassError(ValueError):
    pass


_DEFAULTS: dict[TaskClass, str] = {
    TaskClass.PLANNING: "hermes-default",
    TaskClass.PDF_PARSING: "claude-sonnet",
    TaskClass.SYNTHESIS: "claude-opus",
    TaskClass.EMBEDDING: "bge-m3",
}


class ModelRouter:
    def __init__(self, overrides: dict[TaskClass, str] | None = None) -> None:
        self._table: dict[TaskClass, str] = {**_DEFAULTS, **(overrides or {})}

    def route(self, task_class: TaskClass) -> str:
        if not isinstance(task_class, TaskClass):
            raise UnknownTaskClassError(f"{task_class!r} is not a TaskClass")
        return self._table[task_class]

[ ] Step 4: Run tests

Run: pytest tests/unit/test_model_router.py -v Expected: 3 passed.

[ ] Step 5: Commit

bash

git add core/model_router.py tests/unit/test_model_router.py
git commit -m "feat(model-router): task-class → model id mapping with overrides"

Task 8: Single-Agent Runtime

Files:

Create: core/runtime.py
Create: tests/unit/test_runtime.py

The runtime executes a single Skill, passes its result through the Verifier, and on first failure invokes the Skill once more with the verifier feedback in inputs["verifier_feedback"]. After the second failure it returns a VerifiedFailure. No silent fallback.

[ ] Step 1: Write failing tests

Create tests/unit/test_runtime.py:

python

from datetime import datetime, timezone

from core.citation import NumericClaim, ToolCite
from core.competences import CompetenceRegistry
from core.runtime import RunOutcome, SingleAgentRuntime, VerifiedFailure, VerifiedSuccess
from core.skills import Skill, SkillResult
from core.tools import ToolCall, ToolTrace
from core.verifier import Verifier


class GoodSkill(Skill):
    name = "good"

    def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
        trace.record(
            ToolCall(
                tool_call_id="tc_1",
                tool_name="t",
                source="s",
                table=None,
                args={},
                value=42.5,
                fetched_at=datetime.now(timezone.utc),
            )
        )
        return SkillResult(
            claims=[
                NumericClaim(
                    value=42.5, metric="P/E", code="600519.SH", as_of="2026-04-28",
                    cite=ToolCite(source="s", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_1"),
                )
            ],
            summary="ok",
        )


class FlakeySkill(Skill):
    """Fails first time (wrong value), succeeds second time."""
    name = "flakey"

    def __init__(self) -> None:
        super().__init__()
        self._calls = 0

    def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
        self._calls += 1
        wrong = self._calls == 1
        trace.record(
            ToolCall(
                tool_call_id=f"tc_{self._calls}", tool_name="t", source="s", table=None,
                args={}, value=42.5, fetched_at=datetime.now(timezone.utc),
            )
        )
        return SkillResult(
            claims=[
                NumericClaim(
                    value=999.0 if wrong else 42.5, metric="P/E", code="x", as_of="2026-04-28",
                    cite=ToolCite(
                        source="s", fetched_at=datetime.now(timezone.utc),
                        tool_call_id=f"tc_{self._calls}",
                    ),
                )
            ],
            summary="flakey",
        )


class AlwaysWrongSkill(Skill):
    name = "wrong"

    def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
        trace.record(
            ToolCall(
                tool_call_id="tc_w", tool_name="t", source="s", table=None,
                args={}, value=42.5, fetched_at=datetime.now(timezone.utc),
            )
        )
        return SkillResult(
            claims=[
                NumericClaim(
                    value=1.0, metric="P/E", code="x", as_of="2026-04-28",
                    cite=ToolCite(source="s", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_w"),
                )
            ],
            summary="wrong",
        )


def _runtime() -> SingleAgentRuntime:
    return SingleAgentRuntime(verifier=Verifier(CompetenceRegistry()))


def test_success_on_first_try():
    out: RunOutcome = _runtime().run(GoodSkill())
    assert isinstance(out, VerifiedSuccess)
    assert out.attempts == 1
    assert out.result.summary == "ok"


def test_retries_once_on_failure_then_succeeds():
    out = _runtime().run(FlakeySkill())
    assert isinstance(out, VerifiedSuccess)
    assert out.attempts == 2


def test_returns_failure_after_two_attempts():
    out = _runtime().run(AlwaysWrongSkill())
    assert isinstance(out, VerifiedFailure)
    assert out.attempts == 2
    assert len(out.last_failures) > 0

[ ] Step 2: Run to verify FAIL

Run: pytest tests/unit/test_runtime.py -v

[ ] Step 3: Implement core/runtime.py

python

"""Single-agent runtime with retry-once-on-verifier-fail policy.

Returns either VerifiedSuccess or VerifiedFailure. Never silently
falls back to unverified output.
"""

from __future__ import annotations

from dataclasses import dataclass
from typing import Union

from core.skills import Skill, SkillResult
from core.tools import ToolTrace
from core.verifier import VerificationFailure, Verifier


@dataclass
class VerifiedSuccess:
    result: SkillResult
    trace: ToolTrace
    attempts: int


@dataclass
class VerifiedFailure:
    last_result: SkillResult
    trace: ToolTrace
    attempts: int
    last_failures: list[VerificationFailure]


RunOutcome = Union[VerifiedSuccess, VerifiedFailure]


class SingleAgentRuntime:
    def __init__(self, verifier: Verifier, max_attempts: int = 2) -> None:
        self.verifier = verifier
        self.max_attempts = max_attempts

    def run(self, skill: Skill, **inputs: object) -> RunOutcome:
        last_result: SkillResult | None = None
        last_failures: list[VerificationFailure] = []
        trace = ToolTrace()

        for attempt in range(1, self.max_attempts + 1):
            extra = {"verifier_feedback": last_failures} if last_failures else {}
            result = skill.run(trace, **inputs, **extra)
            ver = self.verifier.verify(result, trace)
            if ver.ok:
                return VerifiedSuccess(result=result, trace=trace, attempts=attempt)
            last_result = result
            last_failures = ver.failures

        assert last_result is not None
        return VerifiedFailure(
            last_result=last_result,
            trace=trace,
            attempts=self.max_attempts,
            last_failures=last_failures,
        )

[ ] Step 4: Run tests

Run: pytest tests/unit/test_runtime.py -v Expected: 3 passed.

[ ] Step 5: Run the full suite

Run: pytest -q Expected: all tests so far pass (~21 tests).

[ ] Step 6: Commit

bash

git add core/runtime.py tests/unit/test_runtime.py
git commit -m "feat(runtime): single-agent loop with retry-once-on-verifier-fail"

Task 9: Example Tool — `get_pe`

Files:

Create: core/examples/__init__.py
Create: core/examples/price_tool.py
Create: tests/unit/test_price_tool.py

This is a concrete tool that calls Tushare's daily_basic endpoint to fetch P/E. Real network is mocked with respx in tests; production reads TUSHARE_TOKEN from env.

[ ] Step 1: Write failing tests

Create tests/unit/test_price_tool.py:

python

import respx
from httpx import Response

from core.examples.price_tool import get_pe
from core.tools import ToolTrace


@respx.mock
def test_get_pe_parses_tushare_response():
    respx.post("http://api.tushare.pro").mock(
        return_value=Response(
            200,
            json={
                "code": 0,
                "msg": "",
                "data": {
                    "fields": ["ts_code", "trade_date", "pe_ttm"],
                    "items": [["600519.SH", "20260427", 35.42]],
                },
            },
        )
    )

    trace = ToolTrace()
    pe = get_pe(trace, code="600519.SH", trade_date="20260427", token="testtoken")

    assert pe == 35.42
    assert len(trace.calls) == 1
    call = trace.calls[0]
    assert call.tool_name == "get_pe"
    assert call.source == "tushare"
    assert call.table == "daily_basic"
    assert call.value == 35.42
    assert call.args == {"code": "600519.SH", "trade_date": "20260427", "token": "testtoken"}

[ ] Step 2: Implement core/examples/__init__.py

python

"""Concrete example Tools and Skills used in integration tests."""

[ ] Step 3: Implement core/examples/price_tool.py

python

"""get_pe — fetch trailing P/E from Tushare daily_basic.

Network call goes to http://api.tushare.pro. Token is passed explicitly
to keep the function pure and testable; in production, the calling
Skill resolves it from env (TUSHARE_TOKEN).
"""

from __future__ import annotations

import httpx

from core.tools import ToolTrace, tool


class TushareError(RuntimeError):
    pass


@tool(source="tushare", table="daily_basic")
def get_pe(trace: ToolTrace, *, code: str, trade_date: str, token: str) -> float:
    payload = {
        "api_name": "daily_basic",
        "token": token,
        "params": {"ts_code": code, "trade_date": trade_date},
        "fields": "ts_code,trade_date,pe_ttm",
    }
    resp = httpx.post("http://api.tushare.pro", json=payload, timeout=10.0)
    resp.raise_for_status()
    data = resp.json()
    if data.get("code") != 0:
        raise TushareError(data.get("msg") or "tushare error")
    items = data["data"]["items"]
    if not items:
        raise TushareError(f"no row for {code} on {trade_date}")
    return float(items[0][2])

[ ] Step 4: Run tests

Run: pytest tests/unit/test_price_tool.py -v Expected: 1 passed.

[ ] Step 5: Commit

bash

git add core/examples/__init__.py core/examples/price_tool.py tests/unit/test_price_tool.py
git commit -m "feat(examples): get_pe tool against Tushare daily_basic"

Task 10: Example Skill — `PriceSummarySkill`

Files:

Create: core/examples/price_summary.py
Create: tests/unit/test_price_summary.py

A Skill that calls get_pe once and emits one NumericClaim. Uses the most recent ToolCall to build the cite.

[ ] Step 1: Write failing tests

Create tests/unit/test_price_summary.py:

python

import respx
from httpx import Response

from core.examples.price_summary import PriceSummarySkill
from core.tools import ToolTrace


@respx.mock
def test_price_summary_emits_cited_claim():
    respx.post("http://api.tushare.pro").mock(
        return_value=Response(200, json={
            "code": 0, "msg": "",
            "data": {"fields": ["ts_code", "trade_date", "pe_ttm"],
                     "items": [["600519.SH", "20260427", 35.42]]},
        })
    )

    trace = ToolTrace()
    result = PriceSummarySkill().run(trace, code="600519.SH", trade_date="20260427", token="t")

    assert len(result.claims) == 1
    claim = result.claims[0]
    assert claim.value == 35.42
    assert claim.metric == "P/E"
    assert claim.code == "600519.SH"
    assert claim.cite.kind == "tool"
    assert claim.cite.tool_call_id == trace.calls[0].tool_call_id

[ ] Step 2: Implement core/examples/price_summary.py

python

"""PriceSummarySkill — emits a NumericClaim with P/E for a given code/date."""

from __future__ import annotations

from typing import Any

from core.citation import NumericClaim, ToolCite
from core.examples.price_tool import get_pe
from core.skills import Skill, SkillResult
from core.tools import ToolTrace


class PriceSummarySkill(Skill):
    name = "price_summary"

    def run(self, trace: ToolTrace, **inputs: Any) -> SkillResult:
        code: str = inputs["code"]
        trade_date: str = inputs["trade_date"]
        token: str = inputs["token"]

        pe = get_pe(trace, code=code, trade_date=trade_date, token=token)
        last_call = trace.calls[-1]

        claim = NumericClaim(
            value=pe,
            metric="P/E",
            code=code,
            as_of=f"{trade_date[:4]}-{trade_date[4:6]}-{trade_date[6:]}",
            cite=ToolCite(
                source=last_call.source,
                table=last_call.table,
                fetched_at=last_call.fetched_at,
                tool_call_id=last_call.tool_call_id,
            ),
        )
        return SkillResult(claims=[claim], summary=f"{code} P/E = {pe} as of {trade_date}")

[ ] Step 3: Run tests

Run: pytest tests/unit/test_price_summary.py -v Expected: 1 passed.

[ ] Step 4: Commit

bash

git add core/examples/price_summary.py tests/unit/test_price_summary.py
git commit -m "feat(examples): PriceSummarySkill emits cited P/E claim"

Task 11: End-to-End Integration Test

Files:

Create: tests/integration/__init__.py
Create: tests/integration/test_price_summary_loop.py

Closes the loop: PriceSummarySkill → Runtime → Verifier passes → output is structured + cite-valid.

[ ] Step 1: Empty integration package init

Create tests/integration/__init__.py:

python

"""Integration tests — full closed loop."""

[ ] Step 2: Write the integration test

Create tests/integration/test_price_summary_loop.py:

python

import respx
from httpx import Response

from core.competences import CompetenceRegistry
from core.examples.price_summary import PriceSummarySkill
from core.runtime import SingleAgentRuntime, VerifiedSuccess
from core.verifier import Verifier


@respx.mock
def test_full_loop_passes_verifier():
    respx.post("http://api.tushare.pro").mock(
        return_value=Response(200, json={
            "code": 0, "msg": "",
            "data": {"fields": ["ts_code", "trade_date", "pe_ttm"],
                     "items": [["600519.SH", "20260427", 35.42]]},
        })
    )

    runtime = SingleAgentRuntime(verifier=Verifier(CompetenceRegistry()))
    out = runtime.run(PriceSummarySkill(), code="600519.SH", trade_date="20260427", token="t")

    assert isinstance(out, VerifiedSuccess)
    assert out.attempts == 1
    claim = out.result.claims[0]
    assert claim.value == 35.42
    assert claim.cite.tool_call_id == out.trace.calls[0].tool_call_id

[ ] Step 3: Run

Run: pytest tests/integration -v Expected: 1 passed.

[ ] Step 4: Run the full suite

Run: pytest -q Expected: all unit + integration tests pass (~22 tests).

[ ] Step 5: Commit

bash

git add tests/integration
git commit -m "test(integration): full closed-loop with PriceSummarySkill + Verifier"

Task 12: Hermes Integration Discovery (Research, no code)

Files:

Create: docs/architecture/hermes-integration.md

This is a research-only task. The output is a Markdown doc that answers the open questions from spec §7 so a follow-up plan can wrap core/ inside Hermes.

[ ] Step 1: Read Hermes internal docs / source

Engineer reads internal Hermes docs and answers, in docs/architecture/hermes-integration.md:

How are Skills registered in Hermes today? (decorator, manifest, registry call?)
Does Hermes have a Competence registry or equivalent? If not, where can we attach our CompetenceRegistry?
How does Hermes expose the agent loop's tool-call trace? Per-call hook? Final transcript?
How does Hermes pass user input + receive Skill output? JSON envelope shape?
Where does model selection happen — and can we plug ModelRouter in?
Does Hermes have built-in retry semantics that conflict with our SingleAgentRuntime retry?

[ ] Step 2: Write the doc

Each question gets an answer with concrete file paths / API names. If a capability is missing, add a "Gap" section listing required adapter work.

[ ] Step 3: Update VitePress sidebar

Edit docs/.vitepress/config.ts architecture sidebar to add:

{ text: "Hermes 集成调研", link: "/architecture/hermes-integration" },

[ ] Step 4: Build to confirm

Run: npm run docs:build Expected: build succeeds.

[ ] Step 5: Commit

bash

git add docs/architecture/hermes-integration.md docs/.vitepress/config.ts
git commit -m "docs(hermes): integration discovery — Skill/Competence/trace API findings"

Task 13: CI for Tests

Files:

Create: .github/workflows/ci.yml

GitHub Actions runs ruff, mypy, and pytest on every push.

[ ] Step 1: Write workflow

Create .github/workflows/ci.yml:

yaml

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  python:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip
      - run: pip install -e ".[dev]"
      - run: ruff check core tests
      - run: mypy core
      - run: pytest -q

  docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: npm
      - run: npm ci
      - run: npm run docs:build

[ ] Step 2: Push and watch

bash

git add .github/workflows/ci.yml
git commit -m "ci: ruff + mypy + pytest + vitepress build"
git push origin main

Open https://github.com/LaCatFly/twilight-drive/actions and confirm both jobs pass.

[ ] Step 3: Add CI badge to README (optional)

Edit README.md under the title:

markdown

[![CI](https://github.com/LaCatFly/twilight-drive/actions/workflows/ci.yml/badge.svg)](https://github.com/LaCatFly/twilight-drive/actions/workflows/ci.yml)

[ ] Step 4: Commit

bash

git add README.md
git commit -m "docs: add CI badge"
git push origin main

Spec Coverage Check

Spec section (`03-agent-framework.md`)	Implemented in
§2 Layer architecture	Tasks 4 (L3), 5 (L4), 6 (Verifier), 8 (L5)
§3.1 Model Router	Task 7
§3.2 Citation Protocol	Task 2
§3.3 Verifier	Task 6
§3.4 State + memory	Deferred — MVP runtime is per-call; session memory + opt-in persistence become Task 14+ in next plan iteration once team review confirms scope
§3.5 Eval (source-grounded)	Achieved implicitly: every test invokes Verifier; metrics tracking is a follow-up
§4 Deployment	Task 1 (pyproject) + Task 13 (CI); containerization is post-MVP
§5 Hard rules 1–6	All enforced by code (Tasks 4, 6, 8)
§7 Open items	Task 12

Explicitly out-of-scope for this plan:

Session memory (B) and per-user persistent memory (C) — small follow-up plan after MVP loop is green.
Skill catalog beyond PriceSummarySkill — additions are mechanical given the framework.
Production LLM client wiring (Hermes API, Claude, GPT) — Model Router only returns IDs; client is wired in Task 12 follow-up.

Self-Review Notes

Type/method names verified consistent across tasks: ToolTrace, ToolCall, Cite, ToolCite, CompetenceCite, NumericClaim, KnowledgeClaim, SkillResult, VerificationResult, VerifiedSuccess, VerifiedFailure, RunOutcome, Verifier.verify, ModelRouter.route, SingleAgentRuntime.run.
No placeholders. Every code step shows runnable code; every test step shows the assertion.
TDD discipline: every task starts with a failing test; commit happens after green.

03 Agent Framework MVP — Implementation Plan ​

Task 1: Project Scaffold ​

Task 2: Citation Models ​

Task 3: Competence Registry ​

Task 4: Tool Decorator + ToolTrace ​

Task 5: Skill Base Class ​

Task 6: Verifier ​

Task 7: Model Router ​

Task 8: Single-Agent Runtime ​

Task 9: Example Tool — get_pe ​

Task 10: Example Skill — PriceSummarySkill ​

Task 11: End-to-End Integration Test ​

Task 12: Hermes Integration Discovery (Research, no code) ​

Task 13: CI for Tests ​

Spec Coverage Check ​

Self-Review Notes ​

03 Agent Framework MVP — Implementation Plan

Task 1: Project Scaffold

Task 2: Citation Models

Task 3: Competence Registry

Task 4: Tool Decorator + ToolTrace

Task 5: Skill Base Class

Task 6: Verifier

Task 7: Model Router

Task 8: Single-Agent Runtime

Task 9: Example Tool — `get_pe`

Task 10: Example Skill — `PriceSummarySkill`

Task 11: End-to-End Integration Test

Task 12: Hermes Integration Discovery (Research, no code)

Task 13: CI for Tests

Spec Coverage Check

Self-Review Notes