主题
03 Agent Framework MVP — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use
superpowers:subagent-driven-development(recommended) orsuperpowers:executing-plansto implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.
Goal: Build a framework-agnostic Python reference implementation of the 03 architecture (L2–L5 + Citation Protocol + Verifier + Model Router) that produces a verifiable closed-loop demo: ask "What is 600519's current P/E?" → Skill calls Tool → Tool returns value with cite metadata → Verifier passes → output emitted.
Architecture: Single-agent loop with deterministic Verifier. Citation Protocol uses tagged-union cite (kind: "tool" or kind: "competence"). All cross-layer calls JSON-serializable from day one (containers-ready). Hermes integration is a stub — wrapped later when internal API is documented.
Tech Stack:
- Python 3.11+
- Pydantic v2 (citation schema, Skill I/O contracts)
- pytest + pytest-asyncio (TDD)
- httpx (HTTP client; respx for mocking)
- structlog (JSON tracing)
- ruff + mypy (lint + type-check)
- uv (package manager — fast, modern)
Repository layout (will be created by Task 1):
core/ # framework-agnostic 03 implementation
├── __init__.py
├── citation.py # Pydantic Cite, Claim envelopes
├── competences.py # Competence registry
├── tools.py # @tool decorator, ToolTrace
├── skills.py # Skill base class
├── verifier.py # deterministic Verifier
├── model_router.py # task → model mapping
├── runtime.py # single-agent loop + retry
└── examples/
├── __init__.py
├── price_tool.py # get_price (Tushare; mock-backed in tests)
└── price_summary.py # PriceSummarySkill
tests/
├── conftest.py
├── unit/
│ ├── test_citation.py
│ ├── test_competences.py
│ ├── test_tools.py
│ ├── test_verifier.py
│ ├── test_model_router.py
│ └── test_runtime.py
└── integration/
└── test_price_summary_loop.py
pyproject.toml
.python-version
.ruff.toml
mypy.iniTask 1: Project Scaffold
Files:
Create:
pyproject.tomlCreate:
.python-versionCreate:
.ruff.tomlCreate:
mypy.iniCreate:
core/__init__.pyCreate:
tests/conftest.py[ ] Step 1: Pin Python version
Create .python-version:
3.11- [ ] Step 2: Create
pyproject.toml
toml
[project]
name = "twilight-drive-core"
version = "0.0.1"
description = "03 Agent Framework reference implementation"
requires-python = ">=3.11"
dependencies = [
"pydantic>=2.6,<3",
"httpx>=0.27,<1",
"structlog>=24.1,<25",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0,<9",
"pytest-asyncio>=0.23,<1",
"respx>=0.21,<1",
"ruff>=0.5,<1",
"mypy>=1.10,<2",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["core"]
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]- [ ] Step 3: Create
.ruff.toml
toml
line-length = 100
target-version = "py311"
[lint]
select = ["E", "F", "W", "I", "B", "UP", "SIM", "RUF"]
ignore = ["E501"]- [ ] Step 4: Create
mypy.ini
ini
[mypy]
python_version = 3.11
strict = True
warn_unused_ignores = True
plugins = pydantic.mypy- [ ] Step 5: Empty package init
Create core/__init__.py:
python
"""Twilight Drive — 03 Agent Framework reference implementation."""- [ ] Step 6: Empty conftest
Create tests/conftest.py:
python
"""Shared pytest fixtures."""- [ ] Step 7: Install dependencies
Run:
bash
pip install -e ".[dev]"Expected: installs Pydantic, httpx, structlog, pytest, ruff, mypy. No errors.
- [ ] Step 8: Verify pytest discovers nothing yet
Run: pytest -q Expected: no tests ran
- [ ] Step 9: Commit
bash
git add pyproject.toml .python-version .ruff.toml mypy.ini core/ tests/
git commit -m "chore: scaffold core package and dev tooling"Task 2: Citation Models
Files:
- Create:
core/citation.py - Create:
tests/unit/test_citation.py
The Cite is a discriminated union; claim envelopes wrap a value (numeric) or text (knowledge) plus cite.
- [ ] Step 1: Write failing test for ToolCite validation
Create tests/unit/test_citation.py:
python
import pytest
from datetime import datetime, timezone
from pydantic import ValidationError
from core.citation import Cite, ToolCite, CompetenceCite, NumericClaim, KnowledgeClaim
def test_tool_cite_accepts_required_fields():
cite = ToolCite(
source="tushare",
table="daily_basic",
fetched_at=datetime(2026, 4, 28, 8, 14, 22, tzinfo=timezone.utc),
tool_call_id="tc_abc123",
)
assert cite.kind == "tool"
def test_competence_cite_accepts_id():
cite = CompetenceCite(competence_id="comp.astock.fiscal_calendar.v1")
assert cite.kind == "competence"
def test_cite_union_discriminates_by_kind():
payload = {"kind": "tool", "source": "x", "fetched_at": "2026-04-28T00:00:00Z", "tool_call_id": "tc_1"}
cite: Cite = ToolCite.model_validate(payload)
assert isinstance(cite, ToolCite)
def test_numeric_claim_requires_cite():
with pytest.raises(ValidationError):
NumericClaim(value=42.5, metric="P/E", code="600519.SH", as_of="2026-04-27") # type: ignore
def test_numeric_claim_round_trip():
claim = NumericClaim(
value=42.5,
metric="P/E",
code="600519.SH",
as_of="2026-04-27",
cite=ToolCite(source="tushare", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_1"),
)
dumped = claim.model_dump_json()
loaded = NumericClaim.model_validate_json(dumped)
assert loaded == claim
def test_knowledge_claim_requires_competence_cite_or_tool_cite():
claim = KnowledgeClaim(
claim="A-share fiscal year ends Dec 31",
cite=CompetenceCite(competence_id="comp.astock.fiscal_calendar.v1"),
)
assert claim.cite.kind == "competence"- [ ] Step 2: Run test to verify it fails
Run: pytest tests/unit/test_citation.py -v Expected: FAIL with ImportError: cannot import name 'Cite' from 'core.citation'
- [ ] Step 3: Implement
core/citation.py
python
"""Citation Protocol — tagged-union Cite + claim envelopes.
A `Cite` is a discriminated union by `kind`:
- "tool" → originated from a tool call
- "competence" → originated from a registered Competence
Every claim emitted by a Skill must carry a `cite`. The Verifier rejects
outputs that fail this contract.
"""
from __future__ import annotations
from datetime import datetime
from typing import Annotated, Literal, Union
from pydantic import BaseModel, Field
class ToolCite(BaseModel):
kind: Literal["tool"] = "tool"
source: str
table: str | None = None
fetched_at: datetime
tool_call_id: str
class CompetenceCite(BaseModel):
kind: Literal["competence"] = "competence"
competence_id: str
Cite = Annotated[Union[ToolCite, CompetenceCite], Field(discriminator="kind")]
class NumericClaim(BaseModel):
value: float
metric: str
code: str
as_of: str # ISO date or period (e.g. "2026Q1")
cite: Cite
class KnowledgeClaim(BaseModel):
claim: str
cite: Cite- [ ] Step 4: Run tests to verify pass
Run: pytest tests/unit/test_citation.py -v Expected: 5 passed.
- [ ] Step 5: Type-check
Run: mypy core/citation.py Expected: Success: no issues found.
- [ ] Step 6: Commit
bash
git add core/citation.py tests/unit/test_citation.py
git commit -m "feat(citation): tagged-union Cite + numeric/knowledge claim envelopes"Task 3: Competence Registry
Files:
- Create:
core/competences.py - Create:
tests/unit/test_competences.py
Competences are registered at startup and looked up by ID. The registry rejects duplicate IDs and provides a resolution check used by the Verifier.
- [ ] Step 1: Write failing tests
Create tests/unit/test_competences.py:
python
import pytest
from core.competences import Competence, CompetenceRegistry, DuplicateCompetenceError, UnknownCompetenceError
def test_register_and_lookup():
reg = CompetenceRegistry()
comp = Competence(
id="comp.astock.fiscal_calendar.v1",
statement="A-share fiscal year ends Dec 31",
source="CSRC accounting standards (2014)",
)
reg.register(comp)
assert reg.get("comp.astock.fiscal_calendar.v1") is comp
assert reg.exists("comp.astock.fiscal_calendar.v1")
def test_duplicate_rejected():
reg = CompetenceRegistry()
comp = Competence(id="x", statement="...", source="...")
reg.register(comp)
with pytest.raises(DuplicateCompetenceError):
reg.register(comp)
def test_unknown_lookup_raises():
reg = CompetenceRegistry()
with pytest.raises(UnknownCompetenceError):
reg.get("nope")
def test_exists_false_for_unknown():
reg = CompetenceRegistry()
assert reg.exists("nope") is False- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_competences.py -v Expected: ImportError.
- [ ] Step 3: Implement
core/competences.py
python
"""Competence registry.
A Competence is a unit of domain knowledge with provenance:
(id, statement, source). Competences are registered at startup;
the Verifier checks `competence_id` references resolve.
LLMs cannot fabricate Competences — only registered IDs are valid
citation targets.
"""
from __future__ import annotations
from pydantic import BaseModel
class DuplicateCompetenceError(ValueError):
"""Raised when registering an ID that already exists."""
class UnknownCompetenceError(KeyError):
"""Raised when looking up an ID that was never registered."""
class Competence(BaseModel):
id: str
statement: str
source: str
ttl_days: int | None = None # optional staleness override
class CompetenceRegistry:
def __init__(self) -> None:
self._items: dict[str, Competence] = {}
def register(self, comp: Competence) -> None:
if comp.id in self._items:
raise DuplicateCompetenceError(comp.id)
self._items[comp.id] = comp
def get(self, comp_id: str) -> Competence:
if comp_id not in self._items:
raise UnknownCompetenceError(comp_id)
return self._items[comp_id]
def exists(self, comp_id: str) -> bool:
return comp_id in self._items- [ ] Step 4: Run tests
Run: pytest tests/unit/test_competences.py -v Expected: 4 passed.
- [ ] Step 5: Commit
bash
git add core/competences.py tests/unit/test_competences.py
git commit -m "feat(competences): registry with duplicate + unknown protection"Task 4: Tool Decorator + ToolTrace
Files:
- Create:
core/tools.py - Create:
tests/unit/test_tools.py
The @tool decorator wraps a pure data function: it injects a generated tool_call_id, captures inputs/outputs into a ToolTrace, and returns the raw value. The trace is what the Verifier later consults.
- [ ] Step 1: Write failing tests
Create tests/unit/test_tools.py:
python
from datetime import datetime, timezone
from core.tools import ToolTrace, tool
@tool(source="testsrc", table="t1")
def add(trace: ToolTrace, a: int, b: int) -> int:
return a + b
def test_tool_returns_value_and_records_trace():
trace = ToolTrace()
result = add(trace, 2, 3)
assert result == 5
assert len(trace.calls) == 1
call = trace.calls[0]
assert call.tool_name == "add"
assert call.args == {"a": 2, "b": 3}
assert call.value == 5
assert call.source == "testsrc"
assert call.table == "t1"
assert call.tool_call_id.startswith("tc_")
assert isinstance(call.fetched_at, datetime)
assert call.fetched_at.tzinfo == timezone.utc
def test_tool_call_ids_are_unique():
trace = ToolTrace()
add(trace, 1, 1)
add(trace, 2, 2)
ids = {c.tool_call_id for c in trace.calls}
assert len(ids) == 2
def test_trace_lookup_by_id():
trace = ToolTrace()
add(trace, 7, 8)
tcid = trace.calls[0].tool_call_id
looked = trace.get(tcid)
assert looked is trace.calls[0]
assert trace.get("missing") is None- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_tools.py -v Expected: ImportError.
- [ ] Step 3: Implement
core/tools.py
python
"""Tool decorator + trace.
A Tool is a pure data function. The @tool decorator:
- Generates a unique tool_call_id per invocation
- Records call metadata into a ToolTrace
- Returns the underlying value unchanged
The trace is the source of truth the Verifier uses to validate cites.
"""
from __future__ import annotations
import functools
import inspect
import secrets
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any, Callable, ParamSpec, TypeVar
P = ParamSpec("P")
R = TypeVar("R")
@dataclass
class ToolCall:
tool_call_id: str
tool_name: str
source: str
table: str | None
args: dict[str, Any]
value: Any
fetched_at: datetime
@dataclass
class ToolTrace:
calls: list[ToolCall] = field(default_factory=list)
def record(self, call: ToolCall) -> None:
self.calls.append(call)
def get(self, tool_call_id: str) -> ToolCall | None:
for c in self.calls:
if c.tool_call_id == tool_call_id:
return c
return None
def _new_id() -> str:
return f"tc_{secrets.token_hex(6)}"
def tool(*, source: str, table: str | None = None) -> Callable[[Callable[P, R]], Callable[P, R]]:
"""Decorate a function as a Tool.
The decorated function MUST take a `trace: ToolTrace` parameter as its first arg.
"""
def deco(fn: Callable[P, R]) -> Callable[P, R]:
sig = inspect.signature(fn)
param_names = list(sig.parameters.keys())
if not param_names or param_names[0] != "trace":
raise TypeError(f"@tool function {fn.__name__!r} must accept `trace: ToolTrace` as first param")
@functools.wraps(fn)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
bound = sig.bind(*args, **kwargs)
bound.apply_defaults()
trace = bound.arguments["trace"]
if not isinstance(trace, ToolTrace):
raise TypeError(f"first arg to {fn.__name__!r} must be ToolTrace, got {type(trace).__name__}")
value = fn(*args, **kwargs)
recorded_args = {k: v for k, v in bound.arguments.items() if k != "trace"}
trace.record(
ToolCall(
tool_call_id=_new_id(),
tool_name=fn.__name__,
source=source,
table=table,
args=recorded_args,
value=value,
fetched_at=datetime.now(timezone.utc),
)
)
return value
return wrapper
return deco- [ ] Step 4: Run tests
Run: pytest tests/unit/test_tools.py -v Expected: 3 passed.
- [ ] Step 5: Commit
bash
git add core/tools.py tests/unit/test_tools.py
git commit -m "feat(tools): @tool decorator + ToolTrace with unique call IDs"Task 5: Skill Base Class
Files:
- Create:
core/skills.py - Create:
tests/unit/test_skills.py
A Skill packages: a name, a list of tools it may use, and a run(trace, **inputs) -> ClaimEnvelope method. The Skill base class enforces that run returns claims (not raw values).
- [ ] Step 1: Write failing tests
Create tests/unit/test_skills.py:
python
from datetime import datetime, timezone
import pytest
from core.citation import NumericClaim, ToolCite
from core.skills import Skill, SkillResult
from core.tools import ToolTrace
class DummySkill(Skill):
name = "dummy"
def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
return SkillResult(
claims=[
NumericClaim(
value=1.0,
metric="x",
code="000001",
as_of="2026-04-28",
cite=ToolCite(source="t", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_xx"),
)
],
summary="dummy ran",
)
def test_skill_returns_skill_result():
s = DummySkill()
out = s.run(ToolTrace())
assert isinstance(out, SkillResult)
assert len(out.claims) == 1
def test_skill_must_define_name():
class NoName(Skill):
def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
return SkillResult(claims=[], summary="")
with pytest.raises(TypeError):
NoName()- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_skills.py -v
- [ ] Step 3: Implement
core/skills.py
python
"""Skill base class.
Skills do reasoning + composition of Tools. They MUST NOT call data
sources directly. Skills return SkillResult, which is a list of claims
plus an optional human-readable summary.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
from core.citation import KnowledgeClaim, NumericClaim
from core.tools import ToolTrace
@dataclass
class SkillResult:
claims: list[NumericClaim | KnowledgeClaim] = field(default_factory=list)
summary: str = ""
class Skill:
name: str = ""
def __init_subclass__(cls, **kwargs: Any) -> None:
super().__init_subclass__(**kwargs)
# Allow abstract intermediate subclasses (those without `name`) but
# reject instantiation via __init__ instead of here, since name
# may be set on a deeper subclass.
def __init__(self) -> None:
if not self.name:
raise TypeError(f"{type(self).__name__} must set `name`")
def run(self, trace: ToolTrace, **inputs: Any) -> SkillResult:
raise NotImplementedError- [ ] Step 4: Run tests
Run: pytest tests/unit/test_skills.py -v Expected: 2 passed.
- [ ] Step 5: Commit
bash
git add core/skills.py tests/unit/test_skills.py
git commit -m "feat(skills): Skill base class returning SkillResult"Task 6: Verifier
Files:
- Create:
core/verifier.py - Create:
tests/unit/test_verifier.py
The Verifier runs deterministic checks against a SkillResult + ToolTrace + CompetenceRegistry. Returns a structured pass/fail with reasons.
- [ ] Step 1: Write failing tests covering each check
Create tests/unit/test_verifier.py:
python
from datetime import datetime, timedelta, timezone
import pytest
from core.citation import CompetenceCite, KnowledgeClaim, NumericClaim, ToolCite
from core.competences import Competence, CompetenceRegistry
from core.skills import SkillResult
from core.tools import ToolCall, ToolTrace
from core.verifier import StalenessBudget, VerificationResult, Verifier
def _trace_with(tool_call_id: str = "tc_1", value: float = 42.5) -> ToolTrace:
t = ToolTrace()
t.record(
ToolCall(
tool_call_id=tool_call_id,
tool_name="get_pe",
source="tushare",
table="daily_basic",
args={},
value=value,
fetched_at=datetime.now(timezone.utc),
)
)
return t
def _ok_numeric_claim(tool_call_id: str = "tc_1", value: float = 42.5) -> NumericClaim:
return NumericClaim(
value=value,
metric="P/E",
code="600519.SH",
as_of=datetime.now(timezone.utc).date().isoformat(),
cite=ToolCite(source="tushare", table="daily_basic", fetched_at=datetime.now(timezone.utc), tool_call_id=tool_call_id),
)
def test_passes_when_all_checks_ok():
reg = CompetenceRegistry()
trace = _trace_with()
result = SkillResult(claims=[_ok_numeric_claim()])
res = Verifier(reg).verify(result, trace)
assert isinstance(res, VerificationResult)
assert res.ok, res.failures
def test_fails_when_tool_call_id_missing_from_trace():
reg = CompetenceRegistry()
trace = _trace_with("tc_real")
result = SkillResult(claims=[_ok_numeric_claim(tool_call_id="tc_ghost")])
res = Verifier(reg).verify(result, trace)
assert not res.ok
assert any("tool_call_id" in f.reason for f in res.failures)
def test_fails_when_value_mismatch():
reg = CompetenceRegistry()
trace = _trace_with("tc_1", value=99.9)
result = SkillResult(claims=[_ok_numeric_claim(tool_call_id="tc_1", value=42.5)])
res = Verifier(reg).verify(result, trace)
assert not res.ok
assert any("value" in f.reason.lower() for f in res.failures)
def test_fails_when_competence_unregistered():
reg = CompetenceRegistry()
claim = KnowledgeClaim(claim="x", cite=CompetenceCite(competence_id="comp.unknown"))
res = Verifier(reg).verify(SkillResult(claims=[claim]), ToolTrace())
assert not res.ok
assert any("competence" in f.reason.lower() for f in res.failures)
def test_passes_when_competence_registered():
reg = CompetenceRegistry()
reg.register(Competence(id="comp.x", statement="...", source="..."))
claim = KnowledgeClaim(claim="x", cite=CompetenceCite(competence_id="comp.x"))
res = Verifier(reg).verify(SkillResult(claims=[claim]), ToolTrace())
assert res.ok
def test_fails_when_stale():
reg = CompetenceRegistry()
old_ts = datetime.now(timezone.utc) - timedelta(days=10)
trace = ToolTrace()
trace.record(
ToolCall(
tool_call_id="tc_old",
tool_name="get_pe",
source="tushare",
table="daily_basic",
args={},
value=42.5,
fetched_at=old_ts,
)
)
claim = NumericClaim(
value=42.5, metric="P/E", code="600519.SH", as_of=old_ts.date().isoformat(),
cite=ToolCite(source="tushare", table="daily_basic", fetched_at=old_ts, tool_call_id="tc_old"),
)
budget = StalenessBudget(default=timedelta(days=1))
res = Verifier(reg, staleness=budget).verify(SkillResult(claims=[claim]), trace)
assert not res.ok
assert any("stale" in f.reason.lower() for f in res.failures)- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_verifier.py -v
- [ ] Step 3: Implement
core/verifier.py
python
"""Verifier — deterministic checks over a SkillResult + ToolTrace.
Checks (per spec §3.3):
1. Tool resolution: every ToolCite.tool_call_id is in the trace.
2. Value match: claim.value == trace[tool_call_id].value (within float tol).
3. Competence resolution: every CompetenceCite.competence_id is registered.
4. Freshness: ToolCite.fetched_at within staleness budget.
"""
from __future__ import annotations
import math
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from core.citation import CompetenceCite, KnowledgeClaim, NumericClaim, ToolCite
from core.competences import CompetenceRegistry
from core.skills import SkillResult
from core.tools import ToolTrace
@dataclass
class StalenessBudget:
default: timedelta = timedelta(days=1)
per_metric: dict[str, timedelta] = field(default_factory=dict)
def for_metric(self, metric: str) -> timedelta:
return self.per_metric.get(metric, self.default)
@dataclass
class VerificationFailure:
claim_index: int
reason: str
@dataclass
class VerificationResult:
ok: bool
failures: list[VerificationFailure] = field(default_factory=list)
class Verifier:
def __init__(
self,
registry: CompetenceRegistry,
staleness: StalenessBudget | None = None,
value_tolerance: float = 1e-9,
) -> None:
self.registry = registry
self.staleness = staleness or StalenessBudget(default=timedelta(days=365 * 10))
self.tol = value_tolerance
def verify(self, result: SkillResult, trace: ToolTrace) -> VerificationResult:
failures: list[VerificationFailure] = []
for i, claim in enumerate(result.claims):
for f in self._check_claim(i, claim, trace):
failures.append(f)
return VerificationResult(ok=not failures, failures=failures)
def _check_claim(
self, i: int, claim: NumericClaim | KnowledgeClaim, trace: ToolTrace
) -> list[VerificationFailure]:
out: list[VerificationFailure] = []
cite = claim.cite
if isinstance(cite, ToolCite):
call = trace.get(cite.tool_call_id)
if call is None:
out.append(VerificationFailure(i, f"tool_call_id {cite.tool_call_id!r} not in trace"))
return out
if isinstance(claim, NumericClaim):
if not _values_match(claim.value, call.value, self.tol):
out.append(
VerificationFailure(
i, f"value {claim.value!r} does not match trace value {call.value!r}"
)
)
budget = self.staleness.for_metric(claim.metric)
age = datetime.now(timezone.utc) - cite.fetched_at
if age > budget:
out.append(VerificationFailure(i, f"stale: age {age} > budget {budget}"))
elif isinstance(cite, CompetenceCite):
if not self.registry.exists(cite.competence_id):
out.append(VerificationFailure(i, f"competence {cite.competence_id!r} not registered"))
return out
def _values_match(a: float, b: object, tol: float) -> bool:
if not isinstance(b, (int, float)):
return False
return math.isclose(a, float(b), rel_tol=tol, abs_tol=tol)- [ ] Step 4: Run tests
Run: pytest tests/unit/test_verifier.py -v Expected: 6 passed.
- [ ] Step 5: Commit
bash
git add core/verifier.py tests/unit/test_verifier.py
git commit -m "feat(verifier): deterministic checks for tool/competence cites + staleness"Task 7: Model Router
Files:
- Create:
core/model_router.py - Create:
tests/unit/test_model_router.py
A small mapping from task_class to model identifier. No network calls in this MVP — just the routing decision. Skill code asks the router which model to use; the actual LLM client is injected separately.
- [ ] Step 1: Write failing tests
Create tests/unit/test_model_router.py:
python
import pytest
from core.model_router import ModelRouter, TaskClass, UnknownTaskClassError
def test_default_routing():
r = ModelRouter()
assert r.route(TaskClass.PLANNING) == "hermes-default"
assert r.route(TaskClass.PDF_PARSING) == "claude-sonnet"
assert r.route(TaskClass.SYNTHESIS) == "claude-opus"
assert r.route(TaskClass.EMBEDDING) == "bge-m3"
def test_override_per_class():
r = ModelRouter(overrides={TaskClass.SYNTHESIS: "gpt-4o"})
assert r.route(TaskClass.SYNTHESIS) == "gpt-4o"
assert r.route(TaskClass.PLANNING) == "hermes-default"
def test_unknown_class_rejected_at_route():
r = ModelRouter()
with pytest.raises(UnknownTaskClassError):
r.route("not-a-class") # type: ignore[arg-type]- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_model_router.py -v
- [ ] Step 3: Implement
core/model_router.py
python
"""Model Router — task class → model identifier.
The MVP returns a string ID; an LLM client layer (out of scope for this
plan) translates IDs into actual API calls.
"""
from __future__ import annotations
from enum import Enum
class TaskClass(str, Enum):
PLANNING = "planning"
PDF_PARSING = "pdf_parsing"
SYNTHESIS = "synthesis"
EMBEDDING = "embedding"
class UnknownTaskClassError(ValueError):
pass
_DEFAULTS: dict[TaskClass, str] = {
TaskClass.PLANNING: "hermes-default",
TaskClass.PDF_PARSING: "claude-sonnet",
TaskClass.SYNTHESIS: "claude-opus",
TaskClass.EMBEDDING: "bge-m3",
}
class ModelRouter:
def __init__(self, overrides: dict[TaskClass, str] | None = None) -> None:
self._table: dict[TaskClass, str] = {**_DEFAULTS, **(overrides or {})}
def route(self, task_class: TaskClass) -> str:
if not isinstance(task_class, TaskClass):
raise UnknownTaskClassError(f"{task_class!r} is not a TaskClass")
return self._table[task_class]- [ ] Step 4: Run tests
Run: pytest tests/unit/test_model_router.py -v Expected: 3 passed.
- [ ] Step 5: Commit
bash
git add core/model_router.py tests/unit/test_model_router.py
git commit -m "feat(model-router): task-class → model id mapping with overrides"Task 8: Single-Agent Runtime
Files:
- Create:
core/runtime.py - Create:
tests/unit/test_runtime.py
The runtime executes a single Skill, passes its result through the Verifier, and on first failure invokes the Skill once more with the verifier feedback in inputs["verifier_feedback"]. After the second failure it returns a VerifiedFailure. No silent fallback.
- [ ] Step 1: Write failing tests
Create tests/unit/test_runtime.py:
python
from datetime import datetime, timezone
from core.citation import NumericClaim, ToolCite
from core.competences import CompetenceRegistry
from core.runtime import RunOutcome, SingleAgentRuntime, VerifiedFailure, VerifiedSuccess
from core.skills import Skill, SkillResult
from core.tools import ToolCall, ToolTrace
from core.verifier import Verifier
class GoodSkill(Skill):
name = "good"
def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
trace.record(
ToolCall(
tool_call_id="tc_1",
tool_name="t",
source="s",
table=None,
args={},
value=42.5,
fetched_at=datetime.now(timezone.utc),
)
)
return SkillResult(
claims=[
NumericClaim(
value=42.5, metric="P/E", code="600519.SH", as_of="2026-04-28",
cite=ToolCite(source="s", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_1"),
)
],
summary="ok",
)
class FlakeySkill(Skill):
"""Fails first time (wrong value), succeeds second time."""
name = "flakey"
def __init__(self) -> None:
super().__init__()
self._calls = 0
def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
self._calls += 1
wrong = self._calls == 1
trace.record(
ToolCall(
tool_call_id=f"tc_{self._calls}", tool_name="t", source="s", table=None,
args={}, value=42.5, fetched_at=datetime.now(timezone.utc),
)
)
return SkillResult(
claims=[
NumericClaim(
value=999.0 if wrong else 42.5, metric="P/E", code="x", as_of="2026-04-28",
cite=ToolCite(
source="s", fetched_at=datetime.now(timezone.utc),
tool_call_id=f"tc_{self._calls}",
),
)
],
summary="flakey",
)
class AlwaysWrongSkill(Skill):
name = "wrong"
def run(self, trace: ToolTrace, **inputs: object) -> SkillResult:
trace.record(
ToolCall(
tool_call_id="tc_w", tool_name="t", source="s", table=None,
args={}, value=42.5, fetched_at=datetime.now(timezone.utc),
)
)
return SkillResult(
claims=[
NumericClaim(
value=1.0, metric="P/E", code="x", as_of="2026-04-28",
cite=ToolCite(source="s", fetched_at=datetime.now(timezone.utc), tool_call_id="tc_w"),
)
],
summary="wrong",
)
def _runtime() -> SingleAgentRuntime:
return SingleAgentRuntime(verifier=Verifier(CompetenceRegistry()))
def test_success_on_first_try():
out: RunOutcome = _runtime().run(GoodSkill())
assert isinstance(out, VerifiedSuccess)
assert out.attempts == 1
assert out.result.summary == "ok"
def test_retries_once_on_failure_then_succeeds():
out = _runtime().run(FlakeySkill())
assert isinstance(out, VerifiedSuccess)
assert out.attempts == 2
def test_returns_failure_after_two_attempts():
out = _runtime().run(AlwaysWrongSkill())
assert isinstance(out, VerifiedFailure)
assert out.attempts == 2
assert len(out.last_failures) > 0- [ ] Step 2: Run to verify FAIL
Run: pytest tests/unit/test_runtime.py -v
- [ ] Step 3: Implement
core/runtime.py
python
"""Single-agent runtime with retry-once-on-verifier-fail policy.
Returns either VerifiedSuccess or VerifiedFailure. Never silently
falls back to unverified output.
"""
from __future__ import annotations
from dataclasses import dataclass
from typing import Union
from core.skills import Skill, SkillResult
from core.tools import ToolTrace
from core.verifier import VerificationFailure, Verifier
@dataclass
class VerifiedSuccess:
result: SkillResult
trace: ToolTrace
attempts: int
@dataclass
class VerifiedFailure:
last_result: SkillResult
trace: ToolTrace
attempts: int
last_failures: list[VerificationFailure]
RunOutcome = Union[VerifiedSuccess, VerifiedFailure]
class SingleAgentRuntime:
def __init__(self, verifier: Verifier, max_attempts: int = 2) -> None:
self.verifier = verifier
self.max_attempts = max_attempts
def run(self, skill: Skill, **inputs: object) -> RunOutcome:
last_result: SkillResult | None = None
last_failures: list[VerificationFailure] = []
trace = ToolTrace()
for attempt in range(1, self.max_attempts + 1):
extra = {"verifier_feedback": last_failures} if last_failures else {}
result = skill.run(trace, **inputs, **extra)
ver = self.verifier.verify(result, trace)
if ver.ok:
return VerifiedSuccess(result=result, trace=trace, attempts=attempt)
last_result = result
last_failures = ver.failures
assert last_result is not None
return VerifiedFailure(
last_result=last_result,
trace=trace,
attempts=self.max_attempts,
last_failures=last_failures,
)- [ ] Step 4: Run tests
Run: pytest tests/unit/test_runtime.py -v Expected: 3 passed.
- [ ] Step 5: Run the full suite
Run: pytest -q Expected: all tests so far pass (~21 tests).
- [ ] Step 6: Commit
bash
git add core/runtime.py tests/unit/test_runtime.py
git commit -m "feat(runtime): single-agent loop with retry-once-on-verifier-fail"Task 9: Example Tool — get_pe
Files:
- Create:
core/examples/__init__.py - Create:
core/examples/price_tool.py - Create:
tests/unit/test_price_tool.py
This is a concrete tool that calls Tushare's daily_basic endpoint to fetch P/E. Real network is mocked with respx in tests; production reads TUSHARE_TOKEN from env.
- [ ] Step 1: Write failing tests
Create tests/unit/test_price_tool.py:
python
import respx
from httpx import Response
from core.examples.price_tool import get_pe
from core.tools import ToolTrace
@respx.mock
def test_get_pe_parses_tushare_response():
respx.post("http://api.tushare.pro").mock(
return_value=Response(
200,
json={
"code": 0,
"msg": "",
"data": {
"fields": ["ts_code", "trade_date", "pe_ttm"],
"items": [["600519.SH", "20260427", 35.42]],
},
},
)
)
trace = ToolTrace()
pe = get_pe(trace, code="600519.SH", trade_date="20260427", token="testtoken")
assert pe == 35.42
assert len(trace.calls) == 1
call = trace.calls[0]
assert call.tool_name == "get_pe"
assert call.source == "tushare"
assert call.table == "daily_basic"
assert call.value == 35.42
assert call.args == {"code": "600519.SH", "trade_date": "20260427", "token": "testtoken"}- [ ] Step 2: Implement
core/examples/__init__.py
python
"""Concrete example Tools and Skills used in integration tests."""- [ ] Step 3: Implement
core/examples/price_tool.py
python
"""get_pe — fetch trailing P/E from Tushare daily_basic.
Network call goes to http://api.tushare.pro. Token is passed explicitly
to keep the function pure and testable; in production, the calling
Skill resolves it from env (TUSHARE_TOKEN).
"""
from __future__ import annotations
import httpx
from core.tools import ToolTrace, tool
class TushareError(RuntimeError):
pass
@tool(source="tushare", table="daily_basic")
def get_pe(trace: ToolTrace, *, code: str, trade_date: str, token: str) -> float:
payload = {
"api_name": "daily_basic",
"token": token,
"params": {"ts_code": code, "trade_date": trade_date},
"fields": "ts_code,trade_date,pe_ttm",
}
resp = httpx.post("http://api.tushare.pro", json=payload, timeout=10.0)
resp.raise_for_status()
data = resp.json()
if data.get("code") != 0:
raise TushareError(data.get("msg") or "tushare error")
items = data["data"]["items"]
if not items:
raise TushareError(f"no row for {code} on {trade_date}")
return float(items[0][2])- [ ] Step 4: Run tests
Run: pytest tests/unit/test_price_tool.py -v Expected: 1 passed.
- [ ] Step 5: Commit
bash
git add core/examples/__init__.py core/examples/price_tool.py tests/unit/test_price_tool.py
git commit -m "feat(examples): get_pe tool against Tushare daily_basic"Task 10: Example Skill — PriceSummarySkill
Files:
- Create:
core/examples/price_summary.py - Create:
tests/unit/test_price_summary.py
A Skill that calls get_pe once and emits one NumericClaim. Uses the most recent ToolCall to build the cite.
- [ ] Step 1: Write failing tests
Create tests/unit/test_price_summary.py:
python
import respx
from httpx import Response
from core.examples.price_summary import PriceSummarySkill
from core.tools import ToolTrace
@respx.mock
def test_price_summary_emits_cited_claim():
respx.post("http://api.tushare.pro").mock(
return_value=Response(200, json={
"code": 0, "msg": "",
"data": {"fields": ["ts_code", "trade_date", "pe_ttm"],
"items": [["600519.SH", "20260427", 35.42]]},
})
)
trace = ToolTrace()
result = PriceSummarySkill().run(trace, code="600519.SH", trade_date="20260427", token="t")
assert len(result.claims) == 1
claim = result.claims[0]
assert claim.value == 35.42
assert claim.metric == "P/E"
assert claim.code == "600519.SH"
assert claim.cite.kind == "tool"
assert claim.cite.tool_call_id == trace.calls[0].tool_call_id- [ ] Step 2: Implement
core/examples/price_summary.py
python
"""PriceSummarySkill — emits a NumericClaim with P/E for a given code/date."""
from __future__ import annotations
from typing import Any
from core.citation import NumericClaim, ToolCite
from core.examples.price_tool import get_pe
from core.skills import Skill, SkillResult
from core.tools import ToolTrace
class PriceSummarySkill(Skill):
name = "price_summary"
def run(self, trace: ToolTrace, **inputs: Any) -> SkillResult:
code: str = inputs["code"]
trade_date: str = inputs["trade_date"]
token: str = inputs["token"]
pe = get_pe(trace, code=code, trade_date=trade_date, token=token)
last_call = trace.calls[-1]
claim = NumericClaim(
value=pe,
metric="P/E",
code=code,
as_of=f"{trade_date[:4]}-{trade_date[4:6]}-{trade_date[6:]}",
cite=ToolCite(
source=last_call.source,
table=last_call.table,
fetched_at=last_call.fetched_at,
tool_call_id=last_call.tool_call_id,
),
)
return SkillResult(claims=[claim], summary=f"{code} P/E = {pe} as of {trade_date}")- [ ] Step 3: Run tests
Run: pytest tests/unit/test_price_summary.py -v Expected: 1 passed.
- [ ] Step 4: Commit
bash
git add core/examples/price_summary.py tests/unit/test_price_summary.py
git commit -m "feat(examples): PriceSummarySkill emits cited P/E claim"Task 11: End-to-End Integration Test
Files:
- Create:
tests/integration/__init__.py - Create:
tests/integration/test_price_summary_loop.py
Closes the loop: PriceSummarySkill → Runtime → Verifier passes → output is structured + cite-valid.
- [ ] Step 1: Empty integration package init
Create tests/integration/__init__.py:
python
"""Integration tests — full closed loop."""- [ ] Step 2: Write the integration test
Create tests/integration/test_price_summary_loop.py:
python
import respx
from httpx import Response
from core.competences import CompetenceRegistry
from core.examples.price_summary import PriceSummarySkill
from core.runtime import SingleAgentRuntime, VerifiedSuccess
from core.verifier import Verifier
@respx.mock
def test_full_loop_passes_verifier():
respx.post("http://api.tushare.pro").mock(
return_value=Response(200, json={
"code": 0, "msg": "",
"data": {"fields": ["ts_code", "trade_date", "pe_ttm"],
"items": [["600519.SH", "20260427", 35.42]]},
})
)
runtime = SingleAgentRuntime(verifier=Verifier(CompetenceRegistry()))
out = runtime.run(PriceSummarySkill(), code="600519.SH", trade_date="20260427", token="t")
assert isinstance(out, VerifiedSuccess)
assert out.attempts == 1
claim = out.result.claims[0]
assert claim.value == 35.42
assert claim.cite.tool_call_id == out.trace.calls[0].tool_call_id- [ ] Step 3: Run
Run: pytest tests/integration -v Expected: 1 passed.
- [ ] Step 4: Run the full suite
Run: pytest -q Expected: all unit + integration tests pass (~22 tests).
- [ ] Step 5: Commit
bash
git add tests/integration
git commit -m "test(integration): full closed-loop with PriceSummarySkill + Verifier"Task 12: Hermes Integration Discovery (Research, no code)
Files:
- Create:
docs/architecture/hermes-integration.md
This is a research-only task. The output is a Markdown doc that answers the open questions from spec §7 so a follow-up plan can wrap core/ inside Hermes.
- [ ] Step 1: Read Hermes internal docs / source
Engineer reads internal Hermes docs and answers, in docs/architecture/hermes-integration.md:
- How are Skills registered in Hermes today? (decorator, manifest, registry call?)
- Does Hermes have a Competence registry or equivalent? If not, where can we attach our
CompetenceRegistry? - How does Hermes expose the agent loop's tool-call trace? Per-call hook? Final transcript?
- How does Hermes pass user input + receive Skill output? JSON envelope shape?
- Where does model selection happen — and can we plug
ModelRouterin? - Does Hermes have built-in retry semantics that conflict with our
SingleAgentRuntimeretry?
- [ ] Step 2: Write the doc
Each question gets an answer with concrete file paths / API names. If a capability is missing, add a "Gap" section listing required adapter work.
- [ ] Step 3: Update VitePress sidebar
Edit docs/.vitepress/config.ts architecture sidebar to add:
ts
{ text: "Hermes 集成调研", link: "/architecture/hermes-integration" },- [ ] Step 4: Build to confirm
Run: npm run docs:build Expected: build succeeds.
- [ ] Step 5: Commit
bash
git add docs/architecture/hermes-integration.md docs/.vitepress/config.ts
git commit -m "docs(hermes): integration discovery — Skill/Competence/trace API findings"Task 13: CI for Tests
Files:
- Create:
.github/workflows/ci.yml
GitHub Actions runs ruff, mypy, and pytest on every push.
- [ ] Step 1: Write workflow
Create .github/workflows/ci.yml:
yaml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
python:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip
- run: pip install -e ".[dev]"
- run: ruff check core tests
- run: mypy core
- run: pytest -q
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
- run: npm ci
- run: npm run docs:build- [ ] Step 2: Push and watch
bash
git add .github/workflows/ci.yml
git commit -m "ci: ruff + mypy + pytest + vitepress build"
git push origin mainOpen https://github.com/LaCatFly/twilight-drive/actions and confirm both jobs pass.
- [ ] Step 3: Add CI badge to README (optional)
Edit README.md under the title:
markdown
[](https://github.com/LaCatFly/twilight-drive/actions/workflows/ci.yml)- [ ] Step 4: Commit
bash
git add README.md
git commit -m "docs: add CI badge"
git push origin mainSpec Coverage Check
Spec section (03-agent-framework.md) | Implemented in |
|---|---|
| §2 Layer architecture | Tasks 4 (L3), 5 (L4), 6 (Verifier), 8 (L5) |
| §3.1 Model Router | Task 7 |
| §3.2 Citation Protocol | Task 2 |
| §3.3 Verifier | Task 6 |
| §3.4 State + memory | Deferred — MVP runtime is per-call; session memory + opt-in persistence become Task 14+ in next plan iteration once team review confirms scope |
| §3.5 Eval (source-grounded) | Achieved implicitly: every test invokes Verifier; metrics tracking is a follow-up |
| §4 Deployment | Task 1 (pyproject) + Task 13 (CI); containerization is post-MVP |
| §5 Hard rules 1–6 | All enforced by code (Tasks 4, 6, 8) |
| §7 Open items | Task 12 |
Explicitly out-of-scope for this plan:
- Session memory (B) and per-user persistent memory (C) — small follow-up plan after MVP loop is green.
- Skill catalog beyond
PriceSummarySkill— additions are mechanical given the framework. - Production LLM client wiring (Hermes API, Claude, GPT) — Model Router only returns IDs; client is wired in Task 12 follow-up.
Self-Review Notes
- Type/method names verified consistent across tasks:
ToolTrace,ToolCall,Cite,ToolCite,CompetenceCite,NumericClaim,KnowledgeClaim,SkillResult,VerificationResult,VerifiedSuccess,VerifiedFailure,RunOutcome,Verifier.verify,ModelRouter.route,SingleAgentRuntime.run. - No placeholders. Every code step shows runnable code; every test step shows the assertion.
- TDD discipline: every task starts with a failing test; commit happens after green.