主题
03 — Agent Framework Design
Status: Design (post-brainstorm) · Date: 2026-04-28 Scope: L2–L5 of the A-share investment research Agent. L1 (storage) is deferred to 01-data-layer.md. Domain-specific Skills/Competences and the cross-vertical Harness are out of scope for this document.
1. Goal
Define a layered, citation-disciplined Agent runtime built on top of the team's existing Hermes framework, capable of producing investment research output where every numeric claim is verifiable and every knowledge claim is sourced.
This document fixes the architecture and protocols. It does not enumerate Skills, Competences, or tools — those belong to the implementation plan and to 01/02.
2. Layer Architecture
L5 Orchestrator Hermes agent loop (existing)
L4 Skills Hermes Skills — verbs the Agent can DO
L4' Competences Hermes Competences — nouns/knowledge the Agent KNOWS;
carry source provenance
L3 Tools Convention layer; Skills delegate data ops here
L2 Data Access JSON-serializable query interface over L1
L1 Storage See 01-data-layer.mdL4 and L4' are siblings, not stacked. Skills consume Competences during reasoning. Competences are not invoked like functions; they are knowledge with attribution.
2.1 Layer Responsibilities
L5 Orchestrator (Hermes)
- Owns the agent loop, tool dispatch, and turn lifecycle.
- Out of scope to redesign; we use Hermes as-is.
- Adds: a thin Model Router (see §3.1).
L4 Skills
- Hermes primitives. Python or JavaScript. Registered with JSON-Schema'd inputs and outputs.
- A Skill is a workflow (e.g., "do peer-comparison valuation", "summarize earnings call").
- Skills MUST NOT call data sources directly. Skills call L3 Tools.
- Skills MUST produce outputs that conform to the Citation Protocol (§3.2).
L4' Competences
- Hermes primitives. Knowledge units of the form
(claim, source, as_of). - Examples: "A-share fiscal year ends Dec 31 — source: CSRC accounting standards"; "Tushare
daily_basictable is forward-adjusted unlessadj=None— source: Tushare docs §188". - Competences are referenced by ID in Agent output.
- The LLM cannot fabricate a Competence: only registered Competences are valid citation targets.
L3 Tools (convention)
- Pure data operations. No LLM. No reasoning.
- Python or JavaScript functions. Pure inputs → pure outputs. Side-effect-free where possible.
- Examples:
get_price(code, start, end),get_fundamentals(code, period),search_reports(query, filters). - Reusable across Skills. Unit-testable.
L2 Data Access (convention)
- Hides L1 storage choices from L3.
- Exposes schema-stable queries; downstream code never embeds source-specific knowledge (e.g., "Tushare table name X").
- All inputs/outputs JSON-serializable. No pandas DataFrames across the boundary — pass paths or serialize.
L1 Storage — see 01-data-layer.md.
2.2 Why a Tool Layer When Hermes Doesn't Require One
Hermes Skills are arbitrary Python/JS — there is no enforced Tool primitive. We impose L3 by convention because without it Skills become 500-line monsters with embedded SQL, prompt logic, and data access intermingled. Convention is enforced via:
tools/directory layout- Code review checklist
- Lint rule (future): forbid
tushare/akshare/raw HTTP imports insideskills/
3. Cross-Cutting Protocols
3.1 Model Router
The Agent runtime does NOT hardcode the Hermes LLM as the only model. A ModelRouter selects the model per task class:
| Task class | Default model | Rationale |
|---|---|---|
| Routing / planning / simple tool calls | Hermes (Nous) | Cheap, team-familiar |
| Long PDF research-report parsing | Claude Sonnet (or GPT-4o) | Long context, table reading |
| Final synthesis with strict citation discipline | Claude Opus / GPT-4 | Strongest at avoiding fabrication |
| Embeddings | bge-m3 / OpenAI | Different problem class |
Implementation: a single function route(task_class) -> ModelClient. ~20 LOC. Configurable per environment. Treats Hermes as the default; other models opt-in per call.
3.2 Citation Protocol
Every claim — numeric or qualitative — carries a cite. The cite is a tagged union with two variants:
(a) Tool-sourced — value originated from a tool call (data fetch, calculation):
json
{
"cite": {
"kind": "tool",
"source": "tushare",
"table": "daily_basic",
"fetched_at": "2026-04-28T08:14:22Z",
"tool_call_id": "tc_abc123"
}
}(b) Competence-sourced — value or claim originated from registered domain knowledge:
json
{
"cite": {
"kind": "competence",
"competence_id": "comp.astock.fiscal_calendar.v1"
}
}Numeric claims wrap their value with the cite:
json
{ "value": 42.5, "metric": "P/E", "code": "600519.SH", "as_of": "2026-04-27", "cite": { "kind": "tool", ... } }Qualitative claims wrap their text with the cite:
json
{ "claim": "A-share fiscal year ends December 31", "cite": { "kind": "competence", ... } }The LLM cannot invent a competence_id; the Verifier rejects unresolved IDs (§3.3). The same applies to tool_call_id — it must correspond to a real call in the trace.
3.3 Verifier (Single-Agent + Deterministic Check)
The runtime is single-agent, not multi-agent. Output discipline comes from a deterministic Verifier (not a second LLM):
- Schema check: every claim has a
citeofkind: "tool"orkind: "competence"per §3.2. - Tool resolution: every
cite.tool_call_idcorresponds to a real tool call in the trace. - Value match: cited value equals the value the tool actually returned.
- Competence resolution: every
cite.competence_idis registered. - Freshness (numeric, tool-sourced only):
as_ofis within the configured staleness budget for the metric. Competence-sourced claims are not subject to staleness checks (knowledge facts) unless the Competence itself declares a TTL.
On failure: retry once with the verifier feedback injected into the next turn. If still failing, surface the failure to the user. No silent fallback.
3.4 State and Memory
Two-tier:
- Session memory (always on). Per-conversation working set: previously fetched data, intermediate Skill outputs, currently-discussed tickers. Cleared when the session ends.
- Per-user persistent memory (opt-in). Long-term tracking ("you have been following 600519 since March"). Stored separately. Subject to TTL and explicit forget operations. MVP wires the storage interface but defaults to disabled.
Cached findings keep their cite metadata. When re-used in a later turn, the runtime re-validates freshness against the staleness budget (§3.3) before allowing the value to flow into a new claim. Stale values force a refetch.
Cross-user shared memory (D in brainstorm) is explicitly out of scope for v1 — it risks contaminating reasoning with stale or biased prior research.
3.5 Evaluation
MVP eval = source-grounded verification. This is essentially free given §3.3: every Agent output is auto-verified, and the verification result is the eval signal.
Metrics tracked from day 1:
- % outputs passing verifier on first try
- % numeric claims with valid cites
- % competence references that resolve
- Median freshness of cited values
Deferred to post-MVP:
- Golden Q&A regression set (begin curating during MVP from real bugs)
- LLM-as-judge for qualitative quality
- Backtest-style outcome eval (requires v1 to be in production with real predictions)
4. Deployment Topology
MVP: Single Python process locally. JS Skills invoked via Node subprocess.
Production target: Containerized per-service. Hermes orchestrator + Skills runtimes (Python, JS) as separate containers.
Migration discipline (enforced from day 1, even in MVP):
- All cross-layer calls JSON-serializable.
- No shared in-process state between Skills and Tools.
- No pandas DataFrames across boundaries; pass paths or serialize to Parquet/JSON.
- Logging/tracing IDs propagate through every call.
If we honor these rules in MVP, the migration to containers is a deployment-config change, not a rewrite.
5. Hard Rules (Non-Negotiable)
- Skills never call data sources directly. All data access goes through L3 Tools.
- Every numeric claim in Agent output carries a
cite. The Verifier rejects otherwise. - Cross-layer calls are JSON-serializable. No in-process shortcuts.
- Competences cannot be fabricated. Only registered Competences are valid citation targets.
- Verifier failure surfaces to the user after one retry. No silent fallback to unverified output.
- The Hermes LLM is not hardcoded. Model selection goes through the Model Router.
6. Out of Scope / Deferred
| Concern | Owned by |
|---|---|
| L1 storage choice and schemas | 01-data-layer.md |
| Research-report-specific tools and parsing | 02-research-reports.md |
| Cross-vertical reuse (Harness abstraction) | 04-harness-framework.md |
| Documentation system | 05-documentation.md |
| Skill catalog (which Skills to build first) | Implementation plan |
| Competence catalog (which knowledge units to register first) | Implementation plan |
| Specific eval cases | Post-MVP |
| Multi-agent topology | v2, gated on eval data showing single-agent + Verifier is insufficient |
7. Open Items Before Implementation Plan
These need answers before writing-plans can produce a usable plan:
- Hermes internal API surface. This spec assumes we can register Skills and Competences and observe the agent loop. If Hermes lacks (e.g.) a Competence registry or per-call observability hooks, the implementation plan must include adapter work.
- Verifier hosting. Runs in-process with the orchestrator (simplest) or as a separate service (cleaner boundary)? Default: in-process for MVP.
- Trace storage. The Verifier needs the tool-call trace. Where does it live — memory only, or persisted? Default: memory for MVP, persisted for production.
- Staleness budget defaults. Per-metric freshness thresholds (e.g., daily price = 1 day; fundamentals = 1 quarter). To be tabulated alongside the data layer.
8. Decision Log (Brainstorm Provenance)
This design is the consolidation of the brainstorming session held 2026-04-27/28. Key choices:
| Decision | Choice | Rejected alternative |
|---|---|---|
| Runtime | Hermes via hosted API + custom orchestrator | Claude Code; self-hosted Hermes |
| Model strategy | Hermes default + Model Router | Hermes-only; frontier-only |
| Skill primitive | Hermes Skills (Python + JS) | Build new abstraction |
| Knowledge primitive | Hermes Competences (cite-anchored) | Free-form RAG |
| Topology | Single-agent + deterministic Verifier | Multi-agent (Researcher/Analyst/Critic) |
| Eval | Source-grounded verification | Golden Q&A only; LLM-as-judge |
| Memory | Session + opt-in per-user | Stateless; cross-user shared KB |
| Deployment | Local MVP → containers | Cloud-native from day 1 |