03 — Agent Framework Design

Status: Design (post-brainstorm) · Date: 2026-04-28 Scope: L2–L5 of the A-share investment research Agent. L1 (storage) is deferred to 01-data-layer.md. Domain-specific Skills/Competences and the cross-vertical Harness are out of scope for this document.

1. Goal

Define a layered, citation-disciplined Agent runtime built on top of the team's existing Hermes framework, capable of producing investment research output where every numeric claim is verifiable and every knowledge claim is sourced.

This document fixes the architecture and protocols. It does not enumerate Skills, Competences, or tools — those belong to the implementation plan and to 01/02.

2. Layer Architecture

L5  Orchestrator      Hermes agent loop (existing)
L4  Skills            Hermes Skills — verbs the Agent can DO
L4' Competences       Hermes Competences — nouns/knowledge the Agent KNOWS;
                       carry source provenance
L3  Tools             Convention layer; Skills delegate data ops here
L2  Data Access       JSON-serializable query interface over L1
L1  Storage           See 01-data-layer.md

L4 and L4' are siblings, not stacked. Skills consume Competences during reasoning. Competences are not invoked like functions; they are knowledge with attribution.

2.1 Layer Responsibilities

L5 Orchestrator (Hermes)

Owns the agent loop, tool dispatch, and turn lifecycle.
Out of scope to redesign; we use Hermes as-is.
Adds: a thin Model Router (see §3.1).

L4 Skills

Hermes primitives. Python or JavaScript. Registered with JSON-Schema'd inputs and outputs.
A Skill is a workflow (e.g., "do peer-comparison valuation", "summarize earnings call").
Skills MUST NOT call data sources directly. Skills call L3 Tools.
Skills MUST produce outputs that conform to the Citation Protocol (§3.2).

L4' Competences

Hermes primitives. Knowledge units of the form (claim, source, as_of).
Examples: "A-share fiscal year ends Dec 31 — source: CSRC accounting standards"; "Tushare daily_basic table is forward-adjusted unless adj=None — source: Tushare docs §188".
Competences are referenced by ID in Agent output.
The LLM cannot fabricate a Competence: only registered Competences are valid citation targets.

L3 Tools (convention)

Pure data operations. No LLM. No reasoning.
Python or JavaScript functions. Pure inputs → pure outputs. Side-effect-free where possible.
Examples: get_price(code, start, end), get_fundamentals(code, period), search_reports(query, filters).
Reusable across Skills. Unit-testable.

L2 Data Access (convention)

Hides L1 storage choices from L3.
Exposes schema-stable queries; downstream code never embeds source-specific knowledge (e.g., "Tushare table name X").
All inputs/outputs JSON-serializable. No pandas DataFrames across the boundary — pass paths or serialize.

L1 Storage — see 01-data-layer.md.

2.2 Why a Tool Layer When Hermes Doesn't Require One

Hermes Skills are arbitrary Python/JS — there is no enforced Tool primitive. We impose L3 by convention because without it Skills become 500-line monsters with embedded SQL, prompt logic, and data access intermingled. Convention is enforced via:

tools/ directory layout
Code review checklist
Lint rule (future): forbid tushare/akshare/raw HTTP imports inside skills/

3. Cross-Cutting Protocols

3.1 Model Router

The Agent runtime does NOT hardcode the Hermes LLM as the only model. A ModelRouter selects the model per task class:

Task class	Default model	Rationale
Routing / planning / simple tool calls	Hermes (Nous)	Cheap, team-familiar
Long PDF research-report parsing	Claude Sonnet (or GPT-4o)	Long context, table reading
Final synthesis with strict citation discipline	Claude Opus / GPT-4	Strongest at avoiding fabrication
Embeddings	bge-m3 / OpenAI	Different problem class

Implementation: a single function route(task_class) -> ModelClient. ~20 LOC. Configurable per environment. Treats Hermes as the default; other models opt-in per call.

3.2 Citation Protocol

Every claim — numeric or qualitative — carries a cite. The cite is a tagged union with two variants:

(a) Tool-sourced — value originated from a tool call (data fetch, calculation):

json

{
  "cite": {
    "kind": "tool",
    "source": "tushare",
    "table": "daily_basic",
    "fetched_at": "2026-04-28T08:14:22Z",
    "tool_call_id": "tc_abc123"
  }
}

(b) Competence-sourced — value or claim originated from registered domain knowledge:

json

{
  "cite": {
    "kind": "competence",
    "competence_id": "comp.astock.fiscal_calendar.v1"
  }
}

Numeric claims wrap their value with the cite:

json

{ "value": 42.5, "metric": "P/E", "code": "600519.SH", "as_of": "2026-04-27", "cite": { "kind": "tool", ... } }

Qualitative claims wrap their text with the cite:

json

{ "claim": "A-share fiscal year ends December 31", "cite": { "kind": "competence", ... } }

The LLM cannot invent a competence_id; the Verifier rejects unresolved IDs (§3.3). The same applies to tool_call_id — it must correspond to a real call in the trace.

3.3 Verifier (Single-Agent + Deterministic Check)

The runtime is single-agent, not multi-agent. Output discipline comes from a deterministic Verifier (not a second LLM):

Schema check: every claim has a cite of kind: "tool" or kind: "competence" per §3.2.
Tool resolution: every cite.tool_call_id corresponds to a real tool call in the trace.
Value match: cited value equals the value the tool actually returned.
Competence resolution: every cite.competence_id is registered.
Freshness (numeric, tool-sourced only): as_of is within the configured staleness budget for the metric. Competence-sourced claims are not subject to staleness checks (knowledge facts) unless the Competence itself declares a TTL.

On failure: retry once with the verifier feedback injected into the next turn. If still failing, surface the failure to the user. No silent fallback.

3.4 State and Memory

Two-tier:

Session memory (always on). Per-conversation working set: previously fetched data, intermediate Skill outputs, currently-discussed tickers. Cleared when the session ends.
Per-user persistent memory (opt-in). Long-term tracking ("you have been following 600519 since March"). Stored separately. Subject to TTL and explicit forget operations. MVP wires the storage interface but defaults to disabled.

Cached findings keep their cite metadata. When re-used in a later turn, the runtime re-validates freshness against the staleness budget (§3.3) before allowing the value to flow into a new claim. Stale values force a refetch.

Cross-user shared memory (D in brainstorm) is explicitly out of scope for v1 — it risks contaminating reasoning with stale or biased prior research.

3.5 Evaluation

MVP eval = source-grounded verification. This is essentially free given §3.3: every Agent output is auto-verified, and the verification result is the eval signal.

Metrics tracked from day 1:

% outputs passing verifier on first try
% numeric claims with valid cites
% competence references that resolve
Median freshness of cited values

Deferred to post-MVP:

Golden Q&A regression set (begin curating during MVP from real bugs)
LLM-as-judge for qualitative quality
Backtest-style outcome eval (requires v1 to be in production with real predictions)

4. Deployment Topology

MVP: Single Python process locally. JS Skills invoked via Node subprocess.

Production target: Containerized per-service. Hermes orchestrator + Skills runtimes (Python, JS) as separate containers.

Migration discipline (enforced from day 1, even in MVP):

All cross-layer calls JSON-serializable.
No shared in-process state between Skills and Tools.
No pandas DataFrames across boundaries; pass paths or serialize to Parquet/JSON.
Logging/tracing IDs propagate through every call.

If we honor these rules in MVP, the migration to containers is a deployment-config change, not a rewrite.

5. Hard Rules (Non-Negotiable)

Skills never call data sources directly. All data access goes through L3 Tools.
Every numeric claim in Agent output carries a cite. The Verifier rejects otherwise.
Cross-layer calls are JSON-serializable. No in-process shortcuts.
Competences cannot be fabricated. Only registered Competences are valid citation targets.
Verifier failure surfaces to the user after one retry. No silent fallback to unverified output.
The Hermes LLM is not hardcoded. Model selection goes through the Model Router.

6. Out of Scope / Deferred

Concern	Owned by
L1 storage choice and schemas	`01-data-layer.md`
Research-report-specific tools and parsing	`02-research-reports.md`
Cross-vertical reuse (Harness abstraction)	`04-harness-framework.md`
Documentation system	`05-documentation.md`
Skill catalog (which Skills to build first)	Implementation plan
Competence catalog (which knowledge units to register first)	Implementation plan
Specific eval cases	Post-MVP
Multi-agent topology	v2, gated on eval data showing single-agent + Verifier is insufficient

7. Open Items Before Implementation Plan

These need answers before writing-plans can produce a usable plan:

Hermes internal API surface. This spec assumes we can register Skills and Competences and observe the agent loop. If Hermes lacks (e.g.) a Competence registry or per-call observability hooks, the implementation plan must include adapter work.
Verifier hosting. Runs in-process with the orchestrator (simplest) or as a separate service (cleaner boundary)? Default: in-process for MVP.
Trace storage. The Verifier needs the tool-call trace. Where does it live — memory only, or persisted? Default: memory for MVP, persisted for production.
Staleness budget defaults. Per-metric freshness thresholds (e.g., daily price = 1 day; fundamentals = 1 quarter). To be tabulated alongside the data layer.

8. Decision Log (Brainstorm Provenance)

This design is the consolidation of the brainstorming session held 2026-04-27/28. Key choices:

Decision	Choice	Rejected alternative
Runtime	Hermes via hosted API + custom orchestrator	Claude Code; self-hosted Hermes
Model strategy	Hermes default + Model Router	Hermes-only; frontier-only
Skill primitive	Hermes Skills (Python + JS)	Build new abstraction
Knowledge primitive	Hermes Competences (cite-anchored)	Free-form RAG
Topology	Single-agent + deterministic Verifier	Multi-agent (Researcher/Analyst/Critic)
Eval	Source-grounded verification	Golden Q&A only; LLM-as-judge
Memory	Session + opt-in per-user	Stateless; cross-user shared KB
Deployment	Local MVP → containers	Cloud-native from day 1

03 — Agent Framework Design ​

1. Goal ​

2. Layer Architecture ​

2.1 Layer Responsibilities ​

2.2 Why a Tool Layer When Hermes Doesn't Require One ​

3. Cross-Cutting Protocols ​

3.1 Model Router ​

3.2 Citation Protocol ​

3.3 Verifier (Single-Agent + Deterministic Check) ​

3.4 State and Memory ​

3.5 Evaluation ​

4. Deployment Topology ​

5. Hard Rules (Non-Negotiable) ​

6. Out of Scope / Deferred ​

7. Open Items Before Implementation Plan ​

8. Decision Log (Brainstorm Provenance) ​

03 — Agent Framework Design

1. Goal

2. Layer Architecture

2.1 Layer Responsibilities

2.2 Why a Tool Layer When Hermes Doesn't Require One

3. Cross-Cutting Protocols

3.1 Model Router

3.2 Citation Protocol

3.3 Verifier (Single-Agent + Deterministic Check)

3.4 State and Memory

3.5 Evaluation

4. Deployment Topology

5. Hard Rules (Non-Negotiable)

6. Out of Scope / Deferred

7. Open Items Before Implementation Plan

8. Decision Log (Brainstorm Provenance)