Skip to content

W6 WarehouseProvider — design spec

  • Status: Draft
  • Date: 2026-05-12
  • Phase: v0.3.0 W6
  • Depends on: PR #38 (W3 ingest + scheduler) merged and warehouse populated
  • Supersedes / closes: sketch in 2026-05-08-tushare-integration-design.md §4.3.3 (15 lines, type signature only)

1. Background

By W6 the DuckDB warehouse will hold normalized A-share quotes ingested by the W3 scheduler (normalized_daily, normalized_daily_basic, normalized_adj_factor, plus the v_daily_full view defined in 01-data-layer.md §6.6.1). The existing in-memory TushareDailyProvider (PR #32) still pays Tushare per-request RTT (~250 ms average, plus per-quota cost) for every historical lookup. With the warehouse live, those requests should be served locally — Tushare becomes the backfill source, not the request path.

WarehouseProvider is the layer that turns warehouse rows into the same canonical Quote objects every other Provider returns, so Composer's existing routing logic continues to work unchanged.

The Composer ROUTES table (PR #34) already lists "warehouse" as the first entry for historical and CLOSED_FINAL paths, but the provider itself doesn't exist on main yet — Composer skips unregistered ids (spec §6.2). W6 fills that slot.

2. Goals

  1. Cold-cache historical reads land on warehouse, not Tushare. A request for (600519.SH, 2024-01-02) after daily_cache eviction must hit normalized_daily rather than burning a Tushare call.
  2. Identical Quote contract. Callers cannot tell whether a Quote came from warehouse or upstream. Units, adjustment factor handling, as_of semantics — all unchanged.
  3. Strict freshness boundary. A request whose target trade_date exceeds the warehouse's most-recent ingest must raise ProviderUnavailable, not silently serve a stale-by-a-week row. Composer falls back to Tushare.
  4. No background fetches. WarehouseProvider is read-only; ingest is W3's responsibility. The provider never writes.

3. Non-goals

  • Backfill on-demand. If a row is missing the provider raises and Composer falls back. Backfill is W3 scheduler's job.
  • Multi-DB sharding. Single DuckDB file per spec §3.6.
  • Cross-source merge inside the provider. If warehouse disagrees with upstream, that surfaces via AuditLog cross-source diff (W7 TDX), not in WarehouseProvider.
  • INTRADAY_QUOTE. Warehouse is daily-snapshot; real-time spot stays with AkshareSpotProvider. Composer's ROUTES already encode this.
  • Schema migrations. The provider reads whatever schema W3 wrote. Schema is owned by 01-data-layer §6.2.

4. Capability set

python
class WarehouseProvider(DataProvider):
    id: ClassVar[str] = "warehouse"
    capabilities: ClassVar[set[Capability]] = {
        Capability.DAILY_QUOTE_HISTORICAL,
        Capability.DAILY_QUOTE_LATEST,
        Capability.ADJ_FACTOR,
        Capability.FUNDAMENTALS_QUARTERLY,
        Capability.INDEX_DAILY,
    }

Five capabilities, mirroring the sketch in tushare-integration §4.3.3. Each backs onto a specific table or view; see §5.

INTRADAY_QUOTE is omitted on purpose — warehouse rows are end-of-day. Adding it would force Composer to skip warehouse for OPEN state lookups, which the existing ROUTES table already handles.

5. Read patterns (per capability)

5.1 DAILY_QUOTE_HISTORICAL — get_quote(code, trade_date=YYYY-MM-DD)

sql
SELECT
  ts_code, trade_date,
  open, high, low, close, pre_close,
  vol, amount,
  adj_factor, adj_factor_latest, ingested_at
FROM v_daily_full v
LEFT JOIN LATERAL (
    SELECT adj_factor AS adj_factor_latest
    FROM normalized_adj_factor
    WHERE ts_code = v.ts_code
    ORDER BY trade_date DESC
    LIMIT 1
) latest ON true
WHERE v.ts_code = ? AND v.trade_date = ?;

Two adj-factor reads:

  • adj_factor for that specific trade_date (via the view's left-join)
  • adj_factor_latest for "most recent factor on file" (lateral subquery)

Both feed Quote.adj_factor / Quote.adj_factor_latest, which the existing close_qfq / close_hfq properties divide. Property math lives in Quote (base.py), not in this provider.

If the join returns null adj_factor (corporate-action gap), Quote still constructs — close_qfq returns None, callers see "raw only".

5.2 DAILY_QUOTE_LATEST — get_quote(code, trade_date=None)

The Composer only routes here for CLOSED_FINAL or PRE_OPEN states; during OPEN / CLOSING, the chain prefers akshare-spot or tushare-daily respectively (per ROUTES). So "latest" here means "the most recent row in the warehouse" — which is the previous trading day's close.

sql
SELECT ts_code, trade_date, open, high, low, close, pre_close,
       vol, amount, adj_factor, adj_factor_latest, ingested_at
FROM v_daily_full
WHERE ts_code = ?
ORDER BY trade_date DESC
LIMIT 1;

adj_factor_latest is whatever the latest row has — for a recently-ingested row this is the same value adj_factor.

5.3 ADJ_FACTOR — separate accessor (not via get_quote)

Capability.ADJ_FACTOR is claimed for parity with the existing TushareDailyProvider, which exposes _call(api_name="adj_factor", …) for the Composer's internal use. WarehouseProvider's equivalent:

python
def adj_factor_at(self, code: str, trade_date: str) -> float | None:
    """Returns the factor in effect for trade_date, or None if unknown."""

Not on the DataProvider ABC. Composer doesn't need it directly because adj_factor reaches Quote via the v_daily_full join above.

(If we later need standalone factor lookups — e.g. a /factor endpoint — this is the seam.)

5.4 FUNDAMENTALS_QUARTERLY

Out of band for get_quote (which returns a Quote, not fundamentals). A future WarehouseProvider.get_fundamentals(code, period) method reads normalized_income / normalized_balance / normalized_cashflow from 01-data-layer §6.2. The current Quote-only ABC doesn't accommodate this; the fundamentals method is a sibling, not an override.

This spec does not lock the fundamentals interface. That's W6.x follow-up. We're staking the capability so Composer's by_capability(...) lookup finds us when the time comes.

5.5 INDEX_DAILY

Same pattern as DAILY_QUOTE_HISTORICAL but against normalized_index_daily (TBD per 01-data-layer §5 TODO). Capability claimed, implementation deferred — capability declaration is cheap; the actual SELECT lands when index ingest goes live.

6. Failure semantics

ConditionRaisesComposer behavior
Row exists, all fields populated(returns Quote)OK
Warehouse file missing / corruptedProviderUnavailable("warehouse db unreadable: …")falls back to next in chain
Table exists but empty for this ts_codeProviderUnavailable("no rows for {code}")falls back
Row exists but missing close (data quality)CanonicalUnitViolation("close is null for {code}@{date}")does NOT fall back — surfaces 500
trade_date > max(trade_date) in normalized_dailyProviderUnavailable("warehouse stale: latest is {date}, requested {date}")falls back to tushare
trade_date not a trading day (weekend / holiday)CapabilityNotSupported("{date} is not a trading day")does NOT fall back — config bug
Lock contention with W3 writerretries via WarehousePool, falls through to ProviderUnavailable if pool exhaustedfalls back

Two design notes:

  • CanonicalUnitViolation blocks fallback. A null close in normalized_daily means the ingest script wrote a bad row — calling Tushare for the same row would mask the upstream / ingest bug. The 500 must surface so on-call notices.
  • Stale vs missing. Some callers test "is the warehouse warm yet?" by querying a code they know exists. The stale message must distinguish "we don't have this date yet" (transient) from "we don't have this code" (permanent). The audit log records both as ProviderUnavailable with the message text.

7. freshness() reporting

FreshnessReport for WarehouseProvider draws from two sources:

  1. source_freshness table (01-data-layer §6.7) — written by W3 ingest after each successful batch. Holds last_success, last_failure, rows_today per source key.
  2. In-memory counters (mirrors the pattern in TushareDailyProvider §120-127) — incremented on each get_quote call. These reflect serving activity, not ingest.

Reporting strategy: prefer in-memory if any get_quote happened this process lifetime; fall back to source_freshness row for tushare:daily if the provider hasn't been queried since startup.

python
def freshness(self) -> FreshnessReport:
    if self._serve_count > 0:
        return FreshnessReport(
            provider_id=self.id,
            last_success=self._last_serve_at,
            last_failure=self._last_failure_at,
            error_msg=self._last_failure_msg,
            rows_today=self._serve_count,
        )
    # Cold-start fallback: report ingest freshness.
    return self._freshness_from_source_freshness()

8. health() reporting

The merged convention (TushareDailyProvider.health(), AkshareSpotProvider.health()) is:

healthy = NOT (last_failure exists AND last_failure > last_success)

WarehouseProvider follows the same rule. Additionally, on first call after process start it does a lightweight readiness probe:

python
SELECT count(*) FROM normalized_daily WHERE trade_date >= today - INTERVAL 7 DAY;

If zero, health() returns unhealthy regardless of in-memory state — the warehouse is empty / unpopulated. This catches "container restarted before W3 finished its first ingest" without waiting for a get_quote call to fail.

9. ROUTES placement (no change)

The Composer ROUTES table (PR #34) already includes "warehouse" as the first entry for these (intent, MarketState) pairs:

intentMarketStatechain
historical*["warehouse", "tushare-daily"]
latestCLOSED_FINAL["warehouse", "tushare-daily"]
latestPRE_OPEN["warehouse", "tushare-daily"]
latestOPEN["akshare-spot", "warehouse", "tushare-daily"]
latestCLOSING["tushare-daily"] only

W6 only registers the provider; routing is already correct on main. Test goal: confirm that after lifespan registration, the first chain entry actually resolves and Composer no longer skips it.

10. daily_cache deletion plan

src/service/cache.py (DailyCache / FundamentalsCache) is the v0.2.0 cache that the W4 cutover (#49) replaces in the request path. But the module file still exists on main after W4, holding stale schema definitions and ~200 lines of dead code.

W6 deletes it:

  1. Verify the v0.2.0 envelope shape consumer (routes/price.py, routes/fundamentals.py) is fully on Composer/QuoteCache. (Confirmed when #49 / #51 merge.)
  2. Delete src/service/cache.py plus core/tests/integration/test_service_cache.py.
  3. Bump the v0.3.0 changelog entry: "removed legacy DailyCache / FundamentalsCache (replaced by Composer + QuoteCache in W4/W5)".

This is one commit inside the W6 PR; not a separate cleanup PR. Keeping both the new path and the old module file orphans the legacy code path where bugs can hide (the same defense in W4 spec §5.2).

11. Test plan

11.1 Unit (core/tests/unit/test_warehouse_provider.py)

  • Capability set declared correctly
  • get_quote happy path: pre-seed normalized_daily + normalized_adj_factor, query, assert Quote fields match (including adj_factor_latest)
  • get_quote returns Quote with close_qfq computed (sanity check the property math survives the warehouse path)
  • get_quote empty warehouse → ProviderUnavailable
  • get_quote row with NULL close → CanonicalUnitViolation
  • get_quote trade_date > max(warehouse) → ProviderUnavailable with the stale message
  • get_quote weekend trade_date → CapabilityNotSupported
  • freshness() returns in-memory after a call; reads source_freshness before any call
  • health() unhealthy when warehouse < 7 days old data

11.2 Integration with Composer

  • Register WarehouseProvider + TushareDailyProvider; mock both
  • Composer.get_quote for historical → warehouse called first
  • Warehouse raises ProviderUnavailable → Composer falls back to Tushare
  • Warehouse raises CanonicalUnitViolation → Composer does not fall back

11.3 Lifespan integration

Once the T2.7 lifespan injection (#47) lands:

  • Backend startup with WAREHOUSE_DB_PATH=... env → WarehouseProvider in registry
  • Backend startup without that env → registry has only Tushare + akshare (warehouse path is optional during transition)

12. Acceptance criteria

  • [ ] core/data/providers/warehouse.py exists and implements all five Capability claims (at least DAILY_QUOTE_HISTORICAL / LATEST fully; FUNDAMENTALS / INDEX may stub-raise CapabilityNotSupported with a TODO comment)
  • [ ] All unit tests in §11.1 pass
  • [ ] Composer integration test in §11.2 covers fallback semantics
  • [ ] src/service/cache.py and its test file deleted in the same PR
  • [ ] Lifespan registers WarehouseProvider when env var is set
  • [ ] Spec §4.3.3 sketch updated to point at this spec
  • [ ] Status doc entry: "v0.3.0 W6 WarehouseProvider live; old DailyCache removed"

13. Open questions

  1. INDEX_DAILY table name not yet finalized. 01-data-layer §5 marks index ingest as TODO. WarehouseProvider can claim the capability now; the SELECT lands when ingest does. Decision: claim it; stub the SELECT.

  2. Per-row source filtering. normalized_daily has a source column (default 'tushare') for future multi-source ingest (e.g. TDX from W7 spec #45). Should WarehouseProvider filter by source? Decision: no — read whatever's there. Cross-source diff happens at AuditLog, not here.

  3. adj_factor_latest semantics during corporate actions. If a split happens between trade_date and "today", the warehouse's "latest" factor and Tushare's "latest" factor agree (both reflect the new post-split factor). But during the split day itself they may disagree for ~1 day. Decision: accept the eventual- consistency window; AuditLog will surface the diff if W7 TDX is live.

  4. Cold-start latency. Loading 50M rows × 5GB DuckDB into mmap on container start takes ~5 seconds. Does the readiness probe in §8 add unwanted latency? Decision: the probe is one SELECT count(*) on an indexed column; <50 ms. Keep it.

  5. DuckDB connection pool semantics during W3 writes. 01-data-layer §6.6.2 notes "read connection during write tx sees old snapshot". If W3's ingest commit happens mid-request, do we serve stale or re-query? Decision: stale this request, fresh next. WarehousePool uses short-lived BEGIN; SELECT; COMMIT; per call so the next call sees the new commit.

14. Dependencies

  • Hard: PR #38 (W3 ingest + scheduler) merged AND warehouse populated with at least Stage 1 (T-730d) backfill complete. Without this, every get_quote raises ProviderUnavailable and Composer always falls back to Tushare — no observable behavior change.
  • Hard: PR #26 (WarehousePool) — already merged on main. W6 uses pool.read() for every query.
  • Soft: PR #47 (lifespan integration spec) — guides where the registration call lands. Not a literal merge dependency, but the pattern should match.
  • Soft: PR #49 (W4 cutover) — daily_cache deletion in §10 is cleaner if /price is already on Composer.
  • Followed by: W7 TDX (#45) — once warehouse is live, cross-source validation has a second source to compare against.

15. Cross-references

  • 2026-05-08-tushare-integration-design.md §4.3.3 — original sketch
  • 2026-05-08-tushare-integration-design.md §6.1 — ROUTES placement
  • 2026-05-08-tushare-integration-design.md §10 — W6 task list
  • 01-data-layer.md §6.2 — warehouse schema
  • 01-data-layer.md §6.6.1 — v_daily_full view
  • 01-data-layer.md §6.6.2 — WarehousePool / WarehouseWriter
  • 01-data-layer.md §6.7 — source_freshness metadata
  • 2026-05-10-lifespan-integration-spec.md §10 — W6 explicitly out-of-scope there
  • W4 spec §5.2 — daily_cache deletion precedent
  • W4 runbook — operator playbook (parallel)

团队内部文档