主题
TDX (通达信) Cross-Source Validation — Design Spec
Date: 2026-05-10 Status: Draft (W7+, post-v0.3.0 scope) Owner: liang Sibling docs:
01-data-layer.md §8.6 | TDX deferred from v0.3.0 backfill scope (Tushare quota is sufficient) |
2026-05-08-tushare-integration-design.md §1, §6, §8 | 5 v0.2.0 incidents, Composer routing, provider_diff_log table |
2026-05-08-deployment-toolset-rollout.md §10 (Out of scope) | TDX 跨源验证标记为 W7+ 独立 spec |
1. Background & Motivation
v0.3.0 ships a single-source A-share data warehouse (Tushare). Spec §8 already defines provider_diff_log for cross-source diff monitoring, with three severity tiers and /quality/diff endpoint, but no second source exists yet.
Without independent verification, the warehouse cannot detect Tushare-side bugs:
- Adjustment-factor edge cases: ex-dividend days, splits with stub shares, transitional periods around corporate actions
- 曾用名 (renamed-stock) delays: Tushare's
namechangeupdates lag the actual stock listing change by days - 退市 (delisted) truncation: delisted ts_codes drop from
dailyafter a delisted-history grace window - Suspended (停牌) handling: Tushare marks suspended days variably (sometimes empty row, sometimes carried-forward close)
These bugs are not caught by the W2 architecture — the architecture validates our code (canonical units, cache TTL, source routing). It assumes Tushare's data itself is correct. TDX provides the second opinion.
Why TDX (not akshare / Yahoo / Alpha Vantage)
| Source | Independent? | A-share coverage | Cost |
|---|---|---|---|
| Tushare | (incumbent) | full | paid (¥200/yr) |
| akshare | partial — sometimes wraps Tushare or Sina | full | free |
| Yahoo Finance | yes — but 1980s—now coverage of A-shares is spotty | partial | free |
| Alpha Vantage | yes — but A-shares are not their focus | thin | paid above quota |
| TDX (通达信) | yes — desktop client / zip files from交易所 | full | free |
TDX wins on independence + coverage. The cost is engineering: the data is delivered as binary files, not HTTP API.
2. Goals
- Daily / weekly cross-source comparison of
dailyclose prices for the entire A-share universe (~5,500 ts_codes). - Automatic recording of disagreements above 0.01% to the existing
provider_diff_logtable (spec §8). - Severity-tiered alerting:
- 0.01–1%:
info(rounding / minor discrepancies) - 1–5%:
warning(suspicious; flag for review) 5%:
critical(likely data error; webhook alert)
- 0.01–1%:
- Manual review workflow for
criticaldiffs — operator gets a/quality/diff?severity=criticalview, decides to retain Tushare or override. - Detection of at least 5 known Tushare quirks within the first 30 days of running. (Acceptance criterion; list calibrated empirically.)
2a. Non-goals
- Real-time TDX integration (binary protocol's reliability outside Windows is questionable; TDX zips are the practical path).
- TDX as primary source. Tushare's coverage + ergonomics > TDX's.
- TDX for fundamentals / financial statements. TDX historically focuses on price/volume; statements are a Tushare specialty.
- Auto-correction. When Tushare and TDX disagree, the audit log captures both and surfaces for human review. No silent override.
- Replace Tushare. Tushare stays primary.
- Use TDX as Composer fallback for normal queries. Out of scope; Composer's fallback chain (akshare-spot, warehouse, tushare-daily) is sufficient for serving requests.
3. Research: TDX data acquisition
Three viable paths, ranked by simplicity / reliability:
3.1 ✅ TDX zip downloads (recommended)
- Source: https://www.tdx.com.cn/article/stockfin.html
- Format: zip files containing TDX-specific binary
.dayfiles - Cadence: TDX publishes weekly / monthly batch dumps from交易所 data
- Cost: free; no auth needed
- Pros: stable URLs, no client / server protocol; download once, parse offline; no rate limits
- Cons: binary format requires parsing; weekly cadence means data is ~1-7 days behind real-time (acceptable for cross-source audit, NOT for serving requests)
3.2 pytdx3 binary protocol library
- URL: https://github.com/lierwang19/pytdx3
- Python client speaking TDX's network protocol; talks to TDX servers run by 通达信 / brokerages.
- Pros: more recent data than zip dumps (sub-daily); same Python ecosystem as our backend
- Cons: server discovery + connection handling fragile; servers change IPs frequently; network reliability often poor outside business hours
→ Use as W9 fallback if zip downloads delay or break.
3.3 ❌ eltdx local reader (rejected)
- URL: https://github.com/electkismet/eltdx
- Reads
.daybinaries from a TDX desktop install (Windows-only). - Doesn't fit our headless Linux backend.
→ Not viable.
3.4 Decision
W7 implements path 3.1 only. Path 3.2 deferred to W9 unless zip publishing breaks for >2 weeks running.
4. Provider design: TdxBatchProvider
Fits the existing DataProvider ABC (spec §4) with a constrained capability:
python
class TdxBatchProvider(DataProvider):
id = "tdx-batch"
capabilities = {Capability.DAILY_QUOTE_HISTORICAL}
def __init__(
self,
zip_dir: Path,
*,
exchange: str = "SSE",
max_age_days: int = 14,
) -> None:
"""zip_dir contains downloaded TDX .day files (or extracted .day files).
max_age_days bounds the lookback (older zips skipped to keep memory low).
"""
def get_quote(
self, code: str, *, trade_date: str | None = None
) -> Quote:
if trade_date is None:
raise CapabilityNotSupported(
"tdx-batch is historical-only; trade_date required"
)
# 1. Locate the zip / .day file containing trade_date for this code
# 2. Parse binary (32-byte struct per row): date / open / high / low /
# close / amount / volume / reserved
# 3. Convert units: TDX stores price * 100 (integer 分), volume * 1
# 4. Verify canonical conversion → CanonicalUnitViolation if violated
# 5. Return Quote with source="tdx-batch", freshness_seconds reflecting
# publication delay (typically ~3-7 days)
def freshness(self) -> FreshnessReport: ...
def health(self) -> HealthStatus: ...Key constraints
- Capabilities limited: Only
DAILY_QUOTE_HISTORICAL. TDX zips are weekly/monthly; latest data is days behind. Composer's ROUTES table intentionally does not includetdx-batch— it's never a fallback for user-facing queries. - Freshness honesty:
freshness_secondsreflects actual zip publication delay (~3-7 days at minimum). Quote consumers should NOT think this is real-time. - Unit normalization: TDX stores prices as integer 分 (×100); volume as shares (not 手). The provider's boundary code converts. Violations →
CanonicalUnitViolation(same pattern as v0.3.0 W2.1/W2.2).
What TdxBatchProvider does NOT do
- Cache results (the
CrossSourceAuditororchestrator handles batch reads; per-call caching adds complexity) - Retry binary parse failures (corrupt zip → fail loudly so operator notices)
- Talk to network at all (zips are pre-downloaded; the provider operates offline)
5. Cross-source diff workflow
A new orchestrator, CrossSourceAuditor, drives the comparison:
python
class CrossSourceAuditor:
def __init__(
self,
warehouse_provider: DataProvider, # tushare-daily / warehouse
tdx_provider: TdxBatchProvider,
audit_log: AuditLog,
*,
diff_threshold_pct: float = 0.01,
warning_threshold_pct: float = 1.0,
critical_threshold_pct: float = 5.0,
) -> None: ...
def audit_window(
self, *, start: date, end: date
) -> dict[str, int]:
"""Compare every (ts_code, trade_date) in window;
write diffs to provider_diff_log; return summary counts."""
for trade_date in trade_days_between(start, end):
for ts_code in tdx.codes_for(trade_date):
wh_quote = warehouse_provider.get_quote(
ts_code, trade_date=trade_date
)
tdx_quote = tdx_provider.get_quote(
ts_code, trade_date=trade_date
)
self._diff_or_record(wh_quote, tdx_quote)
def _diff_or_record(self, a: Quote, b: Quote) -> None:
if a.close_raw is None or b.close_raw is None:
return # one side missing — neither is "right"; skip
diff_pct = abs(a.close_raw - b.close_raw) / b.close_raw * 100.0
if diff_pct < self.diff_threshold_pct:
return # within rounding tolerance
severity = (
"critical" if diff_pct >= self.critical_threshold_pct
else "warning" if diff_pct >= self.warning_threshold_pct
else "info"
)
self.audit_log.record_diff(
code=a.ts_code,
trade_date=a.trade_date,
provider_a=a.source,
value_a=a.close_raw,
provider_b=b.source,
value_b=b.close_raw,
severity=severity,
)Schedule
Daily 03:00 (post-warehouse-refresh, pre-trading-day):
00:00 TDX publisher uploads new weekly batch (their cadence)
02:00 Our scheduler downloads new TDX zip (if newer than last)
03:00 CrossSourceAuditor.audit_window(start=today-7, end=today-3)
writes diffs to provider_diff_log
triggers webhook for any new criticalThe 4-day lag at end of window absorbs TDX publication delay — we audit "settled" data only.
Webhook hook
Spec §8.3 / §11 already define a webhook hook for critical diffs. W8 implements it (URL configurable via env / runtime.yaml). Default: log to stderr only (no external dependency).
6. Implementation phases
W7 (1 week) — core path
- [ ]
core/data/providers/tdx_batch.py(~150 lines) - [ ]
core/data/cross_source_auditor.py(~200 lines) - [ ]
scripts/run_tdx_audit.py(one-shot CLI) - [ ] Unit tests for both modules (~250 lines): provider conformance suite + auditor diff math + severity classification
- [ ] Integration test with sample fixture data
W8 (3-5 days) — productionization
- [ ] APScheduler daily 03:00 cron in
src/service/main.pylifespan - [ ]
/quality/diffendpoint exposes recent diffs (data already in table) - [ ] Webhook config + first webhook fire on critical
- [ ] Deploy to tvps + monitor first week
W9 (deferred) — pytdx3 fallback
- [ ] If TDX zips break for >2 weeks, fall back to pytdx3 binary protocol
- [ ] Otherwise indefinitely deferred
7. Open questions
- TDX zip parsing — pytdx3.exhq.parser or hand-rolled? Probably pytdx3, with a thin wrapper that validates the parsed dict shape and converts units. Format is documented in TDX's reference; not novel.
- Storage of TDX raw data. Do we keep parsed TDX data in a
tdx_dailyraw layer? Yes — use 01 §6.1 raw-layer pattern; helps debug diffs ("did Tushare or TDX shift?"). Compressed JSON, ~50 MB/year for nationwide A-share daily. - Disagreement attribution. When diff found, how to know which is correct? Log both. Flag for human review. No auto-correction.
- Holiday handling. TDX's binary
.daydoesn't include trade_cal; we rely on warehouse'snormalized_trade_cal(already populated by W1.2). - Code-format mapping. TDX uses
sh600519/sz000001/bj430001; Tushare uses600519.SHetc. Boundary normalization inTdxBatchProvider._tdx_to_canonical_code(). - Should auditor delete or expire old diff_log rows? Retention:
info: 30 dayswarning: 90 dayscritical: forever (manually reviewed) Addcleanup_old_diffs()method, run weekly.
- How to detect Tushare's known quirks before W7 ships. Could pre-script a manual diff with
pytdx3to find out which date ranges have the most disagreements; calibrate severity tiers from that sample. Optional preliminary research task.
8. Risks & mitigations
| # | Risk | Probability | Impact | Mitigation |
|---|---|---|---|---|
| 1 | TDX zip URL breaks | low | high | Manual download fallback; keep last successful zip locally; W9 pytdx3 path |
| 2 | pytdx3 server connectivity unstable | high | medium | Stick with zip path (3.1); pytdx3 is W9 only |
| 3 | TDX/Tushare unit conventions differ | medium | high | Validate at TdxBatchProvider boundary; CanonicalUnitViolation if mismatch (matches W2.1 pattern) |
| 4 | Diff log table grows huge | medium | medium | Retention policy (Open Q #6); 30/90/∞ tiers |
| 5 | Alert fatigue if Tushare quirks are common | medium | medium | Calibrate severity tiers from first 30 days of real data; tune thresholds |
| 6 | Webhook brings down audit if fail | low | medium | Webhook fire is best-effort; failure doesn't break auditor |
| 7 | TDX data is itself wrong (rare, but seen historically with corp actions) | medium | medium | This is exactly why we need the diff log + human review; not a bug |
9. Acceptance criteria
- [ ]
TdxBatchProviderpasses the W2.5 conformance suite (id / capabilities / get_quote / freshness / health / error mapping) - [ ]
CrossSourceAuditorrunsaudit_windowagainst last 7 days nightly, writing toprovider_diff_log - [ ] At least 5 known Tushare quirks (e.g. 退市股 truncation, 复权因子 boundary) detected and logged within first 30 days
- [ ]
GET /quality/diff?severity=criticalshows recent critical disagreements - [ ]
GET /quality/diff?severity=warning&since=...filterable - [ ] Webhook fires on first critical (then deduped per code/date)
- [ ] No regression in v0.3.0 W2 work; cross-source is purely additive
10. Out of scope (recap)
- TDX as primary or fallback source for normal queries
- TDX for fundamentals / financial statements
- Real-time TDX subscription
- Broker-specific TDX server selection
- TDX zip backup / archival to S3 / external storage
- Auto-correction of diffs (human review only)
11. Cross-references
- DataProvider ABC:
02-tushare-integration-design.md§4 (already merged in v0.3.0 main) provider_diff_logtable schema:02-tushare-integration-design.md§8 (PR #41 in v0.3.0)- Severity tier semantics:
02-tushare-integration-design.md§8.3 - 5 v0.2.0 incidents (the things v0.3.0 W2 architecture catches):
02-tushare-integration-design.md§1 - Composer ROUTES (proof TDX is NOT a normal serving source):
02-tushare-integration-design.md§6.1 - 01 §8.6: TDX deferred from W3 backfill scope
- Existing GitHub research notes:
github_research.md(Data section)
Appendix A: TDX .day binary format reference
A .day file is a sequence of 32-byte fixed-width records:
Offset Field Type Notes
------ ----------- ------ ------------------------------------------
0..3 trade_date uint32 YYYYMMDD as integer
4..7 open uint32 price × 100 (integer 分)
8..11 high uint32 same scale
12..15 low uint32
16..19 close uint32
20..23 amount float 成交额 (元, single-precision)
24..27 volume uint32 成交量 (手 in TDX convention; verify)
28..31 reserved uint32 ignoredCritical conversions for canonical Quote:
- price fields: divide by 100, store as
float元 (matches Tushare daily.close) - amount: divide by 1000, store as
float千元 (matches Tushare daily.amount) - volume: TDX gives 手 already; store as
int. Validate by cross-checking against Tushare on a known stock — if values divide by 100 consistently, then TDX gave 股 and we need to /100.
The validation step is critical: incident #4 (浙文互联 5,362,159 万元) was exactly this kind of unit assumption error. Every TDX zip parse run should include a sanity check on a hand-picked stock's (close × volume) ≈ amount.