Skip to content

TDX (通达信) Cross-Source Validation — Design Spec

Date: 2026-05-10 Status: Draft (W7+, post-v0.3.0 scope) Owner: liang Sibling docs:

01-data-layer.md §8.6TDX deferred from v0.3.0 backfill scope (Tushare quota is sufficient)
2026-05-08-tushare-integration-design.md §1, §6, §85 v0.2.0 incidents, Composer routing, provider_diff_log table
2026-05-08-deployment-toolset-rollout.md §10 (Out of scope)TDX 跨源验证标记为 W7+ 独立 spec

1. Background & Motivation

v0.3.0 ships a single-source A-share data warehouse (Tushare). Spec §8 already defines provider_diff_log for cross-source diff monitoring, with three severity tiers and /quality/diff endpoint, but no second source exists yet.

Without independent verification, the warehouse cannot detect Tushare-side bugs:

  • Adjustment-factor edge cases: ex-dividend days, splits with stub shares, transitional periods around corporate actions
  • 曾用名 (renamed-stock) delays: Tushare's namechange updates lag the actual stock listing change by days
  • 退市 (delisted) truncation: delisted ts_codes drop from daily after a delisted-history grace window
  • Suspended (停牌) handling: Tushare marks suspended days variably (sometimes empty row, sometimes carried-forward close)

These bugs are not caught by the W2 architecture — the architecture validates our code (canonical units, cache TTL, source routing). It assumes Tushare's data itself is correct. TDX provides the second opinion.

Why TDX (not akshare / Yahoo / Alpha Vantage)

SourceIndependent?A-share coverageCost
Tushare(incumbent)fullpaid (¥200/yr)
aksharepartial — sometimes wraps Tushare or Sinafullfree
Yahoo Financeyes — but 1980s—now coverage of A-shares is spottypartialfree
Alpha Vantageyes — but A-shares are not their focusthinpaid above quota
TDX (通达信)yes — desktop client / zip files from交易所fullfree

TDX wins on independence + coverage. The cost is engineering: the data is delivered as binary files, not HTTP API.


2. Goals

  1. Daily / weekly cross-source comparison of daily close prices for the entire A-share universe (~5,500 ts_codes).
  2. Automatic recording of disagreements above 0.01% to the existing provider_diff_log table (spec §8).
  3. Severity-tiered alerting:
    • 0.01–1%: info (rounding / minor discrepancies)
    • 1–5%: warning (suspicious; flag for review)
    • 5%: critical (likely data error; webhook alert)

  4. Manual review workflow for critical diffs — operator gets a /quality/diff?severity=critical view, decides to retain Tushare or override.
  5. Detection of at least 5 known Tushare quirks within the first 30 days of running. (Acceptance criterion; list calibrated empirically.)

2a. Non-goals

  • Real-time TDX integration (binary protocol's reliability outside Windows is questionable; TDX zips are the practical path).
  • TDX as primary source. Tushare's coverage + ergonomics > TDX's.
  • TDX for fundamentals / financial statements. TDX historically focuses on price/volume; statements are a Tushare specialty.
  • Auto-correction. When Tushare and TDX disagree, the audit log captures both and surfaces for human review. No silent override.
  • Replace Tushare. Tushare stays primary.
  • Use TDX as Composer fallback for normal queries. Out of scope; Composer's fallback chain (akshare-spot, warehouse, tushare-daily) is sufficient for serving requests.

3. Research: TDX data acquisition

Three viable paths, ranked by simplicity / reliability:

  • Source: https://www.tdx.com.cn/article/stockfin.html
  • Format: zip files containing TDX-specific binary .day files
  • Cadence: TDX publishes weekly / monthly batch dumps from交易所 data
  • Cost: free; no auth needed
  • Pros: stable URLs, no client / server protocol; download once, parse offline; no rate limits
  • Cons: binary format requires parsing; weekly cadence means data is ~1-7 days behind real-time (acceptable for cross-source audit, NOT for serving requests)

3.2 pytdx3 binary protocol library

  • URL: https://github.com/lierwang19/pytdx3
  • Python client speaking TDX's network protocol; talks to TDX servers run by 通达信 / brokerages.
  • Pros: more recent data than zip dumps (sub-daily); same Python ecosystem as our backend
  • Cons: server discovery + connection handling fragile; servers change IPs frequently; network reliability often poor outside business hours

Use as W9 fallback if zip downloads delay or break.

3.3 ❌ eltdx local reader (rejected)

→ Not viable.

3.4 Decision

W7 implements path 3.1 only. Path 3.2 deferred to W9 unless zip publishing breaks for >2 weeks running.


4. Provider design: TdxBatchProvider

Fits the existing DataProvider ABC (spec §4) with a constrained capability:

python
class TdxBatchProvider(DataProvider):
    id = "tdx-batch"
    capabilities = {Capability.DAILY_QUOTE_HISTORICAL}

    def __init__(
        self,
        zip_dir: Path,
        *,
        exchange: str = "SSE",
        max_age_days: int = 14,
    ) -> None:
        """zip_dir contains downloaded TDX .day files (or extracted .day files).

        max_age_days bounds the lookback (older zips skipped to keep memory low).
        """

    def get_quote(
        self, code: str, *, trade_date: str | None = None
    ) -> Quote:
        if trade_date is None:
            raise CapabilityNotSupported(
                "tdx-batch is historical-only; trade_date required"
            )
        # 1. Locate the zip / .day file containing trade_date for this code
        # 2. Parse binary (32-byte struct per row): date / open / high / low /
        #    close / amount / volume / reserved
        # 3. Convert units: TDX stores price * 100 (integer 分), volume * 1
        # 4. Verify canonical conversion → CanonicalUnitViolation if violated
        # 5. Return Quote with source="tdx-batch", freshness_seconds reflecting
        #    publication delay (typically ~3-7 days)

    def freshness(self) -> FreshnessReport: ...
    def health(self) -> HealthStatus: ...

Key constraints

  • Capabilities limited: Only DAILY_QUOTE_HISTORICAL. TDX zips are weekly/monthly; latest data is days behind. Composer's ROUTES table intentionally does not include tdx-batch — it's never a fallback for user-facing queries.
  • Freshness honesty: freshness_seconds reflects actual zip publication delay (~3-7 days at minimum). Quote consumers should NOT think this is real-time.
  • Unit normalization: TDX stores prices as integer 分 (×100); volume as shares (not 手). The provider's boundary code converts. Violations → CanonicalUnitViolation (same pattern as v0.3.0 W2.1/W2.2).

What TdxBatchProvider does NOT do

  • Cache results (the CrossSourceAuditor orchestrator handles batch reads; per-call caching adds complexity)
  • Retry binary parse failures (corrupt zip → fail loudly so operator notices)
  • Talk to network at all (zips are pre-downloaded; the provider operates offline)

5. Cross-source diff workflow

A new orchestrator, CrossSourceAuditor, drives the comparison:

python
class CrossSourceAuditor:
    def __init__(
        self,
        warehouse_provider: DataProvider,  # tushare-daily / warehouse
        tdx_provider: TdxBatchProvider,
        audit_log: AuditLog,
        *,
        diff_threshold_pct: float = 0.01,
        warning_threshold_pct: float = 1.0,
        critical_threshold_pct: float = 5.0,
    ) -> None: ...

    def audit_window(
        self, *, start: date, end: date
    ) -> dict[str, int]:
        """Compare every (ts_code, trade_date) in window;
        write diffs to provider_diff_log; return summary counts."""

        for trade_date in trade_days_between(start, end):
            for ts_code in tdx.codes_for(trade_date):
                wh_quote = warehouse_provider.get_quote(
                    ts_code, trade_date=trade_date
                )
                tdx_quote = tdx_provider.get_quote(
                    ts_code, trade_date=trade_date
                )
                self._diff_or_record(wh_quote, tdx_quote)

    def _diff_or_record(self, a: Quote, b: Quote) -> None:
        if a.close_raw is None or b.close_raw is None:
            return  # one side missing — neither is "right"; skip

        diff_pct = abs(a.close_raw - b.close_raw) / b.close_raw * 100.0

        if diff_pct < self.diff_threshold_pct:
            return  # within rounding tolerance

        severity = (
            "critical" if diff_pct >= self.critical_threshold_pct
            else "warning" if diff_pct >= self.warning_threshold_pct
            else "info"
        )

        self.audit_log.record_diff(
            code=a.ts_code,
            trade_date=a.trade_date,
            provider_a=a.source,
            value_a=a.close_raw,
            provider_b=b.source,
            value_b=b.close_raw,
            severity=severity,
        )

Schedule

Daily 03:00 (post-warehouse-refresh, pre-trading-day):

00:00  TDX publisher uploads new weekly batch (their cadence)
02:00  Our scheduler downloads new TDX zip (if newer than last)
03:00  CrossSourceAuditor.audit_window(start=today-7, end=today-3)
       writes diffs to provider_diff_log
       triggers webhook for any new critical

The 4-day lag at end of window absorbs TDX publication delay — we audit "settled" data only.

Webhook hook

Spec §8.3 / §11 already define a webhook hook for critical diffs. W8 implements it (URL configurable via env / runtime.yaml). Default: log to stderr only (no external dependency).


6. Implementation phases

W7 (1 week) — core path

  • [ ] core/data/providers/tdx_batch.py (~150 lines)
  • [ ] core/data/cross_source_auditor.py (~200 lines)
  • [ ] scripts/run_tdx_audit.py (one-shot CLI)
  • [ ] Unit tests for both modules (~250 lines): provider conformance suite + auditor diff math + severity classification
  • [ ] Integration test with sample fixture data

W8 (3-5 days) — productionization

  • [ ] APScheduler daily 03:00 cron in src/service/main.py lifespan
  • [ ] /quality/diff endpoint exposes recent diffs (data already in table)
  • [ ] Webhook config + first webhook fire on critical
  • [ ] Deploy to tvps + monitor first week

W9 (deferred) — pytdx3 fallback

  • [ ] If TDX zips break for >2 weeks, fall back to pytdx3 binary protocol
  • [ ] Otherwise indefinitely deferred

7. Open questions

  1. TDX zip parsing — pytdx3.exhq.parser or hand-rolled? Probably pytdx3, with a thin wrapper that validates the parsed dict shape and converts units. Format is documented in TDX's reference; not novel.
  2. Storage of TDX raw data. Do we keep parsed TDX data in a tdx_daily raw layer? Yes — use 01 §6.1 raw-layer pattern; helps debug diffs ("did Tushare or TDX shift?"). Compressed JSON, ~50 MB/year for nationwide A-share daily.
  3. Disagreement attribution. When diff found, how to know which is correct? Log both. Flag for human review. No auto-correction.
  4. Holiday handling. TDX's binary .day doesn't include trade_cal; we rely on warehouse's normalized_trade_cal (already populated by W1.2).
  5. Code-format mapping. TDX uses sh600519 / sz000001 / bj430001; Tushare uses 600519.SH etc. Boundary normalization in TdxBatchProvider._tdx_to_canonical_code().
  6. Should auditor delete or expire old diff_log rows? Retention:
    • info: 30 days
    • warning: 90 days
    • critical: forever (manually reviewed) Add cleanup_old_diffs() method, run weekly.
  7. How to detect Tushare's known quirks before W7 ships. Could pre-script a manual diff with pytdx3 to find out which date ranges have the most disagreements; calibrate severity tiers from that sample. Optional preliminary research task.

8. Risks & mitigations

#RiskProbabilityImpactMitigation
1TDX zip URL breakslowhighManual download fallback; keep last successful zip locally; W9 pytdx3 path
2pytdx3 server connectivity unstablehighmediumStick with zip path (3.1); pytdx3 is W9 only
3TDX/Tushare unit conventions differmediumhighValidate at TdxBatchProvider boundary; CanonicalUnitViolation if mismatch (matches W2.1 pattern)
4Diff log table grows hugemediummediumRetention policy (Open Q #6); 30/90/∞ tiers
5Alert fatigue if Tushare quirks are commonmediummediumCalibrate severity tiers from first 30 days of real data; tune thresholds
6Webhook brings down audit if faillowmediumWebhook fire is best-effort; failure doesn't break auditor
7TDX data is itself wrong (rare, but seen historically with corp actions)mediummediumThis is exactly why we need the diff log + human review; not a bug

9. Acceptance criteria

  • [ ] TdxBatchProvider passes the W2.5 conformance suite (id / capabilities / get_quote / freshness / health / error mapping)
  • [ ] CrossSourceAuditor runs audit_window against last 7 days nightly, writing to provider_diff_log
  • [ ] At least 5 known Tushare quirks (e.g. 退市股 truncation, 复权因子 boundary) detected and logged within first 30 days
  • [ ] GET /quality/diff?severity=critical shows recent critical disagreements
  • [ ] GET /quality/diff?severity=warning&since=... filterable
  • [ ] Webhook fires on first critical (then deduped per code/date)
  • [ ] No regression in v0.3.0 W2 work; cross-source is purely additive

10. Out of scope (recap)

  • TDX as primary or fallback source for normal queries
  • TDX for fundamentals / financial statements
  • Real-time TDX subscription
  • Broker-specific TDX server selection
  • TDX zip backup / archival to S3 / external storage
  • Auto-correction of diffs (human review only)

11. Cross-references

  • DataProvider ABC: 02-tushare-integration-design.md §4 (already merged in v0.3.0 main)
  • provider_diff_log table schema: 02-tushare-integration-design.md §8 (PR #41 in v0.3.0)
  • Severity tier semantics: 02-tushare-integration-design.md §8.3
  • 5 v0.2.0 incidents (the things v0.3.0 W2 architecture catches): 02-tushare-integration-design.md §1
  • Composer ROUTES (proof TDX is NOT a normal serving source): 02-tushare-integration-design.md §6.1
  • 01 §8.6: TDX deferred from W3 backfill scope
  • Existing GitHub research notes: github_research.md (Data section)

Appendix A: TDX .day binary format reference

A .day file is a sequence of 32-byte fixed-width records:

Offset   Field         Type     Notes
------   -----------   ------   ------------------------------------------
0..3     trade_date    uint32   YYYYMMDD as integer
4..7     open          uint32   price × 100 (integer 分)
8..11    high          uint32   same scale
12..15   low           uint32
16..19   close         uint32
20..23   amount        float    成交额 (元, single-precision)
24..27   volume        uint32   成交量 (手 in TDX convention; verify)
28..31   reserved      uint32   ignored

Critical conversions for canonical Quote:

  • price fields: divide by 100, store as float 元 (matches Tushare daily.close)
  • amount: divide by 1000, store as float 千元 (matches Tushare daily.amount)
  • volume: TDX gives 手 already; store as int. Validate by cross-checking against Tushare on a known stock — if values divide by 100 consistently, then TDX gave 股 and we need to /100.

The validation step is critical: incident #4 (浙文互联 5,362,159 万元) was exactly this kind of unit assumption error. Every TDX zip parse run should include a sanity check on a hand-picked stock's (close × volume) ≈ amount.

团队内部文档