You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
124 lines
3.5 KiB
124 lines
3.5 KiB
|
4 days ago
|
# Data Pipeline Bug Analysis
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
The generated embeddings do not match the database 0_7 embeddings due to multiple bugs in the data pipeline migration from qlib to standalone Polars implementation.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Bugs Fixed
|
||
|
|
|
||
|
|
### 1. Market Classification (`FlagMarketInjector`) ✓ FIXED
|
||
|
|
|
||
|
|
**Original (incorrect):**
|
||
|
|
```python
|
||
|
|
market_0 = (instrument >= 600000) # SH
|
||
|
|
market_1 = (instrument < 600000) # SZ
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fixed:**
|
||
|
|
```python
|
||
|
|
inst_str = str(instrument).zfill(6)
|
||
|
|
market_0 = inst_str.startswith('6') # SH: 6xxxxx
|
||
|
|
market_1 = inst_str.startswith('0') | inst_str.startswith('3') # SZ: 0xxx, 3xxx
|
||
|
|
market_2 = inst_str.startswith('4') | inst_str.startswith('8') # NE: 4xxx, 8xxx
|
||
|
|
```
|
||
|
|
|
||
|
|
**Impact:** 167 instruments (4xxxxx, 8xxxxx - 新三板) were misclassified.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. ColumnRemover Missing `IsN` ✓ FIXED
|
||
|
|
|
||
|
|
**Original (incorrect):**
|
||
|
|
```python
|
||
|
|
columns_to_remove = ['TotalValue_diff', 'IsZt', 'IsDt']
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fixed:**
|
||
|
|
```python
|
||
|
|
columns_to_remove = ['TotalValue_diff', 'IsN', 'IsZt', 'IsDt']
|
||
|
|
```
|
||
|
|
|
||
|
|
**Impact:** Extra column caused feature dimension mismatch.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. RobustZScoreNorm Applied to Wrong Columns ✓ FIXED
|
||
|
|
|
||
|
|
**Original (incorrect):**
|
||
|
|
Applied normalization to ALL 341 features including market flags and indus_idx.
|
||
|
|
|
||
|
|
**Fixed:**
|
||
|
|
Only normalize `alpha158 + alpha158_ntrl + market_ext + market_ext_ntrl` (330 features), excluding:
|
||
|
|
- Market flags (Limit, Stopping, IsTp, IsXD, IsXR, IsDR, market_0, market_1, market_2, IsST)
|
||
|
|
- indus_idx
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Critical Remaining Issue: Data Schema Mismatch
|
||
|
|
|
||
|
|
### `Limit` and `Stopping` Column Types Changed
|
||
|
|
|
||
|
|
**Original qlib pipeline expected:**
|
||
|
|
- `Limit`: **Boolean** flag (True = limit up)
|
||
|
|
- `Stopping`: **Boolean** flag (True = suspended trading)
|
||
|
|
|
||
|
|
**Current Parquet data has:**
|
||
|
|
- `Limit`: **Float64** price change percentage (0.0 to 1301.3)
|
||
|
|
- `Stopping`: **Float64** price change percentage
|
||
|
|
|
||
|
|
**Evidence:**
|
||
|
|
```
|
||
|
|
Limit values sample: [8.86, 9.36, 31.0, 7.32, 2.28, 6.39, 5.38, 4.03, 3.86, 9.89]
|
||
|
|
Limit == 0: only 2 rows
|
||
|
|
Limit > 0: 3738 rows
|
||
|
|
```
|
||
|
|
|
||
|
|
This is a **fundamental data schema change**. The current Parquet files contain different data than what the original VAE model was trained on.
|
||
|
|
|
||
|
|
**Possible fixes:**
|
||
|
|
1. Convert `Limit` and `Stopping` to boolean flags using a threshold
|
||
|
|
2. Find the original data source that had boolean flags
|
||
|
|
3. Re-train the VAE model with the new data schema
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Correlation Results
|
||
|
|
|
||
|
|
After fixing bugs 1-3, the embedding correlation with database 0_7:
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|--------|-------|
|
||
|
|
| Mean correlation (32 dims) | 0.0068 |
|
||
|
|
| Median correlation | 0.0094 |
|
||
|
|
| Overall correlation | 0.2330 |
|
||
|
|
|
||
|
|
**Conclusion:** Embeddings remain essentially uncorrelated (≈0).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause
|
||
|
|
|
||
|
|
The **Limit/Stopping data schema change** is the most likely root cause. The VAE model learned to encode features that included binary limit/stopping flags, but the standalone pipeline feeds it continuous price change percentages instead.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Verify original data schema:**
|
||
|
|
- Check if the original DolphinDB table had boolean `Limit` and `Stopping` columns
|
||
|
|
- Compare with the current Parquet schema
|
||
|
|
|
||
|
|
2. **Fix the data loading:**
|
||
|
|
- Either convert continuous values to binary flags
|
||
|
|
- Or use the correct boolean columns (`IsZt`, `IsDt`) for limit flags
|
||
|
|
|
||
|
|
3. **Verify feature order:**
|
||
|
|
- Ensure the qlib RobustZScoreNorm parameters are applied in the correct order
|
||
|
|
- Check that `[alpha158, alpha158_ntrl, market_ext, market_ext_ntrl]` matches the 330-parameter shape
|
||
|
|
|
||
|
|
4. **Re-run comparison:**
|
||
|
|
- Generate new embeddings with the corrected pipeline
|
||
|
|
- Compare correlation with database
|