You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
86 lines
3.2 KiB
86 lines
3.2 KiB
|
4 days ago
|
# Data Pipeline Bug Analysis - Final Status
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
After fixing all identified bugs, the feature count now matches (341), but the embeddings remain uncorrelated with the database 0_7 version.
|
||
|
|
|
||
|
|
**Latest Version**: v5
|
||
|
|
- Feature count: 341 ✓ (matches VAE input dim)
|
||
|
|
- Mean correlation with DB: 0.0050 (essentially zero)
|
||
|
|
- Status: All identified bugs fixed, but embeddings still differ
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Bugs Fixed
|
||
|
|
|
||
|
|
### 1. Market Classification (`FlagMarketInjector`) ✓ FIXED
|
||
|
|
- **Bug**: Used `instrument >= 600000` which misclassified 新三板 instruments
|
||
|
|
- **Fix**: Use string prefix matching with vocab_size=2 (not 3)
|
||
|
|
- **Impact**: 167 instruments corrected
|
||
|
|
|
||
|
|
### 2. ColumnRemover Missing `IsN` ✓ FIXED
|
||
|
|
- **Bug**: Only removed `IsZt, IsDt` but not `IsN`
|
||
|
|
- **Fix**: Added `IsN` to removal list
|
||
|
|
- **Impact**: Feature count alignment
|
||
|
|
|
||
|
|
### 3. RobustZScoreNorm Scope ✓ FIXED
|
||
|
|
- **Bug**: Applied normalization to all 341 features
|
||
|
|
- **Fix**: Only normalize 330 features (alpha158 + market_ext, both original + neutralized)
|
||
|
|
- **Impact**: Correct normalization scope
|
||
|
|
|
||
|
|
### 4. Wrong Data Sources for Market Flags ✓ FIXED
|
||
|
|
- **Bug**: Used `Limit, Stopping` (Float64) from kline_adjusted
|
||
|
|
- **Fix**: Load from correct sources:
|
||
|
|
- kline_adjusted: `IsZt, IsDt, IsN, IsXD, IsXR, IsDR` (Boolean)
|
||
|
|
- market_flag: `open_limit, close_limit, low_limit, high_stop` (Boolean, 4 cols)
|
||
|
|
- **Impact**: Correct boolean flag data
|
||
|
|
|
||
|
|
### 5. Feature Count Mismatch ✓ FIXED
|
||
|
|
- **Bug**: 344 features (3 extra)
|
||
|
|
- **Fix**: vocab_size=2 + 4 market_flag cols = 341 features
|
||
|
|
- **Impact**: VAE input dimension matches
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Correlation Results (v5)
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|--------|-------|
|
||
|
|
| Mean correlation (32 dims) | 0.0050 |
|
||
|
|
| Median correlation | 0.0079 |
|
||
|
|
| Min | -0.0420 |
|
||
|
|
| Max | 0.0372 |
|
||
|
|
| Overall (flattened) | 0.2225 |
|
||
|
|
|
||
|
|
**Conclusion**: Embeddings remain essentially uncorrelated with database.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Possible Remaining Issues
|
||
|
|
|
||
|
|
1. **Different input data values**: The alpha158_0_7_beta Parquet files may contain different values than the original DolphinDB data used to train the VAE.
|
||
|
|
|
||
|
|
2. **Feature ordering mismatch**: The 330 RobustZScoreNorm parameters must be applied in the exact order:
|
||
|
|
- [0:158] = alpha158 original
|
||
|
|
- [158:316] = alpha158_ntrl
|
||
|
|
- [316:323] = market_ext original (7 cols)
|
||
|
|
- [323:330] = market_ext_ntrl (7 cols)
|
||
|
|
|
||
|
|
3. **Industry neutralization differences**: Our `IndusNtrlInjector` implementation may differ from qlib's.
|
||
|
|
|
||
|
|
4. **Missing transformations**: There may be additional preprocessing steps not captured in handler.yaml.
|
||
|
|
|
||
|
|
5. **VAE model mismatch**: The VAE model may have been trained with different data than what handler.yaml specifies.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Next Steps
|
||
|
|
|
||
|
|
1. **Compare intermediate features**: Run both the qlib pipeline and our pipeline on the same input data and compare outputs at each step.
|
||
|
|
|
||
|
|
2. **Verify RobustZScoreNorm parameter order**: Check if our feature ordering matches the order used during VAE training.
|
||
|
|
|
||
|
|
3. **Compare predictions, not embeddings**: Instead of comparing VAE embeddings, compare the final d033 model predictions with the original 0_7 predictions.
|
||
|
|
|
||
|
|
4. **Check alpha158 data source**: Verify that `stg_1day_wind_alpha158_0_7_beta_1D` contains the same data as the original DolphinDB `stg_1day_wind_alpha158_0_7_beta` table.
|