You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
314 lines
11 KiB
314 lines
11 KiB
# Alpha158 0_7 vs 0_7_beta Prediction Comparison
|
|
|
|
This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions.
|
|
|
|
## Overview
|
|
|
|
The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
stock_1d/d033/alpha158_beta/
|
|
├── README.md # This file
|
|
├── config.yaml # VAE model configuration
|
|
├── pipeline.py # Main orchestration script
|
|
├── scripts/ # Core pipeline scripts
|
|
│ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors
|
|
│ ├── generate_returns.py # Generate actual returns from kline data
|
|
│ ├── fetch_predictions.py # Fetch original predictions from DolphinDB
|
|
│ ├── predict_with_embedding.py # Generate predictions using beta embeddings
|
|
│ ├── compare_predictions.py # Compare 0_7 vs 0_7_beta predictions
|
|
│ ├── dump_polars_dataset.py # Dump raw and processed datasets using polars pipeline
|
|
│ └── extract_qlib_params.py # Extract RobustZScoreNorm parameters from Qlib proc_list
|
|
├── src/ # Source modules
|
|
│ └── qlib_loader.py # Qlib data loader with configurable date range
|
|
├── config/ # Configuration files
|
|
│ └── handler.yaml # Modified handler with configurable end date
|
|
├── data/ # Data files (gitignored)
|
|
│ ├── robust_zscore_params/ # Pre-fitted normalization parameters
|
|
│ │ └── csiallx_feature2_ntrla_flag_pnlnorm/
|
|
│ │ ├── mean_train.npy
|
|
│ │ ├── std_train.npy
|
|
│ │ └── metadata.json
|
|
│ ├── embedding_0_7_beta.parquet
|
|
│ ├── predictions_beta_embedding.parquet
|
|
│ ├── original_predictions_0_7.parquet
|
|
│ ├── actual_returns.parquet
|
|
│ ├── raw_data_*.pkl # Raw data before preprocessing
|
|
│ └── processed_data_*.pkl # Processed data after preprocessing
|
|
└── data_polars/ # Polars-generated datasets (gitignored)
|
|
├── raw_data_*.pkl
|
|
└── processed_data_*.pkl
|
|
```
|
|
|
|
## Data Loading with Configurable Date Range
|
|
|
|
### handler.yaml Modification
|
|
|
|
The original `handler.yaml` uses `<TODAY>` placeholder which always loads data until today's date. The modified version in `config/handler.yaml` uses `<LOAD_END>` placeholder that can be controlled via arguments:
|
|
|
|
```yaml
|
|
# Original (always loads until today)
|
|
load_start: &load_start <SINCE_DATE>
|
|
load_end: &load_end <TODAY>
|
|
|
|
# Modified (configurable end date)
|
|
load_start: &load_start <LOAD_START>
|
|
load_end: &load_end <LOAD_END>
|
|
```
|
|
|
|
### Using qlib_loader.py
|
|
|
|
```python
|
|
from stock_1d.d033.alpha158_beta.src.qlib_loader import (
|
|
load_data_from_handler,
|
|
load_data_with_proc_list,
|
|
load_and_dump_data
|
|
)
|
|
|
|
# Load data with configurable date range
|
|
df = load_data_from_handler(
|
|
since_date="2019-01-01",
|
|
end_date="2019-01-31",
|
|
buffer_days=20, # Extra days for diff calculations
|
|
verbose=True
|
|
)
|
|
|
|
# Load and apply preprocessing pipeline
|
|
df_processed = load_data_with_proc_list(
|
|
since_date="2019-01-01",
|
|
end_date="2019-01-31",
|
|
proc_list_path="/path/to/proc_list.proc",
|
|
buffer_days=20
|
|
)
|
|
|
|
# Load and dump both raw and processed data to pickle files
|
|
raw_df, processed_df = load_and_dump_data(
|
|
since_date="2019-01-01",
|
|
end_date="2019-01-31",
|
|
output_dir="data/",
|
|
fill_con_rating_nan=True, # Fill NaN in con_rating_strength column
|
|
verbose=True
|
|
)
|
|
```
|
|
|
|
### Key Features
|
|
|
|
1. **Configurable end date**: Unlike the original handler.yaml, the end date is now respected
|
|
2. **Buffer period handling**: Automatically loads extra days before `since_date` for diff calculations
|
|
3. **NaN handling**: Optional filling of NaN values in `con_rating_strength` column
|
|
4. **Dual output**: Saves both raw (before proc_list) and processed (after proc_list) data
|
|
|
|
### Processor Fixes
|
|
|
|
The `qlib_loader.py` includes fixed implementations of qlib processors that correctly handle the `::` separator column format:
|
|
|
|
- `FixedDiff` - Fixes column naming bug (creates proper `feature::col_diff` names)
|
|
- `FixedColumnRemover` - Handles `::` separator format
|
|
- `FixedRobustZScoreNorm` - Uses trained `mean_train`/`std_train` parameters from pickle
|
|
- `FixedIndusNtrlInjector` - Industry neutralization with `::` format
|
|
- `FixedFlagMarketInjector` - Adds `market_0`, `market_1` columns based on instrument codes
|
|
- `FixedFlagSTInjector` - Creates `IsST` column from `ST_S`, `ST_Y` flags
|
|
|
|
All fixed processors preserve the trained parameters from the original proc_list pickle.
|
|
|
|
### Polars Dataset Generation
|
|
|
|
The `scripts/dump_features.py` script generates datasets using a polars-based pipeline that replicates the qlib preprocessing:
|
|
|
|
```bash
|
|
# Generate merged features (flat columns)
|
|
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged
|
|
|
|
# Generate with struct columns (packed feature groups)
|
|
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged --pack-struct
|
|
|
|
# Generate specific feature groups
|
|
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups alpha158 market_ext
|
|
```
|
|
|
|
This script:
|
|
1. Loads data from Parquet files (alpha158, kline, market flags, industry flags)
|
|
2. Applies the full processor pipeline:
|
|
- Diff processor (adds diff features)
|
|
- FlagMarketInjector (adds market_0, market_1)
|
|
- ColumnRemover (removes log_size_diff, IsN, IsZt, IsDt)
|
|
- FlagToOnehot (converts 29 industry flags to indus_idx)
|
|
- IndusNtrlInjector (industry neutralization)
|
|
- RobustZScoreNorm (using pre-fitted qlib parameters via `from_version()`)
|
|
- Fillna (fill NaN with 0)
|
|
3. Saves to parquet/pickle format
|
|
|
|
**Output modes:**
|
|
- **Flat mode (default)**: All columns as separate fields (348 columns for merged)
|
|
- **Struct mode (`--pack-struct`)**: Feature groups packed into struct columns:
|
|
- `features_alpha158` (316 fields)
|
|
- `features_market_ext` (14 fields)
|
|
- `features_market_flag` (11 fields)
|
|
|
|
**Note**: The `FlagSTInjector` step is skipped because it fails silently even in the gold-standard qlib code (see `BUG_ANALYSIS_FINAL.md` for details).
|
|
|
|
Output structure:
|
|
- Raw data: ~204 columns (158 feature + 4 feature_ext + 12 feature_flag + 30 indus_flag)
|
|
- Processed data: 348 columns (318 alpha158 + 14 market_ext + 14 market_flag + 2 index)
|
|
- VAE input dimension: 341 (excluding indus_idx)
|
|
|
|
### RobustZScoreNorm Parameter Extraction
|
|
|
|
The pipeline uses pre-fitted normalization parameters extracted from Qlib's `proc_list.proc` file. These parameters are stored in `data/robust_zscore_params/` and can be loaded using the `RobustZScoreNorm.from_version()` method.
|
|
|
|
**Extract parameters from Qlib proc_list:**
|
|
|
|
```bash
|
|
python scripts/extract_qlib_params.py --version csiallx_feature2_ntrla_flag_pnlnorm
|
|
```
|
|
|
|
This creates:
|
|
- `data/robust_zscore_params/{version}/mean_train.npy` - Pre-fitted mean parameters (330,)
|
|
- `data/robust_zscore_params/{version}/std_train.npy` - Pre-fitted std parameters (330,)
|
|
- `data/robust_zscore_params/{version}/metadata.json` - Feature column names and metadata
|
|
|
|
**Use in Polars processors:**
|
|
|
|
```python
|
|
from cta_1d.src.processors import RobustZScoreNorm
|
|
|
|
# Load pre-fitted parameters by version name
|
|
processor = RobustZScoreNorm.from_version("csiallx_feature2_ntrla_flag_pnlnorm")
|
|
|
|
# Apply normalization to DataFrame
|
|
df = processor.process(df)
|
|
```
|
|
|
|
**Parameter details:**
|
|
- Fit period: 2013-01-01 to 2018-12-31
|
|
- Feature count: 330 (158 alpha158_ntrl + 158 alpha158_raw + 7 market_ext_ntrl + 7 market_ext_raw)
|
|
- Fields: ['feature', 'feature_ext']
|
|
|
|
## Workflow
|
|
|
|
### 1. Generate Beta Embeddings
|
|
|
|
Generate VAE embeddings from the alpha158_0_7_beta factors:
|
|
|
|
```bash
|
|
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
```
|
|
|
|
This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model.
|
|
|
|
Output: `data/embedding_0_7_beta.parquet`
|
|
|
|
### 2. Fetch Original Predictions
|
|
|
|
Fetch the original 0_7 predictions from DolphinDB:
|
|
|
|
```bash
|
|
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
```
|
|
|
|
Output: `data/original_predictions_0_7.parquet`
|
|
|
|
### 3. Generate Predictions with Beta Embeddings
|
|
|
|
Use the d033 model to generate predictions from the beta embeddings:
|
|
|
|
```bash
|
|
python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
```
|
|
|
|
Output: `data/predictions_beta_embedding.parquet`
|
|
|
|
### 4. Generate Actual Returns
|
|
|
|
Generate actual returns from kline data for IC calculation:
|
|
|
|
```bash
|
|
python scripts/generate_returns.py
|
|
```
|
|
|
|
Output: `data/actual_returns.parquet`
|
|
|
|
### 5. Compare Predictions
|
|
|
|
Compare the 0_7 vs 0_7_beta predictions:
|
|
|
|
```bash
|
|
python scripts/compare_predictions.py
|
|
```
|
|
|
|
This calculates:
|
|
- Prediction correlation (Pearson and Spearman)
|
|
- Daily correlation statistics
|
|
- IC metrics (mean, std, IR)
|
|
- RankIC metrics
|
|
- Top-tier returns (top 10%)
|
|
|
|
## Quick Start
|
|
|
|
Run the full pipeline:
|
|
|
|
```bash
|
|
python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
```
|
|
|
|
Or run individual steps:
|
|
|
|
```bash
|
|
# Step 1: Generate embeddings
|
|
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
|
|
# Step 2: Fetch original predictions
|
|
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
|
|
|
|
# Step 3: Generate beta predictions
|
|
python scripts/predict_with_embedding.py
|
|
|
|
# Step 4: Generate returns
|
|
python scripts/generate_returns.py
|
|
|
|
# Step 5: Compare
|
|
python scripts/compare_predictions.py
|
|
```
|
|
|
|
## Data Dependencies
|
|
|
|
### Input Data (from Parquet)
|
|
|
|
- `/data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/` - Alpha158 beta factors
|
|
- `/data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/` - Market data (kline)
|
|
- `/data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/` - Industry flags
|
|
|
|
### Models
|
|
|
|
- `/home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt` - VAE encoder
|
|
- `/home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt` - d033 prediction model
|
|
|
|
### DolphinDB
|
|
|
|
- Table: `dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port`
|
|
- Version: `host140_exp20_d033`
|
|
|
|
## Key Metrics
|
|
|
|
The comparison script outputs:
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| Pearson Correlation | Overall correlation between 0_7 and beta predictions |
|
|
| Spearman Correlation | Rank correlation between predictions |
|
|
| Daily Correlation | Mean and std of daily correlations |
|
|
| IC Mean | Average information coefficient |
|
|
| IC Std | Standard deviation of IC |
|
|
| IC IR | Information ratio (IC Mean / IC Std) |
|
|
| RankIC | Spearman correlation with returns |
|
|
| Top-tier Return | Average return of top 10% predictions |
|
|
|
|
## Notes
|
|
|
|
- All scripts can be run from the `alpha158_beta/` directory
|
|
- Scripts use relative paths (`../data/`) to locate data files
|
|
- The VAE model expects 341 input features after the transformation pipeline
|
|
- The d033 model expects 32-dimensional embeddings with a 40-day lookback window
|