You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

113 lines
3.5 KiB

# CTA 1-Day Return Prediction
Experiments for predicting CTA (Commodity Trading Advisor) futures 1-day returns.
## Data
- **Features**: alpha158, hffactor
- **Labels**: Return indicators (o2c_twap1min, o2o_twap1min, etc.)
- **Normalization**: dual (blend of zscore, cs_zscore, rolling_20, rolling_60)
## Notebooks
| Notebook | Purpose |
|----------|---------|
| `01_data_check.ipynb` | Load and validate CTA data |
| `02_label_analysis.ipynb` | Explore label distributions and blending |
| `03_baseline_xgb.ipynb` | Train baseline XGBoost model |
| `04_blend_comparison.ipynb` | Compare different normalization blends |
## Blend Configurations
The label blending combines 4 normalization methods:
- **zscore**: Fit-time mean/std normalization
- **cs_zscore**: Cross-sectional z-score per datetime
- **rolling_20**: 20-day rolling window normalization
- **rolling_60**: 60-day rolling window normalization
Predefined weights (from qshare.config.research.cta.labels):
- `equal`: [0.25, 0.25, 0.25, 0.25]
- `zscore_heavy`: [0.5, 0.2, 0.15, 0.15]
- `rolling_heavy`: [0.1, 0.1, 0.3, 0.5]
- `cs_heavy`: [0.2, 0.5, 0.15, 0.15]
- `short_term`: [0.1, 0.1, 0.4, 0.4]
- `long_term`: [0.4, 0.2, 0.2, 0.2]
Default: [0.2, 0.1, 0.3, 0.4]
## Processors Module
The `cta_1d.src.processors` module provides Polars-based data processors that replicate Qlib's preprocessing pipeline:
### Available Processors
| Processor | Description |
|-----------|-------------|
| `DiffProcessor` | Adds diff features with configurable period |
| `FlagMarketInjector` | Adds market_0, market_1 columns based on instrument codes |
| `FlagSTInjector` | Creates IsST column from ST flags |
| `ColumnRemover` | Removes specified columns |
| `FlagToOnehot` | Converts one-hot industry flags to single index column |
| `IndusNtrlInjector` | Industry neutralization per datetime |
| `RobustZScoreNorm` | Robust z-score normalization using median/MAD |
| `Fillna` | Fills NaN values with specified value |
### RobustZScoreNorm with Pre-fitted Parameters
The `RobustZScoreNorm` processor supports loading pre-fitted parameters from Qlib's `proc_list.proc`:
```python
from cta_1d.src.processors import RobustZScoreNorm
# Method 1: Load from saved version (recommended)
processor = RobustZScoreNorm.from_version("csiallx_feature2_ntrla_flag_pnlnorm")
# Method 2: Load with direct parameters
processor = RobustZScoreNorm(
feature_cols=['KMID', 'KLEN', ...],
use_qlib_params=True,
qlib_mean=mean_array,
qlib_std=std_array
)
# Apply normalization
df = processor.process(df)
```
### Parameter Extraction
Extract parameters from Qlib's proc_list.proc:
```bash
python stock_1d/d033/alpha158_beta/scripts/extract_qlib_params.py \
--proc-list /path/to/proc_list.proc \
--version my_version
```
Output structure:
```
data/robust_zscore_params/{version}/
├── mean_train.npy # Pre-fitted mean (330,)
├── std_train.npy # Pre-fitted std (330,)
└── metadata.json # Feature columns and metadata
```
### Pipeline Helper Functions
```python
from cta_1d.src.processors import create_processor_pipeline, get_final_feature_columns
# Create pipeline from processor configs
pipeline = create_processor_pipeline([
{'type': 'Diff', 'columns': ['turnover', 'free_turnover']},
{'type': 'RobustZScoreNorm', 'feature_cols': feature_cols},
{'type': 'Fillna', 'value': 0},
])
# Get final feature columns after industry neutralization
final_cols = get_final_feature_columns(
alpha158_cols=ALPHA158_COLS,
market_ext_cols=MARKET_EXT_COLS,
)
```