diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..94e918f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,185 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Overview + +Alpha Lab is a quantitative research experiment framework for the `qshare` library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks: + +- **cta_1d**: CTA (Commodity Trading Advisor) futures 1-day return prediction +- **stock_15m**: Stock 15-minute forward return prediction using high-frequency features + +## Directory Structure + +``` +alpha_lab/ +├── common/ # Shared utilities +│ ├── __init__.py +│ ├── paths.py # Path management +│ └── plotting.py # Common plotting functions +│ +├── cta_1d/ # CTA 1-day return prediction +│ ├── __init__.py # Re-exports from src/ +│ ├── config.yaml # Task configuration +│ ├── src/ # Implementation modules +│ │ ├── __init__.py +│ │ ├── loader.py # CTA1DLoader +│ │ ├── train.py # Training functions +│ │ ├── backtest.py # Backtest functions +│ │ └── labels.py # Label blending utilities +│ └── *.ipynb # Experiment notebooks +│ +├── stock_15m/ # Stock 15-minute return prediction +│ ├── __init__.py # Re-exports from src/ +│ ├── config.yaml # Task configuration +│ ├── src/ # Implementation modules +│ │ ├── __init__.py +│ │ ├── loader.py # Stock15mLoader +│ │ └── train.py # Training functions +│ └── *.ipynb # Experiment notebooks +│ +└── results/ # Output directory (gitignored) +``` + +## Common Commands + +### Development Setup + +```bash +# Install dependencies +pip install -r requirements.txt + +# Create environment configuration +cp .env.template .env +# Edit .env with your DolphinDB host and data paths +``` + +### Running Experiments + +```bash +# Start Jupyter for interactive experiments +jupyter notebook + +# Train CTA model from config +python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01 + +# Train Stock 15m model +python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01 + +# Run CTA backtest +python -m cta_1d.backtest \ + --model results/cta_1d/exp01/model.json \ + --dt-range 2023-01-01 2023-12-31 \ + --output results/cta_1d/backtest_01 +``` + +### Python API Usage + +```python +# CTA 1D workflow +from cta_1d import CTA1DLoader, train_model, TrainConfig + +loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual') +dataset = loader.load(dt_range=['2020-01-01', '2023-12-31']) + +config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158']) +model, metrics = train_model(config, output_dir='results/exp01') + +# Stock 15m workflow +from stock_15m import Stock15mLoader, train_model, TrainConfig + +loader = Stock15mLoader(normalization_mode='dual') +dataset = loader.load( + dt_range=['2020-01-01', '2023-12-31'], + feature_path='/data/parquet/stock_1min_alpha158', + kline_path='/data/parquet/stock_1min_kline' +) +``` + +## Architecture + +### Module Organization + +All implementation code lives in `src/` subdirectories: + +- **`cta_1d/src/`**: CTA-specific implementations + - `loader.py`: CTA1DLoader class + - `train.py`: train_model, TrainConfig + - `backtest.py`: run_backtest, BacktestConfig + - `labels.py`: Label blending utilities + +- **`stock_15m/src/`**: Stock-specific implementations + - `loader.py`: Stock15mLoader class + - `train.py`: train_model, TrainConfig + +Root `__init__.py` files re-export public APIs for backward compatibility: +```python +from cta_1d import CTA1DLoader # Imports from cta_1d.src +``` + +### Data Flow + +Both tasks follow a consistent pattern: + +1. **Loaders** (`src/loader.py`): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return `pl_Dataset` +2. **Training** (`src/train.py`): XGBoost with early stopping, outputs model JSON + metrics +3. **Backtest** (`src/backtest.py`): CTA-only; uses `qshare.eval.cta.backtest.CTABacktester` for strategy simulation + +### Key Classes + +- **`CTA1DLoader`**: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (`zscore`, `cs_zscore`, `rolling_20`, `rolling_60`, `dual`) +- **`Stock15mLoader`**: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: `industry`, `cs_zscore`, `dual` +- **`pl_Dataset`**: From `qshare.data`; provides `.with_segments()`, `.split()`, `.to_numpy()` methods + +### Normalization Modes + +**CTA 1D** (`dual` blending): +- `zscore`: Fit-time mean/std normalization +- `cs_zscore`: Cross-sectional z-score per datetime +- `rolling_20/60`: Rolling window normalization +- `dual`: Weighted blend (default: [0.2, 0.1, 0.3, 0.4]) + +**Stock 15m**: +- `industry`: Industry-neutralized returns +- `cs_zscore`: Cross-sectional z-score +- `dual`: 80% industry-neutral + 20% cs_zscore + +### Experiment Tracking + +Manual tracking in `results/{task}/README.md`: + +```markdown +## 2025-01-15: Baseline XGB +- Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50) +- Config: eta=0.5, lambda=0.1 +- Train IC: 0.042 +- Test IC: 0.038 +- Notes: Dual normalization, 4 trades/day +``` + +### Dependencies on qshare + +The codebase relies heavily on the `qshare` library (already installed in the venv): + +- `qshare.data.pl_Dataset`: Dataset container with Polars backend +- `qshare.io.ddb`: DolphinDB session management +- `qshare.io.polars`: Parquet loading utilities +- `qshare.algo.polars`: Industry neutralization, cross-sectional z-score +- `qshare.eval.cta.backtest`: CTA backtesting framework +- `qshare.config.research.cta`: Predefined column lists (HFFACTOR_COLS) + +### Configuration Files + +YAML configs define data ranges, model hyperparameters, and output settings: + +```yaml +data: + dt_range: ['2020-01-01', '2023-12-31'] + feature_sets: [alpha158, hffactor] + normalization: dual +model: + type: xgb + params: {eta: 0.05, max_depth: 6} +``` + +Load with: `python -m cta_1d.train --config config.yaml` or `yaml.safe_load()` directly. diff --git a/README.md b/README.md index 8deffe4..dc1b228 100644 --- a/README.md +++ b/README.md @@ -14,20 +14,33 @@ Quantitative research experiments for qshare library. This repository contains J ``` alpha_lab/ ├── common/ # Shared utilities (keep minimal!) +│ ├── __init__.py │ ├── paths.py # Path management │ └── plotting.py # Common plotting functions │ ├── cta_1d/ # CTA 1-day return prediction +│ ├── __init__.py # Re-exports from src/ +│ ├── config.yaml # Task configuration +│ ├── src/ # Implementation modules +│ │ ├── __init__.py +│ │ ├── loader.py # CTA1DLoader +│ │ ├── train.py # Training functions +│ │ ├── backtest.py # Backtest functions +│ │ └── labels.py # Label blending utilities │ ├── 01_data_check.ipynb │ ├── 02_label_analysis.ipynb │ ├── 03_baseline_xgb.ipynb -│ ├── 04_blend_comparison.ipynb -│ └── src/ # Task-specific helpers +│ └── 04_blend_comparison.ipynb │ ├── stock_15m/ # Stock 15-minute return prediction +│ ├── __init__.py # Re-exports from src/ +│ ├── config.yaml # Task configuration +│ ├── src/ # Implementation modules +│ │ ├── __init__.py +│ │ ├── loader.py # Stock15mLoader +│ │ └── train.py # Training functions │ ├── 01_data_exploration.ipynb -│ ├── 02_baseline_model.ipynb -│ └── src/ +│ └── 02_baseline_model.ipynb │ └── results/ # Output directory (gitignored) ├── cta_1d/ @@ -47,6 +60,8 @@ cp .env.template .env ## Usage +### Interactive (Notebooks) + Start Jupyter and run notebooks interactively: ```bash @@ -59,6 +74,33 @@ Each task directory contains numbered notebooks: - `03_*.ipynb` - Advanced experiments - `04_*.ipynb` - Comparisons and ablations +### Command Line + +Train models from config files: + +```bash +# CTA 1D +python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01 + +# Stock 15m +python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01 + +# CTA Backtest +python -m cta_1d.backtest \ + --model results/cta_1d/exp01/model.json \ + --dt-range 2023-01-01 2023-12-31 \ + --output results/cta_1d/backtest_01 +``` + +### Python API + +```python +# Import from task root (re-exports from src/) +from cta_1d import CTA1DLoader, train_model, TrainConfig +from stock_15m import Stock15mLoader, train_model, TrainConfig +from common import create_experiment_dir +``` + ## Experiment Tracking Experiments are tracked manually in `results/{task}/README.md`: @@ -75,13 +117,18 @@ Experiments are tracked manually in `results/{task}/README.md`: ## Adding a New Task 1. Create directory: `mkdir my_task` -2. Add `src/` subdirectory for helpers -3. Create numbered notebooks -4. Add entry to `results/my_task/README.md` +2. Add `src/` subdirectory with: + - `__init__.py` - Export public APIs + - `loader.py` - Dataset loader class + - Other modules as needed +3. Add root `__init__.py` that re-exports from `src/` +4. Create numbered notebooks +5. Add entry to `results/my_task/README.md` ## Best Practices 1. **Keep it simple**: Only add to `common/` after 3+ copies -2. **Notebook configs**: Define CONFIG dict in first cell for easy modification -3. **Document results**: Update results README after significant runs -4. **Git discipline**: Don't commit large files, results, or credentials +2. **Module organization**: Place implementation in `src/`, re-export from root `__init__.py` +3. **Notebook configs**: Define CONFIG dict in first cell for easy modification +4. **Document results**: Update results README after significant runs +5. **Git discipline**: Don't commit large files, results, or credentials