From 19f7c522e491d92f03e79dd76ba239bb5494dcb7 Mon Sep 17 00:00:00 2001
From: guofu
Date: Fri, 13 Feb 2026 13:27:13 +0800
Subject: [PATCH] Update documentation for src/ consolidation
- Add detailed directory structure to CLAUDE.md and README.md
- Document Module Organization section explaining src/ layout
- Add Python API import examples showing re-export pattern
- Add Command Line usage section with examples
- Update "Adding a New Task" instructions for src/ structure
- Add module organization best practice
Co-Authored-By: Claude Sonnet 4.5
---
CLAUDE.md | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
README.md | 67 +++++++++++++++++---
2 files changed, 242 insertions(+), 10 deletions(-)
create mode 100644 CLAUDE.md
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..94e918f
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,185 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+Alpha Lab is a quantitative research experiment framework for the `qshare` library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks:
+
+- **cta_1d**: CTA (Commodity Trading Advisor) futures 1-day return prediction
+- **stock_15m**: Stock 15-minute forward return prediction using high-frequency features
+
+## Directory Structure
+
+```
+alpha_lab/
+├── common/ # Shared utilities
+│ ├── __init__.py
+│ ├── paths.py # Path management
+│ └── plotting.py # Common plotting functions
+│
+├── cta_1d/ # CTA 1-day return prediction
+│ ├── __init__.py # Re-exports from src/
+│ ├── config.yaml # Task configuration
+│ ├── src/ # Implementation modules
+│ │ ├── __init__.py
+│ │ ├── loader.py # CTA1DLoader
+│ │ ├── train.py # Training functions
+│ │ ├── backtest.py # Backtest functions
+│ │ └── labels.py # Label blending utilities
+│ └── *.ipynb # Experiment notebooks
+│
+├── stock_15m/ # Stock 15-minute return prediction
+│ ├── __init__.py # Re-exports from src/
+│ ├── config.yaml # Task configuration
+│ ├── src/ # Implementation modules
+│ │ ├── __init__.py
+│ │ ├── loader.py # Stock15mLoader
+│ │ └── train.py # Training functions
+│ └── *.ipynb # Experiment notebooks
+│
+└── results/ # Output directory (gitignored)
+```
+
+## Common Commands
+
+### Development Setup
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Create environment configuration
+cp .env.template .env
+# Edit .env with your DolphinDB host and data paths
+```
+
+### Running Experiments
+
+```bash
+# Start Jupyter for interactive experiments
+jupyter notebook
+
+# Train CTA model from config
+python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01
+
+# Train Stock 15m model
+python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01
+
+# Run CTA backtest
+python -m cta_1d.backtest \
+ --model results/cta_1d/exp01/model.json \
+ --dt-range 2023-01-01 2023-12-31 \
+ --output results/cta_1d/backtest_01
+```
+
+### Python API Usage
+
+```python
+# CTA 1D workflow
+from cta_1d import CTA1DLoader, train_model, TrainConfig
+
+loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual')
+dataset = loader.load(dt_range=['2020-01-01', '2023-12-31'])
+
+config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158'])
+model, metrics = train_model(config, output_dir='results/exp01')
+
+# Stock 15m workflow
+from stock_15m import Stock15mLoader, train_model, TrainConfig
+
+loader = Stock15mLoader(normalization_mode='dual')
+dataset = loader.load(
+ dt_range=['2020-01-01', '2023-12-31'],
+ feature_path='/data/parquet/stock_1min_alpha158',
+ kline_path='/data/parquet/stock_1min_kline'
+)
+```
+
+## Architecture
+
+### Module Organization
+
+All implementation code lives in `src/` subdirectories:
+
+- **`cta_1d/src/`**: CTA-specific implementations
+ - `loader.py`: CTA1DLoader class
+ - `train.py`: train_model, TrainConfig
+ - `backtest.py`: run_backtest, BacktestConfig
+ - `labels.py`: Label blending utilities
+
+- **`stock_15m/src/`**: Stock-specific implementations
+ - `loader.py`: Stock15mLoader class
+ - `train.py`: train_model, TrainConfig
+
+Root `__init__.py` files re-export public APIs for backward compatibility:
+```python
+from cta_1d import CTA1DLoader # Imports from cta_1d.src
+```
+
+### Data Flow
+
+Both tasks follow a consistent pattern:
+
+1. **Loaders** (`src/loader.py`): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return `pl_Dataset`
+2. **Training** (`src/train.py`): XGBoost with early stopping, outputs model JSON + metrics
+3. **Backtest** (`src/backtest.py`): CTA-only; uses `qshare.eval.cta.backtest.CTABacktester` for strategy simulation
+
+### Key Classes
+
+- **`CTA1DLoader`**: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (`zscore`, `cs_zscore`, `rolling_20`, `rolling_60`, `dual`)
+- **`Stock15mLoader`**: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: `industry`, `cs_zscore`, `dual`
+- **`pl_Dataset`**: From `qshare.data`; provides `.with_segments()`, `.split()`, `.to_numpy()` methods
+
+### Normalization Modes
+
+**CTA 1D** (`dual` blending):
+- `zscore`: Fit-time mean/std normalization
+- `cs_zscore`: Cross-sectional z-score per datetime
+- `rolling_20/60`: Rolling window normalization
+- `dual`: Weighted blend (default: [0.2, 0.1, 0.3, 0.4])
+
+**Stock 15m**:
+- `industry`: Industry-neutralized returns
+- `cs_zscore`: Cross-sectional z-score
+- `dual`: 80% industry-neutral + 20% cs_zscore
+
+### Experiment Tracking
+
+Manual tracking in `results/{task}/README.md`:
+
+```markdown
+## 2025-01-15: Baseline XGB
+- Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50)
+- Config: eta=0.5, lambda=0.1
+- Train IC: 0.042
+- Test IC: 0.038
+- Notes: Dual normalization, 4 trades/day
+```
+
+### Dependencies on qshare
+
+The codebase relies heavily on the `qshare` library (already installed in the venv):
+
+- `qshare.data.pl_Dataset`: Dataset container with Polars backend
+- `qshare.io.ddb`: DolphinDB session management
+- `qshare.io.polars`: Parquet loading utilities
+- `qshare.algo.polars`: Industry neutralization, cross-sectional z-score
+- `qshare.eval.cta.backtest`: CTA backtesting framework
+- `qshare.config.research.cta`: Predefined column lists (HFFACTOR_COLS)
+
+### Configuration Files
+
+YAML configs define data ranges, model hyperparameters, and output settings:
+
+```yaml
+data:
+ dt_range: ['2020-01-01', '2023-12-31']
+ feature_sets: [alpha158, hffactor]
+ normalization: dual
+model:
+ type: xgb
+ params: {eta: 0.05, max_depth: 6}
+```
+
+Load with: `python -m cta_1d.train --config config.yaml` or `yaml.safe_load()` directly.
diff --git a/README.md b/README.md
index 8deffe4..dc1b228 100644
--- a/README.md
+++ b/README.md
@@ -14,20 +14,33 @@ Quantitative research experiments for qshare library. This repository contains J
```
alpha_lab/
├── common/ # Shared utilities (keep minimal!)
+│ ├── __init__.py
│ ├── paths.py # Path management
│ └── plotting.py # Common plotting functions
│
├── cta_1d/ # CTA 1-day return prediction
+│ ├── __init__.py # Re-exports from src/
+│ ├── config.yaml # Task configuration
+│ ├── src/ # Implementation modules
+│ │ ├── __init__.py
+│ │ ├── loader.py # CTA1DLoader
+│ │ ├── train.py # Training functions
+│ │ ├── backtest.py # Backtest functions
+│ │ └── labels.py # Label blending utilities
│ ├── 01_data_check.ipynb
│ ├── 02_label_analysis.ipynb
│ ├── 03_baseline_xgb.ipynb
-│ ├── 04_blend_comparison.ipynb
-│ └── src/ # Task-specific helpers
+│ └── 04_blend_comparison.ipynb
│
├── stock_15m/ # Stock 15-minute return prediction
+│ ├── __init__.py # Re-exports from src/
+│ ├── config.yaml # Task configuration
+│ ├── src/ # Implementation modules
+│ │ ├── __init__.py
+│ │ ├── loader.py # Stock15mLoader
+│ │ └── train.py # Training functions
│ ├── 01_data_exploration.ipynb
-│ ├── 02_baseline_model.ipynb
-│ └── src/
+│ └── 02_baseline_model.ipynb
│
└── results/ # Output directory (gitignored)
├── cta_1d/
@@ -47,6 +60,8 @@ cp .env.template .env
## Usage
+### Interactive (Notebooks)
+
Start Jupyter and run notebooks interactively:
```bash
@@ -59,6 +74,33 @@ Each task directory contains numbered notebooks:
- `03_*.ipynb` - Advanced experiments
- `04_*.ipynb` - Comparisons and ablations
+### Command Line
+
+Train models from config files:
+
+```bash
+# CTA 1D
+python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01
+
+# Stock 15m
+python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01
+
+# CTA Backtest
+python -m cta_1d.backtest \
+ --model results/cta_1d/exp01/model.json \
+ --dt-range 2023-01-01 2023-12-31 \
+ --output results/cta_1d/backtest_01
+```
+
+### Python API
+
+```python
+# Import from task root (re-exports from src/)
+from cta_1d import CTA1DLoader, train_model, TrainConfig
+from stock_15m import Stock15mLoader, train_model, TrainConfig
+from common import create_experiment_dir
+```
+
## Experiment Tracking
Experiments are tracked manually in `results/{task}/README.md`:
@@ -75,13 +117,18 @@ Experiments are tracked manually in `results/{task}/README.md`:
## Adding a New Task
1. Create directory: `mkdir my_task`
-2. Add `src/` subdirectory for helpers
-3. Create numbered notebooks
-4. Add entry to `results/my_task/README.md`
+2. Add `src/` subdirectory with:
+ - `__init__.py` - Export public APIs
+ - `loader.py` - Dataset loader class
+ - Other modules as needed
+3. Add root `__init__.py` that re-exports from `src/`
+4. Create numbered notebooks
+5. Add entry to `results/my_task/README.md`
## Best Practices
1. **Keep it simple**: Only add to `common/` after 3+ copies
-2. **Notebook configs**: Define CONFIG dict in first cell for easy modification
-3. **Document results**: Update results README after significant runs
-4. **Git discipline**: Don't commit large files, results, or credentials
+2. **Module organization**: Place implementation in `src/`, re-export from root `__init__.py`
+3. **Notebook configs**: Define CONFIG dict in first cell for easy modification
+4. **Document results**: Update results README after significant runs
+5. **Git discipline**: Don't commit large files, results, or credentials