EulerForge CLI Reference
GitHub Repository: https://github.com/eulerwa/eulerforge
한국어 버전: cli.md
Global Options
--lang LANG — Set Output Language
Use --lang with any subcommand to change the CLI output language.
# Output in English
eulerforge --lang en train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml
# Output in Japanese
eulerforge --lang ja bench --preset configs/bench/sft_target_only.yml
| Language Code | Language |
|---|---|
ko |
한국어 (default) |
en |
English |
zh |
中文 |
ja |
日本語 |
es |
Español |
You can also set the language via the EULERFORGE_LANG environment variable. The --lang flag takes precedence over the environment variable.
export EULERFORGE_LANG=en
eulerforge train --preset ... # Output in English
Commands
eulerforge train
Main training pipeline.
eulerforge train [OPTIONS]
| Option | Description |
|---|---|
--preset PATH |
YAML preset configuration file |
--set KEY=VALUE |
Override a configuration value (repeatable) |
--output-dir DIR |
Checkpoint output directory |
--run-name NAME |
Optional run name suffix |
--print-config |
Print the resolved configuration as JSON |
--validate-only |
Validate configuration only; do not train |
--preflight |
Load model, apply injection, verify phases — without training |
--metrics-level LEVEL |
Metrics level: minimal (default) or advanced (+ MoE routing stats) |
--debug |
Enable verbose debug logging |
--debug-every N |
Print debug logs every N steps (default: 50) |
--debug-max-modules N |
Maximum modules per debug section (default: 50) |
--debug-topk-grad N |
Number of top-K gradient norms to print (default: 20) |
--debug-attn |
Print attention projection summaries |
--debug-trainable-names |
Dump trainable parameter names |
Pipeline Continuation (Checkpoint Auto-Detection)
When model_name points to a previous EulerForge training checkpoint, the system automatically detects the base model and loads it correctly.
# SFT → DPO pipeline example
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_dpo.yml \
--set model_name=outputs/sft/final \
--output-dir outputs/dpo
How it works:
1. If model_name directory contains lora_info.json, it's recognized as an EulerForge LoRA checkpoint
2. Reads resolved_config.json from the parent directory to find the original base model
3. Auto-overrides the current config's injection settings (lora_r, lora_alpha, target_keywords, start_layer, etc.) with those from the previous checkpoint — prevents LoRA structure mismatch in cases like SFT(lora_r=48)→DPO(lora_r=24)
4. Loads the base model and applies LoRA injection (using checkpoint-derived parameters)
5. Restores the LoRA adapter weights from the previous checkpoint
Note: Auto-detection requires
resolved_config.json. Only checkpoints produced byeulerforge trainare supported. If you need a different LoRA structure for DPO, run it as a separate training session rather than through the pipeline.
Environment Variables
| Environment Variable | Description |
|---|---|
EULERFORGE_MOE_PERF=1 |
Enable per-module MoEFFN forward time/token distribution profiling. Outputs [MoEPerf] logs |
EULERFORGE_MOE_DTYPE_DEBUG=1 |
Enable dtype flow debug logging at MoEFFN merge points. Outputs [MoEDType] logs (hidden/router/weights/expert_out/buffer dtype) |
EULERFORGE_TRAIN_DEBUG=1 |
Verbose training loop debug: per-group LR, grad norm mean, param dtype distribution, dequantize stats (mean/std/min/max) on phase transitions. Outputs [TrainDebug]/[DequantDebug] logs |
eulerforge convert
Converts arbitrary JSONL into EulerForge standard raw JSONL. The core philosophy is --map-based mapping; recipes are provided as convenience shortcuts.
Three modes:
- Map mode (default): --map OUT_KEY=EXPR (repeatable) — general-purpose field mapping
- Recipe mode: --recipe <name> — built-in recipes for common data structure transformations
- Validate mode: --validate — validate input file schema without performing any conversion
# Inspect fields (explore input file structure)
eulerforge convert --task sft --input data/sft_10k.jsonl --print-sample-flat
# Map mode (flat fields)
eulerforge convert --task sft --input data/custom.jsonl --output data/out.jsonl \
--map prompt=instruction --map response=output
# Map mode (nested fields — automatic dot-path flattening)
eulerforge convert --task prompted_preference --input data/dpo_10k.jsonl --output data/dpo_10k_raw.jsonl \
--map prompt=instruction.value --map chosen=chosen.value --map rejected=rejected.value
# Recipe mode (messages array)
eulerforge convert --task sft --input data/sft_10k.jsonl --output data/sft_10k_raw.jsonl \
--recipe sft_messages --messages-expr json_record.messages
# Validate mode (schema validation without conversion)
eulerforge convert --task sft --input data/sft_10k_raw.jsonl --validate
| Option | Description |
|---|---|
--task |
Target task: sft, preference, prompted_preference |
--input |
Input JSONL file path |
--output |
Output standard raw JSONL path (not required for --validate / --print-sample-flat) |
--map OUT_KEY=EXPR |
Output key to field expression mapping (repeatable). E.g., --map prompt=instruction |
--flatten dot\|none |
Flatten mode: dot (default) flattens nested dicts to dot-keys; none navigates dot-paths directly |
--join-sep SEP |
Separator for joining list[str] values (default: newline) |
--strict |
Error if a resolved value is None, empty string, or empty list |
--recipe |
Built-in recipe name |
--messages-expr EXPR |
Dot-path to messages array (required with --recipe sft_messages) |
--num-proc |
Number of parallel workers (default: 1) |
--overwrite |
Overwrite output file if it already exists |
--max-rows |
Maximum number of rows to convert (sampling) |
--validate |
Validate-only mode: validate input schema, no conversion |
--print-sample-flat |
Print flattened key list from first 1-2 rows and exit (for field discovery) |
Built-in Recipes
| Recipe | Task | Input Structure | Output |
|---|---|---|---|
sft_messages |
sft |
Any dot-path to messages array (--messages-expr required) |
{prompt, response} |
sft_instruction_output |
sft |
{instruction, output} |
{prompt, response} |
dpo_nested_value |
prompted_preference |
{instruction.value, chosen.value, rejected.value} |
{prompt, chosen, rejected} |
sft_messages_v1 |
sft |
{json_record.messages: [{role,content}]} |
{prompt, response} |
dpo_nested_v1 |
prompted_preference |
{instruction.value, chosen.value, rejected.value} |
{prompt, chosen, rejected} |
passthrough_prompted_preference_v1 |
prompted_preference |
{prompt, chosen, rejected} |
Same (extra keys removed) |
passthrough_sft_prompt_response_v1 |
sft |
{prompt, response} |
Same (extra keys removed) |
passthrough_preference_v1 |
preference |
{chosen, rejected} |
Same (extra keys removed) |
Details: tutorials/00_data_preprocessing.md
eulerforge preprocess
Converts raw JSONL into tokenized processed JSONL.
eulerforge preprocess --task TASK --input RAW.jsonl --output PROCESSED.jsonl --model-name MODEL
| Option | Description |
|---|---|
--task |
Task type: sft, preference, prompted_preference |
--input |
Input raw JSONL file path |
--output |
Output processed JSONL file path |
--model-name |
HuggingFace model/tokenizer name |
--max-length |
Maximum sequence length (default: 512) |
--num-proc |
Number of parallel workers (default: 50% of CPU cores) |
--text-col |
SFT text column name (default: text) |
--prompt-col |
Prompt column name (default: prompt) |
--response-col |
Response column name (default: response) |
--chosen-col |
Chosen column name (default: chosen) |
--rejected-col |
Rejected column name (default: rejected) |
Details: tutorials/00_data_preprocessing.md
eulerforge bench
Inference benchmark. Configure target/baseline/judge models via a YAML spec, sample from benchmark data, and compare inference results. The target supports API models (Ollama/OpenAI/Gemini) or locally trained HF checkpoints.
eulerforge bench [OPTIONS]
| Option | Description |
|---|---|
--preset PATH |
Bench YAML spec file path (required) |
--set KEY=VALUE |
Override a configuration value (repeatable) |
--output-dir DIR |
Results output directory |
--validate-only |
Validate configuration only |
--dry-run |
Extract data samples only (no model calls) |
--target-output-dir PATH |
Target local model: training output root directory |
--checkpoint TYPE |
Checkpoint type: final (default) | latest | best |
--target-model-dir PATH |
Target local model: specify HF save_pretrained directory directly |
--target-device DEVICE |
Target local model device override (e.g., cuda:0, cuda:1, cpu) |
Sequential Model Loading (OOM Prevention)
Models are loaded one at a time, process the entire dataset, then unloaded. No more than one model resides in GPU memory at any time.
Phase 1: Load target → run inference on all samples → unload
Phase 2: Load baseline → run inference on all samples → unload (when enabled)
Phase 3: Load judge → evaluate all samples → unload (when enabled)
Phase 4: Aggregate results → same output format as before
LocalHFClient.unload() deletes the model/tokenizer and calls torch.cuda.empty_cache(). For API clients (ChatClient/JudgeClient), unload() is a no-op.
Execution Modes
| Condition | Mode | Output |
|---|---|---|
| Target only | target-only | Target responses |
| Target + baseline | comparison | Both outputs |
| Judge + target only | pointwise | Score (1-10) + explanation |
| Judge + target + baseline | pairwise | Winner (A/B/tie) + score + explanation |
Bench YAML Spec
Target specification methods (use exactly one):
bench:
task: sft # sft | preference
data_path: data/sft_1k_bench_raw.jsonl
sample:
k: 10
seed: 42
shuffle: true
generation:
max_new_tokens: 256
temperature: 0.7
top_p: 0.95
models:
target:
# Method A: API model (Ollama/OpenAI/Gemini)
provider: ollama # ollama | openai | gemini
model: "qwen3:0.6b"
base_url: "http://localhost:11434/v1"
# Method B: Training output directory (automatic checkpoint resolution)
# output_dir: "outputs/run_20260301_120000"
# checkpoint: "final" # final | latest | best (default: final)
# device: "auto" # default: auto
# dtype: "auto" # default: auto
# Method C: Specify HF model directory directly
# provider: local_hf # can be omitted (auto-configured)
# model_dir: "outputs/run_20260301_120000/final"
baseline:
enabled: false
provider: ollama
model: "qwen3:4b"
judge:
enabled: false
provider: ollama
model: "gemma3:27b"
mode: pointwise # pointwise | pairwise
mitigate_position_bias: true # pairwise A/B swap (2 rounds)
output:
out_dir: outputs/bench
save_jsonl: true
print_examples: true
print_max_chars: 1500
Usage Examples
# Using an API model
eulerforge bench --preset configs/bench/sft_target_only.yml
# Load final checkpoint from training output directory
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000
# Use latest checkpoint
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000 --checkpoint latest
# Specify model directory directly
eulerforge bench --preset configs/bench/sft_local.yml \
--target-model-dir outputs/run_20260301_120000/final
# Specify GPU (load target on GPU 1 when judge is using GPU 0)
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000 --target-device cuda:1
# Override configuration
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.sample.k=20
# Validate only
eulerforge bench --preset configs/bench/sft_target_only.yml --validate-only
# Preview samples (no model calls)
eulerforge bench --preset configs/bench/sft_target_only.yml --dry-run
Bench Data Format
| File | Keys | Task |
|---|---|---|
sft_1k_bench_raw.jsonl |
prompt, response |
sft |
dpo_1k_bench_raw.jsonl |
prompt, chosen, rejected |
preference |
Details: tutorials/11_bench.md
eulerforge eval (stub)
Perplexity evaluation (not yet implemented).
eulerforge grid
Hyperparameter grid / random / Bayesian search via Optuna. See the eulerforge grid section below for full documentation.
Training Types (training.type)
| Type | Description | Data Format | Reference Model |
|---|---|---|---|
sft |
Supervised Fine-Tuning (default) | {input_ids, attention_mask, labels} |
Not required |
dpo |
Direct Preference Optimization | chosen/rejected pairs (DPO format) | Adapter disabled |
orpo |
Odds Ratio Preference Optimization | chosen/rejected pairs (DPO format) | Not required |
rm |
Reward Model (Bradley-Terry) | chosen/rejected pairs (DPO format) | Not required |
ppo |
PPO/RLHF | Prompt-only + reward model | Adapter disabled |
Common Training Settings
| Key | Alias | Default | Description |
|---|---|---|---|
max_train_steps |
max_steps |
10000 |
Maximum training steps (micro-step basis) |
batch_size |
— | 4 |
Batch size |
grad_accum_steps |
— | 1 |
Gradient accumulation steps |
lr |
— | 1e-5 |
Learning rate |
weight_decay |
— | 0.01 |
Weight decay |
warmup_steps |
— | 500 |
Warmup steps |
max_grad_norm |
— | 1.0 |
Maximum gradient clipping norm |
log_steps |
— | 100 |
Logging interval (micro-steps) |
save_steps |
— | 1000 |
Checkpoint saving interval (micro-steps) |
val_steps |
— | save_steps value |
Validation interval (micro-steps). Defaults to save_steps if unset |
Note:
max_train_stepsandmax_stepsare synonymous. If both are specified,max_stepstakes precedence. In presets,max_train_stepsis the canonical key.Step Terminology: - Micro-step: Each batch forward/backward constitutes 1 micro-step.
max_stepsis counted in micro-steps. - Optimizer step: Occurs everygrad_accum_stepsbatches. Total optimizer steps =max_steps / grad_accum_steps. - Effective batch:batch_size x grad_accum_steps. The actual number of samples processed per optimizer step.
ORPO Options
training:
type: orpo
orpo_lambda: 1.0 # ORPO term weight (required, positive)
RM Options
training:
type: rm
# RewardHead is auto-generated (hidden_size → 1)
PPO Options
training:
type: ppo
ppo:
clip_range: 0.2 # PPO clipping range
kl_coef: 0.1 # KL penalty coefficient
epochs: 4 # PPO update epochs per batch
max_gen_len: 64 # Maximum generation length
temperature: 1.0 # Sampling temperature
reward_model:
model_name: "path/to/model" # Reward model path
checkpoint_path: "" # Checkpoint (optional)
Model Load Precision (model.load_precision)
Declare precision and quantization options for model loading in YAML.
model:
load_precision:
mode: int4 # fp32 | fp16 | bf16 | int8 | int4
compute_dtype: bf16 # Compute dtype for int8/int4 operations (fp16 | bf16)
quant_type: nf4 # int4 only: nf4 | fp4
double_quant: true # int4 only
dequantize_on_train: true # Auto-dequantize when base_ffn becomes trainable
Behavior by Mode
| Mode | torch_dtype | BitsAndBytes | Use Case |
|---|---|---|---|
fp32 |
float32 | — | CPU / debugging |
fp16 |
float16 | — | GPU memory savings |
bf16 |
bfloat16 | — | GPU default (Ampere+) |
int8 |
— | load_in_8bit |
8-bit quantization |
int4 |
— | load_in_4bit |
4-bit quantization (QLoRA) |
Phase Dequantize Policy
When dequantize_on_train=true (the default) and base_ffn becomes trainable during a phase transition,
quantized modules are replaced with nn.Linear and weights are restored to compute_dtype.
| Mode | Replacement Target | Dequantize Method |
|---|---|---|
| int4 | Linear4bit → nn.Linear |
dequantize_4bit() |
| int8 | Linear8bitLt → nn.Linear |
int8_vectorwise_dequant() |
| bf16/fp16/fp32 | No replacement | Set requires_grad directly |
Replacement is performed automatically just before requires_grad=True is set within set_trainable_by_groups().
Since quantized module forward() re-allocates int8/int4 data, simple parameter casting is insufficient — the module itself must be replaced.
Automatic target_layers scoping: When base_ffn is active and no explicit target_layers is specified, target layers are automatically derived from injection.start_layer/injection.num_layers. Only layers where MoE is injected are dequantized/unfrozen.
Backward Compatibility
Using the legacy training.quant_bits produces a warning and automatic conversion:
- quant_bits: 4 → mode: int4
- quant_bits: 8 → mode: int8
- quant_bits: 16 → mode: bf16
Spec details: docs/fixtures/specs/load_precision_spec.md
Data Input (data Section)
EulerForge supports three standard data task types:
| Task | Raw Schema | Processed Schema |
|---|---|---|
sft |
{text} or {prompt, response} |
{input_ids, attention_mask, labels} |
preference |
{chosen, rejected} |
{chosen_input_ids, chosen_attention_mask, chosen_labels, rejected_input_ids, rejected_attention_mask, rejected_labels} |
prompted_preference |
{prompt, chosen, rejected} |
Above + {prompt_input_ids, prompt_attention_mask, prompt_len} |
Configuration Examples
# Method 1: Already tokenized processed JSONL
data:
format: processed
path: data/sft_processed.jsonl
# Method 2: Raw JSONL (auto-preprocessed and cached during training)
data:
format: raw
task: prompted_preference
path: data/dpo_10k_raw.jsonl
max_length: 512
# num_proc: default = 50% of CPU cores (automatic)
# cache_dir: default = outputs/.cache (shared cache across runs)
#
# Cache filename convention: {input_filename}_{task}_{model_name}_len{max_length}.jsonl
# Example: dpo_10k_raw_prompted_preference_qwen3.5-0.8b-base_len512.jsonl
# Automatically reused when parameters match (no re-preprocessing)
# Method 3 (legacy): Specify processed_data_path directly
processed_data_path: data/sft_processed.jsonl
Labels Masking Policy
- SFT text-only:
labels = input_ids(full sequence causal LM) - SFT prompt/response: Prompt token region set to
-100; only the response contributes to the loss - Preference: Full sequence labels for both chosen and rejected
- Prompted-Preference: Prompt tokens masked with
-100; only completions contribute to loss/logp
Integration Test Data Policy
- Integration tests: use 1k data (
sft_1k.jsonl,dpo_1k.jsonl) - Tutorials: based on 10k data (
sft_10k_raw.jsonl,dpo_10k_raw.jsonl,dpo_10k_raw.jsonl) - Bench tests: use
*_1k_bench_raw.jsonl
Data Validation
Processed data is checked against the first 5 rows for required keys before training starts. On missing keys, a 3-line error is shown:
Data: processed dataset row 0 missing 'input_ids'
Fix: Run `eulerforge preprocess ...` or set data.format=raw with schema mapping
See: docs/tutorials/en/00_data_preprocessing.md
data.task / training.type Compatibility
When data.format=raw, the compatibility between data.task and training.type is automatically validated:
| data.task | Compatible training.type |
|---|---|
sft |
sft, ppo |
preference |
dpo, orpo, rm |
prompted_preference |
dpo, orpo, rm |
prompt_only |
ppo |
On mismatch, an error is raised:
Data Config: data.task='sft' is incompatible with training.type='dpo'. Task 'sft' produces data for sft, ppo training
Fix: Set --set training.type=sft or change data.task to match
See: docs/tutorials/en/00_data_preprocessing.md
Configuration Override Mechanism
Use --set with dot-path notation to override values:
--set training.lr=2e-5
--set injection.lora_r=32
--set training.phases.1.step=100 # Update a specific list element's field (sparse)
--set training.phases.0.trainable=[lora,attn_lora] # List-of-list element value
List index override: Use
training.phases.N.field=valueto update only the field at indexN. Other elements and other fields within that element are preserved from the preset. Out-of-range indices raise a clear error.
Type Inference
| Value | Parsed As |
|---|---|
true |
bool |
false |
bool |
null |
None |
42 |
int |
3.14 |
float |
2e-5 |
float |
hello |
str |
Precedence
CLI --set overrides > YAML preset values
Output and Checkpoints
Training produces the following directory structure:
outputs/run_YYYYMMDD_HHMMSS/
├── resolved_config.json # Full resolved configuration (for reproducibility)
├── checkpoint-latest/ # Latest checkpoint
│ ├── model files...
│ ├── tokenizer files...
│ └── training_state.pt # Optimizer, scheduler, step, best_val_loss
├── checkpoint-best/ # Best validation loss checkpoint
│ └── ...
└── final/ # Final model after training completes
└── ...
checkpoint-bestis saved atval_stepsintervals, andcheckpoint-latestis saved atsave_stepsintervals, independently. Example: withval_steps=500,save_steps=1000— best is saved at step 500, latest at step 1000.
Automatic Training Resume
If checkpoint-latest/training_state.pt exists, training resumes automatically from that point:
- Restores micro_step, optimizer_step, best_val_loss
- LR scheduler fast-forward: the cosine schedule is immediately restored to the optimizer step at resume (no re-warmup from 0)
- ETA calculation: remaining time is estimated from steps taken since resume, not from total steps
MoE Stability Validation
When using MoE strategies (mixture_lora, moe_expert_lora), the validator automatically performs stability checks.
Required Parameters
| Key | Role | Recommended Value | Rationale |
|---|---|---|---|
moe.router_z_loss_coef |
Router logit stabilization | 0.001 |
ST-MoE: prevents softmax overflow |
moe.load_balance.type |
Load balancing policy | aux_loss |
Prevents routing collapse |
moe.load_balance.aux_loss_coef |
Auxiliary loss weight | 0.01 |
Required when type=aux_loss |
Optional Parameters
| Key | Role | Default | Recommended Range |
|---|---|---|---|
moe.capacity_factor_train |
Expert capacity upper bound during training | Model default | 1.0-2.0 (ST-MoE: 1.25) |
moe.capacity_factor_eval |
Expert capacity upper bound during evaluation | Model default | >= train value (ST-MoE: 2.0) |
moe.router_dtype |
Router computation precision | float32 |
float32 (float16/bfloat16 can be numerically unstable) |
moe.load_balance.bias_update_speed |
Expert bias adaptation speed | — | Required when type=aux_loss_free (0.001) |
Error Message Format
MoE validation failures are reported in 3-line format:
MoE Config: moe.router_z_loss_coef is required for MoE strategies.
Fix: Set moe.router_z_loss_coef: 0.001 (ST-MoE recommended)
See: docs/tutorials/en/09_moe_stability_and_validation.md
Warnings (Non-fatal)
| Condition | Meaning |
|---|---|
router_z_loss_coef = 0 |
z-loss disabled — may be unstable in large-scale training |
router_z_loss_coef > 0.1 |
Unusually large value — excessive router constraint |
load_balance.type = none |
No load balancing — risk of routing collapse |
aux_loss_coef = 0 |
Effectively disables load balancing |
router_dtype = float16/bfloat16 |
Potential numerical instability in router softmax |
capacity_factor_eval < train |
Increased token dropping during evaluation |
No phase includes router |
Router not trained — cannot adapt to data |
Preflight Checks
--preflight performs runtime checks after loading the model:
eulerforge train --preset PRESET.yml --preflight
| Check Item | Description |
|---|---|
| Group parameter count | Error if a group referenced by a phase has 0 parameters |
| target_layers range | Error if indices exceed the model's layer count |
Details: tutorials/09_moe_stability_and_validation.md
Logging and Metrics
The training loop logs through a two-tier metrics system.
Metrics Levels
| Level | Recorded Items |
|---|---|
minimal (default) |
step, main_loss, total_loss, aux_loss, lr, grad_norm, throughput, tokens/samples, training-type-specific metrics |
advanced |
minimal + MoE routing stats (token_frac, entropy, importance_cv, router_logit_max, etc.) |
Minimal Metrics
| Tag | Description |
|---|---|
train/main_loss |
Primary loss (SFT/DPO/ORPO/RM/PPO) |
train/total_loss |
Total loss (main + aux * weight) |
train/aux_loss |
Sum of MoE auxiliary losses |
train/learning_rate |
Current learning rate |
train/grad_norm |
Global L2 gradient norm |
train/tokens_seen |
Cumulative training tokens (labels != -100) |
train/samples_seen |
Cumulative processed samples (preference: counted as pairs) |
train/optimizer_step |
Cumulative optimizer steps |
train/micro_step |
Cumulative micro-steps |
train/effective_batch |
Effective batch size (batch_size x grad_accum_steps) |
DPO: train/reward_margin |
Chosen-rejected reward margin |
ORPO: train/sft_loss, train/orpo_loss |
Individual SFT/ORPO losses |
PPO: train/kl, train/reward_mean |
KL divergence, mean reward |
Advanced Metrics (MoE Only)
| Tag | Description |
|---|---|
moe/token_frac_mean |
Mean token fraction per expert |
moe/token_frac_std |
Standard deviation of token fraction per expert |
moe/token_frac_max |
Fraction of the most selected expert |
moe/entropy_mean |
Mean router entropy |
moe/importance_cv |
Importance coefficient of variation (imbalance detection) |
moe/aux_loss_total |
Total aux loss sum |
moe/router_logit_max |
Maximum router logit (numerical explosion detection) |
moe/num_moe_modules |
Number of MoE modules |
LoRA Handoff Logging
When lora_handoff is configured, the following milestone events are automatically logged during training:
| Event | Log Message | Timing |
|---|---|---|
| Schedule init | [Handoff] Schedule: expert_lora(step 4000→6000, ...) |
At training start |
| Fade start | [Handoff] expert_lora fade started at step 4000 (curve=cosine, ...) |
When start_step is reached |
| Fade complete | [Handoff] expert_lora fade complete at step 6000 (scale=0.0000) |
When start_step + duration_steps is reached |
| Freeze | [Handoff] expert_lora frozen at step 6000 (scale=0.0000) |
When end_action=freeze |
| Ramp start | [Handoff] base_ffn_ramp started at step 2000 (LR x1.0→x3.0) |
When ramp start_step is reached |
| Ramp complete | [Handoff] base_ffn_ramp complete at step 4000 (LR x3.0) |
When ramp end_step is reached |
Each event is logged exactly once (one-shot). attn_lora follows the same pattern.
Periodic logs (at log_steps intervals) automatically include handoff state:
[Phase2] Step 400/600 (micro 4500/6000) | Loss: 0.1234 | LR: 1.00e-05 | ... | Handoff[expert_lora=0.75, attn_lora=0.80, ffn_lr_mult=2.50]
TensorBoard Integration
TensorBoard logging is an optional dependency:
pip install eulerforge[tb]
Configuration example:
logging:
metrics_level: advanced # "minimal" | "advanced"
tensorboard:
enabled: true # Enable TensorBoard logging (default: false)
log_dir: "outputs/tb" # Log directory
log_interval: 50 # Write to TensorBoard every N steps
max_experts_log: 16 # Log detailed stats for top N experts in advanced mode
Override metrics level via CLI:
eulerforge train --preset PRESET.yml --metrics-level advanced
If tensorboard is not installed, a warning is printed and training proceeds normally.
Details: tutorials/10_metrics_monitoring.md
eulerforge grid
Basic Usage
# Validate spec only (dry-run)
eulerforge grid configs/grid/sft_random_search.yml --dry-run
# Run
eulerforge grid configs/grid/sft_random_search.yml
# Specify project root explicitly
eulerforge grid configs/grid/sft_random_search.yml --project-root /path/to/project
Options
| Option | Default | Description |
|---|---|---|
spec |
(required) | Grid search YAML spec file path |
--dry-run |
false | Validate spec only; do not run training |
--project-root DIR |
cwd | Base directory for resolving relative paths |
Optuna Dependency
pip install eulerforge[hpo]
If optuna is not installed, a message suggesting pip install eulerforge[hpo] is shown and the process exits.
Optuna Sampler by Method
| Method | Sampler | Notes |
|---|---|---|
grid |
GridSampler |
Discrete spaces only (categorical / choices) |
random |
RandomSampler |
Supports both continuous and discrete |
bayes |
TPESampler |
Supports both continuous and discrete |
Combining
method: "grid"+type: "float"+low/highis an error. To search continuous floats in grid mode, use thechoices: [val1, val2]format.
Output
<output_root>/
├── trial_0000/
│ ├── metrics.jsonl # Per-step metrics (train/total_loss, etc.)
│ └── checkpoint-latest/
├── trial_0001/
│ └── ...
├── summary.json # Overall results (best_trial + all_trials)
└── summary.csv
Spec format details: docs/fixtures/specs/grid_search_spec.md
Python API: eulerforge.loader
Public API for loading EulerForge-trained checkpoints directly from Python.
from eulerforge import load_model
load_model(path, *, checkpoint, device, dtype, load_precision) -> LoadedModel
| Parameter | Default | Description |
|---|---|---|
path |
(required) | Path to a run_dir or checkpoint_dir |
checkpoint |
"final" |
Checkpoint selection for run_dir: final | best | latest |
device |
"auto" |
"auto" | "cpu" | "cuda" | "cuda:0", etc. |
dtype |
"auto" |
"auto" | "float32" | "bfloat16", etc. |
load_precision |
None |
Load precision: None | "fp32" | "fp16" | "bf16" | "int8" | "int4" |
Return Type
@dataclass
class LoadedModel:
model: nn.Module # eval mode
tokenizer: PreTrainedTokenizer
metadata: ModelMetadata
@dataclass
class ModelMetadata:
strategy: str # "dense_lora" | "moe_expert_lora" | "mixture_lora" | "none"
backbone: str # "qwen3" | "llama" | "gemma3" | ""
path_type: str # "run_dir" | "checkpoint_dir"
checkpoint_dir: str # Actual checkpoint directory path
lora_config: dict | None # {"lora_r": int, "lora_alpha": float}
structure_preserved: bool # Whether MoE/MixtureLoRA structure is preserved
load_precision: str | None # "fp32" | "fp16" | "bf16" | "int8" | "int4" | None
Automatic Path Classification
| Path Type | Detection Criteria | Behavior |
|---|---|---|
run_dir |
resolved_config.json exists |
Resolves subdirectory via checkpoint parameter |
checkpoint_dir |
config.json exists |
Loads directly |
Usage Examples
from eulerforge import load_model
# 1. Load from training output directory (final checkpoint)
result = load_model("outputs/run_20260311_163425")
print(result.metadata.strategy) # "moe_expert_lora"
print(result.metadata.backbone) # "qwen3"
# 2. Load best checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="best")
# 3. Specify checkpoint directory directly
result = load_model("outputs/run_20260311_163425/final", device="cuda:0")
# 4. Quantized loading (int4/int8)
result = load_model("outputs/run_20260311_163425", load_precision="int4")
result = load_model("outputs/run_20260311_163425", load_precision="bf16")
# 5. Inference
messages = [{"role": "user", "content": "Hello"}]
text = result.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = result.tokenizer(text, return_tensors="pt").to(result.model.device)
with torch.no_grad():
out = result.model.generate(**inputs, max_new_tokens=128)
print(result.tokenizer.decode(out[0], skip_special_tokens=True))
# 6. Using metadata
if result.metadata.structure_preserved:
print("This checkpoint preserves the MoE structure")
if result.metadata.lora_config:
print(f"LoRA r={result.metadata.lora_config['lora_r']}")
Loading Strategies
| Strategy | Loading Method | Structure Preserved |
|---|---|---|
dense_lora |
LoRA merge → dense model | No |
moe_expert_lora |
Reconstruct MoE architecture (expert+router) | Yes |
mixture_lora |
Reconstruct MixtureLoRA architecture | Yes |
none |
Load via from_pretrained() directly |
N/A |
Note:
moe_expert_loraandmixture_lorarequireresolved_config.json. Without it, the system falls back to averaging experts into a dense model (with a WARNING).
eulerforge export-hf
Exports an EulerForge checkpoint as a HuggingFace Transformers-compatible model.
Basic Usage
# dense_lora → merged HF model
eulerforge export-hf --checkpoint outputs/run_20260311_163425 --output ./exported
# MoE → custom_moe HF model (load with trust_remote_code=True)
eulerforge export-hf --checkpoint outputs/run_moe --output ./exported_moe
# dry-run (print plan only)
eulerforge export-hf --checkpoint outputs/run --output ./out --dry-run
# validate-only
eulerforge export-hf --checkpoint outputs/run --output ./out --validate-only
Options
| Option | Default | Description |
|---|---|---|
--checkpoint |
(required) | Path to run_dir or checkpoint_dir |
--output |
(required) | Export output directory (error if it already exists) |
--format |
auto |
auto | merged | custom_moe |
--select-checkpoint |
final |
final | best | latest |
--dtype |
auto |
auto | fp32 | fp16 | bf16 |
--safe-serialization |
True |
Whether to use safetensors (disable with --no-safe-serialization) |
--copy-tokenizer |
True |
Whether to copy the tokenizer (disable with --no-copy-tokenizer) |
--dry-run |
False |
Print plan only; no actual export |
--validate-only |
False |
Validate only |
--skip-diversity-check |
False |
Skip expert diversity check (experts may not be differentiated with lightweight training) |
Export Format by Strategy
| Strategy | format=auto | Result | HF Loading |
|---|---|---|---|
dense_lora |
merged |
Standard dense HF model | from_pretrained(path) |
mixture_lora |
custom_moe |
base+router+N LoRA experts | from_pretrained(path, trust_remote_code=True) |
moe_expert_lora |
custom_moe |
N expert FFN+router | from_pretrained(path, trust_remote_code=True) |
Key point: Exporting
moe_expert_lora/mixture_lorastrategies asmergeddestroys the expert structure.format=merged+ MoE raises a ValueError.
Spec details: docs/fixtures/specs/export_hf_spec.md
eulerforge pretrain (Plugin)
Plugin command:
pretrainis provided via the plugin system. It is registered in the CLI only wheneulerforge.plugins.pretrain_pluginis present. Excluding this module from a public distribution automatically hides the command.
Performs scratch pretraining on a HuggingFace-format model exported by EulerStack or a similar tool, using raw text data. This is a full-parameter causal LM training pipeline, completely separate from the train command (which applies LoRA/MoE injection).
eulerforge pretrain [OPTIONS]
| Option | Description |
|---|---|
--preset PATH |
Pretrain YAML preset path (required) |
--set KEY=VALUE |
Config override (repeatable) |
--output-dir DIR |
Output directory |
--validate-only |
Validate config only, no training |
pretrain vs train differences
| Item | train (fine-tuning) |
pretrain (scratch) |
|---|---|---|
| Model load | from_pretrained (with trained weights) |
from_pretrained (initialized weights) |
| Injection | LoRA/MoE applied | None (all parameters trainable) |
| Phase schedule | freeze/unfreeze control | None (everything trainable) |
| Data | instruction/preference format | raw text (packed chunking) |
| Forbidden keys | — | injection, moe, backbone → error |
Pretrain YAML preset structure
# Device
device: "cuda:0" # cuda:0, cuda:1, cpu
# Model (EulerStack export directory)
model_dir: "outputs/full_hybrid_moe"
trust_remote_code: true
# Tokenizer (specify separately if not in model_dir)
tokenizer: "gpt2"
# Data
data:
path: "data/dolma_10k.jsonl"
text_column: "text" # text key in JSONL
max_length: 1024
packing: true # packed chunking (concatenate → split to fixed length)
# Training
training:
max_steps: 500
batch_size: 2
grad_accum_steps: 4
lr: 3.0e-4
weight_decay: 0.1
warmup_steps: 50
max_grad_norm: 1.0
log_steps: 10
save_steps: 250
dtype: "float32" # Hybrid models (Hyena/RetNet) require float32
amp: false # FFT ops don't support bf16 → disabled
seed: 42
Usage examples
# Basic run
eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml
# Override settings
eulerforge pretrain --preset ... --set training.max_steps=1000 --set training.lr=1e-4
# Validate only
eulerforge pretrain --preset ... --validate-only
# Specify output directory
eulerforge pretrain --preset ... --output-dir outputs/my_pretrain
Output structure
outputs/pretrain_YYYYMMDD_HHMMSS/
├── pretrain_config.json # Config snapshot
├── metrics.jsonl # Per-step loss, lr
├── checkpoint-250/ # Mid-training checkpoint
│ ├── config.json
│ ├── model.safetensors
│ └── tokenizer files...
└── final/ # Final model
├── config.json
├── model.safetensors
└── tokenizer files...
After pretraining, point eulerforge train's model_name to the final/ directory for LoRA fine-tuning.