11. 추론 벤치마크

개요

eulerforge bench는 파인튜닝된 모델의 추론 품질을 평가합니다. YAML spec 하나로 target/baseline/judge 3종 모델을 설정하고, 자동으로 응답 생성 → 비교 → 평가를 수행합니다.

target 모델 지정 방법:

방법	사용 시점
`target.model` (API 모델)	Ollama/OpenAI/Gemini 서버 모델 사용 시
`target.output_dir` (학습 출력)	`eulerforge train` 출력 디렉토리에서 자동 체크포인트 로드
`target.model_dir` (HF 디렉토리)	`save_pretrained()` 결과를 직접 지정 시

사전 준비

1. Ollama 설치 및 모델 Pull

# Ollama 설치 (https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh

# 모델 다운로드
ollama pull qwen3:0.6b      # target
ollama pull qwen3:4b         # baseline (선택)
ollama pull gemma3:27b       # judge (선택)

2. 벤치 데이터

EulerForge는 data/ 디렉토리에 사전 생성된 벤치 데이터를 사용합니다:

파일	형식	task
`data/sft_1k_bench_raw.jsonl`	`{prompt, response}`	sft
`data/dpo_1k_bench_raw.jsonl`	`{prompt, chosen, rejected}`	preference

빠른 시작

Target 모델만 평가

eulerforge bench --preset configs/bench/sft_target_only.yml

설정 오버라이드

# 샘플 수 변경
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.sample.k=20

# 모델 변경
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.models.target.model=qwen3:4b

Dry Run (모델 호출 없이 샘플 확인)

eulerforge bench --preset configs/bench/sft_target_only.yml --dry-run

실행 모드

1. Target Only

target 모델의 응답만 생성하고 출력합니다.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: false
    judge:
      enabled: false

2. Target + Baseline 비교

두 모델의 응답을 나란히 출력합니다.

# 방법 A: Ollama 모델을 baseline으로 사용
bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true
      provider: ollama
      model: "qwen3:4b"
    judge:
      enabled: false

# 방법 B: HF 모델을 baseline으로 직접 로드 (GPU에서 transformers로 추론)
# 파인튜닝 원본 base 모델과 동일한 모델로 비교하고 싶을 때 유용
bench:
  models:
    target:
      model_dir: "outputs/run_20260330_204433/final"
      device: "cuda:0"
      dtype: "float16"
    baseline:
      enabled: true
      model_dir: "Qwen/Qwen2.5-0.5B"   # HF Hub 이름 또는 로컬 경로
      device: "cuda:0"
      dtype: "float16"
    judge:
      enabled: false

참고: baseline에 model_dir을 지정하면 provider 없이 HF transformers로 직접 로드합니다. Ollama의 instruction-tuned 모델 대신 정확히 같은 base 모델과 비교할 수 있습니다. OOM 방지를 위해 target과 baseline은 순차 로드됩니다.

3. Pointwise Judge

judge 모델이 target 응답을 1-10점으로 평가합니다. baseline이 활성화되어 있으면 target과 baseline 모두 별도 pointwise 평가를 수행합니다.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true        # pointwise에서도 baseline 활성 가능
      model: "qwen3.5:0.8b"
    judge:
      enabled: true
      model: "gpt-oss:20b"
      mode: pointwise

baseline이 활성화된 pointwise 모드에서는: - target 응답과 baseline 응답 각각 독립적으로 1-10점 평가 - 요약에 target 평균/baseline 평균을 별도 표시

4. Pairwise Judge

judge 모델이 target vs baseline을 비교 평가합니다. mitigate_position_bias: true이면 A/B 스왑 2회 평가로 위치 편향을 완화합니다.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true
      model: "qwen3:4b"
    judge:
      enabled: true
      model: "gemma3:27b"
      mode: pairwise
      mitigate_position_bias: true

로컬 HF 모델 사용 (학습 결과 평가)

eulerforge train 후 생성된 체크포인트를 바로 벤치마크할 수 있습니다.

LoRA/MoE 체크포인트 자동 처리: 체크포인트에 LoRA 또는 MoE 키가 있으면 자동으로 처리합니다. 전략별 로딩 방식:

전략 저장 구조 bench 로딩 추론 모델

dense_lora base + LoRA (per-Linear) base + (B @ A) * (α/r) dense 모델

mixture_lora base(1개) + router + N개 LoRA expert MixtureLoRA 구조 재구성 → Attn LoRA 병합 MixtureLoRA 모델 (구조 보존)

moe_expert_lora N개 expert FFN + router MoE 구조 재구성 → LoRA 병합 → 로드 MoE 모델 (구조 보존)

moe_expert_lora + handoff N개 expert FFN(LoRA 없음) + router MoE 구조 재구성 → 직접 로드 MoE 모델 (구조 보존)

moe_expert_lora: resolved_config.json에서 injection 설정(num_experts, top_k 등)을 읽어 MoE 아키텍처를 재구성합니다. Expert를 평균하지 않고, N개 expert + router 구조 그대로 추론합니다.

mixture_lora: resolved_config.json에서 injection 설정을 읽어 build_mixture_lora_for_ffn_layers()로 MixtureLoRA 구조를 재구성합니다. Router + N개 LoRA expert 구조를 보존하여 routing 다양성을 유지합니다. resolved_config.json이 없으면 fallback으로 expert 평균 → dense 모델로 변환됩니다.

양자화 체크포인트 자동 처리: model.load_precision으로 양자화된 체크포인트는 LoRA 병합 전에 _dequantize_bnb_state_dict()가 자동으로 dequantize합니다: - int4/nf4: packed weight (N, 1) → dequantize_4bit() → full precision. Handoff 시 bf16 cast된 packed 데이터도 numel 비교로 자동 감지 - int8: .SCB companion 키(row-wise scale factor) 감지 → int8 * (SCB / 127) → bfloat16. .SCB + .weight_format companion 키 자동 제거

키 접두사 자동 정규화: Qwen3.5 등 일부 모델은 체크포인트 키(model.language_model.*)가 from_config()의 기대 키(model.*)와 다릅니다. 병합 후 자동으로 접두사를 매핑합니다.

Weight tying (정상 동작): Qwen3 등 tie_word_embeddings=True 모델은 lm_head.weight를 체크포인트에 저장하지 않습니다(embed_tokens.weight와 공유). 로딩 시 tie_weights()를 자동 호출해 복원하므로 "누락된 키" 경고가 표시되지 않습니다.

임베딩 자동 resize: 토크나이저에 추가 특수 토큰(<|im_start|>, <think> 등)이 있어 len(tokenizer) > config.vocab_size인 경우, resize_token_embeddings()를 자동 호출합니다. 이 처리가 없으면 apply_chat_template 생성 토큰이 임베딩 범위를 초과하여 CUDA error: device-side assert triggered가 발생합니다.

float16 → bfloat16 자동 변환: dtype: "float16" 지정 시 bfloat16으로 자동 변환합니다. float16은 최대 ~65504의 제한된 범위를 가지며, Qwen3.5 등 linear attention(Mamba) 아키텍처에서 autoregressive 생성 시 내부 상태가 이 범위를 초과하여 NaN → CUDA error: device-side assert triggered가 발생합니다. bfloat16은 동일 16-bit 메모리이면서 범위가 ~3.4e38으로 안전합니다.

전략	저장 구조	bench 로딩	추론 모델
dense_lora	base + LoRA (per-Linear)	`base + (B @ A) * (α/r)`	dense 모델
mixture_lora	base(1개) + router + N개 LoRA expert	MixtureLoRA 구조 재구성 → Attn LoRA 병합	MixtureLoRA 모델 (구조 보존)
moe_expert_lora	N개 expert FFN + router	MoE 구조 재구성 → LoRA 병합 → 로드	MoE 모델 (구조 보존)
moe_expert_lora + handoff	N개 expert FFN(LoRA 없음) + router	MoE 구조 재구성 → 직접 로드	MoE 모델 (구조 보존)

방법 A: 학습 출력 디렉토리 지정

bench:
  models:
    target:
      output_dir: "outputs/run_20260301_120000"   # 학습 출력 루트
      checkpoint: "final"                          # final | latest | best
      device: "auto"                               # 기본: auto
      dtype: "auto"                                # 기본: auto

체크포인트 자동 해석 순서:

`checkpoint`	탐색 경로
`final` (기본)	`{output_dir}/final/`
`latest`	`{output_dir}/checkpoint-latest/` → 최신 `checkpoint-N/`
`best`	`{output_dir}/checkpoint-best/`

방법 B: 모델 디렉토리 직접 지정

bench:
  models:
    target:
      model_dir: "outputs/run_20260301_120000/final"
      device: "cuda:0"
      dtype: "float16"

CLI 플래그로 오버라이드

# 학습 완료 직후 바로 벤치마크
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000

# latest 체크포인트 사용 (훈련 중간 평가)
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --checkpoint latest

# 직접 경로 지정
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-model-dir outputs/run_20260301_120000/final

# dry-run으로 데이터 확인
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --dry-run

# GPU 지정 (judge가 GPU 0 사용 시, target을 GPU 1에 로드)
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --target-device cuda:1

에러 메시지 (3-line 포맷)

체크포인트가 없을 때:

Bench Config: No checkpoint found in outputs/run_xxx (checkpoint=final): 'final/' directory missing
Fix: Run eulerforge train first, or use --checkpoint latest/best, or --target-model-dir
See: docs/tutorials/11_bench.md

상호 배타 오류:

Bench Config: bench.models.target: exactly one of 'model', 'output_dir', 'model_dir' is allowed — found multiple
Fix: Remove all but one: model (API), output_dir (run dir), or model_dir (HF dir)
See: docs/tutorials/11_bench.md

외부 API (OpenAI / Gemini)

judge 모델로 OpenAI나 Gemini를 사용할 수 있습니다:

bench:
  models:
    judge:
      enabled: true
      provider: openai           # ollama | openai | gemini
      model: "gpt-4o"
      base_url: "https://api.openai.com/v1"
      api_key_env: "OPENAI_API_KEY"   # 환경변수에서 API 키 읽기
      mode: pointwise

export OPENAI_API_KEY="sk-..."
eulerforge bench --preset configs/bench/sft_with_openai_judge.yml

출력

터미널 출력

============================================================
[Sample 0]
Prompt: 녹화 장비를 선택할 때 고려해야 할 중요한 요소는...

Target (qwen3:0.6b):
  녹화 장비를 선택할 때 고려해야 할 중요한 요소는...

Judge (pointwise): Target=7/10  Baseline=8/10
  [Target]   The response covers key factors clearly...
  [Baseline] The baseline provides a more structured answer...
============================================================
[Bench Summary]
  Task: sft
  Samples: 10
  Target: qwen3:0.6b
  Baseline: qwen3.5:0.8b
  Pointwise Target:   avg=6.4 min=4 max=9
  Pointwise Baseline: avg=7.8 min=6 max=9
============================================================

저장 파일

outputs/bench/
├── bench_results.jsonl           # 개별 결과 (프롬프트, 응답, 점수)
├── bench_summary.json            # 요약 통계
└── bench_resolved_config.json    # 사용된 설정 스냅샷

제공 예제 YAML

파일	설명
`configs/bench/sft_target_only.yml`	SFT target만
`configs/bench/sft_with_judge.yml`	SFT + pointwise judge
`configs/bench/preference_pairwise.yml`	Preference + pairwise judge

CLI 옵션

옵션	설명
`--preset PATH`	Bench YAML spec 파일 (필수)
`--set KEY=VALUE`	설정 오버라이드 (반복 가능)
`--output-dir DIR`	결과 출력 디렉토리
`--validate-only`	설정 검증만 수행
`--dry-run`	샘플 추출만 (모델 호출 없음)
`--target-output-dir PATH`	target 로컬 모델: 학습 출력 루트 디렉토리
`--checkpoint TYPE`	체크포인트 유형: `final`(기본) \| `latest` \| `best`
`--target-model-dir PATH`	target 로컬 모델: HF save_pretrained 디렉토리 직접 지정
`--target-device DEVICE`	target 로컬 모델 device 오버라이드 (예: `cuda:0`, `cuda:1`, `cpu`). 기본: auto

순차 모델 로딩 (OOM 방지)

Bench는 최대 3개 모델(target, baseline, judge)을 사용합니다. GPU 메모리 초과(OOM)를 방지하기 위해 모델을 1개씩 로드하여 전체 데이터를 처리한 뒤 언로드합니다.

Phase 1: target 로드 → 전체 샘플 추론 → 언로드
Phase 2: baseline 로드 → 전체 샘플 추론 → 언로드
Phase 3: judge 로드 → 전체 샘플 평가 → 언로드
Phase 4: 결과 통합 → 기존과 동일한 출력

동시에 2개 이상 모델이 메모리에 존재하지 않습니다
API 클라이언트(Ollama/OpenAI/Gemini)는 GPU 리소스를 사용하지 않으므로 unload가 no-op
로컬 HF 모델(LocalHFClient)은 unload 시 모델/토크나이저 삭제 + torch.cuda.empty_cache() 수행
결과 출력(JSONL, 터미널, summary)은 기존과 완전히 동일합니다

상세: cli.md

Ollama Thinking 모델 지원

Qwen 3.5 등 thinking 모델은 Ollama에서 응답 시 content가 비어있고 reasoning 필드에 사고 과정을 반환합니다.

EulerForge bench는 Ollama provider를 사용할 때 native API (/api/chat)를 직접 호출하고 think: false를 전송하여 thinking을 비활성화합니다. 이렇게 하면 thinking 모델도 정상적으로 content에 응답이 포함됩니다.

Ollama provider: /api/chat native API 사용 (thinking 비활성화 지원)
OpenAI/Gemini provider: /chat/completions OpenAI-compatible API 사용

config의 base_url이 /v1으로 끝나더라도 Ollama provider는 자동으로 /v1을 제거하고 native API를 호출합니다.

← 이전 10. 메트릭 모니터링 12. 하이퍼파라미터 탐색 (Grid / Random / Bayes) 다음 →