LLM Training Data Pipeline Platform
A manifest-driven data pipeline that transforms raw data into high-quality LLM training data. Perform reproducible data processing with declarative YAML definitions.
Open SourceThree core pillars of EulerWeave: Data Sources, Processing Blocks, Production Outputs
Choose the right track for your use case to process data
| Track | Purpose | Description |
|---|---|---|
pretrain |
Pre-training | Normalize and refine web crawl data |
sft |
Supervised Fine-Tuning | Convert PDFs/documents into QnA training data |
dpo |
Preference Learning | Prepare comparison data in DPO format |
Create, validate, and run data pipelines with a single command
EulerWeave CLI command list
| Command | Description |
|---|---|
eulerweave new |
Create a new manifest YAML |
eulerweave validate |
Validate manifest |
eulerweave plan |
Preview execution plan and estimated cost |
eulerweave run |
Run the pipeline |
eulerweave export |
Export results to various formats |
eulerweave plugins list |
List installed plugins |
eulerweave plugins doctor |
Diagnose plugins |
17+ data processing blocks included in EulerWeave
| Block | Purpose |
|---|---|
normalize_text |
Whitespace cleanup, encoding normalization |
heuristic_filter |
Length and quality-based filtering |
| Block | Purpose |
|---|---|
dedup_minhash |
MinHash-based approximate deduplication |
dedup_exact |
SHA-256 exact deduplication |
| Block | Purpose |
|---|---|
build_sft_messages |
Generate SFT format via field mapping |
build_sft_qna |
LLM-based multi-turn QnA generation |
build_langextract_qna |
LangExtract-style QnA generation |
| Block | Purpose |
|---|---|
metrics_text_basic |
Length distribution, character set statistics |
metrics_text_repetition |
n-gram duplication detection |
metrics_text_gibberish |
Gibberish and encoding anomaly detection |
metrics_text_boilerplate |
Web boilerplate detection |
metrics_perplexity |
Transformers-based text quality scoring |
metrics_pii_detect |
Email, phone, SSN, credit card detection |
metrics_token_stats |
Tokenization statistics |
metrics_record_schema_validate |
Data integrity validation |
| Block | Purpose |
|---|---|
filter_pii_redact |
PII detection and masking |
export_jsonl |
JSONL output |
export_parquet |
Parquet output |
export_mds |
MDS streaming format |
Learn EulerWeave quickly with step-by-step guides
Tutorials coming soon.
A complete pipeline manifest that generates SFT training data from PDF
Install EulerWeave and run your first pipeline
Python 3.11+
Open source, declarative YAML definitions, reproducible data processing.
Get Started on GitHub Contact Us