EulerWeave – Data Pipeline Platform

Key Features

Three core pillars of EulerWeave: Data Sources, Processing Blocks, Production Outputs

Diverse Data Sources

Local: JSONL, CSV, Parquet, TXT, HTML, PDF
Remote: HuggingFace Datasets, HuggingFace Hub, HTTPS, AWS S3
Extensible plugin system

17+ Data Processing Blocks

Normalization & Filtering: normalize_text, heuristic_filter
Deduplication: MinHash, SHA-256
SFT Task Build: LLM-based QnA generation
13+ Metric Blocks: Perplexity, PII, Repetition, Gibberish, etc.
PII detection and masking

Production-Ready Outputs

JSONL: OpenAI Chat format compatible
Parquet: For large-scale analytics
MDS/StreamingDataset: Optimized for distributed training
Compatible: Ollama, vLLM, TRL, HuggingFace Transformers

Pipeline Tracks

Choose the right track for your use case to process data

Track	Purpose	Description
`pretrain`	Pre-training	Normalize and refine web crawl data
`sft`	Supervised Fine-Tuning	Convert PDFs/documents into QnA training data
`dpo`	Preference Learning	Prepare comparison data in DPO format

CLI Quickstart

Create, validate, and run data pipelines with a single command

# Create a new manifest
eulerweave new manifest.yaml --track sft

# Validate the manifest
eulerweave validate manifest.yaml

# Preview execution plan
eulerweave plan manifest.yaml --records 10000

# Run the pipeline
eulerweave run manifest.yaml --input data/train.jsonl --artifacts ./artifacts

# Export to MDS format
eulerweave export mds out/result.jsonl ./output/mds/ --shard-size 2000

# List plugins
eulerweave plugins list

CLI Reference

EulerWeave CLI command list

Command	Description
`eulerweave new`	Create a new manifest YAML
`eulerweave validate`	Validate manifest
`eulerweave plan`	Preview execution plan and estimated cost
`eulerweave run`	Run the pipeline
`eulerweave export`	Export results to various formats
`eulerweave plugins list`	List installed plugins
`eulerweave plugins doctor`	Diagnose plugins

Built-in Block List

17+ data processing blocks included in EulerWeave

Normalization & Filtering

Block	Purpose
`normalize_text`	Whitespace cleanup, encoding normalization
`heuristic_filter`	Length and quality-based filtering

Deduplication

Block	Purpose
`dedup_minhash`	MinHash-based approximate deduplication
`dedup_exact`	SHA-256 exact deduplication

Task Build (SFT)

Block	Purpose
`build_sft_messages`	Generate SFT format via field mapping
`build_sft_qna`	LLM-based multi-turn QnA generation
`build_langextract_qna`	LangExtract-style QnA generation

Metrics

Block	Purpose
`metrics_text_basic`	Length distribution, character set statistics
`metrics_text_repetition`	n-gram duplication detection
`metrics_text_gibberish`	Gibberish and encoding anomaly detection
`metrics_text_boilerplate`	Web boilerplate detection
`metrics_perplexity`	Transformers-based text quality scoring
`metrics_pii_detect`	Email, phone, SSN, credit card detection
`metrics_token_stats`	Tokenization statistics
`metrics_record_schema_validate`	Data integrity validation

PII & Export

Block	Purpose
`filter_pii_redact`	PII detection and masking
`export_jsonl`	JSONL output
`export_parquet`	Parquet output
`export_mds`	MDS streaming format

Tutorials

Learn EulerWeave quickly with step-by-step guides

Tutorials coming soon.

Manifest Example

A complete pipeline manifest that generates SFT training data from PDF

version: 1 track: sft inputs: - type: pdf uri: data/technical_manual.pdf options: strategy: auto pipeline: - id: normalize type: normalize_text slot: normalize - id: filter type: heuristic_filter slot: filter params: min_length: 100 - id: dedup type: dedup_exact slot: dedup - id: qna type: build_sft_qna slot: build_task params: model: "qwen3:32b" base_url: "http://localhost:11434" - id: export type: export_jsonl slot: export exports: - type: jsonl path: out/training_data.jsonl

Installation & Getting Started

Install EulerWeave and run your first pipeline

Installation

pip install eulerweave

# Full feature installation
pip install eulerweave[pdf,llm,parquet]

Requirements

Python 3.11+

GitHub

eulerwa/eulerweave

Start Your Data Pipeline with EulerWeave

Open source, declarative YAML definitions, reproducible data processing.

Get Started on GitHub Contact Us