granary

Granary Dataset Creation Pipeline

Overview

This configuration drives the Granary pseudo-labelling pipeline – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for 25 European languages.

The first public release of Granary (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:

and is published as nvidia/Granary.

Note — Per-language runs

The pipeline is executed once per language pair: set

source_lang / source_lang_full – audio & transcript language

translation.target_lang / target_lang_full – translation language

For example, to obtain English audio with Italian translations choose source_lang: en and translation.target_lang: it. Separate runs are required for each additional language combination.

Note — GPU required

All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when num_devices: -1 (default) is used.

Software prerequisites

Install NeMo-speech-data-processor plus the extra wheels required by specific processors:

FasterWhisperInference

pip install pytorch-lightning \
            "nvidia-cublas-cu12" \
            "nvidia-cudnn-cu12==9.*" \
            faster_whisper

export LD_LIBRARY_PATH=$(python - <<'PY'
import os, nvidia.cublas.lib, nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))
PY)

vLLMInference

pip install "optree>=0.13.0" vllm

CometoidWMTQualityEstimation

pip install pymarian

FastTextLangIdClassifier

pip install fasttext

ConvertToTarredAudioDataset (optional, only if tar-sharding is enabled)

pip install lhotse "nemo-toolkit[common]==2.2.1"

Quick start

Hardware – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).
Install NeMo-speech-data-processor and the extras listed above.
Prepare the input manifest and set three mandatory YAML keys:
- input_manifest_file – manifest with raw audio paths
- output_dir – working/output directory
- sdp_dir – root of the SDP tree (for prompt/regex assets)
Run the pipeline:

# Path to your local clone of NeMo-speech-data-processor
SDP_DIR=/path/to/NeMo-speech-data-processor

python ${SDP_DIR}/main.py \
    --config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \
    --config-name config.yaml \
    input_manifest_file=/path/to/input_manifest.json \
    output_dir=/path/to/output/dir \
    sdp_dir=${SDP_DIR}

Input and output formats

Input manifest

Each line is a JSON object with the source-audio path:

{"source_audio_filepath": "/path/to/file.flac"}

Key outputs

${output_dir}/${source_lang}/manifest_46.json – final bilingual manifest containing audio_filepath, offset, duration, text (source) and answer (translation), plus constant decoder flags.
${output_dir}/${source_lang}/tarred_dataset/ – optional tarred-audio shards and shard_manifest.json when convert_to_audio_tarred_dataset.should_run: True.
All intermediate manifest_XX.json files are kept for audit/debug.

Pipeline stages

The processors executed (indices match the config):

FfmpegConvert (0) – re-encode audio to 16 kHz/mono FLAC.
GetAudioDuration (1) – compute clip length.
RemoveFiles (2) – optionally delete originals (params.save_disk_space).
FasterWhisperInference (3) – pass 1 language detection.
LambdaExpression (4) – probability-based LID filtering.
DropSpecifiedFields (5) – remove temporary fields.
FasterWhisperInference (6, 14) – two-pass transcription (second run can slice by offset).
Segmentation & grooming (7–13) – split Whisper segments into atomic utterances.
Hallucination detection (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.
PnC restoration (21–23) – Qwen-2.5-7B restores punctuation & capitalisation; optional regex clean-up.
Length & charset filtering (27–36) – word-ratio, character histogram and FastText checks.
Quality estimation (41–43) – keep pairs with Comet-QE score ≥ min_qe_score.
Constant flags (44) – add decoder directives (<|emo:undefined|>, itn, pnc, etc.).
Tarred dataset (46) – shard audio into num_shards tar files (optional).

Tunable parameters

All knobs live under the params block.

Language
- source_lang / source_lang_full
- translation.target_lang / target_lang_full
Audio duration
- min_audio_duration – drop very short clips (seconds)
- max_audio_duration – drop very long clips (seconds)
Language-ID & text filtering
- min_audio_lid_probability – Whisper LID threshold
- translation.min_hist_token_ratio – charset-purity ratio
- translation.min_text_lid_probability – FastText LID threshold
Length & quality
- translation.max_len_diff_ratio – max(src / tgt) word ratio
- translation.min_qe_score – Comet-QE acceptance score
Tarred dataset
- convert_to_audio_tarred_dataset.should_run (bool)
- num_shards and buckets_num – shard layout
Misc.
- use_regex – regex preset for text normalisation
- save_disk_space – delete originals after conversion
- use_dask – enable distributed execution (not recommended)

Advanced usage

Selective execution – override processors_to_run with a range of indices, e.g. "0:25".
Model swapping – every inference processor exposes either model_size_or_path (Whisper) or an embedded model: block (vLLM).
Resource tuning – num_devices = -1 uses all visible GPUs; set an integer to pin workers per stage.

References

Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: 2505.13404,
nvidia/Granary dataset on Hugging Face,
NeMo-SDP source code.

Name		Name	Last commit message	Last commit date
parent directory ..
partials		partials
README.md		README.md
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Granary Dataset Creation Pipeline

Overview

Software prerequisites

Quick start

Input and output formats

Input manifest

Key outputs

Pipeline stages

Tunable parameters

Advanced usage

References

FilesExpand file tree

granary

Directory actions

More options

Directory actions

More options

Latest commit

History

granary

Folders and files

parent directory

README.md

Granary Dataset Creation Pipeline

Overview

Software prerequisites

Quick start

Input and output formats

Input manifest

Key outputs

Pipeline stages

Tunable parameters

Advanced usage

References