This configuration drives the Granary pseudo-labelling pipeline – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for 25 European languages.
The first public release of Granary (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:
and is published as nvidia/Granary.
Note — Per-language runs
The pipeline is executed once per language pair: set
source_lang/source_lang_full– audio & transcript languagetranslation.target_lang/target_lang_full– translation languageFor example, to obtain English audio with Italian translations choose
source_lang: enandtranslation.target_lang: it. Separate runs are required for each additional language combination.
Note — GPU required
All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when
num_devices: -1(default) is used.
Install NeMo-speech-data-processor plus the extra wheels required by specific processors:
FasterWhisperInference
pip install pytorch-lightning \
"nvidia-cublas-cu12" \
"nvidia-cudnn-cu12==9.*" \
faster_whisper
export LD_LIBRARY_PATH=$(python - <<'PY'
import os, nvidia.cublas.lib, nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))
PY)vLLMInference
pip install "optree>=0.13.0" vllmCometoidWMTQualityEstimation
pip install pymarianFastTextLangIdClassifier
pip install fasttextConvertToTarredAudioDataset(optional, only if tar-sharding is enabled)
pip install lhotse "nemo-toolkit[common]==2.2.1"- Hardware – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).
- Install NeMo-speech-data-processor and the extras listed above.
- Prepare the input manifest and set three mandatory YAML keys:
input_manifest_file– manifest with raw audio pathsoutput_dir– working/output directorysdp_dir– root of the SDP tree (for prompt/regex assets)
- Run the pipeline:
# Path to your local clone of NeMo-speech-data-processor
SDP_DIR=/path/to/NeMo-speech-data-processor
python ${SDP_DIR}/main.py \
--config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \
--config-name config.yaml \
input_manifest_file=/path/to/input_manifest.json \
output_dir=/path/to/output/dir \
sdp_dir=${SDP_DIR}Each line is a JSON object with the source-audio path:
{"source_audio_filepath": "/path/to/file.flac"}${output_dir}/${source_lang}/manifest_46.json– final bilingual manifest containingaudio_filepath,offset,duration,text(source) andanswer(translation), plus constant decoder flags.${output_dir}/${source_lang}/tarred_dataset/– optional tarred-audio shards andshard_manifest.jsonwhenconvert_to_audio_tarred_dataset.should_run: True.- All intermediate
manifest_XX.jsonfiles are kept for audit/debug.
The processors executed (indices match the config):
- FfmpegConvert (0) – re-encode audio to 16 kHz/mono FLAC.
- GetAudioDuration (1) – compute clip length.
- RemoveFiles (2) – optionally delete originals (
params.save_disk_space). - FasterWhisperInference (3) – pass 1 language detection.
- LambdaExpression (4) – probability-based LID filtering.
- DropSpecifiedFields (5) – remove temporary fields.
- FasterWhisperInference (6, 14) – two-pass transcription (second run can slice by offset).
- Segmentation & grooming (7–13) – split Whisper segments into atomic utterances.
- Hallucination detection (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.
- PnC restoration (21–23) –
Qwen-2.5-7Brestores punctuation & capitalisation; optional regex clean-up. - Length & charset filtering (27–36) – word-ratio, character histogram and FastText checks.
- Quality estimation (41–43) – keep pairs with
Comet-QE score ≥ min_qe_score. - Constant flags (44) – add decoder directives (
<|emo:undefined|>,itn,pnc, etc.). - Tarred dataset (46) – shard audio into
num_shardstar files (optional).
All knobs live under the params block.
-
Language
source_lang/source_lang_fulltranslation.target_lang/target_lang_full
-
Audio duration
min_audio_duration– drop very short clips (seconds)max_audio_duration– drop very long clips (seconds)
-
Language-ID & text filtering
min_audio_lid_probability– Whisper LID thresholdtranslation.min_hist_token_ratio– charset-purity ratiotranslation.min_text_lid_probability– FastText LID threshold
-
Length & quality
translation.max_len_diff_ratio– max(src / tgt) word ratiotranslation.min_qe_score– Comet-QE acceptance score
-
Tarred dataset
convert_to_audio_tarred_dataset.should_run(bool)num_shardsandbuckets_num– shard layout
-
Misc.
use_regex– regex preset for text normalisationsave_disk_space– delete originals after conversionuse_dask– enable distributed execution (not recommended)
- Selective execution – override
processors_to_runwith a range of indices, e.g."0:25". - Model swapping – every inference processor exposes either
model_size_or_path(Whisper) or an embeddedmodel:block (vLLM). - Resource tuning –
num_devices = -1uses all visible GPUs; set an integer to pin workers per stage.
- Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: 2505.13404,
- nvidia/Granary dataset on Hugging Face,
- NeMo-SDP source code.