Speech Technology

رفتن به کانال در Telegram

نمایش بیشتر

الصين56 996 فناوری و برنامه‌ها32 994

1 681

مشترکین

+124 ساعت

+67 روز

+3030 روز

915

نمایش های پست

~ 42724 ساعت

~ 49848 ساعت

54.46%

نرخ مشارکت

اطلاعاتی وجود ندارد

پست های در روز

Ads index

beta

آرشیو پست ها

1 680

Everyone works on translation these days, here is a nice recent release https://github.com/cmots/STEB Official code release for STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity, an automatic evaluation toolkit for speech-to-speech translation systems.

1 680

Some new Persian stuff from Reza2kn Nemo models https://huggingface.co/Reza2kn/Shenava-Koochik-v1.0 https://huggingface.co/Reza2kn/Shenava-Rizeh-v1.0 https://huggingface.co/Reza2kn/Shenava-Rizeh-Pizeh-v1.0 Datasets https://huggingface.co/datasets/Reza2kn/persian-asr-relabeled-gemini https://huggingface.co/datasets/Reza2kn/persian-asr-relabeled-gemini Leaderboard https://huggingface.co/spaces/Reza2kn/persian-asr-double-benchmark

1 680

DCASE2026 Challenge results are out! https://dcase.community/challenge2026/

1 680

There is competition on African voices going on, still plenty of time to join and get the data https://afrivoice.github.io/afrivoice_eac_hackathon/ https://www.kaggle.com/competitions/afri-voices-east-africa-asr-hackathon

1 680

Several interesting NAR / diffusion systems released recently https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar Fast and reasonably accurate. Uses interesting CTC output guidance and bidirectional LLM for postcorrection but kind of hard to adapt because LLM is too special Tedlium WER 4.39 https://github.com/taeyoun811/Whisfusion Diffusion ASR with Whisper-Small Encoder and SMDM-170M Decoder. Not very accurate since it doesn't use CTC Tedlium WER 18.03 (bad) https://github.com/liuzhan22/Diffusion-ASR From Cambridge. Uses Whisper Large encoder and LLADA 8B Instruct for correction. Can edit existing AR hypothesis for better accuracy Tedlium WER 7.05 (not very good yet)

1 680

Rare WER is interesting. It flips things around. Whisper V3 and Cohere (also AED) are still best compared to LLM based systems with better overall WER (Qwen). It actually confirms the intuition that Whisper usually gets all special terms right. Rare WER actually explored before in papers, for example End-to-End Speech Recognition Contextualization with Large Language Models https://arxiv.org/abs/2309.10917

1 680

More or less recent tech from Microsoft. Interesting that accuracy is still more or less the same as Whisper Large v3 https://arxiv.org/abs/2604.00610 Speech LLMs are Contextual Reasoning Transcribers Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning,

1 680

Supposed to be good arxiv.org/abs/2605.28139 ARK-ASR-3B is a multilingual automatic speech recognition model. It achieves current state-of-the-art results on the Hugging Face Open ASR Leaderboard English short-form benchmark, with an average WER of 5.04% huggingface.co/AutoArk-AI/ARK

1 680

https://huggingface.co/marcoyang/spear-xlarge-speech-audio-v2 recently published SPEAR XLarge v2 is the flagship open-source SPEAR encoder for unified speech and general-audio representation learning. This is the ICML 2026 accepted version of SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations. This model is the XLarge v2 release, aligned with the model used in the ICML 2026 paper. Compared with the earlier XLarge v1, v2 is enhanced for complex acoustic scenes through token mixing, improving robustness for overlapped speech, noisy audio, and real-world sound mixtures while keeping SPEAR's unified speech-and-audio design. SPEAR XLarge v2 uses a Zipformer backbone with about 600M parameters, consisting of 13 Zipformer stacks. It produces 1280-dimensional frame-level representations at approximately 50 Hz from 16 kHz waveforms.

1 680

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors https://finmickey.github.io/scena/ Abstract. Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the Reference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

1 680

Odyssey 2026 on speaker verification starts today https://odyssey2026.inesc-id.pt/the-full-schedule/

1 680

Sudarshan Kamath from SmallestAI on how structure beats scale https://www.youtube.com/watch?v=14Cb7D8p-C4

1 680

I always used to think that CTC + LM is a good architecture for quick domain adaptation. Even WER tests demoed the advantage. But recent experiments with rare words WER show that CTC + LM doesn't work as great as expected. Most systems that use ngram shallow fusion demonstrate significantly worse rare WER than RNNT rare WER and even plain CTC without LM rare WER. The thing is that plain conformer accuracy is so good that weak extra LM with perplexity of 100-200 doesn't help much even makes things worse actually confusing rare words. And stronger ngram LM is harder to estimate. Lower perplexity needs more advanced LM architecture and longer context only available with transformers. Interesting flip of the things. Strong LLM should help here of course, but the question is quick adaptation to the domain.

1 680

Rover gets good results of course, its just interesting how corporations gonna compete on taking the first place in huggingface asr leaderboard now https://github.com/huggingface/open_asr_leaderboard/pull/165#issuecomment-4763128980

1 680

Attention to details https://x.com/bryanwangxin/status/2064309590414836085

1 680

TTS models grow in size and get audio generation capabilities https://huggingface.co/inclusionAI/Ming-omni-tts-16.8B-A3B Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

1 680

TTS model based on 0.1B LLM backbone https://huggingface.co/Aratako/MioTTS-0.1B https://huggingface.co/tiiuae/Falcon-H1-Tiny-Multilingual-100M-Base interesting to test how well it performs for different languages

1 680

Recent SynSIG seminars finally uploaded https://www.youtube.com/@isca-synsig for example https://www.youtube.com/watch?v=M8n9I9eGyTM

1 680

Deepmind paper SURF: Separation via Unsupervised Remixing Flow https://google.github.io/df-conformer/surf/ https://arxiv.org/abs/2606.04921 The goal of single-channel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, illposed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based selfsupervised techniques. At a high level, starting from a teacher model, we utilize a “remixing” step to bootstrap the learning of a student flow model from the teacher’s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods.

1 680

Everyone looks for translation these days. Good paper covering the task complexity (naturalness, prosody, content). Omni systems still unrealistic https://arxiv.org/abs/2606.03241 Benchmarking Speech-to-Speech Translation Models Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across XEN and ENX (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's ) while cutting evaluation time by . Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment (). We release COMPASS as a foundation for domain-aware S2ST evaluation.