1 650
Subscribers
No data24 hours
-17 days
+1130 days
Data loading in progress...
Similar Channels
Tags Cloud
Incoming and Outgoing Mentions
---
---
---
---
---
---
Attracting Subscribers
July '26
July '260
in 0 channels
June '26
+29
in 1 channels
Get PRO
May '26
+27
in 0 channels
Get PRO
April '26
+36
in 0 channels
Get PRO
March '26
+20
in 1 channels
Get PRO
February '26
+38
in 0 channels
Get PRO
January '26
+23
in 0 channels
Get PRO
December '25
+16
in 0 channels
Get PRO
November '25
+25
in 0 channels
Get PRO
October '25
+26
in 1 channels
Get PRO
September '25
+45
in 0 channels
Get PRO
August '25
+18
in 0 channels
Get PRO
July '25
+18
in 0 channels
Get PRO
June '25
+28
in 1 channels
Get PRO
May '25
+34
in 0 channels
Get PRO
April '25
+34
in 1 channels
Get PRO
March '25
+40
in 1 channels
Get PRO
February '25
+36
in 0 channels
Get PRO
January '25
+20
in 1 channels
Get PRO
December '24
+56
in 2 channels
Get PRO
November '24
+37
in 1 channels
Get PRO
October '24
+137
in 3 channels
Get PRO
September '24
+47
in 2 channels
Get PRO
August '24
+51
in 1 channels
Get PRO
July '24
+62
in 2 channels
Get PRO
June '24
+34
in 1 channels
Get PRO
May '24
+41
in 1 channels
Get PRO
April '24
+33
in 1 channels
Get PRO
March '24
+55
in 2 channels
Get PRO
February '24
+56
in 2 channels
Get PRO
January '24
+77
in 1 channels
Get PRO
December '23
+96
in 3 channels
Get PRO
November '23
+47
in 2 channels
Get PRO
October '23
+37
in 1 channels
Get PRO
September '23
+23
in 0 channels
Get PRO
August '23
+24
in 0 channels
Get PRO
July '23
+34
in 0 channels
Get PRO
June '23
+20
in 0 channels
Get PRO
May '23
+79
in 0 channels
Get PRO
April '23
+51
in 0 channels
Get PRO
March '23
+15
in 0 channels
Get PRO
February '23
+6
in 0 channels
Get PRO
January '23
+11
in 0 channels
Get PRO
December '22
+12
in 0 channels
Get PRO
November '22
+22
in 0 channels
Get PRO
October '22
+14
in 0 channels
Get PRO
September '22
+25
in 0 channels
Get PRO
August '22
+53
in 0 channels
Get PRO
July '22
+9
in 0 channels
Get PRO
June '22
+16
in 0 channels
Get PRO
May '22
+14
in 0 channels
Get PRO
April '22
+26
in 0 channels
Get PRO
March '22
+9
in 0 channels
Get PRO
February '22
+8
in 0 channels
Get PRO
January '22
+10
in 0 channels
Get PRO
December '21
+16
in 0 channels
Get PRO
November '21
+11
in 0 channels
Get PRO
October '21
+16
in 0 channels
Get PRO
September '21
+3
in 0 channels
Get PRO
August '21
+24
in 0 channels
Get PRO
July '21
+13
in 0 channels
Get PRO
June '21
+16
in 0 channels
Get PRO
May '21
+220
in 0 channels
| Date | Subscriber Growth | Mentions | Channels | |
| 03 July | 0 | |||
| 02 July | 0 | |||
| 01 July | 0 |
Channel Posts
Everyone works on translation these days, here is a nice recent release
https://github.com/cmots/STEB
Official code release for STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity, an automatic evaluation toolkit for speech-to-speech translation systems.
| 2 | Some new Persian stuff from Reza2kn
Nemo models
https://huggingface.co/Reza2kn/Shenava-Koochik-v1.0
https://huggingface.co/Reza2kn/Shenava-Rizeh-v1.0
https://huggingface.co/Reza2kn/Shenava-Rizeh-Pizeh-v1.0
Datasets
https://huggingface.co/datasets/Reza2kn/persian-asr-relabeled-gemini
https://huggingface.co/datasets/Reza2kn/persian-asr-relabeled-gemini
Leaderboard
https://huggingface.co/spaces/Reza2kn/persian-asr-double-benchmark | 455 |
| 3 | DCASE2026 Challenge results are out! https://dcase.community/challenge2026/ | 479 |
| 4 | There is competition on African voices going on, still plenty of time to join and get the data
https://afrivoice.github.io/afrivoice_eac_hackathon/
https://www.kaggle.com/competitions/afri-voices-east-africa-asr-hackathon | 553 |
| 5 | Several interesting NAR / diffusion systems released recently
https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
Fast and reasonably accurate. Uses interesting CTC output guidance and bidirectional LLM for postcorrection but kind of hard to adapt because LLM is too special
Tedlium WER 4.39
https://github.com/taeyoun811/Whisfusion
Diffusion ASR with Whisper-Small Encoder and SMDM-170M Decoder. Not very accurate since it doesn't use CTC
Tedlium WER 18.03 (bad)
https://github.com/liuzhan22/Diffusion-ASR
From Cambridge. Uses Whisper Large encoder and LLADA 8B Instruct for correction. Can edit existing AR hypothesis for better accuracy
Tedlium WER 7.05 (not very good yet) | 556 |
| 6 | Rare WER is interesting. It flips things around. Whisper V3 and Cohere (also AED) are still best compared to LLM based systems with better overall WER (Qwen). It actually confirms the intuition that Whisper usually gets all special terms right.
Rare WER actually explored before in papers, for example
End-to-End Speech Recognition Contextualization with Large Language Models
https://arxiv.org/abs/2309.10917 | 517 |
| 7 | More or less recent tech from Microsoft. Interesting that accuracy is still more or less the same as Whisper Large v3
https://arxiv.org/abs/2604.00610
Speech LLMs are Contextual Reasoning Transcribers
Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li
Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, | 586 |
| 8 | Supposed to be good arxiv.org/abs/2605.28139
ARK-ASR-3B is a multilingual automatic speech recognition model. It achieves current state-of-the-art results on the Hugging Face Open ASR Leaderboard English short-form benchmark, with an average WER of 5.04%
huggingface.co/AutoArk-AI/ARK | 658 |
| 9 | https://huggingface.co/marcoyang/spear-xlarge-speech-audio-v2 recently published
SPEAR XLarge v2 is the flagship open-source SPEAR encoder for unified speech and general-audio representation learning. This is the ICML 2026 accepted version of SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations. This model is the XLarge v2 release, aligned with the model used in the ICML 2026 paper. Compared with the earlier XLarge v1, v2 is enhanced for complex acoustic scenes through token mixing, improving robustness for overlapped speech, noisy audio, and real-world sound mixtures while keeping SPEAR's unified speech-and-audio design.
SPEAR XLarge v2 uses a Zipformer backbone with about 600M parameters, consisting of 13 Zipformer stacks. It produces 1280-dimensional frame-level representations at approximately 50 Hz from 16 kHz waveforms. | 840 |
| 10 | ScenA: Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
https://finmickey.github.io/scena/
Abstract. Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the Reference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline. | 846 |
| 11 | Odyssey 2026 on speaker verification starts today
https://odyssey2026.inesc-id.pt/the-full-schedule/ | 714 |
| 12 | Sudarshan Kamath from SmallestAI on how structure beats scale
https://www.youtube.com/watch?v=14Cb7D8p-C4 | 724 |
| 13 | I always used to think that CTC + LM is a good architecture for quick domain adaptation. Even WER tests demoed the advantage. But recent experiments with rare words WER show that CTC + LM doesn't work as great as expected. Most systems that use ngram shallow fusion demonstrate significantly worse rare WER than RNNT rare WER and even plain CTC without LM rare WER. The thing is that plain conformer accuracy is so good that weak extra LM with perplexity of 100-200 doesn't help much even makes things worse actually confusing rare words. And stronger ngram LM is harder to estimate. Lower perplexity needs more advanced LM architecture and longer context only available with transformers. Interesting flip of the things. Strong LLM should help here of course, but the question is quick adaptation to the domain. | 881 |
| 14 | Rover gets good results of course, its just interesting how corporations gonna compete on taking the first place in huggingface asr leaderboard now
https://github.com/huggingface/open_asr_leaderboard/pull/165#issuecomment-4763128980 | 839 |
| 15 | Attention to details
https://x.com/bryanwangxin/status/2064309590414836085 | 1 132 |
| 16 | TTS models grow in size and get audio generation capabilities
https://huggingface.co/inclusionAI/Ming-omni-tts-16.8B-A3B
Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions. | 1 563 |
| 17 | TTS model based on 0.1B LLM backbone
https://huggingface.co/Aratako/MioTTS-0.1B
https://huggingface.co/tiiuae/Falcon-H1-Tiny-Multilingual-100M-Base
interesting to test how well it performs for different languages | 1 162 |
| 18 | Recent SynSIG seminars finally uploaded
https://www.youtube.com/@isca-synsig
for example
https://www.youtube.com/watch?v=M8n9I9eGyTM | 1 372 |
| 19 | Deepmind paper
SURF: Separation via Unsupervised Remixing Flow
https://google.github.io/df-conformer/surf/
https://arxiv.org/abs/2606.04921
The goal of single-channel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, illposed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based selfsupervised techniques. At a high level, starting from a teacher model, we utilize a “remixing” step to bootstrap the learning of a student flow model from the teacher’s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. | 1 173 |
| 20 | Everyone looks for translation these days. Good paper covering the task complexity (naturalness, prosody, content). Omni systems still unrealistic
https://arxiv.org/abs/2606.03241
Benchmarking Speech-to-Speech Translation Models
Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo
Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across XEN and ENX (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's ) while cutting evaluation time by . Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment (). We release COMPASS as a foundation for domain-aware S2ST evaluation. | 1 108 |
Available now! Telegram Research 2025 — the year's key insights 
