uz
Feedback
Speech Technology

Speech Technology

Kanalga Telegram’da o‘tish
1 650
Obunachilar
+124 soatlar
+17 kunlar
+1630 kunlar
Postlar arxiv
Some recent training results from our system https://alphacephei.com/nsh/2026/05/24/asr-details.html
Some recent training results from our system https://alphacephei.com/nsh/2026/05/24/asr-details.html

Finally a good competition timeline, not two weeks to implement everything https://saigonaihub.com/OneVoiceAIChallenge Dare to build the next generation of realtime translation devices powered by Edge AI Presented by Saigon AI Hub and Qualcomm 🟠 24 May - 24 June 2026: Registration period 🟠 July 2026: Technical specification submission 🟠 August - September 2026: Prototype submission 🟠 October 2026: Field testing 🟠 November 2026: Grand finale at VNG Campus

If you decompose prosody you can do many nice things https://arxiv.org/abs/2605.05927 Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM Wenqian CuiXiao-Hui LiDaxin TanQiyong ZhengIrwin King
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.

ASR robusness still a large area to improve https://github.com/xzf-thu/Mega-ASR

Interesting comments by Desh on TML https://x.com/rdesh26/status/2054246456635150744 for example Game-Time Game-Time: Evaluating Temporal Dynamics in Spoken Language Models https://arxiv.org/abs/2509.26388 .

Following on this, NVIDIA implements transformer encoder instead of conformer https://github.com/NVIDIA-NeMo/NeMo/pull/15661

This is quite insightful paper https://arxiv.org/abs/2601.20094 T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality. transformers are faster than CNN Previous paper from the same author https://arxiv.org/abs/2411.18803

Apptek recently released callcenter dataset (129 hours role played). Qwen3-ASR-1.7B leads again https://huggingface.co/datasets/apptek-com/apptek_callcenter_dialogues

HF introduced private leaderboard https://huggingface.co/blog/open-asr-leaderboard-private-data Qwen is really good for Engli
HF introduced private leaderboard https://huggingface.co/blog/open-asr-leaderboard-private-data Qwen is really good for English

For people interested in Georgian we also recommend to check https://huggingface.co/NMikka good work on finetuning major TTS engines (kokoro, qwen, f5, omni) goes there

NAR-CTC thing is interesting, fast https://www.linkedin.com/feed/update/urn:li:activity:7455374206114115585/ The IBM worldwide speech research team is super excited to announce the release of three Granite 4.1 speech models (links in first comment) that explore different tradeoffs in terms of accuracy, throughput, and functionality: ⛰ granite-speech-4.1-2b (codename "Altius") adds punctuation and truecasing capabilities while also improving multilingual ASR accuracy via a novel dual-head graphemic/BPE CTC encoder with frame importance sampling 💪 granite-speech-4.1-2b-plus (codename "Fortius") adds speaker-attributed ASR, word-level time-stamps and improved support for long-form ASR ⚡️ granite-speech-4.1-2b-nar (codename "Citius") uses a new non-autoregressive architecture that performs parallel local edits to the CTC encoder hypothesis resulting in much higher throughput 🥇 Models 1 and 3 claimed first and third place on the #OpenASR leaderboard. This was truly a worldwide team effort and I would like to express my sincere gratitude to all my IBM colleagues who made this possible: Samuel Thomas, Vishal Sunder., PhD, Jeff Kuo, Brian Kingsbury, Avihu Dekel, Zvi Kons, Hagai Aronowitz, Ron Hoory, Sashi Novitasari, Takashi Fukuda, Tohru Nagano, Masayuki Suzuki, Gakuto KURATA, Madison Lee, Luis Lastras and many more (apologies if I missed you!). 🔬 I hope to see many of you next week at #ICASSP2026 in Barcelona to chat about these models and share some funny anecdotes.

The code for Facebook's LST finally now available https://github.com/facebookresearch/lst https://t.me/speechtech/2195

For a long time AudioSet was big pain to download, finally available on HF https://huggingface.co/datasets/agkphysics/AudioSet Overall, even small speech models need to understand non-speech sounds better. More on this later.

he Beyond Transcription Challenge, an IEEE SLT 2026 shared task tackling a foundational question in audio AI: can a model reason over speech without first converting it to text? The research question: Current speech models still struggle to extract meaning directly from audio, especially when the signal includes overlapping speakers, ambient sounds, and room acoustics. Clinical note generation from doctor-patient conversations is an ideal stress test for this: it demands that a model attend to who said what, filter environmental noise, and produce faithful structured output. Yet on the Synth-DoPaCo dataset, end-to-end models hallucinate at alarming rates, with 99–100% of clinical claims unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge aimed at closing this gap by advancing the technology. Two competition tracks: - Lightweight (≤ 6B params): Single end-to-end model, one invocation. Audio in, SOAP note out. - Heavyweight (≤ 36B params): Tools and agents allowed. Only the final model generates text from audio. The Synth-DoPaCo dataset: 8,800 synthetic doctor-patient conversations (~1,329 hrs), 66 ambient sound classes, room reverberation, Opus compression. Available now on Hugging Face. Key dates: - May 4, 2026: Open-Source Inclusion Proposals Deadline - June 24, 2026: System submission deadline - July 8, 2026: Challenge paper due Data is live. Baselines are posted. Team registration is open. If you work on speech, audio understanding, or multimodal AI, we'd love to have you compete. 👇 https://betrac.github.io group:betrac@googlegroups.com

https://github.com/OpenMOSS/MOSS-TTS-Nano 0.1B params, still many languages

The question is when we will see 32B speech model https://huggingface.co/KRAFTON/Raon-Speech-9B

Rissa Cao, FishAudio CEO, a bit of marketing but very valid point on data importance and lack of real high quality data for speech systems https://www.linkedin.com/feed/update/urn:li:activity:7448399470251356160/ Early on, we made a mistake. We trained our TTS model on whatever voice data we could find online. It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart. The data distribution was wrong.