Speech Technology

前往频道在 Telegram

显示更多

中国56 996 技术与应用32 994

1 681

订阅者

+124 小时

+67 天

+3030 天

915

帖子浏览量

~ 42724 小时

~ 49848 小时

54.46%

参与率

无数据

每日帖子数

Ads index

beta

帖子存档

1 680

https://huggingface.co/nyralabs/CrisperWhisper2.0_large Most speech-to-text systems never actually decide whether to write down what was said or what was meant. They inherit that choice from their training data and apply it inconsistently. CrisperWhisper 2.0 makes it an explicit, controllable choice. One recording, two transcripts:

Verbatim, exactly what was said, in one consistent format:

[um] so we we need to, to reschedule the th- thursday meeting to [uh] march third at nine thirty [laughter]

Intended, the clean version the speaker meant, with numbers, dates, and emails formatted the way you'd write them:

So we need to reschedule the Thursday meeting to March 3 at 9:30. On top of that: Word-level timings. Around 30 ms mean boundary error on read speech and 41 ms on conversational speech, the most precise word timing of any system we benchmarked, on both. Verbatimize. Upgrade transcripts you already have: given audio plus a trusted clean transcript, the model reproduces your content word-for-word and inserts only the disfluencies and vocal events actually present in the audio (rare-word recall jumps from 6.8% to 96.1% vs. re-transcribing). This turns the world's abundant clean corpora into verbatim ones, ready for TTS data, clinical speech analysis, and dataset construction. Multilingual. Verbatim and intended modes work across most languages Whisper supports. CrisperWhisper 2.0 tops the Nyra Verbatim Speech Benchmark leaderboard for disfluency F1 across ten languages, ahead of every closed-source alternative we tested. Seamless longform. Audio of any length, transcribed without the usual chunk-boundary artifacts: each window continues from the words already transcribed (conditional continuation), so there are no duplicated or dropped words at the seams and no fragile timestamp-token bookkeeping. Production inference. A CTranslate2 runtime with speculative decoding and built-in mitigation of Whisper's looping-hallucination failure mode.

1 680

1 680

Everyone builds self-improvement loops in LLMs, I wonder how they could look like in ASR/TTS. Not many publications on that yet.

1 680

We compared three LALM judges against a calibrated human panel across 15 dimensions of speech quality. The LALMs tracked humans closely on relevance, answer quality, and instruction following—what was said—but were much less reliable on naturalness, emotion, pronunciation, and overall feel—how it was said. https://research.withdavid.ai/blog/lalm-as-judge-vs-hitl

1 680

Interesting math on speech LLM https://arxiv.org/abs/2604.08003v1 Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei, Jie Gao, Jie Wu

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.

1 680

Interesting project https://github.com/Xiaobin-Rong/unipase UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations Xiaobin Rong, Zheng Wang, Yushi Wang, Jun Gao, Jing Lu

Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset. This module directly converts degraded waveforms into clean and linguistically faithful phonetic representations, ensuring robust enhancement with minimal linguistic hallucination. Based on these enhanced phonetic representations, an Adapter generates enhanced acoustic representations containing rich acoustic details, which a neural Vocoder uses to reconstruct corresponding high-fidelity 16-kHz waveforms. A PostNet then converts the waveforms to 48~kHz before resampling them to their original rates, enabling seamless handling of inputs and outputs at multiple sampling rates. Experimental results on several evaluation datasets, covering sub-tasks and full tasks, demonstrate that UniPASE achieves superior or competitive performance compared with existing state-of-the-art models. The proposed model also serves as the backbone of our submission to the URGENT 2026 Challenge, which achieved 1st place in the objective evaluation. The source code and audio demos are available at this https URL.

1 680

SGLang-Omni does serious job on optimizing speech models (Higgs TTS too) https://x.com/YichiZ03/status/2078588932191895976

1 680

Interesting codec https://huggingface.co/Scicom-intl/WideCodec

1 680

Outstanding paper on ACL 2026 https://github.com/HITsz-TMG/Lychee-FD Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLM

1 680

https://huggingface.co/AutoArk-AI/Audio8-ASR-0.1B Audio8-ASR-0.1B is a compact autoregressive ASR model whose language-model component has only 0.1B parameters. It supports multilingual speech recognition for languages including Chinese, English, French, German, Japanese, Korean, and Cantonese. We position it as one of the smallest usable performance ASR models in the LLM era. The audio encoder backbone is based on Qwen3-ASR-0.6B, with the audio adapter and projector trained as part of Audio8-ASR. The language-model backbone is based on Ref-Pretrain-Qwen-104M.

1 680

Part of the news on Inkling is that it actually handles audio https://x.com/huckiyang/status/2077625513384841679

1 680

We've just released Zipformer Tajik model (you can use it with sherpa-onnx) https://huggingface.co/alphacep/vosk-model-tg Somewhat initial one, we will work more on it.

1 680

Great part from MERL paper above https://real-tse.github.io/assets/pdf/MERL-SA-Track2.pdf VII. METRIC ATTACK Many speech separation and target speech extraction systems optimize evaluation metrics either explicitly or implicitlyduring training, including metrics such as SI-SDR and PESQ. From the perspective of Goodhart’s Law, this practice fundamentally compromises the validity of such metrics as evaluation tools: once a metric becomes an optimization target, it ceases to function as an independent measure of system quality. ..... Given the fragility of non-intrusive metrics as demonstrated by our attack and also shown in [37], [38], we suggest that the Challenge Organizers either remove DNSMOS and spksim when calculating the official ranking or replace them with alternative speech quality and speaker similarity metrics that were not attacked either advertently or inadvertently by submitted systems

1 680

FAD optimization during training https://github.com/voidful/fd-speech

1 680

Just Turkish TTS so not very applicable to wide audience but interesting design (only 200M DiT + 25 Hz VAE from VoxCPM2) https://github.com/freyavoiceai/FreyaTTS

1 680

https://betrac.github.io/ https://www.linkedin.com/feed/update/urn:li:activity:7482403480888975360/ the winners of the Beyond Transcription Challenge! 🏆 Lightweight track (<6B params, no tools) 1. TalTech 2. NTT-HI-CS 3. KUSLP Heavyweight track (<36B params) 1. TalTech 2. KUSLP 3. NTT-HI-CS 🧪 The Challenge Teams were given 1,100 hours of fully synthetic doctor-patient conversations with reference SOAP notes (conversations roleplayed by Gemma 3, notes generated by Kimi K2, from our Interspeech paper), a list of allowed open-weight models and datasets, and one goal: build the best end-to-end audio-to-SOAP-note system possible. 📊 The Results Systems were evaluated with an automated medical Concept F1 scorer on held-out conversations from the same distribution. All three top teams converged on the same recipe: supervised fine-tuning on the references, followed by reinforcement learning with Concept F1 as the reward. Their systems are remarkable. Crushing hallucinations: the best baselines and cascaded systems we evaluated — built from Qwen 3 and Whisper components — hallucinate on more than 20% of claims. The top competition systems brought that below 1%. ❓ But does it generalize? The obvious objection: isn't this overfitting to synthetic data? And isn't Concept F1 a very limited metric? So we tested it. During evaluation, teams also generated notes for 272 human-acted medical dialogues — not permitted for training, and with no reference SOAP notes. Across n = 19 submitted systems, we asked two questions: 1. Does synthetic performance predict real performance? Yes, almost exactly. Real-data Concept F1 tracks held-out synthetic Concept F1 with a slope of 0.89 (lightweight) and 0.94 (heavyweight) — essentially the identity line (r = 0.97–1.00). 2. Does Concept F1 predict LLM-as-a-judge quality? (judge pipeline using Gemma 4) Yes — r = 0.83–0.87. The agreement is tightest among the strongest systems and fans out below ~0.35 Concept F1, so the metric is most trustworthy exactly where it matters. The synthetic data approach looks like a genuinely promising way forward. Plenty of open questions remain — but "train on synthetic, deploy on real" held up here.

1 680

So Huggingface still have trouble to put ensemble model on leaderboard while there is a pull request. At the same time they put Modulate immediately after release Modulate CTO claims they trained the model on 500M hours of speech https://www.linkedin.com/feed/update/urn:li:activity:7481395636882444288/ Modulate wins just 0.01 in WER over azure and only place #4 on private leaderboard Scaling doesn't work it seems

1 680

Some recent audio annotation and TTS finetuning projects from LAION, complicated pipelines https://github.com/LAION-AI/univeral-audio-annotation-pipeline Produces structured JSON annotations from any audio file, covering speech transcription, speaker diarization, emotions, vocal bursts, sound effects, and music. Best configuration: Gemma-12B + DiCoW — Nemotron 3.5 words + VibeVoice/Sortformer diarization + DiCoW overlap-aware ASR, fused by a text-only Gemma-4-12B LLM (no audio in the final step). It is the highest-Reward pipeline on SoundScape-Bench (0.253) — rank 3 of all systems, nearly matching Gemini 3.5 Flash (0.256) and ahead of every other pipeline. (It trades precision for that recall: see the tradeoff note.) https://github.com/LAION-AI/laionbox LaionBox fine-tunes the DramaBox flow-matching transformer using LoRA (rank=128) with 6 differentiable auxiliary losses that push generated audio toward higher naturalness, quality, and voice cloning fidelity: CLAP Naturalness — Maximizes perceptual naturalness via VoiceCLAP text similarity Quality MLP — Binary classifier trained to distinguish real from synthetic audio Centroid Real/Fake — Distribution matching toward real speech embeddings Speaker Similarity — WavLM-SV voice identity preservation Comb Filter Detector — Latent-space CNN detecting interference artifacts Artifact Detector V2 — Residual CNN for general artifact detection

1 680

https://real-tse.github.io/challenge/ challenge results and reports For example MERL 1st place in offline target speaker extraction https://real-tse.github.io/assets/pdf/MERL-SA-Track2.pdf

1 680

Claude can rewrite Kaldi into a very tiny rust codebase: https://github.com/Reza2kn/Vosk-Rust and even add new features like quantization