Speech Technology

Відкрити в Telegram

Китай59 354 Технології та додатки34 157

1 647

Підписники

Немає даних24 години

+67 днів

+2130 день

1 001

Перегляди допису

~ 47824 години

~ 53248 годин

60.78%

Коефіцієнт залучення

Немає даних

Дописів на день

Ads index

beta

Архів дописів

1 647

NAR-CTC thing is interesting, fast https://www.linkedin.com/feed/update/urn:li:activity:7455374206114115585/ The IBM worldwide speech research team is super excited to announce the release of three Granite 4.1 speech models (links in first comment) that explore different tradeoffs in terms of accuracy, throughput, and functionality: ⛰ granite-speech-4.1-2b (codename "Altius") adds punctuation and truecasing capabilities while also improving multilingual ASR accuracy via a novel dual-head graphemic/BPE CTC encoder with frame importance sampling 💪 granite-speech-4.1-2b-plus (codename "Fortius") adds speaker-attributed ASR, word-level time-stamps and improved support for long-form ASR ⚡️ granite-speech-4.1-2b-nar (codename "Citius") uses a new non-autoregressive architecture that performs parallel local edits to the CTC encoder hypothesis resulting in much higher throughput 🥇 Models 1 and 3 claimed first and third place on the #OpenASR leaderboard. This was truly a worldwide team effort and I would like to express my sincere gratitude to all my IBM colleagues who made this possible: Samuel Thomas, Vishal Sunder., PhD, Jeff Kuo, Brian Kingsbury, Avihu Dekel, Zvi Kons, Hagai Aronowitz, Ron Hoory, Sashi Novitasari, Takashi Fukuda, Tohru Nagano, Masayuki Suzuki, Gakuto KURATA, Madison Lee, Luis Lastras and many more (apologies if I missed you!). 🔬 I hope to see many of you next week at #ICASSP2026 in Barcelona to chat about these models and share some funny anecdotes.

1 647

The code for Facebook's LST finally now available https://github.com/facebookresearch/lst https://t.me/speechtech/2195

1 647

For a long time AudioSet was big pain to download, finally available on HF https://huggingface.co/datasets/agkphysics/AudioSet Overall, even small speech models need to understand non-speech sounds better. More on this later.

1 647

https://www.deepl.com/en/press-release/deepl_launches_voice_api_for_real_time_speech_transcription_and_translation

1 647

he Beyond Transcription Challenge, an IEEE SLT 2026 shared task tackling a foundational question in audio AI: can a model reason over speech without first converting it to text? The research question: Current speech models still struggle to extract meaning directly from audio, especially when the signal includes overlapping speakers, ambient sounds, and room acoustics. Clinical note generation from doctor-patient conversations is an ideal stress test for this: it demands that a model attend to who said what, filter environmental noise, and produce faithful structured output. Yet on the Synth-DoPaCo dataset, end-to-end models hallucinate at alarming rates, with 99–100% of clinical claims unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge aimed at closing this gap by advancing the technology. Two competition tracks: - Lightweight (≤ 6B params): Single end-to-end model, one invocation. Audio in, SOAP note out. - Heavyweight (≤ 36B params): Tools and agents allowed. Only the final model generates text from audio. The Synth-DoPaCo dataset: 8,800 synthetic doctor-patient conversations (~1,329 hrs), 66 ambient sound classes, room reverberation, Opus compression. Available now on Hugging Face. Key dates: - May 4, 2026: Open-Source Inclusion Proposals Deadline - June 24, 2026: System submission deadline - July 8, 2026: Challenge paper due Data is live. Baselines are posted. Team registration is open. If you work on speech, audio understanding, or multimodal AI, we'd love to have you compete. 👇 https://betrac.github.io group:betrac@googlegroups.com

1 647

https://huggingface.co/nvidia/parakeet-unified-en-0.6b

1 647

https://github.com/OpenMOSS/MOSS-TTS-Nano 0.1B params, still many languages

1 647

The question is when we will see 32B speech model https://huggingface.co/KRAFTON/Raon-Speech-9B

1 647

Rissa Cao, FishAudio CEO, a bit of marketing but very valid point on data importance and lack of real high quality data for speech systems https://www.linkedin.com/feed/update/urn:li:activity:7448399470251356160/ Early on, we made a mistake. We trained our TTS model on whatever voice data we could find online. It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart. The data distribution was wrong.

1 647

Recently I was fighting with OOM in ASR service, this was was just on time, very relevant Omni Model Inference: How We Move Tensors Between Stages https://x.com/GenAI_is_real/status/2041723531357196471

1 647

Interesting 1.6M TTS engine, based on StyleTTS https://github.com/tronghieuit/tiny-tts

1 647

Useful thing as trained on rare private SFX data https://github.com/SonyResearch/Woosh

1 647

VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone. https://github.com/OpenBMB/VoxCPM

1 647

https://github.com/LilDevsy0117/Ultra-Sortformer

1 647

Good talk on SpeechLMs https://www.youtube.com/watch?v=m65SiSnsZ3g Explained the paper below. Basically at different point of time one has to pick different layers from text LM for adapters. Word boundaries require more linguistic knowledge, middle words more acoustic knowledge. Big improvements with adjusted adapters as a result. https://arxiv.org/abs/2503.06211 Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

1 647

https://github.com/k2-fsa/OmniVoice

1 647

Interesting community on Reddit https://www.reddit.com/r/VoiceAutomationAI/ will host AMA session with Tony Robinson, one of the most knowledgeable person I know Upcoming AMA with Dr Tony Robinson (Founder Speechmatics) Excited to announce that Dr Tony Robinson will be joining Unio - The Voice AI Community powered by SLNG for a live AMA with builders & founders. If you’re building voice AI, you already know this: it works in demos… and breaks in production. Dr Tony has spent 36+ years in Voice AI, starting in 1989 at Cambridge where he built one of the earliest neural network based speech recognition systems, long before deep learning became mainstream. Today, Speechmatics powers voice AI across 50+ languages, with customers seeing 9x growth in voice agent adoption in 2025. 📅 Date: 27 March ⏰ Time: 10:30 AM PST / 11:00 PM IST 📍 Location: Reddit (r/VoiceAutomationAI) For the next 24 hours, he’ll be answering questions about: • What actually breaks in production voice AI (and how to fix it) • Accents, noise, latency & real-world edge cases • Designing reliable STT-LLM-TTS pipelines • Lessons from 35+ years building speech systems • Where voice AI is really heading (beyond the hype) • What he’d do differently if starting today If you're building in Voice AI, AI agents, or conversational automation, this is a rare opportunity to learn from someone who has been solving these problems for decades. Join the reddit community to drop questions👇 Link in the first comment.

1 647

Just another reminder there is no point in ONNX https://github.com/eschmidbauer/moonshine-c source is pure C 825 lines of code, executable is 40kb. It runs ASR just fine.

1 647

DiTs are powering modern TTS systems however one rarely mention their issues. Longer training time, higher data requirements. Convolutions still have sense given the data is locally uniform. A research like this still makes sense for us GPU-poor guys https://arxiv.org/abs/2603.09408v1

1 647

Nice upsampler - trained for music, supports upsampling from 8khz (important) https://github.com/woongzip1/UniverSR