📝 Publications
🎙 Text-to-Speech Generation

Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling
Huan Liao★, Qinke Ni★, Yuancheng Wang, Yiheng Lu, Haoyue Zhan, Pengyuan Xie, Qiang Zhang, Zhizheng Wu†
[Project] [Paper] [Dataset] [Code]
- The first large-scale dataset with word-level paralinguistic vocalization annotations(573 Hours; 174,179 utterances).
- NVASR, a paralinguistic-aware ASR framework, scales fine-grained automatic annotations.
- Enabled controllable zero-shot TTS(CV2@Emilia-NV) with explicit token-level vocalization control for expressive speech synthesis.

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu†
[Project] [Paper] [Dataset&Code]
- Established a closed-loop framework for speech naturalness alignment, unifying large-scale human preference data(SpeechJudge-Data), a challenging evaluation benchmark(SpeechJudge-Eval), and a generative reward model(SpeechJudge-GRM).
🎙 Audio Generation

BATON: Aligning Text-to-Audio Model with Human Preference Feedback
Huan Liao★, Haonan Han★, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li†
[Project] [Paper] [Dataset&Code]
- The first text-to-audio (TTA) system finetuned from human preference feedback.
- Curated a dataset containing both prompts and the corresponding generated audio, annotated based on human feedback.
- Addressed the audio event semantic omission and temporal disarray with a weighted preference strategy

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion
Tianjiao Du, Jun Chen, Jiasheng Lu, Qinmei Xu, Huan Liao, Yupeng Chen, Zhiyong Wu†
- Training-free approach for controllable TTA generation based on the location and duration of corresponding sound events.

Rhythmic Foley: A Framework for Seamless Audio-Visual Alignment in Video-to-Audio Synthesis
Zhiqi Huang★ Dan Luo★ Jun Wang† Huan Liao Zhiheng Li† Zhiyong Wu†
- An innovative framework for video-to-audio synthesis, characterized by semantic integrity and precise beat point synchronization.
🧙 3D/Motion Generation


AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward
Haonan Han★, Xiangzuo Wu★, Huan Liao★, Zunnan Xu, Ronghui Li, Yachao Zhang†, Xiu Li†
- Enhances the event-level alignment between generated motion and text prompts by leveraging reward from GPT-4Vision.