Oral
-
Improving Musical Accompaniment Co-creation via Diffusion Transformers
Javier Nistal , Marco Pasini , Stefan Lattner
-
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
Kai Wang , Shijian Deng , Jing Shi , Dimitrios Hatzinakos , Yapeng Tian
-
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
JISHENG BAI , Haohe Liu , Mou Wang , Dongyuan Shi , Wenwu Wang , Mark D Plumbley , Woon-Seng Gan , Jianfeng Chen
-
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
Mayank Kumar Singh , Naoya Takahashi , Wei-Hsiang Liao , Yuki Mitsufuji
-
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
Luca A Lanzendörfer, Constantin Pinkl, Nathanaël Perraudin, Roger Wattenhofer
-
Improving Source Extraction with Diffusion and Consistency Models
Tornike Karchkhadze , Mohammad Rasool Izadi , Shuo Zhang
Poster
Poster Presentation Session 1
Date: Dec 14th, 2024
Time: 11:00 am to 12:15 pm
Location: Meeting 114, 115
-
MusicScore: A Dataset for Music Score Modeling and Generation
Yuheng Lin , Zheqi DAI , Qiuqiang Kong
-
Improving Musical Accompaniment Co-creation via Diffusion Transformers
Javier Nistal , Marco Pasini , Stefan Lattner
-
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
Robin Shing-Hei Yuen , Timothy Tin-Long Tse , Jian Zhu
-
Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec
Haohe Liu , Wenwu Wang , Mark D Plumbley
-
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini , Javier Nistal , Stefan Lattner , George Fazekas
-
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
Alexander H. Liu , Qirui Wang , Yuan Gong , James R. Glass
-
Improving Source Extraction with Diffusion and Consistency Models
Tornike Karchkhadze , Mohammad Rasool Izadi , Shuo Zhang
-
SNAC: Multi-Scale Neural Audio Codec
Hubert Siuzdak, Florian Grötschla, Luca A Lanzendörfer
-
High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
Gael Le Lan , Bowen Shi , Zhaoheng Ni , Sidd Srinivasan , Anurag Kumar , Brian Ellis , David Kant , Varun K. Nagaraja , Ernie Chang , Wei-Ning Hsu , Yangyang Shi , Vikas Chandra
-
Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses
Suhita Ghosh , Tim Thiele , Frederic Lorbeer , Sebastian Stober
-
Taemin Kim , WOOYEOL BAEK , Heeseok Oh
-
Do music LLMs learn symbolic concepts? A pilot study using probing and intervention
Wenye Ma , Xinyue Li , Gus Xia
-
One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer
Qihui Yang , Jiahe Lei , Qiuqiang Kong
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Bing Han , Long Zhou , Shujie LIU , Sanyuan Chen , Lingwei Meng , Yanmin Qian , Eric Liu , sheng zhao , Jinyu Li , Furu Wei
-
Three-modal guidance for symbolic music generation: melody, structure, texture
Daniel Alexander Lucht, David Philip Leins, Dimitri von Rütte, Alexandra Moringen
-
Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity
Matteo Ciferri , Matteo Ferrante , Nicola Toschi
-
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
Junwon Lee , Modan Tailleur , Mathieu Lagrange , Keunwoo Choi , Laurie M. Heller , Brian McFee , Keisuke Imoto , Yuki Okamoto
-
Artem Sokolov , Swapnil Bhosale , Xiatian Zhu
-
Enis Berk Çoban, Michael I Mandel, Johanna Devaney
-
Neural Audio Codec for Latent Music Representations
Luca A Lanzendörfer, Florian Grötschla, Amir Dellali, Roger Wattenhofer
-
Kazuki Yamauchi , Wataru Nakata , Yuki Saito , Hiroshi Saruwatari
-
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions
Enshi Zhang , Christian Poellabauer
-
Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers
Ziqiao Meng , Qichao Wang , Wenqian Cui , Yifei Zhang , Bingzhe Wu , Irwin King , Liang Chen , Peilin Zhao
-
Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization
Luke Mo , Manuel Cherep , Nikhil Singh , Quinn Langford , Patricia Maes
Poster Presentation Session 2
Date: Dec 14th, 2024
Time: 04:15 pm to 05:30 pm
Location: Meeting 114, 115
-
Xinhao Mei , Gael Le Lan , Haohe Liu , Zhaoheng Ni , Varun K. Nagaraja , Anurag Kumar , Yangyang Shi , Vikas Chandra
-
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
JISHENG BAI , Haohe Liu , Mou Wang , Dongyuan Shi , Wenwu Wang , Mark D Plumbley , Woon-Seng Gan , Jianfeng Chen
-
Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation
Yin-Jyun Luo , Kin Wai Cheuk , Woosung Choi , Wei-Hsiang Liao , Keisuke Toyama , Toshimitsu Uesaka , Koichi Saito , Chieh-Hsin Lai , Yuhta Takida , Simon Dixon , Yuki Mitsufuji
-
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
Jan Melechovsky , Ambuj Mehrish , BERRAK SISMAN , Dorien Herremans
-
Contrastive Lyrics Alignment with a Timestamp-Informed Loss
Timon Kick, Florian Grötschla, Luca A Lanzendörfer, Roger Wattenhofer
-
DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu , Zhenhua Feng , Diptesh Kanojia , Wenwu Wang
-
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
Mayank Kumar Singh , Naoya Takahashi , Wei-Hsiang Liao , Yuki Mitsufuji
-
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
Kai Wang , Shijian Deng , Jing Shi , Dimitrios Hatzinakos , Yapeng Tian
-
Multi-Source Music Generation with Latent Diffusion
Zhongweiyang Xu , Debottam Dutta , Yu-Lin Wei , Romit Roy Choudhury
-
Coarse-to-Fine Text-to-Music Latent Diffusion
Luca A Lanzendörfer, Tongyu Lu, Nathanaël Perraudin, Dorien Herremans, Roger Wattenhofer
-
MLADDC: Multi-Lingual Audio Deepfake Detection Corpus
ARTH JUHUL SHAH , Ravindrakumar M. Purohit , Dharmendra H. Vaghera , Hemant Patil
-
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
Luca A Lanzendörfer, Constantin Pinkl, Nathanaël Perraudin, Roger Wattenhofer
-
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
Satvik Dixit , Laurie Heller , Chris Donahue
-
FSD: Acoustic Echo Cancellation with Fewer Step Diffusion
Yang Liu , Li Wan , Yiteng Huang , Ming Sun , Changsheng Zhao , Zhaoheng Ni , Xinhao Mei , Yangyang Shi , Florian Metze
-
Benchmarking Music Generation Models and Metrics via Human Preference Studies
Ahmet Solak, Florian Grötschla, Luca A Lanzendörfer, Roger Wattenhofer
-
Spatially-Aware Losses for Enhanced Neural Acoustic Fields
Christopher A. Ick, Gordon Wichern, Yoshiki Masuyama, François Germain, Jonathan Le Roux
-
Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization
Julius Richter , Timo Gerkmann
-
Style Mixture of Experts for Expressive Text-To-Speech Synthesis
Ahad Jawaid , Shreeram Suresh Chandra , Junchen Lu , BERRAK SISMAN
-
Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
Yi Yuan , Dongya Jia , Xiaobin Zhuang , Yuanzhe Chen , Zhengxi Liu , Zhuo Chen , Yuping Wang , Yuxuan Wang , Xubo Liu , Xiyuan Kang , Mark D Plumbley , Wenwu Wang
-
Generating Vocals from Lyrics and Musical Accompaniment
Georg Streich, Luca A Lanzendörfer, Florian Grötschla, Roger Wattenhofer
-
Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion
ZHENYU WANG , Chenxing Li , YONG XU , Chunlei Zhang , John H. L. Hansen , Dong Yu
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Koichi Saito , Dongjun Kim , Takashi Shibuya , Chieh-Hsin Lai , Zhi Zhong , Yuhta Takida , Yuki Mitsufuji
-
LoVA: Long-form Video-to-Audio Generation
Xin Cheng , Xihua Wang , Yihan Wu , Yuyue Wang , Ruihua Song
-
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Chenxu Xiong , Ruibo Fu , Shuchen Shi , Zhengqi Wen , Tao Wang , Chenxing Li , Chunyu Qiang , Yuankun Xie , XinQi , Guanjun Li , Zizheng Yang