X. Wang

info

Please Note

<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>

Conference paper (5)

Journal article (3)

8 records found

AnyoneNet

Synchronized Speech and Talking Head Generation for Arbitrary Persons

Journal article (2023) - Xinsheng Wang , Qicong Xie , Lei Xie , Jihua Zhu , Odette Scharenborg

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos ...

Generating Images from Spoken Descriptions

Journal article (2021) - Xinsheng Wang , Tingting Qiao , Jihua Zhu , Alan Hanjalic , Odette Scharenborg

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefi ...

Synthesizing Spoken Descriptions of Images

Journal article (2021) - Xinsheng Wang , Justin van der Hout , Jihua Zhu , Mark Hasegawa-Johnson , Odette Scharenborg

Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages’ lack of a written form. To solve this problem, recently the image-to-sp ...

Align or attend?

Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Conference paper (2021) - Liming Wang , Xinsheng Wang , Mark Hasegawa-Johnson , Odette Scharenborg , Najim Dehak

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify o ...

Learning fine-grained semantics in spoken language using visual grounding

Conference paper (2021) - Xinsheng Wang , Tian Tian , Jihua Zhu , Odette Scharenborg

In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., using speech and textual transcriptions. Recently, several methods have been proposed to learn speech representations using images, i.e., using visual grounding. Existing studies have ...

Show and speak

Directly synthesize spoken description of images

Conference paper (2021) - Xinsheng Wang , Siyuan Feng , Jihua Zhu , Mark Hasegawa-Johnson , Odette Scharenborg

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that t ...

Multimodal fusion of body movement signals for no-audio speech detection

Conference paper (2020) - Xinsheng Wang , Jihua Zhu , Odette Scharenborg

No-audio Multimodal Speech Detection is one of the tasks in Media- Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an ...

S2IGAN

Speech-to-Image Generation via Adversarial Learning

Conference paper (2020) - Xinsheng Wang , Tingting Qiao , Jihua Zhu , Alan Hanjalic , Odette Scharenborg

An estimated half of the world’s languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to p ...