No Cover Image

Conference Paper/Proceeding/Abstract 194 views 10 downloads

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Sanjay Booshanam, Kelly Chen, Ondrej Klejch, Thomas Reitmaier Orcid Logo, Dani Kalarikalayil Raju, Electra Wallington, Nina Markl, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo, Peter Bell

Findings of the Association for Computational Linguistics: EMNLP 2025, Pages: 22497 - 22509

Swansea University Authors: Thomas Reitmaier Orcid Logo, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo

  • 70213.pdf

    PDF | Accepted Manuscript

    Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).

    Download (1.35MB)

Abstract

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development expe...

Full description

Published in: Findings of the Association for Computational Linguistics: EMNLP 2025
ISBN: 979-8-89176-335-7
Published: Suzhou, China Association for Computational Linguistics 2025
Online Access: https://aclanthology.org/2025.findings-emnlp.1224/
URI: https://cronfa.swan.ac.uk/Record/cronfa70213
Abstract: Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.
College: Faculty of Science and Engineering
Start Page: 22497
End Page: 22509