Conference Paper/Proceeding/Abstract 194 views 10 downloads
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
Findings of the Association for Computational Linguistics: EMNLP 2025, Pages: 22497 - 22509
Swansea University Authors:
Thomas Reitmaier , Jen Pearson
, Matt Jones
, Simon Robinson
-
PDF | Accepted Manuscript
Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).
Download (1.35MB)
Abstract
Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development expe...
| Published in: | Findings of the Association for Computational Linguistics: EMNLP 2025 |
|---|---|
| ISBN: | 979-8-89176-335-7 |
| Published: |
Suzhou, China
Association for Computational Linguistics
2025
|
| Online Access: |
https://aclanthology.org/2025.findings-emnlp.1224/ |
| URI: | https://cronfa.swan.ac.uk/Record/cronfa70213 |
| Abstract: |
Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide. |
|---|---|
| College: |
Faculty of Science and Engineering |
| Start Page: |
22497 |
| End Page: |
22509 |

