No Cover Image

Conference Paper/Proceeding/Abstract 194 views 10 downloads

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Sanjay Booshanam, Kelly Chen, Ondrej Klejch, Thomas Reitmaier Orcid Logo, Dani Kalarikalayil Raju, Electra Wallington, Nina Markl, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo, Peter Bell

Findings of the Association for Computational Linguistics: EMNLP 2025, Pages: 22497 - 22509

Swansea University Authors: Thomas Reitmaier Orcid Logo, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo

  • 70213.pdf

    PDF | Accepted Manuscript

    Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).

    Download (1.35MB)

Abstract

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development expe...

Full description

Published in: Findings of the Association for Computational Linguistics: EMNLP 2025
ISBN: 979-8-89176-335-7
Published: Suzhou, China Association for Computational Linguistics 2025
Online Access: https://aclanthology.org/2025.findings-emnlp.1224/
URI: https://cronfa.swan.ac.uk/Record/cronfa70213
first_indexed 2025-08-21T14:23:09Z
last_indexed 2025-11-07T05:09:13Z
id cronfa70213
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2025-11-05T09:42:00.8576017</datestamp><bib-version>v2</bib-version><id>70213</id><entry>2025-08-21</entry><title>Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati</title><swanseaauthors><author><sid>ccd66b64d11d76b9cd8b28e9d42a0ff0</sid><ORCID>0000-0003-2078-6699</ORCID><firstname>Thomas</firstname><surname>Reitmaier</surname><name>Thomas Reitmaier</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>6d662d9e2151b302ed384b243e2a802f</sid><ORCID>0000-0002-1960-1012</ORCID><firstname>Jen</firstname><surname>Pearson</surname><name>Jen Pearson</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>10b46d7843c2ba53d116ca2ed9abb56e</sid><ORCID>0000-0001-7657-7373</ORCID><firstname>Matt</firstname><surname>Jones</surname><name>Matt Jones</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>cb3b57a21fa4e48ec633d6ba46455e91</sid><ORCID>0000-0001-9228-006X</ORCID><firstname>Simon</firstname><surname>Robinson</surname><name>Simon Robinson</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2025-08-21</date><deptcode>MACS</deptcode><abstract>Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati &#x2013; an unwritten language &#x2013; that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>Findings of the Association for Computational Linguistics: EMNLP 2025</journal><volume/><journalNumber/><paginationStart>22497</paginationStart><paginationEnd>22509</paginationEnd><publisher>Association for Computational Linguistics</publisher><placeOfPublication>Suzhou, China</placeOfPublication><isbnPrint>979-8-89176-335-7</isbnPrint><isbnElectronic/><issnPrint/><issnElectronic/><keywords/><publishedDay>1</publishedDay><publishedMonth>11</publishedMonth><publishedYear>2025</publishedYear><publishedDate>2025-11-01</publishedDate><doi/><url>https://aclanthology.org/2025.findings-emnlp.1224/</url><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Other</apcterm><funders/><projectreference/><lastEdited>2025-11-05T09:42:00.8576017</lastEdited><Created>2025-08-21T15:20:26.1985668</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Sanjay</firstname><surname>Booshanam</surname><order>1</order></author><author><firstname>Kelly</firstname><surname>Chen</surname><order>2</order></author><author><firstname>Ondrej</firstname><surname>Klejch</surname><order>3</order></author><author><firstname>Thomas</firstname><surname>Reitmaier</surname><orcid>0000-0003-2078-6699</orcid><order>4</order></author><author><firstname>Dani Kalarikalayil</firstname><surname>Raju</surname><order>5</order></author><author><firstname>Electra</firstname><surname>Wallington</surname><order>6</order></author><author><firstname>Nina</firstname><surname>Markl</surname><order>7</order></author><author><firstname>Jen</firstname><surname>Pearson</surname><orcid>0000-0002-1960-1012</orcid><order>8</order></author><author><firstname>Matt</firstname><surname>Jones</surname><orcid>0000-0001-7657-7373</orcid><order>9</order></author><author><firstname>Simon</firstname><surname>Robinson</surname><orcid>0000-0001-9228-006X</orcid><order>10</order></author><author><firstname>Peter</firstname><surname>Bell</surname><order>11</order></author></authors><documents><document><filename>70213__35159__7aa4f42c9a9741b5be52d6cfe03ae34d.pdf</filename><originalFilename>70213.pdf</originalFilename><uploaded>2025-09-22T13:59:02.9760494</uploaded><type>Output</type><contentLength>1416997</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><documentNotes>Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling 2025-11-05T09:42:00.8576017 v2 70213 2025-08-21 Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati ccd66b64d11d76b9cd8b28e9d42a0ff0 0000-0003-2078-6699 Thomas Reitmaier Thomas Reitmaier true false 6d662d9e2151b302ed384b243e2a802f 0000-0002-1960-1012 Jen Pearson Jen Pearson true false 10b46d7843c2ba53d116ca2ed9abb56e 0000-0001-7657-7373 Matt Jones Matt Jones true false cb3b57a21fa4e48ec633d6ba46455e91 0000-0001-9228-006X Simon Robinson Simon Robinson true false 2025-08-21 MACS Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide. Conference Paper/Proceeding/Abstract Findings of the Association for Computational Linguistics: EMNLP 2025 22497 22509 Association for Computational Linguistics Suzhou, China 979-8-89176-335-7 1 11 2025 2025-11-01 https://aclanthology.org/2025.findings-emnlp.1224/ COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Other 2025-11-05T09:42:00.8576017 2025-08-21T15:20:26.1985668 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Sanjay Booshanam 1 Kelly Chen 2 Ondrej Klejch 3 Thomas Reitmaier 0000-0003-2078-6699 4 Dani Kalarikalayil Raju 5 Electra Wallington 6 Nina Markl 7 Jen Pearson 0000-0002-1960-1012 8 Matt Jones 0000-0001-7657-7373 9 Simon Robinson 0000-0001-9228-006X 10 Peter Bell 11 70213__35159__7aa4f42c9a9741b5be52d6cfe03ae34d.pdf 70213.pdf 2025-09-22T13:59:02.9760494 Output 1416997 application/pdf Accepted Manuscript true Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention). true eng https://creativecommons.org/licenses/by/4.0/
title Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
spellingShingle Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
Thomas Reitmaier
Jen Pearson
Matt Jones
Simon Robinson
title_short Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
title_full Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
title_fullStr Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
title_full_unstemmed Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
title_sort Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
author_id_str_mv ccd66b64d11d76b9cd8b28e9d42a0ff0
6d662d9e2151b302ed384b243e2a802f
10b46d7843c2ba53d116ca2ed9abb56e
cb3b57a21fa4e48ec633d6ba46455e91
author_id_fullname_str_mv ccd66b64d11d76b9cd8b28e9d42a0ff0_***_Thomas Reitmaier
6d662d9e2151b302ed384b243e2a802f_***_Jen Pearson
10b46d7843c2ba53d116ca2ed9abb56e_***_Matt Jones
cb3b57a21fa4e48ec633d6ba46455e91_***_Simon Robinson
author Thomas Reitmaier
Jen Pearson
Matt Jones
Simon Robinson
author2 Sanjay Booshanam
Kelly Chen
Ondrej Klejch
Thomas Reitmaier
Dani Kalarikalayil Raju
Electra Wallington
Nina Markl
Jen Pearson
Matt Jones
Simon Robinson
Peter Bell
format Conference Paper/Proceeding/Abstract
container_title Findings of the Association for Computational Linguistics: EMNLP 2025
container_start_page 22497
publishDate 2025
institution Swansea University
isbn 979-8-89176-335-7
publisher Association for Computational Linguistics
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
url https://aclanthology.org/2025.findings-emnlp.1224/
document_store_str 1
active_str 0
description Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.
published_date 2025-11-01T05:30:17Z
_version_ 1851097997432061952
score 11.089386