Conference Paper/Proceeding/Abstract 194 views 10 downloads
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
Findings of the Association for Computational Linguistics: EMNLP 2025, Pages: 22497 - 22509
Swansea University Authors:
Thomas Reitmaier , Jen Pearson
, Matt Jones
, Simon Robinson
-
PDF | Accepted Manuscript
Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).
Download (1.35MB)
Abstract
Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development expe...
| Published in: | Findings of the Association for Computational Linguistics: EMNLP 2025 |
|---|---|
| ISBN: | 979-8-89176-335-7 |
| Published: |
Suzhou, China
Association for Computational Linguistics
2025
|
| Online Access: |
https://aclanthology.org/2025.findings-emnlp.1224/ |
| URI: | https://cronfa.swan.ac.uk/Record/cronfa70213 |
| first_indexed |
2025-08-21T14:23:09Z |
|---|---|
| last_indexed |
2025-11-07T05:09:13Z |
| id |
cronfa70213 |
| recordtype |
SURis |
| fullrecord |
<?xml version="1.0"?><rfc1807><datestamp>2025-11-05T09:42:00.8576017</datestamp><bib-version>v2</bib-version><id>70213</id><entry>2025-08-21</entry><title>Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati</title><swanseaauthors><author><sid>ccd66b64d11d76b9cd8b28e9d42a0ff0</sid><ORCID>0000-0003-2078-6699</ORCID><firstname>Thomas</firstname><surname>Reitmaier</surname><name>Thomas Reitmaier</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>6d662d9e2151b302ed384b243e2a802f</sid><ORCID>0000-0002-1960-1012</ORCID><firstname>Jen</firstname><surname>Pearson</surname><name>Jen Pearson</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>10b46d7843c2ba53d116ca2ed9abb56e</sid><ORCID>0000-0001-7657-7373</ORCID><firstname>Matt</firstname><surname>Jones</surname><name>Matt Jones</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>cb3b57a21fa4e48ec633d6ba46455e91</sid><ORCID>0000-0001-9228-006X</ORCID><firstname>Simon</firstname><surname>Robinson</surname><name>Simon Robinson</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2025-08-21</date><deptcode>MACS</deptcode><abstract>Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>Findings of the Association for Computational Linguistics: EMNLP 2025</journal><volume/><journalNumber/><paginationStart>22497</paginationStart><paginationEnd>22509</paginationEnd><publisher>Association for Computational Linguistics</publisher><placeOfPublication>Suzhou, China</placeOfPublication><isbnPrint>979-8-89176-335-7</isbnPrint><isbnElectronic/><issnPrint/><issnElectronic/><keywords/><publishedDay>1</publishedDay><publishedMonth>11</publishedMonth><publishedYear>2025</publishedYear><publishedDate>2025-11-01</publishedDate><doi/><url>https://aclanthology.org/2025.findings-emnlp.1224/</url><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Other</apcterm><funders/><projectreference/><lastEdited>2025-11-05T09:42:00.8576017</lastEdited><Created>2025-08-21T15:20:26.1985668</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Sanjay</firstname><surname>Booshanam</surname><order>1</order></author><author><firstname>Kelly</firstname><surname>Chen</surname><order>2</order></author><author><firstname>Ondrej</firstname><surname>Klejch</surname><order>3</order></author><author><firstname>Thomas</firstname><surname>Reitmaier</surname><orcid>0000-0003-2078-6699</orcid><order>4</order></author><author><firstname>Dani Kalarikalayil</firstname><surname>Raju</surname><order>5</order></author><author><firstname>Electra</firstname><surname>Wallington</surname><order>6</order></author><author><firstname>Nina</firstname><surname>Markl</surname><order>7</order></author><author><firstname>Jen</firstname><surname>Pearson</surname><orcid>0000-0002-1960-1012</orcid><order>8</order></author><author><firstname>Matt</firstname><surname>Jones</surname><orcid>0000-0001-7657-7373</orcid><order>9</order></author><author><firstname>Simon</firstname><surname>Robinson</surname><orcid>0000-0001-9228-006X</orcid><order>10</order></author><author><firstname>Peter</firstname><surname>Bell</surname><order>11</order></author></authors><documents><document><filename>70213__35159__7aa4f42c9a9741b5be52d6cfe03ae34d.pdf</filename><originalFilename>70213.pdf</originalFilename><uploaded>2025-09-22T13:59:02.9760494</uploaded><type>Output</type><contentLength>1416997</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><documentNotes>Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807> |
| spelling |
2025-11-05T09:42:00.8576017 v2 70213 2025-08-21 Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati ccd66b64d11d76b9cd8b28e9d42a0ff0 0000-0003-2078-6699 Thomas Reitmaier Thomas Reitmaier true false 6d662d9e2151b302ed384b243e2a802f 0000-0002-1960-1012 Jen Pearson Jen Pearson true false 10b46d7843c2ba53d116ca2ed9abb56e 0000-0001-7657-7373 Matt Jones Matt Jones true false cb3b57a21fa4e48ec633d6ba46455e91 0000-0001-9228-006X Simon Robinson Simon Robinson true false 2025-08-21 MACS Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide. Conference Paper/Proceeding/Abstract Findings of the Association for Computational Linguistics: EMNLP 2025 22497 22509 Association for Computational Linguistics Suzhou, China 979-8-89176-335-7 1 11 2025 2025-11-01 https://aclanthology.org/2025.findings-emnlp.1224/ COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Other 2025-11-05T09:42:00.8576017 2025-08-21T15:20:26.1985668 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Sanjay Booshanam 1 Kelly Chen 2 Ondrej Klejch 3 Thomas Reitmaier 0000-0003-2078-6699 4 Dani Kalarikalayil Raju 5 Electra Wallington 6 Nina Markl 7 Jen Pearson 0000-0002-1960-1012 8 Matt Jones 0000-0001-7657-7373 9 Simon Robinson 0000-0001-9228-006X 10 Peter Bell 11 70213__35159__7aa4f42c9a9741b5be52d6cfe03ae34d.pdf 70213.pdf 2025-09-22T13:59:02.9760494 Output 1416997 application/pdf Accepted Manuscript true Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention). true eng https://creativecommons.org/licenses/by/4.0/ |
| title |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| spellingShingle |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati Thomas Reitmaier Jen Pearson Matt Jones Simon Robinson |
| title_short |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| title_full |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| title_fullStr |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| title_full_unstemmed |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| title_sort |
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati |
| author_id_str_mv |
ccd66b64d11d76b9cd8b28e9d42a0ff0 6d662d9e2151b302ed384b243e2a802f 10b46d7843c2ba53d116ca2ed9abb56e cb3b57a21fa4e48ec633d6ba46455e91 |
| author_id_fullname_str_mv |
ccd66b64d11d76b9cd8b28e9d42a0ff0_***_Thomas Reitmaier 6d662d9e2151b302ed384b243e2a802f_***_Jen Pearson 10b46d7843c2ba53d116ca2ed9abb56e_***_Matt Jones cb3b57a21fa4e48ec633d6ba46455e91_***_Simon Robinson |
| author |
Thomas Reitmaier Jen Pearson Matt Jones Simon Robinson |
| author2 |
Sanjay Booshanam Kelly Chen Ondrej Klejch Thomas Reitmaier Dani Kalarikalayil Raju Electra Wallington Nina Markl Jen Pearson Matt Jones Simon Robinson Peter Bell |
| format |
Conference Paper/Proceeding/Abstract |
| container_title |
Findings of the Association for Computational Linguistics: EMNLP 2025 |
| container_start_page |
22497 |
| publishDate |
2025 |
| institution |
Swansea University |
| isbn |
979-8-89176-335-7 |
| publisher |
Association for Computational Linguistics |
| college_str |
Faculty of Science and Engineering |
| hierarchytype |
|
| hierarchy_top_id |
facultyofscienceandengineering |
| hierarchy_top_title |
Faculty of Science and Engineering |
| hierarchy_parent_id |
facultyofscienceandengineering |
| hierarchy_parent_title |
Faculty of Science and Engineering |
| department_str |
School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science |
| url |
https://aclanthology.org/2025.findings-emnlp.1224/ |
| document_store_str |
1 |
| active_str |
0 |
| description |
Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide. |
| published_date |
2025-11-01T05:30:17Z |
| _version_ |
1851097997432061952 |
| score |
11.089386 |

