Journal article 584 views 2 downloads
Asymmetric Cross-Scale Alignment for Text-Based Person Search
IEEE Transactions on Multimedia, Volume: 25, Pages: 7699 - 7709
Swansea University Author: Yuanbo Wu
-
PDF | Accepted Manuscript
Download (9.17MB)
DOI (Published version): 10.1109/tmm.2022.3225754
Abstract
Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this...
Published in: | IEEE Transactions on Multimedia |
---|---|
ISSN: | 1520-9210 1941-0077 |
Published: |
Institute of Electrical and Electronics Engineers (IEEE)
2023
|
Online Access: |
Check full text
|
URI: | https://cronfa.swan.ac.uk/Record/cronfa62251 |
first_indexed |
2023-01-03T13:30:07Z |
---|---|
last_indexed |
2024-11-14T12:20:37Z |
id |
cronfa62251 |
recordtype |
SURis |
fullrecord |
<?xml version="1.0"?><rfc1807><datestamp>2024-06-06T12:15:25.3716912</datestamp><bib-version>v2</bib-version><id>62251</id><entry>2023-01-03</entry><title>Asymmetric Cross-Scale Alignment for Text-Based Person Search</title><swanseaauthors><author><sid>205b1ac5a767e977bebb5d6afd770784</sid><ORCID>0000-0001-6119-058X</ORCID><firstname>Yuanbo</firstname><surname>Wu</surname><name>Yuanbo Wu</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2023-01-03</date><deptcode>MACS</deptcode><abstract>Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.</abstract><type>Journal Article</type><journal>IEEE Transactions on Multimedia</journal><volume>25</volume><journalNumber/><paginationStart>7699</paginationStart><paginationEnd>7709</paginationEnd><publisher>Institute of Electrical and Electronics Engineers (IEEE)</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint>1520-9210</issnPrint><issnElectronic>1941-0077</issnElectronic><keywords/><publishedDay>1</publishedDay><publishedMonth>1</publishedMonth><publishedYear>2023</publishedYear><publishedDate>2023-01-01</publishedDate><doi>10.1109/tmm.2022.3225754</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm/><funders>This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000.</funders><projectreference/><lastEdited>2024-06-06T12:15:25.3716912</lastEdited><Created>2023-01-03T13:27:18.4577396</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Zhong</firstname><surname>Ji</surname><orcid>0000-0002-2197-3739</orcid><order>1</order></author><author><firstname>Junhua</firstname><surname>Hu</surname><order>2</order></author><author><firstname>Deyin</firstname><surname>Liu</surname><orcid>0000-0002-0371-9921</orcid><order>3</order></author><author><firstname>Yuanbo</firstname><surname>Wu</surname><orcid>0000-0001-6119-058X</orcid><order>4</order></author><author><firstname>Ye</firstname><surname>Zhao</surname><orcid>0000-0002-8180-4697</orcid><order>5</order></author></authors><documents><document><filename>62251__26475__9db5b190500e40af9bfe9f45e66bbb4f.pdf</filename><originalFilename>62251.pdf</originalFilename><uploaded>2023-02-06T09:54:43.9331511</uploaded><type>Output</type><contentLength>9616949</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><embargoDate>2024-12-05T00:00:00.0000000</embargoDate><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807> |
spelling |
2024-06-06T12:15:25.3716912 v2 62251 2023-01-03 Asymmetric Cross-Scale Alignment for Text-Based Person Search 205b1ac5a767e977bebb5d6afd770784 0000-0001-6119-058X Yuanbo Wu Yuanbo Wu true false 2023-01-03 MACS Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. Journal Article IEEE Transactions on Multimedia 25 7699 7709 Institute of Electrical and Electronics Engineers (IEEE) 1520-9210 1941-0077 1 1 2023 2023-01-01 10.1109/tmm.2022.3225754 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000. 2024-06-06T12:15:25.3716912 2023-01-03T13:27:18.4577396 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Zhong Ji 0000-0002-2197-3739 1 Junhua Hu 2 Deyin Liu 0000-0002-0371-9921 3 Yuanbo Wu 0000-0001-6119-058X 4 Ye Zhao 0000-0002-8180-4697 5 62251__26475__9db5b190500e40af9bfe9f45e66bbb4f.pdf 62251.pdf 2023-02-06T09:54:43.9331511 Output 9616949 application/pdf Accepted Manuscript true 2024-12-05T00:00:00.0000000 true eng |
title |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
spellingShingle |
Asymmetric Cross-Scale Alignment for Text-Based Person Search Yuanbo Wu |
title_short |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
title_full |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
title_fullStr |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
title_full_unstemmed |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
title_sort |
Asymmetric Cross-Scale Alignment for Text-Based Person Search |
author_id_str_mv |
205b1ac5a767e977bebb5d6afd770784 |
author_id_fullname_str_mv |
205b1ac5a767e977bebb5d6afd770784_***_Yuanbo Wu |
author |
Yuanbo Wu |
author2 |
Zhong Ji Junhua Hu Deyin Liu Yuanbo Wu Ye Zhao |
format |
Journal article |
container_title |
IEEE Transactions on Multimedia |
container_volume |
25 |
container_start_page |
7699 |
publishDate |
2023 |
institution |
Swansea University |
issn |
1520-9210 1941-0077 |
doi_str_mv |
10.1109/tmm.2022.3225754 |
publisher |
Institute of Electrical and Electronics Engineers (IEEE) |
college_str |
Faculty of Science and Engineering |
hierarchytype |
|
hierarchy_top_id |
facultyofscienceandengineering |
hierarchy_top_title |
Faculty of Science and Engineering |
hierarchy_parent_id |
facultyofscienceandengineering |
hierarchy_parent_title |
Faculty of Science and Engineering |
department_str |
School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science |
document_store_str |
1 |
active_str |
0 |
description |
Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. |
published_date |
2023-01-01T20:11:42Z |
_version_ |
1822524810477961216 |
score |
11.048756 |