Asymmetric Cross-Scale Alignment for Text-Based Person Search

Ji, Zhong; Hu, Junhua; Liu, Deyin; Wu, Yuanbo; Zhao, Ye

doi:10.1109/tmm.2022.3225754

Journal article 584 views 2 downloads

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Zhong Ji

, Junhua Hu, Deyin Liu

, Yuanbo Wu

, Ye Zhao

IEEE Transactions on Multimedia, Volume: 25, Pages: 7699 - 7709

Swansea University Author: Yuanbo Wu

PDF | Accepted Manuscript
Download (9.17MB)

Check full text

DOI (Published version): 10.1109/tmm.2022.3225754

Abstract

Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this...

Full description

Published in:	IEEE Transactions on Multimedia
ISSN:	1520-9210 1941-0077
Published:	Institute of Electrical and Electronics Engineers (IEEE) 2023
Online Access:	Check full text
URI:	https://cronfa.swan.ac.uk/Record/cronfa62251

first_indexed	2023-01-03T13:30:07Z
last_indexed	2024-11-14T12:20:37Z
id	cronfa62251
recordtype	SURis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2024-06-06T12:15:25.3716912</datestamp><bib-version>v2</bib-version><id>62251</id><entry>2023-01-03</entry><title>Asymmetric Cross-Scale Alignment for Text-Based Person Search</title><swanseaauthors><author><sid>205b1ac5a767e977bebb5d6afd770784</sid><ORCID>0000-0001-6119-058X</ORCID><firstname>Yuanbo</firstname><surname>Wu</surname><name>Yuanbo Wu</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2023-01-03</date><deptcode>MACS</deptcode><abstract>Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.</abstract><type>Journal Article</type><journal>IEEE Transactions on Multimedia</journal><volume>25</volume><journalNumber/><paginationStart>7699</paginationStart><paginationEnd>7709</paginationEnd><publisher>Institute of Electrical and Electronics Engineers (IEEE)</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint>1520-9210</issnPrint><issnElectronic>1941-0077</issnElectronic><keywords/><publishedDay>1</publishedDay><publishedMonth>1</publishedMonth><publishedYear>2023</publishedYear><publishedDate>2023-01-01</publishedDate><doi>10.1109/tmm.2022.3225754</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm/><funders>This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000.</funders><projectreference/><lastEdited>2024-06-06T12:15:25.3716912</lastEdited><Created>2023-01-03T13:27:18.4577396</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Zhong</firstname><surname>Ji</surname><orcid>0000-0002-2197-3739</orcid><order>1</order></author><author><firstname>Junhua</firstname><surname>Hu</surname><order>2</order></author><author><firstname>Deyin</firstname><surname>Liu</surname><orcid>0000-0002-0371-9921</orcid><order>3</order></author><author><firstname>Yuanbo</firstname><surname>Wu</surname><orcid>0000-0001-6119-058X</orcid><order>4</order></author><author><firstname>Ye</firstname><surname>Zhao</surname><orcid>0000-0002-8180-4697</orcid><order>5</order></author></authors><documents><document><filename>62251__26475__9db5b190500e40af9bfe9f45e66bbb4f.pdf</filename><originalFilename>62251.pdf</originalFilename><uploaded>2023-02-06T09:54:43.9331511</uploaded><type>Output</type><contentLength>9616949</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><embargoDate>2024-12-05T00:00:00.0000000</embargoDate><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling	2024-06-06T12:15:25.3716912 v2 62251 2023-01-03 Asymmetric Cross-Scale Alignment for Text-Based Person Search 205b1ac5a767e977bebb5d6afd770784 0000-0001-6119-058X Yuanbo Wu Yuanbo Wu true false 2023-01-03 MACS Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. Journal Article IEEE Transactions on Multimedia 25 7699 7709 Institute of Electrical and Electronics Engineers (IEEE) 1520-9210 1941-0077 1 1 2023 2023-01-01 10.1109/tmm.2022.3225754 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000. 2024-06-06T12:15:25.3716912 2023-01-03T13:27:18.4577396 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Zhong Ji 0000-0002-2197-3739 1 Junhua Hu 2 Deyin Liu 0000-0002-0371-9921 3 Yuanbo Wu 0000-0001-6119-058X 4 Ye Zhao 0000-0002-8180-4697 5 62251__26475__9db5b190500e40af9bfe9f45e66bbb4f.pdf 62251.pdf 2023-02-06T09:54:43.9331511 Output 9616949 application/pdf Accepted Manuscript true 2024-12-05T00:00:00.0000000 true eng
title	Asymmetric Cross-Scale Alignment for Text-Based Person Search
spellingShingle	Asymmetric Cross-Scale Alignment for Text-Based Person Search Yuanbo Wu
title_short	Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_full	Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_fullStr	Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_full_unstemmed	Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_sort	Asymmetric Cross-Scale Alignment for Text-Based Person Search
author_id_str_mv	205b1ac5a767e977bebb5d6afd770784
author_id_fullname_str_mv	205b1ac5a767e977bebb5d6afd770784_***_Yuanbo Wu
author	Yuanbo Wu
author2	Zhong Ji Junhua Hu Deyin Liu Yuanbo Wu Ye Zhao
format	Journal article
container_title	IEEE Transactions on Multimedia
container_volume	25
container_start_page	7699
publishDate	2023
institution	Swansea University
issn	1520-9210 1941-0077
doi_str_mv	10.1109/tmm.2022.3225754
publisher	Institute of Electrical and Electronics Engineers (IEEE)
college_str	Faculty of Science and Engineering
hierarchytype
hierarchy_top_id	facultyofscienceandengineering
hierarchy_top_title	Faculty of Science and Engineering
hierarchy_parent_id	facultyofscienceandengineering
hierarchy_parent_title	Faculty of Science and Engineering
department_str	School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str	1
active_str	0
description	Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.
published_date	2023-01-01T20:11:42Z
_version_	1822524810477961216
score	11.048756

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Similar Items