No Cover Image

Journal article 387 views

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Zhong Ji Orcid Logo, Junhua Hu, Deyin Liu Orcid Logo, Yuanbo Wu Orcid Logo, Ye Zhao Orcid Logo

IEEE Transactions on Multimedia, Volume: 25, Pages: 7699 - 7709

Swansea University Author: Yuanbo Wu Orcid Logo

  • Accepted Manuscript under embargo until: 5th December 2024

Abstract

Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this...

Full description

Published in: IEEE Transactions on Multimedia
ISSN: 1520-9210 1941-0077
Published: Institute of Electrical and Electronics Engineers (IEEE) 2023
Online Access: Check full text

URI: https://cronfa.swan.ac.uk/Record/cronfa62251
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2023-01-03T13:30:07Z
last_indexed 2023-02-07T04:17:04Z
id cronfa62251
recordtype SURis
fullrecord <?xml version="1.0" encoding="utf-8"?><rfc1807 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><bib-version>v2</bib-version><id>62251</id><entry>2023-01-03</entry><title>Asymmetric Cross-Scale Alignment for Text-Based Person Search</title><swanseaauthors><author><sid>205b1ac5a767e977bebb5d6afd770784</sid><ORCID>0000-0001-6119-058X</ORCID><firstname>Yuanbo</firstname><surname>Wu</surname><name>Yuanbo Wu</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2023-01-03</date><deptcode>MACS</deptcode><abstract>Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.</abstract><type>Journal Article</type><journal>IEEE Transactions on Multimedia</journal><volume>25</volume><journalNumber/><paginationStart>7699</paginationStart><paginationEnd>7709</paginationEnd><publisher>Institute of Electrical and Electronics Engineers (IEEE)</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint>1520-9210</issnPrint><issnElectronic>1941-0077</issnElectronic><keywords/><publishedDay>1</publishedDay><publishedMonth>1</publishedMonth><publishedYear>2023</publishedYear><publishedDate>2023-01-01</publishedDate><doi>10.1109/tmm.2022.3225754</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm/><funders>This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000.</funders><projectreference/><lastEdited>2024-06-06T12:15:25.3716912</lastEdited><Created>2023-01-03T13:27:18.4577396</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Zhong</firstname><surname>Ji</surname><orcid>0000-0002-2197-3739</orcid><order>1</order></author><author><firstname>Junhua</firstname><surname>Hu</surname><order>2</order></author><author><firstname>Deyin</firstname><surname>Liu</surname><orcid>0000-0002-0371-9921</orcid><order>3</order></author><author><firstname>Yuanbo</firstname><surname>Wu</surname><orcid>0000-0001-6119-058X</orcid><order>4</order></author><author><firstname>Ye</firstname><surname>Zhao</surname><orcid>0000-0002-8180-4697</orcid><order>5</order></author></authors><documents><document><filename>Under embargo</filename><originalFilename>Under embargo</originalFilename><uploaded>2023-02-06T09:54:43.9331511</uploaded><type>Output</type><contentLength>9616949</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><embargoDate>2024-12-05T00:00:00.0000000</embargoDate><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling v2 62251 2023-01-03 Asymmetric Cross-Scale Alignment for Text-Based Person Search 205b1ac5a767e977bebb5d6afd770784 0000-0001-6119-058X Yuanbo Wu Yuanbo Wu true false 2023-01-03 MACS Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. Journal Article IEEE Transactions on Multimedia 25 7699 7709 Institute of Electrical and Electronics Engineers (IEEE) 1520-9210 1941-0077 1 1 2023 2023-01-01 10.1109/tmm.2022.3225754 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000. 2024-06-06T12:15:25.3716912 2023-01-03T13:27:18.4577396 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Zhong Ji 0000-0002-2197-3739 1 Junhua Hu 2 Deyin Liu 0000-0002-0371-9921 3 Yuanbo Wu 0000-0001-6119-058X 4 Ye Zhao 0000-0002-8180-4697 5 Under embargo Under embargo 2023-02-06T09:54:43.9331511 Output 9616949 application/pdf Accepted Manuscript true 2024-12-05T00:00:00.0000000 true eng
title Asymmetric Cross-Scale Alignment for Text-Based Person Search
spellingShingle Asymmetric Cross-Scale Alignment for Text-Based Person Search
Yuanbo Wu
title_short Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_full Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_fullStr Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_full_unstemmed Asymmetric Cross-Scale Alignment for Text-Based Person Search
title_sort Asymmetric Cross-Scale Alignment for Text-Based Person Search
author_id_str_mv 205b1ac5a767e977bebb5d6afd770784
author_id_fullname_str_mv 205b1ac5a767e977bebb5d6afd770784_***_Yuanbo Wu
author Yuanbo Wu
author2 Zhong Ji
Junhua Hu
Deyin Liu
Yuanbo Wu
Ye Zhao
format Journal article
container_title IEEE Transactions on Multimedia
container_volume 25
container_start_page 7699
publishDate 2023
institution Swansea University
issn 1520-9210
1941-0077
doi_str_mv 10.1109/tmm.2022.3225754
publisher Institute of Electrical and Electronics Engineers (IEEE)
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 0
active_str 0
description Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.
published_date 2023-01-01T12:15:26Z
_version_ 1801110187004657664
score 11.013148