Journal article 582 views 2 downloads
Asymmetric Cross-Scale Alignment for Text-Based Person Search
IEEE Transactions on Multimedia, Volume: 25, Pages: 7699 - 7709
Swansea University Author: Yuanbo Wu
-
PDF | Accepted Manuscript
Download (9.17MB)
DOI (Published version): 10.1109/tmm.2022.3225754
Abstract
Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this...
Published in: | IEEE Transactions on Multimedia |
---|---|
ISSN: | 1520-9210 1941-0077 |
Published: |
Institute of Electrical and Electronics Engineers (IEEE)
2023
|
Online Access: |
Check full text
|
URI: | https://cronfa.swan.ac.uk/Record/cronfa62251 |
Abstract: |
Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. |
---|---|
College: |
Faculty of Science and Engineering |
Funders: |
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62176178, U19A2073, 62002096, and the Natural Science Foundation of Tianjin under Grant 19JCYBJC16000. |
Start Page: |
7699 |
End Page: |
7709 |