Skip to content

AR-Q-former: Historical Newspaper Article Separation Based on Multilingual Transformer Structure

A novel multimodal approach for article separation in historical newspapers leverages both textual data and visual information to improve segmentation accuracy.

Background:

Historical newspapers are invaluable resources that record social, political, and economic events across different periods. Extracting information from these documents helps preserve cultural heritage and prevents the loss of important historical knowledge.

 

However, the digitisation process presents significant challenges, particularly in efficiently and accurately processing the diverse content of historical newspapers.

 

A central challenge is article separation, which involves segmenting a newspaper page into individual articles. An additional challenge is that historical newspapers contain two modalities: text and image, which require multimodal processing for information extraction.

 

To address this issue, Dr Antoine Doucet and co-authors proposed an article separation model called Ar-Q-former, using a multimodal transformer structure. This novel approach for article separation in historical newspapers leverages both textual data and visual information to improve segmentation accuracy.

A demonstration of article separation on a newspaper page and examples of text block linking.
A demonstration of article separation on a newspaper page and examples of text block linking.

Ar-Q-former

  • Ar-Q-former integrates both textual and visual information to identify article boundaries effectively. It employs a cross-modal transformer to process linked text blocks, combining textual content and visual cues to form coherent article segments.
  • In addition, it introduces the mask-image technique, which preserves the positional relationships between text blocks and further enhances segmentation accuracy. This method assumes that both the page image and the position and content of each text block are known. Researchers connected each text block to its neighbouring units below and to the right, effectively modelling the page structure.
  • For each connection (link), they used the text backbone to obtain the text semantic vectors of the text blocks at both ends of the connection. Simultaneously, they constructed the mask-image of these two blocks and used the image backbone to obtain the visual semantic vectors of the mask-image.
  • Finally, they input the text and visual semantic vectors into the cross-modal transformer (Ar-Q-former) to capture the semantics of the connection and determine, via a classifier, whether the connection should be preserved. Ultimately, the text blocks that retain the connection form an article.

Contributions

This is a novel multimodal article separation method, called Ar-Q-former, which combines visual and textual cues for accurate article segmentation in historical newspapers. It is the first approach to use both visual and textual modal interaction ideas in the article separation task and extends the structure of Q-Former to enable it to obtain semantic information from both modalities by introducing an additional text query while preserving the vision query.

This model introduces the mask-image method to effectively model the positional relationships between text blocks, improving the integration of image and text information.The researchers demonstrated the efficacy of this approach through extensive experiments on historical newspaper datasets, achieving competitive performance compared to existing methods.

Read the full article on this link.

This article was presented at the 19th International Conference on Document Analysis and Recognition (ICDAR), held in Wuhan, Hubei, China, in September 2025.