Using Bipartite Matching and Detection Transformers for Document Layout Analysis Transfer Learning

Public Deposited
Resource Type
  • Business documents represent useful information which could benefit from automatic interpretation. The task of document layout analysis seeks to identify and localize semantic structures in documents. Contemporary techniques approach this as a strictly visual task. However, recent progress in Natural Language Processing (NLP) has enabled the incorporation of language information. Multimodal techniques have been proposed for the task of document layout analysis. These models make use of region based object detection techniques which require defining surrogate tasks such as region proposals and non-max suppression. This thesis presents LayoutLMDet, a multimodal layout analysis model. LayoutLMDet approaches object detection as a direct set prediction task as described in "End-to-End Object Detection with Transformers". Using bipartite matching, LayoutLMDet removes the need for surrogate tasks, simplifying implementation. Leveraging a pretrained transformer encoder, LayoutLMDet is able to achieve a mean average precision of 49.5 on the DocLayNet test dataset. A qualitative comparison of LayoutLMDets performance on the DocBank dataset highlights the impact of data selection.

Thesis Degree Level
Thesis Degree Name
Thesis Degree Discipline
Rights Notes
  • Copyright © 2022 the author(s). Theses may be used for non-commercial research, educational, or related academic purposes only. Such uses include personal study, research, scholarship, and teaching. Theses may only be shared by linking to Carleton University Institutional Repository and no part may be used without proper attribution to the author. No part may be used for commercial purposes directly or indirectly via a for-profit platform; no adaptation or derivative works are permitted without consent from the copyright owner.
Date Created
  • 2023


In Collection: