LOTR: Face Landmark Localization Using Localization Transformer


Nowadays, face recognition systems are commonplace in our daily lives, with businesses willing to take advantage of their convenience and robustness. Over the past decade, face detection and facial landmark localization have become crucial in aiding the performance of face recognition systems. While face detection refers to accurately detecting faces in an image, facial landmark localization focuses on estimating the positions of predefined key points in a detected face. These key points represent the different attributes of a human face, e.g., the contours of the face, eyes, nose, mouth, or eyebrows [1]. In recent years, facial landmark localization has become an important area of research in computer vision. It has aided in solving several computer vision problems like face animation, 3D face reconstruction, synthesized face detection, emotion classification, and facial action unit detection. However, it is challenging due to its dependency on face pose, illumination, and occlusion [2].

Proposed methods

Watchareeruetai et al. [1] propose the following novel ideas:

  1. A modified loss function, namely smooth-Wing loss, which addresses gradient discontinuity and training stability issues in an existing loss function called the Wing loss [4].

LOTR: Localization Transformer

Fig. 1: The overview of Localization Transformer (LOTR) . It consists of three main modules: 1) a visual backbone, 2) a Transformer network, and 3) a landmark prediction head. This figure corresponds to Fig. 1 from the paper [1].

Smooth-Wing Loss

Fig. 2: Comparison of Wing loss and smooth-Wing loss (top) and their gradient (bottom) in the global view (left). For the Wing loss (blue dashed lines), the gradient changes abruptly at the points |x| = w (bottom-middle) and at x = 0 (bottom-right). On the other hand, the proposed smooth-Wing loss (orange solid lines) is designed to eliminate these gradient discontinuities. This figure corresponds to Fig. 2 from the paper [1].
Eq. 1: Wing loss where w is the threshold, ϵ is a parameter controlling the steepness of the logarithm part, and c = w−w ln(1 + w/ϵ).
Eq. 2: Smooth-Wing loss, a modification of Wing loss in Eq. 1 with threshold t ; 0 < t < w.


Dataset & pre-processing

Watchareeruetai et al. [1] conducted experiments to measure the performance of the proposed LOTR models on two benchmark datasets: 1) the 106-point JD landmark dataset [10] and 2) the Wider Facial Landmarks in-the-Wild (WFLW) dataset [11].


Table 1 demonstrates the different configurations of LOTRs used in the experiments with the the 106-point JD landmark dataset [10] and 2) the Wider Facial Landmarks in-the-Wild (WFLW) dataset.

Table 1: The architectures of the different LOTR models

Evaluation Metrics

Watchareeruetai et al. [1] used standard metrics such as the normalized mean error (NME), the failure rate, and the area under the curve (AUC) of the cumulative distribution to evaluate and compare different landmark localization algorithms to the LOTR models.



On the WFLW dataset, Watchareeruetai et al. [1] compare the proposed LOTR-HR+ model with several state-of-the-art methods, including Look-at-Boundary (LAB) [11], Wing loss [4], adaptive Wing loss (AWing) [14], LUVLi [9], Gaussian vector (GV) [15], and Heatmap-In-Heatmap (HIH) [16].

Table 2: Comparison with the state-of-the-arts on the WFLW dataset.
Fig. 3: Sample images of the test set of the WFLW dataset with predicted landmarks from the LOTR-HR+ model. Each column displays the images with different subsets. Each row displays images with a different range of NMEs: < 0.05 (top), 0.05–0.06 (middle), and > 0.06 (bottom). This figure corresponds to Fig. 3 from the paper [1].


Watchareeruetai et al. [1] evaluate the performance of the LOTR-M, LOTR-M+ and LOTR-R+ models on the the test set of the first Grand Challenge of the106-Point Facial Landmark Localization against the top two ranked algorithms submitted to the challenge [17] .

Table 3 : The evaluation results for different LOTR models on the JD-landmark test set; * and ** denote the first and second place entries.


The proposed LOTR models outperform other algorithms, including the two current heatmap-based methods on the JD-landmark challenge leaderboard, and are comparable with several state-of-the-art methods on the WFLW dataset. The results suggest that the Transformer-based direct coordinate regression is a promising approach for robust facial landmark localization.


AI researchers from Sertis Vision Lab, namely Ukrit Watchareeruetai, Benjaphan Sommana, Sanjana Jain, Ankush Ganguly, and Aubin Samacoits, collaboratively developed the novel LOTR framework for face landmark localization along with Quantitative researchers: Pavit Noinongyao and Samuel W. F. Earp from QIS Capital. Nakarin Sritrakool, who at the time of this research was affiliated with the Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, contributed immensely to this work during his internship at Sertis Vision Lab.


[1] U. Watchareeruetai et al., “LOTR: Face Landmark Localization Using Localization Transformer,” in IEEE Access, vol. 10, pp. 16530–16543, 2022, doi: 10.1109/ACCESS.2022.3149380.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Leading big data and AI-powered solution company https://www.sertiscorp.com/