Summer Research Fellowship Programme of India's Science Academies

Identification of Text, Figures and Mathematical Expressions in Lecture Videos

Mukesh N Chugani

Amrita Vishwa Vidyapeetham, Coimbatore, Tamil Nadu 641112

Guided by:

Dr. Dinesh Babu Jayagopi

International Institute of Information Technology, Bangalore, Karnataka 560100


Lecture videos are usually rich with different types of information and understanding this can prove to be quite useful for solving problems in the domain of Educational Data Mining (EDM). The figures, text and mathematical expressions in the slides can serve as vital cues for understanding lecture videos. Though the problem of analyzing lecture videos is gaining the attention of the research community, most of the works have concentrated on the localization and recognition of text in lecture videos. Localization of mathematical expressions and figures has been overlooked. In this work, we identify the regions of text, figures and expressions in a particular slide of a lecture using computer vision techniques. This involves the localization of a region of the projected slide and the identification of the three regions in each slide. This information can be used to evaluate the composition of these cues in each slide of the presentation. Poor methods of teaching can play a major role in the distraction of students and hence can hinder effective learning. These results, combined with the existing methods to analyze student engagement in classrooms, can provide a way to assess the effectiveness of a presentation. Feedback regarding how to enhance the presentation for each sub-topic can then be provided to the speaker. As student engagement is the key to successful classroom learning, this can prove to be a robust way of analyzing and evaluating the presentation skills of a speaker.

Keywords: Educational Data Mining, Video Analysis, Presentation Evaluation, Lecture Video, Computer Vision


 MOOC Massive Open Online Courses
 OCW OpenCourseWare
 IIITB International Institute of Information Technology, Bangalore
 OCR Optical Character Recognition
 ASR Automatic Speech Recognition
 CRNN Convolutional Recurrent Neural Network
 STN Spatial Transformer Network
 SSDSingle Shot Detector 
 RPN Region Proposal Network
 IoUIntersection over Union 



Due to the rapid development in the ease of recording videos, many institutions are publishing their lecture videos online in order to enhance accessibility. In most of the lecture videos, the lecturer can be seen using slide presentations instead of conventional blackboards. Displaying figures and videos becomes easier with presentations which often leads to a better understanding of the topic. In addition to this, the students can review the lectures for revision at a convenient time. Due to the ease of recording and publishing lecture videos, there has been a surge in the availability of MOOCs and OCW lecture videos. There is a pressing need to organize and analyze these lecture videos to extract valuable information in order to aid ongoing educational research.

One of the research problems in this domain is to understand the contents of the projected presentation. The content in the slides can be broadly categorized into text, mathematical expressions, figures, and tables. One approach to do this would be to first localize each of these in video frames and then try to recognize the actual content within. This work aims at identifying the regions of text, figures, and mathematical expressions in each slide using the techniques of computer vision. This can provide a quantitative estimation of the composition of each category in a particular period of time. These results, coupled with the existing methods of analyzing student engagement, can help in improvising the effectiveness of the presentation by providing timely feedback. Various topics can then be presented more effectively by making amends accordingly. In addition to this, the localized regions can be recognized in order to enable video-indexing, searching and automatic extraction of class notes.

Since one of the major signs of successful classroom learning is consistent student engagement, this approach is a robust way of assessing and enhancing presentations.

Problem Statement

Localization of text, figures and mathematical expressions in lecture videos and quantitative estimation of the composition of each category over any particular segment of the video.

Objectives of the Research

Cleaning the lecture video dataset containing the recorded lectures of IIITB. Annotating the slide region in each frame and developing a system to filter only the frames with unique slides to avoid redundancy. Annotating text, figures, and mathematical expressions in each unique frame. Training a model to localize these regions. Evaluating the results.


The present work focuses only on the lecture videos captured with a fixed camera. The changes in the positioning of the camera haven't been accounted for. Also, only the lecture videos containing slides have been studied. Other modes of presentation such as blackboards and paper aren't considered.



​​ There have been very few studies related to the recognition of text, especially in lecture videos. LIŠKA, et al, 2007 proposed a system for automatic indexing to enable searching sequences. Slides are extracted using a VGA frame grabber and OCR is applied to extract essential keywords from the slides. These keywords along with corresponding slide timestamps enable indexing of the video. The recognition of words was found to be less accurate because the words weren't localized beforehand. Moreover, access to digital slides was required in order to perform recognition. Wang, et al, 2003 worked on developing a system to synchronize video clips and their respective electronic slides by matching title and content. Keyframes are picked temporally and a geometry-based algorithm is used to detect text regions in them. After localization, OCR is used for recognition. Although the proposed algorithm performed well for the objective of synchronization, the technique used for localization can't be relied upon since geometry algorithms are sensitive to noise.

For text detection in natural scene images, Epshtein, et al, 2010 proposed a method using Stroke Width Transform (SWT). This novel image operator was used to calculate the width of stroke for each image pixel. They utilized the constant stroke width property of text in scene images. Based on this approach, Yang, et al, 2011 proposed an approach for automated lecture video indexing. For text recognition, they adopted a multi-hypotheses framework consisting of OCR, spell checking and processes for result merging. By using the geometrical information of detected text lines, an algorithm for slide structure analysis was implemented. On the other hand, Haojin Yang and Meinel, et al, 2014 combined the results of OCR and Automatic Speech Recognition (ASR) in order to automatically generate keywords for lecture videos. For preprocessing the video, video segmentation and key-frame selection were employed. Textual metadata was then extracted by applying ASR on the lecture audio tracks and video OCR on keyframes. This approach enabled content-based video search and browsing.

An essential step for localizing and recognizing lecture video content is the detection of slide region. Though the frame rate can be adjusted so as to reduce the number of duplicate slides, they cannot be avoided completely. With the detection of slide regions, slide transitions in presentations can be easily recognized. Jeong, et al, 2014 leverage image processing techniques to mark the slide region in a video. The frames are binarized with Otsu's threshold (Otsu, et al, 1979). Each white contour is bounded by a rectangle of minimum area. These rectangles serve as candidates for the slide region. The largest rectangle with a ratio close to 4:3 or 16:9 is chosen as the required slide region. The direct dependence of the results on the binarizing threshold is one of the major drawbacks. The detection is hugely dependent on light exposure and the dimensions of projection. This sub-problem needs to be addressed with a more scalable technique.

One of the most recent and comprehensive works related to lecture videos is that of Dutta, et al, 2018. A new dataset - LectureVideoDB, consisting of frames from various lecture videos has been introduced. The efficacy of existing state-of-the-art handwritten and scene text recognition techniques on the text in lecture videos has been investigated. Four modalities of lectures namely slides, whiteboard, paper, and blackboard are considered. For the task of word detection, techniques like EAST(Zhou, et al, 2017) and TextBoxes++ (Liao, et al, 2018) have been used. For the recognition of text, pre-trained models of CRNN and CRNN-STN have been employed. It has been shown that the existing methods perform well only on slides but not on other modalities and that there is a need to improve them. ​


There have been only a few studies conducted for the localization and recognition of text in lecture videos. Moreover, the detection of mathematical expressions and figures in lecture videos has been overlooked. Hence, there is a need to detect these along with the text in order to conduct a comprehensive content-based analysis of lecture videos.


Pipeline Overview

A lecture video of duration approx. 10s is fed as input. It is split into frames with a constant rate. In the preprocessing step, redundant frames (with the same slide projected) are removed. Text detection is performed on these frames using techniques like EAST or TextBoxes++. Along with this, mathematical expression detection is done with an SSD or RPN-based approach in which the model is pre-trained on a dataset containing printed equations and is fine-tuned on the IIITB dataset. A similar technique is used for the detection of figures. The final output metrics comprise of the average number of words per slide (unique frame), equations per slide and figures per slide - for that period of time.

Untitled Diagram(1).jpg

    Data Collection

    The dataset is collected from 3 one-day sessions of seminars conducted at IIITB. Each session was recorded using an HD video camera with a resolution of 1920x1080 pixels. The camera was fixed so as to capture the speaker and the projected presentation. The participants were graduates from the same institute. The same dataset has been used in Thomas, et al, 2018 for analyzing the engagement between the teacher and the students. The camera was fixed at all times and the lecturer was allowed to move. So, the dataset also contained frames in which the slide was occluded by the speaker.

    Data Preprocessing

    Each video was converted into a sequence of frames. Because the exposure to light was non-uniform, the content of some slides wasn't recognizable by the human eye. All the frames were glanced through and such frames were removed. The number of frames corresponding to each lecturer was capped at 1500 in order to avoid redundancy. In addition to this, contiguous segments of frames contained the same slide of the presentation. Avoiding this kind of duplicity was essential in order to achieve satisfactory results. For training purposes, frames with each unique slide were handpicked. Originally, the dataset contained 44,466 frames. Post cleaning, the dataset was reduced to 722 frames. The reduced dataset was then used for the following steps of the pipeline.

    Frame from collection phase-1
      Frame from collection phase-2
        Sample processed frames from the dataset

        In order to pick unique frames, the slide region had to be localized in each frame. All the slide regions were annotated first using the labeling tool LabelImg. Because a slide had well-defined boundaries and similar properties for all frames, object detection techniques were employed in order to detect slide regions. A pre-trained SSD MobileNetv2 model was fine-tuned with our training set. The performance was evaluated on the test-set portion of the dataset. This localization was done so as to supplement subsequent detection tasks too.

        Text Detection

        Tian, et al, 2016 proposed a novel technique in order to localize text lines in an image. One of the essential advantages of this technique was that no post-processing was required. We needed a metric to quantify the composition of text in slides. The number of lines of text couldn't be chosen as different presentations could have different font sizes. Moreover, different slides of the same presentation also could have different font sizes. Hence, the number of words per slide was chosen as the metric for quantification of text. As mentioned in Dutta, et al, 2018, EAST and TextBoxes++ performed well in localizing text, especially in slides. The off-the-shelf EAST model was chosen for solving the task of word localization as it proved to be slightly better than TextBoxes++.

        Expression Detection

        The dataset available from ICDAR 2013 on mathematical formula identification and recognition has been used. Our dataset was annotated with the locations of mathematical expressions. A pre-trained SSD MobileNetv2 model was fine-tuned on the ICDAR dataset. The model was then tested on our dataset. The SSD MobileNet is easier to train as it contains a fewer number of parameters when compared to RPN based approaches such as RCNN and YOLO. Both isolated and in-line mathematical expressions were detected using this method.

        Figure Detection

        Figures are often regarded as the best way to convey and teach complicated topics. Figures in slides may include graphs, network diagrams, charts or tables. The difficulty with the detection or localization of figures in slides is that there are very few common characteristics among all possible figures. Also, they don't have a definite shape or size. This becomes the prime reason for the poor performance of object-detection methods in detecting figures.


        Slide Localization

        The MobileNet was trained for about 6000 steps which took around 4 hours to complete.

        Slide detection results
        IoU Area maxDets Value
        Avg. Precision 0.50:0.95 all 100 0.855
        0.5 all 100 1
        0.75 all 100 1
        0.50:0.95 small 100 NA
        0.50:0.95 medium 100 NA
        0.50:0.95 large 100 0.855
        Avg. Recall 0.50:0.95 all 1 0.883
        0.50:0.95 all 10 0.883
        0.50:0.95 all 100 0.883
        0.50:0.95 small 100 NA
        0.50:0.95 medium 100 NA
        0.50:0.95 large 100 0.883

        It can be observed that the Mean Average Precision (mAP) is around 0.85 while the recall is around 0.88. This indicates that out of 100 expressions, around 85 of them would be correctly detected. Below are a few inferences on the test set.

        Sample inference 1
          Sample inference 2
            Slide detection inference

              Word Detection

            As mentioned, the off-the-shelf EAST model performs well on this dataset. Its performance on slides has already been tested in Dutta, et al, 2018, as shown in Fig 4.​

              Evaluation of EAST and Textboxes++ done by Dutta, et al, 2018

              Fig 5 shows the result of running the text detection on our dataset.

              Sample inference 1
                Sample inference 2
                  Word detection inference

                  Expression Detection

                  A pre-trained object detection model (SSD MobileNetv2) was fine-tuned on a printed mathematical equation dataset. It performs well on the printed dataset. Its performance was then tested on our dataset. Fig 6 shows an inference of the model on the printed dataset. It can be observed that the model is successful in detecting isolated formulae but doesn't detect any in-line expressions.

                    Inference on printed mathematical dataset


                    Understanding content in slides from lecture videos still remains a difficult task. Though the text detection methods fare well on the dataset, the results of equation and figure detection aren't as good. Newer techniques need to be introduced for this purpose. In addition to this, a few heuristics, based on the characteristics of figures and equations, can be employed. Nevertheless, these techniques combined together give a fair estimate of the composition of each element in a presentation.


                    I am grateful to the Science Academies' for giving me this wonderful opportunity. I would like to thank my guide, Dr. Dinesh J for his constant guidance and support throughout the project. Also, I would like to thank my mentor, Ms. Chinchu T for helping me figure out the various intricacies involved. I am also thankful to my family and friends for their continuous support.


                    • LIŠKA, Miloš, Vít RUSŇÁK a Eva HLADKÁ. Automated Hypermedia Authoring for Individualized Learning. In Program and Abstracts of 8th International Conference on Information Technology Based Higher Education and Training. Kumamoto, Japonsko: Kumamoto University, 2007. 4 s.

                    • Wang, Feng and Ngo, Chong-Wah and Pong, Ting-Chuen (2003). Synchronization of lecture videos and electronic slides by video text analysis.

                    • Epshtein, Boris and Ofek, Eyal and Wexler, Yonatan (2010). Detecting text in natural scenes with stroke width transform.

                    • Yang, Haojin and Siebert, Maria and Luhne, Patrick and Sack, Harald and Meinel, Christoph (2011). Lecture Video Indexing and Analysis Using Video OCR Technology.

                    • Haojin Yang and Meinel, Christoph (2014). Content Based Lecture Video Retrieval Using Speech and Video Text Information. 7,

                    • Jeong, Hyun Ji and Kim, Tak-Eun and Kim, Hyeon Gyu and Kim, Myoung Ho (2014). Automatic detection of slide transitions in lecture videos. 74,

                    • Otsu, Nobuyuki (1979). A Threshold Selection Method from Gray-Level Histograms. 9,

                    • Dutta, Kartik and Mathew, Minesh and Krishnan, Praveen and Jawahar, C.V (2018). Localizing and Recognizing Text in Lecture Videos.

                    • Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun (2017). EAST: An Efficient and Accurate Scene Text Detector.

                    • Liao, Minghui and Shi, Baoguang and Bai, Xiang (2018). TextBoxes++: A Single-Shot Oriented Scene Text Detector. 27,

                    • Thomas, Chinchu (2018). Multimodal Teaching and Learning Analytics for Classroom and Online Educational Settings.

                    • LabelImg. Github. [Online]. Available: https://github.com/tzutalin/ labelImg

                    • Tian, Zhi and Huang, Weilin and He, Tong and He, Pan and Qiao, Yu (2016). Detecting Text in Natural Image with Connectionist Text Proposal Network.

                    Written, reviewed, revised, proofed and published with