Application-Oriented Segmentation of Printed Sindhi Text for Document Recognition and Natural Language Processing Systems

Authors

  • Pir Bakhsh Khokhar Institute of Information and Communication Technologies, MUET, Jamshoro, Pakistan Author https://orcid.org/0000-0001-9993-4572
  • Shahnawaz Talpur Institute of Information and Communication Technologies, MUET, Jamshoro, Pakistan Author https://orcid.org/0000-0002-2660-6145
  • Muhammad Ismail Department of Computer Science, Sukkur IBA University, Sukkur, Pakistan Author https://orcid.org/0000-0002-2274-4276
  • Hassan Abbas Department of Business Administration, Sukkur IBA University, Sukkur, Pakistan Author
  • Muhammad Asif Khan Department of Computer Science, Sukkur IBA University, Sukkur, Pakistan Author

DOI:

https://doi.org/10.52584/QRJ.2302.03

Keywords:

Document Image Analysis, Printed Sindhi Script, Pixel-Based Segmentation, Vertical Projection Profiles, Skew Detection and Correction, OCR Preprocessing, Cursive Script Processing

Abstract

Text segmentation of printed Sindhi documents serves as a fundamental requirement for constructing competent
OCR and NLP systems for the Sindhi language. The cursive format with complex linking elements in the Sindhi script creates difficulties in automatically identifying lines and characters. This paper presents a new application-driven method to precisely separate lines and ligatures in printed images of Sindhi text. Adaptive thresholding acts first to process noisy and skewed images robustly, and then pixel counting detects the top and bottom lines for subsequent line segmentation. Ligature segmentation is performed through a vertical profile technique on the extracted lines. New techniques within the improved skew correction algorithm target resolve two main text-related problems: unaligned text lines and inconsistent spacing between lines. The system operated on 400 printed Sindhi text images, yielding a line segmentation accuracy of 98.3%. The segmented ligatures obtained through this process form the basis of Sindhi OCR development, which also enables applications in speech recognition, text mining, and other digital language processing operations.

Downloads

Published

2025-12-30