3 March 2025

by Lester Caine
3 March 2025
Posted to Lester's Rants

Time I think to give up on this as a bad job. I have several components of the process to produce a fully digitized version of the Model Engineering magazines, but nothing that gets close to producing a usable result complete with individual images. Tesseract is very good at the text side of things, but only when there is only text to work with. Things like adverts with text inside frames confude it, and images seem to get a smattering of text without any actual text being present. I have a copy of scribeocr running on the web stack here and this produces more accuarte text overlays than the ocrmypdf version, but either is prefectly adaquate as a transparent layer on the raw scans. gImageReader has some tantalising facilities, when used in text mode, as it allows me to select images and save them one at a time, but does not provide the location information to help build a page. Switching to hocr mode just results in a lot of 'noise' with numerous graphic elements in the tree, but none of them the actual images on the page. The resulting docx file format has a few usable pages, but most are not, what ever 'page segmentation mode' I select. What is missing here is the ability to select areas to be recognised or tagged as images and add THOSE to the hocr tree. I can edit some of the rogue images blosk to highlight the adjacent larger image but that change is not reflected in any of the outputs.

The Third Party Tools page in the tesseract-ocr github repo needs some major surgery as many of the projects listed are simply dead, or have not been updated in 10+ years. Some are still hosted on Sourceforge which shows just how old they are. None however seem to have ever had the ability to add images to the resulting conversion tree.

I download PDF_Extractor and openCV but neither is giving any help on the main target path. Several other packages are also now on the machine, but the main target seems to be simpy to extract existing text from the pdf files rather than extracting the layout of the pages. The LibreOffice extension should apparently also do that, but it is installed, but does not even appear on the menu to allow me to try.

I've taken the time to have another go at FineReader. I had the licence key, but the link to the software download no longer worked. So I've found a copy of the installation package on Torrent and now have it working on the windows machine. It's running rather slow, so processing all of the pdf's is not going to happen, but cherry picking compilation articles can be handled at least.