1 January 2020
There is a lot of material on the MEDW site about Processing Scans where I have been converting the large pile of Model Engineer and other magazines to recycling fodder. The resulting electronic copies are now filed away in several places but they still need a little tidying up as while they do have a searchable text layer, originally I was making them fully digital by using FineReader OCR. The licence for that no longer works and I am loathed to sign up to the annual subscription to get it working again, so I've been digging into the various options on Linux. The starting point was to try the 'ChatBot' to see what it could come up with, but much of it's procrastination was repeating things we had already found to be dead or simply did not do what I was ASKING FOR.
Bugger ... it did not save again ... can wait until the morning now!
OK so what had I written yesterday? Some of the links are on the wiki page, but the main jist was in relation to the performance of Mistral in providing answers to simple questions in relation to OCR and outputting this to LibreOffice. I have run out of time twice on the sessions I had discussing this with Mistral, which was unable to say just what the limits are, but since little of the negative replies by me affects just what it does produce, starting over is not really a problem. I spent several hours working through things which in many cases simply did not work, despite downloading several extra packages. So now I know some of the answers, can I produce any better questions to pose to finally access better answers?
The fundamental problem is that while PDFArranger does a very nice job in converting the portrait A3 scans produced by the document feeder scans of the original magazines, into a properly ordered set of A4 pages in the pdf document, it does NOT produce a set of A4 images. The pdf standard provides a simpler way of handling the original scans and most of the pdf viewers can handle that and produce the appearance that there are the right A4 pages in the document. However some pdf viewers get confused and output the raw A3 images as they are stored internally. In the past I have had several attempts at trying to 'print' the document as A4 pages which are better suited to the further OCR processing, but without success. Mistral did come up with the option to ACTUALLY print the pages and rescan them, which just seems a little crass? Once I got Mistral away from all the dross around the whole process and concentrating on JUST the pre-processing step, a GhostScript command line appeared which did exactly what I wanted to do. It produces a full set of individual page images, but at least they are in the right orientation, and I can reduce the resolution to 300DPI from the 600DPI of the original scans, which the OCR process seems to prefer? These images can then be recombined as a smaller pdf file and OCRmyPDF handles adding the searchable layer. This results in a file about 60% smaller while the intermediate pdf is 4 times bigger!
So the next step is to convert the now tidier pdf into a fully electronic version of the text and images. This is where FineReader used to make a nice job, and I have a few documents which are much smaller, with clean backgrounds and individual images, and it is this I am trying to replicate. Rather than converting a couple of thousand magazines to this format, the first step has always been to select all the pages across several magazines to produce a consolidated project document, such as from the Superba Traction Engine project in the late 1980's. I have several projects packaged like this on the MEDW site but not sure on copyright rules about making them publically available. The processing to clean A4 images is a good start even without the final step as it removes all of the unnecessary material from the intermediate images.
So where are we on that final step? That is where we go on OCR processing on Linux and in the meantime I can tidy up the script that converts the current pdf's to clean A4 versions. I still need to establish if throwing away the higher resolution is worth doing as the line drawings are much better at the higher resolution. Make these available as an appendix is something that is on the TODO list, and one of the remaining windows program, Vectorizer, does a nice job of producing these as CAD drawing.