Well the new tax year is here, and while in the past my accountant would do the tax return on the 29th January next year, I don't have to wait for that now and the tax man is sitting on money I want back, so I need to do the accounts over the next week or so. The starting point of which is the paypal history file and the HSBC bank account statements. While in the past a full paypal dump would take a few hours, this one was available within 30 minutes. It's a pity that the HSBC accounts are not so easy! The business statements I can at least get as shreadsheets a month at a time, but the personal account statements are only available as pdf documents, top of which the current style of pdf is difficult to use 'cut and paste' to grab the transaction details.
So I have just spent a lot of the day trying to create a clean csv version of the pdf's. I even have a full set of paper copies that I then rescanned to remove the messy hidden text, which I now suspect has been created using tesseract as everybody sems to use today. The problem is that none of it's modes understands a bank statement grid, producing vertical lists of text ... without the 'gaps' casued by blank fields. I am sure in the past I did simply cut and paste stuff from the pdf to the spreadsheet, so just what has changed today? Then I found a package call Tabula which was listed as handling tabular documents. First problem was getting it working on Linux, but once again it's a Java application and just like eclipse over the weekend, it just crashes. I had better luck on the windows box and at tea time I ran all of the original pdf files through and saved them to 12 tabs in a libreoffice spreadsheet. Tabula has to have the messy text layer, but post processes it to fit a grid which adds in all the 'spaces'. Save did not seem to work, but cut and paste from the web based interface is perfect, or at least as perfect as the original statements. Would prefer each transaction on it's own line rather than some being across two lines. Having the numbers a line below is confusing at times, but I am not sure it's worth manually proccessing things to fix it. The next job is simply to identify business expences from the whole and as I've mainly used the business account, there SHOULD be very few over the 40 or so pages anyway.
Of cause going forward I really need to be keeping on top of this and while I no longer have access to Sage accounts package, one of the options on Linux should be the way to go. Yet another todo list item!
Permalink (referenced by: 0 posts references: 0 posts) |