Software to scan paper forms and extract data?

Soldato
Joined
22 Jan 2014
Posts
3,878
Afternoon forumites,

Is anyone aware of software that can take a scanned copy of something (or deal directly with the physical document) and extract the text and numerical data from those scanned copies into a database?

The biggest issue is that whilst the fields on the form are all labelled the same way, the locations of these fields are not the same on the forms so it's not like it's a machine marked test or suchlike where all the copies being scanned are identical in layout. The forms cannot be standardised due to the processes used by those completing the forms not being standardised, unfortunately. This means that whilst the data are all input using word processing software or similar (no hand writing), the fonts, text sizes and locations vary.

It essentially needs to be able to find the named fields on the form and extract the data associated with that field, and input that data into the relevant field in the database. There will be human interaction to ensure what is being input is correct as we are not dealing with thousands a day, we're talking about maybe 20k pages per year.

Any suggestions at all would be very much appreciated.

Hugh
 
Last edited:
When I used to work in a document storage company we used software from Kofax called Ascent Capture, it was pretty powerful so could perhaps do what you describe above.
 
Is it a case of there being an infinite number of variations in the form format and layout, or are the variations limited to a certain number?

Is it possible the solution could be for the system to identify which version of the form is being used, and then know where to look for the info on that variant?
 
When I used to work in a document storage company we used software from Kofax called Ascent Capture, it was pretty powerful so could perhaps do what you describe above.
I've just watched some videos on that software...and wow, that is some impressive stuff! It's astonishing how it lets you auto-input data that it's not sure on just by clicking on it on the form. Great, great stuff. I will be investigating this one further (I suspect it's well out of the price range, but I see they've done a year's free trial before...). Thank you.
 
I would look at capturing this data using forms on a tablet instead of using paper.
Unfortunately it's not possible as it's ~90 providers who are inputting the information from their internal systems into the forms, so a tablet isn't an option. They don't all use the same internal software either, so it's not like we can work with their software provider to develop a nice little module that does all the work as there are just too many of them!
 
Is it a case of there being an infinite number of variations in the form format and layout, or are the variations limited to a certain number?

Is it possible the solution could be for the system to identify which version of the form is being used, and then know where to look for the info on that variant?
I suppose it's theoretically infinite as some of the forms do not have e.g. page cutoffs, so the same input box can change size by multiple lines, thus pushing the other information all over the place. Ideally the forms would be standardised, but they're not.
It looks like Kofax can identify which form types are being used, and also identify where the information is on the page for each type. So perhaps it can accommodate for intra-form type variation too? Not sure.
 
Another option would be for them to fill in a fixed format PDF file, so they just enter text in the relevant box and as the format is locked they can't mess it up. But that's not much help as you've already said you can't standardise.

I've only used home use OCR software and found it to be pretty hit and miss, for what you're doing a commercial solution like what Mr Plow suggested is probably the way to go.
 
Another option would be for them to fill in a fixed format PDF file, so they just enter text in the relevant box and as the format is locked they can't mess it up. But that's not much help as you've already said you can't standardise.

I've only used home use OCR software and found it to be pretty hit and miss, for what you're doing a commercial solution like what Mr Plow suggested is probably the way to go.

That would be ideal - we did get in with a software provider for some of those who provide the forms to us and they inbuilt the form into their system, so that they were auto-completed and auto-submitted in to us...but that company has since gone bust and no one uses their wares any more. Real kicker.

I'll give OCR a try out, just to see what it's like. Thank you for the suggestion. Kofax certainly looks like the dog's danglies, but their initial starter pack is $30k+ so I suspect it's not possible for this project. What a fantastic looking bit of kit though!
 
Back
Top Bottom