George Papaioannou: Data digitizer Project - by Open Knowledge Foundation

For the past few weeks I have been trying to digitize some data from scanned pdf files. They are part of the monthly statistical bulletins that the Greek statistical agency recently provides online (access the digital library here). To become more specific, i am interested in about 10 variables on social benefits to analyse potential political economic cycles. I am using the monthly bulletins so as to capture any monthly (hence really short-term) variation discover any politically influenced changes before an election. However, a major issue with Greek data is that even when using elementary variables such as CPI it is difficult to find the data digitized and already available to the user in a file that a data processor can use, e.g. CSV. The work becomes more difficult when someone requires monthly data and for years as old as 1970 (as in my case). Thus, the only available option is to sit down and attempt to digitize the data in hand which is a very tedious, time-consuming process that apparently might involve the depreciation of the quality of the data as the human mind is prone to errors regardless of how careful someone is. I have thus been working on that task for about 2 months now. Bear in mind that we are talking about 10 variables times 12 months per year times 41 years which makes a total of 4920 data records!! This has been the manual way of digitizing data.

Yesterday I came across a massive project (massive in terms of its potential strength when is completed) that is under way by the Open Knowledge Foundation (OKF). It is a data digitizer, a software that is intended to automatically digitize data from eg Pdf files and incorporate them into a spreadsheet (CSV) while keeping the format of the cells and columns as much as possible. The initiative seems to be extremely intelligent and impressive. As anyone who has come across the scenario of digitizing data manually could say with comfort, this project when finished and if successful can lead to absolute increase in the numbers of existing and ready to manipulate datasets as many non treatable pdf data will become available for statistical analysis or application development in a matter of machine seconds!

You may view all the details of the project of Data Digitizer HERE

Also there is a small demo displaying some features of the under construction project HERE

Having suffered the process of manually digitizing data I want to thank the minds and the hands behind that project.

1 comment:

Jenny Molloy23 November 2011 at 05:17
Dear George

Sadly, the OKF data digitiser is aimed at aiding manual transcription of tabular data from PDFs/images rather than automating the process as you describe, which would obviously be very awesome but we quickly decided impossible in a day (Data Digitiser was hacked together at an Open Science Workshop in a matter of hours!) if not impossible full stop.

I get the impression with automated digitisation that maintains tabular structure that many have tried extremely hard and all have failed thus far, although if anyone knows of any open projects that are getting close then let us know and we will very gladly make that feature available!

Jenny Molloy
Coordinator, OKF Open Data in Science working group

18 November 2011

Data digitizer Project - by Open Knowledge Foundation

1 comment: