I’ve been playing with text analysis and in the course of doing this came up with a couple of posts on my geeky blog:
The first post reiterates my view that data archives should be content agnostic – ie an archive just stores stuff and focuses on finding it, curating it, and making it available. Different from a presentation service, which can be a user interface for either generalised or specialised search, or some special analytic process that uses the archived content.
I’ve speculated a little about what we can do by applying text analysis techniques to text based repository content, and in the second post I make the point that while we can treat text as data that’s not the only use and we need to consider how people will use content.
And this is where the topic modelling stuff has been helpful. When we think of text we tend to think of books and articles, but equally well it could be scanned data printouts or old lineprinter spss output.
Providing the documents are good enough to run through one of the opensource OCR engines we could potentially feed them into wordcloud or topic modelling software to extract key terms, in other words build an index of what a set of scanned items may be about rather than just archiving it as ‘scanned listing of PNG land tenure’ and thus making it easy for someone to select which documents may be worth looking at further …