By Keith Kintigh, Adam Brin, Michael Simeone and Mary Whelan
When people talk about the problems, and potential, of Big Data they are often referring to research or business output files that are gigantic. But an equally vexing Big Data problem involves wrangling thousands of small files, like PDFs. Addressing this problem is a key component of the NEH funded DAHA Project because we anticipate that the DAHA library will contain over 1,600 grey-literature archaeological reports, on the order of 400,000 pages of information-rich text. Our ultimate goal is not to sequester these documents in the archive, but to stimulate and enable new uses that advance scholarship. For efficient ways to search and analyze that many documents at once we turn to computer science and Digital Humanities, using Natural Language Processing (NLP) tools developed and applied in those disciplines.
Of course you can do a simple word search in the DAHA archive in tDAR now and get useful results. But word searches have limitations (spurious results, spelling variations missed, etc.) and ultimately what we would like to do is search the entire corpus using a complex query like “Find all the reports that describe 12th century excavated pit structures from New Mexico or Arizona that have a southern recess and are associated with above-ground pueblos with 10 or more rooms.” We aren’t there yet, but NLP approaches and tools are moving us closer to that goal.
For the DAHA project, we are focusing on the NLP branch known as Named Entity Recognition (NER). Working with this framework in tDAR will allow us to automatically extract standard who, what, where, and when references from each DAHA document, thus improving metadata records which will greatly improve a user’s query and discovery experience.
So far we have started to figure out a workflow, identified a test set of DAHA documents and asked a human to tag words and phrases in one document. Our entity tags include Ceramic Type, Culture, Location, Person, Institution, Archaeological Site Name, Site Number and Date. Next we experimented with three NER tool kits (Stanford’s NLP Toolkit, Apache’s Open NLP, and the University of Sheffield’s GATE) to see which one(s) worked best on our corpus. We’ll describe our NLP comparison and results in detail in the “Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II” blog post.
Kintigh, Keith
2015. “Extracting Information from Archaeological Texts.” Open Archaeology 1: 96–101.
DOI: https://doi.org/10.1515/opar-2015-0004