20. June 2018 · Comments Off on The Center for Digital Antiquity Presentations at the AZ Historic Preservation Conference · Categories: Uncategorized · Tags: , , , , , ,

The lively and well-attended Arizona Statewide Historic Preservation Conferencewas held earlier this month at the Hotel Valley Ho in Scottsdale. The Center for Digital Antiquity organized two sessions for the conference.


One of the sessions highlighted the Digital Archive for Huhugam Archaeology(DAHA) project underway at the Center in collaboration with the Amerind Museum , ASU Libraries, the ASU Center for Archaeology and Society, other ASU scholars, Pueblo Grande Museum, the City of Phoenix Archaeologist office, other public agencies.  Also involved are Archaeology Southwest, Desert Archaeology, Statistical Research, Inc., and a number of other CRM firms in southern Arizona.


Organized by Leigh Anne Ellison, who summarized the various aspects of DAHA, presentations also were made by David Martinez, Frank McManamon, and Adam Brin.  Martinez described the dialogue with tribal communities as part of the project.  McManamon summarized the building of content for DAHA in a collection in tDAR, the Digital Archaeological Record.  Brin summarized research on natural language processing and “text mining” as part of the project that will enable more detailed research on the rich body of technical reports and other documents assembled in the digital archive. Learn more about this and our other session here.

08. December 2017 · Comments Off on Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II · Categories: Uncategorized · Tags: , , , , , ,

By Adam Brin

For the DAHA Project we tested various NLP systems including NLTK, Stanford, GATE, and Apache OpenNLP.  To evaluate them, we extracted a 10-page section of “FRANK MIDVALE’S INVESTIGATION OF THE SITE OF LA CIUDAD” by David R. Wilcox (tDAR # 4405), and ran the text through each of the tools at their base settings, and then through a tool that stripped out artifacts from the Object Character Recognition (OCR) process.  We then grouped and ranked the results comparing them to a baseline run by a person. Each unique result was validated and counted as valid or invalid, using the human recognition as the baseline, though in a few cases, the human missed values, or miscounted values. The result was a mostly quantitative analysis of the different engines and an attempt to rate the quality of each system.

Institution Person Location
Engine found valid invalid found valid invalid found valid invalid
human 35 32 3 62 62 0 27 22 4
stanford 43 23 18 69 46 17 45 21 25
apache 42 24 15.5 29 25 3 31 11 19
apache (2 hour training) 42 24 15.5 47 45 3 31 11 19
gate 0 0 0 66 37 18 1 1 2
NLTK 78 9 69 102 37 65 2 1 1

The results showed that out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

Engine Institution Person Location
human 100.00% 100.00% 100.00%
stanford 71.88% 74.19% 95.45%
apache 75.00% 40.32% 50.00%
apache (2 hour person training) 75.00% 72.58% 50.00%
gate 0.00% 59.68% 4.55%
NLTK 28.13% 59.68% 4.55%

Based on the ease of training the Apache tool, and the challenge of the Stanford tool’s license, which makes it more difficult for integrating it into tDAR infrastructure, we plan on moving forward with the Apache OpenNLP toolkit. One other note on GATE was the challenge of configuring it.The results showed that the out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

View Raw OCR Text Used in Analysis
View Raw Results