08. December 2017 · Comments Off on Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II · Categories: Uncategorized · Tags: , , , , , ,

By Adam Brin

For the DAHA Project we tested various NLP systems including NLTK, Stanford, GATE, and Apache OpenNLP.  To evaluate them, we extracted a 10-page section of “FRANK MIDVALE’S INVESTIGATION OF THE SITE OF LA CIUDAD” by David R. Wilcox (tDAR # 4405), and ran the text through each of the tools at their base settings, and then through a tool that stripped out artifacts from the Object Character Recognition (OCR) process.  We then grouped and ranked the results comparing them to a baseline run by a person. Each unique result was validated and counted as valid or invalid, using the human recognition as the baseline, though in a few cases, the human missed values, or miscounted values. The result was a mostly quantitative analysis of the different engines and an attempt to rate the quality of each system.

Institution Person Location
Engine found valid invalid found valid invalid found valid invalid
human 35 32 3 62 62 0 27 22 4
stanford 43 23 18 69 46 17 45 21 25
apache 42 24 15.5 29 25 3 31 11 19
apache (2 hour training) 42 24 15.5 47 45 3 31 11 19
gate 0 0 0 66 37 18 1 1 2
NLTK 78 9 69 102 37 65 2 1 1

The results showed that out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

Engine Institution Person Location
human 100.00% 100.00% 100.00%
stanford 71.88% 74.19% 95.45%
apache 75.00% 40.32% 50.00%
apache (2 hour person training) 75.00% 72.58% 50.00%
gate 0.00% 59.68% 4.55%
NLTK 28.13% 59.68% 4.55%

Based on the ease of training the Apache tool, and the challenge of the Stanford tool’s license, which makes it more difficult for integrating it into tDAR infrastructure, we plan on moving forward with the Apache OpenNLP toolkit. One other note on GATE was the challenge of configuring it.The results showed that the out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

View Raw OCR Text Used in Analysis
View Raw Results

Comments closed.