Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II

08. December 2017 · Comments Off · Categories: Uncategorized · Tags: DAHA, Hohokam, Huhugam, Natural Language Processing, NER, NLP, tDAR

By Adam Brin

For the DAHA Project we tested various NLP systems including NLTK, Stanford, GATE, and Apache OpenNLP. To evaluate them, we extracted a 10-page section of “FRANK MIDVALE’S INVESTIGATION OF THE SITE OF LA CIUDAD” by David R. Wilcox (tDAR # 4405), and ran the text through each of the tools at their base settings, and then through a tool that stripped out artifacts from the Object Character Recognition (OCR) process. We then grouped and ranked the results comparing them to a baseline run by a person. Each unique result was validated and counted as valid or invalid, using the human recognition as the baseline, though in a few cases, the human missed values, or miscounted values. The result was a mostly quantitative analysis of the different engines and an attempt to rate the quality of each system.

	Institution			Person			Location
Engine	found	valid	invalid	found	valid	invalid	found	valid	invalid
human	35	32	3	62	62	0	27	22	4
stanford	43	23	18	69	46	17	45	21	25
apache	42	24	15.5	29	25	3	31	11	19
apache (2 hour training)	42	24	15.5	47	45	3	31	11	19
gate	0	0	0	66	37	18	1	1	2
NLTK	78	9	69	102	37	65	2	1	1

The results showed that out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format. We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

Engine	Institution	Person	Location
human	100.00%	100.00%	100.00%
stanford	71.88%	74.19%	95.45%
apache	75.00%	40.32%	50.00%
apache (2 hour person training)	75.00%	72.58%	50.00%
gate	0.00%	59.68%	4.55%
NLTK	28.13%	59.68%	4.55%

Based on the ease of training the Apache tool, and the challenge of the Stanford tool’s license, which makes it more difficult for integrating it into tDAR infrastructure, we plan on moving forward with the Apache OpenNLP toolkit. One other note on GATE was the challenge of configuring it.The results showed that the out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format. We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

View Raw OCR Text Used in Analysis
View Raw Results

Comments closed.

DAHA

Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II

Search

Recent News Items