23. July 2018 · Comments Off on If you could apply Text Mining to your archaeological research, what would it look like? · Categories: Uncategorized · Tags: , , , , , ,

By Rachel Fernandez and Mary Whelan

Computer automated part-of-speech analysis of an English sentence.

If you Google “archaeology” and “text mining” you get a pretty small number of results. While both the Archaeology Data Service (ADS) in the UK and Open Context, the American open data and publication service, have worked on projects that apply text mining techniques to archaeological reports, it is probably safe to say that most archaeologists are unfamiliar with how text mining or data science can contribute to archaeological research. This isn’t surprising since most of us specialize in the analysis of some material culture product, not the analysis of written reports. But as McManamon et al. (2017) have pointed out, the volume of new archaeological articles, books, compliance reports and other documents is so large, and grows so quickly, that no one person can possibly read it all. Text mining offers automated approaches that help with this “data deluge.”

Illustration of the stages from binary computer code, to ASCII code, to an English language sentence.

This past April, the Digital Archive of Huhugam Archaeology (DAHA) team hosted an NEH-funded research workshop focused on Archaeology and Text Mining. Led by DAHA investigators Michael Simeone, Keith Kintigh and Adam Brin, we invited 9 panel experts including archaeologists, Native American scholars, and Digital Humanities researchers, to meet with the DAHA team for a day of discussion. We asked the panelists first to simply talk about how they use digital texts in their work, regardless of their discipline.

Working in small groups, the participants discussed the epistemology of research for their respective fields and quickly turned the conversation to the how text mining digital documents could benefit each field. Ideas such as pulling artifact counts and site descriptions from standardized CRM reports to demographic studies were mentioned. Professor David Abbott, a Huhugam Researcher in ASU’s School of Human Evolution and Social Change, brought up the potential for text mining to advance synthetic research.

Next, we wanted to hear their ideas about how we might make the DAHA digital text corpus more useful to a broad and diverse audience with text mining. Joshua MacFadyen, Assistant Professor of Environmental Humanities in the School of Historical, Philosophical, and Religious Studies, and the School of Sustainability at ASU, brought up this important question on user experience and how we are able to record and trace this impact of these tools on a diverse user group. In addition, David Martinez, member of the Gila River Indian Community and Associate Professor of American Indian Studies at ASU, hopes that this technology can potentially bridge the gap between community users and researchers and allow the users to interpret data in ways that interest these communities. Overall, text processing was thought to be most useful in aiding filtering and analysis of data, rather than providing direct insights.

Finally, we asked all the participants to reflect on what we accomplished and identify useful directions the DAHA team can take in using Text Mining tools and technology. Some of the possible features to come out from DAHA include: automated tools for the extraction of specific text sections, such as tables, references, or headings, use of ontologies to make documents more efficient, topic related searches, and the ability to gather set of reports and have the data linked to its various sources.

Illustration of the steps in Named Entity Recognition

The DAHA grant proposal focused specifically on Natural Language Processing applications for text analysis. But during the workshop, a number of more general Text Mining approaches were mentioned, including:

  • Corpus Statistics (Word frequencies across corpus)
  • Concordance
  • N-Grams
  • Advanced queries across multiple texts
  • Named Entity Recognition
  • Topic Modeling
  • Sentiment Analysis
  • Network Analysis
  • Text visualization options
  • GeoParsing (geographic information extraction)

If you are interested in applying text mining tools to a document corpus of importance to you, there are several good, open source toolkits that support one or more approaches:

  1. AntConc (http://www.laurenceanthony.net/software/antconc/ ) is an easy-to-use set of tools for analyzing any type of text corpus. AntConc provides tools to calculate word frequency, concordance, N-Grams, and corpus statistics.
  2. ConText (http://context.lis.illinois.edu/ ) is a package of tools that allow you to do topic modeling, sentiment analysis, parts-of-speech analysis, and visualization of a text corpus.
  3. MALLET (http://mallet.cs.umass.edu/index.php ) MAchine Learning for LanguagE Toolkit is a very popular topic modeling tool.
  4. Ora-Lite (http://www.casos.cs.cmu.edu/projects/ora/software.php ) is a software package that helps you identify and visualize networks (e.g., people who communicated with each other or places that are linked) in text data.
  5. Voyant (https://voyant-tools.org/docs/#!/guide/start ) is a free online service that supports a number of corpus analysis tools, including visualization.
    _______________________
    1 McManamon, Francis P., Keith W. Kintigh, Leigh Anne Ellison, and Adam Brin. 2017. “The Digital Archaeological Record (tDAR): An Archive – for 21st Century Digital Archaeology Curation” Advances in Archaeological Practice 5(3), pp. 238–249.
08. December 2017 · Comments Off on Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II · Categories: Uncategorized · Tags: , , , , , ,

By Adam Brin

For the DAHA Project we tested various NLP systems including NLTK, Stanford, GATE, and Apache OpenNLP.  To evaluate them, we extracted a 10-page section of “FRANK MIDVALE’S INVESTIGATION OF THE SITE OF LA CIUDAD” by David R. Wilcox (tDAR # 4405), and ran the text through each of the tools at their base settings, and then through a tool that stripped out artifacts from the Object Character Recognition (OCR) process.  We then grouped and ranked the results comparing them to a baseline run by a person. Each unique result was validated and counted as valid or invalid, using the human recognition as the baseline, though in a few cases, the human missed values, or miscounted values. The result was a mostly quantitative analysis of the different engines and an attempt to rate the quality of each system.

Institution Person Location
Engine found valid invalid found valid invalid found valid invalid
human 35 32 3 62 62 0 27 22 4
stanford 43 23 18 69 46 17 45 21 25
apache 42 24 15.5 29 25 3 31 11 19
apache (2 hour training) 42 24 15.5 47 45 3 31 11 19
gate 0 0 0 66 37 18 1 1 2
NLTK 78 9 69 102 37 65 2 1 1

The results showed that out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

Engine Institution Person Location
human 100.00% 100.00% 100.00%
stanford 71.88% 74.19% 95.45%
apache 75.00% 40.32% 50.00%
apache (2 hour person training) 75.00% 72.58% 50.00%
gate 0.00% 59.68% 4.55%
NLTK 28.13% 59.68% 4.55%

Based on the ease of training the Apache tool, and the challenge of the Stanford tool’s license, which makes it more difficult for integrating it into tDAR infrastructure, we plan on moving forward with the Apache OpenNLP toolkit. One other note on GATE was the challenge of configuring it.The results showed that the out of the box, the Stanford tool was definitely more accurate, although it had a number of invalid matches too. The NLTK was the most aggressive and had the most matches, and invalid matches (except for locations). The Apache toolkit did pretty well with the ratio of valid matches to invalid ones. It seemed successful enough, that a few hours were spent on determining if it could be easily trained to improve the match quality. The results with “people” were quite good. Stanford found initialized names (E. K. Smith) while the Apache tool did not. We were able to train the Apache tool quite easily to recognize this format.  We also were able to train the Apache tool to identify citation references e.g. (Smith 2017) which also would improve the matches.

View Raw OCR Text Used in Analysis
View Raw Results

04. December 2017 · Comments Off on Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part I · Categories: Uncategorized · Tags: , , , , , ,

By Keith Kintigh, Adam Brin, Michael Simeone and Mary Whelan

When people talk about the problems, and potential, of Big Data they are often referring to research or business output files that are gigantic. But an equally vexing Big Data problem involves wrangling thousands of small files, like PDFs. Addressing this problem is a key component of the NEH funded DAHA Project because we anticipate that the DAHA library will contain over 1,600 grey-literature archaeological reports, on the order of 400,000 pages of information-rich text. Our ultimate goal is not to sequester these documents in the archive, but to stimulate and enable new uses that advance scholarship. For efficient ways to search and analyze that many documents at once we turn to computer science and Digital Humanities, using Natural Language Processing (NLP) tools developed and applied in those disciplines.

Natural Language Processing tool kits that we compared for the DAHA project

Natural Language Processing tool kits that we compared for the DAHA project

Of course you can do a simple word search in the DAHA archive in tDAR now and get useful results. But word searches have limitations (spurious results, spelling variations missed, etc.) and ultimately what we would like to do is search the entire corpus using a complex query like “Find all the reports that describe 12th century excavated pit structures from New Mexico or Arizona that have a southern recess and are associated with above-ground pueblos with 10 or more rooms.” We aren’t there yet, but NLP approaches and tools are moving us closer to that goal.

For the DAHA project, we are focusing on the NLP branch known as Named Entity Recognition (NER). Working with this framework in tDAR will allow us to automatically extract standard who, what, where, and when references from each DAHA document, thus improving metadata records which will greatly improve a user’s query and discovery experience.

Preliminary workflow for DAHA Named Entity Extraction tasks

Preliminary workflow for DAHA Named Entity Extraction tasks

So far we have started to figure out a workflow, identified a test set of DAHA documents and asked a human to tag words and phrases in one document.  Our entity tags include Ceramic Type, Culture, Location, Person, Institution, Archaeological Site Name, Site Number and Date.  Next we experimented with three NER tool kits (Stanford’s NLP Toolkit, Apache’s Open NLP, and the University of Sheffield’s GATE) to see which one(s) worked best on our corpus.  We’ll describe our NLP comparison and results in detail in the “Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II” blog post.

Kintigh, Keith
2015. “Extracting Information from Archaeological Texts.” Open Archaeology 1: 96–101.
DOI: https://doi.org/10.1515/opar-2015-0004