16. August 2018 · Comments Off on A Crowd Sourced User Survey for the Digital Archive of Huhugam Archaeology (DAHA) · Categories: Uncategorized · Tags: , , , ,

By Keith Kintigh and Mary Whelan

In October, 2017 the DAHA team designed a survey to assess the relevant information-related needs of the Digital Archive of Huhugam Archaeology’s key user communities:  archaeologists and others working in cultural heritage management who are concerned with Huhugam archaeology. We wanted to distribute the survey to as many Huhugam archaeologists as possible so we sought the help of the 177 members of the Arizona Archaeological Council (AAC). This is the professional organization of archaeologists working in Arizona. Although most AAC members are not Huhugam archaeology specialists, we thought this group contained much of the audience we were targeting.  We also contacted an additional 28 Huhugam archaeologists not affiliated with AAC. Most individuals received an initial email request to participate (containing the link to the survey) and at least one email reminder.

Between 18 October and 30 November 2017, we received 49 anonymous responses. We were encouraged by the 24% response rate, especially given that the distribution included a substantial number of individuals for whom the survey was not relevant.  The survey was designed to elicit a reasonable number of responses from the community of Huhugam archaeologists and it was successful in that regard; it was not designed to obtain a statistically representative sample.

We published the full DAHA User Survey Report in the “Reports in Digital Archaeology Publication Series.”  Reports in Digital Archaeology is an online publication series devoted to issues regarding research and practice in digital archiving of archaeological materials and archaeologically related data. Below are some of the survey highlights.

Analysis of the Results

Because the goal of the survey was to provide feedback from Huhugam archaeologists for use in developing the DAHA archive, the questions were focused primarily on two areas:  what research questions are of most interest to the user communities, and what IT tools and technological support would enhance and expand the user experience with the DAHA digital library in tDAR.

The results confirmed our beliefs that there is a perceived need for DAHA and that the archive will be heavily used by Huhugam archaeologists. The survey’s responses on how archaeologists use reports and what features they want to see in DAHA indicate that we should focus development on features that facilitate efficient discovery of the desired documents and that allow users to find or extract specific types of information they are looking for within reports.  The results are helpful in both prioritizing the kinds of resources to add to DAHA, and for the development of natural language processing (NLP) tools.

Table 1. What do you see as the three most important questions in Huhugam Archaeology?

Count Subject
21 Understanding the End of Classic/Huhugam Collapse
16 Huhugam Connections to Descendent Communities
14 Huhugam Organization
11 Preclassic/Classic Transition
10 Internal Hohokam Interaction
9 Adaptation to Environment
7 Identity/Ethnicity/Ideology
7 Modeling/Refinement of Population
6 Methods Issues
5 Water Management/Irrigation/River Flow
5 Relevance to today
5 Subsistence & Production
4 Chronology Refinement
4 Early Agricultural to Pioneer Period
4 External Interaction – Including with Mesoamerica & Pueblo areas
3 Nature of Classic Period
3 Huhugam Origins
3 Resilience of Huhugam

Several survey questions provided valuable information concerning the research topics of most interest to the user communities (Table 1).  Those results will likely be of interest to many Hohokam archaeologists, and will help structure the organization of the final DAHA archive, as well as provide guidelines for decisions about the most important documents to include in the archive.

Table 2.  What features would help you in using grey literature reports to advance knowledge of Huhugam society?

Count Feature
13 Keyword Search/Index to reports [already implemented]
12 Full Text Search [already implemented]
4 Good Abstracts/Summaries of Scope & Results
3 Master (Annotated) Bibliography of Huhugam Reports
3 Spatial Search [already implemented]
2 Extract Tables as Spreadsheets
2 Organization Search Output to Facilitate Selection
2 Topic Search
1 List of Analysis Types Reported
1 Indication if Full Text is Available in Search Result
1 Indication if Report is Peer Reviewed or Agency Approved
1 Abstract preview before download [already implemented]
1 Quick Response Time
1 Partial Download
1 Connect Tabulated Data with Associated Text
1 Integrate with AZSite
1 Voice Search

A later question (Table 2) was most useful in directing the development of natural language processing tools and adding or enhancing tDAR search and access features. The two most common requests shown in Table 2, keyword and full-text search, are core features built into tDAR from its beginning.  The report abstracts are generally extracted and made available on the metadata pages as the document summary.  Like full text and keyword search, spatial search is a core feature of tDAR available from the beginning.  Being able to extract document tables as spreadsheets is a challenging request that we are considering.

Question by question and full text survey results are available in tDAR:

04. December 2017 · Comments Off on Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part I · Categories: Uncategorized · Tags: , , , ,

By Keith Kintigh, Adam Brin, Michael Simeone and Mary Whelan

When people talk about the problems, and potential, of Big Data they are often referring to research or business output files that are gigantic. But an equally vexing Big Data problem involves wrangling thousands of small files, like PDFs. Addressing this problem is a key component of the NEH funded DAHA Project because we anticipate that the DAHA library will contain over 1,600 grey-literature archaeological reports, on the order of 400,000 pages of information-rich text. Our ultimate goal is not to sequester these documents in the archive, but to stimulate and enable new uses that advance scholarship. For efficient ways to search and analyze that many documents at once we turn to computer science and Digital Humanities, using Natural Language Processing (NLP) tools developed and applied in those disciplines.

Natural Language Processing tool kits that we compared for the DAHA project

Natural Language Processing tool kits that we compared for the DAHA project

Of course you can do a simple word search in the DAHA archive in tDAR now and get useful results. But word searches have limitations (spurious results, spelling variations missed, etc.) and ultimately what we would like to do is search the entire corpus using a complex query like “Find all the reports that describe 12th century excavated pit structures from New Mexico or Arizona that have a southern recess and are associated with above-ground pueblos with 10 or more rooms.” We aren’t there yet, but NLP approaches and tools are moving us closer to that goal.

For the DAHA project, we are focusing on the NLP branch known as Named Entity Recognition (NER). Working with this framework in tDAR will allow us to automatically extract standard who, what, where, and when references from each DAHA document, thus improving metadata records which will greatly improve a user’s query and discovery experience.

Preliminary workflow for DAHA Named Entity Extraction tasks

Preliminary workflow for DAHA Named Entity Extraction tasks

So far we have started to figure out a workflow, identified a test set of DAHA documents and asked a human to tag words and phrases in one document.  Our entity tags include Ceramic Type, Culture, Location, Person, Institution, Archaeological Site Name, Site Number and Date.  Next we experimented with three NER tool kits (Stanford’s NLP Toolkit, Apache’s Open NLP, and the University of Sheffield’s GATE) to see which one(s) worked best on our corpus.  We’ll describe our NLP comparison and results in detail in the “Natural Language Processing and the Digital Archive of Huhugam Archaeology – Part II” blog post.

Kintigh, Keith
2015. “Extracting Information from Archaeological Texts.” Open Archaeology 1: 96–101.
DOI: https://doi.org/10.1515/opar-2015-0004