Tag Archives: Voyant

Comparing Tools

The tools examined during the past three modules, Voyant, Kepler, and Palladio, allowed for different levels of data analysis and visualization. One could see these tools are serving different needs dictated by the kind of data-set that need to examined. Voyant enables researchers to text-mine large volumes of mostly unstructured text. Kepler produces map visualizations that require files that have been tagged for geographical locations. Palladio creates network visualizations that required highly structured files.

One could say that Voyant offered the opportunity of a relatively open-ended exploration of the WPA Slave Narratives.As a text mining tool, Voyant proved to be very effective when examining a large volume of relatively unstructured data. The five tools included in Voyant (Cirrus, Reader, Trends, Summary and Contexts) provide different entry points into the data and different ways in which said data could be re-organized, explored and visualized. Although Voyant is largely meant to be used by researchers, I can also see how it could be used by public historians, museum professionals and teachers. The Cirrus tool, for instance, produces powerful visualization that can enrich lectures and exhibits. In contrast, I found the Trends tool more difficult to manipulate and read. It was easy to see how many times a word would appear in the different State collections, but I could not easily explore other kinds of trends such chronological distribution, age or gender.  These limitations are understandable given that text mining seeks for individual words or groups of words, and not for categories of words.

Kepler used more structured data than the one used in Voyant. For this reason, we were able to illustrate different aspects of the WPA Slave Narratives. The maps we produced, using geo-tagged CSV files, allowed us to see the relative volume of interviews done in a particular region. It was also possible to create a map that presented a timeline. Yet, the visualizations produced with Kepler did not give us any idea about the content of the interviews. Thus, I found Kepler to be a very good complement to Voyant. While Voyant provided us with the possibility of analyzing the content of the interviews, Kepler enabled us to visualize the broader geographical and chronological context in which the interviews were performed. Also, I found the visualizations created with Kepler were easier to read and manipulate than those produced to Voyant. This is not a criticism of the effectiveness of Voyant. When working with Kepler we used a smaller and more structured data-set prepared to answer more focused questions about time, place and volume; while Voyant was meant to facilitate a more open ended exploration of the ideas contained in a much larger set of documents. 

The last tool we used was Palladio, a network analysis and visualization tool. From all the tools examined, Palladio required the most rich and structured data-set. The goal of this tool is to allow researcher to identify patterns of connection or relationships between different categories of data. Palladio was very effective when producing visualization of different types of relationships. For instance, we are able to create a map where we saw where slaves had been enslaved in relation to where they were interviewed. We were also able to illustrate connections between the topics addressed in the interviews and the gender, age, or type of work of former slaves. In this regard, Palladio proved to be the most flexible of all three tools in terms of the kinds of questions it could help researchers explore. But the power of the tool was only made possible by the quality of the data and the way in which it was structured. 

Experimenting with these tools made me more aware of the challenges and potential inherent to the use of digitized sources. Ultimately, the use of any of these tools will require that data is digitized and structured to some degree and in light of particular questions. For this reason, I think it is important to have different types of tools that work with different types of files. Tools that allow for more open-ended questions like Voyant, or for a more focused exploration like Palladio. In either case, the larger challenge is to ensure that the digitization and preparation of the data is done thoughtfully and professionally. The ultimate effectiveness of any of these tools will largely depend on the quality of the data and the expertise of the researchers using it.

Voyant and Text Mining

Working with Voyant was quite intimidating at first. In many ways, it confirmed the impressions I had formed from reading about other text mining projects, and about text mining in general.

Sources and Materials

Text mining allows scholars to works with large collections of text, what is technically called a corpus. By applying techniques of text mining to these collections, scholars can discern trends in the use of words and/or phrases. The advantage of using text mining techniques lies precisely in the amount of sources that can be “read” in a relatively short amount of time. For example, in the three projects examined during this module we saw that in the America’s Public Bible: A Commentary, (APBAC) the author looked at two major newspaper databases Chronicling America and Nineteenth Century U.S. Newspapers. Between these, the project used more than 13 million pages. Robots reading Vogue (RRV) used every issue published by Vogue, around 400,000 pages, plus 2700 covers. Signs@40 used the archive of the journal Signs from 1975 to 2014. In all three cases, no single human being would be able to read the entire corpora used in these projects in a single life-time. 

However, not all large collections of text are equally useful or available for text mining. The use of computational methods for text mining requires that text collections are digitized using high quality techniques to minimize mistakes.  Furthermore, text collections also need to be in the public domain, or potential authors should acquire necessary permissions for text mining.

In our exercise with Voyant we worked with the WPA Slave Narratives which include more than two thousand interviews with former slaves conducted by staff of the Federal Writers’ Project of the Works Progress Administration. The materials were made available to us already cleaned and organized in 17 easy-to-use files. Having read how difficult and time consuming it can be to simply prepare a corpus for text mining, I was grateful to have this part of the process done for me. However, it is important to not forget that anyone hoping to embark on a text mining project will have to invest time and expertise on making sure the sources are adequately digitized and formatted.

What can we learn?

If one is lucky and/or persistent enough to secure the rights to a significant corpus, text mining can tell us several things about the text collection, the people who created it and organized it, and the world in which it originated. One common use of text mining is seeking trends in the usage of particular words or phrases. This is done in all three of the projects examined, although each uses this ability in different ways. For instance in APBAC, text mining is used to detect specific biblical passages. This allows the author to find out how often was a particular passage used and in what context. In RRV  one can find two examples, Word Vectors and n-gram Search where text mining is used to discern the evolution of word usage overtime. Another use of text mining is topic modeling, this traces words used in a particular context to detect important topics within a set of texts. This is used prominently in Signs@40. In general, the text mining tools used in these projects tell us about the evolution of language, ideas and practices over a period of time as reflected in the pages of publications or documents.

Working with Voyant was a little confusing at first. It took me some time to understand how to manipulate the different visualizations and understand what they were telling me. However, once I started to get a better sense of what the tools allowed me to read, I started to see their potential. The Cirrus tool may seem like an oversimplification of a long and complex text. In some ways it is, but it is this ability to present a relatively simple image of what the text is about that makes it useful. One must remember that the goal of these visualizations is not to give us a deep analysis of what this vast amount of text says or means, its objective is to help us identify a few entry points, a few questions that one could explore when one is facing a large amount of documents. Many of the terms that appeared more frequently in the whole of the corpus were clearly a function of patterns of speech and regular conversation. Words like “old”, “house”, and “slaves” were among the most frequently used terms. However, when I started focusing either on individual terms, or on specific states and terms, I started to find some interesting things. For instance, the term “mother” appeared quite prominently in the general word cloud, but if one focused on the links to this word one saw that it was most frequently connected to words like “father”, “sold”, “died”, “married”, and “children”. These, quite literally, painted a picture. I could imagine tracing the phrases that include the word mother to investigate or illustrate how motherhood was experienced or remembered by former slaves.

What questions could we ask?

Text mining analysis allows us to answer primarily general questions about the contents of a large collection of documents. Since it focuses primarily on text, it can answer questions about how language is used, how people articulated their ideas and practices, and how all of these evolved overtime. However, one has to cultivate a healthy skepticism when working with text mining techniques. First, anyone’s ability to identify meaningful entry points into a large corpus is limited or enhanced by their understanding of the historical and historiographical context in which those sources were created. In this regard, for instance, it was useful to know that there is a body of research that has investigated the experiences of female slaves and that, this historiography has given particular attention to motherhood. I am not a expert on this field, but I knew enough to know that following that term could lead to some interesting questions. A second factor that can affect the questions we could ask from text mining has to do with the chronological or geographical coverage of the collection in question. Some of the collections used in the projects examined in this model covered a no less than forty years. This meant that those working on those collections could ask questions about change or continuity over time. The Slaves Narratives collection was different in that, chronologically, it covered a relatively short period. Even though the memories that interviewers tried to elicit went back many years, the actual interviews were collected during a span of only two years. However, the interviews covered a large portion of the country. Seventeen states were represented in the dataset we used. In light of this, the nature of the questions one can reasonably ask from these interviews is quite different. Rather than focusing on how themes may have changed over time, one would ask how do interviews in one state are different from another?

For instance, using Voyant, I found it very useful to identify differences between the collections that could tell us more about the how the interviews were collected and how to read them. One exercise that was particularly useful was looking at the distinctive terms identified in two sets of documents. One of the states I examined was Florida, here I examined ten distinctive terms. It was interesting that three of these were names of places located in Florida and two were Family names. I thought this would be quite typical of all collections, but when I examined the interviews from Georgia, I was surprised that most of the distinctive terms in those interviews were related to dialect, only one was the name of a place, and there were no family or private names. One would need to investigate these collections further to account for these differences.

Historians typically divide sources between primary and secondary, and this distinction determines the kinds of questions they can ask. It is not news to any experienced historian that sources such as the Slave Narratives are difficult to place in either one of these buckets. Working with Voyant, however, highlights the importance of understanding when and how the Slave Narratives can be used as primary sources and when and how they can serve as secondary sources. Since text mining allows us to capture the totality of a corpus and then break it down in smaller pieces, one should be careful that in trying to put it back together one does draw connections that may not be warranted by the historical or historiographical context.

In the hands of a patient and knowledgeable historian, Voyant and text mining in general, can be powerful tools. Despite the care and time one needs to invest in acquiring rights, cleaning data, testing algorithms, etc. Text mining makes it possible to examine what otherwise would be an impossibly large amount of text, and thus offers a different perspective on one of the oldest and most valuable expressive and communication tools we have: words.