Tag Archives: Crowdsourcing

Digital Humanities and Crowdsourcing

Back in the mid-1990s, as an undergraduate student,  I participated in a project where we were recruited to make an inventory of Mexican history secondary sources available in the public and private libraries located in Mexico City. We were all given a long a long list of books and we were charged with visiting one library, determine whether there were copies of the listed books  in the library assigned to us, and add other sources that were not included in the original list. At the time, only a few academic libraries had digitized their catalogues and, in most cases, the process was still ongoing. What was still not available was the possibility to search these catalogues remotely. Thus, the goal of our project was to produce a bibliography that could allow students and researchers to locate particular sources without having to spend time visiting different libraries throughout the city. A project like this became unnecessary when digital catalogues were made available online. In this case, technology was able to solve one problem, but crowdsourcing remains a useful approach for processes and activities for which modern technology has not been able to provide solutions.

In my last post I talked about how AI chatbots, such as ChatGPT, were the latest technological response to the question of how to process vast amounts of information. Despite the progress that AI technology has achieved, it seems clear that there are still good reasons to continue using human beings in the process of producing knowledge. The idea of using large numbers of people to collect or process large bodies of documents, sources or data is not new. As my experience shows, this approach has been useful when the volume of the materials that needed to be processed was too large, and the nature of the work did not require too much training or expertise. The advent of digital technologies has made it possible to reach more people as potential contributors. Also, the increasing digitization of documents and other types of sources has created a (hopefully) virtuous circle in which crowdsourcing can be used as a means to increase access to historical and cultural materials and, by creating greater access, it can also encourage more engagement from the public.

As we have seen in this module, crowdsourcing is useful to open the process of knowledge production to a wider public. This can be done by asking people to write and edit entries in a project such as Wikipedia, or to contribute individual experiences during a particular event, as was possible in the September 11 Digital Archive. In both cases, members of the public are instructed to work under a set of rules, but are otherwise encouraged to be part of a collective exercise of data collection and interpretation. In these cases, technology has made it possible to reach a larger number of people while also lowering the barriers for their participation.

Other uses for crowdsourcing involve the processing of large volumes of material that cannot be achieved  using computers. Projects of transcription such as Transcribing Bentham, or description of photographs such as Nasa on the Commons, emanate from the need to make materials more accessible through digitization. But this goal also requires the infrastructure by which documents and images can be more accurately catalogued, searched, and studied. Current technology enables us to publish digital photographs and documents on library and museum websites; but it is not yet capable to read all types of handwriting or describe the contents of a photograph. For these, we still need people. So it is not surprise that some large digitization projects are often accompanied by a crowdsourcing component. An example of this is Linked Jazz where members of the public are asked to characterize the relationships that were described in the interviews with Jazz artists. 

Thus, projects that lend themselves to a crowdsourcing approach will typically involve the creation or processing of large volume of materials, where contributors need little or no expertise, and that are aimed at expanding access to the process of knowledge production. Transcription, basic annotation and description, collection, are processes that, under the right planning and circumstances, can be achieved through crowdsourcing.

Careful planning does make a difference to the success of crowdsourcing projects. First, it is important to cast a wide enough net to reach as many people as possible. Second, it is vital to create the infrastructure and incentives that can keep contributors engaged. This is best achieved when a project can explain how the use, preparation and processing of sources lies at the foundation of knowledge production; thus turning what are often seen as meaningless tasks into essential steps in the preservation, dissemination and interpretation of sources. By allowing people access to sources that would normally be reserved to specialists, crowdsourcing projects start by establishing a relationship of trust that is further enhanced when contributors are entrusted to re-tell, read, transcribe, describe or interpret. This trust, however, can be intimidating if contributors do not feel supported and confident that they will have the time to learn while on the job. Thus, it is vital that crowdsourcing projects develop mechanisms for support and communication with contributors and interfaces that are easy to use. But the most important incentive, in my view, is the feeling that by committing to a task, contributors can develop expertise and the value of their work will continue to increase. I believe this cultivates greater engagement and ownership of the overall project. 

Ultimately, the success of crowdsourcing projects can be measured by their ability to engage the public in the process of knowledge production. If contributors feel they can enhance their skills and  are empowered to make meaningful contributions, they will remain committed to the value and success of the larger project

Crowdsourcing

More than 20 years after its creation, Wikipedia illustrates the benefits and pitfalls of crowdsourced information. What started as a site widely distrusted by most, it managed to create a model for open collaboration that has gradually earned a healthy measure of trust among the public and even the scholarly community. One reason why many of us have come to appreciate Wikipedia, despite its many flaws, is that  it has worked hard to bolster what my teachers used to call its “critical apparatus”. That is, it has created the means for other scholars to critique the content and structure of the material presented in its many pages.

As a young student of history, I was taught that the citations in my papers constituted the “critical apparatus.” Among scholars, it is through citations and references that one is able to support one’s claims. The critical apparatus created through these tools constitute the main source of authority for any given piece of scholarship. However, the public’s reliance on reference works, such as encyclopedias, is contingent on their reputation, which in itself is built upon its editorial processes and the people they employ. Wikipedia’s model of open collaboration encourages the use of citations and references, but also allows the public not only to participate in the editorial process, but to see what changes are being proposed for individual articles and to discuss said changes with other individuals also involved in editing a page. This opening-up of the editorial process adds, in my opinion, to the critical apparatus of Wikipedia articles. 

The examination of the “Digital Humanities” article in Wikipedia serves as a good illustration of how tools such as the “Revision History” or the “Talk History” expand our ability to evaluate the quality of a page. A cursory read of the Wikipedia article on “Digital Humanities” shows a reasonably well-organized article with numerous references, a bibliography and recommendations for further reading. Given what we have learned about the field of Digital Humanities, it is reasonable to ask how complete and how up-to-date the article is, how recently, and how often has the page been edited and to what degree. When we looked for a term or topic in a traditional encyclopedia we would accept that, many topics, would be out-of-date given the time lag between when an article was written and published and when it was read. The level of authority on any given piece decreases as time passes. However, as my father used to say about  national Constitutions, the authority of any set of statements requires a balance between stability and flexibility. I would argue that the authority of an encyclopedia article is also contingent on how stable and flexible they are. The fact that an article can be updated as new information becomes available, but not change radically from one version to the next, is a factor that adds to its authority.

The “Digital Humanities” article was last edited on October 2023 but, for the past two years the edits have been relatively minor. One can appreciate this by looking at the revision history tool, where it is possible to see when the page was first created and one one can examine every single version of the article. The revision history will show the date when an edit was done, how many bytes were added, how many words were added or deleted, and there will be a brief description of the changes. One also has the possibility of comparing two versions. Although this particular feature was difficult to use because the changes appear out of the context of the larger page. I found it easier to specifically open different versions and then choose what versions I wanted to explore further. 

Other important information that can be ascertained from the revision history is included in the the Statistics information. Here, one can find how many times a particular page has been edited and at what rate. In the case of the “Digital Humanities” article we can see that since 2006, when the page was originally created, it has been edited 1009 times, 459 editors have made contributions. In average, each user has done 2.2 edits and the page has been edited, in average, every 6.4 days. However, most of the most substantive edits took place in the earlier years of the page. During the past 365 days, only 7 edits have taken place and these have been relatively minor. 

As a reader one can take some confidence of the fact that the page seems that have reached some stability and that further changes to it seem to be of a minor nature. As a student in the Digital Humanities one could wonder whether this means that debates about the definition, history, and scope of the field have been resolved. For this, the Talk History tool may prove very useful. Here, one can see what have been the issues that have preoccupied the editors, how the explain an edit, when they seek comments or advise on a particular change. In the case of the “Digital Humanities” article we can see that the “Concerns” section of the talk history, includes a lot of comments about the history and the definition of Digital humanities. There are also questions about balancing the different types of tools that are included in the Methods, and Projects of Digital Humanities.

Traditional encyclopedias presented themselves as sources of authority due to their editorial processes. Some were open about who were the authors of their articles, others kept those anonymous. In an attempt to foster participation, while also maintaining transparency and accountability, Wikipedia offers contributors both options . One can add an edit to a page without disclosing a name or without creating a user profile. In this case, the edit is recorded under an ISP. But many contributors do create a user name and a profile, and this can be used as another means to read and evaluate a particular entry. By looking at the Statistics page one can determine which contributors have made the most contributions and when. From there one can also see their profiles, if they have created one. From the ten most prolific contributors to the “Digital Humanities” article (at least in terms of words); two are just identified by their ISP, and two have profile names but no information under their profile. Two more are students participating in a Wikipedia-related curriculum. Only about five of them have been active within the last five years. Which again speaks to how relatively stable the page has become. Only one of the contributors, its creator in fact, identifies himself as a scholar and professional in the digital humanities. A couple of contributors identify themselves as scholars in computing or the humanities. 

Learning about the identities, and maybe even qualifications, of contributors may be more relevant to students and/or scholars that to the casual visitor. The latter can rely on the general Assessment of a page which can also be found in the Statistics page and explained in the Assessments page. But for scholars, students or other experts, learning about the contributors to a page is very useful to contextualize and understand the criteria they use when explaining and justifying their editorial decisions. The revision history and the talk history enable us to do a historiographical analysis of any given article, the more we know about the authors of these versions the more informed our analysis and evaluation will be. 

As an organization, Wikipedia has made substantial investments in building a critical apparatus that supports its reputation as an authoritative source of knowledge; not because Wikipedia articles are perfect, but because anyone can identify its sources and follow the process by which they have been created. In contrast, ChatGPT has stayed closer to a notion of “the wisdom of crowds” in that its source of authority is contingent upon the vast volumes of information used to teach be chatbot. In the case of Wikipedia, the crowds involved in creating its articles are people, engaged in an ongoing discussion of what to add, delete or change. ChatGPT uses much of this knowledge to produce elegantly explained answers in a way that seems very simple and clear to the reader. However, we do not get any sense of the sources that were used or the criteria used to select information. In this case ChatGPT lacks a “critical apparatus” on which to rest its authority. It purely relies on a crowd of anonymous sources. ChatGPT is able to elegantly synthesize large bodies of knowledge in the most clear and simple way. However, sources like Wikipedia demonstrate that even a work of synthesis needs to make choices and it is important for readers to understand what are the justifications for these choices. The questions of accuracy that we used to have about Wikipedia are multiplied in the case of ChatGPT since we have no mechanism to assess its accuracy, completeness, or judgements. ChatGPT can produce very clear explanations and narratives, but in this case, clarity may obscure the inherent complexity and messiness of knowledge production.