Start Over

Challenges and opportunities for digital history

Authors :: Ian N. Gregory
Publication Year :: 2014
Abstract: The challenge for digital historians is deceptively simple: it is to do good history that combines the computer’s ability to search and summarize, with the researcher’s ability to interpret and argue. This involves both developing an understanding of how to use digital sources appropriately, and more importantly, using digital sources and methods to deliver new scholarship that enhances our understanding of the past. There are plenty of sources available; the challenge is to make use of them to deliver on their potential. There have been false dawns for digital history, or “history and computing,” in the past (Boonstra et al. 2004). Until very recently, computers were primarily associated with performing calculations on numbers. This has resulted in them becoming fundamental tools in fields such as economic history, historical demography and, through the use of geographical information systems (GIS)1, historical geography. These are, however, relatively small fields within the discipline as a whole and much of the work that has been done in them has taken place outside of History departments in, for example, Economics, Sociology, and Geography. As most historians work with texts, it is hardly surprising that this style of computing has made little impact on the wider discipline. Within the last few years, however, there has been a fundamental shift in computing in which, put simply, computers have moved from being number crunching machines to become an information technology where much of the information that they contain is in textual form. This has been associated with the creation of truly massive amounts of digital textual content. This ranges from social media and the internet, to private sector digitization projects such as Google Books and the Gale/Cengage collections, to the more limited investment from the academic and charitable sectors (Thomas and Johnson 2013). Thus, computers are now inextricably concerned with texts – exactly the type of source that is central to the study of history. As a consequence, many historians have become “digital historians” almost without realizing it through making use of the vast number of sources that are now available from their desktop. So is everything in the garden that is digital history currently rosy? The answer, judging by work such as Hitchcock (2013) and the responses to it (Knights 2013; Prescott 2013), seems to be a resounding no. Many criticisms are centered on the digital sources themselves, whose quality is lower than that might be hoped. Digitizing a document is usually a two-stage process: first a digital image of the document is created as a bitmap, then the textual content is encoded as machine readable text. The two are then often brought together such that a user can type a search term, this is located in the text, and then the user can be shown the appropriate image of the page. The first of the two stages is relatively simple using a scanner or camera and, if done properly, only results in relatively minor abstractions from the original as the result is a facsimile copy. The second stage, however, is hugely problematic involving either the text being manually typed, or optical character recognition (OCR) software being used to automatically identify letters from the bitmap image. Both of these are slow, expensive, and errorprone. OCR tends to be used on largescale projects: it is faster and cheaper but tends to result in far more errors. Whatever approach is used, checking the results is very difficult. Common approaches involve carefully typing up“gold standard”samples of parts of the source and comparing these with bulk-entered material to give a percentage of words or letters that have errors. Understanding what the consequences of these scores mean in practice is difficult. Even without error, if the text is removed from the page scans then they are heavily abstracted from the original and much potentially useful information is lost. Once created, digital sources are often interrogated using techniques that are not properly understood but are nevertheless used uncritically. The classic example that combines both the data capture and uncritical use problems is typing a keyword search into a web interface, which returns a list of hits sorted by “relevance.” As Hitchcock (2013) points out, most historians using digital sources do this without having any idea of the implications either of the data capture that created the digital copy of the source, and thus whether the search will miss words as a result of spelling variations derived from digitization errors, or of how the search engines decides what is – and, more importantly, is not – “relevant.” While using search engines may be problematic, in reality they are the only digital tool that most historians use, indeed there is a lack of widely used techniques that can be used to interrogate, summarize, and understand the large volumes of material that are available. So what do digital historians need to do? The answer, I would argue, is to remember that they are first and foremost historians and that historians fundamentally are in the business of taking complex, incomplete sources that are full of biases and errors, and interpreting them critically to develop an argument that answers a research question. Digital sources do not change this