Cinema Context Digifil Datasprint 28 May 2020

Introduction

The primary goal of this Cinema Context Digifil datasprint is to provide cleaned and enriched data based on the results of the Digifil project, in which we automatically extracted information on film programming from digitised newspapers. Besides this direct aim of a clean dataset, a secondary purpose of the datasprint is to assess the quality of the Digifil data with more precision and to help estimate how much (wo)manpower is needed to clean the data after automatic extraction and how we can further streamline and improve this process of data cleaning and enrichment.

The information on this site is structured as follows: this introduction page gives some general background; the communication page provides more info on how we can contact each other; the Digifil Editor page describes the layout of the digifil editor, and then the envisioned workflow is described step by step in two sections. A final page is reserved to report on results.

Some background

In 2018, Clariah funded the Digifil project, which aimed at automatically extracting, digitising and publishing film screening data from the weekly ‘filmladders’ (films listings) as published in the historical newspapers that are available in the Delpher repository created by the Dutch Royal Library. Since the current data collection in the online Dutch cinema encyclopaedia Cinema Context contains data on film programming up until 1948, the Digifil project has focused on filling the gap of the period after 1948.

In order to automatically extract the film programmes, a series of techniques and strategies was developed: first, to identify the film listings, a needle in the haystack of the OCR'd newspaper pages, and secondly to correctly parse those film listings in order to translate them into structured data of film screenings consisting of three basic components: cinema names, film titles and dates. Moreover, we have tried to identify those film titles by matching them to known title repositories such as the Internet Movie Database (IMDb). For a more detailed description of the project, see the DIGIFIL final report (pdf).

The project has delivered hundreds of thousands of rows with film programmes, but with a varying degree of trustworthiness and at this point not sufficiently reliable to import into the Cinema Context database. OCR errors in the digitised newspapers lead to mistakes. Sometimes cinemas are not recognised by the algorithm (especially cinema names that are prone to OCR errors such as ‘Rex’ or ‘City’). Often film titles are misread by the system and/or are not linked to the correct title in IMDb or Cinema Context. Therefore: human eyes are needed to check and correct mistakes before we can actually put the data to use for scholarly research into the history of Dutch cinema culture. Data that is processed during this datasprint, will be added to the Cinema Context database after a final editorial check. An updated version of the database will become openly available as a dump in DANS EASY.

The dataset

In order to have a clear playing field for this datasprint, we’ve selected a sample from the Digifil dataset. We will be working on film screenings that took place in Amsterdam in the sample years 1952, 1962 and 1972. The choice of these years and this city were predicated on a research pilot we are planning, in which we want to compare patterns of film programming in Amsterdam and Antwerp, and for Antwerp, programming data is only available for those sample years. Additionally, we want to investigate to what extent the quality of the data varies between these three decades (and why).