Let’s Talk about Palm Leafs – From Minimal Data to Text Understanding

A tutorial by Magnus Bender, Marcel Gehrke, and Tanya Braun at KI 2023

46th German Conference on Artificial Intelligence, 26-29 September 2023, Berlin, Germany

Website of Tutorial

  • Date: September 26, 2023
  • Time: 2pm - 6pm
  • Room: C 251 (C building)

Introduction

In recent years, large language models have greatly improved the state of the art for text understanding. However, large language models are often computationally expensive and work best in areas with huge amounts of training data. Unfortunately, there are areas where we do not have a lot of data available. For example, in digital humanities, we have researchers investigating poems that are written on palm leafs in old Tamil. They only have a few hundred or maybe a thousand poems (documents). In such a setting, using a general pre-trained large language model (there are no for old Tamil) and further training the model by subsampling from the corpus comes to its limits, given the limited data available. Nonetheless, a support in text understanding or information retrieval also has great value for these researchers.

Therefore, in this tutorial, we give an overview of how different tasks can be performed with only minimal data available. We will use examples from the field of digital humanties to illustrate particular challenges. Among these examples, we will look at the above-mentioned poems on palm leafs, which include in-line annotations that are not easy to distinguish from the actual poem, if one does not know the poem. Another example are critical editions, where scholars combine many poems, transcriptions, translations, their annotations or comments, and a dictionary. When these editions are merged, the challenges that arise lie in identifying parts of editions that are extensions to or revisions of other critical editions. During our journey, we touch upon long standing concepts such as topic modelling and hidden Markov models and how they still help in text understanding with minimal data. Further, we show how these approaches perform w.r.t. large language models in areas with minimal data.

Presenters

A collaborative effort between

Target Audience, Prerequisite Knowledge, and Learning Goals

The tutorial will be mostly self-contained. While we assume familiarity with concepts such a topic modelling, but we will revisit all necessary definitions. The tutorial is there- fore potentially interesting for all researchers interested in text understanding, which include AI researcher but also researcher from other fields such as digital humanities.

In reference to the call, we talk about AI and digital humanities as well as machine learning and related methods, which are main topics of the KI 2023 call for papers. The goal of this tutorial is two-fold:

  1. to provide an overview about recent developments in text understanding with minimal data with a focus on digital humanities as an application area and
  2. to discuss new directions for investigation.

Further, this tutorial nicely complements the Workshop on Humanities-Centred AI, which took place the last two iterations of the KI conference.

Agenda (including presentation material)

  1. Introduction to Semantic Systems (PDF)
  2. Supervised Learning (PDF)
    • Subjective content descriptions
    • Corpus enrichment
    • Inline annnotations
  3. Unsupervised and Relational Learning (PDF)
    • Unsupervised estimation of SCDs
    • Continuous improvement by feedback
    • Labelling of SCDs
    • Inter- and intra-SCD relations
  4. Summary (PDF)