This research project covers my dissertation and is now largely complete. The follow up is Intelligent Subjective Agents.
Subjective Content Descriptions
Approaching language understanding
An important task in the area of Natural Language Processing (NLP) is language understanding. The goal is to make human written natural language texts machine understandable. What is meant by understanding a text often depends on the actual use case to be fulfilled, thus NLP comprises a large quantity of techniques.
In general, natural language is quite complex and far away from being easily understandable, especially if processed automatically. This also remains, even if you abstract from different languages and only consider one, e.g., English. In a modern world, humans do not only interact with other humans. Many interactions also take place against software, called agents, that can make decisions and execute actions based on input. The input may be natural language which needs to be understand by the agent to make a decision and execute an action. Thus, NLP is an important field of research connecting computer science and linguistics.
SCDs in General
Subjective Content Descriptions (SCDs) provide a framework in the domain of NLP for working with corpora of text documents. SCDs are annotations associated with locations, e.g., sentences, in a corpus. Similarly to post-it notes, SCDs can be understood as hints containing additional data for understanding and automatically processing natural language text. Thus, SCDs make the content near the SCD’s location explicit by providing descriptions, references, or explanations.
Having a corpus annotated with SCDs is a benefit for an agent working with this corpus. It can access the SCD’s content, i.e., the text written on the post-it note, as further information source, but also references of the SCDs. These references consist of all the sentences this same SCDs is associated with, i.e., multiple sentences sharing the same post-it note, and also relations between different SCDs.
Techniques using SCDs
SCDs provide a flexible way to add additional data to sequences of words organized in corpora of text documents. Generally, SCDs can contain any type of data and provide NLP techniques for understanding textual data. For example, SCDs can support agents by performing the following tasks:
- Estimating SCDs for a single previously unseen text document using the Most Probably Suited SCD (MPS2CD) algorithm [1,2],
- classifying a text document as related, extended, revised, or unrelated to a corpus [2],
- moving the SCDs from one corpus to another similar corpus by adapting the SCDs’ domain [3],
- separating SCDs and actual content being interleaved in text documents [4,5], or
- enriching SCDs in a corpus already sparsely associated with SCDs [6].
We can view at SCDs from different perspectives. SCDs may represent the topics of a corpus and thus provide a topic model. But SCD can also simply provide a set of textual annotation for a corpus, similar to post-it notes in to book.
SCDs are based on some fundamental intentions and objectives that are common to all techniques for using SCDs. These fundamentals make SCDs valuable for IR agents. SCDs do not require large corpora nor huge data sets for training. Especially, when working with domain specific texts or rare languages, the size of a corpus is limited and there is not enough data to train, e.g., an LLM. Though, SCDs provide a applicable technique in such case, e.g., for old Tamil poems[4,5]. Furthermore, SCDs require less computation resources compared to LLMs and do not requires special hardware like graphic cards. An SCD-based IR agent can run on off-shelve-hardware and thus in a resource efficient. Summarized, SCDs represent a lean computing approach for text understanding.
Current Topics
USEM: UnSupervised Estimation of SCD Matrices
A user supplies a corpus to an Information Retrieval (IR) agent and needs to retrieve documents with similar content and highlight relevant locations in retrieved documents. The IR agent needs SCDs referencing sentences of similar content across various documents in the corpus and most text documents are not associated with SCDs.
- Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
Unsupervised Estimation of Subjective Content Descriptions in an Information System
in International Journal of Semantic Computing, 2024
DOI PDF - Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
Unsupervised Estimation of Subjective Content Descriptions
in 17th IEEE International Conference on Semantic Computing (ICSC 2023)
DOI PDF
LESS: LEan computing for Selective Summaries
SCD estimated by USEM lack meaningful descriptions, i.e., labels consisting of short summaries. Labels are important to identify relevant SCDs and documents by the agent and its users. We present LESS, which creates labels for SCDs using the word distributions of the SCDs.
- Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
LESS is More: LEan Computing for Selective Summaries
in KI 2023: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Springer.
DOI PDF URL
(Re)FrESH: (Relation-preserving) Feedback-reliant Enhancement of SCDs by Humans
A human interacts with our IR agent and a response contains an erroneous part. Such errors, like faulty SCDs, should be send back to the agent by the human as feedback. Then, the agent needs to incorporate the feedback and remove or correct the erroneous part of its internally used SCDs. However, removing a faulty sentence with an SCD in a previously trained model is a difficulty task—often the model needs to be retrained from scratch. To circumvent this, we present FrESH and ReFrESH to keep the SCDs fresh and maintained with human feedback.
- Magnus Bender, Kira Schwandt, Ralf Möller, Marcel Gehrke
FrESH – Feedback-reliant Enhancement of Subjective Content Descriptions by Humans
in Proceedings of the Humanities-Centred AI (CHAI) Workshop at KI2023, 46th German Conference on Artificial Intelligence, 2023
DOI PDF URL - Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
ReFrESH – Relation-preserving Feedback-reliant Enhancement of Subjective Content Descriptions
in 18th IEEE International Conference on Semantic Computing (ICSC 2024) – Best Paper Award
DOI PDF
Relations and Complementarity
Our agent so far relies on similarity measures to identify related documents used as response or for corpus extension. However, similarity may not be appropriate if looking for new information or different aspects of the same content. Therefore, we combine complementarity- and similarity-based identification of documents.
- Magnus Bender, Felix Kuhr, Tanya Braun
To Extend or not to Extend? Enriching a Corpus with Complementary and Related Documents
in International Journal of Semantic Computing, 2022
DOI PDF - Magnus Bender, Felix Kuhr, Tanya Braun
To Extend or not to Extend? Complementary Documents
in 16th IEEE International Conference on Semantic Computing (ICSC 2022)
DOI PDF
Further Resources
Links
- My Dissertation about SCDs
- Tutorial at KI 2023
- List of Publications
- Project Site @ Universität Hamburg
References
- Felix Kuhr, Tanya Braun, Magnus Bender, Ralf Möller
To Extend or not to Extend? Context-specific Corpus Enrichment
in Proceedings of AI 2019: Advances in Artificial Intelligence, 2019, Springer
DOI PDF - Felix Kuhr, Magnus Bender, Tanya Braun, Ralf Möller
Augmenting and Automating Corpus Enrichment
in International Journal of Semantic Computing, 2020
DOI PDF - Felix Kuhr, Magnus Bender, Tanya Braun, Ralf Möller
Context-specific Adaptation of Subjective Content Descriptions
in 15th IEEE International Conference on Semantic Computing (ICSC 2021)
DOI PDF - Magnus Bender, Tanya Braun, Marcel Gehrke, Felix Kuhr, Ralf Möller, Simon Schiff
Identifying and Translating Subjective Content Descriptions Among Texts
in 15th IEEE International Conference on Semantic Computing (ICSC 2021)
DOI PDF - Magnus Bender, Tanya Braun, Marcel Gehrke, Felix Kuhr, Ralf Möller, Simon Schiff
Identifying and Translating Subjective Content Descriptions Among Texts
in International Journal of Semantic Computing, 2021
DOI PDF - Felix Kuhr, Bjarne Witten, Ralf Möller
Corpus-Driven Annotation Enrichment
in 13th IEEE International Conference on Semantic Computing (ICSC 2019)
DOI