Subjective Content Descriptions

Approaching language understanding

An important task in the area of Natural Language Processing (NLP) is language understanding. The goal is to make human written natural language texts machine understandable. What is meant by understanding a text often depends on the actual use case to be fulfilled, thus NLP comprises a large quantity of techniques.

In general, natural language is quite complex and far away from being easily understandable, especially if processed automatically. This also remains, even if you abstract from different languages and only consider one, e.g., English. In a modern world, humans do not only interact with other humans. Many interactions also take place against software, called agents, that can make decisions and execute actions based on input. The input may be natural language which needs to be understand by the agent to make a decision and execute an action. Thus, NLP is an important field of research connecting computer science and linguistics.

SCDs in General

Subjective Content Descriptions (SCDs) provide a framework in the domain of NLP for working with corpora of text documents. SCDs are annotations associated with locations, e.g., sentences, in a corpus. Similarly to post-it notes, SCDs can be understood as hints containing additional data for understanding and automatically processing natural language text. Thus, SCDs make the content near the SCD’s location explicit by providing descriptions, references, or explanations.

Having a corpus annotated with SCDs is a benefit for an agent working with this corpus. It can access the SCD’s content, i.e., the text written on the post-it note, as further information source, but also references of the SCDs. These references consist of all the sentences this same SCDs is associated with, i.e., multiple sentences sharing the same post-it note, and also relations between different SCDs.

Techniques using SCDs

SCDs provide a flexible way to add additional data to sequences of words organized in corpora of text documents. Generally, SCDs can contain any type of data and provide NLP techniques for understanding textual data. For example, SCDs can support agents by performing the following tasks:

  • Estimating SCDs for a single previously unseen text document using the Most Probably Suited SCD (MPS2CD) algorithm [1,2],
  • classifying a text document as related, extended, revised, or unrelated to a corpus [2],
  • moving the SCDs from one corpus to another similar corpus by adapting the SCDs’ domain [3],
  • separating SCDs and actual content being interleaved in text documents [4,5], or
  • enriching SCDs in a corpus already sparsely associated with SCDs [6].

We can view at SCDs from different perspectives. SCDs may represent the topics of a corpus and thus provide a topic model. But SCD can also simply provide a set of textual annotation for a corpus, similar to post-it notes in to book.

SCDs are based on some fundamental intentions and objectives that are common to all techniques for using SCDs. These fundamentals make SCDs valuable for IR agents. SCDs do not require large corpora nor huge data sets for training. Especially, when working with domain specific texts or rare languages, the size of a corpus is limited and there is not enough data to train, e.g., an LLM. Though, SCDs provide a applicable technique in such case, e.g., for old Tamil poems[4,5]. Furthermore, SCDs require less computation resources compared to LLMs and do not requires special hardware like graphic cards. An SCD-based IR agent can run on off-shelve-hardware and thus in a resource efficient. Summarized, SCDs represent a lean computing approach for text understanding.

Current Topics

USEM: UnSupervised Estimation of SCD Matrices

A user supplies a corpus to an Information Retrieval (IR) agent and needs to retrieve documents with similar content and highlight relevant locations in retrieved documents. The IR agent needs SCDs referencing sentences of similar content across various documents in the corpus and most text documents are not associated with SCDs.

  • Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
    Unsupervised Estimation of Subjective Content Descriptions in an Information System
    in International Journal of Semantic Computing, 2024
    DOI PDF
  • Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
    Unsupervised Estimation of Subjective Content Descriptions
    in 17th IEEE International Conference on Semantic Computing (ICSC 2023)
    DOI PDF

LESS: LEan computing for Selective Summaries

SCD estimated by USEM lack meaningful descriptions, i.e., labels consisting of short summaries. Labels are important to identify relevant SCDs and documents by the agent and its users. We present LESS, which creates labels for SCDs using the word distributions of the SCDs.

  • Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
    LESS is More: LEan Computing for Selective Summaries
    in KI 2023: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Springer.
    DOI PDF URL

(Re)FrESH: (Relation-preserving) Feedback-reliant Enhancement of SCDs by Humans

A human interacts with our IR agent and a response contains an erroneous part. Such errors, like faulty SCDs, should be send back to the agent by the human as feedback. Then, the agent needs to incorporate the feedback and remove or correct the erroneous part of its internally used SCDs. However, removing a faulty sentence with an SCD in a previously trained model is a difficulty task—often the model needs to be retrained from scratch. To circumvent this, we present FrESH and ReFrESH to keep the SCDs fresh and maintained with human feedback.

  • Magnus Bender, Kira Schwandt, Ralf Möller, Marcel Gehrke
    FrESH – Feedback-reliant Enhancement of Subjective Content Descriptions by Humans
    in Proceedings of the Humanities-Centred AI (CHAI) Workshop at KI2023, 46th German Conference on Artificial Intelligence, 2023
    DOI PDF URL
  • Magnus Bender, Tanya Braun, Ralf Möller, Marcel Gehrke
    ReFrESH – Relation-preserving Feedback-reliant Enhancement of Subjective Content Descriptions
    in 18th IEEE International Conference on Semantic Computing (ICSC 2024)Best Paper Award
    DOI PDF

Relations and Complementarity

Our agent so far relies on similarity measures to identify related documents used as response or for corpus extension. However, similarity may not be appropriate if looking for new information or different aspects of the same content. Therefore, we combine complementarity- and similarity-based identification of documents.

  • Magnus Bender, Felix Kuhr, Tanya Braun
    To Extend or not to Extend? Enriching a Corpus with Complementary and Related Documents
    in International Journal of Semantic Computing, 2022
    DOI PDF
  • Magnus Bender, Felix Kuhr, Tanya Brau
    To Extend or not to Extend? Complementary Documents
    in 16th IEEE International Conference on Semantic Computing (ICSC 2022)
    DOI PDF

Further Resources

References

  1. Felix Kuhr, Tanya Braun, Magnus Bender, Ralf Möller
    To Extend or not to Extend? Context-specific Corpus Enrichment
    in Proceedings of AI 2019: Advances in Artificial Intelligence, 2019, Springer
    DOI PDF
  2. Felix Kuhr, Magnus Bender, Tanya Braun, Ralf Möller
    Augmenting and Automating Corpus Enrichment
    in International Journal of Semantic Computing, 2020
    DOI PDF
  3. Felix Kuhr, Magnus Bender, Tanya Braun, Ralf Möller
    Context-specific Adaptation of Subjective Content Descriptions
    in 15th IEEE International Conference on Semantic Computing (ICSC 2021)
    DOI PDF
  4. Magnus Bender, Tanya Braun, Marcel Gehrke, Felix Kuhr, Ralf Möller, Simon Schiff
    Identifying and Translating Subjective Content Descriptions Among Texts
    in 15th IEEE International Conference on Semantic Computing (ICSC 2021)
    DOI PDF
  5. Magnus Bender, Tanya Braun, Marcel Gehrke, Felix Kuhr, Ralf Möller, Simon Schiff
    Identifying and Translating Subjective Content Descriptions Among Texts
    in International Journal of Semantic Computing, 2021
    DOI PDF
  6. Felix Kuhr, Bjarne Witten, Ralf Möller
    Corpus-Driven Annotation Enrichment
    in 13th IEEE International Conference on Semantic Computing (ICSC 2019)
    DOI