for Localized TEI / XML Assistance

An experiment with Small Language Models (SLMs), Retrieval Augmented Generation (RAG), and Model Context Protocol (MCP)

DigitAI Student Developers: Alexander C. Fisher, Hadleigh Jae Bills, Michael Simons

Faculty Mentor & Co-developer: Dr. Elisa Beshero-Bondar, Prof. of Digital Humanities

 

What DigitAI will be:

DigitAI will have broad applications to:

  • Inspect eXtensible Markup Language (XML) and schemas (coded rules governing data structures)

  • Identify inconsistencies and recommend improvements in markup,

  • (With agency): apply transformation scripts to supply new encoding and help complete unfinished projects.

This project was made possible, in part, by a seed grant from the Penn State Office of the Vice President for Commonwealth Campuses, and by Penn State funding for Undergraduate Research.

We are... Penn State funded!

newtFire {dh} @

newtFire {dh} @

newtFire {dh} @

newtFire {dh} @

Digital Media, Arts, and Technology ("DIGIT") @

Customizable desktop AI system trained to apply the Guidelines of the Text Encoding Initiative (TEI)

  • TEI: used by students, scholars, and editors in systematic encoding of structure and data in historic / cultural heritage documents in arts/humanities
  • Notoriously complex / challenging to learn
  • Usually customized for projects, which can be intricate and "bespoke".

Hypotheses:

  • If a Small Language Model (SLM) were given the TEI Guidelines and XML documents based on marked semantic units of XML elements and attribute nodes and not as arbitrarily "chunked" text strings . . .
  • If that SLM could access and apply XML processing tools like XPath, XSLT, and XQuery. . .
  • We can develop a small RAG/ MCP system that can run on a laptop computer that is highly customizable and does not require expensive access to Large Language Models.
  • We can learn how to make the powerful approximating Natural Language Processing statistical machinery of language models address and consult a highly structured data archive.
  • We can teach a SLM to "look stuff up" with XML tooling.
  • We can apply the strengths of language models to "speak to" the findings and speculate about how to help / answer a question / complete a task set by a human prompter.

Stage 1 (2025): The Graph RAG Challenge

 

 

1. Neo4j Knowledge Graph RAG: Attempted (with XSLT => JSON) to map XML hierarchies into a property graph of nodes and relationships for search and retrieval.

  • "Code Bloat" Problem! We discovered that XML dimensional relationships (attributes, namespaces, and deep nesting) expanded exponentially when converted to nodes/edges, creating a database that was less efficient than the source files.

  • Finalizing the graph is a problem: incomplete, difficult to update.

2. and 3. BGE-M3 & FAISS: Used for translation, indexing, and filtering of information from the RAG: works rapidly and efficiently to filter the graph relationship data.

4. Qwen2-7B Language Model

  • the "brain" that synthesizes all the data retrieved from the "RAG" and the vector embeddings into a human-readable answer.
  • Processes the top-ranked nodes retrieved by FAISS to generate natural language responses.
  • supports 29+ languages!

Stage 2 (2026): Agentic SLM + MCP uses XML stack tools

  • Model Context Protocol (MCP): Replaces the static Graph DB new dynamic system in which the SLM applies and adapts MCP code scripts directly with raw XML files.

  • Tool-Augmented Agency: Instead of "searching" a database, the model acts as an agent capable of executing XML stack tools:

    • XPath/XQuery: For investigating and reporting on XML code patterns + inconsistencies

    • XSLT: For adding new markup / transforming the original XML.

  • Schema-Aware Reasoning: We provide the model with "starter scripts" and schema code (describes rules / structure of an XML project): allows the SLM to navigate XML data with its original logic rather than a bloated translation.

  • More Efficient? Eliminates the "translation tax" of converting XML to Graph, keeping the data footprint small while increasing query precision.

DigitAI-FacultyShowcase-Poster

By Elisa Beshero-Bondar

DigitAI-FacultyShowcase-Poster

DigitAI for Localized TEI/XML Assistance: A Poster

  • 19