for Localized TEI / XML Assistance

An experiment with Small Language Models (SLMs), Retrieval Augmented Generation (RAG), and Model Context Protocol (MCP)

DigitAI Student Developers: Alexander C. Fisher, Hadleigh Jae Bills, Michael Simons

Faculty Mentor & Co-developer: Dr. Elisa Beshero-Bondar, Prof. of Digital Humanities

What DigitAI will be:

DigitAI will have broad applications to:

Inspect eXtensible Markup Language (XML) and schemas (coded rules governing data structures)
Identify inconsistencies and recommend improvements in markup,
(With agency): apply transformation scripts to supply new encoding and help complete unfinished projects.

This project was made possible, in part, by a seed grant from the Penn State Office of the Vice President for Commonwealth Campuses, and by Penn State funding for Undergraduate Research.

We are... Penn State funded!

newtFire {dh} @

Digital Media, Arts, and Technology ("DIGIT") @

Customizable desktop AI system trained to apply the Guidelines of the Text Encoding Initiative (TEI)

TEI: used by students, scholars, and editors in systematic encoding of structure and data in historic / cultural heritage documents in arts/humanities
Notoriously complex / challenging to learn
Usually customized for projects, which can be intricate and "bespoke".

Hypotheses:

If a Small Language Model (SLM) were given the TEI Guidelines and XML documents based on marked semantic units of XML elements and attribute nodes and not as arbitrarily "chunked" text strings . . .
If that SLM could access and apply XML processing tools like XPath, XSLT, and XQuery. . .
We can develop a small RAG/ MCP system that can run on a laptop computer that is highly customizable and does not require expensive access to Large Language Models.
We can learn how to make the powerful approximating Natural Language Processing statistical machinery of language models address and consult a highly structured data archive.
We can teach a SLM to "look stuff up" with XML tooling.
We can apply the strengths of language models to "speak to" the findings and speculate about how to help / answer a question / complete a task set by a human prompter.

Stage 1 (2025): The Graph RAG Challenge

1. Neo4j Knowledge Graph RAG: Attempted (with XSLT => JSON) to map XML hierarchies into a property graph of nodes and relationships for search and retrieval.

"Code Bloat" Problem! We discovered that XML dimensional relationships (attributes, namespaces, and deep nesting) expanded exponentially when converted to nodes/edges, creating a database that was less efficient than the source files.
Finalizing the graph is a problem: incomplete, difficult to update.

2. and 3. BGE-M3 & FAISS: Used for translation, indexing, and filtering of information from the RAG: works rapidly and efficiently to filter the graph relationship data.

4. Qwen2-7B Language Model

the "brain" that synthesizes all the data retrieved from the "RAG" and the vector embeddings into a human-readable answer.
Processes the top-ranked nodes retrieved by FAISS to generate natural language responses.
supports 29+ languages!

Stage 2 (2026): Agentic SLM + MCP uses XML stack tools

Model Context Protocol (MCP): Replaces the static Graph DB new dynamic system in which the SLM applies and adapts MCP code scripts directly with raw XML files.
Tool-Augmented Agency: Instead of "searching" a database, the model acts as an agent capable of executing XML stack tools:
- XPath/XQuery: For investigating and reporting on XML code patterns + inconsistencies
- XSLT: For adding new markup / transforming the original XML.
Schema-Aware Reasoning: We provide the model with "starter scripts" and schema code (describes rules / structure of an XML project): allows the SLM to navigate XML data with its original logic rather than a bloated translation.
More Efficient? Eliminates the "translation tax" of converting XML to Graph, keeping the data footprint small while increasing query precision.

DigitAI-FacultyShowcase-Poster

By Elisa Beshero-Bondar

DigitAI-FacultyShowcase-Poster

DigitAI for Localized TEI/XML Assistance: A Poster

Elisa Beshero-Bondar PRO

Professor of Digital Humanities and Chair of the Digital Media, Arts, and Technology Program at Penn State Erie, The Behrend College.

for Localized TEI / XML Assistance

An experiment with Small Language Models (SLMs), Retrieval Augmented Generation (RAG), and Model Context Protocol (MCP)

DigitAI Student Developers: Alexander C. Fisher, Hadleigh Jae Bills, Michael Simons

Faculty Mentor & Co-developer: Dr. Elisa Beshero-Bondar, Prof. of Digital Humanities

What DigitAI will be:

We are... Penn State funded!

Digital Media, Arts, and Technology ("DIGIT") @

Hypotheses:

Stage 1 (2025): The Graph RAG Challenge

Stage 2 (2026): Agentic SLM + MCP uses XML stack tools

DigitAI-FacultyShowcase-Poster

More from Elisa Beshero-Bondar