Skip to content

Internal workshop on infrastructure and research challenges in AI for digital humanities

AI4DH team members met for an internal workshop to familiarise themselves with each other’s research topics and explore opportunities for productive collaboration. The workshop also aimed to present ongoing research to new team members PhD students and a postdoctoral researcher and to ensure that everyone is up to date with the current state of our projects and future directions.

Antoine Doucet, ERA Chair holder, presented his ongoing research on natural language processing tools for historical newspaper analysis and the comparative study of historical newspaper collections. The objectives of this research are: (1) to develop and share advanced tools and datasets for pre-assessing the quality and usability of digitised historical documents, and (2) to adapt language-independent tools for information extraction from historical sources. Future plans include the systematic re-evaluation of optical character recognition (OCR) quality, the proposal of new evaluation metrics, and the comparison or mutual enrichment of post-OCR and post-ASR outputs.

Marko Robnik-Šikonja, together with Matej Klemen, presented their work on injecting knowledge into large language models. The objectives of their research are: (1) to enhance the semantic awareness of large language models, and (2) to evaluate the impact of injected linguistic knowledge on LLM performance in realistic evaluation tasks. Their current work focuses on setting up agent-based, resource-grounded multilingual grammatical analysis. A research paper has been submitted, and a preprint version is available at the provided link. https://www.arxiv.org/abs/2512.00214

Polona Tratnik, together with Jan Babnik, presented their research on the development of AI infrastructure for data-driven folkloristics. Their objectives are: (1) to develop advanced AI infrastructure for studying the socio-cultural dimensions of folktales, and (2) to provide nuanced explanations of the social functions of folktales in multilingual contexts. A research paper has been submitted, and a preprint version is available at the provided link. https://arxiv.org/abs/2510.18561

Marko Robnik-Šikonja, together with Aleš Žagar, presented their research on explainability in generative large language models. The objectives are: (1) to enhance transparency and robustness in generative LLMs, and (2) to adapt general explainability methodologies to specific challenging domains. Their approach consists of three phases: first, analysing domain-specific knowledge such as ontologies, knowledge graphs and annotation guidelines; second, semi-automatically generating explainability datasets using LLMs; and finally, training LLMs on both the original task and the generated datasets using fine-tuning and retrieval-augmented methods.

Igor Vobič presented his ongoing research, conducted together with Aleš Žagar and Boris Mance, on tools for social science analysis of news media in less-resourced languages. The objective is to develop automated tools for content analysis of news media in low-resource languages, including Slovene and other South Slavic languages. News media produce content at a scale that exceeds human analytical capacity. AI therefore enables large-scale content analysis, systematic cross-media comparison, longitudinal and historical analysis, as well as rapid exploratory analysis and pattern detection. The team examined news diversity in the Slovenian media environment using AI methods, investigating how structural conditions in the Slovenian journalism system shape patterns of pluralisation and homogenisation, and how ownership, organisational and ideological factors influence visibility and symbolic power across media outlets. A paper based on this research was recently published and is available at the provided link. https://doi.org/10.22572/mi.30.2.1

Simon Krek, together with Slavko Žitnik and Timotej Knez, presented their work on producing semantic data and evaluation datasets from lexical resources. The objectives are: (1) to develop machine-readable knowledge representations to better understand how meaning is constructed, and (2) to produce datasets for evaluating meaning in human–computer interaction contexts. Their completed work includes compiling lexicographic data from multiple sources, constructing a lexical pre-training corpus, defining a multidimensional evaluation framework, and training and evaluating the GaMS model on the resulting dataset.

Newly joined AI4DH members also presented their work. Karim El Haff has joined the team as a postdoctoral researcher and will focus on data-driven research and analysis of regional folklore using natural language processing methods. Domen Vreš has joined as a PhD student and will research the adaptation of large language models; he is currently the lead developer of an open-source large language model for Slovene. Amélie Quilichini, also a PhD student, will investigate cross-lingual comparative analysis across large document collections, with a particular focus on the comparative study of Norwegian languages in 20th-century newspapers to gain insights into their historical development. Rebeka Kropivšek Leskovar, a PhD student, will work on computational methods for narrative analysis, with research interests in human–technology interaction, the co-evolution of technology and culture, user agency, and critical engagement.