Automated Content Analysis for Digital Humanities and Corpus Linguists (Workshop)

Facilitated by Gerold Schneider

Textual data can be analysed increasingly well with automated techniques. Text no longer is unstructured data, but topics, trends, sentiments, answers to questions can be extracted automatically.

This workshop introduces a range of distant reading techniques, including document classification, keyword detection, cognitive associations with word embeddings, topic modeling, new visualisation methods and transformer-based approaches, which we apply to large-scale political, historical and literary corpora.

Transformer-based approaches to language, often called large language models (LLMs), include GPT-3 to GPT-5, but smaller versions of these can also be run on your local computers, for instance in Ollama – we will do this together.

Example materials will include media and political texts, medical texts, historical texts, works by Charles Dickens and others.

The session will combine conceptual input with hands-on exercises. Participants are warmly encouraged to bring their own datasets and research questions, which may be incorporated into the session.

Resources: Bring along your own computer

Requirements: Basic knowledge of R or Python is helpful but not required.

References

Schneider, Gerold (Monograph, 2024). Text Analytics for Corpus Linguistics and Digital Humanities: Simple R scripts and Tools. Bloomsbury. https://www.bloomsbury.com/uk/text-analytics-for-corpus-linguistics-and-digital-humanities-9781350370821/

Venue

37-120 - Joyce Ackroyd Building