Is that a dog or just a far-away horse?

Exploring visual collections using AI

Father Ted and Dougal discuss the finer points of scale and relative size.

AI4Illuminations

In 2023 the Vatican showcased their AI4Illuminations project at the annual IIIF Conference. The idea of the project was to apply machine learning tools and techniques to their illustrated manuscripts and automatically catalogue objects of interest (such as illustrations of different animals).

As part of our exploration of emerging AI technologies we wondered if we could try applying the same techniques to our collections – how difficult would it be to get started? How successful would these techniques be when applied to our diverse collections?

Computer Vision and YOLO

Computer Vision is the field of extracting information from images. The models which AI4Illuminations explored belong to a family of Computer Vision called Object Detection, which is about trying to recognise objects in a scene and determine where they are in relation to each other. This technology is commonly found in applications such as security cameras (to distinguish between a person and a cat at your front door), smartphones (where AI models identify scenes and subjects when you take a photo), and self-driving cars (to navigate the environment).

There are a lot of different options for models and some of them are widely used and optimised (if you have to run your model on every frame of video for a security camera it needs to be very fast). The model that AI4Illuminations used was called YOLOv5 (“You Only Look Once”). For our experiments we will use a newer version of the same model, YOLOv8 which was released in 2023.

Computer Vision models such as YOLO can identify objects and indicate their location within a scene. The number beside each object represents the model’s confidence level in its prediction, with a maximum value of 1.0.

These models are trained using huge collections of images such as COCO, although they are likely to be biased towards modern photos and realistic looking objects rather than cartoons or illustrations. It’s possible to re-train the models yourself to make them perform better on your specific content although doing so requires a lot of well-labelled data.

Another interesting aspect of these models is that although they are small enough to run on most computers they can also be accelerated by AI hardware. This is a hot topic in the PC world as manufacturers release “AI PCs” with integrated NPUs (Neural Processing Units) or powerful GPUs (Graphics Processing Units). We were able to get YOLOv8 running on a range of hardware including CPUs, GPUs and NPUs.

Working with collections

Manchester Digital Collections is a digital platform which contains thousands of images of manuscripts, photographs and other types of material, including some beautiful illustrations. The images on MDC are available using a standard called IIIF, which allows images to be downloaded and re-used.

We selected a number of different collections to experiment with:

Japanese Maps features some realistic line drawings of boats.
Text and Image is a collection of illustrated manuscripts and printed books.
Persian Manuscripts is a large collection with lavishly illustrated volumes including many examples of animals.
Magic, Monsters and Macabre is a themed collection which includes illustrations of fantastic animals and black and white photographs.

scale and resolution

YOLOv8 works with images which are just 640 by 640 pixels, which is around 60 times smaller than the high resolution images created by our photographers. This limitation stems from both the design of YOLO as a fast real-time model and the increasing computational demands of larger blocks of image data. Images are resized before being processed by the model, which means that if the objects we are trying to detect are physically too small they are unlikely to be detected.

In this full-size image only some of the larger objects have been detected.

We explored some approaches to mitigating this problem, such as SAHI, although we also saw some good results by simply cropping the high-resolution images at different zoom levels.

Slicing Aided Hyper Inference is a technique to automatically divide an image up into smaller slices for inference.

A cropped image allows smaller features to be detected, such as these people in the top left of the previous image.

False positives and bias

Without being re-trained the YOLOv8 model can only detect 80 types of object (these are taken from the COCO training data). These classes of object include a lot of modern ones which aren’t suitable for our collections.

The COCO (Common Objects in Context) dataset includes only 80 classes of object.

Additionally, the model can struggle when trying to distinguish between object classes which are visually similar such as horses and dogs (especially when we are looking at illustrations rather than photographs). In cases like this a human might be able to infer the type of animal based on the relative size of the objects but the model could mis-identify them.

The training data used to create a model can also introduce bias if it contains more (or better) examples of one object over another.

Horses and dogs can be tricky to distinguish because their features are so similar.

Context in images

While models such as YOLOv8 are good at detecting individual objects they aren’t able to understand the broader context of a scene. There may be clues about how objects relate to each other (or to people in a scene) which would require a higher level of understanding of the world to interpret.

YOLOv8 doesn’t include classes for musical instruments but a human would be able to infer that this object is unlikely to be a baseball bat due to the rest of the scene.

The future of AI in image collections

AI4Illuminations showed how in addition to running pre-trained models on image collections it was also possible to re-train them to identify new types of objects (such as mythical beasts). The project also explored building interfaces to view or query the detected objects using IIIF annotations, which is something we could explore with our collections too.

In addition to these experiments with running pre-trained models the library is collaborating with the Oxford Visual Geometry Group on more tailored approaches to AI in digital humanities and our collections, for example applying image segmentation techniques to manuscripts.

Looking more broadly, sophisticated Multi-Modal Large Language Models are starting to become available which promise to work with images, audio and video in addition to traditional text queries. These large models can work with images in a more general way than a specialist model such as YOLO (for example, identifying many more types of objects or answering complex questions about a scene). The downside of these models is that they can hallucinate (generate false or nonsensical information) or behave in unpredictable ways, in addition to being much more expensive to use.

Multi-Modal Large Language Models such as ChatGPT 4o could make sophisticated inferences about images, although the models available today are prone to making mistakes.

Tom Higgins (Senior Software Developer)