XML Coding with AI, Episode 2: The Hallucinations Strike Back

This blog post follows the previous post, XML Coding with AI, Episode 1: A New Hope. After the initial excitement of using AI to automate XML cataloguing, I encountered one of the most challenging aspects of working with AI: ‘hallucinations’, or the confident creation of things that didn’t exist. In this blog I will outline some of the challenges I have faced with trying to use generative AI for structured cataloguing, and will also share tips and tricks I have developed for how to avoid some of these problems.

The AI goes rogue

PDF catalogue arranged in a non-standard but still fairly structured way. This is easy for a human to read but Google’s PDF reader struggled with the different rows and columns.

As I mentioned in the previous episode, the first thing I tried to do with access to context-sensitive AI tools was to convert existing catalogues into formats that could be more easily read by machines. Like with the JSON conversion during our workshop, this was an intermediate step between the ‘analogue’ catalogue and a fully digital, searchable XML ready for our archives hub, but creating machine-readable data from physical media has other uses. One time-consuming project for the map collection involved converting a 400-page PDF of a catalogue to an excel file for future import into a database such as EMu, our internal collections management system. Previous attempts to do this using Google Sheets (uploading the PDF, having Google process the file and then copy into an excel) had produced a poorly-formatted output that required a lot of manual fixing. It therefore seemed ideal for generative AI, that could ‘read’ the document and produce accurate metadata.

At first, I attempted to upload the entire 400-page PDF and requested that the contents be converted to text. This was far too much for the system to handle, although unlike with a traditional programme it was opaque about its capacity. Instead, it confidently reassured me that a large file would take a lot of time to process, and that it was working on it. I tried several times, at one point leaving my PC on overnight to give it time to process the file. Each time it would apologise and recommit itself to the task. After multiple failed attempts I googled the problem, and found that others had encountered this issue. The real answer was that it was unable to process the large file, but was unwilling or unable to explain the true problem and had therefore invented a solution (wait longer). This would be a harbinger of things to come! I realised I needed to be less trusting of the things generative AI communicated to me about itself, and the confident way in which it seemed to undertake tasks.

Initial success breaking down the contents of the PDF into a structured list.

Attempts to transcribe shorter segments of the PDF catalogue were more successful, but established a pattern of errors that would repeat during the XML conversion process later. Although it initially proved capable of transcribing the PDF’s contents, problems quickly emerged. Initially, it attempted to shorten or summarise the data by default, but I was able to instruct it to transcribe the data verbatim and guide it with plain text instructions towards the type of categorisation I wanted for this specific dataset. This meant longer responses, however, and when it reached its limit (seemingly around 10 records) it would gloss over this with a reassuring message, e.g. “// Additional entries would follow with the same fields, even if data is blank.”

To adapt to this capacity restriction, I started requesting transcriptions of data piecemeal such as ‘the next five pages’. This worked initially, but as the conversation continued it started to stumble on specific sections, seemingly without cause. This resulted in output that omitted entries or even replaced existing entries in the catalogue with new ones that sounded correct but were actually false; the dreaded ‘hallucinations’ of generative AI! After a while of going back and forth with these issues, the AI generated repeated errors and started offering up increasingly nonsensical data. I appeared to have reached beyond the limits of its capabilities, and having to present data in small chunks and check it thoroughly for errors meant that generative AI was no longer an effective time-saving tool.

Managing the current limitations of AI

Trying to diagnose the cause of hallucinations. The AI sounds very convincing, but will continue to repeat these errors if asked to do too much at once (and sometimes even with small datasets). It would regularly claim to be checking its output against the source file but there was no evidence of this in the details log.

Asking ChatGPT what instructions I should give it to translate the contents of 23 boxes into structured JSON. This was clearly too much in one go, so it provided data for 2 boxes before ending with “Further entries for Box 3…”.

Using what I had learned with PDF file conversions, I then approached more ‘oven-ready’ datasets to convert these into XML. Once again, there appeared to be a hidden limit to how much ChatGPT could reliably process in one interaction, even though it would confidently assert that it was executing every task correctly and in full. This was a learning experience as it is different from a standard programme, which will usually produce the same result (say, printing a file to PDF) regardless of the size of the task, the only difference being time taken. With ChatGPT, I soon realised that reducing the amount of work it was required to do per instruction increased the reliability of the outputs and reduced the likelihood of omissions, hallucinations or critical errors. Accuracy improved significantly with a few adjustments to my approach, namely:

Reducing quantity of data per instruction, e.g. ‘XML for these 5 items’ rather than ‘XML for this handlist’.
Pasting data directly into the chat box rather than uploading a file and expecting the AI to also ‘read’ the data from the correct place in the file, to reduce the amount of work requested at one time.
Creating some persistent rules about how the AI should process the data, for example: ‘use the information verbatim, leaving out no details unless explicitly instructed to summarise’.
If the AI was misunderstanding or persistently ignoring a specific aspect of an instruction, asking it how best to instruct it to achieve the task – essentially asking for help with a skill known as ‘prompt engineering’.
CHECKING EVERYTHING. Although I was doing this from the beginning, over time I understood better the areas that were most likely to contain errors, and became better at spotting where errors had creeped in.

Accuracy and accountability

The issue of hallucinated data represents a serious and potentially fundamental flaw in the use of generative AI by information professionals. Our field relies heavily on providing information that is reliable and verifiable, and all of the generative AI tools I have currently used (ChatGPT, Gemini and Copilot) seem to produce confidently-expressed falsehoods as a routine part of their programming. My impression is that the AI prioritises doing a task over doing it effectively, sacrificing accuracy, completeness or anything else it needs to in order to ‘meet expectations’. To fulfil our code of conduct using generative AI, users quite rightly have to be vigilant against these mistruths, checking outputs line by line. Paradoxically, this problem may become worse as AI becomes more advanced. It is relatively easy as a human to spot errors when they are present in 20% of the data produced because trust is low. It could be much harder to remain vigilant against errors that are only present in 1% or less of the data, despite the errors being no less damaging to the useability and trustworthiness of the catalogue.

Despite all this, I haven’t given up on AI. If anything, the challenges have helped me refine how I use it. In the next episode, Return of the AI, I’ll focus on the ways this technology has genuinely enhanced my work – from XML coding to fact-checking, and even improving my own writing.

XML Coding with AI, Episode 2: The Hallucinations Strike Back

The AI goes rogue

Managing the current limitations of AI

Accuracy and accountability

Share this:

Like this:

Discover more from AIIA Insights: Official blog of the Directorate of AI and Ideas Adoption at The University of Manchester Library