Part 128: Track AI hallucinations with Zabbix

Back in part 107 of this blog I went bananas and made a local LLava AI model to recognize what is in the image. But how reliably Llava('s smallest model) does actually recognize what is in the picture? Let's test it with Zabbix!

Methodology

For this test, I am

making my Mac to capture a snapshot image from our CCTV once per minute
let Llava to describe the image and store that response to a text file
let Zabbix agent to pick the contents of that text file
make Zabbix to check if the description contains "person", "bird", "dog" etc
count the results, with the counts aggregated with one hour windows

Ready to see how consistent the results are? So am i. Let's start.

The capture loop

As this is just a temporary test, I have on my Mac the following shell loop running inside a tmux session:

while true; do curl --output frontdoor_camera.jpg "http://my.camera.address:88/cgi-bin/CGIProxy.fcgi?cmd=snapPicture2&usr=…" ; ollama run llava "Describe in one sentence what you see in this CCTV image taken outdoors, be sarcastic: ./frontdoor_camera.jpg" >frontdoorcamera.txt; sleep 60; done

We live in a very silent area and the camera is pointing to our own yard, so the image should be relatively same all the time, especially during nights.

With this setup, my frontdoorcamera.txt contains quirky descriptions such as

It's a beautiful day outside - not. The sky is a bit gloomy and there are some clouds present. A small house with a porch is visible in the background, along with a car parked on the road. I can tell it's summer because of the lush green trees all around!

Tracking interesting keywords

But how to track the occurrence of any keywords in Zabbix? In my case, what I did is that I created a bunch of dependent items, with the text file contents being the master item.

Then, on item preprocessing of each of the items, I have these hazardous regular expressions that probably catch too much, but you get the idea.

So, if the regular expression matches, then everything gets replaced with 1, if it does not match, then the result is set to 0. This way I can get the metrics in numeric format.

The results

How does all this look like?

For the commentary part, here's few comments:

Yes, for the reliability of this test, I am likely shooting myself in my foot, as the "be sarcastic" instruction makes these descriptions to be pretty wild. But hey; I don't like to be boring, not in this blog at least.

For the graphs: here is a graph showing the hourly counts of how many times any of the terms were mentioned. Apart from the car, which should be in the picture all the time, especially during nighttime, not too many of these things should happen. However, this is what Llava thought:

I'm sure (yes I reviewed the pictures) that we didn't have a person around that much in the middle of the night, or the othet things that get mentioned. Maybe Llava sees a ghost and I do not. Anyway, with the nighttime images, LLava clearly hallucinates a lot.

However, the situation changes quickly soon after sunrise. With a clear CCTV image, the results are much better and closer to actual truth.

With this kind of technique, with more thought and effort put into it, you too can observe the quality of your AI responses in maybe a different way just by using Zabbix.

Part 128: Track AI hallucinations with Zabbix

Methodology

The capture loop

Tracking interesting keywords

The results

Add new comment

Restricted HTML