Making search relevance judgements is BORING - so let's get AI to do it! - Charlie Hull

I recently wrote about a way to make Quepid, the relevance testing tool, work with pretty much any website. I did this by hacking Quepid’s ability to work with a nice tidy Search API that returns JSON, instead using Javascript to pull field data out of raw HTML. The next step is to create relevance judgements – manually score the returned results as good, bad or something in the middle – a task best done by experts in whatever field your website or service covers.

There’s a minor problem with this: judging relevance is not very exciting. In a perfect world, we’d have an army of domain experts able to devote their time to marking up thousands of search results. We’d be able to return to these experts regularly, to ask them to judge any new results that appear due to our improvements to the search algorithm. We’d have enough of them to figure out biases due to their differing opinions, because of course they wouldn’t always agree.

In the real world however, this is a difficult thing to achieve. If you’re running legal search, you probably need people with a legal qualification – and these people are busy (and can charge clients a lot of money for every 15 minutes of their time). If you’re selling DIY equipment you need someone who understands the difference between emulsion and gloss paint, but they may be be serving customers in a shop. I’ve seen plenty of teams struggle with finding domain experts and convincing them to do some judgements (although rewarding them sometimes helps – a prize for the most productive judge?). Quite often it’s the search team themselves that end up doing the majority of the judging, which comes with a risk of confirmation bias and again, isn’t very exciting for them.

An AI Judge to the rescue

Perhaps we can use an AI to create some relevance judgements? This isn’t a new idea of course, it’s quite common to use one AI system to evaluate the output of another system. There’s a risk of ‘marking your own homework’ of course, so definitely don’t mark one LLM with the same LLM (I’m sure Claude is a great fan of Claude) – but with sufficient guardrails and some manual validation it seems a valid strategy to massively expand how many relevance judgements we can carry out.

Quepid recently added a feature to allow AI-powered judging of search results. This currently uses OpenAI – so to try it out you’re going to need an API key and some credits (luckily Quepid itself is entirely free).

Cases, Books and Teams in Quepid

There’s a few concepts in Quepid I should probably explain before going any further.

Case

A case is a set of test queries which might be aimed at testing a particular area of search. Let’s say we’re trying to figure out why two-word queries about bread don’t work very well – we might have a Case with sour dough, french baguette, sliced loaf etc. as our queries:

Cases are one of the earliest concepts in Quepid, from back in the day when it only worked for one user. Each user can have lots of Cases. Each Case can be tied to a different search endpoint if necessary, with different search parameters, but you can also easily clone them to make multiple Cases work with the same backend configuration.

Teams

A team is a group of judges, who collaborate on a particular task. Teams are how you share Cases (and some other things like search endpoints). Here’s my new Team (where I’m a little lonely at present, but notice I can add an AI judge to keep me company!):

Books

A book is a collection of judgements, which can be carried out by a group of different people. Books can be shared between Teams and can be linked to one or more Cases, from which they import query/document pairs – these are the things that requires judging (basically ‘is this document relevant to this query’). Books also have some great statistics to show how complete the judging process is – great for working with a Team. I’m going to create a Book, and then show how to populate it from a Case:

Creating an AI for relevance judgements

First we’re going to spin up an AI Judge, which will join our Team in Quepid. This is a pretty simple process: go to our Team and click Create AI Judge. I’ve given it a name (because Homer’s favourite saying is…..) and entered an OpenAI key. There’s a sample prompt for OpenAI which we’ll leave alone for now:

We now need to populate our Book with some query/document pairs from a Case: first we select the Case ‘Two Word Baking’ and click the Share menu option at the top, and share this case with our ‘Master Bakers’ Team. Now click the Judgements menu option at the top, and we can populate the Book ‘Let’s Make More Dough’: select this Book, check the Populate Book option and then click the green button at bottom right:

populating a book for relevance judgements

Selecting our Book, we can watch it being populated with the query/document pairs (this takes a little while):

Next we need to click Settings and allow our AI judge to work with this Book:

assigning an AI judge for relevance judgements

Click ‘Update Book’ and we’re ready to run our AI judge!

Go, Homer, go!

To create the AI-powered relevance judgments we select the Judgement Stats option in our Book. The Prepare to Judge button lets us start the process:

On the dialog box that pops up we select Judge All Pairs and then click Judge Documents:

judge all pairs for relevance judgements

…and we’re off! Once the process is finished we can click Judgements and see what happened – click the number in the ID column to see more information, including an explanation of why OpenAI made the rating (don’t click anything on this page unless you want to change the rating!):

Returning to our Case we can see the scores, rolled up into overall metrics – it seems we need to do some work to improve our bread-related searches:

Can we trust Homer – and can we make him better?

Although it’s exciting to see an AI automatically creating relevance judgments, we can’t assume these judgements are perfect. It’s worth reviewing the judgements against some kind of ground truth – in this case, someone who knows about food – to make sure we can rely on the data.

In general, there is some evidence that AI-powered judgements are reliable, as my ex-colleague Scott Stults writes in Using GPT for Relevancy Judgements.

Adding more context

Another issue is that we may not be providing enough information to OpenAI to make a judgement. We may have to work on our data extraction (more Javascript!). Some things that (in my limited experience) may help are:

Retrieving pictures – useful for datasets where visual cues are important, for example books or fashion – OpenAI is able to interpret the picture data, for example it may spot that a dress has a flower pattern even if the dress isn’t called ‘floral’
Retrieving data pointed to by a URL. If one of the returned search fields is a URL – for example, a link to the entire document when only a snippet is provided in the search results, or a product page with many more details – you can ask OpenAI to follow this URL and consider this when judging. For example, edit the AI Judge settings in the Team page and add something like this to the prompt:
“If there is a field in the document called ‘refurl’, follow this link and also evaluate this page.”

Watch your spend!

Also remember to watch your API credits – even judging a few cases with an AI to write this blog cost around $0.30 to use OpenAI.

Conclusion

AI-powered relevance judgements give us a way to quickly ramp up our ability to test search and help us get past the ‘cold start’ problem often encountered, where human judges are unavailable or unwilling. It’s exciting to see this feature available in Quepid – and it’s very easy to get started.

In the future, we hope to see other AI providers become available in the tool – perhaps a locally-hosted LLM would reduce the overall cost, and if this was fine-tuned for the domain we’re working with it might also improve the overall judgement quality.

If you need help setting up your relevance judgement process & adding AI to the mix then please get in touch.

Ai Generated Stock photos by Vecteezy

Enjoyed reading? Share it with others:

Making search relevance judgements is BORING – so let’s get AI to do it!