Fuzzy search for typo tolerance on a free OpenSearch cluster - Charlie Hull

Fat fingers, cats on the keyboard and simply bad spelling – all of these can lead to search queries that don’t match anything in your index. To keep your users happy you need to cope with whatever they type and to try and figure out what they actually mean. I’ll show you below how to use fuzzy search techniques in OpenSearch to broaden the list of potential matches, how this works under the hood in Lucene – and when fuzzy search can go wrong! You’ll learn some neat tricks for protecting data that should never be fuzzified and how to test and measure your new improved query structures.

To demonstrate this I’m going to use the free OpenSearch tier recently launched by Aiven – this provides 4GB RAM, 20GB disk storage and the full set of OpenSearch functionality, including Dashboards. This is a great sandbox for your OpenSearch projects – it’s even big enough for experimenting with vector search – and it won’t charge you anything until you choose to upgrade.

After signing up all I have to do is select the free tier option, which gives me an Aiven Console page, and then wait for it to finish ‘Building’. Note the connection information provided: we’ll need some of this below.

Creating some sample data

To create my searchable index I’m using an OpenSear c h demo project provided by the Aiven team – this uses a recipe dataset from Kaggle. I’m going to change things around a bit however and use a different dataset from the e-commerce domain. The steps to follow are:

Install Python 3.7+
Run through the steps on the demo project page to get everything set up, but you only need to go as far as setting up a connection to your free OpenSearch instance.
Download this sports equipment dataset from Kaggle, extract the file Sports_ECommerce_Products_Data.csv and put it in the same folder as the demo project
There are a lot of products in this dataset – we won’t index them all, to save time
Modify the provided index.py to read this new data format as follows, by changing the load_data function to the following:

def load_data():
    """Send multiple data to an OpenSearch client."""
    id=1
    with open('Sports_ECommerce_Products_Data.csv', newline='') as csvfile:
        linereader = csv.reader(csvfile, delimiter=',', quotechar='\"')
        for row in linereader:
           # print(row)
            document = '{\"title\" : \"' + row[4] + '\",' + \
                       '\"description\" :\"' + row[0] + '\",' + \
                       '\"description_fuzzy\" :\"' + row[0] + '\",' + \
                       '\"price\" :' + row[1] + ',' + \
                       '\"discounted_price\" :' + row[2]  + ',' + \
                       '\"discount\" :' + row[3] + '}'
            print(document)
            response = client.index(index= INDEX_NAME, body=document,id=id, refresh=True)
           # print(response)   
            id = id+1   
            if id > 100: break;

(You can see I’ve added a few print statements for debugging, it’s up to you whether you leave them in or comment them out). It’s pretty simple – open the CSV file, read each line, create an OpenSearch document and use the client.index function to send it off to OpenSearch. Once we’ve got 100 items we stop. Note the description_fuzzy field, which we’ll need later.

Just for consistency I’ve also modified the index name in config.py:
INDEX_NAME = "cricket"
Now we can try and index our data! Open a console window in the Python folder (I’m using Powershell) and type:
python index.py load-data
It will take a few minutes to index everything, to make sure it’s working type:
python search.py multi-match title description bat
and you should see some search results.

Searching our index using Dev Tools

We’re going to interact with our new index using OpenSearch’s DevTools. Returning to the Aiven console, I select the OpenSearch Dashboards tab and click the Service URI link, log in with the provided credentials and select the default configuration options. Once I can see the OpenSearch Dashboards screen I simply click

and we’re in. DevTools shows a default search query – if we change this to match our index name ‘cricket’ we can run a query simply by clicking the ‘play’ arrow at the top of this text block:

GET cricket/_search
{
  "query": {
    "match_all": {}
  }
}

Let’s change our query to the (more useful) multi_match, which will search more fields, and add a parameter to further narrow our search:

GET cricket/_search
{
  "query": {
    "multi_match": {
      "query": "yonex bag",
      "fields": ["title","description"],
      "minimum_should_match": 2
    }
  }
}

We get 16 results for this query.

Handling typos with fuzzy search

It’s quite common to find misspelled words in your query logs (which should always be your first port of call when trying to tune search – they tell you what your users are actually searching for, not what you think they’re searching for, and the words they use – which don’t always match the words used in your source data). If we try this with our test index:

GET cricket/_search
{
  "query": {
    "multi_match": {
    "query": "yanex bag",
    "fields": ["title","description"],
    "minimum_should_match": 2
    }
  }
}

as expected we get zero results, as no products match the misspelled ‘yanex’. We might use a spelling suggester to suggest corrections (which has the advantage of letting the user know that they’ve made a spelling error, and in some cases gives them the choice of using the original spelling if that’s what they actually meant to do) but we can also try to broaden our search query automatically to cope with any typos, using fuzzy search:

GET cricket/_search
{
  "query": {
    "multi_match": {
      "query": "yanex bag",
      "fields": ["title","description"],
      "minimum_should_match": 2, 
      "fuzziness" : "AUTO"
      }
   }
}

This brings back our previous 16 results.

A broader query with fuzzy search

Fuzzy search broadens the query to include variants based on edit distance – how many 1-character changes are needed. To see what’s happening under the hood we can use OpenSearch’s Profile API – let’s start with our first (non-fuzzy) query that doesn’t match anything:

GET cricket/_search?human=true
{
  "profile": "true",
  "query": {
    "multi_match": {
    "query": "yanex bag",
    "fields": ["title","description"],
    "minimum_should_match": 2
    }
  }
}

In the (very) verbose output we can see strings like this, showing the actual Lucene queries running under the hood:

"query": [
  {
    "type": "DisjunctionMaxQuery",
    "description": "((+title:yanex +title:bag) | (+description:yanex +description:bag))",

If we add the fuzzy search back in:

GET cricket/_search?human=true
{
  "profile": "true",
  "query": {
    "multi_match": {
      "query": "yanex bag",
      "fields": ["title","description"],
      "minimum_should_match": 2,
      "fuzziness": "AUTO"
      }
    }
}

we can see the expansions:

 "description": """((+(description:yonex)^0.8 +(description:bag (description:bas)^0.6666666 (description:bat)^0.6666666 (description:big)^0.6666666))

Note that the alternative words are weighted less than 1 – after all, we want documents that do match ‘bag’ to score higher than those that match ‘bas’ or ‘big’. The misspelled query will now bring back some results.

A problem with numbers

Fuzzy search isn’t always a good idea, especially when we need exact matching – e.g. for numbers. Say we’re looking for an exact model number, size or even an age (on a birthday card for example). Model number ‘1223’ may be a very different product to model number ‘1273’, but they’re only one edit distance apart – and you don’t want a 21st birthday card for someone who is turning 31. We can simulate this problem with our test index:

GET cricket/_search?human=true
{
  "profile": "true",
  "query": {
    "multi_match": {
      "query": "master 1000",
      "fields": ["title","description"],
      "minimum_should_match": 2,
      "fuzziness": "AUTO"
      }
    }
}

We get four results, only one of which is correct (let’s assume for the sake of argument that these other results are a really bad result for your user, although in this case I suspect the products are very similar). If we look at the expansions we can see why – fuzzy search is adding numbers like ‘1008’, ‘1500’ and ‘5000’ to our query:

"description": """(MatchNoDocsQuery("empty BooleanQuery") | (+((description:buster)^0.6666666 description:master) +(description:1000 (description:1008)^0.75 (description:1500)^0.75 (description:5000)^0.75 (description:7000)^0.75)))""",

Protecting numbers from fuzzy search

One central (and often misunderstood) concept in search is that your index doesn’t actually store the words in the documents you send to it. Instead, it stores tokens – modified forms of these words – that can look very different. The modifications are controlled by a process called analysis which can include steps like stemming, plural removal, adding synonyms & removing accents – this process helps with matching (if for example I want to list all ‘cafes’ when the user types ‘café). In fact, analysis can also let you completely rewrite a source term – and we can use this to protect numbers from fuzzy search.

We’re going to replace numbers with their text equivalents – “1” becomes “one”, “2” becomes “two” etc. We’re hoping that this prevents queries for ‘1’ (which will turn into ‘one’) matching documents containing ‘2’ (as it will be turned into ‘two’, which is a good few edits away from ‘one’). Just to make sure we can see where the replacements have been made we can wrap the words in underscore characters.

Let’s start by creating a new field for our index – description_fuzzy. We’ll use the same text from the source data as we do for the description field, but analyze it differently. We’ll also need to specify various analysis settings. As we can’t change the settings of an existing index we’ll tear it down first:

DELETE /cricket
PUT /cricket
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    },
    "analysis": {
      "analyzer": {
        "custom_number_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "custom_number_filter"
          ]
        }
      },
      "char_filter": {
        "custom_number_filter": {
          "type": "mapping",
          "mappings": [
            "1 => _one_",
            "2 => _two_",
            "3 => _three_",
            "4 => _four_",
            "5 => _five_",
            "6 => _six_",
            "7 => _seven_",
            "8 => _eight_",
            "9 => _nine_",
            "0 => _zero_"
          ]
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "description_fuzzy": {
          "type": "text",
          "analyzer" : "custom_number_analyzer",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "discount": {
          "type": "float"
        },
        "discounted_price": {
          "type": "float"
        },
        "price": {
          "type": "float"
        },
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
}

Now we need to re-index – run python index.py load-data again from the local machine. Let’s try a query against our new field:

GET cricket/_search?human=true
{
  "profile": "true",
  "query": {
    "match": {
      "description_fuzzy": {
        "query" :  "master 1000",
        "fuzziness": "AUTO",
        "minimum_should_match": 2
        }
      }
    }
  }

We only get a single result – and the expansion shows why. Even though fuzzy search is trying words close in edit distance to ‘Master’, it isn’t doing the same for our encoded numbers:

"description": "+((description_fuzzy:Buster)^0.6666666 (description_fuzzy:Master)^0.8333333) +description_fuzzy:one__zero__zero__zero",

A gotcha – one becomes nine

One thing to watch out for is that ‘one’ and ‘nine’, and ‘five’ and ‘nine’ are only 2 edits apart – remember the fuzzy search algorithm includes adding new characters. So from one->none->nine is only two hops. The AUTO setting used here for fuzzy search has the following rules according to the OpenSearch documentation:

- Terms containing 0–2 characters: Requires an exact match (0 edits).
- Terms containing 3–5 characters: Allows a maximum of 1 edit.
- Terms containing 6 or more characters: Allows a maximum of 2 edits.

So it’s feasible that _nine_ which has 6 characters could turn into _one_ or _five_ when fuzzified. We can see this by searching for ‘Master 9000’ which returns two results, ‘Master 1000’ and Master ‘5000’.

To prevent this we can use _n_ine instead in our mapping list:
"9 => _n_ine_",

and when we run this (remember to tear down and re-index as above) we’ll only get the Master 9000 product in our result list.

Conclusion

We’ve seen how OpenSearch can cope automatically with typos, and how it works under the hood – remember that it’s not always obvious how expansions are generated, and there can be unexpected gotchas. Offline testing is a great way to test out these kind of tricks, which you can easily do with DevTools or a tool like Quepid.

Thanks must go to the Aiven team for creating their new free tier – interestingly they also provide free tiers for other projects like MySQL, Valkey, Postgres & Kafka, so you could set up an entire stack with source data storage, an ingest pipeline and search engine without spending anything. I’m looking forward to trying out more fun experiments like this one!

Fuzzy Stock photos by Vecteezy

Enjoyed reading? Share it with others: