Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.
Lucene has a number of implementations; I previously described
Here I describe our newest suggester:
Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.
You can see it in action at the example Jira search applicationthat I built to showcase various Lucene features.
For example, if you enter
Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of
It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom
Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.
Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with
One nice benefit of an index-based approach is
Performance is reasonable: somewhere between
Analysis fun
As with
Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!
Lucene has a number of implementations; I previously described
AnalyzingSuggester
. Since then, FuzzySuggester
was also added, which extends AnalyzingSuggester
by also accepting mis-spelled inputs. Here I describe our newest suggester:
AnalyzingInfixSuggester
, now going through iterations on the LUCENE-4845Jira issue. Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.
You can see it in action at the example Jira search applicationthat I built to showcase various Lucene features.
For example, if you enter
japan
you should see various issues suggested, including: - SOLR-4945: Japanese Autocomplete and Highlighter broken
- LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
- LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji
Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of
EdgeNGramTokenFilter
to index the short prefixes of each token, up to length 3 by default, for fast prefix querying. It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom
Collector
to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira searchrequires all terms). Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.
Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with
AnalyzingInfixSuggester
using an index-time analyzer that dropped all but the first three tokens. One nice benefit of an index-based approach is
AnalyzingInfixSuggester
handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!). Performance is reasonable: somewhere between
AnalyzingSuggester
and FuzzySuggester
, between 58 - 100 kQPS (details on the issue). Analysis fun
As with
AnalyzingSuggester
, AnalyzingInfixSuggester
let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or
would still successfully find any suggestions containing words starting with or
, rather than dropping the term entirely. Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!