Quantcast
Channel: Changing Bits
Viewing all articles
Browse latest Browse all 52

SuggestStopFilter carefully removes stop words for suggesters

$
0
0
Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilterat lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.comand it seems to be working well so far!

Viewing all articles
Browse latest Browse all 52

Trending Articles