Skip to content

feat(search): add support for multiple stopword sets #2391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: v29
Choose a base branch
from

Conversation

tharropoulos
Copy link
Contributor

@tharropoulos tharropoulos commented Jun 4, 2025

Change Summary

  • Allow comma-separated stopword set names in search queries
  • Merge all specified stopword sets into a single unified set
  • Add validation to ensure all referenced stopword sets exist
  • Add tests for multi-stopword set functionality

Demo

❮ curl "http://localhost:8108/collections/test/documents/search?q=query&stopwords=set1,set2" \
  -H "X-TYPESENSE-API-KEY: xyz" \
  -H "Content-Type: application/json" \
  -X GET

PR Checklist

- Allow comma-separated stopword set names in search queries
- Implement combined stopword processing with `##COMBINED##:` prefix
- Merge all specified stopword sets into a single unified set
- Add validation to ensure all referenced stopword sets exist
- Add tests for multi-stopword set functionality
… handling

- Add `parse_stopword_set_names()` utility function in `StringUtils` for comma-separated parsing
- Remove `##COMBINED##:` prefix handling and simplify stopword set processing logic
- Refactor collection manager to store original parameter value instead of creating combined identifiers
- Consolidate stopword parsing logic in single location for better maintainability
- Use iterator-based insertion for combining multiple stopword sets
- instead of manually trimming whitespace, use the `split` function to
avoid redudant code
- add `get_combined_stopwords()` method to `StopwordsManager` class
- consolidate comma-separated stopword set parsing and merging logic
- replace inline stopword combination code in `collection.cpp`
- improve error handling for missing stopword sets
- reduce code duplication and improve maintainability
@tharropoulos tharropoulos marked this pull request as ready for review June 4, 2025 13:23
… parsing

- Replace combined stopwords processing with individual set iteration
- Parse stopword set names and process each set separately
- Add error logging for missing stopword sets
- Remove tokens from both `tokens` and `tokens_non_stemmed` arrays per set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant