Verify word embedding model downloader #5532
Labels
enhancement
New feature or request
good first issue
Good for newcomers
NLP
Issues / questions around text processing
P2
Priority of the issue for triage purpose: Needs to be fixed at some point.
up-for-grabs
A good issue to fix if you are trying to contribute to the project
Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named
wiki.en.vec
Example code:
The code here shows a full example of the
FeaturizeText
for use with theApplyWordEmbedding
. Specifically, it creates the tokens for theApplyWordEmbedding
by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.Side note:
We should make a sample of
FeaturizeText
withApplyWordEmbedding
. I wrote the above since I couldn't locate one to link-to in this issue.Additional user report: #5450 (comment)
The text was updated successfully, but these errors were encountered: