Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
Clear any copies of the fastText300D word embedding file from local machine Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named wiki.en.vec
Create example code using the FastTextWikipedia300D (6.6GB) in the word embedding transform
Time how long it takes to download (or fail)
Example code:
varfeaturizeTextOptions=newTextFeaturizingEstimator.Options()
{
// Produce cleaned tokens for input to the word embedding transformOutputTokensColumnName="OutputTokens",
// Text cleaning (not shown is stop word removal)KeepDiacritics=true, // Non-defaultKeepPunctuations=false,
KeepNumbers=false, // Non-defaultCaseMode=TextNormalizingEstimator.CaseMode.Lower,
// Row-wise normalization (see: NormalizeLpNorm)Norm=TextFeaturizingEstimator.NormFunction.L2,
// Use ML.NET's built-in stop word remover (non-default)StopWordsRemoverOptions=newStopWordsRemovingEstimator.Options() { Language=TextFeaturizingEstimator.Language.English },
// ngram optionsWordFeatureExtractor=newWordBagEstimator.Options()
{
NgramLength=2,
UseAllLengths=true, // Produce both unigrams and bigramsWeighting=NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF or IDF
},
// chargram optionsCharFeatureExtractor=newWordBagEstimator.Options()
{
NgramLength=3,
UseAllLengths=false, // Produce only tri-chargrams and not single/double charactersWeighting=NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF or IDF
},
};
// Featurization pipelinevarpipeline=mlContext.Transforms.Conversion.MapValueToKey("Label", "Label") // Needed for multi-class to convert string labels to the Key type// Create ngrams, and cleaned tokens for the word embedding
.Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", featurizeTextOptions, new[] { "InputText" })) // Use above options object// Word embedding transform reads in the cleaned tokens from the text featurizer
.Append(mlContext.Transforms.Text.ApplyWordEmbedding("FeaturesWordEmbedding",
"OutputTokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D))
// Feature vector is the concatenation of the ngrams from the text transform, and the word embeddings
.Append(mlContext.Transforms.Concatenate("Features", new[] { "FeaturesText", "FeaturesWordEmbedding" }))
// Enable if numeric features are also included. Normalization is generally unneeded if only using the output from FeaturizeText as it's row-wise normalized w/ a L2-norm; word embeddings are also well behaved.//.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))// Cache the featurized dataset in memory for added speed
.AppendCacheCheckpoint(mlContext);
// Trainer vartrainer=mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
vartrainingPipeline=pipeline.Append(trainer);
The code here shows a full example of the FeaturizeText for use with the ApplyWordEmbedding. Specifically, it creates the tokens for the ApplyWordEmbedding by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.
Side note:
We should make a sample of FeaturizeText with ApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.
Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named
wiki.en.vec
Example code:
The code here shows a full example of the
FeaturizeText
for use with theApplyWordEmbedding
. Specifically, it creates the tokens for theApplyWordEmbedding
by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.Side note:
We should make a sample of
FeaturizeText
withApplyWordEmbedding
. I wrote the above since I couldn't locate one to link-to in this issue.Additional user report: #5450 (comment)
The text was updated successfully, but these errors were encountered: