Verify word embedding model downloader #5532

justinormont · 2020-12-05T02:00:37Z

Internal user reported a stall during the .Fit() of the word embedding transform.

On first use of the word embedding transform, it downloads the word embedding model from the CDN.

To test:

Clear any copies of the fastText300D word embedding file from local machine
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named wiki.en.vec
Create example code using the FastTextWikipedia300D (6.6GB) in the word embedding transform
Time how long it takes to download (or fail)

Example code:

var featurizeTextOptions = new TextFeaturizingEstimator.Options()
{
    // Produce cleaned tokens for input to the word embedding transform
    OutputTokensColumnName = "OutputTokens", 

    // Text cleaning (not shown is stop word removal)
    KeepDiacritics = true, // Non-default
    KeepPunctuations = false,
    KeepNumbers = false, // Non-default
    CaseMode = TextNormalizingEstimator.CaseMode.Lower,

    // Row-wise normalization (see: NormalizeLpNorm)
    Norm = TextFeaturizingEstimator.NormFunction.L2,

    // Use ML.NET's built-in stop word remover (non-default)
    StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() { Language = TextFeaturizingEstimator.Language.English },

    // ngram options
    WordFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 2,
        UseAllLengths = true, // Produce both unigrams and bigrams
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },

    // chargram options
    CharFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 3,
        UseAllLengths = false, // Produce only tri-chargrams and not single/double characters
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },
};

// Featurization pipeline
var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label") // Needed for multi-class to convert string labels to the Key type
    
    // Create ngrams, and cleaned tokens for the word embedding
    .Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", featurizeTextOptions, new[] { "InputText" })) // Use above options object

    // Word embedding transform reads in the cleaned tokens from the text featurizer
    .Append(mlContext.Transforms.Text.ApplyWordEmbedding("FeaturesWordEmbedding", 
        "OutputTokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D))

    // Feature vector is the concatenation of the ngrams from the text transform, and the word embeddings
    .Append(mlContext.Transforms.Concatenate("Features", new[] { "FeaturesText", "FeaturesWordEmbedding" }))

    // Enable if numeric features are also included. Normalization is generally unneeded if only using the output from FeaturizeText as it's row-wise normalized w/ a L2-norm; word embeddings are also well behaved.
    //.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))

    // Cache the featurized dataset in memory for added speed
    .AppendCacheCheckpoint(mlContext);

// Trainer 
var trainer = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
    .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

var trainingPipeline = pipeline.Append(trainer);

The code here shows a full example of the FeaturizeText for use with the ApplyWordEmbedding. Specifically, it creates the tokens for the ApplyWordEmbedding by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.

Side note:
We should make a sample of FeaturizeText with ApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.

Additional user report: #5450 (comment)

The text was updated successfully, but these errors were encountered:

justinormont added good first issue up-for-grabs NLP labels Dec 5, 2020

antoniovs1029 added enhancement P2 labels Dec 28, 2020

Jul	OCT	Mar
	26
2020	2021	2025

dotnet / machinelearning Public

Verify word embedding model downloader #5532

Verify word embedding model downloader #5532

justinormont commented Dec 5, 2020

dotnet / machinelearning Public

Verify word embedding model downloader #5532

Verify word embedding model downloader #5532

Comments

justinormont commented Dec 5, 2020