The Wayback Machine - https://web.archive.org/web/20220324205139/https://github.com/dotnet/machinelearning/issues/5532
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify word embedding model downloader #5532

Open
justinormont opened this issue Dec 5, 2020 · 1 comment
Open

Verify word embedding model downloader #5532

justinormont opened this issue Dec 5, 2020 · 1 comment
Labels
enhancement good first issue NLP P2 up-for-grabs

Comments

@justinormont
Copy link
Contributor

@justinormont justinormont commented Dec 5, 2020

Internal user reported a stall during the .Fit() of the word embedding transform.

On first use of the word embedding transform, it downloads the word embedding model from the CDN.

To test:

  1. Clear any copies of the fastText300D word embedding file from local machine
    Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named wiki.en.vec
  2. Create example code using the FastTextWikipedia300D (6.6GB) in the word embedding transform
  3. Time how long it takes to download (or fail)

Example code:

var featurizeTextOptions = new TextFeaturizingEstimator.Options()
{
    // Produce cleaned tokens for input to the word embedding transform
    OutputTokensColumnName = "OutputTokens", 

    // Text cleaning (not shown is stop word removal)
    KeepDiacritics = true, // Non-default
    KeepPunctuations = false,
    KeepNumbers = false, // Non-default
    CaseMode = TextNormalizingEstimator.CaseMode.Lower,

    // Row-wise normalization (see: NormalizeLpNorm)
    Norm = TextFeaturizingEstimator.NormFunction.L2,

    // Use ML.NET's built-in stop word remover (non-default)
    StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() { Language = TextFeaturizingEstimator.Language.English },

    // ngram options
    WordFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 2,
        UseAllLengths = true, // Produce both unigrams and bigrams
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },

    // chargram options
    CharFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 3,
        UseAllLengths = false, // Produce only tri-chargrams and not single/double characters
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },
};

// Featurization pipeline
var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label") // Needed for multi-class to convert string labels to the Key type
    
    // Create ngrams, and cleaned tokens for the word embedding
    .Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", featurizeTextOptions, new[] { "InputText" })) // Use above options object

    // Word embedding transform reads in the cleaned tokens from the text featurizer
    .Append(mlContext.Transforms.Text.ApplyWordEmbedding("FeaturesWordEmbedding", 
        "OutputTokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D))

    // Feature vector is the concatenation of the ngrams from the text transform, and the word embeddings
    .Append(mlContext.Transforms.Concatenate("Features", new[] { "FeaturesText", "FeaturesWordEmbedding" }))

    // Enable if numeric features are also included. Normalization is generally unneeded if only using the output from FeaturizeText as it's row-wise normalized w/ a L2-norm; word embeddings are also well behaved.
    //.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))

    // Cache the featurized dataset in memory for added speed
    .AppendCacheCheckpoint(mlContext);

// Trainer 
var trainer = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
    .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

var trainingPipeline = pipeline.Append(trainer);

The code here shows a full example of the FeaturizeText for use with the ApplyWordEmbedding. Specifically, it creates the tokens for the ApplyWordEmbedding by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.

Side note:
We should make a sample of FeaturizeText with ApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.

Additional user report: #5450 (comment)

@justinormont justinormont added good first issue up-for-grabs NLP labels Dec 5, 2020
@antoniovs1029 antoniovs1029 added enhancement P2 labels Dec 28, 2020
@pree-T
Copy link

@pree-T pree-T commented Dec 31, 2021

I want to work on this. Can anyone help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement good first issue NLP P2 up-for-grabs
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants