The Wayback Machine - https://web.archive.org/web/20220320185630/https://github.com/huggingface/datasets/pull/2884
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IC, SI, ER tasks to SUPERB #2884

Merged
merged 5 commits into from Sep 20, 2021

Conversation

anton-l
Copy link
Member

@anton-l anton-l commented Sep 9, 2021

This PR adds 3 additional classification tasks to SUPERB

Intent Classification

Dataset URL seems to be down at the moment :( See the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/fluent_commands/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#ic-intent-classification---fluent-speech-commands

Speaker Identification

Manual download script:

mkdir VoxCeleb1
cd VoxCeleb1
            
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip
            
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip
            
# download the official SUPERB train-dev-test split
wget https://raw.githubusercontent.com/s3prl/s3prl/master/s3prl/downstream/voxceleb1/veri_test_class.txt

S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/voxceleb1/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#sid-speaker-identification

Intent Classification

Manual download requires going through a slow application process, see the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/emotion/IEMOCAP_preprocess.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#er-emotion-recognition

⚠️ Note

These datasets either require manual downloads or have broken/unstable links. You can get all necessary archives in this repo: https://huggingface.co/datasets/anton-l/superb_source_data_dumps/tree/main

@anton-l anton-l requested review from albertvillanova and lewtun Sep 9, 2021
@anton-l
Copy link
Member Author

@anton-l anton-l commented Sep 9, 2021

Sorry for the late PR, uploading 10+GB files to the hub through a VPN was an adventure 😅

@lewtun
Copy link
Member

@lewtun lewtun commented Sep 9, 2021

Thank you so much for adding these subsets @anton-l!

These datasets either require manual downloads or have broken/unstable links. You can get all necessary archives in this repo: https://huggingface.co/datasets/anton-l/superb_source_data_dumps/tree/main
Are we allowed to make these datasets public or would that violate the terms of their use?

@anton-l
Copy link
Member Author

@anton-l anton-l commented Sep 9, 2021

@lewtun These ones all have non-permissive licences, so the mirrored data I linked is open only to the HF org for now. But we could try contacting the authors to ask if they'd like to host these with us.
For example VoxCeleb1 now has direct links (the ones in the script) that don't require form submission and passwords, but they ban IPs after each download for some reason :(

lewtun
lewtun approved these changes Sep 13, 2021
Copy link
Member

@lewtun lewtun left a comment

Thank you for adding these subsets @anton-l - a very clean and elegant implementation 🥳 !

Let's wait for @lhoestq or @albertvillanova to give this a pass, before we merge :)

datasets/superb/README.md Show resolved Hide resolved
datasets/superb/superb.py Show resolved Hide resolved
datasets/superb/superb.py Show resolved Hide resolved
datasets/superb/superb.py Show resolved Hide resolved
@patrickvonplaten patrickvonplaten requested a review from lhoestq Sep 13, 2021
@lewtun
Copy link
Member

@lewtun lewtun commented Sep 13, 2021

@lewtun These ones all have non-permissive licences, so the mirrored data I linked is open only to the HF org for now. But we could try contacting the authors to ask if they'd like to host these with us.
For example VoxCeleb1 now has direct links (the ones in the script) that don't require form submission and passwords, but they ban IPs after each download for some reason :(

I think there would be a lot of value added if the authors would be willing to host their data on the HF Hub! As an end-user of datasets, I've found I'm more likely to explore a dataset if I'm able to quickly pull the subsets without needing a manual download. Perhaps we can tell them that the Hub offers several advantages like versioning and interactive exploration (with datasets-viewer)?

Copy link
Member

@lhoestq lhoestq left a comment

Cool thanks ! It looks all good - I just added some comments about the labels in the dataset card

datasets/superb/README.md Outdated Show resolved Hide resolved
datasets/superb/README.md Outdated Show resolved Hide resolved
datasets/superb/README.md Outdated Show resolved Hide resolved
mkdir VoxCeleb1
cd VoxCeleb1
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip
# download the official SUPERB train-dev-test split
wget https://raw.githubusercontent.com/s3prl/s3prl/master/s3prl/downstream/voxceleb1/veri_test_class.txt
Copy link
Member

@lhoestq lhoestq Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windows users may not like this xD

Copy link
Member Author

@anton-l anton-l Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to remove this if the VoxCeleb authors agree to move the files to the Hub haha

@anton-l anton-l merged commit 830b997 into huggingface:master Sep 20, 2021
6 checks passed
@anton-l anton-l deleted the add-superb-classification branch Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants