huggingface / datasets Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IC, SI, ER tasks to SUPERB #2884
Add IC, SI, ER tasks to SUPERB #2884
Conversation
Sorry for the late PR, uploading 10+GB files to the hub through a VPN was an adventure |
Thank you so much for adding these subsets @anton-l!
|
@lewtun These ones all have non-permissive licences, so the mirrored data I linked is open only to the HF org for now. But we could try contacting the authors to ask if they'd like to host these with us. |
Thank you for adding these subsets @anton-l - a very clean and elegant implementation
Let's wait for @lhoestq or @albertvillanova to give this a pass, before we merge :)
I think there would be a lot of value added if the authors would be willing to host their data on the HF Hub! As an end-user of |
Cool thanks ! It looks all good - I just added some comments about the labels in the dataset card
mkdir VoxCeleb1 | ||
cd VoxCeleb1 | ||
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa | ||
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab | ||
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac | ||
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad | ||
cat vox1_dev* > vox1_dev_wav.zip | ||
unzip vox1_dev_wav.zip | ||
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip | ||
unzip vox1_test_wav.zip | ||
# download the official SUPERB train-dev-test split | ||
wget https://raw.githubusercontent.com/s3prl/s3prl/master/s3prl/downstream/voxceleb1/veri_test_class.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
windows users may not like this xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't wait to remove this if the VoxCeleb authors agree to move the files to the Hub haha
This PR adds 3 additional classification tasks to SUPERB
Intent Classification
Dataset URL seems to be down at the moment :( See the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/fluent_commands/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#ic-intent-classification---fluent-speech-commands
Speaker Identification
Manual download script:
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/voxceleb1/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#sid-speaker-identification
Intent Classification
Manual download requires going through a slow application process, see the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/emotion/IEMOCAP_preprocess.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#er-emotion-recognition
These datasets either require manual downloads or have broken/unstable links. You can get all necessary archives in this repo: https://huggingface.co/datasets/anton-l/superb_source_data_dumps/tree/main