Add IC, SI, ER tasks to SUPERB #2884

anton-l · 2021-09-09T11:56:03Z

This PR adds 3 additional classification tasks to SUPERB

Intent Classification

Dataset URL seems to be down at the moment :( See the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/fluent_commands/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#ic-intent-classification---fluent-speech-commands

Speaker Identification

Manual download script:

mkdir VoxCeleb1
cd VoxCeleb1
            
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip
            
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip
            
# download the official SUPERB train-dev-test split
wget https://raw.githubusercontent.com/s3prl/s3prl/master/s3prl/downstream/voxceleb1/veri_test_class.txt

S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/voxceleb1/dataset.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#sid-speaker-identification

Intent Classification

Manual download requires going through a slow application process, see the note below.
S3PRL source: https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/emotion/IEMOCAP_preprocess.py
Instructions: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream#er-emotion-recognition

⚠️ Note

These datasets either require manual downloads or have broken/unstable links. You can get all necessary archives in this repo: https://huggingface.co/datasets/anton-l/superb_source_data_dumps/tree/main

anton-l · 2021-09-09T11:57:50Z

Sorry for the late PR, uploading 10+GB files to the hub through a VPN was an adventure 😅

lewtun · 2021-09-09T14:18:24Z

Thank you so much for adding these subsets @anton-l!

These datasets either require manual downloads or have broken/unstable links. You can get all necessary archives in this repo: https://huggingface.co/datasets/anton-l/superb_source_data_dumps/tree/main
Are we allowed to make these datasets public or would that violate the terms of their use?

anton-l · 2021-09-09T15:12:26Z

@lewtun These ones all have non-permissive licences, so the mirrored data I linked is open only to the HF org for now. But we could try contacting the authors to ask if they'd like to host these with us.
For example VoxCeleb1 now has direct links (the ones in the script) that don't require form submission and passwords, but they ban IPs after each download for some reason :(

lewtun

Thank you for adding these subsets @anton-l - a very clean and elegant implementation 🥳 !

Let's wait for @lhoestq or @albertvillanova to give this a pass, before we merge :)

datasets/superb/README.md

datasets/superb/superb.py

lewtun · 2021-09-13T09:16:09Z

@lewtun These ones all have non-permissive licences, so the mirrored data I linked is open only to the HF org for now. But we could try contacting the authors to ask if they'd like to host these with us.
For example VoxCeleb1 now has direct links (the ones in the script) that don't require form submission and passwords, but they ban IPs after each download for some reason :(

I think there would be a lot of value added if the authors would be willing to host their data on the HF Hub! As an end-user of datasets, I've found I'm more likely to explore a dataset if I'm able to quickly pull the subsets without needing a manual download. Perhaps we can tell them that the Hub offers several advantages like versioning and interactive exploration (with datasets-viewer)?

lhoestq

Cool thanks ! It looks all good - I just added some comments about the labels in the dataset card

datasets/superb/README.md

lhoestq · 2021-09-16T16:58:54Z

datasets/superb/superb.py

+            mkdir VoxCeleb1
+            cd VoxCeleb1
+
+            wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
+            wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
+            wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
+            wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
+            cat vox1_dev* > vox1_dev_wav.zip
+            unzip vox1_dev_wav.zip
+
+            wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
+            unzip vox1_test_wav.zip
+
+            # download the official SUPERB train-dev-test split
+            wget https://raw.githubusercontent.com/s3prl/s3prl/master/s3prl/downstream/voxceleb1/veri_test_class.txt


windows users may not like this xD

Can't wait to remove this if the VoxCeleb authors agree to move the files to the Hub haha

Add IC, SI, ER

528a504

anton-l requested review from albertvillanova and lewtun Sep 9, 2021

Style checks

561e2bb

lewtun approved these changes Sep 13, 2021

View changes

datasets/superb/README.md Show resolved Hide resolved

datasets/superb/superb.py Show resolved Hide resolved

datasets/superb/superb.py Show resolved Hide resolved

datasets/superb/superb.py Show resolved Hide resolved

patrickvonplaten requested a review from lhoestq Sep 13, 2021

anton-l added 2 commits Sep 15, 2021

Explain VoxCeleb labels

c28e572

Add other examples for KS clips

36be907

lhoestq reviewed Sep 16, 2021

View changes

Add notes on ClassLabel values

fb11ba5

anton-l merged commit 830b997 into huggingface:master Sep 20, 2021
6 checks passed

anton-l deleted the add-superb-classification branch Sep 20, 2021

albertvillanova mentioned this pull request Oct 4, 2021

Fix Windows paths in SUPERB benchmark datasets #3009

Merged

Feb	MAR	Apr
	20
2021	2022	2023

huggingface / datasets Public

Add IC, SI, ER tasks to SUPERB #2884

Add IC, SI, ER tasks to SUPERB #2884

anton-l commented Sep 9, 2021

anton-l commented Sep 9, 2021

lewtun commented Sep 9, 2021

anton-l commented Sep 9, 2021

lewtun left a comment

lewtun commented Sep 13, 2021

lhoestq left a comment

lhoestq Sep 16, 2021

anton-l Sep 17, 2021

huggingface / datasets Public

Add IC, SI, ER tasks to SUPERB #2884

Add IC, SI, ER tasks to SUPERB #2884

Conversation

anton-l commented Sep 9, 2021

Intent Classification

Speaker Identification

Intent Classification

⚠️ Note

anton-l commented Sep 9, 2021

lewtun commented Sep 9, 2021

anton-l commented Sep 9, 2021

lewtun left a comment

lewtun commented Sep 13, 2021

lhoestq left a comment

lhoestq Sep 16, 2021

Choose a reason for hiding this comment

anton-l Sep 17, 2021

Choose a reason for hiding this comment