wikiann dataset is missing columns #2130
Comments
Here please find TFDS format of this dataset: https://www.tensorflow.org/datasets/catalog/wikiann |
Hi ! It would be nice to include the The objective is to use |
Hi @lhoestq
I am not sure how to contribute to the repository and how things work, could you let me know how one can access the datasets to be able to contribute to the repository? Maybe I could do it then |
Cool ! Let me give you some context: Contribution guideYou can find the contribution guide here: https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md It explains how to set up your dev environment in a few steps. Dataset loadingEach Dataset is defined by a Table that have many rows (one row = one example) and columns (one column = one feature). https://github.com/huggingface/datasets/blob/master/datasets/wikiann/wikiann.py It includes everything needed to load the WikiANN dataset. Define a new columnEach column has a name and a type. You can see how the features of WikiANN are defined here: datasets/datasets/wikiann/wikiann.py Lines 245 to 263 in c98e4b8 Ideally we would have one additional feature "spans": "spans": datasets.Sequence(datasets.Value("string")), Compute the content of each rowTo build the WikiANN rows, the _generate_examples method from here is used. This function yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs} The objective would be to return instead something like spans = spans = get_spans(tokens, tags)
yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs, "spans": spans} Let me know if you have questions ! |
Hi
Wikiann dataset needs to have "spans" columns, which is necessary to be able to use this dataset, but this column is missing from huggingface datasets, could you please have a look? thank you @lhoestq
The text was updated successfully, but these errors were encountered: