wikiann dataset is missing columns #2130

dorost1234 · 2021-03-29T08:23:00Z

Hi
Wikiann dataset needs to have "spans" columns, which is necessary to be able to use this dataset, but this column is missing from huggingface datasets, could you please have a look? thank you @lhoestq

dorost1234 · 2021-03-29T08:30:59Z

Here please find TFDS format of this dataset: https://www.tensorflow.org/datasets/catalog/wikiann
where there is a span column, this is really necessary to be able to use the data, and I appreciate your help @lhoestq

lhoestq · 2021-03-29T14:09:58Z

Hi !
Apparently you can get the spans from the NER tags using tags_to_spans defined here:

https://github.com/tensorflow/datasets/blob/c7096bd38e86ed240b8b2c11ecab9893715a7d55/tensorflow_datasets/text/wikiann/wikiann.py#L81-L126

It would be nice to include the spans field in this dataset as in TFDS. This could be a good first issue for new contributors !

The objective is to use tags_to_spans in the _generate_examples method here to create he spans for each example.

dorost1234 · 2021-03-29T14:44:18Z

Hi @lhoestq
thank you very much for the help, it would be very nice to have it included, here is the full code, one need to also convert tags to string first:

import datasets 
from datasets import load_dataset

def tags_to_spans(tags):
  """Convert tags to spans."""
  spans = set()
  span_start = 0
  span_end = 0
  active_conll_tag = None
  for index, string_tag in enumerate(tags):
    # Actual BIO tag.
    bio_tag = string_tag[0]
    assert bio_tag in ["B", "I", "O"], "Invalid Tag"
    conll_tag = string_tag[2:]
    if bio_tag == "O":
      # The span has ended.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = None
      # We don't care about tags we are
      # told to ignore, so we do nothing.
      continue
    elif bio_tag == "B":
      # We are entering a new span; reset indices and active tag to new span.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = conll_tag
      span_start = index
      span_end = index
    elif bio_tag == "I" and conll_tag == active_conll_tag:
      # We're inside a span.
      span_end += 1
    else:
      # This is the case the bio label is an "I", but either:
      # 1) the span hasn't started - i.e. an ill formed span.
      # 2) We have IOB1 tagging scheme.
      # We'll process the previous span if it exists, but also include this
      # span. This is important, because otherwise, a model may get a perfect
      # F1 score whilst still including false positive ill-formed spans.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = conll_tag
      span_start = index
      span_end = index
  # Last token might have been a part of a valid span.
  if active_conll_tag:
    spans.add((active_conll_tag, (span_start, span_end)))
  # Return sorted list of spans
  return sorted(list(spans), key=lambda x: x[1][0])

dataset = load_dataset('wikiann', 'en', split="train")
ner_tags = {
   0:"O",
   1:"B-PER",
   2:"I-PER",
   3:"B-ORG",
   4:"I-ORG",
   5:"B-LOC",
   6:"I-LOC"
}

def get_spans(tokens, tags):
  """Convert tags to textspans."""
  spans = tags_to_spans(tags)
  text_spans = [
      x[0] + ": " + " ".join([tokens[i]
                              for i in range(x[1][0], x[1][1] + 1)])
      for x in spans
  ]
  if not text_spans:
    text_spans = ["None"]
  return text_spans


for i, d in enumerate(dataset):
   tokens = d['tokens']
   tags = d['ner_tags']
   tags = [ner_tags[i] for i in tags]
   spans = get_spans(tokens, tags)
   print("spans ", spans)
   print(d)
   if i > 10:
     break;

I am not sure how to contribute to the repository and how things work, could you let me know how one can access the datasets to be able to contribute to the repository? Maybe I could do it then
thanks

lhoestq · 2021-03-29T15:05:10Z

Cool ! Let me give you some context:

Contribution guide

You can find the contribution guide here:

https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md

It explains how to set up your dev environment in a few steps.

Dataset loading

Each Dataset is defined by a Table that have many rows (one row = one example) and columns (one column = one feature).
To change how a dataset is constructed, you have to modify its dataset script that you can find here:

https://github.com/huggingface/datasets/blob/master/datasets/wikiann/wikiann.py

It includes everything needed to load the WikiANN dataset.
You can load locally a modified version of wikiann.py with load_dataset("path/to/wikiann.py").

Define a new column

Each column has a name and a type. You can see how the features of WikiANN are defined here:

datasets/datasets/wikiann/wikiann.py

Lines 245 to 263 in c98e4b8

    
           features = datasets.Features( 
        
               { 
        
                   "tokens": datasets.Sequence(datasets.Value("string")), 
        
                   "ner_tags": datasets.Sequence( 
        
                       datasets.features.ClassLabel( 
        
                           names=[ 
        
                               "O", 
        
                               "B-PER", 
        
                               "I-PER", 
        
                               "B-ORG", 
        
                               "I-ORG", 
        
                               "B-LOC", 
        
                               "I-LOC", 
        
                           ] 
        
                       ) 
        
                   ), 
        
                   "langs": datasets.Sequence(datasets.Value("string")), 
        
               } 
        
           )

Ideally we would have one additional feature "spans":

        "spans": datasets.Sequence(datasets.Value("string")),

Compute the content of each row

To build the WikiANN rows, the _generate_examples method from here is used. This function yield one python dictionary for each example:

yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs}

The objective would be to return instead something like

spans = spans = get_spans(tokens, tags)
yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs, "spans": spans}

Let me know if you have questions !

lhoestq added the good first issue label Mar 29, 2021

rabeehk mentioned this issue Mar 30, 2021

added spans field for the wikiann datasets #2141

Merged

Apr	MAY	Jun
	02
2020	2021	2022

huggingface / datasets

wikiann dataset is missing columns #2130

wikiann dataset is missing columns #2130

dorost1234 commented Mar 29, 2021 •

edited

dorost1234 commented Mar 29, 2021

lhoestq commented Mar 29, 2021 •

edited

dorost1234 commented Mar 29, 2021 •

edited

lhoestq commented Mar 29, 2021 •

edited

huggingface / datasets

wikiann dataset is missing columns #2130

wikiann dataset is missing columns #2130

Comments

dorost1234 commented Mar 29, 2021 • edited

dorost1234 commented Mar 29, 2021

lhoestq commented Mar 29, 2021 • edited

dorost1234 commented Mar 29, 2021 • edited

lhoestq commented Mar 29, 2021 • edited

Contribution guide

Dataset loading

Define a new column

Compute the content of each row

dorost1234 commented Mar 29, 2021 •

edited

lhoestq commented Mar 29, 2021 •

edited

dorost1234 commented Mar 29, 2021 •

edited

lhoestq commented Mar 29, 2021 •

edited