The Wayback Machine - https://web.archive.org/web/20210502135225/https://github.com/huggingface/datasets/issues/2130
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikiann dataset is missing columns #2130

Open
dorost1234 opened this issue Mar 29, 2021 · 4 comments
Open

wikiann dataset is missing columns #2130

dorost1234 opened this issue Mar 29, 2021 · 4 comments

Comments

@dorost1234
Copy link

@dorost1234 dorost1234 commented Mar 29, 2021

Hi
Wikiann dataset needs to have "spans" columns, which is necessary to be able to use this dataset, but this column is missing from huggingface datasets, could you please have a look? thank you @lhoestq

@dorost1234
Copy link
Author

@dorost1234 dorost1234 commented Mar 29, 2021

Here please find TFDS format of this dataset: https://www.tensorflow.org/datasets/catalog/wikiann
where there is a span column, this is really necessary to be able to use the data, and I appreciate your help @lhoestq

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Mar 29, 2021

Hi !
Apparently you can get the spans from the NER tags using tags_to_spans defined here:

https://github.com/tensorflow/datasets/blob/c7096bd38e86ed240b8b2c11ecab9893715a7d55/tensorflow_datasets/text/wikiann/wikiann.py#L81-L126

It would be nice to include the spans field in this dataset as in TFDS. This could be a good first issue for new contributors !

The objective is to use tags_to_spans in the _generate_examples method here to create he spans for each example.

@dorost1234
Copy link
Author

@dorost1234 dorost1234 commented Mar 29, 2021

Hi @lhoestq
thank you very much for the help, it would be very nice to have it included, here is the full code, one need to also convert tags to string first:

import datasets 
from datasets import load_dataset

def tags_to_spans(tags):
  """Convert tags to spans."""
  spans = set()
  span_start = 0
  span_end = 0
  active_conll_tag = None
  for index, string_tag in enumerate(tags):
    # Actual BIO tag.
    bio_tag = string_tag[0]
    assert bio_tag in ["B", "I", "O"], "Invalid Tag"
    conll_tag = string_tag[2:]
    if bio_tag == "O":
      # The span has ended.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = None
      # We don't care about tags we are
      # told to ignore, so we do nothing.
      continue
    elif bio_tag == "B":
      # We are entering a new span; reset indices and active tag to new span.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = conll_tag
      span_start = index
      span_end = index
    elif bio_tag == "I" and conll_tag == active_conll_tag:
      # We're inside a span.
      span_end += 1
    else:
      # This is the case the bio label is an "I", but either:
      # 1) the span hasn't started - i.e. an ill formed span.
      # 2) We have IOB1 tagging scheme.
      # We'll process the previous span if it exists, but also include this
      # span. This is important, because otherwise, a model may get a perfect
      # F1 score whilst still including false positive ill-formed spans.
      if active_conll_tag:
        spans.add((active_conll_tag, (span_start, span_end)))
      active_conll_tag = conll_tag
      span_start = index
      span_end = index
  # Last token might have been a part of a valid span.
  if active_conll_tag:
    spans.add((active_conll_tag, (span_start, span_end)))
  # Return sorted list of spans
  return sorted(list(spans), key=lambda x: x[1][0])

dataset = load_dataset('wikiann', 'en', split="train")
ner_tags = {
   0:"O",
   1:"B-PER",
   2:"I-PER",
   3:"B-ORG",
   4:"I-ORG",
   5:"B-LOC",
   6:"I-LOC"
}

def get_spans(tokens, tags):
  """Convert tags to textspans."""
  spans = tags_to_spans(tags)
  text_spans = [
      x[0] + ": " + " ".join([tokens[i]
                              for i in range(x[1][0], x[1][1] + 1)])
      for x in spans
  ]
  if not text_spans:
    text_spans = ["None"]
  return text_spans


for i, d in enumerate(dataset):
   tokens = d['tokens']
   tags = d['ner_tags']
   tags = [ner_tags[i] for i in tags]
   spans = get_spans(tokens, tags)
   print("spans ", spans)
   print(d)
   if i > 10:
     break; 

I am not sure how to contribute to the repository and how things work, could you let me know how one can access the datasets to be able to contribute to the repository? Maybe I could do it then
thanks

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Mar 29, 2021

Cool ! Let me give you some context:

Contribution guide

You can find the contribution guide here:

https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md

It explains how to set up your dev environment in a few steps.

Dataset loading

Each Dataset is defined by a Table that have many rows (one row = one example) and columns (one column = one feature).
To change how a dataset is constructed, you have to modify its dataset script that you can find here:

https://github.com/huggingface/datasets/blob/master/datasets/wikiann/wikiann.py

It includes everything needed to load the WikiANN dataset.
You can load locally a modified version of wikiann.py with load_dataset("path/to/wikiann.py").

Define a new column

Each column has a name and a type. You can see how the features of WikiANN are defined here:

features = datasets.Features(
{
"tokens": datasets.Sequence(datasets.Value("string")),
"ner_tags": datasets.Sequence(
datasets.features.ClassLabel(
names=[
"O",
"B-PER",
"I-PER",
"B-ORG",
"I-ORG",
"B-LOC",
"I-LOC",
]
)
),
"langs": datasets.Sequence(datasets.Value("string")),
}
)

Ideally we would have one additional feature "spans":

        "spans": datasets.Sequence(datasets.Value("string")),

Compute the content of each row

To build the WikiANN rows, the _generate_examples method from here is used. This function yield one python dictionary for each example:

yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs}

The objective would be to return instead something like

spans = spans = get_spans(tokens, tags)
yield guid_index, {"tokens": tokens, "ner_tags": ner_tags, "langs": langs, "spans": spans}

Let me know if you have questions !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants