COLLECTED BY
Organization:
Internet Archive
The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the
Wayback Machine .
The Wayback Machine - https://web.archive.org/web/20210829163006/https://github.com/topics/nlp-datasets
Here are
99 public repositories
matching this topic...
curated collection of papers for the nlp practitioner 📖 👩🔬
Открытые лингвистические датасеты: тональный словарь русского языка, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Updated
Aug 25, 2021
Python
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
Updated
Feb 3, 2021
Python
Implementation of Very Deep Convolutional Neural Network for Text Classification
Updated
Jan 3, 2021
Python
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Updated
Jul 22, 2021
Python
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Updated
Jul 17, 2021
Python
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
Updated
Jun 8, 2021
Python
What Twitter reveals about the differences between cities and the monoculture of the Bay Area
Updated
May 31, 2019
Jupyter Notebook
The release of the FreebaseQA data set (NAACL 2019).
Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)
Updated
Apr 22, 2021
Jupyter Notebook
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
Updated
Dec 4, 2019
Python
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
Updated
Aug 10, 2021
Makefile
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。
Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
Updated
Feb 4, 2018
Jupyter Notebook
Reading the data from OPIEC - an Open Information Extraction corpus
Updated
Jun 12, 2019
Java
Model training, custom generative function and training for raplyrics.eu - A rap music lyrics generation project
Updated
Oct 20, 2019
Python
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Updated
Feb 1, 2019
Jupyter Notebook
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
Updated
Aug 15, 2021
Python
Question Answering System using BiDAF Model on SQuAD v2.0
Updated
Sep 2, 2020
Python
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
Updated
Jul 15, 2020
Python
Updated
Oct 13, 2020
Java
Library for generation of russian names
Updated
Apr 23, 2019
Python
The Mueller Report Corpus V 0.1
Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.
Updated
Aug 26, 2020
Python
English loanwords in Japanese
Updated
Mar 31, 2021
Python
Open Finnish NLP datasets
Improve this page
Add a description, image, and links to the
nlp-datasets
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
nlp-datasets
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
Rather than the current system of each sub-corpora it is own folder with its own code. Create a top-level
downloads.sh
which can re-assemble the sub-corpora.Separately, have the downloaded & pre-processed sub-corpora ready to be referenced from ADR, and NMT repos as submodules etc.