Back to open data
NLP & Language
DarNERcorp — Named Entity Recognition in Moroccan Darija
About
DarNERcorp is a manually annotated NER corpus of 65,905 tokens in Moroccan Darija. Covers categories: persons, locations, organizations, date/time, and miscellaneous. Available on Mendeley Data (V4). Published 2023 in Data in Brief.
https://data.mendeley.com/datasets/286sss4k9v/4
Visit WebsiteIn the same category
Goud-sum (HuggingFace) — Darija Summarization Dataset
158k articles + headlines from Goud.ma — Darija/MSA text summarization dataset
Darija Open Dataset (DODa)
100k+ darija↔English entries — largest open source Darija translation dataset
MA_Open_Datasets — Goud.ma
Goud news articles in CSV format — alternative distribution of Goud data
MA_Open_Datasets — LeMatin
Le Matin newspaper articles by category — nation, economy, culture, sports