Back to open data
NLP & Language
Darija Open Dataset (DODa)
About
DODa is the largest open source Darija↔English translation dataset on GitHub (CC BY-NC 4.0). 1300+ nouns, 1000+ verbs, 45,000+ sentences, 100,000+ entries total. Subcategories: food, animals, body, health, education. Standard resource for Darija NLP.
https://darija-open-dataset.github.io
Visit WebsiteIn the same category
Goud-sum (HuggingFace) — Darija Summarization Dataset
158k articles + headlines from Goud.ma — Darija/MSA text summarization dataset
MA_Open_Datasets — Goud.ma
Goud news articles in CSV format — alternative distribution of Goud data
MA_Open_Datasets — LeMatin
Le Matin newspaper articles by category — nation, economy, culture, sports
MA_Open_Datasets — MoroccoWorldNews
Morocco news articles dataset from MoroccoWorldNews