NLP & Langues

Darija Open Dataset (DODa)

À Propos

DODa is the largest open source dataset for Darija↔English translation, hosted on GitHub under CC BY-NC 4.0. Contains 1300+ nouns, 1000+ verbs, 45,000+ sentences, 100,000+ total entries. Organized into subcategories: food, animals, human body, health, education, etc. Aims to be the standard resource for Darija NLP research.

https://darija-open-dataset.github.io

Visiter le site

Dans la même catégorie

Goud-sum (HuggingFace) — Darija Summarization Dataset

158k articles + headlines from Goud.ma — Darija/MSA text summarization dataset

MA_Open_Datasets — Goud.ma

Goud news articles in CSV format — alternative distribution of Goud data

MA_Open_Datasets — LeMatin

Le Matin newspaper articles by category — nation, économie, culture, sport

MA_Open_Datasets — MoroccoWorldNews

Morocco news articles dataset from MoroccoWorldNews