Back to open data
NLP & Language
Darija-Dataset-Builder — IlyasFardaouix
About
Scalable pipeline for building Moroccan Darija NLP datasets for LLM training. Provides tools and libraries for data extraction, processing, and organization for training language models on Moroccan Darija.
https://github.com/IlyasFardaouix/darija-dataset-builder
Visit WebsiteIn the same category
Goud-sum (HuggingFace) — Darija Summarization Dataset
158k articles + headlines from Goud.ma — Darija/MSA text summarization dataset
Darija Open Dataset (DODa)
100k+ darija↔English entries — largest open source Darija translation dataset
MA_Open_Datasets — Goud.ma
Goud news articles in CSV format — alternative distribution of Goud data
MA_Open_Datasets — LeMatin
Le Matin newspaper articles by category — nation, economy, culture, sports