The RedPajama-Data repository by Together Computer provides datasets for pretraining large language models (LLMs). It is inspired by OpenAI's LLaMA dataset and aims to create open, high-quality training data for foundation models.
Key Features:
- Open Datasets: RedPajama replicates the LLaMA dataset composition using publicly available sources.
- Diverse Sources: It includes data from Common Crawl, Wikipedia, books, GitHub, arXiv, and other web datasets.
- Preprocessing Pipelines: The repository contains scripts for cleaning, deduplicating, and tokenizing text.
- Data Format & Access: The datasets are available in processed and raw forms, with metadata for filtering.
Usage:
Researchers and organizations can use RedPajama-Data to pretrain their own LLMs while ensuring transparency and reproducibility.