Redpajama

The RedPajama-Data repository by Together Computer provides datasets for pretraining large language models (LLMs). It is inspired by OpenAI's LLaMA dataset and aims to create open, high-quality training data for foundation models.

Key Features:

Open Datasets: RedPajama replicates the LLaMA dataset composition using publicly available sources.
Diverse Sources: It includes data from Common Crawl, Wikipedia, books, GitHub, arXiv, and other web datasets.
Preprocessing Pipelines: The repository contains scripts for cleaning, deduplicating, and tokenizing text.
Data Format & Access: The datasets are available in processed and raw forms, with metadata for filtering.

Usage:

Researchers and organizations can use RedPajama-Data to pretrain their own LLMs while ensuring transparency and reproducibility.