Argilla

A tool designed to create high-quality datasets for training language models, using distilabel and large language models (LLMs) for tailored data generation. (github.com)

Build datasets using natural language

Introduction

Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. The announcement blog goes over a practical example of how to use it but you can also watch the video to see it in action.

Supported Tasks:

Text Classification
Chat Data for Supervised Fine-Tuning
Retrieval Augmented Generation

This tool simplifies the process of creating custom datasets, enabling you to:

Describe the characteristics of your desired application
Iterate on sample datasets
Produce full-scale datasets
Push your datasets to the Hugging Face Hub and/or Argilla

By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

Installation

You can simply install the package with:

pip install synthetic-dataset-generator

Quickstart

from synthetic_dataset_generator import launch

launch()