Argilla
🎯

Argilla

A tool designed to create high-quality datasets for training language models, using distilabel and large language models (LLMs) for tailored data generation. (github.com)

image

Build datasets using natural language

image

Introduction

Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. The announcement blog goes over a practical example of how to use it but you can also watch the video to see it in action.

Supported Tasks:

  • Text Classification
  • Chat Data for Supervised Fine-Tuning
  • Retrieval Augmented Generation

This tool simplifies the process of creating custom datasets, enabling you to:

  • Describe the characteristics of your desired application
  • Iterate on sample datasets
  • Produce full-scale datasets
  • Push your datasets to the Hugging Face Hub and/or Argilla

By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

Installation

You can simply install the package with:

pip install synthetic-dataset-generator

Quickstart

from synthetic_dataset_generator import launch

launch()