A tool designed to create high-quality datasets for training language models, using distilabel and large language models (LLMs) for tailored data generation. (github.com)
Build datasets using natural language
Introduction
Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. The announcement blog goes over a practical example of how to use it but you can also watch the video to see it in action.
Supported Tasks:
- Text Classification
- Chat Data for Supervised Fine-Tuning
- Retrieval Augmented Generation
This tool simplifies the process of creating custom datasets, enabling you to:
- Describe the characteristics of your desired application
- Iterate on sample datasets
- Produce full-scale datasets
- Push your datasets to the Hugging Face Hub and/or Argilla
By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
Installation
You can simply install the package with:
pip install synthetic-dataset-generator
Quickstart
from synthetic_dataset_generator import launch
launch()