TFDV
🤱

TFDV

TensorFlow Data Validation (TFDV) can analyze training and serving data to:

The core API supports each piece of functionality, with convenience methods that build on top and can be called in the context of notebooks.

Computing descriptive data statistics

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Tools such as Facets Overview can provide a succinct visualization of these statistics for easy browsing.

For example, suppose that path points to a file in the TFRecord format (which holds records of type tensorflow.Example). The following snippet illustrates the computation of statistics using TFDV:

    stats = tfdv.generate_statistics_from_tfrecord(data_location=path)

The returned value is a DatasetFeatureStatisticsList protocol buffer. The example notebook contains a visualization of the statistics using Facets Overview:

    tfdv.visualize_statistics(stats)
image

The previous example assumes that the data is stored in a TFRecord file. TFDV also supports CSV input format, with extensibility for other common formats. You can find the available data decoders here. In addition, TFDV provides the tfdv.generate_statistics_from_dataframe utility function for users with in-memory data represented as a pandas DataFrame.

In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.