TensorFlow Data Validation (TFDV) can analyze training and serving data to:
- compute descriptive statistics
- infer a schema
- detect data anomalies.
The core API supports each piece of functionality, with convenience methods that build on top and can be called in the context of notebooks.
Computing descriptive data statistics
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Tools such as Facets Overview can provide a succinct visualization of these statistics for easy browsing.
For example, suppose that path
points to a file in the TFRecord
format (which holds records of type tensorflow.Example
). The following snippet illustrates the computation of statistics using TFDV:
stats = tfdv.generate_statistics_from_tfrecord(data_location=path)
The returned value is a DatasetFeatureStatisticsList protocol buffer. The example notebook contains a visualization of the statistics using Facets Overview:
tfdv.visualize_statistics(stats)
The previous example assumes that the data is stored in a TFRecord
file. TFDV also supports CSV input format, with extensibility for other common formats. You can find the available data decoders here. In addition, TFDV provides the tfdv.generate_statistics_from_dataframe
utility function for users with in-memory data represented as a pandas DataFrame.
In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats
set to True to tfdv.generate_statistics_from_tfrecord
.