Data is one of the most crucial components of your machine learning model. Collecting the right data is not enough. You also need to make sure that you put in the right processes in place to clean, analyze, and transform the data as needed, so that the model can take the most signal from that data as possible. And models which are deployed in production especially, require lots and lots of data. This is data that likely won't fit in memory, and can possibly be spread across multiple files, or may come from an input pipeline. The tf.Data API enables you to build those complex input pipelines from simple reusable pieces. For example, the pipeline might be a structured dataset that requires normalization, feature crosses, or bucketization. An image model might aggregate data from files in a distributed file system, apply random skewness to each image, and merge randomly selected images into a batch for training. The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and then batching together sequences of different lengths. The tf.data API makes it possible to handle large amounts of data, read it in different file and data formats, and perform those complex transformations. The tf.data API introduces the tf.data.Dataset abstraction that represents a sequence of elements in which each element consists of one or more components. For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label. There are two distinct ways to create a dataset. A data source constructs a dataset from data stored in memory or in one or more files or a data transformation constructs a dataset from one or more tf.dataset objects. Large datasets tend to be sharded or broken apart into multiple files, which can be loaded progressively. Remember that you train on mini batches of data. You don't need to have the entire data set in memory. One mini batch is all you need for one training step. The dataset API will help you create input functions for your model that load data in progressively, throttling it. There are specialized dataset classes that can read data from text files like CSVs, TensorFlow records, or FixedLengthRecord files. Datasets can be created from many different file formats. Use TextLineDataset to instantiate a dataset object, which is comprised of, as you might guess, one or more text files. TFRecordDataset, TFRecordFiles, FixedLengthRecordDataset is a dataset object from fixed length records, or one or more binary files. For anything else, you can use the generic data set class and add your own decoding code. Okay, let's walk through an example of TFRecordDataset. At the beginning, the TFRecordOp is created and executed. It produces a variant tensor representing a dataset which is stored in the corresponding Python object. Next, the shuffle off is executed using the output of the TFRecordOp and its input, connecting the two stages of our input pipeline so far. Next, the user defined function is traced and passed as attributes to the map operation, along with the shuffled data set variant input. Finally, the batch op is created and executed, creating the final stage of our input pipeline. When the for loop mechanism is used for numerating the elements of the dataset, the iter, iterable method is invoked on the dataset, which triggers the creation and execution of two ops. First, an anonymous iterant iterator loop is created and executed, which results in the creation of an iterator resource. Subsequently this resource along with the batch dataset variant is passed into the MakeIterator op initializing this state of the iterator resource with the dataset. When the next method is called, it triggers creation and execution of the iterator GetNext op, passing in the iterator resource as the input. Note that the iterator op is created only once, but executed as many times as there are elements in the input pipeline. Finally, when the Python iterator object goes out of scope, the delete iterator op is executed to make sure that the iterator resource is properly disposed of. Or to state the obvious, properly disposing of the iterator resource is essential as it is not uncommon for your iterator resources to allocate say hundreds of megabytes to gigabytes of memory because of internal buffering.