Normally, during each half of the all training loop, we'd feed the ALS algorithm whole rows or columns at a time. But since it's hard to know which stage it's in, we'll just feed both instead, so that it always has the right data regardless of what stage it's in. Remember, this will be fed in batches of rows and columns from the ratings matrix. Hopefully, we will go through the matrix multiple times so things will go even out, and you'll end up processing all the rows and columns. It might be important to make sure that batch size doesn't clearly divide data set length so the rollover offset causes different groupings in the batch, so the same batches don't continually repeat themselves. Graphically, going off of our user movie example, we have our user by movie matrix R, that we are hoping to factorize into our user factors matrix U and our movie factors matrix would V made up of k latent factors. We iterate alternately until convergence fixing V, computing U, then fixing U, computing V, and so on. Coming back to the training input function, remember we read in batches of rows and columns at the same time, and they are stored in a SparseTensor, which we will discuss the details in the next section. Remember, we don't need labels because due to the alternation, the labels come from the features we are not currently solving for. Okay. So, we need rows and columns of a radius matrix and then we can continue? Well, most data warehouse for systems of millions of users, with millions of items don't store the complete Cartesian product. There will be a huge waste of space because that matrix is extremely sparse. Instead, ratings data is usually stored when there is an interaction that becomes a record or row. There may be an interaction timestamp and most definitely, there will be a column that contains an identifier of the user of that interaction, another column the item that the user interacted with, and then a column for the actual interaction data. Whether that is the number of stars, like or dislike, or the duration of the interaction. However, these are usually represented as tables and that has actual matrices that have row indices and column indices. Let's look at an example. In this example, visitor ID contains a unique user identifier and content ID contains a unique item ID which are the video IDs. The actual interaction data in this example is session duration, which is probably going to be used as implicit feedback, where we infer that. If a user has a longer session they are more likely to like that content than with shorter sessions. Notice anything unusual with this table? Now, unless there are row indices or the users of the entire galaxy, those are some really large numbers for visitor ID, many magnitudes more than all the people on earth. Same goes for content ID even though 100 million is much more believable than the 73 quintillion visitors. Remember, these two columns need to map to a contiguous matrix, so that these should be the indices of those rows and columns. Also, it would help if we scale session duration to be a small number. What we need to create is a mapping. We'll map visitor ID to user ID, contact ID to item ID, and session duration to rating. This mapping needs to be saved to persistent storage because one need to map input values to the map values, not just during training, but also during inference. This way, you can quickly take any visitor ID, content ID, and session duration, and get the corresponding map values used as output by the model. Speaking of inference, when making predictions, you need access at time of prediction. Not just to these mappings, but to the entire input dataset, because you may want to filter out any previously interacted with items such as the previous purchase, view, or rating to provide the top k recommendations of new items. Should you recommend an already rated item to a user? For some problems, users don't want to be recommended things they've already bought or saw like a movie. However for our restaurant they liked, they may want to return. Also, we can distribute our data instead of sending a whole rows and columns and our input function to one worker. Each worker's minibatch consists of a subset of rows of the matrix. The training step computes the new values for the corresponding row factors. However, the update depends on the full column factor, which would be costly to fetch each step. So, we use a trick where we compute a Grammy and G, which is just the determinant of the matrix inner product x transpose x. Given G, the worker now only need to look at a subset of rows of V, those corresponding to non-zero entries in the input to compute the update. Now, we can use gather and scatter to perform fetches and updates and use custom C++ kernels for the compute. This is much easier to distribute and scale. When using the wall's estimator, it is important to have the inputs in the correct format. What should we do with the table here in the input function before it is used by the estimator? Do we map client ID to integers in the range zero inclusive to the number of clients exclusive? Do we want to map product ID to integers in the range zero inclusive to the number of products exclusive? Do we want to map sentiment from a string representation to numeric representation, or maybe some combination? The correct answer is F. The client ID is represented as an alphanumeric string in this data. So, we need to map each string into an integer representing that client's user index. The product ID at least isn't a string, but it is a long integer that is not representative of the actual matrix index. So, we need to map each to an integer representing that products item index. As for the rating, we probably have an example of explicit feedback. However, it is a string, so we will need an ordinal mapping perhaps from the lowest sentiment to highest sentiment by integers, and we can even scale that to be between zero or one or some other range. The main goal is to eventually get the radians into a numeric format. Therefore, we had to create all three mappings and save them in persistent storage to be used for future training and inference.