So, in this module, we will implement an LSTM Auto-encoder based Anomaly Detector in Keras. So, for this demo, we will use a dataset provided by the Case Western Reserve University. We start with an ETL, which stands for Extract Transform Load process, which requires data remotely, transforms it, and loads it to our final storage system, ObjectStar in that case. The researchers recorded different sessions of accelerometer data, placed at different positions of a test apparatus of healthy and faulty bearings, and separates between 12 and 48 kilohertz. All the files are in Matlab format, and different sub-pages are provided containing healthy and different types of faulty data for download. So there is a relatively low amount of healthy data available, only for recordings. So let's download these four files first. This is straightforward. After download completes, you move those files to a sub-folder called "CWR-healthy". Let's have a look at the contents. Okay, great. We have our four Matlab files. Luckily, the scikit-learn package of Python has a function to read Matlab files. Let's examine the object it returns by just loading a single file. This is a Python dictionary with two important entries for the two accelerometers, each one contains a vibration center data time series, as a Python Numpy Array. So the read folder function traverse is full of folder of Matlab files, and extracts the two Numpy Arrays. Data query is not 100 percent optimal. Therefore, we have to filter out some crappy data that a length of the time series in both entries of the Matlab files is not the same. We added column to the array, which indicates the file ID, so that we know the source of the data in case we need it later. Some Matlab files doesn't contain two sensor readings. Therefore, in that case, we fill the other column with zeros. We do this for every file and finally append it to the final array, which we return. Let's execute this function under healthy dataset. So we've created a Numpy Array, this is what we expect. And this Array contains three dimensions, one is the file ID, and the other two, are two time series for the two accelerometers. For convenience, we create the Pandas DataFrame out of the Numpy Array and write it to a CSV file. It looks nice. Note that Pandas created a fourth column for us, which contains a continuous sequence number, which can become handy at some point. Note that time series are useless once you don't get them in correct order. Therefore, you can sort all the columns and ensure their correct ordering. Let's have a look at the faulty data. Many more entries are present here and picking those by hand is quite tedious. Therefore, we will automate this. We just extract URLs from the HTML file, and only download those. But first, we get the HTML file. Then we fill the outer lines containing MAT and HTML. Then we split those lines a bit, in order to extract URL to the Matlab file, and then it looks like this. Once we hit URL, we download the file. We do this for all three web pages containing the different types of anomalies. And, finally, move everything into a folder called "CWR_faulty". This takes a while, so let's go for a walk and come back later. So, this is done, and we've downloaded all the files. Again, we create a Numpy Array containing all the data we want and finally store it in a CSV file. This file is a bit bigger so let's have a look, okay. It's 1.4 GB. It makes sense to use Apache Spark for monitoring the files. So we create an Apache Spark DataFrame out of them and, at the same time, we adjusted the DataFrame as a temporary view, so that we can run SQL statements against it. So, for example, let's check how many samples we have per file. So it's between a quarter and half a million. This is sufficient. But note that data is several between 12 and 48 kilohertz. So this is only a couple of seconds worth of data. So we do the same for faulty data. Since we are sorting ascending, we are sure that we have at least two to three seconds worth of data per instance. Now it's time to finish our ETL process, by storing both DataFrame as Parquet files into ObjectStar. Parquet is a very cool, compressed column store format and from the 1.4 GB, I think we will end up only with another 200 megabytes. In order to get the credentials for the ObjectStar, the most convenient and easy way is to just upload a dummy file. I've done this already and then insert a SparkSession DataFrame into the notebook. Once the credentials are inserted, you can use the ObjectStar as an ordinary file system or like HDFS within Apache Spark in a notebook. Let's start both DataFrames as Parquet files in ObjectStar. Note that I'm temporarily changing the file name since I've done this already and I haven't deleted those. So we have now written more than 33 millions of samples. Let me just double check where the data has been stored to. To do so, we enter the IBM Cloud Console and click on the appropriate ObjectStar service. So we can see that the Parquet file has been split into multiple sub files. We can prevent this by calling the repartition function on the DataFrame. But let's skip this for now. It's, anyway, a good practice to repartition the data according to the number of workers we have in the Spark servers. Now it's time for the actual implementation. Note that ETL often takes between 80 and 95 percent of your time, so now we deserve to have some fun. Again, we use the same credentials in order to read files from the ObjectStar. We could have used the local file system, which Data Science Experience provides as well, especially since it provides a staging area of nearly a 100 terabytes for us for free. But this file system is volatile so we are better off using ObjectStar, which is permanent. So let's create the two DataFrames and register them as Temporary Query Tables in order to be able to execute SQL statements against them using Apache Spark. We need a couple of inputs, then we define a class called "Loss-History". This is a so-called callback handler, which is called by Keras every now and then, to record a trajectory of losses during training. It's that every now and then actually it's called "On The Beginning of Ever Training Epoch". Now we defined that we are working on count-based windows of 100 samples. This means, we are using "100 past samples to predict 100 future samples". In a sense, this is exactly what Time Series Forecasting does. The dimension is two, since we have two accelerometer sensor readings per instance. Now we create an instance of our Callback Handler. And start with a sequential model. We add an LSTM layer which has 15 neurons and expects input as 100 by two 2D Array. The output layer is of dimension two again since we are going back to our input shape, which are the two dimensions of the two accelerometer readings. Now, we compiled a model, and defined two important parameters. First, we take Keras to compute loss with a mean average error function, and then we use the Adam gradient descent optimizer. We define a train function, it calls "fit on a model" with the following important parameters. We train for 20 epochs. That means we're showing the data 20 times to the model. Batch-size 72 means that every 72 samples the weight parameters are updated. Please note also that we are passing the data twice; as input and as output. This is how an auto-encoder goes. So in order to train the model, we want to use the different data instances on other work's data, which comes from the Matlab files. To be 100 percent sure that we are not messing up with ordering, we sort the data based on the sequence column. Then we filter by file ID, this ID we've directly obtained from the file name of the Matlab file. Finally, we return only the two columns containing accelerometer data. We unwrap the DataFrame to obtain the oddity, and map on it in order to unwrap data from the row objects. Collect pulls data back to the driver as a Python Array, and this, we wrap as a Numpy Array. Since we're working with count-based tumbling windows of size 100, we need to make sure that we fit to those right. Therefore, we just remove all data which doesn't fit into multiples of a 100. So trim now contains the remainder which we simply cut away using Numpy slicing syntax. Then return this 2D Array into a 3D one, by taking the tumbling window size into account. This make the array perfectly matches the shape of the input of our new network. Finally, we return the Array. Since we have this cool and handy function in place, we now can iterate over an array of file IDs. We can obtain those from the DataFrame as well, by selecting the distinct of that field, since we have start a file ID to each sample as well. Note that again, we need to unwrap and collect the data back to this back driver. Since training a neural network takes a bunch of timely measure, let's start as the watch here. This body of this loop will be executed for each file ID of the healthy dataset. And for each file ID, we get a recording which we can directly feed into the neural network for training. After the top has finished, we see that it run, on three recordings and took roughly 11 minutes for all them, and four minutes per recording. Now let's plot the trajectory of such a training in order to see how it converges. It actually converges really fast, which is very nice. Especially, since our neural network is fairly simple. Later, we see more complex neural networks with more complex data. The two spikes which we are seeing here are mostly occurring because of the switch between different recordings of the training dataset. Now let's examine a faulty recording, and see how the neural network behaves. We expect a sudden increase in loss, especially, at the beginning of training. We append the obtained losses from recorded class to the previous losses, and training under healthy dataset, and plot it. And we see a clear spike in loss, once the neural network sees faulty data for the first time. Note that the loss decrease over time, so there are couple of additional steps to turn this into out of the box anomaly detector. So let's examine this in the next video. So we've done our homework for now. So, let's see how we can turn this into a solution exemplified by a Cognitive IOT Real-Time Anomaly Detection System.