0:00
Hi. Welcome to the DeepLearning4J Overview.
In this section I'm going to provide an overview of
the tools that the DeepLearning4J project provides.
Any code that I write for this Coursera course will be shared in this Git repo.
So you can go to this GitHub location and take a look at any
of the code that has been generated for this course.
So what DeepLearning4J does is it provides a toolkit for using DeepLearning on the JVM.
It consists of a number of subprojects.
A large part of our work as data scientists is spent in terms of ingesting, processing,
preprocessing, normalizing, standardizing, and otherwise manipulating our data source.
And our data source might be a comma separated value file,
might be a collection of images, or video,
or audio files, and DataVec is our subproject in DL4J that provides tools for ETL.
The processing that takes place when
a neural network trains is numeric processing of arrays.
So it's matrix to matrix multiplication,
and ND4J handles that.
You can kind of think of ND4J as the equivalent of what NumPY has for Python.
You can think of it as NumPY for the JVM.
Since we're doing matrix to matrix multiplication,
so libnd4j provides the native libraries for execution on GPUs and or CPUs.
And then DeepLearning4J, this is where we define our neural net,
configure it, and then train it.
So let's talk about DataVec first.
So DataVec needs to get your data into a numeric array,
sometimes called a tensor.
Sometimes perhaps a more appropriate term would be an indexed n-dimensional array,
a multi-dimensional array of values.
And DataVec helps you get from your data source into that numeric array.
Your data source might be log files, text documents,
voice samples, tabular data,
images, video, and more.
Some of the features that DataVec provides: transformation.
I may need to transform a list of classes
or a numeric representation of classes to a one-hot representation.
I may need to join data sets.
I may need to transform values.
I may need to reorganize the data into another schema.
I will need to scale my data,
perhaps between values of zero and one so that we have consistent ranges of values.
So normalizing and standardizing DataVec provides tools for that.
A neural network trains best if the data is shuffled,
so DataVec provides tools to assist with
shuffling the data at many points along the pipeline,
and then splitting our data into test and train.
In order to train a neural net if we're doing supervised learning,
we need some way to get the label of the data.
And the label might be stored as part of the file path.
So we might have a directory of images of cats and a directory named Cats.
We would extract that.
And collection of images of dogs might be in a directory named Dogs.
So some tools for doing that.
If the labels in the path,
perhaps the name of the file,
we can use Path Label Generator.
If the label is in the parent path, like I described,
the directory of cats and dogs,
then we could use Parent Path Label Generator.
If the label's stored as a column in your CSV data,
then we provide the labelindex to the RecordReaderDataSetIterator.
And there will be an example of code in
the next section where you'll actually see that particular use case.
So some commonly used features in DataVec,
a RecordReader to read our files or input,
converting it to a list of writables.
So we'll pass our record reader in input split which says where in
the file system and that file system could be HDFS or S3,
any Java interpretable file path.
Normalizing our data, standardize, scale,
transform processes to modify the schema of the data,
join datasets to replace strings, extract labels.
So here's a quick diagram of some of the available ETL paths.
There's more, but there you see some of the classes or some of
the tools that you would use depending upon our data source,
where the label is,
what type of RecordReader we're going to use
to read it depending upon how the data is stored.
And down here you'll see where we convert to an INDArray.
That's our RecordReaderDataSetIterator.
It takes what the RecordReader provides,
a list of writables,
a list of records you can think of that,
and converts that into a multi-dimensional array of features,
and then if they're doing supervised learning,
an additional multi-dimensional array of labels.
I couldn't put all the available record readers
here in this slide.
Too many of them. So here's a link to a web page that provides
a list of many of the available record readers.
You may need to preprocess your data.
So an example would be images where the pixel values might be 0 to
255 to determine the range of a color in that pixel.
To process that data in a neural net you may want to scale
those two values between zero and one.
And we could use MinMax scaling.
We could use NormalizerMinMaxScaler where we find the observed men and the observed maps.
In the case of the images we know what the potential Max and the potential Min is,
but your data might not necessarily be images and you might need to extract
the observed Max and the observed Min and then
apply those and set the observed Max to one and the observed Min to zero.
NormalizerStandardize,
NormalizerMinMaxScaler needs to read through
the whole data set to extract them in an max,
the global min and max.
NormalizerStandardizer prevents that initial pass
by providing a moving column wise variance and mean,
thereby eliminating the need to preprocess the data.
So CSVRecordReader is commonly
used if we have CSV data and we'll have an example of that.
We'll also have an example in this course of CSV sequence RecordReader
where we're generating a time series structure out of the data that's stored,
image RecordReader if you are reading images,
and there's examples of that in the GitHub DeepLearning4J,
GitHub repo providing examples.
You won't necessarily be doing one in this course.
JacksonRecordReader if I was reading
JSON parent path label generator quite commonly used.
Transform, TransformProcess, Transform Process
Builder always to choose which columns you'd like to use.
Perhaps perform computation transformation of those columns.
Now let's talk about ND4J.
ND4J for is our numeric scientific computing libraries.
One of its main features is a versatile and dimensional array object.
So we'll be creating indexed and dimensional arrays and then our neural net will be
processing those and generating its output which will
also be an indexed and dimensional array.
Multi-platforms functionality and support including GPUs.
Neural nets take a lot of computing power,
though perform significantly faster on
graphic processing units or GPUs and in order to switch from CPU
to GPU and DL4J it's as easy as changing a configuration of what the ND4J backend is.
So it's a simple change of your POM file and then an execute on GPUs rather than CPUs.
And the tools we're frequently using from ND4J,
DataSet, and a DataSet is a collection of INDArrays,
one for the features and one for the labels, and then DataSetIterator,
that RecordReader DataSetIterator is where we move from DataVec parsing,
processing, configuring our input into generating the INDArray.
So RecordReader DataSetIterator for
processing that data and getting it into ND4J to pass it to our neural net.
Libnd4j this is a C++ engine that powers ND4J.
We need the speed and
the native processing support of C++ and libnd4j provides that for us.
DeepLearning4J, this is where we built our neural nets and we can
configure our neural nets to execute on CPUs or GPUs
by changing that line in our POM file.
We can specify we want standalone or
parallel processing so you can build a simple neural net that runs locally on CPUs,
switch our POM file and it's executing on GPUs.
Wrap that neural net in ParallelWrapper or
SparkDl4JMultilayer and it will now
execute in parallel across a collection of Spark nodes on your Spark cluster.
And that's one of the focuses of this course.
We're going to demonstrate deploying your code or neural nets
code onto the Spark cluster that IBM's data science experience provides.
And then ParallelWrapper.
If we had a collection of GPUs on a single machine we need to do parallel processing,
across those GPUs we would use ParallelWrapper.
So if you wanted more information,
GitHub DeepLearning4J that's where we store the code.
That's the actual source code.
A Gitter chat if you needed help from one of our engineers or a member of the community.
We're constantly monitoring that Gitter channel that we see right here.
And then our website, deeplearning4j.org.
You can go there and get assistance,
some documentation, et cetera.
Thank you.