0:05
OK. Obviously, we've already looked at loading in data in the previous couple of weeks.
You can't do data analysis without getting the data into H2O first.
But in this video, I'm just going to take
a more vigorous look at the various options we have.
So, these are the currently supported data file formats.
CSV is the one we've been using and we'll continue to use throughout this course.
ORC is only available if you're using Hadoop.
SVMLight is a sparse data format.
Be aware that when you load it into H2O,
my understanding is it will become dense.
It will be expanded out.
So if your data file is very sparse,
you may hit memory issues there.
And the rest you'll know if you need.
Generally, use CSV if you can.
H2O supports zip and Gzip formats.
If your CSV data is already zipped on a remote server,
it will download the zip file,
unzip it and just deal with it, which is very nice.
H2O will also analyze your data when it reads it in.
In the case of a CSV file,
it will be looking at the top row and think: does that look like column names.
If it does, it will use them.
If it looks like the rest of the data,
it will instead assign default column names,
which I think are C1,
C2, C3 and so on.
It will also analyze each column to decide the data type,
whether it's an integer,
numeric column, a factor,
a string and so on.
Generally, it's good at this.
Sometimes it gets it wrong,
so always go and check.
It will also do some basic statistics on the numeric columns: mean,
standard deviation I think is there,
the range of the data.
In the case of categorical Enum factor columns,
it will count how many you have of each possible category.
Let's look at the data sources that you can load from.
File system, very obvious.
We are actually going to come back to that one.
S3, and in future other cloud providers are being added.
That's Amazon S3, of course.
It's a very useful idea to put your data under S3 if
you're running a cluster on Amazon AWS machines.
HDFS makes a lot of sense if you have the data in a Hadoop cluster already.
The last one there, JDBC, is allowing you to load the data for SQL database.
We're not going to look at that anymore in this course.
Please go look at the documentation if you're interested.
Currently, MySQL, Postgres, MariaDB are the supported databases.
The file system, there's two approaches for loading.
There is import file and upload file.
And the distinction is where the file exists,
or more importantly, whether H2O can see the file or not.
Now if you're running H2O on your local machine,
the same machine as the client,
this distinction doesn't matter.
And that means you should always use input file.
When you have a client in one place and the server running on another host,
whether it's another host on your land or
another host in your data center or another host on
the other side of the world in a cloud server, the distinction matters.
Import file, H2O has to be able to see the file,
from wherever it's running.
So if you're using the file system,
it has to be on the local file system of the H2O server.
If you're using S3 or HDFS, again,
your access credentials have to allow access from wherever that is sitting.
If you use upload file,
this will take a file on the local file system of your client.
And just as the name suggests,
what it will do is,
first upload the data to the H2O server,
and then H2O will use import file from that local file.
And then, it will delete the temporary file.
Last week, we looked at creating artificial data using as H2O or H2O.H2O frame in Python.
If you look under the surface of what's actually happening there,
it's saving it to a temporary file on your disk using upload file,
and then, again, the server will use import file.
Generally, you should prefer import file because it's more efficient.
Also, when you import file on very large files,
H2O can read the data in parallel.
So if you're using an eight system, an eight node cluster,
where possible H2O will read one eighth of your file directly under each machine.
When you are going to be importing the same file a lot,
it is well worth putting it onto S3 or a web server,
or an HDFS system sitting next to your H2O server.
If you're only going to be doing it once, it doesn't matter.
Do whichever way is easiest.
And to recap, the most important thing,
do check the manual for the latest information about what is supported and what isn't.