0:22
data structure like a Player.
So you have players who are either footballers or cricketers,
and cricketers not all of whom bowl.
So you basically have some cricketers have bowling averages,
but everyone bats, so all cricketers have batting averages.
So you have this hierarchy, and you want to represent this
in a relational database, because you want to persist the data.
If you use a relational database to persist an object hierarchy,
you get into an object relational impedance, there's a mismatch here.
For example, we said that all cricketers have batting averages,
but not all of them have bowling averages.
However the table itself, if you go look at it there'll be a Name, there'll be
a Club, there'll be a BattingAverage, there'll be a BowlingAverage.
What is a BattingAverage of a football player?
Makes absolutely no sense, but we have to have it.
So we basically go ahead and put in a null in there, and once you do that
you basically have data integrity problems from this point onwards.
So how do you prevent this kind of an issue with an object
to relational mapping?
Well one way is if you can store objects directly, and
that's what Cloud Datastore on GCP lets you do.
So Datastore scales up to terabytes of data,
where in a relational database typically goes into a few gigabytes,
Datastore can go up to terabytes and
what you're storing in a Datastore conceptually is like a Hashmap.
So it's a key or an id to an object, so you store the entire object in.
So when you are writing to Datastore you're writing an entire object, but
when you are reading from it, that's searching, you can search with the key but
you can also search by a property.
So you can look for
all cricket players whose batting average is greater than 30 runs a game, right?
2:34
You want to update, again you can update just the batting average of a player, and
you can update this in a transactional way.
So Datastore supports transactions, it allows you to read and
write structured data.
It is a pretty good replacement for use cases in your application,
where you may be using a relational database to persist your data.
However, this replacement is something that you would have to do explicitly.
Unlike the things that we've talked about in the previous chapter you can't just,
for example in the previous chapter we said you have a spark
program that you are running on a Hadoop cluster On-Premise.
You want to run it on GCP just run it on Dataproc,
pretty much all of your code just migrates unchanged.
If you have a MySQL code base,
well whatever you're doing to your MySQL On-Premise you can do MySQL on
Google Cloud using Cloud SQL, those are easy migrations, right?
Take what you have, take those use cases that you have,
just move them to the cloud, but when we talk about something like Datastore,
now it's not that easy a migration.
You have to change the code that you're doing, where the way you interact with
Datastore is different from the way you'd interact with a relational database.
So how do you interact with Datastore?
Well, the way you work with Datastore is that it's like a persistent Hashmap.
So for example let's say we want to persist objects that are author objects.
You'd say I have an author class, it's an Entity, that's the identity.
It's an annotation that you add and I'm showing you Java here, but
it works with a variety of object oriented languages.
And you say that the author is distinguished by their email address,
the email address is an Id column, so you say add Id.
We want to search for authors by name, so
we'd like the name property to be indexed, and just to show you that you can have
hazard relationships, an author has a bunch of different interests.
Same thing about guestbook entries, you store guestbook entries, each entry has
an id that makes it unique, it basically has a Parent Key and Author.
These are the people who wrote the GuestbookEntry and that's a relationship.
You have messages,
we're never going to search apparently because it's not indexed.
We're not going to search for
guestbook entries based on the text of the message and we have dates, right?
And that's something that we might want the search based on.
So once you have an Entity, you have an Author, there's an Entity you have,
an Author has an email which is the id, and a name which is the index.
So you want to create an Author, you basically call the constructor,
just as you would do for any plain old job or object.
A new Author xjin@bu.edu, name is xjin,
you have your Author object, but at this point the Author object is only in memory.
You want to save it, you basically call save passing in the entity.
ofy here is the objective file library,
it's one of several Java libraries that help you deal with Datastore.
So in this case this code is showing you objective file, we save the entity and
at this point the xjin object has been persisted.
If you want to read it, if you want to search for it,
what you can do is say load all authors and filter them by name Ha Jin,
and because name is an indexed field we can do this.
We can filter by name Ha Jin, and we will get back an iterable of authors.
6:10
Well because Datastore scales up to terabytes, so one of these
columns that your search is based on, what comes back could be gigabytes of data,
might be much more than can fit into memory, so we give you back an iterable.
Well if you know you're going to get back only one item, such that's the second
one here, you're loading authors and you're finding id xjin@bu.edu, at that
point you're going to get one author back, so you best get back the author object.
We call it JH within the code, and now we can update the name of jh,
you can say jh.name = Jin Xuefei and then save that entity.
7:34
Another reason that relational databases may not work very well, and
we've discussed this in the module review section of the previous chapter,
is if you have high throughput needs.
If you have sensors that are distributed all across the world, and
you're basically getting back millions of messages a minute,
that's not something the Cloud SQL can handle very well.
That's not something a relational database can handle very well,
that's essentially a pen-only operation.
We're just getting you data and we're saving, and
we don't need transactional support.
And because we are willing to give up transactional support, the capacity of
Bigtable is no longer like the Terabytes that the Datastore can support,
but Petabytes.
On the other hand what we've given up is the ability to update
just a single field of the object, we have to write an entirely new row.
The idea is that if we get a new object, we basically append it to the table and
then we read from latest data and go backwards.
So that the very first object that we find at the particular key
is the latest version of that object.
8:44
So Bigtable is really good for high throughput scenarios,
where you want to be not in the business of managing infrastructure,
you want something to be as NoOps as possible.
With Bigtable you basically deal with flattened data, it's not for
hierarchical data, it's flattened and you search only based on the key.
And because you can search only based on the key, the key itself and
the way you design it becomes extremely important.
Number one, you want to think about the key as being the query that you're
going to make, because again you can only search based on the key.
You cannot search based on any property, and because you can't search fast based on
properties, you're going to be searching based on keys, you want your key itself
to be designed such that you can find the stuff that you want quickly.
And the key should be designed such that there are no hotspots,
you don't want all of your objects, all of your rows,
falling into the same bucket, you want things to be distributed there.
Tables themselves should be tall and narrow, right?
Tall because you keep appending to it.
Narrow, why?
The idea being that, if you have Boolean flags for example,
rather than have a column for each flag, and have the value be 0 or 1.
Maybe you just have a column that says,
these are the only true flags on this object.
This kind of thing becomes extremely useful if you're trying to store, for
example, user's ratings.
A user may rate only like five out of the thousands
of items in your catalog, and browser has thousands of columns, one for every item.
You simply store object comma rating for
the things that they've actually rated, and that could be a new column.
10:55
The reason to use big tables and it's NoOps, it's automatically balanced,
it's automatically replicated, its compacted, it's essential NoOps.
You don't have to manage any of that infrastructure,
you can deal with extremely high throughput data.
This is how you work with Bigtable, you work with it using the hbase API.
So that's why what if you're importing is org.apache.hadoop.hbase.
So you work with it the way you would normally work with hbase,
you basically get a connection.
You basic go to the connection and you get your table, you create a put operation,
you add all of your columns, and then you put that into the table and
you've basically added a new row to the table.