0:44
So the problem that computers have is they have to come up with a way,
I mean, computers don't understand letters, actually.
What computers understand is numbers, and so
we had to come up with a mapping between letters and numbers.
And so we came up with a mapping, and there's been many mappings historically.
The one that sort of is the most common mapping of
the 1980s is this mapping called ASCII.
The American Standard Code for Information Interchange.
And it says basically, this number equals this letter.
So for example, the number for
hello world, that for capital H the number is 72.
Somebody just decided that the capital H was going to be 72.
Lowercase e, the number is 101.
And newline is 10.
So if you were really and truly going to look at
what was going on inside the computer, it's storing these as numbers.
But the problem is, is there are 128 of these.
Which means that you can't put every character into a 0 through 128.
And so,
in the early days we just kind of dealt with whatever characters were possible.
Like I said, when I started you could only do uppercase.
You couldn't even do lowercase, and so
there is this function, as long as you're dealing with simple values.
That you can say hey, what is the actual value for the letter H.
And it's called ord, which stands for ordinal.
What's the ordinal?
What is the number corresponding to H, and that's 72.
What's the number corresponding to lowercase e? It's 101.
And what's the number corresponding to newline? And that is a 10.
Remember, newline is a single character.
3:00
zzz all lowercase, and that's because all lowercase letters are less.
I mean all uppercase letters are less than all lowercase letters.
Actually this can be AAA, that's what I should have said there, okay?
So don't worry about that.
Just know that they are all numbers.
And in the early days, life was simple.
We would store every character in a byte of memory,
otherwise known as eight bits of memory.
It's the same thing when you say I have a many gigabyte USB stick, that's
a 16 gigabyte USB stick, that means there are 16 billion bytes of memory on there.
Which means we could put 16 million characters on here in the old days, okay?
So the problem is in the old days we just had so
few characters that we could put one character in a byte.
And so the ord function tells us the numeric value of a simple ASCII character.
And so like I said, if you take a look at this
the e is 101 and the H, capital H is 72.
And then the newline which is here at the line feed which is 10.
Now we could represent these in hexadecimal which is base 16 or
octal which is base 8.
Or actual binary which is what's really going on which is nothing but 0's and 1's.
But this is the binary for 10.
0001010, and so these three are just alternate versions of these numbers.
The numbers go up to 127, if you look at the binary, you can see in this,
this is actually seven bits of binary.
You could see that's all 1's, so it starts at all 0's goes to all 1's.
And so it's like 0's and 1's are what the computers always do.
And if you go all the way back to the hardware the little wires and
stuff, the wires are character are taking 0's and 1's.
So this is what we did and in the 60s and
70s we just said whatever we're capable of squeezing in we're just totally happy.
We're not going to have anything tricky and like I said
halfway early in my undergraduate career I started to see lowercase letters.
I'm like that's really beautiful, lowercase letters.
Now, the real world is nothing like this.
There are all kinds of characters.
And they had to come up with a scheme by which we could map these characters.
And for a while there were a whole bunch of incompatible ways to represent
characters other than these ASCII, also known as Latin character sets.
Also known as Arabic character sets.
These other character sets just completely invented their own way of representing.
And so you had these situations where Japanese computers pretty much couldn't
talk to American computers or European computers at all.
I mean, the Japanese computers just had their own way of representing characters.
And the American computers had their own way of representing characters and
they just couldn't talk.
But they invented this thing called Unicode.
And so Unicode is this universal code for hundreds of
millions of different characters and hundreds of different character sets.
So that instead of saying sorry,
you don't fit with your language from some South Sea island,
it's okay. We've got space in Unicode for that.
And so Unicode has lots and lots of characters, not 128.
Lots and lots of characters.
6:13
And so there was a time, like I said in the 70s and
the 80s where everyone had something different.
And even like in the early 2000s, what happened was that as the Internet
came out it became an important issue to have a way to exchange data.
And so we kind of had to say oh, well, it's not sufficient for
Japanese computers to talk to Japanese computers
and American computers to talk to American computers, we want Japanese and
American computers to exchange data.
So they built these character sets, and so there is Unicode,
which is sort of this abstraction of all the different possible characters.
And there are different ways of representing them inside of computers.
And so there's a couple of simple things that you might think are good ideas that
turn out to be not such good ideas, although they're used.
So the first thing we did is these UTF-16, UTF-32 and
UTF-8 are basically ways of representing a larger set of characters.
Now the gigantic one is 32 bits, which is 4 bytes.
It's 4 times as much data for a single character.
And so that's quite a lot of data, so
you're dividing the number of characters by four.
So if this is 16 GB, it can only handle 4 billion characters, or something.
Divided by four, right, four bytes per character.
And so that's not so efficient.
And then, there's a compromise like I have two bytes,
but then you'd have to pick.
This can do all the characters.
This can do sort of lots of character sets.
But turns out that even though you might instinctively think that UTF-32
is better than UTF-16 and UTF-8 is the worst?
It turns out that UTF-8 is the best.
So UTF-8 basically says, it's either going to be one, two, three or
four characters and there's little marks that tell it when to go from one to four.
The nice thing about it is that UTF overlaps with ASCII, right?
And so if the only characters you're putting in are the original ASCII or
Latin I character set, then UTF-8 and ASCII are literally the same thing.
And then use a special character that's not part of ASCII to indicate flipping
from one-byte characters to two-byte characters or
three-byte characters or four-byte.
So it's a variable length.
And so you can automatically detect, you can just be reading through a string and
say, whoa I just saw this weird marker character.
I must be in UTF-8.
And then if I'm in UTF-8 then I can sort of expand this and find,
represent all those character sets and all those characters in those character sets.
8:43
And so what happened was is they went through all these things and
as you can see from this graph, the graph doesn't really say much other than
the fact that UTF-8 is awesome and getting awesomer.
And every other way of representing data is becoming less awesome, right?
And this is 2012, so that's a long time ago.
So this was like, UTF-8 rocks, and
that's really because, as soon as these ideas came out, it was really clear
that UTF-8 is the best practice for encoding data moving between systems.
And that's why we're talking about this right now.
Finally, with this network we're doing sockets we're moving data between systems.
So your American computer might be talking to a computer in Japan and
you've got to know what character set's coming out, right?
And you might be getting Japanese characters even though everything I've
shown you is non-Japanese characters or Asian characters or whatever, right?
So UTF-8 turns out to be the best practice
if you're moving a file between two systems.
Or if you're moving network data between two systems we recommend,
the world recommends UTF-8, okay?
So if you think about your computer, inside your computer,
the strings that are inside, your Python like x equals hello world,
we don't really care what their syntax is.
And if there's a file usually the Python running on the computer and
the file had the same character set. They might be UTF-8 inside Python.
It might be UTF-8 inside, but we don't care.
You open a file, and
that's why we didn't have to talk about this when we were opening files.
Even though you might some day encounter a file that's different than
your normal character set, it's rare, okay?
So files are inside the computer.
Strings are inside the computer.
But network connections are not inside the computer and when we get to databases
we're going to see they're not inside of the computer either.
And so this is also something that's changed from Python 2 to Python 3.
It was actually a big deal, a big thing.
And most people think it's great, I actually think it's great.
Some people are grumpy about it,
but I think those people just are people that fear change.
So, there were two kinds of strings in Python,
there were a normal old string and a Unicode string.
And so you could see that Python 2 would be able to make a string constant, and
that's type string, and it would it make a Unicode by prefixing u before the quote.
And that's a separate kind of thing, and then you had to convert back and
forth, between Unicode and strings.
What we've done in Python 3 is this is a regular string and
this is Unicode string, but you'll notice they're both strings.
So it means that inside of the world of Python, if we're pulling stuff in,
you might have to convert it.
But inside Python everything is Unicode.
You don't have to worry about it,
every string is kind of the same whether it has Asian characters or
Latin characters or Spanish characters or French characters, it's just fine.
So this simplifies this,
but then there's certain things that we're going to have to be responsible for.
11:29
So the one kind of string that we sort of haven't used yet
but becomes important. and it's present in both Python 2 and Python 3.
Remember how I said in the old days a character and a byte were the same thing?
And so there's always been a thing like a byte string and
they do this by prefixing the b.
And that says, this is a string of bytes that mean this character. And
if you look at the byte string in Python 2,
and then you look at a regular string in Python 2, they're both type string.
The bytes are the same as string, and the Unicode is different.
So these two are the same in Python 2, and these two are different in Python 2.
12:31
And now the byte string and the regular string are different, okay?
So bytes turns out to be raw,
unencoded, that might be UTF-8, might be UTF-16, it might be ASCII.
We don't know what it is, we don't know what its encoding is.
So it turns out that this is the thing we have to manage when we're dealing with
data from the outside.
So in Python 3 all the strings internally are Unicode.
Not UTF-8, not UTF-16, not UTF-32,
and if you just open a file, it pretty much usually works.
If you talk to a network now, we have to understand.
Now the key thing is we have to decode this stuff.
We have to realize what is the character set of the stuff we're pulling in.
Now the beauty is, is because 99% or maybe 100% of the stuff you're
going to run across just uses UTF-8, it turns out to be relatively simple.
13:28
So, there's this little decode operation, so if you look at this code right here.
When we talk to an external resource we get a byte array back like the socket
gives us an array of bytes which are characters.
But they need to be decoded so we know, if it could be UTF-8, UTF-16 or ASCII.
So there is this function that's part of byte arrays, so
data.decode says figure this thing out.
And the nice thing is, is you can tell it what character set it is but
by default it assumes UTF-8 or ASCII dynamically.
Because ASCII and UTF-8 are upwards compatible with one another.
So if it's like old data, you're probably getting ASCII, if it's newer data,
you're probably getting UTF-8.
And literally, it's a law of diminishing returns,
it's very rare that you get anything other than those two,
so you almost never have to tell it what it is, right?
So you just say decode it, look at it.
It might be ASCII, might be a UTF-8, but
whatever it is by the time it's done with that it's a string.
It's all Unicode inside of this. And so this is bytes.
14:42
And you also can see, when we're looking at the sending of the data,
we're going to turn it into bytes.
So encode takes this string and makes it into bytes.
So this is going to be bytes that are properly encoded in UTF-8.
Again, you could have put a thing here, UTF-8, but it just assumes UTF-8.
And this is all ASCII, so it actually doesn't do anything so, but that's okay.
And then we're sending the bytes out the command.
So we have to send the stuff out, then we receive it, we decode it,
when we send it we encode it.
Out in this real world is where the UTF-8 is.
Here we just have Unicode.
15:21
And so, before we do the send and before we receive, we have to encode and
decode this stuff so that it works out and it works out correctly.
And so you can look at the documentation for both the encode and the decode.
Decode is a method in the bytes class.
And it says, you can see that the encoding we're telling it,
you can say it's not UTF-8, ASCII and UTF-8 are the same thing.
The default is UTF-8, which is probably all you're ever going to use,
and the same thing is true, strings can be encoded using UTF-8 into a byte array and
then we send that byte array out to the outside world.
16:08
On the way out, we have a internal string.
Before we send it, we have to encode it, and then we send it.
Getting stuff back, we receive it, it comes back as bytes.
We happen to know it's UTF-8 or we're letting it automatically detect UTF-8 and
decode it and now we have a string.
And now internally inside of Python we can write files,
we can do all kinds of stuff in and out of stuff and it sort of works all together.
It's just that this is UTF-8 ??
This is the outside world.
And so you kind of have to look at your program and
say okay, when am I talking to the outside world?
Well in this case, it's when I'm talking to a socket, right?
I'm talking to a socket, so I have to know enough to encode and
decode as I go in and out of the socket.
So it looks kind of weird when you all of a sudden start seeing these
encodes and decodes.
But they actually make sense.
There's sort of like this barrier between this outside world and our inside world.
So that inside our data is all completely consistent and we can mix strings
from various sources without regard to the character set of those strings.
So now what we're going to do is we're going to rewrite that program.
It's a short program, but we're going to make it even shorter.