In this course, you will learn to design the computer architecture of complex modern microprocessors.

Loading...

From the course by Princeton University

Computer Architecture

234 ratings

In this course, you will learn to design the computer architecture of complex modern microprocessors.

From the lesson

Multithreading

This lecture covers different types of multithreading.

- David WentzlaffAssistant Professor

Electrical Engineering

. Okay.

Â So now we're going to move off of vectors and talk about sort of a near cousin of

Â vectors, or how you can deal, or have vector

Â computing, in your desktop today. So this is actually a lot of this was

Â done actually by Ruby Reith here at Princeton she added a lot of multimedia

Â extensions to the HPPA risk architecture. there's a couple of other people involved

Â in this, but the, she was actually pretty influential in, in dealing, to do this.

Â The, the idea here is that if you have a wide register, so if you're doing let's

Â say 64 bit additions, and you don't want to have to do 64 bit

Â additions, or don't actually have 64 bit data laying around, you could cut it in

Â half and do two 32 bit operations at the same time,

Â or you can use that same ALU and try and do four sixteen bits,

Â or eight 8-bit operations. So, this is called SIMDy, or Single

Â Instruction, Multiple Data, so you have, or short SIMDy instructions here, because

Â typically the, the vector length is pretty short,

Â or multimedia extensions. and you have an instruction which says, I

Â want to do two 32-bit ads, we'll say, at the same time.

Â This is was popularized in x86 at least by, MMX was the first, first

Â implementation of this. And it's, it's sort of gone on from there

Â to SSE, SSE3, SSE4 SSE4, and now Intel AVX.

Â And the differenances between mmx and all the different SSE's largely has to do

Â with the length of the register and how many instructions they had.

Â so in AVX we've gone to 256 bit registers, wider registers, and it's

Â extensible to I think 1,000 bit or, or 1024 bits.

Â One thing I do want to point out about this which is interesting is this

Â requires changes to your data path. If you have an adder, and you have a 32

Â bit add, and now you wanted to do eight, eight bit ads, you need to cut the carry

Â chain in seven places. Now, that's if you have a basic adder.

Â I guess it gets a little more complicated if you have something like a

Â propagate, or, a, carry look ahead adder, or something like that,

Â because you may not have a simple place to go sniff the, the carry chains.

Â There is still some place to cut it, but you might, your original design, you

Â might have propagated across, where now, you need to cut the boundary.

Â So, this is, this is definitely a, a challenge.

Â Also, for things like multiplies, if you want to do eight, eight bit multiplies.

Â the, the, the structure looks a little bit different there.

Â But the, some of these, the big insight here, is, you had that logic anyway.

Â You're just effectively adding muxes on the carry chains to the, the the data

Â path. And some operations you don't even need

Â to add. Obviously if you're operating on

Â something like eight, eight bit values, you want to do the logical or of them.

Â You don't need to add a special instruction for that.

Â From a implementation perspective, this is what I was trying to get at here. You

Â can, you've independent ad's going on, and they all happen in parallel So why,

Â why do we like multimedia extensions, or these vector instructions or short vector

Â instructions? And let's compare them to our big vector

Â machines. So, one of the major differences is that

Â you can't control the vector length. The vector length is the way the length

Â of the, the native data word or the length of the instruction set.

Â so, or the length, the length of the native data type for your instruction

Â set. And,

Â strided, scatter-gather, these other operations are hard to do,

Â because typically you just have a single load in store.

Â And you use the processor's load and storing instructions.

Â Because the processor doesn't care. It's just like the same way that unary

Â operations or logical operations don't need special instructions to do short

Â vector, or single instruction multiple data operations.

Â You don't need special instructions for SIM D data to be able to do loads and

Â stores. You just load the data.

Â And store the data. this is actually starting to change a

Â little bit. Some of the new versions of SSE actually

Â do have some, scatter-gather modifications.

Â It's a, it's a little bit harder if you think about it because you can't hold a

Â full address if you will, in a vector. So it's not like you can actually do sort

Â of index of addressing, index of addresses because you can't

Â necessarily hold the full address in there.

Â But, in essence, they've sort of come up with some way to do, scatter and gather

Â operations. Couple things about having the vector

Â register length being limited, is that you can't do as much work in one

Â operation. So, you can't necessarily do a 64

Â operations in one instruction, like we did with our vector length of 64.

Â So that's just, that just is a, is a problem.

Â And, and unfortunately, what happens here is you end up having to do more

Â operations and issue more instructions. And you're effectively increasing the

Â bandwidth out of your fetch, unit. So it's not, it's not, not as, not as

Â good. and finally, I just wanted to say we're,

Â that processors are starting to move, that these multimedia extensions are

Â starting to move a little bit towards vector processors. as they add more rich

Â instruction sets. So, as we get to SSC4 for instance, or

Â SSC4.2, there's more instructions in there and X 86 that can do fancier

Â things. And the vector length is even getting,

Â getting longer, up to 124 bits. Or excuse me 1024 bits.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.