In this course, you will learn to design the computer architecture of complex modern microprocessors.

Loading...

From the course by Princeton University

Computer Architecture

235 ratings

In this course, you will learn to design the computer architecture of complex modern microprocessors.

From the lesson

Vector Processors and GPUs

This lecture covers the vector processor and optimizations for vector processors.

- David WentzlaffAssistant Professor

Electrical Engineering

So, it's an important aspect of all this vector work of how do you compile for

Â this. Well,

Â Thankfully, we actually have compilers that can do automatic vectorization.

Â And one of the challenges here, if you look at this element wise multiply is, you

Â have a loop that's running and another loop that's running and your compiler

Â needs to figure out that it can merge those loops and run them at the same time.

Â And, compilers actually have gotten pretty sophisticated.

Â If you look at the, the, the Craig compiler now, it can basically do outer

Â loop parallelism, it can do certain types of parallelism with loop carry

Â dependencies and vectorize all this. But it requires some pretty deep compiler

Â analysis. This especially works well for things like

Â Fortran codes where you don't have random pointers pointing in different places.

Â C codes get a little bit hard. So, what if you don't want to execute the

Â same code in all the elements of your vector?

Â Well, that could be a problem. So, here we have a piece of code which

Â loops over some big vector, this is C code. And, it checks to see whether the

Â value is greater than zero. And only if it's greater than zero does it

Â do this next operation. So, there's been extensions to vector

Â processors that have allowed effectively predicates or masked operations on a per

Â element basis of the, of the vector. So, the way you would do this is you would

Â actually load the entire vector, set a mask register where you have a one or a

Â zero which is the result of this comparison on an element to element basis,

Â And then, do the operation. And you can basically put this together with these bit

Â by bit comparisons and have slightly different control flow for the different

Â elements within a vector. And, just sort of, showing the

Â implementation of this, if we looked at how to actually implement masking, one way

Â to do it is, you actually do every operation.

Â So, let's say, you're doing multiply and your vector length is 64.

Â You do all 64 but you just disable the right to the register file on, the ones

Â that have the mask bit turned off. Or, you could have a much more fancy

Â implementation which takes out the work that doesn't have to be done.

Â But, the control on this is, is quite a bit harder.

Â And, I would say, that this is probably more common, just the simple

Â implementation. And the, the,

Â This is, this is harder largely because, if you have the resources anyway, say if

Â you have multiple lanes, it might just make sense to go execute a sort of a null

Â operation later. Some other things that are pretty common

Â in vectors is you want to have reductions. What I mean by reduction is let's say, you

Â have this array and you want to add all the elements in the array into a variable.

Â There's a sort of a vector to scalar operation You can't really do this on what

Â we discussed so far. You can't do a vector operation which will

Â actually operate on all of these values and, and try to do something useful with

Â it. But, what you can do is you can try and do

Â some software tricks. So, one of the software tricks is, you

Â take a whole vector, and instead, call it two vectors.

Â Sort of, cut it in half, and then overlap them and do parallel adds.

Â And then, you take the results of that. You take, it was someplace else in there,

Â And you take those two parts and you overlap them, you do adds.

Â So, you could do lots of parallel adds and effectively build a reduction operation,

Â by building a tree of adds. So, if we have our vector here, we would

Â cut it in half and add this part with this part, and then the result would be half

Â the size. If we cut in half we had this, part of that part, the result is half the

Â size and cut again, we do, we keep doing adds. So, we can use our vector arithmetic

Â to effectively do a reduction. So we're about out of time here.

Â Talk about scatter gather, this isn't that deep.

Â The implementation of this can be very hard though.

Â Um,, A of d of i. So, we want to index base off

Â a index of the vector. This is called gather.

Â Scatter is the other direction when you're doing store with a double lead-in, a, a,

Â a, a, index of a index. And, in the instruction set in your book,

Â they actually have an instruction to do this.

Â Lvi here, Well, what that basically does is it takes

Â each element of vector D here, indexes into vector C, and then that is that

Â result. Problems with this is, of course, your

Â memory layout is not going to be all nicely laid out in memory.

Â You're going to be sort of jumping around in memory.

Â Let's, let's stop here for today, and we'll talk a little bit more about vectors

Â and GPUs next time.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.