0:07

Let's look at some of the technical challenges involved in

astronomical data and calculations and how we tackle those challenges.

We're going to want to look at how much data we want to store, how

long it takes to search through it, and how long it takes to do calculations.

And in each of those cases, we'll see

there's a brute force method, and a smart method.

And in fact we usually want to do both.

And then right at the end we'll wrap up with looking at how

we try to make data access over the internet as easy as possible.

[BLANK_AUDIO]

0:48

So let's talk about how much data we need.

First of all, let's think about a single CCD image.

So maybe one CCD image is about 4,000

pixels across, and each one of those pixels,

there's a number stored with 16 bits of

information which is then 2 bytes of information.

And if you do the sums that adds up to 32 megabytes.

1:24

So that's what a single CCD image might require for storage.

Not too much by modern standards.

However as you've seen we tend to use large mosaic

cameras with many CCD's put together in a big array.

Those gigapixel cameras could be much bigger.

You could be talking here about a a mosaic camera...

... a single image might be 1 to a few gigabytes.

1:55

So that's a big image. Now what about if we want to survey the whole sky?

Well you could ask how many pixels are there over the whole sky?

If we try to pave the sky with CCD images, what do we need?

Well it depends how fine the pixels are,

but let's suppose the pixels are about one third - 0.3 -

of an arc second, so that makes

the pixels small enough to get reasonably decent images.

2:31

Then again, if you assume we have 16 bit numbers.

Oh, and by the way, that 16 bit number

is enough to store numbers up to about 65,000.

And, that's typically what CCDs do.

2:53

Now actually of course we want to do the same thing at

several different wavelengths, different colors and also we want to repeat the sky.

So in practice a typical modern sky survey, the

whole sky will be something like petabyte scales.

Now a reminder about all of these petas and gigas and so on... so a factor of 10 to

the power of 3,000 in scientific notation, that's kilo, as in kilobyte, etcetera.

10 to the 6, that's big M for mega.

10 to the power 9, a billion, is a big G for giga.

10 to the power 12, big T for tera.

And then we'd get up to 10 to the power 15, a thousand trillion.

That's a big P for peta.

So, a petabyte is a thousand trillion bytes and

remember each byte is 8 bits in computer speak.

So that's how big a sky survey needs to be.

4:04

Now today if you've got a reasonably good laptop, you know, you could

get yourself a 1 terabyte disk, it's not that unusual to do so.

So maybe to store, for yourself, a sky survey, you need about 1000 laptops.

4:24

So it's doable, but really a bit daft.

If every astronomer has to have 1,000 laptops to store

their own copies of all their favourite databases, that's just silly.

So here's the smart solution. What it drives

us into is what's known as a service economy.

Around the world, there's a handful of big data centers, one of

which is here in Edinburgh, and there're a number of others,

4:49

where we store on big computers with lots of disks the major sky surveys.

Astronomers around the world through the internet go and get only the data

they want, or do calculations or searches on our servers here.

So, that's the smart way to do it.

[BLANK_AUDIO]

5:13

It's not just the pixels, the images, that

astronomers want, but it's catalogues of objects.

If you imagine here's an image of the sky,

and there are lots of stars and galaxies on it.

We have software that goes over here, spots each of these objects.

Each of those becomes a line in a table and that makes a catalogue of objects.

And for each one of these objects then is a row, and there are lots

of columns, and, each one of these columns is

a different piece of information about this object here.

So, that, then, is our catalogue.

So, how big are these catalogues?

Well, it's not nearly as big as the pixel data.

5:57

So for instance - it's a lot of objects,

out of a sky survey, we might have a billion objects, or a few billion -

and maybe there might be 50 of these columns.

And if each one of those is a couple of bytes, then

we end up with something like 100 gigabytes for a big sky survey.

So that's no problem to store, however searching through it is something else.

And that's what we'll look at next.

[BLANK_AUDIO]

So how do we search through a table of a billion objects and

find just the one we want, that redshift seven quasar or the killer rock or whatever?

So imagine here we've got lots of rows

in our table and this is sitting on our hard drive on

our computer, and then let's imagine over here is our

CPU in the computer, the bit that does the calculation, calculating.

Now in order to, do our search, essentially what

we have to do, is take one row of

data, bring it into the CPU, do a calculation

and decide whether we want that one or not.

And then we take the next row and do it again,

and the next row and do it again and so on.

Now, if imagine all this data streaming from the hard drive

to the CPU. A good PC will run at gigahertz rates,

So in principle, you can stream a billion rows

of data like this through in a split second.

It's not a problem.

However, it doesn't really work like that.

Any search process like this, any transfer from a hard drive to

the CPU - because you do it in lots of chunks, each one of those has

some kind of overhead, and that overhead may be only a few milliseconds, but

then when you multiply a billion times

a few milliseconds, you're into a much longer time.

It could take days to search through your big database.

7:59

Now modern solid state disks as opposed to

spinning hard drives have much smaller overheads - they're

faster, but still they are very expensive. Data centers,

at least scientific ones, are not using those because they're more expensive.

8:16

So the key point is that you don't

actually necessarily have to search through everything every time.

The first thing - and this is just the same

as say Google or Amazon do - you save the most

popular searches so they can be brought back quickly

the next time somebody asks pretty much the same thing.

The next thing is that the various columns here,

you figure out which of those are the most

common and put your database in the right

order so that you can search through them quickly.

8:48

And then you can pick examples of these columns to

make an index on and search through those particularly quickly.

[BLANK_AUDIO]

So let's talk about astronomical calculations, and how long they take.

And I'll take as an example, so

called N-body calculations that cosmological theorists use.

And the idea here is that we take lots of

fake matter particles and if you take any two of

those particles, we can calculate the gravitational force between them

and that tells us how they're going to move overtime.

But we need to take every possible pair

of particles and calculate the forces between all

those particles, to understand how the whole ensemble

of particles is going to evolve with time.

Now, a big calculation might have a million fake matter

particles and a really big one might have a 100

million fake matter particles. That step from a million to a

100 million makes a big difference as we'll see.

But first of all, let's make the basic point.

Let's imagine we've got one of

these big simulations with 100 million particles.

So, that's 10 to the power 8 particles.

10:19

Now, on a fast computer, that's going to take less than a second.

It's not a problem.

Let's just say, that takes, 1 second.

However, as I just described, we need to do not one calculation per

particle, but one calculation for every pair of particles. So we need

to do 10 to the power 8 times 10 to the power 8 calculations.

Okay, every particle in that 10 to the 8 has to do all the others.

10:53

So then, that's an enormous amount of time.

That's going to take years.

And that essentially would make one frame, one time step

in that simulation movie that we saw earlier.

And you need lots of those to see how the universe evolves.

So this is just very difficult.

So the brute-force solution is to say okay, if it, this

is what one PC can do, we need a super computer.

So a supercomputer is really just like

thousands of computers chained together working in parallel.

So a big supercomputer might have several thousand nodes, and every

one of those nodes might have 10 or 20 cores in.

So there could many thousands effectively of CPUs working in parallel.

11:45

But even then there're two snags. It still

wouldn't be fast enough with this sort of calculation,

to get things done really quickly - and also the

other snag is that those machines cost millions of pounds.

We'd like to do something a bit smarter.

12:03

So the smart solution is to do with being approximate.

So with what I've described here, it's what you

have to do if you're going to do this calculation exactly.

Every particle and its effect on every other particle.

But, there's shortcuts you can do which have kind of to do with

the fact that particles further away, as individual pairs, are less important.

(They add up to a lot.)

Now, we haven't got time to explain exactly how this works.

For the mathematically minded, instead of a problem that is

n times n, we can get a speed change so that this is n times the logarithm of n.

Now this makes a very serious

difference to the speed of our calculation.

So for instance let's start... if we imagine we have

our 10 to the 6 particles and let's say...

14:04

If we do the n times the logarithm of

n method, that comes out to about 30 minutes.

Now notice that's still quite a long time - to calculate one timestep

in this simulation, but at least it's something plausible that that we can do.

So big calculations are very difficult.

[BLANK_AUDIO]

So we've been talking about technical challenges, how

much data there is to store, how hard that is, how long it takes to search, how long

it takes to do calculations, and all those problems are about machine time,

how long it takes a computer or a super computer to perform these tasks.

But the real world problem is not just

about machine time, it's about human time as well.

15:00

So for example what we don't want is that every time we go and get some data we have

to spend a whole afternoon working out how this particular website works, what

we have to do, and when we get the data we have to write a special

piece of software to deal with that data and plot it on top of something else.

All of that sort of then just uses up

astronomer time even if the machines are very fast.

Now, the modern internet that we're used to - when we're looking

for information, doing our shopping, etc, is very point and click.

It's very easy.

It took a lot of effort to get it that way, but it's automated.

And we want astronomy to work the same way, grabbing data, mixing and matching it.

That ideal is known as the virtual observatory.

And it's the secret - to either the virtual observatory, or

to the internet as a whole - is the same thing.

It's standardization.

What we need essentially is for everything to have the

same screw threads so that, so the bits fit together.

The different web pages, the data sets, and so on.

So forthe internet that's about things like TCP/IP, the

basic internet protocols, or HTTP the way, how you define how

you speak to a website or HTML, how you write the content of a website.

So all those things are standards which were agreed

internationally, and that's what makes it all magic and easy.

In the same way in astronomy we want to standardize

the format of data, the data access protocols and and so on.

And we're in the middle of that process as we speak.