0:02

So that's the R Profiler.

Â And the R Profiler has a function in R that's called Rprof.

Â And it, an Rprof is used to start the profiler in R.

Â One could note is that R must be compiled with profiler

Â support and so it's not something that's going to built in all cases.

Â However, and I'd say 99.9% of the cases this is the true, this is

Â the truth, so mu, you will only, R will only be compiled without profiler

Â support in some very cer, special circumstances.

Â And so I wouldn't, the chances are your version of R use, it can use the profiler.

Â The other function that's useful is the summary Rprof

Â function, which takes the output from the profiler and summarizes

Â it in a way that's kind of readable, because the

Â raw output from the profiler is generally not very usable.

Â 1:04

So the Rprof function keeps track, basically what it does is it

Â keeps track of the function call stack, at regularly sampled intervals, right?

Â And so basically it as you're function is running, it kind of

Â goes it, it, it queries the function call stack, so how

Â many functions you, functions that call other functions that call other functions.

Â And it just prints it out.

Â Basically, that's all it does is it prints out the function call stack at, at very

Â quick intervals, so, so that every 0.02 seconds

Â and it prints out the function call stack.

Â So first thing you'll notice is that if your function takes less

Â than 0.02 seconds to run, then this R, the profiler will be useless.

Â And in general, because it will never sample the function call stack.

Â And in general if your program is runs very quickly, the profiler is not useful.

Â Well, and but of course that if your program runs very quickly,

Â you probably wouldn't think to run the profiler in the first place.

Â So it's usually not a problem.

Â But you really need to use the profiler in situations where your code

Â is taking much longer on the order, at least on the order of seconds.

Â So here's just a quick example of the raw output that comes

Â from the profiler.

Â Now you generally speaking you will not ever use this output, but

Â I thought it might be interesting to look at what's going on.

Â So you can see that I'm, you, I'm just calling the lm

Â function, which is kind of a univariate outcome and a univariate predictor.

Â 2:22

And, and what happens here, as you can see, that

Â each line of this output is the function call stack.

Â So you can see at the very right, is kind of the top.

Â and, and at the very

Â left is kind of the bottom, so to speak.

Â And the, so the very right, you can see that

Â lm was called, and lm called eval, and eval called eval.

Â So I'm going from right to left here.

Â And eval called model frame, which called model frame

Â default, which called eval again and eval in the list.

Â So all these functions call each other.

Â So you can see that the function calls back goes out for the deep.

Â As you go further in the evaluation you can see

Â that that the function calls that changes, so at the

Â very bottom you can see that lm called lm.fit.

Â 3:19

So, that kind of raw output is not

Â particularly easy to read, so we use the sumaryRprof

Â function to tabulate the Rprof or the output

Â and calculate how much is spent in which function.

Â So, the idea is that once you see that the function call stack, you know that

Â the, that each line of the con, the

Â function call stack is separated out by 0.02 seconds.

Â Access the frequency which is sampled.

Â So, given that you can calculate how many seconds are

Â spent in each of the functions, because if it appears in the function call

Â stack then you're actually spend, then you must be spending some time in it.

Â So there are two methods for, for normalizing

Â the data that you get the R Profiler.

Â One is called by.total, which divides the time spent in

Â each function by a total, by the total run time.

Â 4:12

So, its important to realize that the two

Â separate concepts here of kind of, by.total and by self.

Â The basic idea is that by total, I, I mean, the, the normalizing

Â by the total amount of time spent in a function gives you basically,

Â how much time was be, was spent that that how many basically, how

Â many times that function appeared in the calls, in the kind of printout here.

Â And so for example, a 100% of your time is spent in the top-level

Â function, right, so the function that you call, suppose it's lm, you spend a

Â 100% of your time in that function, because it was at the top level.

Â And so, but the reality is that often

Â the top level functions don't really do anything with

Â that's kind of important, all they do is they

Â call helper functions that do the real work, right?

Â So chances are if your function is spending

Â a lot of time doing something, it's spending

Â a lot of time in those helper functions which is just being called by this top

Â function to kind of do, to do all the work.

Â And so often it's not very interesting to know how much is time is spent in

Â these top level functions, because that's not where

Â the, where the real, where the real work occurs.

Â All right, so you really want to know kind of how much time is spent in the

Â top level function, but subtracting out all

Â the low, the functions that it calls right?

Â So it turns out that it spends a lot of time in the

Â top level function, but even after you subtract out all of the lower level

Â functions, then that has something that's interesting.

Â But most of the time you will notice that when you subtract out all the lower level

Â functions that get, that get called there's very

Â little time it spends in the top level function.

Â And because all the work and all the kind of the computations is being done

Â at the lower level function, so that's, that's

Â kind of where you want to focus your efforts.

Â So, the, the buy.self format is, I, I think, the most

Â interesting format to use because it tells you how much time

Â is being spent in a given function, but after subtracting out all of the

Â other, all of the time spent in, in lower level functions that it calls.

Â So it gives you I think a more accurate

Â picture of, you know, which functions are really, are truly

Â taking up the most amount of time and which functions

Â that you might want to target for optimization, later on.

Â 6:31

But you can see that and so you can

Â see that the second place winner was the lm.fit function.

Â I mentioned lm.fit is where a lot of the computation occurs.

Â And so that was three and a half seconds,

Â so about half of the time in that function.

Â Now, now you also see that a number of

Â functions here model.frame, model.frame.default, eval,

Â all these functions don't really

Â involve computation but there is a reasonable amount of

Â time spent within those functions, so that's kind of interesting.

Â 7:16

In particular, lm.fit calls, calls a four trend routine for inverting a matrix.

Â And so, that's usually where in most large scale regression problems, that's where

Â all the computation occurs. The next function that takes a lot of time

Â ap, ap, apparently, or 11% of the time

Â is the as.list function, for the as.list.data.frame method.

Â It's not immediately clear why so much time is being spent in

Â this, but, spent in this function,

Â but it's maybe something you want to investigate.

Â Because it maybe something that's not very important to the kind of core computation.

Â for, and so you can kind of go down the list here

Â and see how much time is being spent in various functions.

Â And then you'll see a lot of these functions don't directly pertain

Â to computation or kind of core computation, but they're really more kind

Â of pertain to data, formatting of the data and things like that.

Â 8:08

The last part of the summaryRprof output is

Â just the sample interval, so you can see

Â how, what, what time interval the sampling took

Â place for printing out the function call stack.

Â So you can see, it's 0.02 seconds.

Â And the sampling time, which is just the total

Â amount of time that the expression took to run.

Â This is the same kind of, this is so this is the, I think equivalent to the kind of

Â elapse time in the system.time function.

Â So that's a quick tour of the R profiler in R,

Â it's a very handy tool for doing performance analysis, R code to

Â give you some useful feedback and I find often highlights functions that

Â you may not have suspected as being kind of time hogs or bottlenecks.

Â 8:51

And because they're not really core to the

Â kind of, the real computation that you're working on.

Â So the profiler can be really useful,

Â I think for highlighting these kinds of situations

Â and, and often finding things that you are kind of unexpected.

Â The summary Rprof function summarizes the output from Rprof and

Â gives you the percent time spent in in each functions.

Â And I think the by.self kind of [UNKNOWN] normalization is the

Â most useful for kind of highlighting bottlenecks in your, in your code.

Â 9:17

One of the, one of the implications of using the

Â profiler is that it's useful to break your code into functions.

Â So rather than have one massive function,

Â it's useful to break your code into kind of logical pieces of different functions.

Â And so the profiler can use this information

Â to tell you where the time is being spent.

Â So remember the profiler prints, prints out the function call stack.

Â And if you break your code into mul, multiple little functions, the

Â function names that you give will kind of serve as little identifiers.

Â In the function call stack to tell you kind of where

Â the, where the code is spending the most amount of time.

Â So that's another little strategy

Â that's kind of that's can be useful when you're profiling your R code.

Â The last thing that's worth learning is that if your R code, or any other R code

Â call C or Fortran code, this C and Fortran code is can, is like a black box.

Â It's not profiled.

Â You, you won't see any information about that code.

Â All you will know is that some time is

Â spent there, but you won't know any details about that.

Â