Usually, you'll add those values back into the data frame.

This is particularly important when you're doing something like prediction,

where you wanna create new variables but you might use as predictors.

Common variables that people create are missingness indicators.

So, pointing out where you might have missing data,

that's one really common variable to create.

Also cutting up quantitative variables,

creating variables that are factor versions of a quantitative variable

corresponding to particular values that may be of interest.

Or applying transformations to deal with data that have a strange distributions.

So, we're gonna be using this Baltimore restaurant data set as an example.

So again, you could go to their URL and then click over here on this export button

to get the URL for a particular version of the data set.

In this case, I'm using the CSV version.

Again, I see if the data directory exists.

Then I download the file from the internet,

and I load it into R using read.csv.

Now, I've got a data set rest data which is the restaurant data that was created

from this data set.

And so, one thing that I'm gonna talk about first before I get into

analyzing a data set is creating sequences.

Sequences are often used to index different operations that you're going to

be doing on data, and so it's good to be able to know how to create them.

The command that you use in R is seq.

And so, usually what you do is you tell seq the minimum value and

the maximum value.

And then there's two ways in which you usually specify how many values to

generate.

One is to use the by command, so by=2 means it starts at 1 and

then creates new values, increasing each new value by 2.

Another way to do it is to specify the length of the vector.

And so what that means is, it'll start at 1 and

end at 10, and it will create exactly three values.

A third way is to say, suppose you have a vector x and it has five values in it.

And suppose you wanna create an index, so that you can loop over those five values.

You might use this seq(along = x), which will basically create a vector

of the same length as x, but with consecutive indices that you can use for

looping or accessing subsets of data sets.

So, one kind of variable that you might wanna create is a variable that indicates

which subset another variable comes from.

So for example, here we have these restaurant data.

We actually have the neighborhoods that the restaurants are in.

And so, I might wanna find all the restaurants that are in two neighborhoods

near me, Roland Park and Homeland.

And so, if I use this percent in percent command, it will allow me to find

only the restaurants that are in those two neighborhoods.

It will return true if you're in that neighborhood and false if you're not.

And then what I'm doing is,

I'm appending these nearMe variable onto the restaurant data set.

So, this allows me to now subset that data set only by the restaurants that are near

me, which is kind of a nice, clean way of subsetting the data set,

as opposed to always having to use this longer percent, ampersand command.