Another approach to getting data is actually with the GET command.
This is the httr package.
For doing things like this where there's an open,
easily accessible website, accessing it with
connections like we talked about before might be the easiest way.
But we'll talk in a minute about why the httr package can be very useful in other
settings.
So here I've loaded the httr package, and then what I do is I take that
same URL I was using before, and I just GET the URL with html2.
And then what I have to do now is actually have to extract the content from that
HTML page.
So I do that, and I say I'm gonna extract the content as a text.
Just one big text string.
And then I can use the htmlParse command to parse out that text, and
get the parsed HTML.
And so this parsed HTML is gonna look exactly like what I would
have got if I had used the XML package to extract the data directly.
And so then I can use xpathSApply to extract out the title
of the page again like I did before.
So that's how you can use get just to do the exact same sort of
exercise that you would have done with the XML package.
In some other cases, you might have to do something a little bit more complicated.
So if you navigate to this webpage with your browser,
you'll see that it requires a username and password input.
And so if I just try to get that page using the GET command from the httr
package, I get a response that says Status: 401 and
that's because I wasn't able to login because I haven't been authenticated.
So what you could do with the HTTR package is you can actually authenticate yourself
for websites, and so
you can do this by assigning to this what we're gonna call the handle.
We can go and get that website and so we pass at the URL again, but
then we use this authenticate command, and we give it the username and the password.
In this case, this is just a test website, and so
the username is user and the password is password.
And so you can test out things like this to see if you can get access.
In this case now, the response is Status: 200, which means we
actually were able to get access to the file and to even authenticate it.
And now we can look at the names of this pg2,
which gives you all the different components.
We included the cookies that we have for this file, and
then the handle that we used to access it and all that.
And so, we can then use the content function to extract the content from that
website, after having logged in through r.
So make sure that you use handles because if you use handles then you can
actually access.
You can sort of save the authentication across multiple accesses to a website.
So if you set Google to be a handle where that Google is a particular website,
then what you can do is you can tell GET to go and get that handle and
you can say go get it for a specific path or you can set it a different path.
So for example, if you authenticate this handle one time
then the cookies will stay with that handle and you'll be authenticated.
You won't have to keep authenticating over and over again as you access that website.