Simplest way to download a web page and print the content to stdoutwith boost

F

Francesco S. Carta

gervaz said:
Hi all,
can you provide me the easiest way to download a web page (e.g.http://www.nytimes.com) and print the output to stdout using the boost
library?

Thanks,
Mattia

Yes, we can :)

Sorry, but you should try to find the way by yourself first - that's
not hard, split the problem and ask Google, find pointers and follow
them, try to write some code and compile it. If you don't succeed you
can post here your attempts and someone will eventually point out the
mistakes.
 
G

gervaz

Yes, we can :)

Sorry, but you should try to find the way by yourself first - that's
not hard, split the problem and ask Google, find pointers and follow
them, try to write some code and compile it. If you don't succeed you
can post here your attempts and someone will eventually point out the
mistakes.

Ok, nice advice :p

Here what I've done (adapted from what I've found reading the doc and
googling):

#include <iostream>
#include <boost/asio.hpp>

int main()
{
boost::asio::io_service io_service ;
boost::asio::ip::tcp::resolver resolver(io_service) ;
boost::asio::ip::tcp::resolver::query query("www.nytimes.com",
"http");
boost::asio::ip::tcp::resolver::iterator iter =
resolver.resolve(query);
boost::asio::ip::tcp::resolver::iterator end;
boost::asio::ip::tcp::endpoint endpoint;
while (iter != end)
{
endpoint = *iter++;
std::cout << endpoint << std::endl;
}

boost::asio::ip::tcp::socket socket(io_service);
socket.connect(endpoint);

boost::asio::streambuf request;
std::eek:stream request_stream(&request);
request_stream << "GET / HTTP/1.0\r\n";
request_stream << "Host: localhost \r\n";
request_stream << "Accept: */*\r\n";
request_stream << "Connection: close\r\n\r\n";

boost::asio::write(socket, request);

boost::asio::streambuf response;
boost::asio::read_until(socket, response, "\r\n\r\n");

std::cout << &response << std::endl;

return 0;
}

But I'm not able to retrieve the entire web content.
Other questions:
- the while loop seems like an iterator loop, but what
boost::asio::ip::tcp::resolver::iterator end stands for? Is a zero
value?
- to see the output I had to use &response, why?

Thanks, but I've come up with the solution in 30 mins and still have
to learn a lot of C++.

Mattia
 
F

Francesco S. Carta

gervaz said:
Ok, nice advice :p

Here what I've done (adapted from what I've found reading the doc and
googling):

#include <iostream>
#include <boost/asio.hpp>

int main()
{
    boost::asio::io_service io_service ;
    boost::asio::ip::tcp::resolver resolver(io_service) ;
    boost::asio::ip::tcp::resolver::query query("www.nytimes.com",
"http");
    boost::asio::ip::tcp::resolver::iterator iter =
resolver.resolve(query);
    boost::asio::ip::tcp::resolver::iterator end;
    boost::asio::ip::tcp::endpoint endpoint;
    while (iter != end)
    {
        endpoint = *iter++;
        std::cout << endpoint << std::endl;
    }

    boost::asio::ip::tcp::socket socket(io_service);
    socket.connect(endpoint);

    boost::asio::streambuf request;
    std::eek:stream request_stream(&request);
    request_stream << "GET / HTTP/1.0\r\n";
    request_stream << "Host: localhost \r\n";
    request_stream << "Accept: */*\r\n";
    request_stream << "Connection: close\r\n\r\n";

    boost::asio::write(socket, request);

    boost::asio::streambuf response;
    boost::asio::read_until(socket, response, "\r\n\r\n");

    std::cout << &response << std::endl;

    return 0;

}

But I'm not able to retrieve the entire web content.
Other questions:
- the while loop seems like an iterator loop, but what
boost::asio::ip::tcp::resolver::iterator end stands for? Is a zero
value?

Whatever the value, in the framework of STL iterators the "end" one is
simply something used to match the end of the container / stream /
whatever so that you know there isn't more data / objects to get. You
shouldn't worry about its actual value - I ignore the details too,
maybe there is something wrong with your program and I'll have a look,
but I'm pressed and I wanted to drop in my 2 cents.
- to see the output I had to use &response, why?

That's not good to pass the address of a container to an ostream
unless you're sure its actual representation matches that of a null-
terminated c-style string. In this case I suppose you have to convert
that buffer to something else, in order to print its data.

There is also the chance that you have to

- call "read_until" to fill the buffer
- pick out the data from the buffer (eventually flushing / emptying
it)

multiple times, until there is no more data to fill it.

Hope that helps you refining your shot.
 
F

Francesco S. Carta

Francesco S. Carta said:
Whatever the value, in the framework of STL iterators the "end" one is
simply something used to match the end of the container / stream /
whatever so that you know there isn't more data / objects to get. You
shouldn't worry about its actual value - I ignore the details too,
maybe there is something wrong with your program and I'll have a look,
but I'm pressed and I wanted to drop in my 2 cents.


That's not good to pass the address of a container to an ostream
unless you're sure its actual representation matches that of a null-
terminated c-style string. In this case I suppose you have to convert
that buffer to something else, in order to print its data.

There is also the chance that you have to

- call "read_until" to fill the buffer
- pick out the data from the buffer (eventually flushing / emptying
it)

multiple times, until there is no more data to fill it.

Hope that helps you refining your shot.

I've played with your program a bit. Up to the line:should be all fine.

In particular, the loop that checks for the end of the endpoint list
is fine because, as it seems, those iterators get automatically set to
mean "end" if you don't assign them to anything - it works differently
from, say, a std::list, where you have to explicitly refer to the
end() method of a list instantiation.

The first problem with your code is where you send the server the
"Host" header. You should replace "localhost" with the domain name you
want to read from - in this case:
request_stream << "Host: www.nytimes.com\r\n";

Then we have the (missing) loop to retrieve the data.

The function "read_until" that you are calling will throw when the
socket has no more data to read, and consider also that all overloads
of that function return a size_t with the amount of bytes that it has
transferred to the buffer.

Seems like you have to intercept the throw, in order to know when to
stop calling it. Another option is to use the "read_until" overload
that doesn't throw (it takes an error_code argument, instead) and
maybe check if the returned size_t is not null - then you would break
the loop.

So far we're just filling the buffer. For printing it out you have to
build an std::istream out of it and get the data out through the
istream.

Try to read_until "\r\n", not _until "\r\n\r\n", then getline on the
istream to a string.

If you want I'll post my (working?) code, but since I've learned a lot
by digging my way, I think you can take advantage of doing the same.

Have good coding and feel free to ask further details if you want -
heck, reading boost's template declarations is not very good time...

(don't exclude the fact that I could have said something wrong, it's
something new for me too, I hope to be corrected by more experienced
users out there, in such case)
 
G

gervaz

I've played with your program a bit. Up to the line:> >     request_stream << "GET / HTTP/1.0\r\n";

should be all fine.

In particular, the loop that checks for the end of the endpoint list
is fine because, as it seems, those iterators get automatically set to
mean "end" if you don't assign them to anything - it works differently
from, say, a std::list, where you have to explicitly refer to the
end() method of a list instantiation.

The first problem with your code is where you send the server the
"Host" header. You should replace "localhost" with the domain name you
want to read from - in this case:
    request_stream << "Host:www.nytimes.com\r\n";

Then we have the (missing) loop to retrieve the data.

The function "read_until" that you are calling will throw when the
socket has no more data to read, and consider also that all overloads
of that function return a size_t with the amount of bytes that it has
transferred to the buffer.

Seems like you have to intercept the throw, in order to know when to
stop calling it. Another option is to use the "read_until" overload
that doesn't throw (it takes an error_code argument, instead) and
maybe check if the returned size_t is not null - then you would break
the loop.

So far we're just filling the buffer. For printing it out you have to
build an std::istream out of it and get the data out through the
istream.

Try to read_until "\r\n", not _until "\r\n\r\n", then getline on the
istream to a string.

If you want I'll post my (working?) code, but since I've learned a lot
by digging my way, I think you can take advantage of doing the same.

Have good coding and feel free to ask further details if you want -
heck, reading boost's template declarations is not very good time...

(don't exclude the fact that I could have said something wrong, it's
something new for me too, I hope to be corrected by more experienced
users out there, in such case)

Ok, so far my shortest result

#include <string>
#include <iostream>
#include <boost/asio.hpp>

void error(const char* p1, const char* p2 = "")
{
std::cerr << p1 << ' ' << p2 << '\n';
std::exit(1);
}

int main(int argc, char* argv[])
{
if (argc != 2) error("Wrong number of arguments!");

std::string host(argv[1]);

boost::asio::ip::tcp::iostream s(host, "http");

s << "GET / HTTP/1.0\r\n";
s << "Host: " << host;
s << "\r\n\r\n" << std::flush;

// std::cout << s.rdbuf();

std::string line;
while (std::getline(s, line))
{
std::cout << line << std::endl;
}

return 0;
}

Now, I'm wondering how to handle the connection through a proxy. Any
help?

Ciao,
Mattia
 
R

red floyd

Now, I'm wondering how to handle the connection through a proxy. Any
help?

Read the RFC's for HTTP and figure it out? At this point, you are
WAYYY
beyond the bounds of C++ (even assuming "how do I use Boost to do
this).
How HTTP proxies work is completely off topic here.
 
F

Francesco S. Carta

gervaz said:
I've played with your program a bit. Up to the line:> >     request_stream << "GET / HTTP/1.0\r\n";
should be all fine.
In particular, the loop that checks for the end of the endpoint list
is fine because, as it seems, those iterators get automatically set to
mean "end" if you don't assign them to anything - it works differently
from, say, a std::list, where you have to explicitly refer to the
end() method of a list instantiation.
The first problem with your code is where you send the server the
"Host" header. You should replace "localhost" with the domain name you
want to read from - in this case:
    request_stream << "Host:www.nytimes.com\r\n";
Then we have the (missing) loop to retrieve the data.
The function "read_until" that you are calling will throw when the
socket has no more data to read, and consider also that all overloads
of that function return a size_t with the amount of bytes that it has
transferred to the buffer.
Seems like you have to intercept the throw, in order to know when to
stop calling it. Another option is to use the "read_until" overload
that doesn't throw (it takes an error_code argument, instead) and
maybe check if the returned size_t is not null - then you would break
the loop.
So far we're just filling the buffer. For printing it out you have to
build an std::istream out of it and get the data out through the
istream.
Try to read_until "\r\n", not _until "\r\n\r\n", then getline on the
istream to a string.
If you want I'll post my (working?) code, but since I've learned a lot
by digging my way, I think you can take advantage of doing the same.
Have good coding and feel free to ask further details if you want -
heck, reading boost's template declarations is not very good time...
(don't exclude the fact that I could have said something wrong, it's
something new for me too, I hope to be corrected by more experienced
users out there, in such case)
- Mostra testo citato -

Ok, so far my shortest result

#include <string>
#include <iostream>
#include <boost/asio.hpp>

void error(const char* p1, const char* p2 = "")
{
    std::cerr << p1 << ' ' << p2 << '\n';
    std::exit(1);

}

int main(int argc, char* argv[])
{
    if (argc != 2) error("Wrong number of arguments!");

    std::string host(argv[1]);

    boost::asio::ip::tcp::iostream s(host, "http");

    s << "GET / HTTP/1.0\r\n";
    s << "Host: " << host;
    s << "\r\n\r\n" << std::flush;

    // std::cout << s.rdbuf();

    std::string line;
    while (std::getline(s, line))
    {
        std::cout << line << std::endl;
    }

    return 0;

}

Now, I'm wondering how to handle the connection through a proxy. Any
help?

Uh... we can handle a socket as a simple iostream in Boost? Very nice
to know, well done :)

By the way, floyd is obviously right, diving into proxy issues is
definitely off topic here, you'll find plenty of advice on other
groups (and using search engines as well, of course).

Buona fortuna e buon proseguimento :)
 
J

Jorgen Grahn

Read the RFC's for HTTP and figure it out? At this point, you are
WAYYY
beyond the bounds of C++ (even assuming "how do I use Boost to do
this).
How HTTP proxies work is completely off topic here.

Or, he may want to avoid coding all of that himself and just system()
or popen() etc a tool which already does all that and more, such as
wget or curl. It depends on what his ultimate goal is.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top