I am a software developer and would like to outsourcesome of the work for a project I’m work on.
I would like a C++ web spider class that accepts a URL asa parameter and returns links on the page and images on the page.? Please see a possible API:
CSimpleSpider web;
bool success = [login to view URL](“[login to view URL]?);
List<string>* urls = [login to view URL]();
List<string>* imgs = [login to view URL]();
[login to view URL](); // allow me to call ProcessPage again
(Excuse me if my C++ syntax isn’t spot on, it’s been awhile since I used STL.)
As well as the class you must provide a small commandline test app utilising the class that takes a URL as a command line parameterand prints all links and image URLs.
Important aspects of the project are:
1) You must build in functionality for [login to view URL] andignore an input URL that should not be spidered AND ignore links on the pagethat should not be spidered
2) Links to videos, binaries, etc, should be excludedfrom the return URL list
3) Links to images should be included in the returnimages list
4) Relative links should be expanded (e.g. a link of“../[login to view URL]? on “[login to view URL]? should be translated into“[login to view URL]?
5) Links should have session IDs and bookmarks removed
6) Spider should detect excessive page sizes (i.e. over512 KB) and stop
PLEASE SEE MISC CODING DETAILS BELOW.