We are searching for a developer who is interested to create a Web Crawler thats collecting information about a website and send this information to a central XML RPC. This central XML RPC will be create by our own developers. It's possible to create the total project in phases. Please provdide some information about your experience with crawlers.
## Deliverables
Program functions: - Craw the web for the information listed below - Collecting unknow sites and add them to the crawl queue - Save the requested information in a xml file - It must be possible to run mutiple crawlers on different locations at the same time - The program must use threads for speed! - Crawling speed is verry important, (say ± 15mpbs >) - Request a job (list of urls) from a remote server (xml) - Send jobs results (xml file with results) to a remote server - Respect the robots exclusion standard - Use a keep-alive request where possible - Showing the progress and statistics of the crwaling job Configuration - Crawling depth - Crawling speed on different times, or system activity - Number of threads - User agent of the crawler - List of domains that will not be crawled - List of know applications, statics, sdvertisements etc - List of RBL servers to check Statistics specifications: - Speed (kpbs/mbps, no bytes but bits!) - Failed dns recolving - Failed connections - Time left for the current job - Time running for the current job Some useful information: - [login to view URL] - [login to view URL] - [login to view URL] - [login to view URL] - [login to view URL] Data Resources: - [login to view URL] - [login to view URL] - [login to view URL] IP & AS Numbers - ftp://[login to view URL] - ftp://[login to view URL] - ftp://[login to view URL] - ftp://[login to view URL] - ftp://[login to view URL] Information to collect: - Website Title: - Meta Description: - Meta Keywords: - Languages: - DMOZ Listing: [location] - DMOZ Title: - DMOZ Description: - Alexa Related Sites: - Alexa Trend/Rank: - Server Type: - IP Address: - IP Country: - IP AS Number: - Hosting provider: - HTTP Response Code: - P3P (Platform for Privacy Preferences) Settings: - Blacklist Status: - SSL Certificate valid from: - SSL Certificate valid till: - SSL Certificate authority: - SSL Certificate single root: [is the certificate chained?] - Website Status: - Domain Registrar: - Domain registered: [date] - Domain expires: [date] - Registrar Status: - Name Servers: - Mail exchanges: - SPF (Sender Policy Framework) settings: - Thumbnail: [a small image of the page] - Know applications: [outlook webaccess, phpmyadmin, drupel, phpnuke etc...] - Outgoing url's on this site: [links to an other domain] - Available RSS/XML/Atom Feeds: [application/rss+xml] - SearchProvider: [application/opensearchdescription+xml] - Site statics: [nedstat,analytics, etc] - Advertisements: [adsense,yahoo,msn,intellitxt, etc] - Charset: - Use of javascript: - Use of cookies:
## Platform
Windows XP