Scrape 500K web pages

Fermé Publié le Nov 12, 2007 Paiement à la livraison
Fermé Paiement à la livraison

We are reposting this project because our previous coder was not able to complete the project in time. Please only bid if you are confident you can complete the entire project within 1 week. Note that this project requires the use of multiple instances each using a different ip address. In your bid please briefly describe how you plan to complete this project. We need a developer to scrape approximately 530,000 web pages, cached by a search engine. The developer will go through 26 different search terms, parse each result page for specific urls, and then scrape a cached copy of the actual page. For each cached page, the developer will extract certain bits of content and insert it into our MySQL database with the provided schema. The specific content is conveniently annotated with css so the developer can easily use XPath or simple regular expressions to parse. The content being scraped is in the public domain. The project (including the scrape) should be completed in one week. You should run at least 20 instances/threads in parallel, each using a different ip address. When you restart a thread it needs to use a different ip address. Given 2 seconds per page, we expect the scrape to take less than 1.5 days. It is important to note that the search engine will likely limit the scraper to 500 - 5000 requests per ip address per 24 hour period. Doing the math, for 500,000 pages and given a 24 hour limit of 1000 requests per ip address, this equates to 500 different ip addresses. Of course if you spread out the task over four days, you only need 125 different ip addresses. We will pay for pre-approved server/bandwidth usage. You are free to use your own servers or a virtual server like EC2. We are somewhat platform agnostic, but being rails and java guys, we have a definite preference for solutions in one of the two. For each milestone, the developer will send a snapshot of the latest codebase and we will validate and sign off on the scraped data.

## Deliverables

There are three milestones and you will be paid a partial fee for completing each milestone. For each milestone, the developer will send a snapshot of the latest codebase, instructions to execute the code, and we will validate and sign off on the scraped data. A milestone will not be considered complete until we have validated the accuracy of the data. All 3 projects must be completed within 1 week. 1. Develop the search results scraper and scrape all the cached_profile_urls for the first two search terms. (5%) 2. Develop the profile scraper and scrape the 20 urls we provide. (5%) 3. Scrape each of the profile_cached_urls in the search_results table. (90%)

## Platform

Rails or Java preferred.

Ingénierie Java MySQL Perl PHP Python Ruby on Rails Architecture Logicielle Tests de Logiciels XML XSLT

Nº du projet : #3467529

À propos du projet

10 propositions Projet à distance Actif Nov 28, 2007

10 freelances font une offre moyenne de 808 $ pour ce travail

kishil

See private message.

$850 USD en 7 jours
(30 Commentaires)
5.8
Sefidel

See private message.

$850 USD en 7 jours
(15 Commentaires)
7.0
meteorindia

See private message.

$850 USD en 7 jours
(4 Commentaires)
4.7
sharkinfo2004

See private message.

$841.5 USD en 7 jours
(6 Commentaires)
3.6
huyvtrany2k9

See private message.

$850 USD en 7 jours
(4 Commentaires)
5.2
vw6742929vw

See private message.

$850 USD en 7 jours
(7 Commentaires)
3.6
sergejv

See private message.

$850 USD en 7 jours
(3 Commentaires)
2.7
jwbvw

See private message.

$850 USD en 7 jours
(3 Commentaires)
3.5
hutsolvw

See private message.

$765 USD en 7 jours
(1 Évaluation)
3.2
javajia

See private message.

$525.3 USD en 7 jours
(0 Commentaires)
0.0