Speedy high volume web page scraper

I have a software product that reads online text and creates a detailed profile (a profile is then compared to other profiles and recommendations can then be served).

The profiling engine is a single-server Java application that is served off Tomcat. It has a REST API.

Up till now, the profiles have reached my server via full text RSS feeds, or XML files (that I then create a custom parser for in Java).

I now have a project where I will receive a high volume of urls (around 80,000 arriving during the course of the day) and will need to 'scrape' the text off these pages before passing this to the profiling engine.

For this development operational speed is very important and so the 'scraper' needs to be fast acting in order to handle the perceived transaction volume but also accurate enough so that most of the page 'junk' does not affect adversely the profile that is made.

Ideally the web scraper will take the page 'title' and 'article' text and use these for profiling.

However, there will not be a standard format for these pages and so the web scraper needs to be fairly generic too.

Get in contact if you feel you can achieve this but please you must have experience in this field!!

Compétences : HTML, Java, PHP

Voir plus : web page format, web page development online, web development in java, tomcat rest, rss java parser, project title for web development, profile web application, php or java for web development, order web development, online web application development, my fast web page software, generic parser, full custom web, fast web development, development of web page, create web page software, create java web application, article for web development, page title scraper, page scraper, java for web development, xml online course, web page development software, software to create a web page, format xml online

Concernant l'employeur :
( 0 commentaires ) London, United Kingdom

N° du projet : #1010422

11 freelance font une offre moyenne de $1095 pour ce travail


Hello, Please check your inbox Thanks

1380 $ USD en 15 jours
(112 Commentaires)
750 $ USD en 5 jours
(48 Commentaires)

Hello, we have a great experience in web scraping. A detailed experience information will be sent as PM. We can handle between 100-150K web sources (URLs) per day (have few servers doing this for years). Looking Plus

960 $ USD en 25 jours
(2 Commentaires)

Please check PMB

1200 $ USD en 15 jours
(10 Commentaires)

Hello Please check pmb

1500 $ USD en 12 jours
(1 Commentaire)

can we discuss Reffer to pmb

750 $ USD en 7 jours
(9 Commentaires)

I can help you really quickly! Check your inbox.

750 $ USD en 3 jours
(3 Commentaires)

see PM for details

1500 $ USD en 20 jours
(6 Commentaires)

Hi, I have 4+ exp. Believe in quality output In programming and do several project related to your project and I can easy do this job plaese check my pmb and be a part of my services forever and I have also good clie Plus

1450 $ USD en 10 jours
(0 Commentaires)

please see pm

800 $ USD en 20 jours
(0 Commentaires)

I have long experience in J2EE and I have done many scrapers in java using htmlunit or jakarta commons.

1000 $ USD en 10 jours
(0 Commentaires)