I am looking for a guy who can develope the following multithead python script:
In a database, there a tons of urls that need to be visited. (Column page_url)
Visiting the side, it need to be, whether a certain URL is still in the source text of the visited site. This URL is part of the DB as well (image_url)
If the image link is still found in the source code, the value „YES“ is printed into a certain DB column and checking will continue in step 2.
If the image link is not found any more in the source code, the value „NO“ is printed into a certain DB column. Nothing more happens.
For the websites where the image URL is still online, it needs to be checked whether a certain text string (out of DB) is included in the source code of the website.
Here, it needs to be differenciated where the text string is found. For that task, text strings, that are included in the image_url needs to be excluded.
First check: Is the text string generally available on the specific url (page_url out of database)
If NO, print „NO“ to a certain column of the DB.
If YES, continue checking:
Is the text string part of an „alt-tag“, print „ALT“ to a certain column oft he DB.
Is the text string part of an „title-tag“, print „MOUSEOVER“ to a certain column of the DB.
Is the text string not part of the image_url, not part of an alt-tag and not part of a title-tag, print „YES“ to a certain column of the DB.
- As there are a lot of sites that need to be checked, I need the python script multitheading.
- A list of proxies will be provided that shall be used for accessing the page_urls. The proxy used shall be changed each visit of a new page_url.
- There shall be the option to set a waiting time between accessing 2 page_urls by one threat.
16 freelance font une offre moyenne de $225 pour ce travail
Hi, I've been implemented python script crawler with proxy similar with your project please check my finished project: [url removed, login to view] Cheers.
Hello, I have a lot of knowledge and experience for this job. If You hire me this project will be done efficiently and fast. Feel free to contact me if You have any questions. Kind regards, Nino Rasic