I need a simple script:
step 0: you get a file with a list of URLs (hundreds or thousands); they are in all sorts of format (subdomains, https, many SLD/TLD).
step 1: you extract the domain names from the URLs and generate a sorted list of unique domains; this is not as simple as it sounds as the function doing that must be able to tokenize any URL format as well as any form of TLD (like .[url removed, login to view], .fr, .[url removed, login to view], ... for example).
step 2: clean the list to remove some domains such as free blogs or .gov.
step 3: scrape [url removed, login to view] to get one data about some of the domains.
step 4: scrape [url removed, login to view] to get some data for a short list of domains (without getting banned for superusage).
step 5: scrape 2 data from the [url removed, login to view] page for each domain in the list.
step 6: sort the list and output as a flat file.
Potential for long term work with the right programmer(s)
7 freelance font une offre moyenne de $159 pour ce travail
HI, I had gone through the requirements, and understood what you need. Please contact to discuss it further.
I cleary understood your project requirements. I will be able to deliver as per you specifications.
I'm interested in your job offer. I have big experience in webscrapping. I live in Russia and am ready for a remote work. We can contact via ICQ, MSN, Google Talk, Skype or any other messenger. I'm a high experienc Plus
Hi, I have 4 year experience in linux & shell scripting. As this is a small task of few lines I will deliver this in 2 days. Thanks & regards Rishi Tiwari
Hello: I shall prefer writing the script in python or shellscript. Python requests module and urlparse works beautifully for such tasks. After extracting the domain names and cleaning shall be making requests to the Plus