I need a simple script:
step 0: you get a file with a list of URLs (hundreds or thousands); they are in all sorts of format (subdomains, https, many SLD/TLD).
step 1: you extract the domain names from the URLs and generate a sorted list of unique domains; this is not as simple as it sounds as the function doing that must be able to tokenize any URL format as well as any form of TLD (like .[url removed, login to view], .fr, .[url removed, login to view], ... for example).
step 2: clean the list to remove some domains such as free blogs or .gov.
step 3: scrape [url removed, login to view] to get one data about some of the domains.
step 4: scrape [url removed, login to view] to get some data for a short list of domains (without getting banned for superusage).
step 5: scrape 2 data from the [url removed, login to view] page for each domain in the list.
step 6: sort the list and output as a flat file.
Potential for long term work with the right programmer(s)
7 freelance ont fait une offre moyenne de 159 $ pour ce travail
Hi, I have 4 year experience in linux & shell scripting. As this is a small task of few lines I will deliver this in 2 days. Thanks & regards Rishi Tiwari