I need a php script doing:
step 0: you get a file with a list of URLs (hundreds or thousands); they are in all sorts of format (subdomains, https, many SLD/TLD).
step 1: you extract the domain names from the URLs and generate a sorted list of unique domains; this is not as simple as it sounds as the function doing that must be able to tokenize any URL format as well as any form of TLD (like .[url removed, login to view], .fr, .[url removed, login to view], ... for example).
step 2: clean the list to remove some domains such as free blogs or .gov.
step 3: scrape [url removed, login to view] to get one data about some of the domains.
step 4: scrape [url removed, login to view] to get some data for a short list of domains (without getting banned for superusage).
step 5: scrape 2 data from the [url removed, login to view] page for each domain in the list.
step 6: sort the list and output as a flat file.
Or you can propose your method.
Potential for long term work with the right programmer(s)