Read everything below to fully understand the project. Do not bid until you have read in full. Personal messages along with the bid will be given extra attention. If a requested feature drastically increases the price, mention how much it is with and without it so that I can correctly compare your bids to the others...
During the process, it is very important that we stay in contact with one another.
I need a program that I can run on Windows to extract email addresses from URL in an existing CSV file and save the results into the same file which contains other data.
CSV has this column structure:
[url removed, login to view]
- I need these; .com, .co, .net, .biz, .us
- Use comma if more than one email found.
- Nulti-threading which can be adjusted by the user (1-30)
- Must load data into database (ie: sqlite) for scraping. There are times where I will use this for 100 URL’s and times where I will want to use it for 100k URLs. So it is important that the results be saved either in the CSV or DB in case of a loss of internet or PC restart.
- Must be able to read URLs in this format; http, www, and [url removed, login to view]
The program will pull the URL (which I can always make column A), scrape the website for email and post the results into the Email column (column B).
The program needs to have three scraping modes to help with speed. Do not scrape external URL’s or redirects.
1) Slow - Full scan of entire website (50 URL max)
2) Medium - Scrape only the links found on the initial landing page and stop scraping after 30 URL's
3) Fastest - Scrape only these pages; landing page, contact-us, contact, contactus, about, about-us, aboutus, staff. If these pages have extensions (php, jsp, htm, apx, html, etc), that means that case does matter. So we also have to have Contact-us, Contact, Contactus, About, About-us, Aboutus, Staff, ContactUs, Contact-Us, About-Us. And sometimes, the "contact" page is a folder such as [url removed, login to view] (max 15 domains)
I will use as many threads as I can, and run all URL’s in ‘Fastest’ mode. Then, if there are domains that do not have URL’s, I will run it in Slow or Medium (since it will take longer).
One GUI where I will select the file, watch the process, and if possible, specify the URL/time limit for each option (Slow, Medium, Fastest). If that increases the price, let me know. I may later decide that it is better to have a time limit instead of URL limit and will want the ability to change this without rewriting the program.
The program will save the results into a new CSV file which defaults to the original file name with the word RESULTS added to the end of it. If it cannot default to the original file name, it should call itself [url removed, login to view]
Since many websites have forms, it would be nice to know this so that I do not continue trying to process those. Maybe the program can detect the <form> code and put FORM in column B so that I can skip those and keep it for my records.
I will want to test this along the way. The demo you provide will need the ability to test at least 50-100 URL’s. It’s much harder to get a good idea of performance with a smaller list.
I want the source code once the project has been completed. As long as you are available, I will continue to work with you if changes are needed, but if you are unable to be reached, I will need to take it to someone else to receive help.
Two-weeks of support once the project is finalized. There are emails that will be missed so revisions will be needed.