have 500K resumes (CV's) in PDF, Word (doc & docx), Text and html formats (just those 4 formats) for the USA that need to be parsed to extract:
email address (if it exists and "none" indicated if it is not in the resume)
address (the most important piece of the address being the zip/postal code)
We are developing an internal Drupal site to provide retention, indexing and search for resumes/cv's.
The intention is clean up the 500K objects to ensure they are valid resumes (20% could be junk) before they are imported into the new system.
I expect there will be a lot more questions so feel free to ask.