We require some postres programming to create a matching algorithm for identifying duplicate customer records within our database of around 120 million customer records (that may or may not be unique).
I have a specific approach in mind that involves scoring potentially matched customers on their specific attributes and then choosing the one with the highest score over a certain threshhold.
The second element of the project involves creating households for customers who likely live within the same home.
The third element is to create a table of nick-names such that common customer names (bill / william, robert / bob, jake / jacob) can be scored equally.
The fourth element is to create a table of invalid email addresses such that these email addresses will not be utilized for matching (even though two or more customers may share these invalid email addresses).
The fifth element of the project is to allow for some degree of typo or misspellings - for example, transpositions within the data such that Johnson would match Johnsno
In the past we have attempted similar functionality using postgres's full text search, but that didn't give enough control over the qualifications for a match.
The application that will consume this will be Ruby / RAILS, but because of the database size and performance requirements, we are not able to get the sort of performance we need using ruby.
The successful completion of this project will include 1) the code necessary to create whatever table structure, functions, triggers, etc. for the matching algorithm, 2) documentation for the previous, 3) table structure for the nick-names, 4) documentation as to how to test, 5) code necessary for identifying households
Your proposal should include specific details about 1) this project, and 2) your approach to solving this problem - specifically how your code will allow us to a) identify duplicates within the existing data, and b) before inserting new customers.
I really don't need to know how many years of experience and in what languages and technologies you or your team has - I mainly want evidence that you understand the problem we're trying to solve and how you intend to solve it.
Here are some examples of customers we would like to match:
1335 Amble Way
Madison, WI 55008
1532 Fourth Street North
Madison WI 55008