I need a mini-app (Compiled C on Linux) that groups similar sentences together.
I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.
Then iterate through doing word-by-word comparisons (16bit comparisons).
Two algos are acceptable:-
1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.
We leave such large gap so that we don't need to worry about word roots.
From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.
The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.
The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.
I need something in 36 hours. A mediocre algorithm is fine.
10 freelances font une offre moyenne de 418 $ pour ce travail
You can trust my expertise, I can finish in time, thanks a lot! I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developin Plus
Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks Plus
Hello, I have more than 6 years experience writing software with Python. I can make a very fast, maintainable script for this in Cython if you are interested? Consider that: 1 - The main slowdown is from cache Plus
Hi im free so i can do this type of jobs in quick manner as you have 36 hours for the job lets dont waste the time and get it started
We can make a small team for delivering the assignment Can break the work for read, algorithm and output process I hope you will need documentation also Also as tomorrow is Sunday please share the date which is ex Plus
Hello, I am an experienced algorithm designer and would really like to work on your project. I appreciate how detailed your project description is and have understood every aspect of it. Award me the project and I w Plus
Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram
Hi, Hope you doing well sir i read your message in given below i make sure you that i can help you to build mini-app (Compiled C on Linux) that groups similar sentences together. as better as per your given requir Plus
Dear Prospect Hiring Manager. Thank you for giving me a chance to bid on your project. i am a serious bidder here and i have already worked on a similar project before and can deliver as u have mentioned I have Plus
I have expertise in C/C++ My plan to solve this thing: 1) You give me example of dataset 2) I do rapid prototyping in python and show you approximate result of algorithm execution and timing. 3) If you like i Plus