C code to index large text library and find similar -- 2
Budget $200-600 USD
I need a mini-app (Compiled C on Linux) that groups similar sentences together.
I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.
Then iterate through doing word-by-word comparisons (16bit comparisons).
Two algos are acceptable:-
1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.
We leave such large gap so that we don't need to worry about word roots.
From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.
The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.
The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.
I need something very soon. A mediocre algorithm is fine.
To be awarded: explain in 1-2 sentences your proposed approach, and bid a base amount plus a bonus on completion. Come in cheap, and get the big reward after you have delivered.
11 freelances font une offre moyenne de 422 $ pour ce travail
Hello, since the number of records is not so big I'd load the entire DB in RAM for faster [login to view URL] need to run it on a PC,right?
I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developing, and mainly focus on server side, using c++ under Linux environ Plus
Dear sir. Your project attracted my attention at first glance, because I've extensive experience in C Programming. I'm really confident about your project, and very eager to join your project. If we have a chance to Plus
[login to view URL] I saw your project description carefully and i'm very interesting your project. But i have some question about your project. If u have enough time to discuss about your project with me ,please contact me. An Plus
Hi, I must say very interesting and challenging project. I have done some work on the similar project and did research on how Twitter search works on large volume. I would suggest lucene search library to create ind Plus
.................................................................................................................................................................................................................
Dear Sir, I have gone through project description and interested taking it up. Posted bid amount is indicative and a more accurate I can give once more details are shared. Looking forward to hear from you. Thanks
Dear Employer Due to my own interest in such natural language processing problems, I already developed your described approach into a first unoptimized protoype to see how fast it can process and group 100k sentence Plus
Hello, I am expert on C/C++/Python/Data Structures/Algorithms For word indexing, i propose using trie structure (character tree). Leaf nodes would carry the index value. We could also use a hash table for indexing, bu Plus
Dear hiring manager, I am senior Web Scraping expert with 13 years rich experience in the past. I have strong skills and so many experience in web scraping (10M Amazon products Images scraping, Cryptocurrency marke Plus
Hi there, Interesting project you have there. Here is my approach. I have data structure library in C which is in development but will meet this project needs as some of the data structures have been implemented. Plus