Fermé

C code to index large text library and find similar -- 2

I need a mini-app (Compiled C on Linux) that groups similar sentences together.

I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.

Then iterate through doing word-by-word comparisons (16bit comparisons).

Two algos are acceptable:-

1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.

We leave such large gap so that we don't need to worry about word roots.

From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.

The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.

The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.

I need something very soon. A mediocre algorithm is fine.

To be awarded: explain in 1-2 sentences your proposed approach, and bid a base amount plus a bonus on completion. Come in cheap, and get the big reward after you have delivered.

Compétences : Programmation C, Programmation C#, Programmation C++, Linux, Python

en voir plus : docfetcher score, docfetcher portable download, docfetcher wiki, docfetcher windows 7, where does docfetcher store index, docufetch, docfetcher web interface, docfetcher index location, code compare excel files find similar items, sorting large text file, vba code extract email text field, script numbering large text file, large text file viewer, parse large text files java, nutch index large, php code send emails text file, text library, easy code mafia online text game, css code product description text oscommerce, large text flash website

Concernant l'employeur :
( 16 commentaires ) Ultimo, Australia

Nº du projet : #17629535

11 freelances font une offre moyenne de 422 $ pour ce travail

quantumcube

Hello, since the number of records is not so big I'd load the entire DB in RAM for faster [login to view URL] need to run it on a PC,right?

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(21 Commentaires)
7.4
hbxfnzwpf

I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developing, and mainly focus on server side, using c++ under Linux environ Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 5 jours
(181 Commentaires)
7.2
dinhfreedom

Dear sir. Your project attracted my attention at first glance, because I've extensive experience in C Programming. I'm really confident about your project, and very eager to join your project. If we have a chance to Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(78 Commentaires)
6.6
polarjin2017

[login to view URL] I saw your project description carefully and i'm very interesting your project. But i have some question about your project. If u have enough time to discuss about your project with me ,please contact me. An Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(48 Commentaires)
5.9
erShashi

Hi, I must say very interesting and challenging project. I have done some work on the similar project and did research on how Twitter search works on large volume. I would suggest lucene search library to create ind Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 25 jours
(38 Commentaires)
5.3
freelancerSolvit

.................................................................................................................................................................................................................

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(32 Commentaires)
4.8
magadhmindslx

Dear Sir, I have gone through project description and interested taking it up. Posted bid amount is indicative and a more accurate I can give once more details are shared. Looking forward to hear from you. Thanks

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(17 Commentaires)
3.5
mbenkendorf

Dear Employer Due to my own interest in such natural language processing problems, I already developed your described approach into a first unoptimized protoype to see how fast it can process and group 100k sentence Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 3 jours
(4 Commentaires)
3.4
mdolgun

Hello, I am expert on C/C++/Python/Data Structures/Algorithms For word indexing, i propose using trie structure (character tree). Leaf nodes would carry the index value. We could also use a hash table for indexing, bu Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 7 jours
(5 Commentaires)
2.3
teamspirit3

Dear hiring manager, I am senior Web Scraping expert with 13 years rich experience in the past. I have strong skills and so many experience in web scraping (10M Amazon products Images scraping, Cryptocurrency marke Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 2 jours
(3 Commentaires)
1.6
TobiObadiah

Hi there, Interesting project you have there. Here is my approach. I have data structure library in C which is in development but will meet this project needs as some of the data structures have been implemented. Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 4 jours
(0 Commentaires)
0.0