Closed

C code to index large text library and find similar

I need a mini-app (Compiled C on Linux) that groups similar sentences together.

I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.

Then iterate through doing word-by-word comparisons (16bit comparisons).

Two algos are acceptable:-

1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.

We leave such large gap so that we don't need to worry about word roots.

From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.

The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.

The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.

I need something in 36 hours. A mediocre algorithm is fine.

Compétences : Programmation C, Programmation C#, Programmation C++, Linux, Python

en voir plus : document search tool, docfetcher download, file indexing software open source, c++ 2d graphics library, docfetcher alternative, faiss tutorial, docfetcher review, docfetcher portable, code compare excel files find similar items, sorting large text file, vba code extract email text field, script numbering large text file, large text file viewer, parse large text files java, nutch index large, php code send emails text file, text library, easy code mafia online text game, css code product description text oscommerce, large text flash website

Concernant l'employeur :
( 14 commentaires ) Ultimo, Australia

Nº du projet : #17551738

14 freelance font une offre moyenne de $457 pour ce travail

hbxfnzwpf

You can trust my expertise, I can finish in time, thanks a lot! I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developin Plus

%bids___i_sum_sub_32% %project_currencyDetails_sign_sub_33% USD en 1 jour
(132 Commentaires)
6.9
TopDev727

Hi I am c++ expert I can do perfectly your project. I want to discuss more on chat. I will wait for your contact. Thanks

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(39 Commentaires)
6.3
chongyin429

Hello, I have read your proposal and it is very interesting for me, because I have very strong skills in Algorithm and Programming in C/C++/C#. This project can be divided into 3 main parts. 1 - Input data from Postg Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 3 jours
(18 Commentaires)
6.0
ITPyramid85

hello,how are you. i read your bid carefully. i am c/c++ expert and have full experience for 10 years. c++ language is my top skill. i can provide most quality and high speed. if you want to success, please contact Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(4 Commentaires)
5.5
JMITSolution

I'm very interested in your project. I have a lot of experience with c and c++ programming and algorithms on string. I've checked the project in detail and I can finish your project. I'm a full stack developer and I Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(39 Commentaires)
5.1
dstepanenko

Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks Plus

%bids___i_sum_sub_32% %project_currencyDetails_sign_sub_33% USD en 1 jour
(19 Commentaires)
4.5
nmsandroid

We can make a small team for delivering the assignment Can break the work for read, algorithm and output process I hope you will need documentation also Also as tomorrow is Sunday please share the date which is ex Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 3 jours
(22 Commentaires)
3.8
MzHashmi

Hi im free so i can do this type of jobs in quick manner as you have 36 hours for the job lets dont waste the time and get it started

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(5 Commentaires)
3.2
jjmutumi

Hello, I have more than 6 years experience writing software with Python. I can make a very fast, maintainable script for this in Cython if you are interested? Consider that: 1 - The main slowdown is from cache Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 0 jours
(2 Commentaires)
3.0
codingedward

Hello, I am an experienced algorithm designer and would really like to work on your project. I appreciate how detailed your project description is and have understood every aspect of it. Award me the project and I w Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 3 jours
(11 Commentaires)
3.1
ansarias21

Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram

%bids___i_sum_sub_32% %project_currencyDetails_sign_sub_33% USD en 1 jour
(4 Commentaires)
0.4
humrobo

Hi, Hope you doing well sir i read your message in given below i make sure you that i can help you to build mini-app (Compiled C on Linux) that groups similar sentences together. as better as per your given requir Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(1 Évaluation)
0.0
itsparx

Dear Prospect Hiring Manager. Thank you for giving me a chance to bid on your project. i am a serious bidder here and i have already worked on a similar project before and can deliver as u have mentioned I have Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 10 jours
(0 Commentaires)
0.0
Anpera

I have expertise in C/C++ My plan to solve this thing: 1) You give me example of dataset 2) I do rapid prototyping in python and show you approximate result of algorithm execution and timing. 3) If you like i Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% USD en 2 jours
(0 Commentaires)
0.0