A library for Windows to extract the plaintext of several file formats
I need a library in .dll and .lib forms, that extracts the plaintext of some file formats, listed below:
• Microsoft Word Files (.doc, .docx)
• Microsoft Excel Files (.xls, xlsx)
• Microsoft Access (.mdb, .accdb)
• Adobe Acrobat Files (.pdf)
• IBM Lotus Notes database file (.nsf)
The library must have a callable function with the following signature
BOOL ExtractPlaintextFromFile(PCTSTR FilePath, TextCallback Callback);
The first parameter will be a pointer to a unicode string containing the file path to extract the plaintext from (ex. D:\[url removed, login to view])
The second parameter will be a plaintext processing callback, explained below.
The return value must be TRUE on success and FALSE on error.
The callback function must have the following signature, with each parameter explained.
typedef BOOL (*TextCallback)(PCTSTR Text, SIZE_T TextLength, PCTSTR SourceFile);
Text: Pointer to a buffer that contains all or part of the extracted plaintext in unicode. If the file is to be extracted in chunk or parts, the callback can be safely called again pointing to the new chunk or part.
TextLength: Length in characters of the buffer pointed by Text.
SourceFile: Pointer to a buffer that contains the originating source file (ex. D:\[url removed, login to view])
Return value: TRUE on success, FALSE on error.
The project must be delivered in one or two .sln files (Visual Studio solution file) to the choice of the developer.
If only one .sln file is provided it must compile everything from scratch to a demo application
If two .sln file are provided one must be for all the possible dependencies of the project (external libraries and such) and other for the main library and the demo application
The demo application must be a simple application that calls the extracting function with a provided sample file for each of the supported file formats.
The callback function of the demo must simply save the extracted contents to a file with the .txt extension added. (ex. D:\[url removed, login to view]).
The goal of the demo is to extract all the plaintext from all the included sample files.
The included sample files were uploaded as a multipart rar file due upload file size limitations
As expected, converting from formats with special formatting like PDFs to plaintext can lead to loss of text positioning or format. This is no problem for my requirements. As long as all available text from the document is extracted, superfluous whitespace is not a problem.
Additionally, the library must meet the following technical specifications:
• It must be coded in C or C++ (Avoid using C++0x/C++11)
• It must be able to run in any version of windows from Windows XP SP1 to the latest version. (Windows XP SP1 to SP3, Windows Vista Retail to SP2, Windows 7 Retail and SP1, Windows 8, Windows 8.1 and Windows 10)
• The library must be self-contained. This means that it should not depend on any external libraries, installed programs, DLLs or frameworks that are not included in a clean installation of Windows XP SP1 (That is, an installation of Windows XP with SP1 with no extra programs or system updates installed).
• It must not have any graphical interface, play any sound or generate any kind of alert to the user
• You must deliver all the source code that generates the final library; no precompiled libraries will be accepted.
• You must document all the external libraries used by the library, including the version used, direct download link and detailed notes about any changes to the original source code of such libraries.
• The final binaries should be compiled using Visual Studio 2010 or higher and compiled with the Runtime Library option set to Multi-threaded (/MT).
If you are interested in the job please answer this request with the following information:
• Estimated time of development.
• What is your favorite animal pet
Your proposal will be subject to approval