To understand this project, you must first understand the purpose and usage of CPE 2.3: [login to view URL]
I would like a standalone microservice created (preferably in PHP using CodeIgnitor 4) that accepts 4 strings as in a post request, and returns the top X matches based on the request. This is effectively a fuzzy search system that tries to find a CPE for an application version, should one exist in the dictionary. This could be a problem resolved by a machine learning model, natural language processing, and/or distance algorithms.
As guidance, the API could use Levenstein, keyword counting, Jaccard index, longest common substring, Hunt–Szymanski algorithm, Hamming distance, Damerau–Levenshtein, or a combination of the aforementioned algorithms. The selected algorithms should be used to find a match for the POSTed application in the CPE dictionary ([login to view URL]) should one exist.
The microservice/REST API itself is very simple. No authentication or security controls. Only one interface to POST the app data. The majority of the work will be perfecting the CPE matching process, and then enhacing the performance.
Inputs:
- (string) publisher
- (string) name
- (string) version
- (string) operating_system
Outputs:
- Array of arrays containign CPE matches with score:
[
[
"cpe": (string) cpe2.3_uri
"score": (int) closeness out of 100
]
]
Components:
- Local copy of the CPE dictionary
- REST request interface
The workflow:
- Request received containing publisher, name, version, operating_system.
- Service runs fuzzy search in local CPE dictionary for possible matches.
- Service runs each potential CPE through various distance/similarity algorithms.
- Service returns best CPE 2.3 match(es) if any are found. The returned CPE must match the application submitted (publisher, name, and version).
Worked example:
- User POSTs:
publisher: "Valve Corporation"
name: "Steam"
version: "2.10.91.91"
- System responds:
[
"cpe": "cpe:2.3:a:valve:steam:2.10.91.91:*:*:*:*:*:*:*",
"score": 90 (or whatever the distance score was)
]
It's really important that we do not return many false positives. Ideally, we require 98% accuracy (2 false positives for every 100 requests).
Some rules... we're only interested in CPEs where:
1) The "version" is not "-".
2) The value of "part" is "a"
3) "update" is "*"
4) "target_sw" is one of ['*', 'windows', 'windows_10', 'windows_server', 'x64', 'x86', '.net', '.net_framework', 'desktop', 'edge', 'internet_explorer', 'internet_explorer_10', 'internet_explorer_11']
5) "target_hw" is one of [('x64', 'x86', 'nuc', '-', '*', 'arm64', 'intel64', 'amd64')]
) The API should maintain an in-memory cache to speed up future repeat responses
Here are some examples of false positives:
Request 1:
"Microsoft .NET Framework Targeting Pack","4.8.04161","Microsoft Corporation"
Response 1:
"Microsoft .NET Framework Targeting Pack","4.8.04161","Microsoft Corporation"
Request 2:
"Citrix Workspace Inside","[login to view URL]","Citrix Systems, Inc.","Windows"
Response 2:
"Citrix Workspace 2102 for Windows","cpe:2.3:a:citrix:workspace:2102:*:*:*:*:windows:*:*","citrix","workspace","2102"
Request 3:
"Office","18.2205.1091.0","Microsoft Corporation"
Response 3:
"Microsoft Office","cpe:2.3:a:microsoft:office:-:*:*:*:*:*:*:*","microsoft","office","-"
Attachments:
- [login to view URL] is a list of example inputs
Other people/projects that have attempted this for inspiration (I've tried all, none of them work correctly or are accurate enough):
- [login to view URL] and [login to view URL]
- [login to view URL]
- [login to view URL]