Dear Phyve staff,
thanks for posting this interesting and challenging project.
If you are serious about building something like this I'd be glad to offer my expertise.
I've been successfully writing complex custom site crawlers and data mining tools before, so I can offer a solution which is both effective with respect to the high concurrency which is preferable when a large number of sites must be crawled, and also with respect to the data processing that will be required to extract the semantic data.
You should be aware that extracting semantic data from "any site" as you write is not easily done because the machine-readable structure is only specified up to visual presentation of the data. When it comes to extracting data like "product name", "price", "category", every site can do it differently, so there will be no waterproof solution. However, I am experienced in writing advanced data mining engines which we may use to systematically applying some structure to the data. We can have the spider download the raw data from the sites, and by comparison with other sites determine what is likely to be a price, or a product name, etc.
This means that we should start with a couple of sites (ideally with a lot of products being shared between sites), and from there to refine the algorithms step my step.
Regarding storing data for fast retrieval, I propose using a database or a CLucene index (where I contributed to).
Best regards,
Isidor Zeuner