As I move to the end of this iteration of Whatson, I am stalling on the final step involving Answer Type computation. In the Hot Water Tank architecture of a Question-Answer system, all the source data gets crawled and indexed internally. During indexing, Natural Language Processing (NLP) is applied to the data, tagging Person entities Location entities, and others. This tagging is key to answering user questions correctly. You see, questions get analyzed using similar NLP. An Answer Type is computed to figure out what kind of question is being asked and what kind of answer will do. Is it a Person question? A Location question? Correct answers can be obtained by matching the Answer Type to the content tagged during indexing.
The Hot Water Tank architecture is solid, but I can’t help feeling it is too rigid, like Relational Database design with all its unique identifiers and keys. Relational Database design is also very effective is returning accurate results, as long as you are only asking canned questions about a known range of content. In this post, I describe an alternative “Tank-less” architecture.
Obviously, I am playing on the Tank/Tank-less options for heating water in one’s home. The Tank option heats a large quantity of water in advance, waiting for you to turn the tap on. Similarly, in the Tank architecture, a massive first crawl of all external data source content is performed, followed by a massive first indexing of that content, just like filling a hot water tank. Updates to the index can be performed on just the delta, but changes to the NLP model may require complete re-indexing.
The Tank-less option heats water on demand. In the Tank-less architecture, the massive index is replaced with a lightweight Data Source Catalog, containing structured descriptions about data sources and programmatic interfaces to search the sources. When a user submits a question, the text is still analyzed using NLP and an Answer Type is still computed. What is new is an extra step in which the system uses the NLP and the Data Source Catalog to choose the appropriate sources for answers. A small amount of content is retrieved using the search interfaces in the Catalog. NLP tagging is then applied to just that small amount of content. Correct answers can be obtained by matching the Answer Type to the content just tagged.
Replacing the massive index with a lightweight catalog is an experimental approach. The catalog must contain sufficient description of the data sources to make accurate choices about where to search for answers. In addition, the description must be structured to allow for automated choices of the sources. I’m not sure yet if this will work best in the long run, but I have a good feeling about it. In my next post, I will weigh the advantages and disadvantages of both architectures. It will be no surprise that I favour the Tank-less architecture for a cognitive system.