Every so often the game changes. Newton thought time was a constant. Einstein showed that time slows down for travelers at light speed. A change of singular proportion is happening in computing today because of the challenges of big data and the rise of Strong Natural Language Processing technologies.
Step back to the world of small data and database design 101. An entity such as a Customer is defined by a list of attributes: Name, Address, Phone Number, and so on. These attributes are structured as fields in a Customer table. Each Customer record is assigned a unique identifier (UID). The Customer ID is a primary key, allowing database designers to create relationships between Customers and other entities. Each record in a Products table has a Product ID, and each record in an Invoices table has an Invoice ID. The Invoices table will have extra foreign key columns for Customer ID and Product ID so that queries can efficiently pull out a purchase history for a customer.
Traditional database design is effective, as long as you don’t want to integrate databases across organizations. In the Customers database, John Smith has a specific Customer ID. Good enterprise design will share that information across databases, but in another enterprise John Smith has a completely different UID. Assuming permissions to share data, the only way to line up these two UIDs is to compare record fields: Name, Address, Phone Number, again. Hopefully the information has not changed and there are not too many typos.
The largest volume of data being generated today is unstructured: documents, emails, blog posts, tweets, and so forth. Most of this data is accessible on the open web. This is the world of ‘big data.’ Search technologies help. A Google search yields possible results ranked by its sophisticated ranking algorithm. A likely match is often found in the first page of results. We accept this as a good thing. Wouldn’t it be nice if the old I’m Feeling Lucky button could correctly answer a question in one try?
Natural Language Processing (NLP) is a big data technology. Beyond keyword matching, NLP parses words and sentences for meaning. Standard patterns identity people, companies and locations. Custom domain models are built to identity business concepts and detect relationships. NLP transforms unstructured content into structured data. One begins to wonder if NLP can replace good-old database design and its unique identifiers. Not quite. There is still an unsatisfactory margin of error. John Smith might get identified correctly as a Customer who purchased Twinkies, but NLP might struggle in other cases with variants in proper names and products. Error rates get compounded. If person identification is 95% accurate, and product 90%, the overall confidence is only 86%. NLP still depends on human search and analytics for uniquely resolving an answer.
Stronger NLP technologies are emerging. In 2011, IBM challenged the world’s two best Jeopardy players to compete with its Watson supercomputer. The game requires nuanced knowledge of human language and culture. Watson used NLP on a database of 200 million pages of structured and unstructured content, including Wikipedia. Thing is, the game only permits one answer to a question, not pages of possible answers. Certainly, Watson would come up with a list of likely answers, ranked by probability, but it could only submit its single best answer. To win, Watson had to answer correctly more often than its skilled human competitors throughout the game. Watson won. Strong NLP is the ability to process big data to produce one correct answer.1 Strong NLP is a singularity in computing history. We can say good-bye to traditional database design and its unique identifiers. I can imagine much bigger changes.
- Of course, it would have the ability to answer successive questions correctly. Also, if there are two equally correct answers, both answers would be given. This would not work on Jeopardy but it would be necessary in real life. [↩]