Search engine vs Database in BI part 1: data structure

BI apps face a real data deluge these days. The engineers need new approaches to deal with this. Is a search engine, well known for handling humongous data sets, adapted to a BI context ?

Business intelligence is the science of gathering and structuring data to help making decisions. We will call Business Intelligence BI for short. One major technical challenge is that BI generally involves dealing with large volume of data, which cause troubles to the classic database approach. The search engines (SEs) are a new approach to handling data. Is a search engine a go-to solution for BI usage ?

SEs are particularly good for dealing with large amount of data. This quality is based upon smart architectural solutions to common problems. In essence a SE :

  • crawls raw data
  • indexes the different data (classify and interpret)
  • provides functionalities to look for data related to a theme (search)

To illustrate our talk, we’ll take as an example a BI program that target employee’s emails. All the company’s communications are classified and then one can search for emails and computes indicators on this data.



Data structured vs. unstructured: how ambiguous your data are

The classification process in BI have to understand the meaning of each email. If I write down this information: I’m not here now.



We can interpret different facts:

  • Baptiste is not here and notified Pierre;
  • Baptiste is not here and notified Pierre just now;
  • Baptiste and Pierre are not here.

We can say that the data is ambiguous.

Now if the data we received are: to said “I’m not here right now”.



Now we understand without ambiguity what is the related fact to the piece of information:

  • Baptiste said to Pierre that he wasn’t here when he sent the email.

It is due to the new structure elements that we added: *to* *said* *”*I’m not here right now*”*.

To be able to understand what is going on in the company, we need the data to be structured.

Search engines are particularly known for dealing with unstructured data. They are performant when a user wants to compare an unstructured question to unstructured documents, like books or websites. Take the most famous one: Google. Finding an exact match can be complicated.
For instance, I looked for what obama said to sarkozy last week and I typed : “obama said to sarkozy last week”… I find first a quote from Sarkozy to Obama. Then the next links are related to common declarations of the two men together. Only the fourth link was about a quote from Obama to Sarkozy.

The factual details described in my search and in the websites are not correctly interpreted.
Another issue is that “last week” is badly intepreted, as I wanted to look for declarations within last week, and not articles containing the words “last week”. The root of the issue is the unstructured aspect of the data I was looking for.



The business item concept

We saw that structured data are essential in a BI context to remove ambiguities. SEs are not particularly designed for dealing with structured data, even though a solution exists: one should deal with “business items”. This concept, proposed by Exalead, is in practice really hard to maintain. Check this article for further info. We’ll see why in the next paragraph.



Consistency of the data: what is conform

My first paragraph was about data structure. The new SE support data structure to some extend, introducing the concept of “business item”. This concept brings clever solutions to performance wise technical issues but also introduce functional limitations.

The first obvious limitation is to ensure consistency. Consistency is to make sure that every single data respects a standard. For instance, that an email has a sender, and this sender must exist in our user referential. The conformity checks must be supported at all cost by the application using the SE. As there is no centralized description of the whole structure with constraints, it implies that every developer must check carefully the details. To be fair, large database based applications in BI must also drop some of the conformity check features for performance reasons.



Manage changes: how the data repository can live after its conception

The second problem is that It is common in BI to go back on already stored data and discover a new way to use them, a new interesting indicator to build. Like splitting the mail addresses in two, one part for the name and another for the domain.

Usually, data inside a SE are dead data, they cannot be changed in a new way. On the contrary, a database can easily update later data already stored.
Sure, Solr and Exalead provide such functionality but it is in a limited upsert way (cancel and replace). It means that if one wants to update data, it needs to extract every business items, change it and push again in the repository. It is like not having the “search and replace” function in excel. This limitation causes headaches to engineers when the functional requirements are changing.
This can explain why some people that run a Proof of Concept around a SE as a main data repository can be disappointed later on because they can’t easily manage changes.



Se or not SE ?

One quality in SE is its speed to index large volume of unstructured data. But its defunct is to store data in an ambiguous way and not let users change the structure later on.

This is why at Inovia we decided to let the SE work only as an extension of a database. In our guidelines, the SE shouldn’t be the main repository of data.

We have a database that allows us to finely structure data and a search engine for the unstructured data when the database is not relevant anymore. Read this article for further info about the features of SE over databases.

Postgres got a mechanism like this called Full Text Search. We will discuss in another post a concrete business case that successfully mixed the two approaches, database and search engine.