Enterprises are nowadays provided with unprecedented opportunities for improving their business results based on the collection and processing of very large datasets typically big data. The data sets are generated not only from a wide range of enterprises systems and databases but also from emerging social media and internet of things infrastructures. Big data sets are not typically useful unless they are analyzed in order to discover knowledge and produce insights about business processes. Such analysis would enable the organizations to improve their operations and take optimal decisions. Knowledge extraction from data can also drive the development of machine learning agents which are able to understand and assess various situations in order to boost automation and to accelerate decision making. Machine learning processes are in several cases based on past observations (i.e. data about past situations) and used to trigger process improvement. Nevertheless, like in human learning, the development of machine learning agents depends on the business context of the problem at hand. Specifically, the extraction of knowledge patterns from past data is highly dependent on the context (e.g., time, location, cause, people, and objects involved) of the collected datasets. Therefore, there is a very close affiliation between data analytics techniques (e.g., statistical processing) and intrinsic knowledge of the business domain at hand. Data science is not simply about finding patterns on data. It is rather about identifying recurring patterns that can solve real business problems. As a result, data scientists must be able to understand, appreciate and fully leverage domain knowledge in course of the analysis process.
Domain Knowledge in the Data Mining Lifecycle
The importance of domain knowledge is evident in all data mining processes, such as the popular CRISP-DM (Cross Industry Standard Process for Data Mining) and KDD (Knowledge Discovery in Databases). For example, CRISP-DM comprises several steps that are based on domain knowledge, including:
A Business Understanding phase, where the data mining problem at hand is formulated from a business viewpoint. Domain knowledge is very important at this phase in order to articulate a tangible business problem and its challenges.
A Data Understanding phase, where the data are observed, inspected and visualized, in order to understand whether they are proper and sufficient for the targeted issue. In this phase domain knowledge serves as a key to understand whether the data reflects the problem domain, as well as whether they are representative and free of bias.
A Modelling phase, where different data mining and machine learning models (e.g. Bayesian techniques, regression, decision trees, neural networks) are considered and applied in order to derive knowledge that solves the business problem at hand. Domain knowledge can be invaluable in selecting the right techniques and building an acceptable model, as it helps to select a model with proper expressiveness while avoiding bias factors (e.g., overfitting the model on the data).
An Evaluation phase, where different models are evaluated in terms of their suitability for the given problem. The evaluation requires confrontation of performance metrics of each model (e.g., classification error rates) against the needs of the enterprise in the specified business context. It’s not possible to judge the performance and suitability of model in the context where it will be deployed, simply by looking at some performance indicator. For example, an accuracy of 85% can be very good for automatic classification of loan applications (i.e. with the rest 15% to be screened by humans), but very poor for the classification of individuals as patients of a rare disease (e.g., more than 99.5% of the patients do not suffer from the rare disease).
Overfitting Avoidance: When Data Science and Statistics are not Enough
Domain knowledge is extremely important in cases of supervised learning, where learning depends on a set of past observations (i.e. training data). Even though the data are derived from real-life settings, they are not always representative of all scenarios. This is the source of the so-called data overfitting problem, where a machine learning agent performs very well on the training dataset but exhibits poor performance when used over additional (new) data. Consider, for example, the case where you want to model the income of young workers in a given city. If you sample a large portion of the city’s population, chances are that you will find a proper relationship between these two parameters. However, if you sample only a few neighbourhoods, including the ones where very rich people and workers live, the result might significantly deviate from the reality, as it will be affected by outliers (i.e. very rich people) and randomness (i.e. youngsters possessing unusually well-paid jobs).
Data scientists tend to apply heuristic methods that help them avoid “overfitting”. One of these methods involve penalizing complex functions, as a means of disposing of descriptions of more general structural patterns (i.e. improving generalization). Another measure that helps to avoid “overfitting” is cross-validation, which boils down on the splitting of the available data in a training and test dataset, in order to facilitate the testing of different models that are built on the training data based on the test dataset. In this context, data scientists tend also to compare various models. In the case of models that exhibit more or less the same performance, the simplest one is chosen, as it is likely to be more general.
Despite the existence of “rules of thumb” for detecting and alleviating overfitting, true solutions cannot be found without domain knowledge. The domain knowledge is important to detect problems in the given datasets, such as seasonality of the data or outliers. It can also serve as a basis for identifying inter-dependent attributes (e.g., one property defining another) and for explaining the patterns derived from a model. For instance, in datasets collected from a machine towards building a condition monitoring or predictive maintenance application, there are insights that can be only provided with the help of a domain expert. For example, the fact that a machine was not operating at all or operating at low speed for a given time interval, can be only deduced with a help of a domain expert. Likewise, identifying possible faulty situations that existed prior to starting the machine and collecting its data is also something that requires the involvement of a domain expert.
Domain Knowledge Implications
The importance of domain knowledge has significant implications in various aspects of data mining and machine learning processes. In particular:
Team Building: The assembly of a proper data science and data analytics team becomes extremely challenging as a result of the need to involve business experts. The experts should actively participate in the team and are expected to work in association with IT, database, data mining and machine learning experts. Given the proclaimed talent gap in big data technologies, building an expert team in many cases is a tough challenge.
Data Collection: The process of collecting and consolidating datasets must be supervised and reviewed by domain experts. Assembling representative datasets is a key prerequisite for starting your project on the right foot.
Variations in Accuracy and ROI (Return-On-Investment): Companies of similar size and in the same industry are likely to get very different returns from more or less similar investments in big data analytics. The degree of involvement of domain experts can justify such variations. Even though domain experts’ resources are expensive and hard to commit, it would lead to more accurate and more effective knowledge extraction.
Balancing data science expertise with domain knowledge is a key to succeeding in the big data analytics projects. It’s therefore worth investing time and effort in building a proper team, which will harmonically combine the skills of business experts, data scientists, and IT experts.