A quick reflection on 2016 IT trends reveals the importance of data analytics for a wide range of innovative applications. The list of relevant examples is end-less: Artificial Intelligence (AI) chatbots that open new horizons in customer service, connected cars that analyze data in order to enable safer driving, smart cities that automatically optimize water and energy resources, factories that provide predictive insights on machine and production systems maintenance as well as a variety of machine intelligence and internet-of-things applications. These applications are not just leveraging the conventional data mining techniques, but rather deploy richer and more efficient methods such as deep learning. The data-driven trend is likely to intensify and expand in coming years, which makes it important for companies to understand how to deploy and fully leverage advanced analytics.
It’s about Big Data Technologies
The rapidly expanding use of data analytics is largely due to the proliferation of data sources (e.g., legacy databases, open data, sensors, IoT devices) and of the data volumes that they produce. Nowadays analytics are applied over Big Data i.e. datasets exposing the famous Vs:
- Volume, i.e. arbitrarily large data volumes exceeding the capacity of state-of-the-art distributed databases. For example, several terabytes of data are generated daily on popular social networking sites such as Twitter.
- Variety: Data analytics applications bring together and process data from a multitude of data sources that contribute data in various formats. These are likely to include structured, unstructured and semi-structured data sources, such as relational databases, e-mails and social media respectively.
- Velocity: Data-driven applications such as smart city and fintech applications have to deal with a multitude of data streams that feature very high rates of ingestion in the data analytics system.
- Veracity: A great deal of the data comes from noisy sources, which generates uncertainty in terms of the contents and semantics of the data streams.
A Big Data system is not limited to a single technology deployment. Rather, Big Data requires the structured integration and use of multiple technologies, including distributed systems, SQL and noSQL databases, security mechanisms, as well as the provision of the required scalable data storage and processing in the cloud. As a result, companies need to invest in building the right technology infrastructure for Big Data.
Big Data technologies provide the means for collecting, unifying, persisting, storing and managing the data. The real business value however stems from the analytics.
Machine Learning and Statistics for Data Analytics
The analysis of the data typically powers machine learning or statistics, which are two complementary approaches. Machine learning (ML) enables systems to learn how to operate on the basis of past datasets. With ML, systems become more autonomous, since they can act by knowledge that they extract rather than by preprogrammed instructions. The training of an ML system is based on historic data, without any need for prior assumptions about the underlying relationships of the features that comprise the data hence it is able to make predictions about new data sets. It is applied to data sets with multiple features (i.e. many dimensions) such as healthcare or retail customer data.
On the other hand, statistics are based on an understanding of the statistical properties of the datasets, which give rise to similar insights that would have been produced by running an experiment many times. Statistical modeling is usually used in the case of datasets with low dimensions i.e. very few features.
A key prerequisite for developing effective ML and/or statistical models is a sound business understanding of the problem at hand. This can help a scientist in the selection of proper models and their parameters for discovering business knowledge based on the data. Finding a proper data analytics models involves iterative circles comprising the processes of discovering and evaluating a model. Each circle also involves the selection and preparation of training and test datasets, which are used to evaluate and benchmark a given model. Several iterations are usually needed prior to the ultimate selection and deployment of a high-performance ML model that renders accurate results. The process of data analytics using statistics and ML is overall time and effort consuming and involves experts from different disciplines such as data scientists, business experts and IT professionals.
Deep Learning
Deep Learning (DL) is a new approach to data analytics and learning that is gaining momentum. It is based on the use of multilayer neural networks for analyzing Big Data from different origins and of varying formats including multimedia data sources. Beyond the use of multilayer neural networks, DL is also driven by the availability of large amounts of data and by the emergence of powerful computational devices (e.g., such as Graphic Processing Units). DL should not only be seen as an evolution of ML, rather it is a disruptive approach that will empower a great deal of the emerging AI and cognitive computing systems like augmented reality for people with disabilities. In the coming years, the growing availability of multimedia data will lead to a proliferating number of DL applications, which are considered the future of data analytics.
Data Driven Culture
Beyond a sound understanding of the technological and business of data analytics, enterprises need to develop a data-driven culture, through revising their processes in a way that leverages the abundant data available, including enterprise data sources, open databases and social media. This cultural shift is in several cases much more difficult than technology deployment itself.