Visualization is an integral part of most data-intensive applications, as it’s not possible to understand their outcomes without visualizing the datasets. This is also the case for the wave of BigData applications, which cope with very large volumes of data. In most cases, data visualization aims at providing ergonomic and user-friendly representations of data-driven outcomes. However, in BigData applications, visualization has two additional goals: First, to boost the identification of insights such as non-obvious or hidden patterns of knowledge, and second, to ease navigation and browsing of very large datasets. As such data visualization in BigData is an integral part of data analysis, which helps end-users of BigData applications to identify knowledge patterns, predict trends and present insights to stakeholders. The visualizations incorporate the outcome of tabular and spatial data in visual formats that are typically more appealing for stakeholders, while at the same time facilitating the representation of ideas.
The importance of visualization has given rise to the introduction of a wide array of diagrams and charts that visualize different aspects and insights present in the data. Likewise, a large number of tools that facilitate the creation of various charts from the source data have emerged. The use of such tools is essential in order to create effective representations of the datasets, while at the same time these tools also enable story creation and story-telling based on large amounts of raw data.
Data Visualization in the BigData Applications Lifecycle
In one of our earlier posts, we presented popular methodologies for developing and deploying data mining applications, such as methodologies based on CRISP-DM (Cross Industry Standard Process for Data Mining) and KDD (Knowledge Discovery in Databases). The activities specified in these methodologies include:
- Data Understanding: Prior to applying any analytics or machine learning model, data scientists need to understand the nature and characteristics of their training datasets, including for example the distributions of various parameters, correlations between them and more. In the case of BigData applications, this requires a proper visualization of the training datasets, as it is almost impossible to review and understand data properties in their raw data format.
- Application deployment – User interface tasks: The ultimate phase of a BigData analytics application involves the integration and deployment of machine learning or data mining models. This involves the implementation of a proper user-interface with data visualization capabilities as well. In the case of BigData, effective visualization is a key to understanding results that are hidden in very large amounts of raw data.
To facilitate data understanding and application-level visualization, data scientists and other stakeholders employ a large number of different diagrams.
Data Visualization Types and Diagrams
There are many different types and diagrams for visualizing datasets. Most of us are quite familiar with the basic diagrams that are part of popular spreadsheet applications, such as histograms, line charts and bar charts. For example, a histogram illustrates datasets based on rectangles that have heights proportional to the count of the data and widths equal to the range of intervals where the data belong. They are suitable for visualizing the distribution of the data. Likewise, line charts are used to depict the evolution of data parameters in relation to other parameters.
Beyond these basic diagrams, BigData projects take advantage of additional types of visualizations, which are effective in consolidating and summarizing very large datasets. These additional diagrams have their roots in both statistics and data mining. Some prominent examples follow:
- Box Plot: A Box and Whisker Plot (or simply Box Plot) provide the means for visualizing data distributions through their quartiles. Such diagrams are characterized by the presence of “whiskers” i.e. lines that extend parallel from the boxes and used to illustrate variability outside the upper and lower quartiles. Box plots are drawn either vertically or horizontally while being able to depict outliers (i.e. values outside the presented distributions) as individual dots. Box Plots are good for BigData applications because they take up less space when compared to histograms or density plots, which is essential when you have to cope with very large databases that can comprise multiple groups of data.
- Stream Graphs (ThemeRiver): This Graph chart has values displaced around a varying central baseline. They are used to display changes in the data over time for many different data categories. To this end, they use a flow-like shape that is inspired by a river metaphor (i.e. they resemble a river stream). Note that the size of each individual stream shape is proportional to the values in each category, while the axis where streams flow parallel to, represents the timescale. Stream Graphs can be colored in order to distinguish each category or to give different emphasis to each category’s quantitative values through varying color shades. ThemeRiver diagrams are perfectly suited for BigData datasets since they ease the discovery of trends and patterns over time and across many different categories. It is possible to identify seasonal peaks and periodic patterns, and also visualize the volatility of large groups of items/assets in a given timeframe.
- Word (or Tag) Clouds: This visualization type depicts how frequently words appear in a given fragment of text (e.g., document, body). In particular, the diagram depicts each word with a size that is proportional to its frequency i.e. words with the highest frequency appear larger than all others. Overall, the words are arranged in a cluster or cloud of words. However, it also possible that words are presented in any format such as horizontal lines, columns or within a given shape. Word Clouds can be also colorful: Color can be used to display another data variable associated with the displayed word. Word/Tag Clouds are very popular in the era of BigData, especially in applications involving the display of statistics about content (e.g., documents, books, websites, blogs).
- Venn Diagrams: These diagrams display logical relationships between different sets of items, through representing each set with a separate circle. Within each circle, the diagram depicts objects/entities with common properties. Venn diagrams are commonly used to identify overlapping entities across two or more datasets i.e. entities residing in the intersection area of the circles that represent these datasets. Venn diagrams are sometimes called Set diagrams as well.
- Mind-map (Brainstorm): These are diagrams used to map related ideas, words, images, and concepts. Brainstorms are usually structured as follows: A central node is connected to some major categories, while lesser categories appear as their subcategories. The diagram develops in a hierarchical fashion and provides a tool for generating ideas, finding associations, organizing information and visualizing structures. In BigData applications it can serve as a basis for identifying classifications and sub-categories for large amounts of data.
- Donut charts: These are like the popular pie charts, but with an area of their center cut out. They aim at making it easier to compare multiple pie charts together since they facilitate noticing the differences between the slices of the pie chart. Indeed, donut charts de-emphasize the use of the central area and allow their readers to focus more on reading the length of their arcs, rather than observing the proportions between slices. Last, but not least, Donut charts are more space efficient than conventional pie charts, as their central area can be used to display additional information.
The above list of visualization types is certainly non-exhaustive. A large number of additional diagrams are used in BigData systems for different purposes and applications.
Data Visualization Tools
The creation of BigData visualizations is largely a matter of using appropriate tools that can produce the various diagrams in a fast and configurable way. There are already many tools that can facilitate this production. Available tools vary not only in terms of their functionalities and sophistication, but also in terms of the programming languages and platforms that they support. As a prominent example, Candela is an open-source visualization tool for Javascript developers and data scientists. Likewise, the Datawrapper tool supports visualization for mobile devices and provides the means for creating several popular charts in seconds. As another example, MyHeatMap is a tool that focuses on the interactive visualization of geographic data, including the production of heatmaps. There are also tools that provide various visualizations of large sets of historical data such as Palladio. This tool supports different visualization types, such as map views, graph views, and list views. It can visualize data from different source formats such as .CSV and .tab files.
Note also that all giant vendors offer the advanced tool for data visualization. Prominent examples include the business intelligence tools from Tableau, Google and Oracle, which offer extreme versatility not only in terms of input data sources and formats but also in terms of supported data visualizations.
Visualization is an integral and important part of any non-trivial BigData project. Understanding and deploying the best ways to visualize data is something that could set one apart from competitors. This requires however learning and mastering data visualization types beyond conventional diagrams, and using the right data visualization tools for optimal productivity. While this incurs a significant learning curve, it’s certainly an investment that pays off!