BigData and Artificial Intelligence are becoming the main drivers of the digital transformation of businesses worldwide. Enterprises are increasingly trying to gather intelligent information out of very large volumes of messy data which includes structured, unstructured and semi-structured datasets from a variety of different sources. In this effort, they tend to employ a cohort of data scientists who typically have the virtue of combining math, statistics and programming skills at the same time. Their math and statistics skills allow them to derive insights and knowledge from complex data sets while their programming and engineering skills facilitate them in prototyping software systems that take advantage of this knowledge.
Despite their engineering skills, data scientists tend to be less proficient in software development than typical programmers and software engineers. The main reason for this is that their programming effort is exclusively focused on manipulating and visualizing large and complicated datasets and using the features and capabilities of Turing complete programming languages. Likewise, data scientists are usually proficient in programming languages and tools which are best suited to data-intensive applications.
The rising popularity of BigData and data science has led to the emergence of many relevant programming languages as well as to the establishment of an ecosystem that comprises of developers around them. Each of these languages has pros and cons when it comes to implementing data-intensive projects. In following paragraphs, we attempt a brief presentation of the most popular languages and tools while illustrating the criteria that could drive a data science to select which one to use.
Languages and Tools for Data Scientists
As already outlined, there is a rich set of programming environments for data intensive applications. Popular data handling languages (such as R and Python) are complemented by mainstream general purpose languages that offer data analytics libraries (e.g., Java). At the same time, there are also more specialized toolkits which are devoted to specific data mining approaches such as deep neural networks and deep learning. A representative list is as follows:
R: R is a free software environment for statistical computing and graphics programming. It emerged twenty years ago as a free alternative to commercial statistical software (e.g., Matlab) and has been gradually gaining momentum ever since. R is very appealing to the data science community for two main reasons: First, its simplicity which enables data scientists to manipulate complex datasets and visualize graphs with only few lines of code. Second, its large and vibrant ecosystem which offers a rich set of extensions, libraries and tools that can support virtually any programming task.
Python: Python is another intuitive and easy to learn language which is increasingly adopted by data scientists worldwide. It provides sophisticated data mining capabilities while being more versatile than R for the development of prototypes and products beyond data mining scripts and graphs. Python also comes with a growing ecosystem of developers, applications and tools.
Julia: Julia is a dynamic, high-level and high performance language and it is the latest entry in the field of programming languages for data intensive systems. It is high level, fast and expressive. In several applications it outperforms R in speed and Python in scalability while still being easy to learn. Most important, despite its quite short lifetime (i.e. it was originally created in 2009) it is rapidly evolving and will soon be able to support the full range of functionalities that are currently provided by R and Python.
Java: Java remains one of the languages with the largest mass of developers. It comes with various frameworks for data mining and data science such as Weka and RapidMiner. Java frameworks aren’t as simple as R and Python and they are also not so efficient in data visualization. Nevertheless, they still represent an excellent choice for integrating data mining and machine learning systems with the vast amount of Java-based systems and libraries. Note that Java remains a primary choice when it comes to storing and managing very large datasets. Specifically, the popular Hadoop filesystem for BigData persistence and batch processing is Java based and comes with other Java tools such as the Hive query framework.
Scala: Scala is an example of a Java-based language as it compiles to Java bytecode, executes over the Java Virtual Machine and comes with Java development environments such as IntelliJ. However, Scala has also several functional programming features in an effort to combine the best of both the functional and object oriented worlds. Scala is a primary language of choice for popular BigData frameworks such as Apache Spark.
Apache Kafka: Kafka is recently gaining momentum as one of the best stream processing frameworks. The need for such frameworks appears intensified due to the proliferation of the applications that process streams with high ingestion rates such as internet-of-things applications. Kafka supports real-time processes as it can handle messages very fast. Nevertheless, this can sometimes lead to instabilities and loss of messages.
TensorFlow: TensorFlow™ is an open source software library for numerical computation using data flow graphs. It is extensively used for data mining solutions that leverage deep neural networks. It is also able to leverage both CPUS and GPUs based on a flexible architecture. TensorFlow is not the only option for programming deep learning applications. There are also java-based frameworks such as DeepLearning4J.
GO: GO is Google-developed language which can be used to develop robust machine learning infrastructures including blockchain-based applications. It emerged in 2009 and already supports lots of data-intensive projects.
Selection Criteria – Making the Right Choice
The above list is not exhaustive but indicative of the variety of different options that are offered to data scientists. Data science teams are therefore confronted with the challenge of selecting the best framework for their task at hand. This selection could be done based on one or more of the following criteria:
- Performance and Scalability: Scalability and performance are major concerns for large scale data-intensive applications. The programming languages come with different performance and scalability capabilities and this could be one of the main criteria for selecting the language to use.
- Integration with other software systems: Some languages offer excellent data processing capabilities yet they lack functionalities for integration with other complex software systems. In such cases, data scientists are likely to select a language (e.g., Java) not based on its data-related capabilities but rather on the richness and versatility of its integration libraries.
- Visualization: There are data-intensive applications for which visualization is a primary concern. In such cases, data scientists are likely to opt for choices like R and Python which are powerful in graphs and visualization.
- Cost: Most of the listed solutions and environments are based on free open source libraries which is a key to keeping the licensing cost low. On the other hand, in some cases data scientists are offered with the opportunity of accessing costly, enterprise-scale toolkits for statistical data analysis (such as MATLAB, SPSS and SAS).
- Training and Education resources: In most cases, data scientists will invest time in becoming proficient with the selected language. To this end, they are seeking for languages that have many and easily accessible training resources. Mainstream programming languages come with a wealth of on-line courses in sites like Udemy and Cognitive Class.
- Tools and Ecosystem support: Data scientists productivity is highly dependent on the availability of developer-friendly tools for data processing and prototyping of data intensive systems. In this context, data scientists are likely to select a language based on the richness and friendliness of the tools it comes with. Similarly, they will be seeking for languages with a strong ecosystem which is likely to ease access to libraries, frameworks and scripts that can support a variety of tasks.
- Scope and purpose of the project: Several of the presented languages and frameworks are specialized to specific data science tasks such as stream processing or deep learning. Projects with specialized requirements are also likely to limit the selection space as they drive the selection towards a much smaller subset of candidate frameworks and tools.
Overall, in the BigData and AI era, data scientists and developers are offered with a wealth of programming languages, environments and tools. It’s therefore up to them to set the proper selection criteria and prioritize them according to their project at hand. Moreover, we advise data scientists to be vigilant about the emergence of new languages and tools as the data mining and analytics communities are constantly innovating in this space. In your data science journey, you are certainly not alone.