Machine learning (ML) is the way to go when you have a large dataset encompassing pattern recognition or predictive analysis. The proliferation of open source software, which is available for free, has made machine learning easier to implement singly or on a large scale. From face recognition to spam filters, open source software is becoming the programming tool of choice.

Machine learning is undergoing a form of renaissance lately. Almost every day there is a form of advancement to what is already in existence. From advances in lip reading to images, even the most enthusiastic programmers are fighting to catch up.

The idea behind open source is a smart way to grow a community of talent in a particular field. In the field of open source machine learning, Google is undoubtedly the force to beat with its TensorFlow, which silences newcomers in a lot of metrics.

Taking a look at the paradigm shift machine learning can inspire, it is important to maintain it at open source. This will allow anyone, from any part of the world, join the revolution so that no one is left behind. The open source tools below have libraries for different programming languages, including Java, Python, Scala, C++, R, Go, and JavaScript. In no particular order, here are 13 open source tools to make the most of machine learning.

1. Shogun

Venerable Shogun came into view in 1999. It was written in C++, but it is flexible enough to be compatible with Python, Java, C#, Octave, Matlab, Lua and R. The latest version, 6.0.0, expands the compatibility to Scala and Microsoft Windows.

The major competition Shogun faces is Mlpack, which came in 2011. It is also C++-based, but it appears easier to work with and faster than other competing libraries.

2. Scikit-learn

Python has a large library that is available for nearly all applications, and is easy to adopt. This is why it is now the programming language for science, math and statistics.

Scikit-learn takes advantage of Python libraries to build on packages like Scipy, NumPy and Matplotlib for science or math works. The resulting libraries form interactive platforms for applications or fuse into software. It is fully open and reusable because the kit is accessible under BSD license.

3. Apache Mahout

Apache Mahout went hand-in-hand with Hadoop for a long time, but a good number of its algorithms are now independent. They are invaluable for stand-alone applications that can be carried into Hadoop projects, with the possibility of spinning them into stand-alone applications.

The recent versions have increased support for Spark framework, with improved support for the ViennaCL library.

4. Accord.Net Framework

Accord is a signal processing and machine learning framework for .net extensions. It has a set of libraries for audio signal processing and image streams. Its vision algorithm can be exploited for face detection, tracking moving objects or pinning images together.

Accord also has a set of libraries that provide a more conventional set of machine learning functions, which range from decision-tree systems to neural networks.

5. H2O

H2O has a set of algorithms that are well suited for business processes, like trend or fraud prediction, rather than image analysis. H2O can function in a standalone fashion with YARN, HDFS stores, MapReduce or Amazon EC2.

The H2O framework provides attachment for R, Python and Scala which enables you to interact with the entire libraries on the aforementioned platforms. Hadoop fans, on the other hand, will use Java to be able to interact with H2O.

6. Spark MLlib

The primary language for working in MLlib is Java, but Python users can link with MLlib with NumPy library. R users, on the other hand, can only merge with Spark’s version 1.5 and newer. MLlib boasts a lot of algorithms and runs at speed and scale.

Scala users can come up with codes against MLlib. MLbase is another project that builds on MLlib to make it easier to obtain results. Instead of writing codes, users make queries through SQL language.

7. GoLearn

GoLearn is a Google Go language-inspired machine learning library. Its creation was to meet the dual role of customizability and simplicity.

The simplicity part stems from the ease with which you can load and handle data in a library, which is modeled from R and SciPy. The customizability lies in the way some file structures can be stretched in an application.

8. Cloudera Oryx

Oryx is accredited to the makers of the Cloudera Hadoop distribution. It makes use of Kafka stream and the Spark processing framework to run ML models on real-time data. Oryx is the perfect platform for creating projects that need to make decisions, like live detection of anomalies using new and historical data.

The newer Version 2.0 is a near-perfect redesign, with loosely coupled components in lambda architecture. It is now easy to add new abstractions to new algorithms at any time.

9. ConvNetJS

ConvNetJS is a library for JavaScript designed to perform as a data workbench for neural network machine learning. Those using the Node.js can use the NPM version.

The library is crafted to properly handle JavaScript asynchronicity. For example, call back can be given to training operations to execute the moment they complete. It has ample demo expels too.

10. Deeplearn.js

Deeplearn.js is another project suited for deep learning in a web browser. You can train the neural network models in any modern browser without the need for additional client-side software.

It is also possible to perform GPU-accelerated computation on Deeplearn.js using WebGL API. This means performance is not only a function of the specification of the CPU of the system. Users of TensorFlow by Google will find it easy to use Deeplearn.js as it is modeled from the former.

11. Weka

Weka is a collection of Java machine learning algorithms specifically designed for data mining. Its functionality can be extended further with an official and unofficial package system. It comes with a book to explain the software and its associated techniques.

Weka is not targeted at Hadoop users, but the recent versions work well with Hadoop with the help of a set of wrappers. Weka still does not support Spark.

12. TensorFlow

This is still the leading open source machine learning library. TensorFlow is easy to use with Python, with a few experimental API in Go and Java. The introductory section has machine learning for beginners and a section for professionals.

TensorFlow is the leading open source machine learning tool on GitHub. It has the largest community, as well as the most projects.

13. PaddlePaddle

PaddlePaddle is a fairly new entry. It is a product of the researchers at Baidu, the Chinese version of Google. Baidu has a set of fairly advanced artificial intelligence (AI) labs which is run by an ex-Stanford professor.

Paddle is an acronym for Parallel Distributed Deep Learning. It is promoted as flexible, efficient, scalable and easy to use deep learning platform. Beginners will have a head start using the getting started page.


Machine learning (ML) enables computers to learn without extensive programming. ML has evolved from artificial intelligence through computational learning theory and pattern recognition. It explores such fields as algorithms that can make precise high-end predictions.

Machine learning is now deployed to a diversity of computing tasks where programs and efficient algorithms are problematic. Making ML tools open source allows for more extensive research into their improvement, which further hastens the evolution of technology.

There are many more open source machine learning tools out there. How many have you used and which of them do you think is the most resourceful? Let us hear your opinion in the comment box.

Publié 18 novembre, 2017


Software Developer

Lucy is the Development & Programming Correspondent for She is currently based in Sydney.

Article suivant

Alexa Skill Tutorial: How To Write Your First Voice-assistant App