Blog
In the flurry of buzzwords concerning innovative technology, especially when these are flown around for marketing rather than information reasons, it is quite easy to get lost. There is no shortage of ‘neural networks’, artificial all of them, but including ‘recurrent’, ‘generative adversarial’ and a multitude of others. Artificial intelligence, machine learning, predictive analysis, all are usually ill-defined concepts that aim to impress rather than educate.
The boundaries of each technique are indeed not very well defined, as is appropriate for a sector that is in a state of flux. However, we can put some order in this chaos, by subsuming all techniques that try to mimic the working of the human brain into the field of “Artificial Intelligence (AI)”.
Then “Machine Learning (ML)” is the subfield that applies to techniques that try to imitate the human learning process (or at least what we currently know about them). Finally, “Deep Learning (DL)” is the use of a specific technology, namely ‘Multi-Layer Neural Networks’ to do Machine Learning tasks.
Machine learning techniques often consist of presenting human-selected and curated inputs, along with a suitable description of their nature. These can be, for example, photos with their description, or machine states with their suitability, or text along with its translation in a foreign language. The idea is that the machine, using suitable algorithms, will ‘generalize’ the labels or descriptions provided. It will then be able to apply them to new, unseen inputs.
The first serious attempts at machine learning were usually dual (or, less often, multiple class) classifiers: They separated inputs into two (or more) categories. For some simple tasks – or even a bit more complicated ones, like handwritten character recognition, they were hugely successful.
Their main drawback was their patent user-unfriendliness. They usually needed a lot of expertise and careful selection of the representation of the inputs that were presented to them. For example, a picture should be rasterized – rendered into pixels, and the intensity and tone of each pixel fed into the machine.
A text could be grammatically analysed, and the parts of speech presented, with their position, to the algorithm. Human experts already did this ‘feature extraction’ from raw data, and then the machine was able to consume vast amounts of these features and make the required generalization.
The most significant improvement of Deep Learning is that it does away with the need for the human expert that creates a suitable machine representation. The data can be fed into the machine ‘raw’, in the form they exist in nature (of course, with appropriate sensors that digitize them), or on the internet.
In this way, Deep Learning is a manifestation of the ‘Representation Learning’ technique. This method relies on the machine or algorithm to automatically detect the abstract representations that are needed, to classify the inputs or detect the relevant one from the multitude fed into it. DL algorithms are representation techniques with many levels.
Starting with the raw data, the DL extracts some features, feeds them into the next level using non-linear transfer functions, where the features are combined and fed to the next one, and so on. At each level, the features have gained in abstraction (and generally, lost in number, though not necessarily). The multiplicity of levels gives DL its ‘Deep’ moniker.
In more detail, every machine learning technique, deep or shallow, start by presenting to the machine the solved problem. Usually, this takes the form of an annotated list of, say, images, or sentences. These have been already categorized, or the feature we want the machine to recognize, has been highlighted.
The machine accepts the input and produces a series of scores, one for each desired category. We want the highest output score to correspond to the correct category for the instantaneous input. This, of course, is rarely the truth for an ‘untrained’ machine. However, since the score that the machine should be aiming for is ‘1’ for the correct category and ‘0’ for all others, the machine can compute a distance of its output from the desired one. Then, it can adjust the internal parameters that define it, usually called weights, in a way that will make its output appear more like the desired output.
Deep Learning techniques are an extension of the decades-old method of Artificial Neural Networks. These were initially conceived to simulate the way human (or animal) neurons process the information they receive from the world. They were quite successful in their many forms and were considered the hope for the eventual emergence of a ‘true’ AI.
However, in their typical applications of image recognition and speech understanding, they run up into the ‘selectivity-invariance dilemma’. It means that simple classifiers could get very good in classifying the samples used in their training, while being hopeless in generalization,
i.e. classifying previously unseen samples. Alternatively, they could describe general characteristics of classes (e.g. “this is an animal since it has four legs”), but hilariously err when seeing variants of the taught examples (e.g. “a dog standing on two legs is not an animal”).
Linear classifiers like the original neutral networks can therefore only classify into very simple spaces in the ‘feature space’ of inputs (in the set of distinguishing characteristics of inputs). Alternative methods developed since they can classify non-linearly, but these are bad in generalization (since they tend to give more importance in the central, dominant characteristics of a class. It was the reason that, as noted before, automated classification needed significant human investment in separating the distinguishing but invariant attributes of each class, and then feeding them to the “shallow” classifier.
Deep Learning methods, with its primary representative being the Deep Neural Network (DNN) has many layers of the simple, traditional, learning modules. Most of these transform their inputs into a non-linear way and feed their processed output to the next layer. Each module can augment the selectivity and the invariance of the internal representation at the same time.
In this way, a ‘deep’ network (with typically more than ten layers) can ‘learn’ to separate classes with minute differences, while at the same time being able to classify unseen examples. The DNNs are almost immune to preferential selection of characteristics seen only in the models used for their training, a tendency usually termed overfitting in ML science.
It probably not appears in the above description that there is no fundamentally new, revolutionary technology that allowed the vast and fast expansion of the DNN techniques. Indeed, DNNs are mostly ANNs with non-linear transfer functions, and hundreds of nodes (and millions or tens of millions of weights).
It was instead the exponential improvement in the efficiency of processing and memory chips used, that allowed the easier and faster manipulation of the tens of millions of weight update operations that a DNN needs to be trained. The newest trend of ‘cloud computing’, where the most difficult tasks are delegated to far-off clusters of powerful processors, made the use of DNNs even more approachable to the masses of users.
The defining example of AI in general, perhaps reflecting the huge importance that vision has for our everyday life, is image recognition. DNNs have been used from the onset for image recognition tasks (the first appearance of the term came in a paper that significantly improved some standard image recognition tests, using a DNN).
The tasks where DNNs are used include segmentation of images, classification, detection of objects, recognition of regions in images. They are used, for example, in traffic sign recognition by self-driving cars, face recognition, text recognition in images, and other tasks.
They have reached almost human performance already, while they are used by multinationals like Facebook (Friends’ faces recognition), Alphabet (self-driving cars), Adobe (text in Optical Character Recognition) and others, in the image recognition tasks of interest to each one.
Hearing is undoubtedly almost as important as seeing, and speech recognition systems have also been using DNNs to make advances that have also already penetrated our everyday lives, with systems like Siri (Apple) and Alexa (Amazon) available to consumers with high performance with few errors.
Machine ‘hearing’ and executing simple commands (or just transforming speech to written text) is not as powerful as real language understanding, where DNNs has also been used with great success. Analysis of text and language using DNNs has led, for example, in ‘sentiment analysis’, that is important in analysing everything from tweets to customer reviews.
It has also immensely helped automatic translation, where the level of performance has improved significantly, even for translation between languages where there does not exist a big corpus of common texts, to aid direct training. DNNs have also been used in literary or forensic analysis of texts to authenticate writers.
The primary motive for the development of ANNs was, as explained above, the simulation of biological neurons. Biologists have not given up on this aim, using the new technology of DNNs as their new helpers. On top of simulating brain networks, and giving insights into the biologic process of learning, though, DNNs are also used in biology in many other ways.
For example, DNNs have been used to decipher the gene expression networks in real organisms, given the presence of metabolites and/or messenger RNAs at given moments. They have also been used to predict the target molecules of drugs, or even to predict which substances would be toxic, given just their chemical structure.
Finally, yet importantly, DNNs have started to be introduced to help resolve physics questions. The vast data volumes produced by the modern High-Energy physics detectors and astronomical observatories offer themselves an excellent target for automatic analysis.
Because of this, physicists were among the first enthusiastic users of the ANNs of yore, and they have recently embraced DNNs as well. Other than helping data analysis in sub-disciplines with enormous data production rates, DNNs have recently started selecting viable models between alternatives, given the raw data, in a way substituting the physicists themselves!
A common criticism of the ever-bigger penetration of DL in all aspects of academic and commercial life is that the machines are usually ‘black boxes. Indeed, despite efforts of AI researchers, the exact process by which a DNN makes its ‘decisions’ is not only traditionally hidden from the average user but also unexplainable to an expert.
Indeed, due to the non-linearity of the transfer functions operating between the nodes of a DNN, and the possibly millions or tens of millions of weights adjusted in every step of training, it is virtually impossible to explain in simple terms the ‘reasons’ behind a ‘decision’.
Opponents of this view assert that this is true for most of our own, human, everyday decisions: Usually only in hindsight and lots of rationalisation can we ‘explain’ to others and justify our actions and decision process. However, as DNNs start controlling important aspects of human life (for example, bail or loan decisions), it is vital to keep a clear liability and responsibility chain.
If, for example, a DNN decision makes a fatal mistake, will the user, the programmer or the ‘trainer’ be liable for damages? The problem is exacerbated in possible uses of DNNs in, for example, selecting targets for weapons. The ethical issues are still unsolved and are only an aspect of the new problems generated for society by the use of AI in general.
Since DNNs are essentially computer programs or increasingly specialized hardware, they are of course prone to malicious use and hacking. Already researchers have seen unexplained problematic behaviour is image classifiers, for example, where nonsense images are classified with high confidence into real-world object categories.
Also, others have managed to confuse DNNs by modifying images (e.g. adding optical “noise”) to a level that is undetectable to a human eye. The difficulty of explaining the ‘reasoning’ behind a DNN’s decision delays the debugging of these issues. In other words, nobody is sure why DNNs have this behaviour.
People have already set DNNs against each other to accelerate or enhance the hacking process. In this setup, called Generative Adversarial Networks, a DNN tries to change the input fed into another in the way that will produce the biggest ‘confusion’ to its adversary, i.e.
make it commit the grossest misclassification. It partially solves the ‘black box’ problem (the human intermediary that will have to understand is not needed at all), but, of course, used in the wrong way could destroy infrastructure, affect the lives of many, and help in criminality.
In general, DNNs cannot escape the societal issues of any new technology; as usually said, technology is neither good or bad – it only is as good or as bad as its human users. As we delegate more and more tasks to the welcome automation and extensive capabilities of DNNs, we should enhance our safeguards against their misuse, as we do for any other technology.
As outlined above, Deep Learning already has numerous applications; it has even invaded our everyday life, hidden in many applications and gadgets. The truth is that claims for Deep Learning’s usefulness are even more numerous and varied than the real applications, now. Searching the contents of a random scientific magazine,
in almost any discipline, will reveal at least one article where the authors have used, or propose to use Deep Learning techniques. However, the most exciting prospects for the use of Deep Learning rests in the combination of the technique with two other subfields of AI.
First, DL, nowadays is mostly used for ‘supervised’ learning tasks. These are tasks where the correct ‘answer’ (classification or solution) is provided by an (almost always) human supervisor. Nevertheless, kids - and animals - do not learn things by memorizing long lists of objects and their categories provided by adults.
They instead explore the world unsupervised and make their inferences and generalizations at the same time. Creating a DL system that can not only be fed ‘raw’ data but also knows where to look for them is an exciting prospect.
These new combined systems are expected to play a significant role in two critical areas: One is computer vision, where DL-based systems will quickly analyse the critical part of a scene presented while deciding what is essential by themselves. The other is natural language processing,
where both the analysis and understanding of texts and speech, and also the translation will be much improved by systems that recognize each word’s role in a text in a particular language. It would undoubtedly help areas where current speech recognition and translation systems severely lag, like humour, irony and sentiment recognition.
Finally, the dream of all AI engineers, a true intelligence, will probably have a big DL-based part that will be combined with an as-yet-unknown complex reasoning system. The complex reasoning part will make the decisions for action, based on the analysed information that the DL-based will provide, after capturing it from the surrounding environment. It will indeed be an Artificial Intelligence worth the name!
Ba, J., Mnih, V. & Kavukcuoglu, K. (2014) Multiple object recognition with visual attention. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1412.7755.
Bahdanau, D., Cho, K. & Bengio, Y. (2015) Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representationshttp://arxiv.org/abs/1409.0473.
Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J. (2014) Deep generative stochastic networks trainable by backprop. In Proc. 31st International Conference on Machine Learning 226–234.
Bishop, C. (1995) Neural Networks for Pattern Recognition, Oxford University Press Inc., New York
Bottou, L. (2014) From machine learning to machine reasoning. Mach. Learn. 94, 133–149.
Cadieu, C. F. et al. (2014) Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS Comp. Biol. 10, e1003963.
Carrasquilla J & Melko RG (2017) Machine learning phases of matter. Nature Physics, 13, 431-434
Cho, K. et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Conference on Empirical Methods in Natural Language Processing1724–1734.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. & LeCun, Y. (2014) The loss surface of multilayer networks. In Proc. Conference on AI and Statistics http://arxiv.org/abs/1412.0233 .
Dauphin, Y. et al. (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proc. Advances in Neural Information Processing Systems, 27, 2933–2941.
Girshick, R., Donahue, J., Darrell, T. & Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. Conference on Computer Vision and Pattern Recognition 580–587.
Haykin, S. (1999) Neural Networks, Pearson, New Jersey
Hinton, G. et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29, 82–97.
Jean, S., Cho, K., Memisevic, R. & Bengio, Y. (2015) On using very large target vocabulary for neural machine translation. In Proc. ACL-IJCNLP http://arxiv.org/abs/1412.2007.
Kingma, D., Rezende, D., Mohamed, S. & Welling, M. (2014) Semi-supervised learning with deep generative models. In Proc. Advances in Neural Information Processing Systems, 27, 3581–3589.
Krizhevsky, A., Sutskever, I. & Hinton, G. (2012) ImageNet classification with deep convolutional neural networks. Proc. Advances in Neural Information Processing Systems, 25, 1090–1098.
Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. (2015) Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55, 263–274.
Mnih, V. et al. (2015) Human-level control through deep reinforcement learning. Nature 518, 529–533.
Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. (2014) On the number of linear regions of deep neural networks. In Proc. Advances in Neural Information Processing Systems, 27, 2924–2932.
Patterson, J., Gibson, A., (2017) Deep Learning: A practitioner’s approach, O’Reilly Media Inc., Sebastopol, CA, US
Sermanet, P. et al. (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In Proc. International Conference on Learning Representationshttp://arxiv.org/abs/1312.6229.
Simonyan, K. & Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. In Proc. International Conference on Learning Representationshttp://arxiv.org/abs/1409.1556.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Machine Learning Res. 15, 1929–1958.
Sutskever, I. Vinyals, O. & Le. Q. V. (2014) Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, 27, 3104–3112.
Szegedy, C. et al. (2014) Going deeper with convolutions. Preprint at http://arxiv.org/abs/1409.4842.
Taigman, Y., Yang, M., Ranzato, M. & Wolf, L (2014) Deepface: closing the gap to human-level performance in face verification. In Proc. Conference on Computer Vision and Pattern Recognition 1701–1708.
Tompson, J., Jain, A., LeCun, Y. & Bregler, C. (2014a) Joint training of a convolutional network and a graphical model for human pose estimation. Proc. Advances in Neural Information Processing Systems, 27, 1799–1807.
Tompson, J., Goroshin, R. R., Jain, A., LeCun, Y. Y. & Bregler, C. C. (2014b) Efficient object localization using convolutional networks. In Proc. Conference on Computer Vision and Pattern Recognitionhttp://arxiv.org/abs/1411.4280 .
Weston, J., Bordes, A., Chopra, S. & Mikolov, T. (2015) Towards AI-complete question answering: a set of prerequisite toy tasks. http://arxiv.org/abs/1502.05698.
Xiong, H. Y. et al. (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 6218.
Xu, K. et al. (2015) Show, attend and tell: Neural image caption generation with visual attention. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1502.03044 .
thanks for info
I appreciate your step-by-step approach. Your explanation makes this material accessible for a wide audience. Keep up the great contributions.
Thank you for taking the time and writing this post. Great help for me!
Fantastic piece, I learnt a lot from it
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.