Attention in Deep Learning

In deep learning, backpropagation constrain the learning to be smooth due to differentation. Deep learning models can learn hierarchical representations automatically given the data, all this process takes place by incrementally tuning the weights in a neural network given the loss function of interest. Attention is the mechanism to introduce a dictionary-like structure to the nerual network learning process.

Bidirectional Encoder Representations from Transformers (BERT)

BERT is a variation of the original transformer model, its main features are that is a transformer model that aims to be easily transferable between different domain tasks with no architectural changes and minimal fine-tuning.

Transformers

Since the breakthrough paper, Transformers had been achieving state of the art in many tasks, here I explain in easy terms what a transformer is.

Attention Based Mechanisms

Attention is essentially weights that depend on the input, this weights can depend on the input of the input layer (self attention) or input that is the result from a hidden layer (attention), attention mechanisms are very powerfull because we dont have to bias the network architecture to the way we think the network will perform tha task, for example CNNs are network architectures that are specialized to process 2D or 3D dimensional data, the convolution operations process the input in a specific way, the neurons process the inputs via their corresponding receptive fields, all of this machinery is architectural bias, because the experts think the network will perform better if the network process the input this way. But what if there is a better way to process 2D or 3D inputs, what if we let the model learn by itself how to process the inputs according to some task? Enter the Attention layers.

Adversarial Attacks

Attention is essentially weights that depend on the input, this weights can depend on the input of the input layer (self attention) or input that is the result from a hidden layer (attention), attention mechanisms are very powerfull because we dont have to bias the network architecture to the way we think the network will perform tha task, for example CNNs are network architectures that are specialized to process 2D or 3D dimensional data, the convolution operations process the input in a specific way, the neurons process the inputs via their corresponding receptive fields, all of this machinery is architectural bias, because the experts think the network will perform better if the network process the input this way. But what if there is a better way to process 2D or 3D inputs, what if we let the model learn by itself how to process the inputs according to some task? Enter the Attention layers.

Machine Learning Tokyo Meetup

It has been four months since I have been living in Tokyo, currently, I’m working for a Japanese company for some machine learning consultancy. I was very curious about the software development movement here in Tokyo, especially machine learning. Japan is well known worldwide for robotics and hardware innovations, I was wondering if is as good for modern AI. So I start searching for machine learning meetups and I come across a group called Deep learning Otemachi group, so after work, I decided to see and check their talks at the Marunouchi Building near Tokyo station.

Detailed Guide to AWS CodeDeploy Installation

Because I found AWS CodeDeploy documentation to be kind of hard to grasp, Here is a quick reference if you want to set up a complete CodeDeploy solution. Create Bucket with policy from bucket-policy.txt, be sure to change the principal for your account

Convolutional neural net for teeth detection

In this blog post, you will learn how to create a complete machine learning pipeline that solves the problem of telling whether or not a person in a picture is showing the teeth, we will see the main challenges that this problem imposes and tackle some common problems that will arise in the process. By using a combination of Opencv libraries for face detection along with our own convolutional neural network for teeth recognition we will create a very capable system that could handle unseen data without losing significative performance. For quick prototyping, we are going to use the the Caffe deep learning framework, but you can use other cool frameworks like TensorFlow or Keras.

Neural Language Model with Caffe

In this blog post, I will explain how you can implement a neural language model in Caffe using Bengio’s Neural Model architecture and Hinton’s Coursera Octave code. This is just a practical exercise I made to see if it was possible to model this problem in Caffe. A neural model is capable of predicting the next word given a set of previous words, the predicted word has to relate to the previous context.

Stochastic Gradient Descent recomendations

If you compute the gradient on the half of your training data and also on the other half of the training data, you will always get the same answer on both of them, is better to compute the gradient on the first half, then update the weights and then compute the gradient on the second half. Typically we use a mini batch size of 10 or 100 examples or 1000 examples. So we can conclude that is not optimal to compute the gradient for your weights on the entire training set.

The Perceptron

In Machine Learning the first version of neural network is related to perceptrons and is quite important to understand them to grasp some important concepts that will be too common even on the latest ML topics. The perceptron is a network that receives multiples fixed inputs of data specifically numbers, those inputs are connected to a special neuron by a set of synapses or weighs, the neuron will calculate a weighted sum of the inputs and the weights and will perform an activation function, in this case, a binary one.

Detailed Guide to AWS CodeDeploy Installation

Although not machine related it can help you with your EC2 environment if you are planning to use AWS ML, this set up was very handy for me to help me speed up my Python deployments to EC2 instances in no time.

Gradient Descent

What is Gradient Descent

Gradient descent is a very usefull algorithm in machine learning and mathematics because it allows your calculate the local minima or maxima of a function, lets remember that a function can be represented on a plot as a plane or hyperplane depending of the number of variables it has, this plane can be imagined as a nature horizon with mountains going up and holes going down, each point on the plane is the result of the function evaluated on some specific values for the independent variables. The goal here is to look at the highest point on a plane or the lower, this problem is called finding the global minima or global maxima, gradient descent can help us finding not the global minima or optima but the local minima or optima, it has been proven that finding a local minima or optima is enough for tackle real world problems on machine learning.

Classification vs Regression comparission

Our primary goal in machine learning is to predict a value based on input values, depending on the desired output we need to choose the correct cost function and correct neurons to be able to represent the final data correctly, this output values can have different forms of representations:

When data is not enough, Synthetic Data

In machine learning is a common problem is the amount of data that we have, not always we can have gigabytes of information on our databases, data is the most precious result in machine learning and is understandable that is very scarce for our problems, the ones who own most of the data are the kings in the machine learning world, for example, Google has been gathering our data for years with only on objective in mind to mine this date using sophisticated statistics algorithms or machine learning algorithms.

An easy introduction to Machine Learning part 2

Generalization

When you train a machine learning algorithm we know that the input data is changing the internal values of the model, you can train a machine learning algorithm with you data so well that the model can produce an exact output function that resembles exactly as the training set, this can be risky because your goal is to be able to identify the hidden rule that describes your data, if your model can deduce this hidden rule then the model will be able to behave very well on unseen data, the capability of the model to deduce this hidden rule is what is called generalization, because your model was trained on the training set but was able to learn the rules that describe the data very well, the model generalizes well.

Prediction vs Inference in Machine Learning

In machine learning sometimes we need to know the relationship between the data, we need to know if some predictors or features are correlated to the output value, on the other hand sometimes we don’t care about this type of dependencies and we only want to predict a correct value, here we talking about inference vs prediction.

Machine Learning and Visual Pattern Recognition

Every single moment our brain is being exposed to a vast amount of information in different forms like different light intensities, sounds, touch sensations, smells and a gazillion other more. It turns out that our brain does an outstanding job at getting familiarized with all this new information that arrives every millisecond, thanks to our memory capabilities and pattern recognition abilities we can somehow understand and remember abstract concepts from previous experiences of the real world.

An easy introduction to Machine Learning part 1

Maybe you have already heard about machine learning and some of the amazing things it can do, in this series of posts I’m going to explain in a very easy way what machine learning is and why it is so important for the future of technology. Machine learning covers a lot of topics and can have a lot of ramifications, but I like to define it as a set of techniques and specialized algorithms that generate mathematical functions from pure data, the output of these generated functions are of our special interest because they can give us very good estimates about new data based on previous knowledge, in other words, this generated function will try to describe the data no matter how is structured.