top of page
  • Danny Gallagher

Machine Learning in Neuroscience: A Primer

Machine learning has exploded onto the neuroscience community over the past few years. Everyone wants to talk about it if only to use the words “machine learning” or “artificial intelligence” (which is a far broader category, but is often incorrectly used interchangeably with machine learning). Machine learning is far more than a buzzword- it allows us to analyze rich datasets in exciting new ways. However, as with any new technology or methodology in our field, it is important to distinguish between capability and hype.

The basics- What is machine learning?

To help with this, I am here to give a brief primer on machine learning in neuroscience. My goal will be to introduce you to the basic concepts of machine learning- so no specific algorithms or hieroglyphic-looking formulas. Hopefully, by the end of this post, you will have a working knowledge of what machine learning is, how it is used in the neuroscience community, and what precautions need to be taken when using it. So, let’s make like our computers and get learning!

Traditionally, programmers would give their programs specific sets of rules which would determine how they acted. The programmer would have to anticipate various conditions which might occur and set the program to act accordingly. So, if your program trades stocks, you may tell it to buy a stock if it dropped to a certain percentage of its year-long average, and to sell if it rose in the same way. A real stock-trading program may include hundreds more parameters, with various combinations all leading the program to take specific, predetermined action. There are obvious difficulties with this. For one, it is time consuming for a programmer to try to predict every combination of parameters and determine how a program should act in each situation. Secondly, what if the programmer doesn’t know what to do in the first place? If the programmer does not know when to buy and sell stocks, the program will only make the same mistakes.

Enter machine learning. At its core, this is the idea that we can give computers data and let them learn from it, ultimately using it to make predictions. Though this sounds simple, it’s a huge advancement in computing. Now, instead of the programmer defining each way the program acts, he/she feeds it data and gives input on how it should learn. The stock-trading program, for instance, may be given a decade’s worth of data on hundreds of different stocks and try to classify which patterns lead to success and which lead to failure. Now the program can use those patterns to decide when and what to buy and sell.

Computers that know us better than we know ourselves

In neuroscience, machine learning is typically used to classify states, particularly when using rich datasets consisting of multiple different properties, like electroencephalogram (EEG) or combinations of peripheral nervous system measures (skin conductance (EDA/GSR), heart rate (ECG), etc.). The hope is that a sophisticated computer can uncover patterns which a human analyst otherwise could not. Scientists use this method frequently to examine states of cognitive load (how much stuff your brain is trying to process) (Kothe & Maekig, 2011; Stevens, Galloway, & Berka, 2006; Berka et. al, 2007), and many are trying to apply it to the detection of various emotions (Soroush, Maghooli, Setarehdan & Nasrabadi, 2017).

To understand how this works, it helps to break the process down into simple steps. For context, I will use a composite of a rather common experimental archetype in which investigators want to classify emotions (see Heraz, Razaki & Frasson, 2007; Sohaib, Hagelbäck, Hilborn & Jerčić, 2013; Wang, Nie & Lu, 2014). In our case, they want to use EEG to determine when somebody is happy or not.

The steps I describe will not capture the full complexity of practical machine learning processes (there are many tiny but important decisions a machine-learning practitioner has to make throughout the process), but will provide a basic framework which may help understand the pipeline.

A simplified version of a typical machine learning pipeline

Let’s break it down:

Feed the program data.

You have to give your program data from which to learn. This is referred to as training data. In our case, we want to classify happy states, so we need data from when our subjects are happy and when they are not. Many experiments try to do this by collecting physiological data (like EEG) while people look at happy or neutral stimuli (often pictures, videos, or audio). We can then tell the program which training data means “happy” and which means “not happy” so that it knows how to classify future data. Our definitions of “happy” and “not happy” are referred to as our ground truth, or what we tell the program is true. This will be discussed in more detail later.

Create a model to fit the data.

Next, the program tries to extract patterns from the data. The hope is that our model represents a general pattern of physiological “happy” and “not happy” which we can then try to fit to future data, allowing us to later determine one’s emotional states. There are many ways of going about this, but to describe them specifically is out of the context of this post (for a relatively easy read about common algorithms, try this. For a more adventurous review that pertains specifically to neuroscience, try this).

An example of what a simple, regression-based model might look like. The curve does its best to differentiate between the two datasets (represented by different colors). This model may then be used on new datasets, with points falling below being classified as orange and points above as blue. Note, due to outliers, the classification is not 100% accurate.

Test and refine our model.

In practice, our first attempt at building a model may not accurately fit our data. For instance, maybe it only correctly classifies data from our training set 50% of the time (this percentage is often referred to as a model’s “classification accuracy” or “prediction accuracy”). In this case, we will want to refine our model, either by tweaking what we have or using an entirely different algorithm. By doing this over and over, we can hopefully create a model that can give us an accurate classification.

Validate our model.

Perhaps we’ve reached the point where our model correctly classifies our training data a high percentage of the time. However, our goal is not to classify our training data, but to classify future data, when we don’t know the “answer.” Thus, we need to see if our model generalizes to other “unseen” data. To do this, practitioners often split the original data, training the model with some and validating it on another (for more detail on how this is done in modern processes, see here). If the model does not accurately predict the test set, we need to rework it or try a new one.

A model that fits the training data very well but does not fit unseen data is referred to as “overfitting.” It is actually a very common machine learning problem. The model trains too much and becomes so optimized to the specifics of the training data that it now only represents that sample as opposed to the whole population. For instance, this model goes out of its way to work around specific data points, even though those points seem to be outliers and likely do not generalize well. For this reason, if someone tries to tell you that their model has a 98% prediction accuracy, make sure you ask about how it performed in validation tests!


After iterating through these steps enough times to create a good model, we can use it on future data to try to determine when people are happy in other situations.

In a field where different activity is often represented by incredibly complex patterns, the ability to have a machine learn these patterns for us is a massive step. While a human experimenter may be able to infer conclusions from a certain defined metric (like the amplitude of a waveform at a given time), a machine can infer from minute changes in many different metrics at once. Considering that some aspects of physiological responses may be shared across many different mental states, the ability to use multiple different aspects of your data at once is a huge advantage. Though the value of one parameter may potentially be indicative of a range of different conclusions, the combination of many may help us significantly narrow down that range.

If only it were this easy… ©momius [Adobe Stock]

So there you go. All we have to do is give our computers the data and let it decide what means what. Problem solved. Great article.Machine learning is also used frequently for real-time applications, such as brain-computer interfaces that use physiological feedback to adjust in the moment. An example of this is an interface that adapts to someone’s cognitive load so as not to overburden the user with information. Instead of needing an analyst to monitor the data and tell a program what to do, the program decides for itself.

As you could probably guess, it’s not quite that simple.

Glitches in the system- the problems with machine learning

As I mentioned, machine learning represents an incredible advancement in data science and can be a valuable tool in a neuroscientist’s kit. However, it is by no means perfect. Any academic will say that you can’t simply give the computer your data and expect an accurate result (contrary to what some marketers may suggest). The problem with artificial intelligence in this case is that its “intelligence” is artificial. It doesn’t know anything about what you are measuring; it’s just finding the best pattern it can to describe a series of numbers. A common phrase in the field is “garbage in, garbage out,” meaning that your output is only as good as the underlying data behind it. So, in neuroscience specifically, what adds “garbage” into our conclusions?

Noisy data.

If you have ever worked with physiological data, you don’t need me to explain the problem of noise. “Noise” refers to the factors which you are not interested in measuring, but affect the data anyway. These factors distort the “signal” in which you are actually interested. In physiology, noise is commonplace in essentially every measurement. Whether it be random electrical noise or artifact (data arising from unwanted influences) caused by a participant moving too much during the experiment, the scientist will have to deal with a lot of “garbage” on top of the actual signal of interest. A great hope of machine learning is to cut through random noise to find a true underlying pattern, but again, a computer will not know which data is signal and which is noise. Thus, the noisier your input data, the noisier your output.

This is especially true when the noise is not simply random, but introduces bias. For instance, going back to our happiness experiment, if some of the “happy” stimuli cause participants to laugh, the electrical activity caused simply by the movement of laughter will influence the EEG data. If that same movement is not present when the participants view the neutral stimuli, we now have noise that is specifically biased to one condition. This bias means our machine learning program may be more likely to interpret the noise as part of the pattern of interest. When we then go to apply our model to future studies, it may be more likely to call someone “happy” simply because they are moving around.

Thus, as with any experiment, it is imperative to analyze the cleanest data possible. The easiest way to achieve this is through good experimental design and solid technical recording practices. Basically, record as little noise as possible in the first place. After that, you can apply filters and reject/correct “dirty” segments of data. Just know that these processes will, by nature, distort your data, and so must be used with finesse. It is also important to know that these procedures work better after the fact, as opposed to in real-time, which is why I always encourage people to analyze the results after all the data has been collected if real-time monitoring is not imperative.

At the end of the day, it is important to remember that machine learning is not a substitute for collecting good, clean data. Even if your model’s prediction accuracy is very high, it is not useful if it’s predicting a biased dataset. This brings me to my next point.

Ground “truth.”

As mentioned, using machine learning to classify things requires the use of a “ground truth,” where we tell the machine what the training data represents. This is referred to as supervised learning, since we are supervising the way the program trains by telling it what certain datasets mean. You can also create an unsupervised program without a ground truth if you simply want to cluster data or make associations, but this is less utilized in neuroscience due to the interest in classifying mental states.

A perfect ground truth is objective. We know that dataset A represents B. However, this is rarely the case in the real word- especially when dealing with something as complex as the brain.

Let’s go back to our happiness indicator. As is common in these types of studies (Heraz, Razaki & Frasson, 2007; Sohaib, Hagelbäck, Hilborn & Jerčić, 2013), let’s say we determined “happiness” by showing participants “happy” pictures from a validated database.

Happy (left) and neutral (right) pictures, taken from the Open Affective Standardized Image Set (OASIS) database (Kurdi, Lozano, Banaji, 2017)

The first question we must ask is “did these pictures actually make the participant happy?” We can (and should) assess this by simply asking them to rate how positive or negative they feel after each picture. Of course, there is always the possibility that our respondent is lying, perhaps because they don’t want us to think of them as some monster who hates puppies in teacups.

Even if we could reasonably determine that the pictures made our respondents happy, we don’t know what other reactions those same pictures instilled. Perhaps the pictures were all arousing, biasing our algorithm toward other arousing stimuli. Or perhaps the pictures triggered nostalgia, leading to additional activity that had nothing to do with emotion, but instead had to do with the process of retrieving memories.

In addition, it is hard to be sure which physiological reactions are stimulus-specific. What if our program picks up on responses that have little to do with emotion, and much more to do with our reactions to certain colors or certain physical features? We can assuage this by “aiding” our program, deciding to include only features of the data for which we have some contextual reason to expect might be related to our variable of interest. However, given the complexity of our bodies, and the fact that certain signals are often correlated to multiple different state