Machine Learning in Neuroscience: A Primer

©Sergey Nivens [Adobe Stock]
Machine learning has exploded onto the neuroscience community over the past few years. Everyone wants to talk about it if only to use the words “machine learning” or “artificial intelligence” (which is a far broader category, but is often incorrectly used interchangeably with machine learning). Machine learning is far more than a buzzword- it allows us to analyze rich datasets in exciting new ways. However, as with any new technology or methodology in our field, it is important to distinguish between capability and hype.

The basics- What is machine learning?

To help with this, I am here to give a brief primer on machine learning in neuroscience. My goal will be to introduce you to the basic concepts of machine learning- so no specific algorithms or hieroglyphic-looking formulas. Hopefully, by the end of this post, you will have a working knowledge of what machine learning is, how it is used in the neuroscience community, and what precautions need to be taken when using it. So, let’s make like our computers and get learning!

Traditionally, programmers would give their programs specific sets of rules which would determine how they acted. The programmer would have to anticipate various conditions which might occur and set the program to act accordingly. So, if your program trades stocks, you may tell it to buy a stock if it dropped to a certain percentage of its year-long average, and to sell if it rose in the same way. A real stock-trading program may include hundreds more parameters, with various combinations all leading the program to take specific, predetermined action. There are obvious difficulties with this. For one, it is time consuming for a programmer to try to predict every combination of parameters and determine how a program should act in each situation. Secondly, what if the programmer doesn’t know what to do in the first place? If the programmer does not know when to buy and sell stocks, the program will only make the same mistakes.

Enter machine learning. At its core, this is the idea that we can give computers data and let them learn from it, ultimately using it to make predictions. Though this sounds simple, it’s a huge advancement in computing. Now, instead of the programmer defining each way the program acts, he/she feeds it data and gives input on how it should learn. The stock-trading program, for instance, may be given a decade’s worth of data on hundreds of different stocks and try to classify which patterns lead to success and which lead to failure. Now the program can use those patterns to decide when and what to buy and sell.

Computers that know us better than we know ourselves

In neuroscience, machine learning is typically used to classify states, particularly when using rich datasets consisting of multiple different properties, like electroencephalogram (EEG) or combinations of peripheral nervous system measures (skin conductance (EDA/GSR), heart rate (ECG), etc.). The hope is that a sophisticated computer can uncover patterns which a human analyst otherwise could not. Scientists use this method frequently to examine states of cognitive load (how much stuff your brain is trying to process) (Kothe & Maekig, 2011; Stevens, Galloway, & Berka, 2006; Berka et. al, 2007), and many are trying to apply it to the detection of various emotions (Soroush, Maghooli, Setarehdan & Nasrabadi, 2017).

To understand how this works, it helps to break the process down into simple steps. For context, I will use a composite of a rather common experimental archetype in which investigators want to classify emotions (see Heraz, Razaki & Frasson, 2007; Sohaib, Hagelbäck, Hilborn & Jerčić, 2013; Wang, Nie & Lu, 2014). In our case, they want to use EEG to determine when somebody is happy or not.

The steps I describe will not capture the full complexity of practical machine learning processes (there are many tiny but important decisions a machine-learning practitioner has to make throughout the process), but will provide a basic framework which may help understand the pipeline.

A simplified version of a typical machine learning pipeline

Let’s break it down:

  1. Feed the program data. You have to give your program data from which to learn. This is referred to as training data. In our case, we want to classify happy states, so we need data from when our subjects are happy and when they are not. Many experiments try to do this by collecting physiological data (like EEG) while people look at happy or neutral stimuli (often pictures, videos, or audio). We can then tell the program which training data means “happy” and which means “not happy” so that it knows how to classify future data. Our definitions of “happy” and “not happy” are referred to as our ground truth, or what we tell the program is true. This will be discussed in more detail later.
  2. Create a model to fit the data. Next, the program tries to extract patterns from the data. The hope is that our model represents a general pattern of physiological “happy” and “not happy” which we can then try to fit to future data, allowing us to later determine one’s emotional states. There are many ways of going about this, but to describe them specifically is out of the context of this post (for a relatively easy read about common algorithms, try this. For a more adventurous review that pertains specifically to neuroscience, try this).
An example of what a simple, regression-based model might look like. The curve does its best to differentiate between the two datasets (represented by different colors). This model may then be used on new datasets, with points falling below being classified as orange and points above as blue. Note, due to outliers, the classification is not 100% accurate.
  1. Test and refine our model. In practice, our first attempt at building a model may not accurately fit our data. For instance, maybe it only correctly classifies data from our training set 50% of the time (this percentage is often referred to as a model’s “classification accuracy” or “prediction accuracy”). In this case, we will want to refine our model, either by tweaking what we have or using an entirely different algorithm. By doing this over and over, we can hopefully create a model that can give us an accurate classification.
  2. Validate our model. Perhaps we’ve reached the point where our model correctly classifies our training data a high percentage of the time. However, our goal is not to classify our training data, but to classify future data, when we don’t know the “answer.” Thus, we need to see if our model generalizes to other “unseen” data. To do this, practitioners often split the original data, training the model with some and validating it on another (for more detail on how this is done in modern processes, see here). If the model does not accurately predict the test set, we need to rework it or try a new one.

    A model that fits the training data very well but does not fit unseen data is referred to as “overfitting.” It is actually a very common machine learning problem. The model trains too much and becomes so optimized to the specifics of the training data that it now only represents that sample as opposed to the whole population. For instance, this model goes out of its way to work around specific data points, even though those points seem to be outliers and likely do not generalize well. For this reason, if someone tries to tell you that their model has a 98% prediction accuracy, make sure you ask about how it performed in validation tests!
  3. Profit! After iterating through these steps enough times to create a good model, we can use it on future data to try to determine when people are happy in other situations.

 In a field where different activity is often represented by incredibly complex patterns, the ability to have a machine learn these patterns for us is a massive step. While a human experimenter may be able to infer conclusions from a certain defined metric (like the amplitude of a waveform at a given time), a machine can infer from minute changes in many different metrics at once. Considering that some aspects of physiological responses may be shared across many different mental states, the ability to use multiple different aspects of your data at once is a huge advantage. Though the value of one parameter may potentially be indicative of a range of different conclusions, the combination of many may help us significantly narrow down that range.

If only it were this easy…
©momius [Adobe Stock]

So there you go. All we have to do is give our computers the data and let it decide what means what. Problem solved. Great article.Machine learning is also used frequently for real-time applications, such as brain-computer interfaces that use physiological feedback to adjust in the moment. An example of this is an interface that adapts to someone’s cognitive load so as not to overburden the user with information. Instead of needing an analyst to monitor the data and tell a program what to do, the program decides for itself.

As you could probably guess, it’s not quite that simple.

Glitches in the system- the problems with machine learning

As I mentioned, machine learning represents an incredible advancement in data science and can be a valuable tool in a neuroscientist’s kit. However, it is by no means perfect. Any academic will say that you can’t simply give the computer your data and expect an accurate result (contrary to what some marketers may suggest). The problem with artificial intelligence in this case is that its “intelligence” is artificial. It doesn’t know anything about what you are measuring; it’s just finding the best pattern it can to describe a series of numbers. A common phrase in the field is “garbage in, garbage out,” meaning that your output is only as good as the underlying data behind it. So, in neuroscience specifically, what adds “garbage” into our conclusions?

Noisy data. If you have ever worked with physiological data, you don’t need me to explain the problem of noise. “Noise” refers to the factors which you are not interested in measuring, but affect the data anyway. These factors distort the “signal” in which you are actually interested. In physiology, noise is commonplace in essentially every measurement. Whether it be random electrical noise or artifact (data arising from unwanted influences) caused by a participant moving too much during the experiment, the scientist will have to deal with a lot of “garbage” on top of the actual signal of interest. A great hope of machine learning is to cut through random noise to find a true underlying pattern, but again, a computer will not know which data is signal and which is noise. Thus, the noisier your input data, the noisier your output.

This is especially true when the noise is not simply random, but introduces bias. For instance, going back to our happiness experiment, if some of the “happy” stimuli cause participants to laugh, the electrical activity caused simply by the movement of laughter will influence the EEG data. If that same movement is not present when the participants view the neutral stimuli, we now have noise that is specifically biased to one condition. This bias means our machine learning program may be more likely to interpret the noise as part of the pattern of interest. When we then go to apply our model to future studies, it may be more likely to call someone “happy” simply because they are moving around.

Thus, as with any experiment, it is imperative to analyze the cleanest data possible. The easiest way to achieve this is through good experimental design and solid technical recording practices. Basically, record as little noise as possible in the first place. After that, you can apply filters and reject/correct “dirty” segments of data. Just know that these processes will, by nature, distort your data, and so must be used with finesse. It is also important to know that these procedures work better after the fact, as opposed to in real-time, which is why I always encourage people to analyze the results after all the data has been collected if real-time monitoring is not imperative.

At the end of the day, it is important to remember that machine learning is not a substitute for collecting good, clean data. Even if your model’s prediction accuracy is very high, it is not useful if it’s predicting a biased dataset. This brings me to my next point.

Ground “truth.” As mentioned, using machine learning to classify things requires the use of a “ground truth,” where we tell the machine what the training data represents. This is referred to as supervised learning, since we are supervising the way the program trains by telling it what certain datasets mean. You can also create an unsupervised program without a ground truth if you simply want to cluster data or make associations, but this is less utilized in neuroscience due to the interest in classifying mental states.

A perfect ground truth is objective. We know that dataset A represents B. However, this is rarely the case in the real word- especially when dealing with something as complex as the brain.

Let’s go back to our happiness indicator. As is common in these types of studies (Heraz, Razaki & Frasson, 2007; Sohaib, Hagelbäck, Hilborn & Jerčić, 2013), let’s say we determined “happiness” by showing participants “happy” pictures from a validated database.

Happy (left) and neutral (right) pictures, taken from the Open Affective Standardized Image Set (OASIS) database (Kurdi, Lozano, Banaji, 2017)

The first question we must ask is “did these pictures actually make the participant happy?” We can (and should) assess this by simply asking them to rate how positive or negative they feel after each picture. Of course, there is always the possibility that our respondent is lying, perhaps because they don’t want us to think of them as some monster who hates puppies in teacups.

Even if we could reasonably determine that the pictures made our respondents happy, we don’t know what other reactions those same pictures instilled. Perhaps the pictures were all arousing, biasing our algorithm toward other arousing stimuli. Or perhaps the pictures triggered nostalgia, leading to additional activity that had nothing to do with emotion, but instead had to do with the process of retrieving memories.

In addition, it is hard to be sure which physiological reactions are stimulus-specific. What if our program picks up on responses that have little to do with emotion, and much more to do with our reactions to certain colors or certain physical features? We can assuage this by “aiding” our program, deciding to include only features of the data for which we have some contextual reason to expect might be related to our variable of interest. However, given the complexity of our bodies, and the fact that certain signals are often correlated to multiple different states or activities, this may be easier said than done.

For similar reasons, it is hard to know whether the “happy” patterns generated by our pictures generalize to other stimuli. Does it also apply to video? Audio? What about real life? Neuroscientists traditionally attempt to keep stimuli between conditions as similar as possible because they know that even small changes can affect our brains’ responses.

The crux of the issue is that our ground truth is anything but objective. We are complex creatures and our responses to anything are going to be equally complex. Our machine learning model may predict both the training data and unseen data very well, but that doesn’t mean it predicts our variable of interest. This may seem hopeless (if you are feeling down, I refer you back to the puppy in the teacup), but it does not mean that our conclusions are useless. It simply means that we cannot simply take what our program says on blind faith. Just as in traditional experiments, we have to approach our conclusions with skepticism, and consider in the experimental design process how to minimize potential confounds.

Speaking of experimental design, I have one more point that is often overlooked…

We’re all basically the same, right? Of course not! We are individuals! Inside each of us surges a wild entanglement of biology and experience- millions of tiny differences forming the lattice that separates us from the rest of the world. It allows us to make our own unique marks on the world. It allows me to be me and you to be you. It also makes neuroscientists hate both of us.

People are variable, as are their physiological signals. This is why sample size matters; only by aggregating enough people can we try to smooth out this variance and start generalizing conclusions. And that’s for one metric. Now imagine a program that is trying to glean patterns from dozens of different metrics, all of which can vary across each participant. For this reason, it can be difficult for machine-learning models to accurately predict certain states.

Some scientists suggest that you can create more accurate models by training subjects individually, then using these individualized models to make predictions about the same person (Berka et. al, 2007). However, this requires you to lengthen your experiment to train these participants, which may be less feasible. In addition, this paradigm holds no guarantee that you will be able to achieve a high prediction accuracy, whereas that information would be available in a previously trained model. Finally, even when training individually, it is important to make sure that the subject is in the same mental state during training and during the actual study. For instance, if you show them pictures for an hour before even starting the experiment, they will likely be more fatigued when looking at new items than they were when collecting data to create our model. Before deciding to use machine learning, you should be cognizant that it may require an additional investment in your experiment.

Wrapping it up

Hopefully, you now know the basics of machine learning, how it is applied in neuroscience, and when we should be skeptical. The technology is truly amazing, and I assume will only become more prevalent in this as well as many other fields. However, it is not yet perfect. As scientists, we cannot shirk our duty to the truth. We must understand that a machine learning program knows nothing of neuroscience or emotion or whatever else we feed it- it only knows the data it’s given. It is therefore useful as another tool in our belts. We use it to dig ever deeper to the truth- but ultimately the conclusions must be our own.

Works Cited

Berka, C., Levendowski, D. J., Lumicao, M. N., Yau, A., Davis, G., Zivkovic, V. T., … & Craven, P. L. (2007). EEG correlates of task engagement and mental workload in vigilance, learning, and memory tasks. Aviation, space, and environmental medicine78(5), B231-B244.

Gupta, P. Cross-Validation in Machine Learning (2017, June). Retrieved from

Heraz, A., Razaki, R., & Frasson, C. (2007, July). Using machine learning to predict learner emotional state from brainwaves. In Advanced Learning Technologies, 2007. ICALT 2007. Seventh IEEE International Conference on (pp. 853-857). IEEE.

Kothe, C. A., & Makeig, S. (2011, August). Estimation of task workload from EEG data: new and current tools and perspectives. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE (pp. 6547-6551). IEEE.

Kurdi, B., Lozano, S., & Banaji, M. R. (2017). Introducing the open affective standardized image set (OASIS). Behavior research methods49(2), 457-470.

Le, J. A Tour of The Top 10 Algorithms for Machine Learning Newbies (2018, January). Retrieved from

Lemm, S., Blankertz, B., Dickhaus, T., & Müller, K. R. (2011). Introduction to machine learning for brain imaging. Neuroimage, 56(2), 387-399.

Sohaib, A. T., Qureshi, S., Hagelbäck, J., Hilborn, O., & Jerčić, P. (2013, July). Evaluating classifiers for emotion recognition using EEG. In International conference on augmented cognition (pp. 492-501). Springer, Berlin, Heidelberg.

Soroush, M. Z., Maghooli, K., Setarehdan, S. K., & Nasrabadi, A. M (2017). A Review on EEG Signals Based Emotion Recognition. International Clinical Neuroscience Journal4(4), 118-129.

Stevens, R., Galloway, T., & Berka, C. (2006). Integrating EEG models of cognitive load with machine learning models of scientific problem solving. Augmented Cognition: Past, Present and Future2, 55-65.

Wang, X. W., Nie, D., & Lu, B. L. (2014). Emotional state classification from EEG data using machine learning approach. Neurocomputing, 129, 94-106.