All posts by Kathryn Ambroze

The Underlying Red Flags of Usability Testing

Usability testing has emerged in the market research world as a promising method to measure how consumers interact with a product, service or system. The goal of usability testing is to assess concepts such as understandability, learnability, operability to attractiveness or compliance (Rusu et al., 2015) in the consumer experience. Generally, when conducting usability testing, consumers complete a task while being observed by a researcher who notes if issues are encountered. At the core of usability testing, the goal is to minimize or eliminate distraction and find ways to optimize the information for the consumer. Usability testing also aims to discover ways to re-evaluate how to maximize product effectiveness and efficiency, in addition to improving consumer satisfaction (Rusu, C., Rusu, V., Roncagliolo, S., & González, C, 2015). These recommendations on how to improve performance issues are then utilized to further product development.  

Staying competitive in any market is a challenge. Those striving to be “the best” must be able to outline what and how to beat the competition in whatever category is prioritized at the top. A prime example of this can be seen in the transportation industry. The growth of mobile apps has influenced companies like Uber, taxis and Lyft. To remain competitive, Lyft invested in usability testing for its mobile app. The results suggested that the app needed to be revamped. The company redesigned the interface to fit the consumers’ wants and needs through simplifying maps and driver information (Chen, 2016). By continually striving to improve and remain up-to-date, Lyft remains competitive in the field.

Having an unbiased consumer complete tasks or interact with an item allows the researcher observing the participant to gain a better understanding of enjoyment, satisfaction, confusion or criticism in real time throughout the entire test. Companies like Lyft utilize usability testing to advance in the market without exploiting funds or time. Additionally, feedback from the target audience directly minimizes the risk of a product failing, thus, improving sales and helping the company better position itself within the market (Hawley, 2010).

So, if usability testing works, why isn’t everyone using it? It is a good question with a couple of red-flags among the answers…  

The first red flag of usability testing lies within the typical sales pitch about why it is so great: it’s overpromising by guaranteeing more concrete solutions than it can provide. A big part of the usability testing appeal is the claim to uncover a lot of answers. Having such broad and lofty goals creates vague conclusions that are challenging to apply to real-world situations (from deliriously opening a bottle of medicine from bed during a sick day to online shopping at midnight for the perfect date outfit). The oversimplified gray areas in usability testing can create confusion. Tractinsky (2017) questions if usability contributes to satisfaction, or if satisfaction contributes to usability. Odd coexisting variables, such as satisfaction and usability, are feasible due to the fluidity of the current definitions. The lack of consistency makes it possible for current researchers to structure usability testing to best satisfy the goals of a project. What usability is, and the way it should be tested/measured, should remain consistent, regardless of the type of design activity being applied.

Usability testing umbrellas many different trial methods, so much so that there is no standard approach. Having generic guidelines can lead to various interpretations of overarching measures to create note-worthy conclusions. Whatever mode of usability testing is applied to a study should neither be determined by findings, nor should it be altered to fit the framework of a research question (Arnowitz, Dykstra-Erickson, Chi, Cheng, & Sampson, 2005). Let’s say that you wanted to test a new type of scissors. Comparative usability, where multiple scissors are compared to one another, verse explorative usability, an approach to testing content and functionality of a single pair of scissors, are vastly different approaches to determining how the new scissors model resonates with a consumer. This concern could be rectified with a consistent, standardized methodology (i.e. questionnaire) or definition. If you want to run a scissors study with usability testing, a clear step-by-step approach to get valid results should be meticulously determined.  Essentially, walk, don’t run with scissors.  

Usability testing delays research progression just as interactive product design is hindered by debatable measures (Veral & Marcias, 2019). A system must be developed to help build a solid foundation of preliminary research for future research to progress from the initial findings. Such ambiguity makes it near impossible to compare studies since there is such a wide difference in approaches. If the objective is to create a reputable usability test, then it should have a consistent framework with rigorous definitions and strong evidence to reach sound conclusions.

Part of the general problem involves recognizing that any type of usability testing is a construct, not a real-world phenomenon. Maybe your dog is barking, your roommate is on the phone or the neighbor’s car alarm is going off for the third time today. Whatever the situation may be, a controlled environment will never duplicate a real-life scenario. This is especially true because environments are constantly changing. There are constraints to all research in this regard, but that must be acknowledged prior to approaching how an experimental measure will be conducted. Similarly, usability seeks to gain access to the user experience; however, that is not entirely possible since the participant always has an awareness that he or she is a part of a study. The non-naturalistic methodology (aka being told to complete a task) will not reflect the consumer’s real experience.

There are various quantitative metrics used to validate design concepts in usability testing: time spent on a task, success and failure rates, number of clicks needed to complete a task, etc. These measures cannot guarantee that the result is because of the product, system or service. For example, there is no guarantee that the person is focusing on the task rather than a meal from two days ago, and therefore could affect time spent using the item. For a stimulus to become important for a user, there must be some type of motivation to take part in the task (Hertzum, 2016). Understanding these initial mechanisms will develop a stronger methodology to analyze usability results.

The lack of a stable procedure in usability testing can lead to components of the research being problematic. One controversial point within usability research is sample size. It is common practice in usability testing to have 5-7 participants in a sample size. This range of participants is heavily debated; however, a sample size of 5 is utilized in practice frequently because it is budget-friendly (Lindgaard, G., & Chattratichart, J, 2007). While the sample size for all research remains to be subjective, it should be noted that having only a few participants provides a limited amount of feedback.

There are other common research methods that are frequently used and should be questioned for their effectiveness and efficiency. For example, one usability approach involves what is called “thinking aloud” where the participant speaks during a task while an evaluator observes behavior and listens. This method can be divided into two segments: classic and relaxed thinking aloud. Both force the participant to evaluate his or her experience in the explicit, which may or may not be what is verbally expressed. The consumer can say anything that comes to mind, which may be unnatural and forced, especially for prolonged testing periods (Hertzum, 2016). It may also be distracting for the consumer to complete the task while simultaneously explaining each step verbally, creating a time lag for task completion. With such obvious confounds, why still use this as a method? It may be thatthere hasn’t been another measure developed to extract consumer opinions that supersedes the thinking aloud method.

Qualitative data in usability testing includes observing body language, hand movement, facial expressions and facial changes like squinting. A lot of these metrics are left up to interpretation of the observer, who comes with his or her own set of bias. Yes, you can have a recording of the research session; however, a video recording only provides one angle of the respondent and still does not give insight into what the consumer is considering. Think about the subjectivity in hand motions. A gesture’s intention can vary culturally and socially. Hand gestures are used to communicate, but it’s very challenging to know exactly what something like holding your head up with your hands could mean.   

Certain components of usability testing have evolved substantially over the last few years. As interaction paradigms, technology, and software development rapidly increased, the potential avenues of usability testing have followed suit. Several studies have been conducted to refine components of usability testing such as the questionnaire and how to evaluate responses to make them more standardized. Advancements included simplifying ways to use number scales rather than descriptive words when describing a task or developing systematic ways to compare response charts automatically (Huisman & Hout, 2010; Merčun, 2014; Adikari, S., McDonald, C., & Campbell, J., 2016; Berkman, M. I., & Karahoca, D., 2016). This progress should not be undermined and suggests that professionals are aware of the strides necessary to appropriately practice usability testing. 

Usability testing, like anything in life, will have unexpected complications. Each interface upholds its own unique challenges, thus pushing the focus to develop more user-friendly goods or services (Krug, 2000). For example, web design testing can appear simple at first. However, most companies need the web design to now be compatible with not only a desktop, but also a mobile app, a tablet, and a smart phone. These different interfaces are dramatically different from a visualization perspective as well as for human-technology interaction. Something as simple as scrolling capabilities can have a huge influence on what may or may not be strong areas of interest on the stimuli. Learning to account for these unforeseen difficulties during research will help improve usability testing.

Changing trends of familiarity can also influence results within usability testing. Target audiences are another vital component of any study. Among the general population, 77% of Americans have smartphones, with younger adults being more likely to use the mobile device as their primary connection to the internet (Pew Research Center, 2018). Using a mouse or a touchscreen can make a huge impact on how someone interacts based on his or her performance with technology. Finding measures to scale hidden issues, such as the target audience’s technological capabilities, aids the battles that usability testing is facing.


Considering all these red-flags, it makes sense why usability testing isn’t the magic one-size-fits-all solution. We live in a world that requires an extremely fast turnaround in technology. Some companies bypass usability testing completely and just release to the public as much content as possible to see what sticks. Other companies, such as Lyft, want to conduct studies with the fastest turnaround possible. The last decade alone has seen extensive growth in usability because of the emergence of smartphone advancements, AI, website design, etc. It is a challenge to meet the needs of such a rapidly growing market.  

Currently, usability testing can be applied in any field (alluding to the potential that it may be spread too thin to be effective). Usability testing tries to gain a grasp of how consistently a user interacts with an interface, or memorability, and the ability to understand how to use the app, or learnability. To further understand concepts like memorability and learnability, research tools such as eye-tracking have been integrated into usability studies to help determine parameters such as visual attention (Frith, 2019). Any item being tested includes variables which influence a participant’s ability to perform on a specific task. Color, texture, font, font size, images and format are just some of the many components that contribute to the overall picture of how a stimulus is presented. There is a potential for usability testing to provide truly powerful tools for optimizing products. However, it needs to define and progress with the competitive space in which its being utilized.   

Consumers have quirks that influence the interactions between a consumer and software, products or a service. Usability professionals must challenge themselves to recognize the disconnect among testing methodologies and objectives. Defined terms will help to develop a standard methodology, and ultimately, promote a stronger research.

At HCD, we advocate for always using the right tool for the right question. Our motto is “Prove it.” For usability testing, our goal is to make sure that the methodology we use is appropriate for the experience, and that the results are meaningful and actionable. This list of red flags highlights the need for this sort of approach. As a blanket field of research, usability testing can be useful, but fraught with misuse. But if you make sure to use the right tool for the right questions, you may just be able to ensure product success. Please contact Cara Silvestri (cara.silvestri@hcdi.net) regarding any additional information about how we can help you better your product, service, or system.


Citations:

Adikari, S., McDonald, C., & Campbell, J. (2016). Quantitative Analysis of Desirability in User Experience. arXiv preprint arXiv:1606.03544.

Arnowitz, J., Dykstra-Erickson, E., Chi, T., Cheng, K., & Sampson, F. (2005). Usability as science. interactions12(2), 7-8.

Berkman, M. I., & Karahoca, D. (2016). Re-assessing the usability metric for user experience (UMUX) scale. Journal of Usability Studies11(3), 89-109.

Chen, J. (2016, June 21). Lyft Re-design Case Study – UX Collective. Retrieved from https://uxdesign.cc/lyft-re-design-case-study-3df099c0ce45

Frith, K. H. (2019). User Experience Design: The Critical First Step for App Development. Nursing education perspectives40(1), 65-66.

Krug, S. (2000). Don’t make me think!: a common sense approach to Web usability. Pearson Education India.

Lindgaard, G., & Chattratichart, J. (2007, April). Usability testing: what have we overlooked? In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 1415-1424). ACM.

Hawley, M. (2010). Rapid Desirability Testing: A Case Study. Accessed online15(04), 2010.

Hertzum, M. (2016). Usability testing: too early? too much talking? too many problems?. Journal of Usability Studies11(3), 83-88.

Huisman, G., & Van Hout, M. (2010). The development of a graphical emotion measurement instrument using caricatured expressions: the LEMtool. In Emotion in HCI–Designing for People. Proceedings of the 2008 International Workshop (pp. 5-8).

Merčun, T. (2014). Evaluation of information visualization techniques: analysing user experience with reaction cards. In Proceedings of the Fifth Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (pp. 103-109). ACM.

Pew Research Center. (2018). Demographics of mobile device ownership and adoption in the United States. Retrieved from www.pewinternet.org/factsheet/mobile/

Rusu, C., Rusu, V., Roncagliolo, S., & González, C. (2015). Usability and user experience: what should we care about?. International Journal of Information Technologies and Systems Approach (IJITSA)8(2), 1-12.

Tractinsky, N. (2017). The Usability Construct: A Dead End? Human–Computer Interaction, 33(2), 131–177.

Veral, R., & Macías, J. A. (2019). Supporting user-perceived usability benchmarking through a developed quantitative metric. International Journal of Human-Computer Studies122, 184-195.