Skip to content

12. Usability Evaluation

Estimated time to read: 104 minutes

TL;DR

Why this matters: Your design decisions are hypotheses. Evaluation tests if they work for real users in real contexts. Finding problems early (paper prototype) costs 1 hour to fix; finding them after launch costs 100 hours + reputation damage.

Theory (FG1): Understand evaluation methods, when to use each, and why. Know Nielsen's 10 Heuristics.

Practice (Project): Apply systematic evaluation to your MVP:

  1. Inspection first (no users needed) — Find structural issues yourself with heuristic evaluation and cognitive walkthrough
  2. Then test with real users (2 people, Thinking Aloud) — Discover real breakdowns and mental model mismatches
  3. Document findings (severity ratings, brief recommendations) — NOT implement fixes (out of scope)
  4. Automated accessibility check — Use Accessibility Scanner (Android) and WAVE (Browser Plugin) to detect accessibility violations (contrast, missing labels, semantics)

Result: Evidence-based understanding of your design’s strengths and critical weaknesses — including heuristic findings, user test results, and automated accessibility violations. These insights inform your FG2 reflection; implementing fixes is beyond this course’s scope.

Building on Design Knowledge

This chapter directly applies principles from Design:

  • Nielsen's HeuristicsDesign chapter contains full explanations with examples (Good ✅ vs Bad ❌)
  • This chapter focuses on APPLYING those heuristics to systematically evaluate your design
  • Design principles → You'll test if users understand your visual hierarchy, color choices, navigation patterns

Prerequisites: Read the Design chapter first. You need to understand the principles before you can evaluate against them.

12.1 Usability vs. UX

We discussed the differences in more detail in the introduction.

  • Usability: can target users achieve goals effectively, efficiently, satisfactorily in a specific context?
  • User Experience (UX): broader — includes emotions, trust, value, aesthetics, meaning over time.

Rule for your project

Always combine two perspectives:

  1. Expert-based inspection (heuristics / walkthrough) — Fast, find structural issues
  2. User-based evaluation (usability testing with tasks) — Reveal real breakdowns and mental model mismatches

Why this order? Don't waste users' time on problems you can find yourself! Fix obvious issues first (inspection), then discover unexpected problems (testing).

12.2 Software Ergonomics (What you optimize)

Usability work improves:

  • Learnability (first-time use)
  • Efficiency (repeated use)
  • Error prevention & recovery
  • Transparency (feedback, system status)
  • Accessibility (usable for more people, more contexts)

Minimum Standard: Automated Accessibility Testing

In addition to heuristic evaluation and user testing, you must run:

These tools automatically detect:

  • Insufficient color contrast
  • Missing semantic labels
  • Touch targets that are too small
  • Missing alternative text
  • Structural/ARIA issues

Automated tools do NOT replace usability testing — but they catch critical accessibility violations quickly and objectively.

12.3 Motivation

Steve Krug, Don't Make Me Think

"Testing one user is 100% better than testing none."

You are not your user:

  • You know how the app works (they don't)
  • You know what each button does (they have to figure it out)
  • You made design decisions for reasons (they don't know those reasons)
  • You've tested it 100 times (they use it for the first time)

Result: What seems obvious to you is confusing to users.

graph LR
    A[Skip Evaluation] --> B[Launch with Issues]
    B --> C[Users Struggle]
    C --> D[Bad Reviews]
    C --> E[Support Tickets]
    C --> F[User Churn]

    G[Evaluate Early] --> H[Find Issues Cheap]
    H --> I[Fix Before Launch]
    I --> J[Good Experience]
    J --> K[Success]

Design decisions are hypotheses. Usability evaluation tests whether these hypotheses hold in real use.

Usability Evaluation answers the question:

Does the system actually work for users in their context?

Evaluation is not a final step, but a tool for learning and iteration throughout the design process.

12.4 Evaluation Timing

Evaluation happens throughout the development process, not just at the end:

timeline
    title Evaluation Throughout Development

    After Paper Prototypes : Paper Prototype
           : Informal validation
           : 2-3 people
           : Find major flow issues

    After Digital Prototypes : Digital Prototype
           : Quick validation
           : 3-5 people
           : Validate design decisions

    Late in Development Phase : MVP Implementation
            : Full Evaluation
            : Heuristic + Thinking Aloud
            : YOUR PROJECT FOCUS

    Project End : Final Check
            : Regression testing
            : Optional SUS
            : Document findings

Your project timeline:

  1. Early prototyping → Quick informal tests (covered in Design)
  2. Late in developmentSystematic evaluation (this chapter's focus)
  3. Heuristic evaluation of implemented app
  4. Cognitive walkthrough of critical flows
  5. Thinking Aloud with 2 users
  6. Project end → Document findings for FG2 reflection

Industry Practice

In professional projects, evaluation timing depends on the development process model. See Development Process for how evaluation fits into:

  • Double Diamond → Evaluation happens in "Develop" and "Deliver" phases
  • Agile UX → Continuous testing parallel to development sprints
  • Design Sprint → Day 5 is dedicated to user testing and validation

The key principle: Test early, test often, iterate continuously.

12.5 Formative vs. Summative Evaluation

Formative evaluations are used during the design process to identify strengths and weaknesses in a prototype, guiding iterative improvements before final release. They focus on uncovering usability issues and refining the interface through repeated testing and revision.

Summative evaluations, on the other hand, assess the performance of a finished product, often comparing it to benchmarks, previous versions, or competitors. They provide a big-picture view of usability and are typically conducted just before or after a product launch to measure success and inform future updates.

While formative evaluations are ongoing and iterative, summative evaluations serve as checkpoints to validate the final design and ensure it meets user needs and business goals. Both types are essential for creating effective, user-centered products 1.

Two Independent Dimensions

Purpose (formative vs. summative) and who evaluates (expert vs. real users) are independent of each other. Any method can — in principle — serve either purpose:

Expert-based (Inspection) User-based (Testing)
Formative (improve) Heuristic Evaluation, Cognitive Walkthrough Thinking Aloud
Summative (assess) Expert audit / benchmark SUS, Task Completion Rate, A/B Test

In practice: inspection methods are almost always formative; user-based testing is usually formative but becomes summative when you measure final quality (e.g., SUS score after launch).

12.6 Usability Inspection

Why Inspect First?

Don't waste users' time on problems you can find yourself!

Inspection methods let you find and fix structural issues quickly before involving real users. This makes user testing more productive — users can focus on discovering unexpected problems, not obvious violations of established principles.

Your workflow:

  1. First: Heuristic Evaluation + Cognitive Walkthrough (you + team) + Accessibility Scan
  2. Then: Thinking Aloud with real users (after fixing obvious issues)

Inspection methods are expert-based evaluations where you (or others) review the design without involving actual users — and are almost always used formatively, to identify structural issues before user testing.

12.6.1 Heuristic Evaluation

Method

Systematically check design against established heuristics (Nielsen's 10).

Process:

graph LR
    A[1. Brief] --> B[2. Individual Review]
    B --> C[3. Document Issues]
    C --> D[4. Severity Rating]
    D --> E[5. Consolidate]
    E --> F[6. Prioritize Fixes]

Full Explanations in Design Chapter

For detailed explanations of each heuristic with multiple examples, see Design chapter - Nielsen's 10 Usability Heuristics.

The list below provides quick reminders for evaluation. Use the Design chapter as your reference when explaining heuristics in detail.

Quick list for evaluation:

  1. Visibility of system status — System shows what's happening (feedback within 0.1-1 second)
  2. Match between system and real world — Use language and concepts from user's perspective
  3. User control and freedom — Users need "emergency exit" (Undo, Cancel)
  4. Consistency and standards — Same words/actions mean same thing everywhere
  5. Error prevention — Better to prevent errors than write good error messages
  6. Recognition rather than recall — Make options visible instead of requiring memory
  7. Flexibility and efficiency of use — Shortcuts for experts, without hindering novices
  8. Aesthetic and minimalist design — No irrelevant content competing for attention
  9. Help users recognize, diagnose, recover from errors — Error messages in plain language
  10. Help and documentation — Easy to search and task-focused

Application: For each screen, ask — "Which heuristics are violated?"

screenshot docs
How to Conduct a Heuristic Evaluation

Each evaluator independently:

  1. Perform each task
  2. Check against each heuristic
  3. Note violations

Tips:

  • Go through system 2-3 times (first = overview, then detailed)
  • Think as target persona
  • Document context (which screen, which action)
  • For Heuristic #7 (Efficiency): Count interaction steps for frequent tasks
    • Identify 2-3 most frequent tasks from your user research
    • Count total interactions: taps + screens + text inputs
    • Target: ≤ 3 taps for primary actions
    • Compare: How many steps do competitor apps need?

Example Path Length Analysis:

Task Frequency Steps Issue?
Log workout Daily (from user research) Home → Tap "Add" → Select type → Enter data → Save = 5 taps + 2 screens + 3 inputs ⚠️ Check if reducible
View history Weekly Home → Tap "History" = 1 tap + 1 screen ✅ Efficient

12.6.2 Cognitive Walkthrough 2

Method

Step-by-step analysis: Can users figure out what to do and understand what happens?

Developed by: Wharton, Rieman, Lewis & Polson (1994)

Focus: Learnability (first-time use, no training)

Four Key Questions (per action):

The method systematically asks four questions for EACH user action:

  1. Will users try to achieve the right effect?

    • Do they understand what this action does?
    • Is it the goal they want?
  2. Will users notice that the correct action is available?

    • Is the control visible?
    • Is it where they expect it?
  3. Will users associate the correct action with the effect they want?

    • Is the label clear?
    • Does it match their mental model?
  4. If the action is performed, will users see that progress is being made?

    • Is there feedback?
    • Does the system state change visibly?

Example Walkthrough

Task: "Log a workout"

Step 1: User on Home Screen

Question Answer Issue?
1. Try to achieve right effect? Yes, "Log Workout" button visible
2. Notice correct action? Yes, prominent button at top
3. Associate action with goal? Yes, label is clear
4. See progress if performed? Button press → Form opens

Step 2: User on Add Workout Form

Question Answer Issue?
1. Try to achieve right effect? Yes, wants to enter workout data
2. Notice correct action? Yes, form fields visible
3. Associate action with goal? Mostly. "Duration" field unclear unit (min? hours?) ⚠️
4. See progress if performed? Text appears in field

Issue Found:

**Problem:** Duration field doesn't show unit
**Impact:** User might enter "30" thinking hours, not minutes
**Fix:** Add helper text "Duration (minutes)"

Optional Method: Keystroke-Level Model (KLM)

Analytic, or predictive, approaches—such as the Goals, Operators, Methods and Selection (GOMS) family of models, which includes the Keystroke Level Model (KLM) — are used to forecast user behavior and streamline task completion3. KLM predicts task completion time by counting low-level operations (keystrokes, mouse movement).

Why NOT emphasized in this mobile course:

  • KLM was developed for desktop interfaces with frequent switching between keyboard and mouse
  • Better for mobile: Count interaction steps (taps + screens) as part of Heuristic #7 evaluation (see above)

12.7 Usability Evaluation (With Real Users)

Why Test After Inspection?

You've already found and (ideally) fixed structural issues with heuristic evaluation and cognitive walkthrough. Now real users can help you discover:

  • Unexpected problems you didn't anticipate
  • Mental model mismatches that experts miss
  • Real-world usage patterns that violate your assumptions

Testing is more productive when obvious issues are already fixed!

Do not ask real users for problems you know already or can easily find by yourself. See the following article on how the new terminal in Frankfurt Main Airport was tested Alle Abläufe im Test 8.000 Komparsen testen neues Terminal 3 in Frankfurt.

Usability evaluation involves observing real users as they attempt to complete tasks with your system.

5 users

"For really low-overhead projects, it's often optimal to test as few as 2 users per study (comment: your for your project: 2 users). For some other projects, 8 users — or sometimes even more — might be better. For most projects, however, you should stay with the tried-and-true: 5 users per usability test."4

According to the Devil's Quadrangle, it’s always essential to weigh the different requirements against each other.

Pilot Test = Mandatory 5

Before any real usability test, conduct a pilot test:

  • 1 person (teammate or friend)
  • Test the test materials
  • Find issues with: tasks, instructions, recording setup, timing

Why pilot testing matters:

  • Catches ambiguous task wording
  • Reveals timing issues (tasks too long/short)
  • Tests recording setup
  • Validates consent forms
  • Prevents wasting participant time with broken materials

12.7.1 Thinking Aloud Method

Gold Standard Method

Participants verbalize their thoughts while performing tasks. You observe and take notes.

Why it works:

  • Reveals mental model
  • Exposes confusion immediately
  • Shows what users notice (or miss)
  • Identifies pain points naturally

12.7.2 Test Plan

  1. Define Goals
## Usability Test Goals

**Research Questions:**
- Can users log a workout in under 2 minutes?
- Do users understand the progress visualization?
- Can users find and edit past workouts?

**Success Criteria:**
- 80%+ task completion rate
- < 3 critical errors per session
- Average SUS score > 68
  1. Select Tasks (5-8 tasks)

Tasks from User Stories

Your test tasks should come from the user stories you defined in Requirements Engineering.

Each user story becomes a test task. This ensures you're testing what users actually need to do.

Good Tasks:

  • Realistic (from user research)
  • Prioritized (most important first)
  • Clear success criteria
  • No step-by-step instructions

Example Task:

## Task 1: Log Workout
"You just finished a 30-minute run and want to track it. 
Show me how you would do that."

**Success:** New workout appears in history
**Time limit:** 3 minutes
**Errors:** Count if user goes to wrong screen

Bad Task (Don't do this):

❌ "Tap the + button, then select Run, then enter 30 minutes"
(This gives away the solution!)

12.7.2.1 Recruit Participants

Who:

  • Match your personas
  • 2+ participants for your project (5+ ideal)
  • Mix of experience levels
  • Not your friends/family (too biased)

How:

  • Fellow students (different course/semester)
  • Social media (student groups)
  • Incentives (coffee voucher, €10)

Screening questions:

1. How often do you exercise? (Daily/Weekly/Rarely)
2. Do you currently use any fitness app? (Yes/No, Which one?)
3. Have you used [Your App Name] before? (Yes → Exclude)
4. Are you comfortable thinking aloud while using an app? (Yes/No)

12.7.2.2 4. Prepare Materials

Checklist:

- [ ] Working prototype/MVP (fully charged)
- [ ] Task cards (one per page, printed)
- [ ] Consent form (recording, data handling)
- [ ] Note-taking template
- [ ] Recording equipment (phone camera on tripod)
- [ ] Backup device (if one fails)
- [ ] SUS questionnaire (post-test)
- [ ] Pens, paper, water for participant

12.7.3 Moderator Script 6

Consistent Introduction

Use the same script for all participants to reduce variability.

12.7.3.1 Introduction

"Hi [Name], thank you for participating!

Today we're testing a fitness tracking app. Important: We're testing 
the app, not you. There are no wrong answers, and you can't break 
anything.

Your job is to think aloud: Tell me what you're looking at, what 
you're thinking, what you expect to happen. For example, if you see 
a button, say what you think it does before you tap it.

This will feel a bit unnatural at first, but it really helps us 
understand your experience.

You can stop at any time or skip a task if you want. This session 
will take about 30 minutes.

We'll record the screen to help us analyze later. The recording is 
only for our team and will be deleted after the project.

Do you have any questions before we start?

[Wait for questions]

Okay, let's do a quick practice task first..."

12.7.3.2 During Tasks

Your role:

  • ✅ Observe silently
  • ✅ Take notes
  • ✅ Use neutral prompts if participant stops talking
  • ❌ Don't help
  • ❌ Don't explain features
  • ❌ Don't defend the design

Neutral Prompts (if participant stops talking):

  • "What are you thinking right now?"
  • "What are you looking for?"
  • "What do you expect to happen if you tap that?"
  • "Can you tell me more about what you're seeing?"

Avoid leading questions:

  • ❌ "Do you like the button placement?"
  • ❌ "Isn't this feature useful?"
  • ❌ "Would you prefer it here or there?"

If participant is stuck (after ~30 sec):

  1. First: "What are you trying to do?"
  2. Then: "What would you expect to see/do?"
  3. Last resort: "Let's move to the next task" (mark task as failed)

12.7.3.3 Post-Task Questions

"Thank you! Now I have a few quick questions:

1. Overall, how easy or difficult was it to complete these tasks?
   (Very easy / Easy / Neutral / Difficult / Very difficult)

2. What did you like most about the app?

3. What was most confusing or frustrating?

4. Is there anything you expected to see that wasn't there?

5. Would you use this app if it were available? Why or why not?

6. Any other feedback?"

7. Thanks Thanks THANKS

12.7.3.4 SUS Questionnaire

See Metrics section below.


12.7.4 Common Mistakes (Avoid!)

❌ Helping Too Much

Don't explain features during test.

If stuck, ask: "What would you expect?"

❌ Defending Design

Don't justify decisions during test.

Note the issue, fix it later.

❌ Leading Questions

Don't ask "Do you like...?"

Ask "What do you think...?"

❌ Skipping Pilot

Always do a pilot test first.

Find issues with test materials.

❌ Tasks with Instructions

Don't: "Tap the + button"

Do: "Log a workout"

❌ Unclear Success

Define success criteria per task.

Know when task is complete.

12.7.5 Observation Notes

Template:

# Usability Test Session Notes

**Date:** 2026-02-15
**Participant:** P3 (Casual runner, uses Strava)
**Moderator:** [Your name]
**Observer:** [Teammate]

## Task 1: Log Workout

**Time:** 62 seconds
**Completion:** ✅ Success
**Errors:** 1 (tried "Stats" first before finding "Log")

**Observations:**
- Hesitated 8 sec before tapping any button
- Said: "I'm looking for a plus icon, but I see a button that says 'Log Workout'"
- Expected form to have more fields (heart rate, weather)
- Confused by "Duration" label - asked "Is that minutes or hours?"
- Positive reaction to success message: "Oh good, it's saved!"

**Issues:**
- [ ] Low discoverability of "Log Workout" button (expected + icon)
- [ ] "Duration" field needs unit label

---

## Task 2: Check Goal Progress

**Time:** 45 seconds
**Completion:** ✅ Success (with hint)
**Errors:** 0

**Observations:**
- Immediately went to "Stats" tab
- Said: "I assume progress is in stats?"
- Found weekly goal widget easily
- Commented: "Nice visualization, I understand it"

**Issues:**
- None for this task

---

[Continue for all tasks...]

---

## Post-Task Quotes

> "Overall pretty easy to use. I wish the plus button was more obvious."

> "I like the clean design. Not too cluttered."

> "The stats screen is my favorite. Clear and motivating."

---

## SUS Score: 78 (Grade B)

12.8 Severity Rating

Severity combines three factors: Frequency × Impact × Persistence7

Where:

  • Frequency: How often does this problem occur?
  • Impact: How difficult is it for users to overcome?
  • Persistence: Is this one-time or repeated annoyance?
Rating Pattern Description Fix Priority
0 - Not a problem Not actually a usability issue Ignore
1 - Cosmetic Low frequency + Low impact Minor issue, doesn't affect usability Fix if time
2 - Minor High frequency + Low impact Small problem, workaround exists Fix before launch
3 - Major Low frequency + High impact Serious problem, difficult to work around Fix now
4 - Catastrophic High frequency + High impact Users cannot complete task Fix immediately

12.9 Metrics

12.9.1 Quantitative Metrics

12.9.1.1 Task Completion Rate

Completion Rate = (Successful completions / Total attempts)

Example:
5 participants × 3 tasks = 15 attempts
13 successful completions
Completion Rate = 13/15 = 87%

12.9.1.2 Time on Task

Average Time = Sum of completion times / Number of successful completions

Example Task "Log Workout":
P1: 45 sec
P2: 62 sec
P3: 38 sec
P4: Failed (don't count)
P5: 51 sec

Average = (45 + 62 + 38 + 51) / 4 = 49 seconds

Use:

  • Compare to your target (e.g., "under 60 seconds")
  • Identify slow tasks for improvement
  • Track improvement across iterations

12.9.2 Qualitative Metrics

12.9.2.1 System Usability Scale (SUS) 8

Industry Standard Questionnaire

10 questions, 5-point Likert scale, yields score 0-100

12.10 Evaluation Report Template

After completing your evaluation (heuristic evaluation + thinking aloud), document your findings in a brief report:

# Usability Evaluation Report: [App Name]

**Date:** [Date]
**Evaluators:** [Names]
**Methods Used:** Heuristic Evaluation, Cognitive Walkthrough, Accessibility Scan, Thinking Aloud (2 participants)

---

## Executive Summary

Brief overview of evaluation process and key findings (3-4 sentences).

---

## Methods

### Heuristic Evaluation
- **Evaluators:** [Number and roles]
- **Screens evaluated:** [List]
- **Heuristics applied:** Nielsen's 10 Usability Heuristics

### Cognitive Walkthrough
- **Tasks analyzed:** [List critical tasks]
- **Focus:** First-time use, learnability

### Automated Accessibility Scan
- **Tools Used:** Accessibility Scanner (Android) / WAVE (Browser)
- **Screens Tested:** [List]
- **Violations Found:** [Number + brief summary]

### Thinking Aloud Testing
- **Participants:** 2 (P1: [brief description], P2: [brief description])
- **Tasks:** [List of tasks tested]
- **Duration:** 30 minutes per session

---

## Findings

### Critical Issues (Severity 4)

**Issue #1: [Brief title]**

- **Location:** [Screen/component]
- **Heuristic Violated:** [#X - Name]
- **Severity:** 4 - Catastrophic
- **Description:** [What happens, why it's critical]
- **Evidence:** Observed in 2/2 user tests; both participants unable to complete Task X
- **Recommendation:** [Brief suggestion for fix]

### Major Issues (Severity 3)

[Same format as above]

### Minor Issues (Severity 2)

[Same format - can be briefer or listed]

### Cosmetic Issues (Severity 1)

[Brief list]

---

## Metrics Summary

### Task Completion Rates
- Task 1 (Log Workout): 100% (2/2)
- Task 2 (Check Progress): 50% (1/2)
- Task 3 (Edit Workout): 100% (2/2)
- **Overall:** 83% (5/6 task attempts)

### Time on Task
- Average time to log workout: 45 seconds (target: <60s) ✅
- Average time to find progress: 90 seconds (target: <60s) ❌

### SUS Scores
- P1: 72 (Grade B - Okay)
- P2: 65 (Grade D - Poor)
- **Average:** 68.5 (Grade C - Below Average)

---

## Prioritized Recommendations

### Must Fix (Before Final Delivery)
1. [Issue with severity 3-4]
2. [Issue with severity 3-4]

### Should Fix (If Time Permits)
1. [Issue with severity 2]
2. [Issue with severity 2]

### Future Improvements
1. [Issue with severity 1]
2. [Additional features/enhancements]

---

## Conclusion

[2-3 sentences summarizing overall usability, what went well, what needs work]

Fixing Issues Is NOT Part of This Course

Important: Your evaluation report documents problems and suggests solutions, but implementing fixes is beyond the scope of this course.

What you DO: - ✅ Systematically find and document issues - ✅ Rate severity and prioritize - ✅ Suggest potential solutions - ✅ Reflect on findings in Fachgespräch 2

What you DON'T do: - ❌ Implement recommended fixes in code - ❌ Re-test after fixes - ❌ Iterate on design based on findings

Value for your learning: You gain critical experience in identifying usability problems systematically — a skill that translates directly to professional practice. In real projects, evaluation findings inform the next iteration; here, they inform your reflection on the design process.

Evaluation with Implemented App

You'll conduct this evaluation on your implemented Flutter app, not on prototypes. See Flutter Development for implementation details.

This means you're evaluating real, working functionality — which makes findings more realistic and actionable (even if you don't implement fixes in this course).

12.11 Self-Assessment

Test Your Understanding (Fachgespräch 1)

  1. On a timeline from start to launch, when should you conduct inspection vs. testing methods?
  2. Why conduct inspection methods first (no users) before testing with real users?
  3. What is the difference between formative and summative evaluation?
  4. Why is the introduction critical before a Thinking Aloud session?
  5. Why is a pilot test mandatory?
  6. Write a good task description for your project idea — what makes it good?
  7. Which participants should you recruit for usability testing?
  8. What are common moderator mistakes during testing?

12.12 Further Reading

Jeffrey Rubin. Handbook of usability testing: how to plan, design, and conduct effective tests. Wiley Pub, 2nd ed edition, 2008. ISBN 978-0-470-18548-3.

Helen Sharp, Yvonne Rogers, and Jenny Preece. Interaction design: beyond human-computer interaction. John Wiley & Sons, Inc, sixth edition edition, 2023. ISBN 978-1-119-90109-9 978-1-119-90111-2.


  1. Alita Kendrick. Formative vs. summative evaluations. 2019. URL: https://www.nngroup.com/articles/formative-vs-summative-evaluations/ (visited on 2025-02-04). 

  2. Cathleen Wharton, John Rieman, Clayton Lewis, and Peter Polson. The cognitive walkthrough method: a practitioner's guide, pages 105–140. John Wiley & Sons, Inc., USA, 1994. URL: https://dl.acm.org/doi/10.5555/189200.189214

  3. Daniel Cunha, Rui P. Duarte, and Carlos A. Cunha. Klm-goms detection of interaction patterns through the execution of unplanned tasks. In Osvaldo Gervasi, Beniamino Murgante, Sanjay Misra, Chiara Garau, Ivan Blečić, David Taniar, Bernady O. Apduhan, Ana Maria A.C. Rocha, Eufemia Tarantino, and Carmelo Maria Torre, editors, Computational Science and Its Applications – ICCSA 2021, 203–219. Cham, 2021. Springer International Publishing. URL: https://link.springer.com/chapter/10.1007/978-3-030-86960-1_15

  4. Jakob Nielsen and Thomas K. Landauer. A mathematical model of the finding of usability problems. 1993. URL: https://www.nngroup.com/articles/how-many-test-users/

  5. Kara Pernice. Pilot testing: getting it right (before) the first time. 2017. URL: https://www.nngroup.com/articles/pilot-testing/ (visited on 2025-01-29). 

  6. Maria Rosala. Checklist for moderating a usability test. 2022. URL: https://www.nngroup.com/articles/usability-checklist/ (visited on 2025-02-04). 

  7. Jackb Nielson. Severity ratings for usability problems. 1994. URL: https://www.nngroup.com/articles/how-to-rate-the-severity-of-usability-problems/ (visited on 2025-02-04). 

  8. John Brooke. Sus: a quick and dirty usability scale. Usability Eval. Ind., 189:, 11 1995. URL: https://www.researchgate.net/publication/228593520_SUS_A_quick_and_dirty_usability_scale