Evaluation Methodology and Benchmarking Framework by INDICATE

GRANT AGREEMENT NO. : 608775 PROJECT ACRONYM: INDICATE PROJECT TITLE: Indicator-‐based Interactive Decision Support and Information Exchange Platform for Smart Cities FUNDING SCHEME: STREP THEMATIC PRIORITY: EeB.ICT.2013.6.4 PROJECT START DATE: 1st October 2013 DURATION: 36 Months

DELIVERABLE 6.1 Evaluation Methodology and Benchmarking Framework

Date 24-‐2-‐15 02-‐03-‐15 06-‐03-‐15 09-‐03-‐15

Review History Submitted By Reviewed By John Loane on behalf of DKIT team First Review Aidan Melia IES Second Review Stephen Purcell FAC John Loane on behalf of DKIT team

Version 1 1 1 2

Dissemination Level PU PP RE CO

Public Restricted to other programme participants (including the Commission Services) Restricted to a group specified by the consortium (including the Commission Services) Confidential, only for members of the consortium (including the Commission Services)

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 608775

Table of Contents EXECUTIVE SUMMARY ................................................................................................................................. 4 1 METHODOLOGY ........................................................................................................................................ 5 1.1 Introduction ................................................................................................................................................... 5 1.2 The User, User Categories and User Expectations ............................................................................................ 5 1.2.1 Urban Planners ........................................................................................................................................... 6 1.2.2 Public Authorities ....................................................................................................................................... 6 1.2.3 Developers (Architects/Engineers/Designers) ............................................................................................ 7 1.2.4 Main Contractors ........................................................................................................................................ 7 1.2.5 Technology Providers (ICT and RET) ........................................................................................................... 7 1.2.6 Material and Solution Manufacturers ........................................................................................................ 8 1.2.7 Energy Utility Companies ........................................................................................................................... 8 1.2.8 R&D ............................................................................................................................................................ 9 1.3 Overview of Evaluation Methodology and Benchmarking Framework ....................................................... 10 2 HEURISTIC EVALUATION .......................................................................................................................... 12 2.1 Introduction ..................................................................................................................................................... 12 2.2 Heuristics and Experts who will evaluate INDICATE ........................................................................................ 12 3 METRICS FOR VALIDATING INDICATE ...................................................................................................... 15 3.1 Introduction ..................................................................................................................................................... 15 3.2 Usability Metrics .............................................................................................................................................. 15 3.2.1 Performance Metrics .................................................................................................................................... 15 3.2.1.1 Task Success .............................................................................................................................................. 15 3.2.1.2 Time-‐on-‐task ............................................................................................................................................. 17 3.2.1.3 Errors ......................................................................................................................................................... 18 3.2.1.4 Efficiency ................................................................................................................................................... 19 3.2..1.5 Learnability ............................................................................................................................................... 19 3.2.2 Issue-‐Based Metrics ...................................................................................................................................... 20 3.2.3 Self-‐Reported Metrics .................................................................................................................................. 21 3.3 Ethics ............................................................................................................................................................... 22 3.4 Number of Participants ................................................................................................................................... 23 3.5 Combining Metrics to Give Single Usability Score ........................................................................................... 24 4 QUESTIONNAIRES .................................................................................................................................... 25 4.1 Introduction ..................................................................................................................................................... 25 4.2 System Usability Scale (SUS) applied to INDICATE .......................................................................................... 25 4.3 Intuitiveness .................................................................................................................................................... 27 4.4 Microsoft Desirability ...................................................................................................................................... 29 4.5 INDICATE Survey Management and Processing Application ........................................................................... 29 5 INTERVIEWS ............................................................................................................................................ 31 5.1 Introduction ..................................................................................................................................................... 31 5.2 Interview Data Analysis ................................................................................................................................... 31 6 BENCHMARKING ..................................................................................................................................... 33 XX/XX/XXXX Grant No. 608775 2

6.1 Introduction ..................................................................................................................................................... 33 6.2 Predicted versus Real ...................................................................................................................................... 33 6.3 INDICATE vs OTHER TOOLS .............................................................................................................................. 34

7 ORGANIZATION OF EVALUATION AND BENCHMARKING ACTIVITIES ..................................................... 35 8 CONCLUSIONS ......................................................................................................................................... 36 REFERENCES ............................................................................................................................................... 37

XX/XX/XXXX Grant No. 608775 3

EXECUTIVE SUMMARY This document presents the intended evaluation methodology and benchmarking framework that will be used to evaluate the software tools produced in the INDICATE project. As the project evolves it is intended that this document will also evolve to better evaluate the actual tool produced in the project. In chapter one we summarize results from D.1.1 and D1.3 which identified stakeholders and their expectations of the INDICATE project. These users and their expectations are a crucial part of evaluating the project, as they are the people that the tool is being designed for. If the stakeholders are not able to use the tool or are not satisfied with its performance then the tool has failed. They also serve a very important purpose in giving formative feedback on the prototype tool during the development process. Finally in chapter one, we give an overview of the evaluation methodology and benchmarking framework. The evaluation methodology involves coming up with the metrics that will be used to assess INDICATE, expert heuristic evaluation of the tool and user evaluation of the tool implemented using task based assessment, questionnaires and interviews. Chapter two details the heuristic evaluation. The goal is to have experts identify any serious usability problems before end-‐user testing. Using heuristic evaluation prior to user testing will reduce the number and severity of usability problems discovered by users. However, the issues found in a heuristic evaluation are usually different to those identified by user testing, so one cannot replace the other. Chapter three details the metrics and user evaluation of the system. The metrics are broken down into three broad categories – performance metrics, issue-‐based metrics and satisfaction metrics. Performance metrics include task success, time-‐on-‐task, errors, efficiency and learnability and are measured by observing the user trying to complete a series of tasks with the tool. Issue-‐based metrics identify anything that causes the user problems in completing the tasks. These issues will be identified by observing the user performing the tasks and asking the user at the end of each task about any difficulties encountered. Each issue will be categorised as low, medium or high severity depending on its affect on user performance of the task. Satisfaction metrics measure how users feel about the system and will be measured using a number of questionnaires, which produce both qualitative and quantitative data. The System Usability Scale produces a single number, which measures the usability of the system. The INTUI scale produces four numbers, which measure Gut Feeling, Verbalizability, Effortlessness and Magical Experience. The Microsoft Desirability toolkit asks the user to associate words with their experience with the software, rank them and them leads to an interview follow up where qualitative data about the experience is gathered. Finally in this chapter we detail the ethics that will be used in interacting with users, the number of test users needed and how we will combine metrics to produce a single usability score for the tool. Chapters four and five detail the questionnaires and interview formats that will be used and how the data will be analysed. Chapter six details how the tool will be benchmarked. We approach this from two sides. We will use real time energy usage data gathered in the Living Lab based in DKIT to assess the Virtual City Model, Dynamic Simulation Model and algorithms developed in the INDICATE project. We will ask the test users during the interviews about their experience of the INDICATE tool versus other tools that they use to carry out similar tasks. Finally Chapter seven details the organization of the evaluation and benchmarking activities.

XX/XX/XXXX Grant No. 608775 4

1 METHODOLOGY In this chapter we describe the methodology that will be used to evaluate and benchmark the INDICATE tool. In section 1.1 we give an introduction to usability testing and note the importance of choosing the right test users. In section 1.2 we give a summary of work presented in D1.1 and D1.3 which detail the users that will test the INDICATE tool and their expectations of the tool. These individuals will be the test users we will use to carry out the tasks detailed in Chapters 3, 4 and 5. Finally in section 1.3 we give an overview of the evaluation methodology and benchmarking framework that will be used to assess the INDICATE tool.

1.1 Introduction Usability testing is an essential part of the software development process. Usability has been defined by the ISO as the ‘effectiveness, efficiency and satisfaction with which a specified set of users can achieve a specified set of tasks in a particular environment’. Usability is an essential quality of any application and it is recognised that an iterative user-‐centred usability testing process, whereby designs are iteratively evaluated and improved, is essential for developing usable applications (Hartson, Andre & Williges, 2003). Research has considered usability experts, intended end users, novice evaluators and domain experts as participants of usability testing. Each participant brings a certain level of technical expertise, domain knowledge and motivation (Tullis & Albert, 2008). As such, the participant plays a vital role in the usability problems that are discovered. Indeed, there are other factors to consider, which may influence the usability issues identified, including the tasks participants are asked to complete, the environment used, the observers etc. While some research suggests five expert users is the ‘magic number’ for identifying roughly 80% of the usability problems in web applications (Nielsen, 2000), it is thought by others that this view is naïve and a higher number of users is desirable (Spool & Schroeder, 2001; Woolrych & Cockton, 2001). However, there is still no agreement across usability practitioners on how many users is enough. A number of usability testing methods exist and there are numerous varied opinions on both their practicality and their effectiveness. Usability inspection methods (UIMs), such as heuristic evaluations and cognitive walkthroughs for example, are relatively cheap to carry out (Bias, 1994; Nielsen, 1994b; Wharton et al, 1994). While heuristic evaluations are thorough, providing lots of ‘hits’ in terms of problems, there can also be many false alarms as it is difficult to determine which problems will actually impede a user’s ability to successfully complete a given task. More recently, it has been argued that UIM’s are not as effective as traditional user testing with real users, as UIM’s ‘predict’ problems rather than report on observed problems with real users (Liljegren, 2006). However, using heuristic evaluation prior to user testing will reduce the number and severity of usability problems discovered by users. Furthermore, the issues found in a heuristic evaluation are usually different to those identified by user testing, so one cannot replace the other. Thus, both heuristic evaluation with expert users and usability testing with potential INDICATE users will constitute the INDICATE evaluation framework.

1.2 The User, User Categories and User Expectations In this section we will summarize work carried out in work packages 1.1 and 1.3 to detail the users and their expectations of the INDICATE tool. These users and their expectations of the tool will be crucial to the testing that is detailed in chapters 3, 4 and 5. As part of the evaluation process we will ask these users to classify themselves along two axes, knowledge about the domain and extent of computer experience. This classification will be used in the analysis of the results to see if there is a difference between novice and expert user’s experience of the INDICATE tool. The users fall into eight different categories, which we summarize below.

XX/XX/XXXX Grant No. 608775 5

1.2.1 Urban Planners At the Genoa test site Pier Paolo Tomiolo is an architect who is responsible for urban planning in the regional administration. Dr. Gabriella Minervini works for the Department of Environment in the Regione Liguria. She is responsible for the environmental impact assessment of the new Galliera project. Maurizio Sinigaglia and Silvia Capurro are architects who are responsible for the development plan of the city of Genoa. Rita Pizzone is an architect with the Architectural heritage and Landscape of Liguria who is involved as the New Galliera project involves architectural listed buildings and is in a region of landscape protection. Simon Brun is an engineer who is responsible for new hospitals in Liguria. At the Dundalk test site Catherine Duff us an executive town planner at Louth County Council and has extensive experience in planning including projects such as smarter planning. Table 1 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 1: Urban Planners Target Group

Urban Planners Needs Expectations for INDICATE Sustainable urban planning; optimize community’s land Holistic vision of a city to enhance urban sustainability. use and infrastructure. Optimise efficiency and minimize energy use. Simulations to balance load and demand in real time. Understand interactions between buildings, RET, local Dynamic Simulation Modelling, which allows to model distribution networks and the grid. the interactions between the city and its subsystems.

1.2.2 Public Authorities At the Genoa test site Franco Giodice is an architect who is vice director of sector investments in the health and social services department of the Liguria region. At the Dundalk test site Louth County Council are partners in the INDICATE project. Padraig O’Hora is a senior executive engineer in the European and Energy office. Table 2 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 2: Public Authorities Target Group

Public Authorities Needs Plan a sustainable Smart city (Smart environment, Smart mobility, Smart living). Reduce energy consumption and carbon emissions Integrate Renewable Energy Technologies (RET).

Expectations for INDICATE Holistic vision of a city to enhance urban sustainability. Dynamic Simulation Modelling. 3D urban modelling to assess the impact of the integrated technologies. Optimise existing systems and increase energy Solutions to model the interactions between building, efficiency. installed systems and the grids. Support to define and validate regulations and Tools able to understand how the regulatory directives in urban environment. requirements, the policies and the standards influence the approach taken to scheme development and the selection of any methodology. XX/XX/XXXX Grant No. 608775 6

1.2.3 Developers (Architects/Engineers/Designers) At the New Galliera project in Genoa Paola Brecia (OBR) has designed the urban scale and building envelope, Riccardo Curci (Steam) has designed the energy systems and their integration and Lisiero (D’Appalonia) is the author of the environmental impact analysis. In Dundalk David McDonnell is Chief Executive of the Smart Eco Hub and is responsible for creating business opportunities, living labs and stimulating innovation in the Dundalk region. Table 3 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 3: Developers Target Group

Developers Needs Support to optimise existing buildings and integrate new technologies in the city environment.

Expectations for INDICATE Tool to analyse the buildings’ and the districts’ environment and understand how the different infrastructures of the city are related to one another; 3D urban modelling to assess the impact of the integrated technologies Centralized technology portfolio to evaluate different Dynamic Simulation Modelling, which allows solutions. evaluations of different solutions.

1.2.4 Main Contractors In Dundalk Kingspan and Glen Dimplex have been involved in carrying out many upgrades to local council houses. Table 4 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 4: Main Contractors Target Group

Main Contractors Needs Efficient project coordination.

Expectations for INDICATE Simulation and energy-‐based DSM taking account buildings and their interactions with urban environment. Clear and global view of the process and actors Solutions to connect decision makers and experts to involved. enable exchange of experience and best practice.

1.2.5 Technology Providers (ICT and RET) In Genoa engineer Borgiorni of SIRAM, the energy management company of the hospital has installed a combined engine for electricity and heating and is developing a program of HVAC centralized control. In Dundalk Derek Roddy is CEO of Climote, a company who provide remote control for home heating. Damian McCann is a corporate account manager with Viatel, Digiweb group. He as been involved with the Louth County Council Broadband Forum and is assiting with developing Dundalk as a smart town. Table 5 on the next page summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. XX/XX/XXXX Grant No. 608775 7

Table 5: Technology Providers Target Group

Technology Providers Needs Increase market share of their technologies.

Expectations for INDICATE Software to simulate and demonstrate the increase of energy efficiency with the integration of new technologies. Support to the development of new technologies and Solutions to analyse and compare the efficiency of solutions. different technologies and estimate ROI; Software that provides a holistic vision of a city.

1.2.6 Material and Solution Manufacturers In Dundalk Derek Roddy is CEO of Climote, a company who provide remote control for home heating. Table 6 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 6: Material and Solution Manufacturers Target Group

Material and Solution Manufacturers Needs Increase market share of their products.

Expectations for INDICATE Software to simulate and demonstrate the increase of energy efficiency with the integration of new products. New market opportunities. Solutions to analyse and compare the efficiency of different technologies and estimate ROI. Support to test products and solution in city 3D urban modelling to assess the impact of the environment. installed solution.

1.2.7 Energy Utility Companies In Genoa engineer Borgiorni of SIRAM, the energy management company of the hospital has installed a combined engine for electricity and heating and is developing a program of HVAC centralized control. At the Dundalk site Declan Meally has worked with SEAI for 10 years and is head of Department of Emerging Sectors since 2012. Table 7 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 7: Energy Utilities Target Group

Energy Utility Companies Needs Support to develop new energy-‐related solutions and test and implement solutions. More competitive prices and tariff plans.

Expectations for INDICATE Tool to estimate the revenues and ROI for each infrastructure improvement. Tool able to simulate the balance of load and demand in real time and to evaluate different tariffs plans.

XX/XX/XXXX Grant No. 608775 8

1.2.8 R&D Barry Grennan (Xerox3 ) is the Business Centre Manager with Xerox since 2002 with responsibility for the colour toner manufacturing operation in Xerox’s Dundalk Plant. Barry has a particlar interest in Demand Side Management and is currently implementing it at the Xerox plant in Dundalk. Table 8 below summarizes the expectations that these users have of the INDICATE tool. These expectations will be used to inform the tasks carried out to evaluate the INDICATE tool in chapter 3. Table 8: R&D Target Group

R&D Needs Expectations for INDICATE Support to test new research and new solution in city Dynamic Simulation Modelling, which allows to environment. evaluate different solution.

XX/XX/XXXX Grant No. 608775 9

1.3

Overview of Evaluation Methodology and Benchmarking Framework

Figure 1 below details the evaluation methodology. First we started with a review of the literature and ISO 9241-‐ 11 usability guidelines to come up with the metrics that we will use to evaluate the INDICATE tool. Of particular use here were (Tullis & Albert, 2008) and (Nielsen, 1993) both of which give pragmatic common sense approaches to evaluating software. Having identified the metrics we wish to measure we then detail how we go about measuring them. First we carry out a heuristic evaluation where a usability expert assesses the tool before it is used in user testing. In order to carry out user testing we first identify appropriate tasks and then ask the user to use the think aloud technique while carrying out the tasks. When all tasks and questionnaires are completed we then interview the user to further investigate the user experience. Literature

Usability Guidelines

ISO 9241 -‐ 11 Usability Goals Heuristic Evaluation Guidelines (Nielsen) Heuristic Evaluation Chapter 2

Feedback to Prototype D6.2

Performance and Satisfaction Metrics

Task Design Guidelines

Questionnaires Chapter 4

Task Design Chapter 3

Selection of best technique

Think Aloud Technique

Interviews Chapter 5

Feedback to Final Tool D6.3, D6.4

Thematic Analysis of Interviews

Analysis of Results D6.2

List of fixes and improvements required

Figure 1: Evaluation Methodology

XX/XX/XXXX Grant No. 608775 10

Figure 2 below details the benchmarking framework. Here we aim to benchmark the algorithms developed as part of the INDICATE tool by comparing their predictions against real time energy usage data gathered in the Living Lab located in DKIT. Access to the real energy usage data will allow us to benchmark the prediction algorithms and models including the Virtual City Model and the Dynamic Simulation Model. We will benchmark INDICATE against other tools by asking the users during the interview to compare their experiences with INDICATE against tools that they have previously used. We will ask users to compare them in terms of speed, user interface, data requirements and portability. Benchmarking

INDICATE vs Other Tools

Predicted vs Real

Figure 2: Benchmarking Framework

XX/XX/XXXX Grant No. 608775 11

2 HEURISTIC EVALUATION 2.1 Introduction Heuristic evaluation is a usability evaluation method (UEM) typically employed to identify usability problems in interactive systems. It is one of many expert review methodologies. With heuristic evaluation “The expert reviewers critique an interface to determine conformance with a short list of design heuristics” (Schneiderman and Plaisant, 2005). It is important that the experts are familiar with the heuristics and capable of interpreting and applying them. Formal, expert reviews have proven to be effective as a starting point for evaluating new or revised interfaces (Nielsen and Mack, 1994). The expertise of such users may be in the application area, or in user interface design. Typically, expert reviews can happen both at the start and end of a design phase. The output is usually a report highlighting any identified problems and recommending design changes that should be integrated before system deployment. Heuristics are so called, as they are rules of thumb rather than specific usability guidelines. However, they are well reported in the literature and often used in expert evaluation of interactive systems. Heuristic evaluation has a number of advantages, including the ability to provide quick and relatively inexpensive feedback to designers at an early stage of the design process.

2.2 Heuristics and Experts who will evaluate INDICATE Within the INDICATE project, heuristic evaluation will take place at the beginning of the evaluation phase, once a pilot system is available for testing and before usability testing with end users begins. The goal is to have experts identify any serious usability problems before end-‐user testing. Using heuristic evaluation prior to user testing will reduce the number and severity of usability problems discovered by users. However, the issues found in a heuristic evaluation are usually different to those identified by user testing, so one cannot replace the other. Guidelines that INDICATE will be evaluated against There exist numerous guidelines and heuristics for evaluating interactive systems. However, the most well known and often used are heuristics developed by Nielsen (Nielsen, 1994) and Schneiderman (Schneiderman and Plaisant, 2005). Jacob Nielsen outlined 10 general principles of interaction design (Nielsen 1994): 1 2

4 5

Visibility of system status – The system should always keep users informed about what is going on, through appropriate feedback within a reasonable time. Match between system and the real world – The system should speak the users’ language, with words, phrases and concepts familiar to the user, rather than system-‐oriented terms. Follow real-‐world conventions, making information appear in a natural and logical order. User control and freedom – Users often choose system functions by mistake and will need a clearly marked “emergency exit” to leave the unwanted state without having to go through an extended dialogue. Support undo and redo. Consistency and standards – Users should not have to wonder whether different words, situations or actions mean the same thing. Follow platform conventions. Error prevention – Even better than good error messages is a careful design that prevents a problem from occurring in the first place. Either eliminate error-‐prone conditions or check for them and present users with a confirmation option before they commit to the action.

XX/XX/XXXX Grant No. 608775 12

6 Recognition rather than recall – Minimise the user’s memory load by making objects, actions and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate. 7 Flexibility and efficiency of use – Accelerators – often unseen by the novice user – may speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor for frequent actions. 8 Aesthetic and minimalist design – Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility. 9 Help users recognize, diagnose and recover from errors – Error messages should be expressed in plain language (no codes), precisely indicate the problem and constructively suggest a solution. 10 Help and documentation – Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user’s task, list concrete steps to be carried out and not be too large. Schneiderman also outlined 8 golden rules of interface design – 8 rules or principles that are applicable in most interactive systems and which were defined based on experience over two decades of design and evaluation work. These rules can be fine-‐tuned for the specific system being evaluated. They include: 1 2

3 4 5 6 7

Strive for consistency – E.g. identical terminology used throughout, consistent colour, layout, fonts etc. Cater to universal usability – Recognise the needs of diverse users and design for plasticity. Novice-‐expert differences, age ranges, disabilities, and technology diversity should guide the design requirements. Implementing features for novices (such as explanations) and experts (such as short cuts) can improve the design. Offer informative feedback – For every action the user carries out, there should be feedback from the system. This can be modest for frequent actions, or more substantial for infrequent actions. Design dialogs to yield closure – Ensure that sequences of actions are grouped and have a beginning, a middle and an end. Prevent errors – As much as possible, design the system so that users cannot make serious errors. If a user-‐ error occurs, provide simple, constructive and specific instructions to recover from that error. Permit easy reversal of actions – Actions should be reversible, as much as possible. Allowing a user to ‘undo’ something reduces anxiety and encourages exploration of unfamiliar options. Support internal locus of control – Experienced users like to sense that they are in charge of the system and that the system responds to their actions. Inability to perform a certain action, get the required data, or tedious sequences of actions can cause dissatisfaction and frustration. Reduce short term memory load – Human short term memory is limited in terms of its processing power, and research has shown that humans can adequately remember seven plus or minus two chunks of information. This means that interfaces should be kept simple and not overloaded with information.

Conducting the Evaluation Define the Tasks A starting point for the heuristic evaluation will be the evaluation of the tasks and actions (nouns and verbs) of the INDICATE system. These tasks will be defined prior to the heuristic evaluation and evaluators will be asked to carry them out followed by comments on the corresponding interface objects and actions. XX/XX/XXXX Grant No. 608775 13

Identify the Users As with all UEMs, there is no consensus on how many experts are required to carry out a heuristic evaluation. Some research suggests one expert is enough, whereas other research reports that different experts tend to find different problems with interfaces, and so recommend that 3-‐5 expert reviewers are recruited. While user interface design experts are familiar with the field of evaluating interactive systems, they may not be familiar with the application area. Therefore, we will recruit 1 interface expert as well as 1 application domain expert to conduct our heuristic evaluation. Review the Heuristics With an expert review, evaluators should know and understand the above heuristics to be able to assign a problem to one of them. Each evaluator will review the INDICATE system individually. INDICATE HCI researchers will combine Nielsen’s and Schneiderman’s heuristics, removing duplicates. Evaluators will be provided with a template with each heuristic outlined, asking the evaluator to evaluate the system against that particular heuristic and outline to what degree it has been satisfied. Evaluators will be asked to provide comments beside each heuristic and to use screen grabs if they feel this will help to illustrate their point. Writing the Report Once the evaluators have completed their evaluation template, all the feedback will be combined, the usability problems will be clustered into thematic areas and categorised into different levels of severity. Usability problems will be categorized into critical, serious, medium or low, using a decision tree (Travis 2009). The report will rank recommendations by importance and expected effort level for redesign/ implementation. The report will also outline a summary of all findings, including a list of all usability problems in a table, with its severity ranking, ease of fixing and heuristic violated. Specific problem areas will be highlighted, including evidence of that problem occurring in the interface and a recommendation on how to resolve the issue.

XX/XX/XXXX Grant No. 608775 14

3 METRICS FOR VALIDATING INDICATE 3.1 Introduction In this chapter we will consider the metrics that will be used to assess the INDICATE tool. In section 3.2 we will introduce three kinds of metrics, performance metrics, issue-‐based metrics and self-‐reported metrics. In section 3.2.1,2,3 we will describe the metrics that will be used in the evaluation, how they will be gathered and how the data will be analysed. Finally we will consider how all of the data will be combined to produce an overall usability score for the INDICATE tool. This chapter follows methods detailed in (Tullis & Albert, 2008).

3.2 Usability Metrics We can break usability metrics down into three broad categories, performance metrics, issue-‐based metrics and self-‐reported metrics. Performance is all about what the user actually does in interacting with the product. We will measure this by asking the users to perform specific tasks with the INDICATE tool and will give the details in section 3.2.1. Issue-‐based metrics will be gathered by questioning users after they have performed each task and are detailed in section 3.2.2. Self-‐reported or satisfaction metrics will assess user’s overall experience with the feelings for the final product and will be gathered by asking users to complete standard questionnaires after all tasks have been completed. The details are given in section 3.2.3 below.

3.2.1 Performance Metrics Five types of performance metrics will be used in assessing INDICATE. 1. Task success measures how effectively users are able to complete a given set of tasks. We will detail sample tasks and report results as binary success. 2. Time-‐on-‐task measures how much time is spent on each task. 3. Errors are mistakes made during a task and will be used to find confusing or misleading parts of the interface. 4. Efficiency will be calculated using task success and time-‐on-‐task measures. 5. Learnability will allow us to measure how performance changes over time.

3.2.1.1 Task Success The tasks detailed below are informed by user expectations detailed above in section 1.2. For each task we will give a clear end state and define the criteria for success. Users will be asked to verbally articulate the answer after completing the task. Tasks will be given to the user one at a time to give a clear start condition to each task and facilitate the timing of each task. These tasks will be refined when a prototype version of the INDICATE tool is available. Table 9 below gives some sample tasks. Table 9: Tasks for Evaluation

Task No. 1

Task Details with Clear End Condition Aim: Holistic vision of a city Task: What are total population, energy usage and renewable energy production in the city? End Condition: Three values Aim: Simulations to balance load and demand in real time Task: At what time of the day does max import from the grid happen? What are 3 options to meet this demand End Condition: Time, 3 options Aim: Dynamic simulation modelling to model interactions

Success Criteria Correct three values

Time, 3 options from a list of all possible options available before the task starts 2 sets of rankings to be compared

XX/XX/XXXX Grant No. 608775 15

between the city and its subsystems Task: Rank each subsystem on demand (buildings, transport, public services) and supply sides (centralized, distributed) in terms of their energy consumption or production. End Condition: Two sets of rankings Aim: 3D urban modelling to assess the impact of integrated technologies Task: Add 100kWp of solar PV to buildings and assess their impact on the city – what is the expected output from the panels? How does this affect the city’s import from the grid? How will these panels affect the cityscape? End Condition: PV output, impact on grid, impact on cityscape Aim: Understand regulatory requirements, policies and standards Task: Add a new school to the model and lists restrictions imposed on the model End Condition: A list of restrictions Aim: Solutions to connect decision makers and experts to enable the exchange of experience and best practice Task: Following on from task 5 list the experts suggested by the software to help with this task. End Condition: List of experts Aim: Demonstrate the increase in energy efficiency with the integration of new technologies Task: If we retrofit all homes in the city with triple glazed windows how would this affect the energy demand of the city? End Condition: kWh value for decrease in demand Aim: Analyse and compare the efficiency of different technologies and estimate the ROI, ROI for infrastructure investment Task: Compare task 7 with adding a solar panel for hot water to the roof of all homes in the city. What would kWh saving be in this case? What is ROI for both tasks? Which is better value? End Condition: kWh saving, ROI, which is better Aim: Use DSM to evaluate different tariff plans Task: Given two local tariff plans for public services in the city use DSM to access which is better value for the local authority End Condition: A statement about which of the tariff plans is better value

to values worked out before task

Correct values for PV production, decrease in import from grid and whether the panels will have any impact on the cityscape

List of restrictions from a list of options available before the task starts

List of experts from a list of options available before the task starts

kWh value for decrease in demand

kWh saving, ROI, which is better

Statement about which tariff plan is better value for the public authority

Users will be given either a success (1) or failure (0) for each task. We will use the numeric score to calculate the average as well as confidence intervals for the success of each task as detailed in table 10 on next page. XX/XX/XXXX Grant No. 608775 16

Table 10: Summary measures for task completion Participant P1 P2 P3 P4 P5 Average Confidence Interval (95%)

Task 1 1 1 1 1 0 80% 39%

Task 2 0 0 1 1 0 40% 48%

Task 3 1 1 1 1 1 100% 0%

Task 4 0 0 1 1 1 60% 48%

Task 5 0 1 1 1 1 80% 39%

Results will be presented on a per task level with average completion rate as well as a confidence interval for each measure reported as in Figure 3 below.

Task Success 120%

% Successful

100% 80% 60% 40% 20% 0% Task 1

Task 2

Task 3

Task 4

Task 5

Figure 3: Presentation of Results showing mean and 95% confidence interval

3.2.1.2 Time-‐on-‐task The time-‐on-‐task measure will give us information about the efficiency of the tool. The shorter the time it takes to complete any of the tasks the better. Time in task will be measured as the time from the user receiving the task to the time that they verbalize the answer. A stopwatch will be used to measure the time in seconds. Also a screen recording or the users screen and audio during the test will be gathered and this can be used to check timed events. Time data for each task will be tabulated and summary data including average, median, maximum, minimum and confidence intervals will be reported as in table 11 on next page. XX/XX/XXXX Grant No. 608775 17

Table 11 Summary Results for Time-‐on-‐task

Participant P1 P2 P3 P4 P5 Average Median Upper bound Lower Bound Confidence Interval

Task 1 259 253 42 38 33 125 42 259 33 105

Task 2 112 64 51 108 142 95 108 142 51 33

Task 3 135 278 60 115 66 130 115 278 60 77

Task 4 58 160 57 146 47 93 58 160 47 48

The time-‐on-‐task results will be presented by finding the average time for each task as well as confidence intervals and the results will be graphed as in Figure 4.

Time-‐on-‐task Time (sec) to completre task

250 200 150 100 50 0 Task 1

Task 2

Task 3

Task 4

Figure 4: Time-‐on-‐task showing mean times and 95% confidence interval

3.2.1.3 Errors Errors are incorrect actions that may lead to task failure or inefficiency. Errors can include entering incorrect data, making the wrong choice in a menu or dropdown list, taking an incorrect sequence of actions or failing to take a key action. Once we have a working prototype of the tool we will make a list of all possible actions a user can do with the tool and then define the different types of errors that can be made with the product. We will organize errors by task, with each task having multiple opportunities for error. We will record the number of errors for each task for each user so the number of errors for each task will be between zero and the maximum number of errors for that task. The errors will be counted while observing the users completing each task and can be verified from the screen recordings. In order to see the tasks that are producing the most errors we will take the total number of errors per task and divide it by the total number of error opportunities to give an error rate. We will also calculate the average number of errors made by each participant for each task. XX/XX/XXXX Grant No. 608775 18

3.2.1.4 Efficiency The Common Industry Format (CIF) for Usability Test Reports (NIST, 2001) specifies that the “core measure of efficiency” is the ratio of the task completion rate to the mean time per task where time per task is commonly expressed in minutes. The efficiency metric is calculated as the ratio of the task completion to the task time in minutes. An example is given in Table 12 below. Table 12: Calculating Efficiency Metric

Task 1 2 3 4

Completion Rate Percentage 80 60 100 40

Task Time (mins) 1.5 1.7 1 2.1

Percent Efficiency 53 35 100 19

The results can be presented in graph form by showing the average efficiency metric for each task as shown in Figure 5 below.

Percent Eﬃciency Eﬃciency (compleeon/eme)

120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% Task 1

Task 2

Task 3

Task 4

Figure 5: Efficiency Metric

3.2..1.5 Learnability Learnability is a measure of how easy it is to learn something and can be measured by examining how much time and effort is required to become proficient at something. We will measure this by carrying out the tasks detailed in Table 3.1 five times with a single user with a gap of two weeks between each trial. For each trial we will use average time-‐on-‐task, averaged over all tasks. The data will be presented as in Figure 6 on next page and the slope of the curve will indicate how difficult the system is to learn. We also hope to see flattening out of the graph, which indicates that users have no more to learn about the system and have reached maximum performance. We hope that five iterations will be enough to reach this point.

XX/XX/XXXX Grant No. 608775 19

Learnability 60

Time-‐on-‐task(sec)

50 40 30 20 10 0 Trial 1

Trial 2

Trial 3

Trial 4

Trial 5

Figure 6: Learnability

3.2.2 Issue-‐Based Metrics In order to gather issue-‐based metrics we will use an in person task based study and ask the user to think aloud. Users will be asked to verbalize their thoughts as they work through the tasks reporting what they are doing, what they are trying to accomplish, how confident they are about their decisions, their expectations and why they performed certain actions. At the end of each task from Table 3.1 users will be asked to rate the usability of the system for the task. Observers will also look out for verbal expressions of confusion, frustration, dissatisfaction, pleasure or surprise as well as non-‐verbal behaviours such as facial expressions. If the user provides a low usability score they will be asked to explain what the problem was and why they rate the system that way. We will use a three level system to classify the severity of usability issues. Severity ratings will be assigned by the observer based on observation of the user and questioning of the user after each task is complete. The three level system we will use is: Low: Any issue that annoys or frustrates users but does not play a role in task failure. This issue may only reduce efficiency or satisfaction a small amount. Medium: Any issue that contributes to but does not directly cause task failure. These issues have an impact on effectiveness, efficiency and satisfaction. High: Any issue that directly leads to task failure. These issues have a big impact on effectiveness, efficiency and satisfaction. We will report on the usability issues with the number of unique issues identified classified by severity rating. These unique issues will also be documented and form part of the report for WP6.2 which will feed back into the design of the final tool. We will use graphs such as Figure 7, next page to summarize these issues.

XX/XX/XXXX Grant No. 608775 20

Number of Unique Usability Issues

Usability Issues 18 16 14 12

Low

10 8

Medium

High

4 2 0 Design 1

Design 2

Figure 7: Number of unique usability Issues per design iteration ranked by severity

3.2.3 Self-‐Reported Metrics Self-‐reported metrics give information about user’s perception of the tool. They express how users feel about the system and whether they enjoy using it or not. We will use three standard questionnaires to gather these metrics. These questionnaires are: System Usability Scale: It consists of ten statements to which users rate their level of agreement. Half the statements are positively worded and half negatively worded. A 5-‐point scale is used for each. A technique for combining the ten ratings into an overall score (on a scale from 0 to 100) is given. This gives an overall usability score with 100 representing a perfect score. We give full details in section 4.2. Intuitive Interaction: The INTUI model explores the phenomenon of intuitive interaction from a User Experience (UX) perspective. It combines insights from psychological research on intuitive decision making and user research in HCI as well as insights from interview studies into subjective feelings related to intuitive interaction. This phenomenological approach acknowledges the multi-‐dimensional nature of the concept and also reveals important influencing factors and starting points for design. The INTUI-‐model suggests four components of intuitive interaction, namely, Gut Feeling (G), Verbalizability (V), Effortlessness (E) and Magical Experience (X). We give full details in section 4.3. Microsoft Desirability Toolkit: This is made up of 118 product reaction cards containing words such as “Useful”, “Consistent” and “Sophisticated”. On completion of a usability test, users are asked to sort through the cards and pick the top five that most closely match their personal reactions to the system the have just used. These five cards then become the basis of a post-‐test interview. We give full details in section 4.4. These three questionnaires will be administered after the user has completed the usability test. Full details of how the results will be processed are given in Chapter 4 below.

XX/XX/XXXX Grant No. 608775 21

3.3 Ethics We will follow guidelines from (Nielson, 1993) in conducting the usability tests. Before the test we will: • Have everything ready before the user shows up. • Emphasize that it is the system that is being tested, not the user. • Acknowledge that the software is new and untested, and may have problems. • Let users know that they can stop at any time. • Explain the screen recording and actions of the observer that will be used. • Tell the user that the test results will be kept completely confidential. • Make sure that we have answered all of the user’s questions before proceeding. During the test we will: • Try to give the user an early success experience. • Hand out the test tasks one at a time. • Keep a relaxed atmosphere in the test room, serve tea/coffee and take breaks. • Avoid disruptions: Close the door and post a sign. Disable telephone. • Never indicate in any way that the user is making mistakes or is too slow. • Minimize the number of observers at the test. • Not allow the user’s management to observe the test. • If necessary, stop the test if it becomes too unpleasant. After the test we will: • • •

End by stating that the user has helped us find areas of improvement. Never report results in such a way that individual users can be identified. Only show screen recordings outside the usability group with the user’s permission.

XX/XX/XXXX Grant No. 608775 22

3.4 Number of Participants There will be two distinct rounds of testing in order to evaluate the tool. The first round will result in Deliverable 6.2 in month 25 and can be characterized as a formative usability test. For this test we will use 5 users in Dundalk and 5 users in Genoa. This number comes from research that shows that about 80% of usability issues will be observed with the first five participants (Lewis, 1994; Nielsen & Laudauer, 1993; Virzi, 1992). As shown in Figure 8 below with 10 users we will have a 90% chance of detecting a problem that affects 31% of users and a 65% chance of detecting a problem that affects 10% of users. We will also use the two groups to assess any differences between users in Dundalk and Genoa and any differences between novice and expert users.

Figure 8: Difference in sample sizes needed to have an 85% chance of detecting a problem that affects 10% of users vs 32% of users, (Source http://www.measuringu.com/five-‐users.php).

For the final round of testing, which will result in Deliverables 6.3 and 6.4 in month 35 and is a summative assessment of the tool, we will use 9 users in Dundalk and 9 users in Genoa. This will give us an 85% chance of detecting a problem that affects 10% of users and again allow analysis of Dundalk and Genoa users and novice and expert users.

XX/XX/XXXX Grant No. 608775 23

3.5 Combining Metrics to Give Single Usability Score In order to combine the different metrics into an overall usability score we will compare each data point to a usability goal and present one single metric based on the number of users who achieved a combined set of goals. These goals will be finalized when a prototype version of the software is available. Table 13 below gives a sample of how this will be calculated. For this table the goals for each task are 80% task completion, average time on task of less then 410 seconds, an average of less than 5 errors per task, efficiency of 75%, a SUS score above 66% and gut feeling, verbalizability, effortlessness and magical experience scores of more than 5. With this sample data the overall usability score is 50%, representing the fact that two out of four users have met all of the goals for each task. Table 13 Combined Metrics

Participa nt Number

Task Completio n

Tim Error e on s Tas k

Efficienc y

SU S

Gut Feelin g

Verbalisabili ty

Effortlessne ss

Magical Experienc e

Goal Met ?

1 2 3 4 Average

85% 70% 90% 82% 81.75%

300 250 400 450 350

80 80 85 90 83.75

80 77 60 66 71

6 5 7 3 5.25

6 5 6 4 5.25

6 5 7 6 6

6 5 6 6 5.75

1 0 1 0 50%

2 4 3 5 3.5

XX/XX/XXXX Grant No. 608775 24

4 QUESTIONNAIRES 4.1 Introduction We will use three questionnaires, the System Usability Scale, Intuitive Interaction and Microsoft Desirability Toolkit to gather users’ personal feelings about the system after they have completed the usability test. These questionnaires will result in a mixture of quantitative and qualitative scores as detailed in the sections below.

4.2 System Usability Scale (SUS) applied to INDICATE The System Usability Scale is a ‘quick and dirty’ method that allows reliable, low cost assessment of usability. It is a simple 10-‐item scale giving a global view of subjective assessments of usability. SUS covers a variety of aspects of system usability, such as the need for support, training and complexity. Within INDICATE, SUS will be administered to participants who have just evaluated the INDICATE system, and before any interview or debriefing takes place. SUS yields a single number representing a composite measure of overall usability of the system being studied. Scores for individual items are meaningless on their own. SUS scores have a range of 0 to 100, where 100 indicates a more usable system. The questions from SUS are listed below. 1. I think that I would like to use this application frequently. Strongly disagree

Strongly agree

2. I found the application unnecessarily complex. Strongly disagree

Strongly agree

3. I thought the application was easy to use. Strongly disagree

Strongly agree

4. I think that I would need the support of a technical person to be able to use this application. Strongly disagree

Strongly agree

5. I found the various functions in this system were well integrated Strongly disagree

Strongly agree

XX/XX/XXXX Grant No. 608775 25

6. I thought there was too much inconsistency in this system Strongly disagree

Strongly agree

7. I would imagine that most people could learn to use this application very quickly. Strongly disagree

Strongly agree

8. I found the application very cumbersome to use. Strongly disagree

Strongly agree

9. I felt confident using the application. Strongly disagree

Strongly agree

10. I needed to learn a lot of things before I could get going with the application. Strongly disagree

Strongly agree

In order to interpret the SUS scores we will average the scores from all users and generate confidence intervals. These numbers will then be compared to the data in Figure 9 (Tullis and Albert, 2008). In a comprehensive evaluation of 50 studies that reported average SUS scores across a total of 129 conditions they found that the average SUS score was 66% with a median of 69%. The 25th percentile was 57% and the 75th percentile was 77%. So we will think of an average SUS score of under 60% as poor while one over 80% will be considered as good.

Frequency

Frequency Distribution of SUS Scores for 129 Conditions from 50 Studies 50 45 40 35 30 25 20 15 10 5 0 <=40

41-50

51-60

61-70

71-80

81-90

91-100

Average SUS Scores

Figure 9: Average SUS Scores (Source: measuringuserexperience.com) XX/XX/XXXX Grant No. 608775 26

4.3 Intuitiveness The following details of the INTUI http://intuitiveinteraction.net/model/

model

come

from

the

intuitive

interaction

website

“The INTUI model explores the phenomenon of intuitive interaction from a User Experience (UX) perspective. It combines insights from psychological research on intuitive decision making and user research in HCI as well as insights from interview studies into subjective feelings related to intuitive interaction. This phenomenological approach acknowledges the multi-‐dimensional nature of the concept and also reveals important influencing factors and starting points for design. The INTUI-‐model suggests four components of intuitive interaction, namely, Gut Feeling (G), Verbalizability (V), Effortlessness (E) and Magical Experience (X). Intuitive interaction is typically experienced as being guided by feelings. It is an unconscious, non-‐analytical process. This widely parallels what we know from research in psychology about intuitive decision making in general. For example, (Hammond, 1996) describes intuition as a "cognitive process that somehow produces an answer, solution, or idea without the use of a conscious, logically defensible step-‐by-‐step process." In consequence, the result of this process, i.e., the insight gained through intuition is difficult to explain and cannot be justified by articulating logical steps behind the judgment process. Despite the complex mental processes underlying intuitive decisions, the decision maker is not aware of this complexity, and the process of decision making is perceived as rather vague, uncontrolled and guided by feelings rather than reason. Intuition is simply perceived as a "gut feeling". This also became visible in our user studies and peoples’ reports on intuitive interaction with different kinds of products. Many participants based their judgment on a product’s intuitiveness on the fact that they used it without conscious thinking and just followed what felt right. Users may not be able to verbalize the single decisions and operating steps within intuitive interaction. Researchers in the field of intuitive decision making discussed different mechanisms that could also be relevant for users' decisions while interacting with technology. For example, (Wickens et al, 1998) argue that this is because intuitive decisions are based on stored memory associations rather than reasoning per se. Another factor is implicit learning. (Gigerenzer, 2013) argues that especially persons with high experience in a specific subject make the best decisions but, nevertheless, are the most incapable when it is about explaining their decisions. They apply a rule but are unaware of the rule they follow. This is because the rule was never learnt explicitly but relies on implicit learning and this missing insight into the process of knowledge acquisition implies that it is hardly memorable or verbalizable. The aspect of decision making without explicit information also becomes visible in the position by (Westcott, 1968), stating that “intuition can be said to occur when an individual reaches a conclusion on the basis of less explicit information that is ordinarily required to reach that conclusion.” Similarly, (Vaughan, 1979) describes the phenomenon of intuition as “knowing without being able to explain how we know”. (Klein, 1998) rather sees the reasons for missing verbalizability of intuitive decisions in the nature of human decision making per se. He claims that people in general have difficulties with observing themselves and their inner processes and, thus, obviously have troubles with explaining the basis of their judgments and decisions. Intuitive interaction typically appears as quick and effortless. In our studies, many users emphasized that they handled the product without any strains. Before starting conscious thinking, they had already their goal. This is also mirrored in the descriptions of intuitive decision making in psychology. For example, (Hogarth, 2001), claims that “The essence of intuition or intuitive responses is that they are reached with little apparent effort, and typically without conscious awareness.” In general, intuition produce quick answers and tendencies of action, it allows for the extraction of relevant information without making use of slower, analytical processes. On a XX/XX/XXXX Grant No. 608775 27

neuronal basis, the quick decision process may be explained by the much quicker processing of unconscious processing (Baars, 1988; Clark et al., 1997). Intuitive interaction is often experienced as magical. In our studies in the field of interactive products, this was reflected by enthusiastic reactions where users emphasized that the interaction was something "special", "extraordinary", "stunning", "amazing", "absolutely surprising" -‐ or even "magical". Research in the field of intuitive decision making reveals a number of mechanisms that may add to this impression. First of all, most people are not aware of the cognitive processes and their prior knowledge underlying intuition, so that intuition appears to be a supernatural gift (Cappon, 1994). They are not aware that they acquired that knowledge by themselves rather than receiving it by magic or revelation. And even if one knows about intuitive processing and the role of prior knowledge, it is still not directly perceivable. As (Klein, 1998) argues, the access to previously stored memories usually does not activate single, specific elements but rather refers to sets of similar elements. This aggregated form of knowledge makes one’s own contribution to intuition hard to grasp, and people possibly become not aware of the actual source of their intuition. In the field of interactive products, the experience of magical interaction may further be supported by introducing a new technology or interaction concept, so far not applied in this product domain (e.g., introducing the scroll wheel in the domain of mp3 players).” The questions from INTUI are listed below in Figure 10.

Figure 10: INTUI questionnaire XX/XX/XXXX Grant No. 608775 28

The INTUI survey will result in metrics for Gut Feeling (G), Verbalizability (V), Effortlessness (E) and Magical Experience (X). For each metric we will report average and confidence intervals. We don’t have similar data to that available for SUS so we will use these metrics to compare different iterations of the INDICATE tool against each other and hope for an increase in these values from prototype to final system.

4.4 Microsoft Desirability Traditional usability testing is an excellent way of measuring whether users can complete tasks efficiently. However, it has been less successful in measuring intangible aspects of user experience, such as desirability to continue to use a product. During the post-‐evaluation interview, we will integrate a measure based on Microsoft’s Desirability Toolkit (Benedek & Miner, accessed 2008), whereby users will be presented with a list of 118 adjectives (both positive and negative), presented on separate cue cards. Participants will be asked to choose all adjectives they feel applied to their usage of INDICATE. The evaluator will record the choices. From this list, the participant will be asked to choose those 5 adjectives that most closely match their personal reactions to the system. These five adjectives will then be used by the evaluator as the basis for a guided interview. This type of usability measure is a particularly good way to detect usability problems as it can potentially uncover user reactions and opinions that might not come to light with solely a questionnaire. Furthermore, presenting users with both positive and negative adjectives encourages critical responses, which is important when uncovering usability problems. This method of data capture will be embedded as a 20 minute workshop session after task completion. This technique results in qualitative data. The most important data from the tool comes from the discussion with the participant to determine their reaction to an item and how they apply that item to the product being evaluated.

4.5 INDICATE Survey Management and Processing Application In order to administer the tasks and questionnaires we have developed a survey management and processing framework. This is a Ruby on Rails web application framework using a MySql database and HTML5 compatible user interface. An overview of the application architecture is given in Figure 11 below.

Surveys

MySql DB

Schedules

Question List

Answers

Question

Users

Figure 11: Survey Management and Processing Application

XX/XX/XXXX Grant No. 608775 29

A survey consists of a set of scheduled questions that are created and managed using a web-‐based interface which stores entries in a MySQL relational database. Users are added to the survey and are asked questions at intervals determined by the schedule. As survey questions are answered the results are stored in the application database and are processed to produce final scores according to standard questionnaire rules and can be visualised using the web interface. For example for the SUS questionnaire we first sum the score contributions for each item. Each item’s score contribution will range from 0 to 4. For items 1, 3, 5, 7, and 9, the score contribution is the scale position minus one. For items 2, 4, 6, 8, and 10, the contribution is 5 minus the scale position. We then multiply the sum of the scores by 2.5 to obtain the overall SUS score. Innovative features of this application are instant evaluation of survey results, removal of transcription errors and Inclusion of task timing into survey results. The schema for the database is shown in Figure 12 below.

Figure 12: Database schema for Survey Management App

XX/XX/XXXX Grant No. 608775 30

5 INTERVIEWS 5.1 Introduction When each participant has completed the usability test and questionnaires, we will hold a short semi-‐structured interview to gauge their experience of using the INDICATE tool. We will take a semi-‐structured approach so that the level of questioning can probe the user on more interesting issues as they arise. We will follow a top-‐down approach, starting with a general question regarding the overall task and progressing to more leading questions to encourage the user to elaborate on their responses. Interviews are a good evaluation method for eliciting user preferences and attitudes. They may also reveal problems that were not observed during task completion. The INDICATE evaluation questions will be planned in advance, with a series of questions centred on an overall evaluation question. Interviews are not a controlled experimental technique. We will ensure, however, that interviews between different participants are consistent as much as possible -‐ the evaluator may choose to adapt the questions for different participants to get the most benefit.

5.2 Interview Data Analysis Each INDICATE participant interview will be audio recorded and transcribed verbatim. The interviews will produce qualitative data for analysis. We will perform thematic analysis on the data, using a grounded theory approach. We will use NVivo to manage the data. The general approach to the analysis of qualitative data involves four stages: 1. 2. 3. 4.

Collect the data, organise and prepare it for analysis. Code or ‘describe’ the data, Classify or categorise the data into themes, Look for connections in the data, interpret and providing explanation or meaning.

Qualitative data analysis is more susceptible to bias than quantitative data, as people perform the coding. To control the impact of individual researcher interpretation, we will employ a commonly used coding technique (emergent coding), have two researchers (experienced in thematic analysis and coding) perform coding on transcripts and employ statistical methods to evaluate the validity and reliability. This approach is recommended and outlined in the textbook “Research Methods in Human-‐Computer Interaction – Chapter 11 Analyzing Qualitative Data” and discussed in more detail below.

5.2.1 Analysing Text Content Coding is the term used to denote analysis of text-‐based content. Coding involves “interacting with data, making comparisons between data and in doing so, deriving concepts to stand for those data , then developing those concepts in terms of their properties and dimensions” (Corbin and Strauss, 2008; p.66) We will employ emergent coding in the analysis of our data. Two researchers will independently examine a subset of the text-‐based data from the interview transcripts (specifically, one interview transcript), and each will develop a list of coding categories based on their own interpretation of the data. Both researchers will then compare their lists, examine and discuss the differences and then decide on a list that both agree on. Next, the codes of both coders will be compared and reliability measures computed. If a high reliability score is achieved, both researchers can move onto coding the entire data set. Otherwise, the above process needs to be repeated until a satisfactory reliability score is achieved. XX/XX/XXXX Grant No. 608775 31

The next step is to identify concepts or categories from the codes. We will use a mixed-‐methods approach to identifying coding categories, including Examining existing theoretical frameworks. A number of taxonomies have been created in the HCI field to help understand data from usability studies, for example Norman’s taxonomy of mistakes and slips that includes categories such as ‘description errors’, ‘data-‐driven errors’, ‘mode errors’ etc. (2002). • Researcher-‐denoted concepts (new concepts that arise that might not be covered in existing taxonomies – for example, concepts that might be specific to INDICATE). We will build a code structure – a hierarchy of concepts with each level representing more detail. This will support comparison of the data. Comparisons will be made within each coding category, between different participants (e.g. experts vs novices) and with the literature. •

5.2.2 Ensuring validity and reliability Given the possibility of researcher bias in performing thematic analysis of interview data, we must ensure the analysis is valid and reliable. Validity refers to using well-‐documented procedures to increase the accuracy of results. We will follow the procedure outlined in Lazar et al., 2012. Reliability refers to consistency. If two researchers independently coding the data come to the same conclusions, then the analysis is considered reliable. This can be measured by calculating the percentage of agreement between coders.

5.2.3 Interview analysis report The output of this piece of evaluation will be a report outlining the categories of findings, a description of the relationships amongst the data and a reliability measure.

XX/XX/XXXX Grant No. 608775 32

6 BENCHMARKING 6.1 Introduction In this chapter we will detail the benchmarking of the INDICATE tool. We will approach this in two ways. First we will use real time energy usage data gathered in the Living Lab at DKIT to assess how accurate the predictions of the INDICATE tool are. Secondly, while conducting the interview with the users we will ask them how the INDICATE tool compares to existing tools that they already use.

6.2 Predicted versus Real The first stage in benchmarking the INDICATE tool will be to compare its predictions for energy consumption against real time energy usage data gathered in the Living Lab at DKIT. The Energy Monitoring in the Living Lab based in DKIT in Dundalk consists of real time montioring of 16 apartments in Great Northern Haven, 10 council houses in Muirhevena Mór, a local school, O’Fiaich College, and DKIT campus. All of these sources, other than DKIT campus, have monitoring installed which allows real time energy usage data to be sent to a cloud based data aggregation service which gathers the data and stores it in a database in DKIT. The data in this database is then processed automatically to extract daily and hourly energy usage information.

Figure 13: Monitoring kit in Muirhevena Mór homes showing clockwise from top left Gas meter, temperature and humidity sensor, coms kit and electricity sensors

Figure 13 above shows some of the hardware used to collect energy usage data, including gas, electricity, temperature and humidity at one of the test sites. All of the above systems other than DKIT campus send real time energy usage data to a cloud-‐based energy data aggregator (Figure 14). Data from DKIT campus is currently manually entered into the system but this is something that we are actively looking to automate. This is essentially a database where the usage data is stored along with a number of scripts that process the data and extract hourly and daily usage data. Again this processed data is stored in a database. XX/XX/XXXX Grant No. 608775 33

Figure 14: Community Energy Data Store

This data can be graphed and presented as in Figure 15 below. We will use this data to test the predictions and algorithms developed in the INDICATE tool.

Figure 15: O’Fiaich College solar PV production for 2012

6.3 INDICATE vs OTHER TOOLS We will benchmark the INDICATE tool against other tools by asking the users at interview about their experiences of using both and which they would prefer for specific tasks. We will ask them to compare the tools along a range of metrics including speed, user interface, data requirements and portability.

XX/XX/XXXX Grant No. 608775 34

7 ORGANIZATION OF EVALUATION AND BENCHMARKING ACTIVITIES

Pilot of All Tasks, Questionnaires and Interviews with Prototype GUI

D5.1 due in month 24 will deliver the prototype GUI, D4.2 due in month 19 will deliver the DSM model and D4.3 due in month 24 will d eliver the first prototype of the VCM

Heuristic Evaluation of Prototype GUI

Formative assessment of tool using user testing, questionnaires and interview

Summative Assessment of Final INDICATE tool

The Heuristic evaluation and formative assessment will result in D6.2 INDICATE Functional Testing, Usability and Performance evaluation d ue in month 25. This will give a list of improvements n eeded for the final tool D5.2 due in month 34 will deliver the final GUI, D4.4 due in month 33 the final VCM, D3.4 due in month 27 will deliver the CCI, D3.3 due in month 33 will d eliver the Sustainable Urban Indicators The summative assessment will result in D6.3 and D6.4 the final evaluations of the tool due in month 35

XX/XX/XXXX Grant No. 608775 35

8 CONCLUSIONS In this document we have presented a pragmatic evaluation methodology and benchmarking framework for the INDICATE tool. At the heart of the evaluation and benchmarking are the users and their expectations of the tool. Users will be classified based on their experience using GIS tools as well as their domain expertise. This will allow the identification of groups within the user base and whether the INDICATE tool is more or less useful for different user groups. We have identified the metrics that will be used in the evaluation and how these will be gathered and analysed. A heuristic evaluation will be carried out by an expert, in order to catch problems with the interface before it is presented to users. The results of the heuristic evaluation will be fed back to the developers so that the prototype tool can correct problems highlighted. Then we will carry out a formative user test in order to further assess the tool. We will use real world tasks, based on users’ expectations and aim for a broad task base that captures the full functionality of the tool. Then by assessing these tasks with a limited number of users we hope to get very good quality feedback with a manageable workload for those carrying out the assessments. We will present the metrics both individually and also combine all of the metrics together to create an overall usability measure for the project. This will allow developers to see in detail where the problems are with the project but will also give project managers an overall metric for measuring the progress of the project. In the assessment of the final tool we will use more test users to help us find less common problems with the tool. As the tool progresses from prototype to final version we will need more users to find subtle issues that remain. We have developed a custom survey tool that can be used to administer all of the tasks and questionnaires in the user evaluation. This tool allows the customization of the questions that are asked and stores the results directly into a database. We can further develop the tool to automate the processing of the results of the qualitative survey data. This is still a working document and once a prototype of the software is available we will refine the tasks and goals for those tasks.

XX/XX/XXXX Grant No. 608775 36

REFERENCES Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge: Cambridge University Press. Benedek, J. and Miner, T. (accessed 2008) Measuring Desirability: New Methods for Evaluating Desirability in a Usability Lab Setting. Available at: www.microsoft.com/usability/uepostings/desirabilitytoolkit.doc Bias, R. (1994) The Pluralistic Usability Walkthrough: Coordinated Empathies. In Usability Inspection Methods, J. Nielsen and R. Mack Eds, Wiley, 63-‐76. Cappon, D. (1994). A new Approach to Intuition. Omni, 16(1), 34-‐38. Clark, A., & Boden, M. A. (1997). Being there: putting brain, body, and world together again. Cambridge, MA: MIT Press. Corbin, J. and Strass, A. (2008) Basics of qualitative research, 3rd edition, Los Angeles, California, Sage publications. Gigerenzer, G. (2013). Interview. HaysWorld Magazine, 1/2013. Hammond, K. R. (1996). Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. New York, USA: Oxford University Press. Hartson, H.R., Andre, T.S. and Williges, R.C. (2003) Criteria for Evaluating Usability Evaluation Methods. In International Journal of Human-‐Computer Interaction, 15, 1, 145-‐181. Hogarth, R. M. (2001). Educating intuition. Chicago: University of Chicago Press. Klein, G. (1998). Sources of Power: How People Make Decisions. Cambridge, MA: MIT Press. Lazar, J., Feng, J.H. and Hoheiser, H. Research Methods in Human Computer Interaction, Wiley and Sons Ltd. (2012). Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-‐378 Liljegren, E. (2006) Usability in a Medical Technology Context Assessment of Methods for Usability Evaluation of Medical Equipment. In International Journal of Industrial Ergonomics, 36, 4, 345-‐352. Nielsen, J. (2000) Why you only need In Alertbox, http://www.useit.com/alertbox/20000319.html

test

with

users.

Nielsen, J. (1994) Heuristic evaluation. In Nielsen, J. and Mack, R.L. (Eds.), Usability Inspection Methods, John Wiley and Sons, New York, NY. Nielsen, J. (1994b) Heuristic Evaluation. In Usability Inspection Methods, J. Nielsen and R. Mack Eds, Wiley, 25-‐62. Nielsen, J. (1993). Usability Engineering. Academic Press, Boston, ISBN 0-‐12-‐518405-‐0 (hardcover), 0-‐12-‐518406-‐9 (softcover) Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp.206-‐213). Amsterdam: ACM. XX/XX/XXXX Grant No. 608775 37

Nielsen, J. and Mack, R.L. (1994) Usability Inspection Methods, John Wiley and Sons, New York, NY. Norman, D. (2002) The design of everyday things. New York, Basic Books. Schneiderman, B., Plaisant, C. (2005) Designing the User Interface, 4th edition -‐ Strategies for Effective Human Computer Interaction, Addison-‐Wesley. Spool, J. and Schroeder, W. (2001) Testing Websites: Five Users is Nowhere Near Enough. In CHI ’01 Extended Abstracts, ACM, 285-‐286 Travis, D. (2009) How to Prioritise http://www.userfocus.co.uk/articles/prioritise.html

usability

problems.

Available

Tullis, T. and Albert, B. (2008) Measuring the User Experience. Morgan Kaufmann Series in Interactive Technologies. Vaughan, F. E. (1979). Awakening intuition. Garden City, USA: Anchor Press. Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-‐471. Westcott, M. R. (1968). Toward a contemporary psychology of intuition: a historical, theoretical, and empirical inquiry. New York, USA: Holt, Rinehart and Winston. Wharton, C., Bradford, J., Jeffries, R. & Franzke, M. (1992) Applying Cognitive Walkthroughs to more Complex User Interfaces: Experience, Issues and Recommendations. In CHI ’92, ACM press, 381-‐388. Wickens, C.D., Gordon, S.E., & Liu, Y. (1998). An Introduction to Human Factors Engineering. New York, USA: Addison-‐Wesley Educational Publishers Inc. Woolrych, A. and Cockton, G. (2001) Why and When Five Test Users aren’t Enough. In IHM-‐HCI, 105-‐108.

XX/XX/XXXX Grant No. 608775 38