Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

How pre-processing decisions affect the reliability and validity of the approach–avoidance task: Evidence from simulations and multiverse analyses with six datasets

How pre-processing decisions affect the reliability and validity of the approach–avoidance task:... Reaction time (RT) data are often pre-processed before analysis by rejecting outliers and errors and aggregating the data. In stimulus–response compatibility paradigms such as the approach–avoidance task (AAT), researchers often decide how to pre-process the data without an empirical basis, leading to the use of methods that may harm data quality. To provide this empirical basis, we investigated how different pre-processing methods affect the reliability and validity of the AAT. Our literature review revealed 108 unique pre-processing pipelines among 163 examined studies. Using empirical datasets, we found that validity and reliability were negatively affected by retaining error trials, by replacing error RTs with the mean RT plus a penalty, and by retaining outliers. In the relevant-feature AAT, bias scores were more reliable and valid if computed with D-scores; medians were less reliable and more unpredictable, while means were also less valid. Simulations revealed bias scores were likely to be less accurate if computed by contrasting a single aggregate of all compatible conditions with that of all incompatible conditions, rather than by contrasting separate averages per condition. We also found that multilevel model random effects were less reliable, valid, and stable, arguing against their use as bias scores. We call upon the field to drop these suboptimal practices to improve the psychometric properties of the AAT. We also call for similar investigations in related RT-based bias measures such as the implicit association task, as their commonly accepted pre-processing practices involve many of the aforementioned discouraged methods. Highlights • Rejecting RTs deviating more than 2 or 3 SD from the mean gives more reliable and valid results than other outlier rejection methods in empirical data • Removing error trials gives more reliable and valid results than retaining them or replacing them with the block mean and an added penalty • Double-difference scores are more reliable than compatibility scores under most circumstances • More reliable and valid results are obtained both in simulated and real data by using double-difference D-scores, which are obtained by dividing a participant’s double mean difference score by the SD of their RTs Keywords Approach-avoidance task (AAT) · Bias scores · Reliability · Validity · Outlier exclusion · Simulation · Multiverse analysis Introduction Stimulus–response compatibility tasks like the approach–avoid- ance task (AAT; Solarz, 1960), the extrinsic affective Simon * Sercan Kahveci task (De Houwer, 2003), and the implicit association task (IAT; sercan.kahveci@plus.ac.at Greenwald et al., 1998) have been used for over 60 years to 1 measure attitudes without directly asking the participant. Their Department of Psychology, Paris-Lodron-University strength lies in the fact that they measure stimulus–response of Salzburg, Hellbrunner Straße 34, 5020 Salzburg, Austria 2 compatibility implicitly through reaction times (RTs), which Centre for Cognitive Neuroscience, Paris-Lodron-University avoids the methodological issues associated with self-reports, of Salzburg, Salzburg, Austria 3 such as social desirability and experimenter demand. In turn, Behavioural Science Institute, Radboud University, however, they are subject to all the methodological issues Nijmegen, The Netherlands Vol.:(0123456789) 1 3 Behavior Research Methods associated with RT tasks, such as occasional incorrect responses, task-irrelevant stimulus feature, for example, by measuring outlying RTs, and the large quantity of data, which cannot be chocolate approach–avoidance bias by requiring participants meaningfully interpreted until it is reduced. As such, the data to approach stimuli surrounded by a green frame and avoid usually undergo some kind of pre-processing before analysis, stimuli surrounded by a blue frame, thereby making it irrele- whereby error trials and outliers are dealt with in some manner vant whether the stimulus itself contains chocolate. This task (or not), often followed by aggregation of the data into an easily is reported in the literature as unreliable, with reliabilities interpretable bias score. below zero (Kahveci, Van Bockstaele, et al., 2020; Lobbes- There are many methods available to perform each of these tael et al., 2016; Wittekind et al., 2019), though reliabilities pre-processing steps. However, there is no clear-cut answer on around .40 (Cousijn et al., 2014) and even .80 have been which methods are preferable and under which circumstances, reported on individual occasions (Machulska et al., 2022). It leaving researchers to find their way through this garden of has seen frequent use, because its indirect nature conceals the forking paths on their own. Decisions may be made on the goal of the experiment and thus makes it less susceptible to basis of their effect on the data, thereby inflating the likeli - experimenter demand. The relevant-feature AAT, in contrast, hood of obtaining spurious results (Gelman & Loken, 2013). directly manipulates the contingency between a task-relevant Researchers may choose the same pre-processing pipeline feature of the stimulus and the response, for example, by as in already published work. This allows for comparable measuring chocolate approach–avoidance bias by requiring results, but makes the quality of the findings of an entire line participants to approach chocolate stimuli during one block of research dependent on the efficiency of the most popu - and to avoid it during another block. This task usually has a lar pre-processing pipeline. In the best case, the commonly high reliability, from around .90 (Zech et al., 2022), to around accepted set of decisions reliably recover the true effect and .70 (Hofmann et al., 2009; Van Alebeek et al., 2021), up allow the field to progress based on grounded conclusions. In to .50 (Kahveci, Van Bockstaele, et al., 2020); however, the the worst case, it makes the measurement less reliable, thereby direct nature of its instructions make it easy for the partici- misleading researchers with conclusions based on random pant to figure out what the task is about. noise and null findings that mask true effects. Hence, both In the present study, we probed the extent of pre-processing heterogeneity in pre-processing decisions as well as low reli- heterogeneity in the literature on the AAT, and we made an ability can contribute to inconsistent results across studies that effort towards reducing it by examining the reliability and investigate the exact same effect, and thus play a role in the validity obtained through a wide range of pre-processing deci- ongoing replication crisis in psychological science. Ideally, sions using a multiverse analysis, thereby limiting the range of pre-processing decisions would be made based on empirical acceptable pre-processing methods to only the most reliable findings that demonstrate which options yield the best results and valid approaches. The multiverse analysis methodology, (see e.g. Berger & Kiefer, 2021; Ratcliff, 1993). advocated by Steegen et al. (2016), involves applying every The literature on the AAT is no stranger to these issues. combination of reasonable analysis decisions to the data, to The field did not take up the methods which Krieglmeyer probe how variable the analysis outcomes can be, and to what and Deutsch (2010) found to lead to the highest reliability extent each analysis decision contributes to this variability. We and validity (i.e. either strict slow RT cutoffs or data trans- know of one study so far that has examined the impact of pre- formation). Many labs have since settled into their own pre- processing methods on the reliability of the AAT, though it processing pipeline without a firm empirical basis for their did not utilize multiverse analysis. Krieglmeyer and Deutsch decisions, making it unclear whether differing results are due (2010) applied a number of different methods to deal with to differences in task setup or in pre-processing after data col - outliers to the data and compared the resulting bias scores on lection. For example, using the same task setup in which par- the basis of their split-half reliability and overall effect size, ticipants approach and avoid stimuli with a joystick, one study n fi ding that the relevant-feature AAT is most reliable when no found a correlation between spider fear and spider avoidance outlier correction is applied, while the irrelevant-feature AAT bias (e.g. Reinecke et al., 2010), while another did not (e.g. benefits from very strict outlier rejection, e.g. removing all RTs Krieglmeyer & Deutsch, 2010). It is unclear whether this dif- above 1000 ms or deviating more than 1.5 SDs from the mean. ference occurred because the former study did not remove Additionally, Parsons (2022) was the first to examine the effect outliers whereas the latter removed all RTs above 1500 ms, or of pre-processing decisions on reliability, though he looked at because the former study featured 2.66 times more test trials the dot-probe, Stroop, and Flanker tasks rather than the AAT. and 2.25 times more practice trials than the latter, or because Our study instead focused on the AAT, but also extended the two studies used different scoring algorithms. these studies methodologically by also examining criterion Low reliability has also been a problem in the AAT lit- validity, as high reliability is a prerequisite, but not a guar- erature (Loijen et  al., 2020), at least for certain variants antee for high validity; if we focused solely on reliability, we of it. The irrelevant-feature AAT manipulates the contin- would risk achieving highly reliable, but invalid data. Reli- gency between the approach or avoidance response and a ability represents how well a measurement will correlate with 1 3 Behavior Research Methods the same measurement performed again (Spearman, 1904), different types of trials. RTs to these four types of trials can but it is agnostic on what is actually measured. Hence, one be decomposed into three independent contrasts, which are could measure something reliably, but that something might detailed in Table 1. The first contrast is the RT difference be an artifact rather than the effect one was looking for. For between responses to the two stimulus categories (rows in example, participants tend to be slower in the beginning of the Table 1), regardless of response direction. This difference experiment when they are trying to adapt to the task, and some can be caused by the familiarity, visual characteristics, or the are slower than others. This initial slowness is a large interper- prototypicality of the stimulus as a stand-in for its stimulus sonal difference that can be measured reliably, but it has little category, among other causes. As shown in Table 1, this factor to do with cognitive bias. If we only focus on reliability, we contaminates any difference score between single-direction may erroneously believe that our analysis should focus on this responses to one stimulus category versus another. If this is initial slowness rather than ignore it. ignored, we may erroneously conclude that a familiar stimulus We also extended these previous studies by examining category is approached faster than a less familiar category, simulated as well as real data: simulated data allows for a even though all responses to the familiar stimulus category detailed analysis of the conditions under which different are faster, regardless of response direction. outlier rejection and bias scoring methods are more or less The second contrast is the RT difference between approach reliable; but only real data can be used to examine how and avoidance trials, regardless of stimulus content (columns in validity is affected by bias scoring and error and outlier Table 1). This difference can be caused by the relative ease with handling. Previous simulation studies examining outlier which approach and avoidance movements can be made, which rejection have assumed that extreme RTs are unrelated to the can be influenced by irrelevant factors like the individual’s individual’s actual underlying score (Berger & Kiefer, 2021); anatomy and posture as well as by the (biomechanical) setup if, in real data, the approach–avoidance bias expresses itself of the response device. This factor contaminates any difference through errors and extreme RTs (e.g. stronger bias leading score between approach and avoid trials within a single stimulus to more errors when avoiding desired stimuli), then it could category. For example, a study found that patients with anorexia turn out to be preferable to keep them in the data. nervosa avoid, rather than approach, thin female bodies (Leins et al., 2018). Does that mean that women with anorexia, counter- Data structure and methodological challenges of  intuitively, have an avoidance bias away from these stimuli? the AAT Such an interpretation would not be valid, since an identical avoidance bias was demonstrated for normal-weight bodies In this section, we will discuss the characteristics of the AAT in the same patient group as well as in healthy individuals, to understand the methodological challenges that will need indicating that avoidance responses were simply faster and not to be addressed when pre-processing its data. In the AAT, specific to thin bodies. participants view different stimuli and give either a speeded The third contrast is the approach–avoidance bias, and is approach or avoidance response depending on a feature of the represented by the difference between approaching and avoid- stimulus. Responses are typically given with a joystick or a ing a target stimulus category, relative to the difference between similar device that simulates approach toward or avoidance approaching and avoiding a reference stimulus category. As of a given stimulus (Wittekind et al., 2021), though simple shown in Table 1, this double difference can be interpreted as buttons are sometimes used instead. Depending on the input an approach or avoidance bias towards one particular stimulus device, this allows for the measurement of different types type relative to another. of response times per trial, which we term initiation time, movement duration, and completion time (terms previously The current study used by: Barton et al., 2021; Tzavella et al., 2021). The time from stimulus onset to shortly after response onset (initiation The current article consists of four studies. As a first step, we time) indicates how long it took the participant to initiate a reviewed the literature to gain insight into which pre-processing response; the time from response onset until response comple- decisions are in use in the field (Study 1). We discuss thereafter tion (movement duration) indicates the speed of the approach which methods are potentially problematic and consider alter- or avoidance movement. The two are often added together to native methods, giving extra consideration to robust and novel represent the latency from stimulus onset until response com- approaches. Next, we performed a simulation study to compare pletion (completion time). Approach–avoidance bias scores two ways of aggregating data from four conditions, those being quantify the extent to which responses slow down or speed double-difference scores and compatibility scores (Study 2). We up due to the compatibility between stimulus and response. followed up with a simulation study to compare the impact of A typical AAT trial features one out of two stimulus cat- outliers on the reliability of scores derived using a number of egories (target and control) and requires a response in one outlier detection methods and scoring algorithms (Study 3). And out of two directions (approach and avoid), resulting in four lastly, we compared these pre-processing methods on how they 1 3 Behavior Research Methods affect the reliability and validity of real datasets in a multiverse analysis (Study 4). Study 1: Literature review Introduction We performed a focused scoping review of the AAT literature to examine which pre-processing decisions are used in the field of AAT research. The intention was not to be exhaustive or systematic but to tap into the variability in pre-processing decisions in the field to orient the rest of this project. Methods We reviewed 213 articles retrieved from Google Scholar using the keywords “approach–avoidance task OR approach avoidance task” published between 2005 and 2020. We rejected 65 articles after reading the abstract or full text, since they featured no AAT, only a training-AAT, or a variant of the AAT that departs strongly from the original paradigm (e.g. by allowing participants to freely choose whether to approach or avoid a stimulus). We also excluded one experi- ment which featured multiple pre-processing tracks, as we would otherwise have to count two full pre-processing pipe- lines for a single experiment. We thus retained 143 articles containing a total of 163 AATs. When an article contained multiple AAT experiments, all were included as separate entries and counted as such. The experiments were coded on the following variables: instruction type (relevant-feature, irrelevant-feature), response device (e.g. joystick, keyboard, mouse), RT definition (initiation time, movement dura- tion, completion time), inclusion of some sort of training or therapy, the research population, the target stimuli, the type of reported reliability index if any (e.g. even-odd split-half, Cronbach’s alpha of stimulus-specific bias scores), absolute outlier exclusion rules (e.g. any RTs above 2000 ms), adap- tive outlier exclusion rules (e.g. 3 SD above each partici- pant’s mean), error handling rules (e.g. include error trials in analyses, force participants to give correct responses), perfor- mance-based participant exclusion rules (e.g. more than 35% errors), score-based exclusion rules (e.g. bias scores deviat- ing more than 3 SD from sample mean), and the summary statistic used (e.g. double mean difference scores, median category-specific difference scores, simple means). Results A total of 163 experiments from 143 articles were examined. Below, we describe the number and percentage of experi- ments that utilized specific methods in their design, pre- processing, and analysis. The full results of this review can 1 3 Table 1 AAT trial types, difference scores, and their components Movement direction Avoid Approach Stimulus Target (Quadrant A) Target stimulus recognition + general – (Quadrant B) Target stimulus recognition + general = Target-specific difference score = (General avoid cat- avoid speed + avoidance facilitation of target approach speed + approach facilitation of target speed –general approach speed) + (avoidance facil- egory itation of target – approach facilitation of target) – – – Control (Quadrant C) Control stimulus recognition + – (Quadrant D) Control stimulus recognition + = Control-specific difference score = (General avoid general avoid speed + avoidance facilitation of general approach speed + approach facilitation of speed – general approach speed) + (avoidance facil- control control itation of control – approach facilitation of control) = = = (negative) Avoid-specific difference score = (target – (negative) Approach-specific difference score = = Double-difference score = (avoidance facilitation of stimulus recognition – control stimulus recogni- (target stimulus recognition – control stimulus target – approach facilitation of target) – (avoid- tion) + (avoidance facilitation of target – avoid- recognition) + (approach facilitation of target – ance facilitation of control – approach facilitation ance facilitation of control) approach facilitation of control) of control) Note: This table is a schematic depiction of single- and double-difference scores and the RT components they consist of. The quadrants contain a description of which RT components we hypothesize to constitute the RTs of the combination of stimulus and response that the quadrant represents. When read from top to bottom, the bottom row represents the result of subtracting the middle row from the top row. When read from left to right, the right column represents the result of subtracting the middle column from the left column Behavior Research Methods Table 2 Frequencies of upper and lower RT cutoffs in the reviewed literature Outlier definition <100 ms <150 ms <200 ms <250 ms <300 ms <350 ms None Total n % n % n % n % n % n % n % n % >1000 ms 3 1.84% 3 1.84% >1500 ms 2 1.23% 6 3.68% 1 0.61% 6 3.68% 6 3.68% 21 12.88% >1700 ms 1 0.61% 1 0.61% >2000 ms 1 0.61% 20 12.27% 1 0.61% 4 2.45% 7 4.29% 33 20.25% >3000 ms 3 1.84% 2 1.23% 5 3.07% >3500 ms 1 0.61% 1 0.61% >4000 ms 1 0.61% 1 0.61% >5000 ms 1 0.61% 1 0.61% >10,000 ms 2 1.23% 1 0.61% 3 1.84% None 1 0.61% 3 1.84% 2 1.23% 1 0.61% 87 53.37% 94 57.67% Total 4 2.45% 12 7.36% 23 14.11% 2 1.23% 11 6.75% 6 3.68% 105 64.42% 163 100% be found in this study’s online repository: https://doi. or g/10. some did not clarify how they computed Cronbach’s alpha (3; 17605/ OSF. IO/ YFX2C 1.84%). The least common measure was test-retest reliability (4; 2.45%). Response device RT measures Joysticks were by far the most popular response device (132; 80%). They were followed by keyboards (8; 4.91%), button Most studies used a single RT measure (153; 93.9%) but some boxes (7; 4.29%), touchscreens (5; 3.07%), computer mice used multiple (10; 6.13%). Out of all examined experiments, (5; 3.07%), and other/multiple/unknown devices (8; 4.91%). most did not report how RTs were defined (69; 42.33%), but those that did used completion time (50; 30.7%), initiation Instructions time (43; 26.4%), or movement duration (9; 5.52%). The irrelevant-feature AAT was the most popular task type Outlier rejection rules (119; 73.01%), followed by the relevant-feature AAT (41; 25.15%). A small number (3; 1.84%) used both task types in Many experiments applied no outlier rejection (62; 38%), the same experiment. while those that did either applied only absolute outlier rejection methods (38; 23.3%), only adaptive outlier rejec- Reliability measures tion methods (24; 14.7%), or both together (39; 23.9%). Fre- quencies of absolute outlier rejection rules (78; 47.9%) are Reliability was not examined in the majority of experiments shown in Table 2, and frequencies of adaptive outlier rejec- (125; 76.69%); most that did examine reliability used a sin- tion methods (63; 38.7%) are shown in Table 3. gle reliability measure (36; 22.1%), and some used two (2; 1.23%). Split-half reliability was the most common measure Error rules (19; 11.7%). The types of split-half reliability included tem- poral split-half, which is splitting the experiment halfway (5; In most experiments, the authors excluded error trials (115; 3.07%); even-odd split-half, which is splitting the data by even- 70.55%), while others either included them (34; 20.86%), uneven trial number (5; 3.07%); and randomized split-half, replaced them with a block mean RT of correct trials plus a which is averaging together the correlations between many ran- penalty (7; 4.29%), or required participants to give correct dom splits (5; 3.07%); other studies did not mention the type of responses to complete the trial (7; 4.29%). split-half used (4; 2.45%). Cronbach’s alpha was the next most common reliability measure (16; 9.82%). Most experiments Bias score algorithms computed Cronbach’s alpha on the basis of the covariance matrix of stimulus-specific bias scores (11; 6.75%), while a We categorized the observed bias score algorithms where pos- minority computed Cronbach’s alpha for RTs in a single move- sible, and gave them systematic names, which will be used in ment direction, grouping them per stimulus (2; 1.23%), and the remainder of the article. They are shown in Table 4. 1 3 Behavior Research Methods results in the literature are due to differences in experimental Table 3 Frequencies of adaptive outlier rejection methods in the reviewed literature design, pre-processing, or chance. In the following discussion of Study 1, we will review the observed and hypothetical new Outlier definition Both sides Upper side pre-processing decisions based on methodological considera- N % N % tions, in anticipation of Studies 2, 3, and 4. Upper and/or lower 1% 10 6.13% Outlier rejection Upper and/or lower 2% 1 0.61% 1.5 SD 1 0.61% Various methods are used to flag and remove implausible or 2 SD 3 1.84% 2 1.23% extreme RTs. This is especially important considering that 2.5 SD 7 4.29% 1 0.61% non-robust statistics are much more strongly influenced by 3 SD 28 17.18% 7 4.29% individual extreme outliers than by a multitude of regular Multiple methods 1 0.61% RTs; as such, outliers inflate type I and II error rates by sup- Total 50 30.67% 11 6.75% pressing effects that exist and creating effects that do not Unclear 2 1.23% exist in real life (Dixon, 1953; Ratcliff, 1993). None 100 Fixed RT cutos ff It seems sensible to remove outliers based on Participant rejection rules a cutoff that is adapted to the specific study but fixed across participants in that study. This is based on the reasoning that It was uncommon for participants to be rejected based on there is a high likelihood that RTs above or below certain bad performance (42; 25.8%), but it is unclear whether this is values have a different origin than the mental process being because participants performed well in most studies, or their measured (Ratcliff, 1993). The removal of such RTs is thus performance simply was not examined. If participants were thought to enhance the validity of the data. For example, when rejected, it was most commonly on the basis of error rates a participant forgets the instructions and tries to remember (34; 20.9%), with an error rate above 25% being the most them, this can result in a 4-second RT caused by memory common cutoff (12; 7.36%), followed by error rates of 35% search rather than by stimulus recognition and decision-mak- (6; 3.68%), and 20% (4; 2.45%). Less often, participants were ing. The same goes for fast outliers: it is known that people rejected because they had RTs that were too slow (6; 3.68%), only begin to recognize a stimulus after about 150 ms, and or because they had too few trials remaining after error and they only begin giving above-chance responses from 300 ms outlier removal combined (4; 2.45%). In a minority of studies, and onwards (Fabre-Thorpe, 2011). Given this, a 50 ms RT is participants were rejected not (only) due to high error rates or most likely not related to the stimulus that has just been shown slow RTs, but (also) because their bias scores were too outly- on the screen. It remains unclear, however, what the ideal ing (11; 6.75%), their scores had too much influence on the cutoffs are. Ratcliff (1993) found that a RT cutoff of 1500 ms regression outcome (1; 0.61%), or for unclear reasons relat- led to results with decent power to detect a group difference, ing to the magnitude of their scores (1; 0.61%). Many of the when said group difference was in the mean of the distribu- examined experiments gave the impression that no participant tion, but also when it was in the tail of the distribution instead. rejection rule was defined beforehand, but participants were This study, however, utilized simulated data that may not cor- rejected following data inspection. respond with effects observed in real life. Further complexity is introduced by the fact that some stimuli are more visually or Pipelines conceptually complex than others and may thus require more processing time before the participant is capable of respond- We empirically observed a total of 108 unique pre-processing ing correctly using the cognitive mechanism under study. pipelines across 163 studies, out of 218,400 possible combi- nations, computed by multiplying the numbers of all unique Means and SDs By far the most common adaptive outlier observed pre-processing methods at each step with each other. rejection method is to removes all RTs that deviate more than 3 SD from the participant’s mean. Ratcli ff ( 1993) found that Discussion very strict SD boundaries (e.g. M + 1.5 SD) reasonably sal- vage power to detect group differences when the group differ - We found that some pre-processing methods were quite com- ence is in the group means, but significantly weaken power mon (e.g. excluding trials deviating more than 3 SD from the when groups primarily differ in the length of the right tail of participant mean), but there is still much heterogeneity in the the distribution; this suggests that the benefit of using means literature, as only a few studies used identical pre-processing and SDs can depend on the nature of the task. Both means and methods, which makes it difficult to discern whether divergent SDs are themselves also highly influenced by outliers. Thus, 1 3 Behavior Research Methods Table 4 Bias score algorithms and how frequently they have been used Name N % Formula ∼ ∼ Median category-specific difference 47 28.83% RT − RT avoid target approach target Mean category-specific difference 32 19.63% RT − RT avoid target approach target RT −RT Category-specific difference D-score 6 3.68% avoid target approach target SD target RT ∼ ∼ ∼ ∼ Double median difference 18 11.04% RT − RT − RT − RT avoid target approach target avoid control approach control Double mean difference 3 1.84% RT − RT − RT − RT avoid target approach target avoid control approach control ∼ ∼ Median compatibility score 4 2.45% RT − RT avoid target or approach control approach target or avoid control Mean compatibility score 4 2.45% RT − RT avoid target or approach control approach target or avoid control Compatibility D-score 1 0.61% RT avoid target or approach control − RT approach target or avoid control SD RT ∼ ∼ Median movement-specific difference scores 4 2.45% RT − RT avoid target avoid control and ∼ ∼ RT − RT approach target approach control Mean movement-specific difference scores 3 1.84% RT − RT avoid target avoid control and RT − RT approach target approach control Multiple 9 5.52% Other 3 1.84% None 20 12.27% Unclear 9 5.52% Total 163 100% when using this method, extreme outliers widen the SD and Percentiles One of the less common ways to deal with outli- mask smaller outliers that would otherwise have been detected ers was to remove a fixed percentage of the fastest and slowest after an initial exclusion of extreme outliers. Additionally, RTs from the data (10 out of 163 studies). This method has the means and SDs are only correct descriptors for symmetric advantage of not ignoring fast outliers. However, it is independent distributions, which RTs are not; as such, this method is prone of the characteristics of the RT distributions under investigation, to removing only slow outliers, while ignoring extremely fast and is thus likely to remove either too few or too many outliers, RTs that technically do not deviate more than 3 SD from the depending on the data. mean, but are nevertheless theoretically implausible. Hence, this method is often combined with absolute outlier cutoffs that Outlier tests While much less common than any of the afore- eliminate extreme RTs at both ends of the distribution before mentioned methods, another method that deserves mention is means and SDs are computed for further outlier rejection. the significance testing of outliers. The Grubbs test (Grubbs, Alternatively, one may turn to robust estimators of the 1950) was used by e.g. Saraiva et al. (2013), among others, mean and SD, such as the median and the median absolute to detect whether the highest or lowest value in the data sig- deviation (MAD, (Hampel, 1985)), respectively. Unlike the nificantly deviates from the distribution. When found to be mean, the median is not disproportionately affected by outliers significantly different, this value is removed, and the process compared to non-outlying datapoints, as the median assigns is repeated until no more significant outliers are detected. equal leverage to every data point. Hence, it is not affected by the outliers’ extremeness, but only by their count. Simi- Error handling larly, the MAD is a robust alternative to the SD, calculated by computing the deviation of all data points from the median, Study 1 revealed four ways of dealing with error trials: removing the sign from these values, computing their median, including them in the analyses, excluding them, replacing and multiplying this value by a constant to approximate the them with the block mean plus a penalty, or requiring the value that the SD would take on in a normal distribution. participant to give a correct response during the task itself 1 3 Behavior Research Methods and defining the RT as the time from stimulus onset until They represent the advantage of approaching over avoiding the correct response. Which method is ultimately the best one stimulus category relative to another. We will examine depends on whether error trial RTs contain information on how mean-based and median-based double-difference algo- approach–avoidance bias. After all, some implicit tasks rithms compare in their ability to recover reliable and valid are based entirely on errors (e.g. Payne, 2001). Raw error approach–avoidance bias scores. counts sometimes show approach–avoidance bias effects (Ernst et al., 2013; Gračanin et al., 2018; van Peer et al., Compatibility scores These scores involve averaging all RTs 2007) but often they do not (Glashouwer et al., 2020; Heuer from the bias-compatible conditions together, and subtracting this et al., 2007; Kahveci et al., 2021; Neimeijer et al., 2019; from the average of all RTs in the bias-incompatible conditions Radke et al., 2017; von Borries et al., 2012), and it is unclear taken together. When one measures approach–avoidance bias why some studies find such an effect while others do not. towards palatable food, for example, the bias-compatible condi- Therefore, we will examine how the different types of error tions involve approaching food and avoiding control stimuli handling affect reliability and validity (Study 4). (quadrants B and C of Table 1), while the bias-incompatible con- ditions involve avoiding food and approaching control stimuli Bias score computation algorithms (quadrants A and D of Table 1). When there is an equal number of trials in each of the four conditions, compatibility scores (e.g. Category‑specific and movement‑specific difference ) are thus avoid target or approach control − approach target or avoid control scores The category-specific difference score is the most functionally identical (though halved in size) to double-difference popular bias scoring algorithm (93 in 163 studies). To com- scores (which can be reformulated as pute it, one subtracts aggregated approach RTs from aggre- . H o w e v e r , avoid target + approach control − approach target + avoid control gated avoidance RTs for a single stimulus category (Table 1: when there is an unequal number of trials in the conditions con- quadrant A minus B, or C minus D). Movement-specific tained within the averages, the condition with more trials has a difference scores are less popular, but they similarly con- larger influence on the average than the condition with fewer trast a single condition to another, in this case by subtract- trials, which can reintroduce RT influences that a double-differ - ing the approach or avoidance RT for a target stimulus from ence score is meant to account for, such as stimulus-independent the approach or avoidance RT of a control stimulus. In the differences in approach–avoidance speed. Imagine, for example, resulting score, positive values imply that the target stimulus that a participant particularly struggled with avoiding palatable is approached or avoided faster than the control stimulus food stimuli, and made a disproportionate number of errors in (Table 1: quadrant C minus A, and D minus B; note that the this particular condition. After error exclusion, their n fi al dataset Table displays these the other way around). As we discussed thus contains 20 avoid-food trials and 40 trials of each other con- in the introduction, these scores are problematic when inter- dition. The mean RT of the incompatible conditions is thus more preted on their own, because they do not account for inter- strongly influenced by the 40 approach-control trials than by the personal and overall differences in how fast participants 20 avoid-food trials and it fails to cancel out the stimulus-inde- perform approach and avoidance movements, and how fast pendent difference between approach and avoid trials. Therefore, they classify the stimuli into their categories, respectively. it is almost always an impure measure of approach–avoidance This contamination with motor or classification effects can bias, as we will show in a further analysis. produce bias scores with extremely high reliabilities that do The D‑score correction The D-score correction controls not correlate with any relevant interpersonal metric, because the difference score consists primarily of contaminant rather for the fact that larger differences between conditions emerge when a participant has a wider RT distribution, than stimulus-related approach–avoidance bias (as found by e.g. Kahveci, Meule, et al., 2020). To hold any meaning, which occurs when they respond more slowly, as demon- strated by Wagenmakers and Brown (2007). The D-score they need to be contrasted with their opposite equivalent, i.e. approach scores with avoid scores, and target stimuli with was introduced by Greenwald et al. (2003) for the Implicit Association Task and was also adopted in AAT research control stimuli. This can be done through subtraction or by comparing the two scores in an analysis, such as ANOVA. (Wiers et al., 2011). Many different types of D-scores were reported in the AAT literature, with the common thread Therefore, we primarily focus on double-difference scores in this article. being that a mean-based difference score of some kind is divided by the SD of the participant’s RTs. It makes sense Double‑difference scores Double-difference scores cancel to cancel out the effect of narrower or wider SDs, as these can be caused by a myriad of causes other than underly- out effects other than stimulus category-specific approach– avoidance bias, by subtracting approach–avoidance scores ing approach–avoidance bias, such as age, fatigue, and speed-accuracy trade-offs. However, this slowing cannot for a control or comparison stimulus category from those of a target stimulus category (Table 1: quadrants [A-B]-[C-D]). be entirely disentangled from the slower responding that 1 3 Behavior Research Methods may occur when individuals have more difficulty perform- main effect of stimulus-to-movement congruence (computed ing the task due to a strong and rigid approach–avoidance in R with the formula RT ~ congruence + (congruence | bias. Hence, it is as of yet unclear whether the D-score Subject)). correction helps or hurts the validity of the AAT and will therefore be examined here. Dataset simulation Multilevel random effects This scoring method has We simulated AAT datasets to produce values distributed recently been introduced by Zech et al. (2020). It involves with a right skew similarly to real AAT data and with fitting a mixed model and extracting the by-participant adjustable differences between different conditions and random slopes representing the desired contrast between between subjects. For each participant, we first randomly conditions. For example, a contrast between approach and generated the mean RT, SD, movement direction RT dif- avoidance can be retrieved by extracting the random effect ference, stimulus category RT difference, and bias effect of movement direction (0 = avoid, 1 = approach), and a RT difference (the true bias score), based on a predeter- double-difference score can be retrieved by extracting the mined sample-wide mean and SD for each parameter. interaction between movement direction and stimulus cat- After this, we generated gamma-distributed RTs whose egory (0 = control, 1 = target). This method allows for the means and SDs were shifted such that they matched pre- inclusion of known covariates influencing individual RTs determined parameters of their respective condition and such as trial number, temporal proximity of error trials, participant. To be able to generate data with similar prop- and individual stimulus recognition speeds. Due to its nov- erties to those of real studies, we used means and SDs of elty and good performance in the aforementioned study, the aforementioned parameters from the relevant-feature we included this approach here and chose to examine it in AAT described by Lender et al. (2018) with errors and the following analyses. outliers (RT <200 or RT >2000) removed; these param- eters are described in Appendix 1. Each dataset featured 36 participants, each having 256 trials divided into four Study 2: Susceptibility of compatibility conditions. This data simulation procedure is available scores to confounding caused by differences through the function aat_simulate() in the AATtools in trial count between conditions package (Kahveci, 2020) for R (R Core Team, 2020). Introduction and method Analysis procedure As mentioned, compatibility scores are a problematic meas- We simulated 1000 datasets based on the properties from ure of approach–avoidance bias when the number of trials Lender et al. (2018). We also simulated 1000 datasets where in each condition is unequal, which is bound to be the case the RT difference between approach and avoid trials was when outliers and error trials are removed. We demonstrated doubled, as we hypothesized that unequal trial count is espe- this by simulating AAT datasets and examining how reli- cially problematic for compatibility scores when RT differ - ability is impacted by the removal of trials from one specific ences between approach and avoidance trials are large. In condition. each dataset, we removed one trial per participant from the approach-target condition (removing trials from any of the Examined methods other condition instead should lead to identical effects on the compatibility score). After this, we computed double- We examined double-difference and compatibility score difference scores and compatibility scores from the data variants of the four archetypal data aggregation methods using the aforementioned four archetypal data aggregation described in Study 1: means, medians, D-scores, and mul- methods. We repeated this procedure of trial removal and tilevel random effects. The formulas for the first three of score computation until 16 trials remained per participant. these methods are described in Table 1. As for the multilevel methods, multilevel double-difference scores were computed Outcome measures by extracting the per-participant random effect coefficients of a movement × stimulus-type interaction (computed in R We evaluated the accuracy of the bias scores by correlating with the formula RT ~ movement_direction * stimulus_cat- them with the predetermined true score on which the par- egory + (movement_direction * stimulus_category | Sub- ticipants’ data were based. We refer to this measure as (true ject)), whereas multilevel compatibility scores were com- score) recoverability. We chose this measure, since it is intui- puted by extracting per-participant random coefficients of a tively easy to understand on its own, it is computationally 1 3 Behavior Research Methods much less costly than permutated split-half reliability, and it two scoring methods performed identically given equal is equivalent to the square root of reliability, since trial counts across conditions, but diverged when these became unequal. Comparing the four score aggregation Cov(T, T + E) Cor(T, T + E) = √ methods, D-scores best recovered the true score, and they Var(T) ⋅ Var(T + E) were followed by mean-based scores, multilevel-based Cov(T, T) + Cov(T, E) scores, and lastly, median-based scores. √ √ Var(T) ⋅ Var(T)+ Var(E)+ 2Cov(T, E) Given the finding that compatibility scores either perform Var(T) on par or worse than double-difference scores, we see little rea- √ √ son to use them when double-die ff rence scores are available. Var(T) ⋅ Var(T)+ Var(E) We thus recommend using double-difference scores instead of Var(T) √ compatibility scores, and we will do so ourselves in the remain- Var(T)+ Var(E) der of the article. The only exception to these findings is in the case of multilevel-analysis-based scores, where compatibility where the Spearman-Brown-corrected split-half correlation scores were superior to double-difference scores unless groups is an estimator of were unequal in size and simultaneously had a large difference Var(T) between approach and avoidance RTs. We will report on multi- Cor(T + E, T + E) = Var(T)+ Var(E) level compatibility scores. To be able to compare double-difference and compat- ibility scores, we also computed the probability that a ran- Study 3: Simulation study of outlier rejection domly drawn double-difference score would be better than and scoring algorithms a randomly drawn compatibility score at recovering the true score. We arrived at this probability by computing the Introduction and methods mean proportion of recoverability values of double-differ - ence scores that were greater than recoverability values of Given the heterogeneity in the literature as revealed in Study 1, each compatibility score. This was done separately for each we chose to empirically examine the impact of outlier rejection aggregation method and number of missing trials. methods and scoring algorithms on reliability using data simula- tion. We simulated datasets to be able to control the number of outliers in the data, and we applied every unique combination Results and discussion of outlier rejection method and scoring algorithm to these data- sets. We compared the methods to each other in their ability to As depicted in Fig. 1, bias scores became increasingly inac- recover the true scores on which the simulated data were based. curate as the trial count became more unequal across condi- tions. This decrease in accuracy was larger for compatibility Examined methods scores than for double-difference scores, and it was larger when there was more variability between the simulated par- The examined bias computation algorithms included the ticipants in how much faster or slower they were to approach double mean difference score, double median difference or avoid. Overall, the probability of a double-difference score, double-difference D-score, and multilevel compat - score being better than a compatibility score at recovering ibility scores, as described in the previous studies. We the true score was almost always above chance, being .79 examined a number of outlier detection methods. First, we at most. These probabilities are further depicted in Table 5. examined the sample-wide removal of the slowest and/or Double-difference scores only performed worse than fastest percentile of trials (1% / 99%), because it is a com- compatibility scores when computed using multilevel mon method in the AAT literature; we examined the per- analysis, given either relatively small differences in trial participant removal of RTs exceeding the mean by 3 SD (M count between conditions, or given average variability ± 3 SD), because it is similarly common; we examined the in the difference between approach and avoidance RTs. per-participant removal of RTs exceeding the mean by 2 This contrast was driven by multilevel double-difference SD (M ± 2 SD), as a representative of the more strict SD- scores performing worse than their mean- and D-score- based outlier removal methods that is sufficiently different based counterparts, while multilevel compatibility scores from the aforementioned 3 SD method such that its effects performed on par. When bias scores were computed on the data will be more detectable; we examined per- using medians, compatibility scores underperformed rel- participant removal of RTs exceeding the median by ± 3 ative to double-difference scores even when trial counts MADs (median ± 3 MAD), to be able to contrast the com- were equal across conditions. In all other cases, the mon 3 SD method to its robust counterpart; we examined 1 3 Behavior Research Methods Fig. 1 Effect of unequal trial count per condition on the recoverability of the true score from double-difference and compatibility scores that were based on means, medians, D-scores, and multilevel random effects repeated outlier testing and removal using one- or two- Analysis procedure sided Grubbs tests (Grubbs), to represent outlier removal methods based on statistical testing rather than boundaries We generated 1000 datasets in the same manner as Study calculated from the data; and lastly, we contrasted these 2, with each dataset having the same properties as the methods to no outlier rejection (None). We did not examine relevant-feature AAT study of Lender et al. (2018) with absolute outlier cutoffs in this study, as we were concerned outliers and error trials included. These datasets each had that these, unlike adaptive outlier rejection methods, were 36 participants with 256 trials each, spread across 2×2 too sensitive to the arbitrary properties of our current simu- conditions. When examining category-specific difference lation (such as the mean and SD of the RTs), and would scores, we excluded all trials pertaining to the control hence require the manipulation of these properties as well, condition from the data; when examining double-differ - which falls outside the scope of this article. ence scores, the full dataset was used. In each dataset, we Table 5 Probability of double-difference scores having higher true score recovery than compatibility scores Variability in difference between Aggregation method Probability of double-difference scores having higher true score recov - approach and avoidance RTs ery than compatibility scores 0 missing trials 16 missing trials 32 missing trials 48 missing trials Average Multilevel-based .46 .46 .49 .50 Mean-based .50 .51 .54 .54 Median-based .52 .55 .58 .55 D-score-based .50 .50 .54 .56 Large Multilevel-based .46 .50 .60 .69 Mean-based .50 .54 .66 .75 Median-based .54 .61 .71 .75 D-score-based .50 .54 .67 .79 1 3 Behavior Research Methods iteratively replaced one random additional RT with a slow any number of outliers (r = .70–.78). The mean + 2 SD outlier (mean = μ + 1200, SD = 400 ms, gamma- method gave the second highest true score recoverability participant distributed, shape = 3) in every participant’s data, (i.e. first (r = .57–.71), but it was on par with the median + 3 MAD one outlier, then two, then three) after which we separately when there were 14 or fewer slow outliers per participant (2 applied each combination of one outlier rejection method SD and 3 MAD: r = .74–.80). Performing worse than these to slow RTs and one bias scoring algorithm to the data. in terms of true score recoverability were Grubbs’ test (r This was done 32 times per dataset, until 12.5% of each = .48–.69), followed by the mean + 3 SD (r = .42–.69), participant’s data consisted of outliers. To obtain the reli-percentiles (r = .38–.68), and lastly, no outlier removal (r 32 32 ability of these combinations, we utilized the same true = .33–.68). Percentile-based outlier rejection was effective score recoverability measure that we used in Study 2; that when the data consisted of around 1% outliers (top percentile is, we computed the correlation between the computed bias rejection: r = .75–.79, compared to no outlier rejection: r 4 4 scores and the true (double-difference) bias scores that = .67–.75), but it failed to reduce the decline in true score the data was generated from. We thus obtained for each of recoverability compared to no outlier rejection when there the 1000 datasets the recoverability of the true score from were more outliers (from 4 to 32 outliers, true score recover- bias scores that were computed with each combination of ability for double mean difference scores went down by .37 outlier rejection method and bias score algorithm, from for the percentile method and .35 for no outlier removal). data with 0 to 32 outliers per participant. These recover- Comparing bias score algorithms, double median differ - ability values were averaged across datasets to gain an ence scores were barely influenced by outliers (r = .75 to r 0 32 overview of how recoverable the true score was through = .68 with no outlier removal) or outlier rejection. Double each combination of methods at each number of outliers. mean difference scores and double-difference D-scores were This procedure was repeated in another 1000 datasets, more strongly affected by outliers (means: r = .80 to r = 0 32 except we iteratively replaced one random RT with a fast .47, D-scores: r = .81 to r = .47 with no outlier removal), 0 32 outlier (mean = μ – 500 ms, SD = 50 ms, gamma- but they were better at recovering the true score when there participant distributed, shape = 3) in each participant’s data and applied were few outliers or when outliers were excluded with M + outlier rejection to fast RTs before we computed bias scores 2 SD or median + 3 MAD; across virtually all outlier rejec- and recoverability. We also repeated the same process in tion methods and numbers of outliers, D-scores were better another 1000 datasets where we iteratively replaced one ran- than double mean difference scores at recovering the true dom RT with a fast outlier and another with a slow outlier. score (correlations were, on average, .01 higher, up to .05). Multilevel compatibility scores showed the strongest decline Results and discussion in true score recoverability following the addition of outliers (r = .79 to r = .35 with no outlier removal), and outlier 0 32 In this section we report on results regarding the double-dif- rejection failed to bring multilevel compatibility scores back ference scores. Outcomes relating to category-specific differ - on par with the other algorithms (e.g. when combined with ence scores were almost identical in pattern but lower in overall 3 MAD outlier rejection, multilevel: r = .73, and D-score: recoverability, and can be viewed in Appendix 2. We report on r = .78). In addition, we report in Appendix 2 how the multilevel compatibility scores rather than multilevel double- multilevel compatibility score also produces a much wider difference scores since the former were better at recovering the range of correlations with the true score than the other meth- true score in virtually all occasions. The results of the simula- ods do, making it especially difficult to know whether any tions for double-difference scores are depicted in Fig.  2. The single application of this method will produce scores with sensitivity and specificity of the examined outlier rejection the expected reliability; D-scores, in comparison, produced methods is also further discussed in Appendix 2. Correlations scores with the least variable correlation with the true score, of bias scores with other aspects of the underlying data are also indicating that this method is not only highly reliable but reported in Appendix 2. These reveal that multilevel bias scores also consistently reliable. Overall, slow outliers strongly are contaminated with variance from the participant mean RT. decreased true score recoverability (the reduction of recov- Whenever we report a correlation in this section, the associated erability from 0 to 32 outliers was between .03 and .44). number of outliers is reported as a subscript. Fast outliers Slow outliers Grubbs’ test and mean – 3 SDs almost completely failed The true score recoverability of the outlier rejection methods to detect fast outliers, performing no better than no outlier followed a similar pattern across all bias scoring algorithms. rejection (Fig. 2). The best recoverability of the true score The true scores were most recoverable from bias scores was obtained when classifying all RTs faster than 2 SDs computed using the median + 3 MAD method at almost below the mean as outliers, especially in data with many 1 3 Behavior Research Methods Fig. 2 True score recoverability changes due to exclusion of outliers across the four double-difference scoring methods outliers (r = .73–.78). The median – 3 MAD method also reliability when there were few to no outliers (percen- led to better reliabilities (r = .72–.77) than no outlier rejec- tile outlier removal: r = .73–.79; compared to no outlier 16 0 tion (r = .69–.76). Furthermore, compared to no outlier removal: r = .75–.81). Bias scores were most reliable if 16 0 removal (r = .76–.81), removal of the fastest percentile of outliers were removed by rejecting RTs deviating more trials actually led to a decline in reliability which was espe- than 3 MAD from the participant median (r = .50–.68), cially noticeable when there were few or no fast outliers in but with fewer outliers, reliabilities were on par when the data (r = .73–.78). outliers were removed by rejecting RTs deviating more Again, double median difference scores were only very than 2 SD from the participant mean (2 SD: r = .74–.78; slightly affected by outliers (r = .75 to r = .73), followed 3 MAD: r = .74–.79). 0 32 8 by D-scores (r = .81 to r = .70), double mean difference 0 32 scores (r = .80 to r = .69), and lastly multilevel compat- 0 32 ibility scores (r = .79 to r = .65); but median difference Conclusion 0 32 scores had lower reliability in the absence of outliers and never exceeded the reliability of D-scores when all trials For outlier rejection methods, it can be concluded that more than 2 SD below the mean were removed. Overall, fast percentile-based outlier detection removes too few outli- outliers had a relatively small influence on the reliability (the ers when there are many, and it removes too many outli- reduction of reliability from 0 to 32 outliers was between ers when there are few, to the point of making bias scores .02 and .14). less accurate under common circumstances (e.g. Fig. 2, second row; lines with squares). Accordingly, percentile- Bilateral outliers based outlier exclusion appears to be disadvantageous. Given both slow and fast outliers, the remaining outlier Outlier rejection on data containing both slow and fast rejection methods did not strongly differ in effective- outliers led to results resembling a combination of the ness when there were few outliers, but when there were aforementioned findings, with the largest influence many, median ± 3 MAD (Fig. 2, row 3; lines with upward coming from slow outliers. Again, rejecting the top and triangles) outperformed mean ± 2 SD, which in turn out- bottom 1% of RTs reduced rather than improved the performed Grubbs’ test and mean ± 3 SD (Fig. 2, row 3; 1 3 Behavior Research Methods diamonds, downward triangles, and circles). Mean ± 3 Datasets for “Erotica” We used data from a single experi- SD and Grubbs’ test also failed to reject most fast outli- ment fully described in Kahveci, Van Bockstaele, et  al. ers (Fig. 2, row 2, circles and downward triangles), which (2020). In short, 63 men performed an AAT featuring eight suggests there is little point to using these methods to blocks with 40 trials each. In four of these blocks, they had to remove fast outliers; one should thus combine these two classify images of women on the basis of whether the images methods with an absolute outlier cutoff like, for instance, were erotic or not (relevant-feature), and in the other four 200 ms. blocks they had to classify the images on the basis of hair Among the algorithms, double-difference D-scores and color (irrelevant-feature). Half of the participants responded double mean difference scores were most reliable when with the joystick and the other half using the keyboard. For there were few outliers – in data with many slow and fast analysis in the current study, five participants were removed: outliers (>8%), they were outclassed by double median one with incomplete data, and four with a mean RT over difference scores despite outlier rejection. Multilevel 1000 ms. As the criterion validity measure, we chose the compatibility scores were less reliable than the aforemen- participants’ self-reported number of porn-viewing sessions tioned methods when there were no outliers, they became per week, as we found that approach–avoidance scores cor- more unreliable when there were more outliers in the data, related more strongly with this score than with other con- and their reliabilities were more inconsistent than those structs measured in the study. of other methods. Worryingly, applying outlier rejection was not enough to make multilevel compatibility scores Datasets for “Foods” We used data from a single experiment as reliable as those derived with the methods not based fully described in Lender et al. (2018). In short, 117 par- on multilevel analysis. This casts doubt upon whether the ticipants performed either of three joystick-AATs involving use of this scoring method is justifiable. Median difference food and object stimuli where the correct movement direc- scores were shown to be nearly unaffected by outliers, but tion was determined on the basis of different elements: stim- they were less reliable than the other methods when there ulus content (N = 37), picture frame (N = 44), and a shape were few outliers and outlier rejection was applied. Hence, displayed in the middle of the stimulus (N = 36). Each task it appears that the robustness of median-based scores may involved two blocks of 128 trials each. For the current study, be outweighed by the reliability and consistency of mean- we selected the content-based AAT as the relevant-feature based scores and especially D-scores in conjunction with task to be analyzed, and we selected the frame-based AAT appropriate outlier rejection. as the irrelevant-feature AAT to be analyzed. For analysis in the current study, we removed one participant from the relevant-feature AAT with an error rate above 50%. As the Study 4: Comparison of validity criterion variable, we chose the restrictive eating scale (α and reliability of pre‑processing pipelines = .90) of the Dutch Eating Behavior Questionnaire (van on real data Strien et al., 1986), as we found that approach–avoidance bias scores correlated more strongly with this score than Introduction and methods with other constructs measured in the study. We next examined the effect of different pre-processing Datasets for “Spiders” For the relevant-feature AAT involv- decisions on reliability and validity in six real datasets. ing spiders, we used data from a single study fully described in Van Alebeek et al. (2023). In short, 85 participants per- formed a relevant-feature AAT on a touchscreen where they Description of the examined datasets and their criterion were shown pictures of 16 spiders and 16 leaves, and were validity measures required to approach and avoid on the basis of stimulus content. Approaching involved sliding the hand towards the We selected datasets to cover appetitive and aversive stimulus and then dragging it back to the screen center, while stimulus categories, relevant- and irrelevant-feature task avoidance involved sliding the hand away from the stimulus. instructions, and joystick and touchscreen input, to get The task involved 128 trials divided into two blocks, and was results that can generalize to a wide range of future AAT embedded in a larger experiment which also included AATs studies. Datasets were only eligible if they measured both involving butterflies, office articles, and edible and rotten an initiation RT and a full motion RT, if they featured a food. As the criterion variable, we chose the Spider Anxiety target and control category, and if their bias scores were Screening (α = .88; Rinck et al., 2002). significantly correlated with a criterion variable. Proper - For the irrelevant-feature AAT involving spiders, we ties of the datasets, such as mean RT and error rate, are used data from a single study fully described in Rinck et al. shown in Appendix 2. (2021). In short, participants performed an irrelevant-feature 1 3 Behavior Research Methods go/no-go AAT on a touchscreen, where they were shown from the sample mean of that half to ensure correlations images of 16 spiders, 16 leaves, and 16 butterflies. Partici- were not driven by outliers. pants approached or avoided the spiders and leaves based on their position on the screen, while they were required not to Validity of category‑specific difference scores respond to the butterflies. Responses always involved lift- ing the hand off the touchscreen and touching the stimulus, To gain an overview of the psychometric properties of cat- and then sliding it toward the other side of the screen. Thus, egory-specific difference scores, we performed a number of stimuli at the top of the screen were dragged closer and thus tests. We computed category-specific bias scores for target and approached, while stimuli at the bottom of the screen were control stimuli by subtracting participants’ median approach moved further away and thus avoided. After excluding all RT from their median avoid RT, both computed from initia- the no-go trials, the experiment consisted of 128 trials in a tion times of correct responses. We computed the correlation single block. The Spider Anxiety Screening was again used between bias scores for target and control stimuli. Across the as criterion variable (α = .92). whole of the multiverse analyses, we also computed the rank correlation between reliability and criterion validity per data- set and algorithm type (category-specific difference, double- Multiverse analysis difference). We excluded multilevel compatibility scores from analyses involving the irrelevant-feature AAT for reasons The six aforementioned datasets were pre-processed through which are explained further in the results section. many different pipelines, after which we computed the split-half reliability using the function aat_splithalf() in Decision trees R package AATtools (Kahveci, 2020), as well as the cri- terion validity using Spearman correlations. We computed Following the computation of reliability and criterion valid- the average of 6000 random split-half correlations to obtain ity for each pipeline, we applied the Fisher z-transformation the randomized split-half reliability. We used 6000 itera- to the reliability and validity values to be able to analyze tions because we found in an analysis reported in Appendix differences at both low and high levels of reliability and 3 that, at most, 6000 random splits are needed to ensure validity. We submitted the z-transformed reliabilities and that at least 95% of average split-half coefficients deviate validities as dependent variables to linear mixed decision less than .005 from the grand average of 100,000 splits. The tree analyses with random intercepts for dataset and fixed examined components of the pipeline included the defini- predictors for RT type, lower RT cutoff, upper RT cutoff, tion of the RT (initiation time, completion time), the lower adaptive outlier rejection rule, error rule, and aggregation RT limit (0 ms, 200 ms, 350 ms), the upper RT limit (1500 method. We used an alpha level of .001 and a maximum tree ms, 2000 ms, 10,000 ms), the adaptive outlier rule (none, depth of 6 to prevent the decision trees from becoming too mean ± 2 SD, mean ± 3 SD, median ± 3 MAD, <1% and large to display. Decision trees were generated using R pack- >99%, Grubbs’ test), the error rule (keep errors, remove age glmertree (Fokkema et al., 2018). For display in plots errors, replace errors with the block mean + 600 ms; fur- and tables, the z-transformed correlations were averaged and ther called error penalization), the algorithm type (category- then converted back to regular correlations. specific difference, double-difference) and the algorithm aggregation method (mean difference, median difference, Results and discussion D-score, multilevel category-specific difference or compat- ibility). This led to a total of 2592 pipelines per dataset. The Validity of category‑specific difference scores examined pipeline components were selected on the basis of their common use and methodological rigor as revealed As can be seen in Table 6, target- and control-specific differ - by the literature review in Study 1, on the basis of results ence scores were positively correlated in all irrelevant-feature from the analyses in Studies 2 and 3, and with emphasis on AATs, indicating that a significant portion of the variance newly (re)emerging methods in the field (e.g. Grubbs' test: in target-specific and control-specific stimuli is shared; this Saraiva et al., 2013). In each analysis, the pre-processing shared variance may originate from the interpersonal vari- steps were applied in the following order: the RT measure ability in participant’s overall approach–avoidance RT differ - was selected, the lower and upper RT cutoffs were applied, ences, as we speculated in Study 1. Conversely, there was a error trials were excluded if required, outliers were excluded, significant negative correlation for two of the three relevant- error trials were penalized if required, and the bias scores feature AATs, indicating that category-specific difference were computed. During the computation of split-half reli- scores to target and control stimuli are related to a source of ability, participants were excluded from individual iterations variance that increases one bias score but decreases the other, if their bias score in either half deviated more than 3 SD such as response slowdown between blocks. 1 3 Behavior Research Methods Table 6 Correlations between target- and control-specific difference scores, and t-tests comparing control-specific bias scores to zero Correlation between category- Correlation between reliability and criterion validity specific difference scores of target Category-specific difference Double-difference and control stimuli Instructions Stimuli r p r p r p Relevant-feature Erotic 0 .988 .20 <.001 .31 <.001 Food −.40 .017 −.36 <.001 .15 <.001 Spider −.14 .189 −.20 <.001 .31 <.001 Irrelevant-feature Erotic .46 <.001 −.05 .152 −.01 .661 Food .38 .011 −.45 <.001 −.16 <.001 Spider .35 .001 .39 <.001 .08 .010 As reported in Table 6, when bias scores were computed decisions. Variability of reliability estimates was especially with double-difference scores, reliability and criterion valid- strong in multilevel compatibility scores in the irrelevant- ity were positively correlated in four datasets, negatively in feature AAT, with extreme values reaching into the range one, and not at all in one. When bias scores were computed of 1 as well as −1. This is likely due to the fact that small with category-specific difference scores, reliability and cri- random effects are difficult to identify in multilevel models terion validity were negatively correlated in three studies, and can thus get contaminated with other aspects of the data positively in two, and not at all in one. We expected posi- such as the mean RT, as we demonstrated in Appendix 2. tive correlations between reliability and criterion validity, Hence, multilevel compatibility scores may not be valid for as more reliable measures are less influenced by noise and the irrelevant-feature AAT, as well as for any other task with could hypothetically capture the approach–avoidance bias very small effect sizes. Hence, we do not analyze multilevel more accurately, enabling stronger correlations with meas- compatibility scores in the irrelevant-feature AAT in the ures of similar constructs; negative correlations would imply remainder of this article. that when bias scores become more reliable, they get bet- ter at measuring a construct that is different from implicit Reliability decision trees approach–avoidance bias; this would cast doubt upon the validity of the scores. We used decision trees to deconstruct the complex nonlinear These findings thus err more towards supporting than relationships between different factors in how they influence rejecting the idea that category-specific difference scores the reliability and validity of the six AATs. run a risk of being contaminated with sources of variance The reliability decision tree of the relevant-feature unrelated to approach–avoidance bias of the target stimuli, AATs is depicted in Fig. 4. The most influential decision and they run a higher risk than double-difference scores of was how to handle error trials: penalization (.71) gave becoming less valid as they become more reliable. In the worse reliability than error removal or retention (.76). The remainder of this results section we will therefore report on second most influential decision was algorithm: double- double-difference scores, while results on category-specific difference D-scores were the most reliable (.78) but could difference scores can be gleaned in Appendix 4. lead to lower reliability if lax outlier rules (upper RT limit of 10,000 ms, no adaptive outlier exclusion or percentile- Variability in reliability and validity of different bias scoring based) were applied to completion times (.72); multilevel algorithms compatibility and double mean difference scores came in closely after (.76) and the only thing harming their reli- We sought to gain an overview of how much the various bias ability was retention (.73) rather than removal of errors scoring algorithms are perturbed by other pre-processing (.76). Double median difference scores were the least reli - decisions. Figure 3 and Table 7 depict the mean reliabil- able (.73) and were further harmed by the stricter outlier ity and criterion validity of the various datasets, as well as removal methods (.71; Median ± 3 MAD, M ± 2 SD). several measures of spread. Criterion validity and espe- The reliability decision tree of the irrelevant-feature cially reliability were found to strongly fluctuate depend - AAT is depicted in Fig. 5. Reliability was very low for this ing on which pre-processing pipeline was used. Comparing task. Again, error penalization harmed reliability (−.13), task types, the irrelevant-feature AATs were, on average, though less for D-scores (−.02). Algorithm was the sec- less valid and much less reliable, and their reliabilities and ond most important decision when errors were not penal- validities were more strongly perturbed by pre-processing ized: double median difference scores gave bad reliability 1 3 Behavior Research Methods Fig. 3 Distributions of reliability and validity coefficients acquired bias and a variable that was preselected on the basis of its significant through different pre-processing pipelines in the six analyzed data- correlation with approach–avoidance bias scores in that particular sets. This figure depicts the distribution of reliability and criterion dataset. It is therefore of little value to focus on how high or low this validity estimates from all different pre-processing pipelines. A wide value is in absolute terms. Rather, we guide the reader to focus on distribution implies that differing pre-processing decisions had a the spread or uncertainty of this value. In all cases, the validity of the large influence on the resulting reliability or criterion validity. Crite- irrelevant-feature AAT datasets is more spread out than that of the rion validity is based on the correlation between approach–avoidance irrelevant-feature AATs Table 7 Means, confidence intervals, and variability estimates for reliability and validity outcomes over all pipelines AAT type Algorithm Reliability Criterion validity Mean r 95% CI SD z Mean r 95% CI SD z Relevant-feature Multilevel .74 .46, .88 .12 .27 .03, .45 .09 Mean .74 .45, .88 .12 .27 .05, .44 .08 Median .72 .49, .84 .07 .33 .23, .48 .06 D-score .77 .55, .89 .10 .28 .05, .45 .10 Irrelevant-feature Multilevel .23 −.67, .89 .60 .12 −.22, .38 .14 Mean −.05 −.61, .27 .21 .23 .03, .52 .11 Median −.09 −.56, .18 .18 .22 .01, .39 .09 D-score .01 −.32, .22 .12 .23 .03, .50 .10 Note: SDs represent the pooled SD f z-transformed, not raw, reliability and criterion validity estimates. Pooling was done by computing the vari- ance within each dataset first and then averaging across datasets. Criterion validity is based on the correlation between approach–avoidance bias and a variable that was preselected on the basis of its significant correlation with approach–avoidance bias scores in that particular dataset. It is therefore of little value to focus on how high or low this value is in absolute terms. Rather, we guide the reader to focus on the spread or uncer- tainty of this value. In all cases, the validity of the irrelevant-feature AAT datasets is more spread out than that of the irrelevant-feature AATs 1 3 Behavior Research Methods Fig. 4 Decision tree of factors influencing the reliability of the rele- played in each node represent the average reliability achieved by the vant-feature AAT. The factors that the data was split by are denoted decisions that led to that node. Particularly reliable and unreliable on each node, and the factor levels by which the data was split are pathways are respectively depicted in green and grey depicted on the edges emerging from these nodes. The numbers dis- (~ −.10), except with the use of completion times and on average (.23) compared to error removal and retention less strict upper RT limits like 2000 ms or above (.03). (.32). However, if error trial RTs were penalized, validity Double mean difference scores and double-difference could be salvaged with the use of a combination of dou- D-scores benefited the most from removal of error trials ble median difference scores, completion times, and a 1500 and from outlier handling with any method (.05) other ms RT cutoff (.37). When error trials were not penalized, than percentiles. validity was higher for double median difference scores and double-difference D-scores (.33) than for double mean Validity decision trees difference scores or multilevel compatibility scores (.30). Additionally, validity often benefited slightly from removal We used the same methodology to construct decision trees of error trials, and subsequently, the use of completion times for validity. rather than initiation times. As depicted in Fig.  6, the criterion validity of the rel- As depicted in Fig. 7, criterion validity outcomes were evant-feature AAT was much less strongly perturbed by more ambiguous for the irrelevant-feature AAT. Criterion pre-processing decisions than its reliability was. Once validity was higher with outlier rejection methods that again, error penalization was harmful to criterion validity were not strict and not lax, i.e. M ± 3 SD and the Grubbs Fig. 5 Decision tree of factors influencing reliability in the irrele- played in each node represent the average reliability achieved by the vant-feature AAT. The factors that the data was split by are denoted decisions that led to that node. Particularly reliable and unreliable on each node, and the factor levels by which the data was split are pathways are respectively depicted in green and grey depicted on the edges emerging from these nodes. The numbers dis- 1 3 Behavior Research Methods Fig. 6 Decision tree of factors influencing the criterion validity of bers displayed in each node represent the average criterion validity the relevant-feature AAT. The factors that the data was split by are achieved by the decisions that led to that node. Particularly valid and denoted on each node, and the factor levels by which the data was invalid pathways are respectively depicted in green and grey split are depicted on the edges emerging from these nodes. The num- test (.25); and validity could only be harmed within this absent, and likewise, there is much to be gained from the branch by the combination of retaining error trials and informed choice of the study’s pre-processing pipeline. In using completion times (.19). With the other outlier rejec- turn, we will discuss the findings on error handling, out- tion methods, reliability was best when error trials were lier rejection, score computation, RT measurement, and removed or penalized (.22) rather than kept (.19). Unlike instruction type. We will derive from these findings a set in every other decision tree, error penalization did not of recommendations, which are summarized in Table 8. lower the outcome measure in this case. We also consider implications for other RT-based implicit measures. General discussion Error trials There is a long chain of often arbitrary pre-processing Most striking was the finding that replacing error RTs with decisions that researchers have to make before analyzing the block mean RT plus a penalty (e.g. 600 ms) frequently their data, and the wide variability in outcomes this can led to lower reliability and validity. In Study 1 we found generate threatens replicability and scientific progress. that this method was used in 7 out of 163 reviewed studies, Only recently have researchers begun to investigate the likely due to the influence of the implicit association task consequences of different decisions (Steegen et al., 2016), literature in which this method is common. Furthermore, and a comprehensive study for the field of AAT research there was a smaller but noticeable disadvantage in reli- has so far been missing. We aimed to fill this gap here. ability and validity when error trials were kept rather than Our selective literature review in Study 1 revealed a removed, especially when trial completion times were used wide range of pre-processing practices in AAT studies. as the RT measure. Errors were kept in the data in 34 out of We subsequently used simulations (Studies 2 and 3) and 163 reviewed studies. analyses on real data (Study 4) to compare many of these practices, and obtained several findings that can inform Outliers further RT research. Importantly, we found large vari- ability in the obtained reliability and validity outcomes Regarding RT cutoffs, we found that reliability and depending on the chosen pre-processing pipeline. This validity of real data were unaffected by the presence or highlights the fact that the varying practices do indeed absence of lower RT cutoffs; hence, this particular pre- muddy the waters of whether an effect is present or processing decision may not strongly inf luence reliability 1 3 Behavior Research Methods Fig. 7 Decision tree of factors influencing the criterion validity of bers displayed in each node represent the average criterion validity the irrelevant-feature AAT. The factors that the data was split by are achieved by the decisions that led to that node. Particularly valid and denoted on each node, and the factor levels by which the data was invalid pathways are respectively depicted in green and grey split are depicted on the edges emerging from these nodes. The num- and validity outcomes. Upper RT cutoffs did inf luence series of simulation studies that SD-based outlier rejec- outcomes, though not that frequently. A cutoff of 1500 tion, in contrast to MAD-based outlier rejection, is less ms showed good validity for completion times in the prone to inflating type I error. Based on these findings relevant-feature AAT, while a cutoff of 1500 or 2000 ms and prior establishment of methods in the field, our pref- showed slightly better reliability than a cutoff of 10,000 erence thus goes towards either rejecting outliers deviat- ms under very specific conditions. Despite this ambi- ing more than 2 SD from the mean, or towards rejecting guity, however, we do suggest that reasonably chosen RTs deviating more than 3 SDs after very fast slow RTs lower and upper RT cutoffs be applied: both slow and have been removed, as reported in Table 8. fast outliers, however rare and insignificant they may be, We have two explanations for why the outcomes for still represent invalid data, and slow outliers still have a outlier rejection in simulated and real data were diver- strong impact on subsequently applied adaptive outlier gent. First, the outlier rejection methods may have pro- removal methods and RT aggregation. duced more divergent outcomes for our simulations simply We found no clear pattern regarding which outlier because we simulated a large number of outliers: when rejection method produced better results in real data; we the number of simulated outliers was smaller and more did, however, find that removing outliers was better than consistent with what occurs in real data (e.g. 4% of tri- not doing so. This contrasts with the results of our simu- als), the outlier rejection methods were much less distin- lations, where true score recoverability followed a con- guishable. Though less likely, an alternative explanation sistent pattern across outlier rejection methods from best is that the simulation was based on incorrect assumptions. to worst: median ± 3 MAD > mean ± 2 SD > repeated We assumed that RT differences between conditions are st th Grubbs tests > mean ± 3 SD > 1 & 99 percentiles > represented by shifts in the bulk of the RT distribution, none. In our simulations, we found that dealing with out- rather than in the presence of more extreme RTs in one liers by rejecting the lowest and highest RT percentiles condition than in the other; depending on which of these across the dataset can actually harm reliability, since this two assumptions is used, results can be quite different, as method does not distinguish between real outliers and reg- demonstrated by (Ratcliff, 1993). This assumption may ular RTs in very fast or slow individuals. It was used in 10 have favored outlier rejection methods that remove a larger out of 163 reviewed studies. Furthermore, almost all fast number of extreme RTs, such as the MAD. More lenient outliers remained in our simulated data when we rejected outlier rejection methods would be favored if differences RTs deviating more than 3 SDs from the individual mean between conditions instead originated from differences in or RTs that were significant outliers on Grubbs’ test, and the number of extreme RTs. Future research should inves- hence, these methods should be used in conjunction with tigate whether RT differences between conditions in the fixed cutoffs for fast outliers. Rejecting RTs deviating AAT are represented by a larger number of extreme RTs more than 2 SDs from the individual mean produced the or by shifts in the bulk of the RT distribution. best reliability outcomes in simulations involving fast or few slow outliers, but in real data the reliability and valid- Scoring algorithms ity of this outlier rejection method often performed on equal footing with rejecting RTs deviating more than 3 Regarding scoring algorithms, we reasoned that cate- SDs from the mean. Berger and Kiefer (2021) found in a gory-specific difference scores (approach stimuli – avoid 1 3 Behavior Research Methods Table 8 Recommendations for pre-processing AAT data Less reliable/valid method Methods with ambiguous outcomes More reliable/valid alternative Outliers • Not rejecting outliers • Removing the lowest and highest • Rejecting RTs deviating more than 2 SD from the mean percentile of RTs sample-wide • Rejecting RTs deviating more than 3 SD from the mean • Rejecting RTs deemed outliers by repeated Grubbs’ tests • Rejecting RTs deviating more than 3 MADs from the median • Preceding aforementioned methods with the removal of RTs below and above reasonable fixed cutoffs* Error trials • Not removing error trials • Removing error trials • Replacing error trials with the block mean plus a penalty Bias score • Compatibility scores • Double median difference scores • Double-difference D-scores in conjunction with outlier rejec- computa- • Multilevel double-differ - • Multilevel compatibility scores in tion tion ence scores the relevant-feature AAT • Double mean difference scores in conjunction with outlier • Category-specific rejection multilevel scores in the irrelevant-feature AAT • Category-specific differ - ence scores* Note: Recommendations are displayed in order, with the worst (col. 1) and best (col. 3) methods displayed at the top of each list * = primarily based on theoretical considerations stimuli) are confounded with stimulus-independent indi- compatibility scores except when bias scores are computed vidual differences in approach–avoidance speed, and through multilevel modelling. hence, these should always be contrasted with a reference Among these, double-difference D-scores consistently stimulus category. In Study 4, we found that increasing had the highest validity and reliability and the lowest vari- the reliability of a category-specific difference score often ability in outcomes, both in simulated and real data; we decreases its validity, and that category-specific difference therefore express our clear preference for double-difference scores for target and control stimuli are positively cor- D-scores over the other methods. Double mean difference related in the irrelevant-feature AAT, supporting the idea scores had more variable outcomes and were often slightly that these scores are contaminated, and become more con- less reliable and valid. taminated when they are more reliable. We therefore opted We found in both simulations and real data—to our sur- to focus the majority of this article on double-difference prise—that double median difference scores led to lower scores. However, our concerns about category-specific dif- reliability than double mean die ff rence scores or double-dif - ference scores need to be corroborated with more conclu- ference D-scores in conjunction with adequate outlier rejec- sive evidence in future empirical studies, which manipu- tion. Double median difference scores were only more reli- late or track factors that differentially influence approach able in simulated data with many outliers (>8%). In validity, and avoidance RTs, such as posture, fatigue, muscle mass, there was not as much of a difference between algorithms so and response labelling. long as errors and outliers were removed. Hence, we draw We demonstrated using simulations that compatibil- no strong conclusions on whether double median difference ity scores become more inaccurate than double-difference scores are to be discouraged or not. scores when there is an unequal number of trials in differ - We found that multilevel compatibility scores were ent conditions, and they confer no benefits over double-dif- more strongly affected by outliers than any other scor- ference scores; the only exception to this was in multilevel ing algorithm, and applying outlier rejection did not fully random effect scores. We found that multilevel random remedy this issue. Multilevel compatibility scores also effects are inaccurate compared to other methods, and had the largest unpredictability in outcomes both sim- become increasingly inaccurate when bias scores are mod- ulations and real data, and especially in the irrelevant- elled with three model terms as in a double-die ff rence score, feature AAT. We hypothesize that this is due to the fact rather than with a main effect, as in a compatibility score. that it can be difficult for multilevel models to identify Hence, double-difference scores should be preferred over small random effects, as occur in the irrelevant-feature 1 3 Behavior Research Methods AAT, where bias scores explain only a small propor- We did not investigate the impact of several less com- tion of the RT variance. Hence, we recommend against mon pre-processing approaches that address problematic using multilevel random effect scores in the irrelevant- aspects of the data overlooked by most reviewed meth- feature AAT, and we remain ambivalent about their use ods. RT transformations, such as square root, natural in the relevant-feature AAT. Further research is needed logarithm, and inverse transformations, can reduce the to demonstrate whether this method has any advantages rightward skew of the RT distribution and thereby de- that make it preferable over the algorithms that do not use emphasize the inf luence of slow RTs on subsequently mixed modelling. In particular, multilevel modelling (or computed bias scores that are based on means or regres- per-participant regression) could account for trial-level sion. Similarly, there are a number of outlier rejection contaminants of RTs, such as post-error slowing, fatigue methods that can deal with skewed distributions, such and learning effects, and stimulus-specific confounds as the use of interquartile ranges, and exclusion using such as recognition speed, or visuospatial complexity. It asymmetric SDs; these currently remain unexplored in is as of yet unclear how exactly these contaminants could this article and the wider AAT literature. RTs can also be best be modelled, and whether their inclusion benefits the excluded by their temporal position within the block, as validity of the bias scores. participants are often still memorizing the instructions at the start of the block; hence, exclusion of trials at the RT definitions start of the block is a recommended pre-processing step for the brief IAT (Nosek et al., 2014). Regarding RT definitions, our findings were somewhat inconclusive: the only consistent pattern was that com- Generalization to other RT paradigms pletion times are less reliable and valid when error trials are also kept in the data. We therefore cannot draw any The current methodological findings cause concern for how conclusions as to which of these two RT definitions is data are analyzed in other RT tasks. However, it is difficult preferable. We suggest that the RT definition be cho- to forecast how the examined methods affect the validity sen on the basis of theoretical considerations and pre- of other tasks, as other tasks might depend on aspects of vious research in a specific field. As we found in our the data that are masked by this study’s recommendations. own previous research with touchscreen-based AATs, Ideally, the current multiverse decision tree methodology approach–avoidance biases may express themselves pri- could be applied to every popular experimental paradigm marily at the movement planning stage, such as when to confirm whether it is beneficial or detrimental how these the target stimuli are foods (Kahveci et al., 2021; Van tasks are currently pre-processed. It is particularly important Alebeek et  al., 2021), or during movement execution, that such a multiverse analysis is performed on paradigms such as when the target stimuli are spiders (Rinck et al., where the most commonly used pre-processing pipelines 2021). Since very few studies have explored the out- include methods we found to be detrimental. The IAT, for comes of multiple RT definitions (see also: Rotteveel & example, is commonly analyzed by penalizing error trials Phaf, 2004; Solarz, 1960), we recommend that this be and including outliers in the data. These recommendations done more often in future research. by Greenwald et al. (2003) were adopted in a minority of AAT studies that used the D-score (e.g. Ferentzi et al., 2018; Limitations and future directions Lindgren et al., 2015; Van Alebeek et al., 2021), and are contradicted by our findings. We were unable to investigate on which basis to include or This being said, a number of our findings are purely sta- exclude participants from AAT studies, for example, on the tistical in nature and can be expected to generalize regard- basis of extreme mean RT, error rate, outlier count, or bias less of the paradigm. We demonstrated in Study 2 that less score relative to the rest of the sample. Such an investigation accurate aggregated scores are obtained when averaging would require the analysis of far more datasets, and hence, together two conditions with unequal trial counts (as in this is to be addressed by future research. For now, it may compatibility scores) instead of computing separate aver- be sensible to reject participants on the basis of preset crite- ages for each and adding those together (as in double-dif- ria regarding error rates and mean RTs, as these can signal ference scores). This disadvantageous practice is common that the data of a particular participant do not sufficiently in research on the IAT (Greenwald et al., 2003). Addition- represent the mental process under study. Similarly, when ally, rejecting the top and bottom 1% of RTs as outliers non-robust analysis methods are used, outlying bias scores will also lead to the removal of an inappropriately low or should be removed. high number of trials in other paradigms, although this 1 3 Behavior Research Methods method currently sees little use outside the AAT literature. Lastly, fast outliers will also remain undetected in other paradigms when most of the adaptive outlier rejection methods that we examined are applied. Lastly, it remains to be explored further in the AAT and in other paradigms whether it is problematic to con- trast two response conditions to target stimuli without further contrasting these to control stimuli, as with cat- egory-specific difference scores. Stimulus-independent biases favoring one response over the other are common and cannot always be prevented through good experimen- tal design. It remains to be shown, however, how influ- ential they truly are, especially when responses consist of mere button-presses rather than full-limb movements as with the joystick. Conclusions Far from delivering a one-size-fits-all recommendation for pre-processing the AAT, our review, simulations, and multiverse decision tree analyses have recovered a num- ber of more reliable and valid methods, while eliminating a smaller number of methodologically harmful “forking paths” in the garden of AAT pre-processing decisions, as shown in Table 8. As some of these harmful practices are highly common (e.g. error trial retention or penalization) or even dominate the field (e.g. median category-specific difference scores), we hope that the recommendations of the current study will help to significantly improve the overall reliability and validity of future AAT studies. Appendix 1 Parameter retrieval procedure Parameters were computed for the six datasets with and without errors and outliers (defined as RTs below 200 ms or above 2000 ms). For the main effect of movement direc - tion, we computed per participant the mean difference for approach minus avoid trials; for the main effect of stimu- lus category, we computed the mean difference for trials featuring the target minus control stimuli. For the effect of bias score, we computed the mean difference between trials featuring approach of target stimuli and avoidance of control stimuli, minus trials featuring avoidance of target stimuli and approach of control stimuli. For RT mean and SD, we computed the mean and SD of each participant’s RTs before and after the subtraction of the aforementioned movement, stimulus, and bias effects from the RTs. After this, parameter means and SDs were computed across par- ticipants. These parameters are reproduced in Appendix Tables 9 and 10. 1 3 Table 9 Sample characteristics of the six datasets used in the study Content Tasktype Outliers N subjects Mean N trials Mean errors Mean RT Mean RT var. Full RT SD Full RT SD var. Residual RT SD Residual RT SD var. Erotica Relevant-feature Raw 58 160 10.5 538.09 77.1 175.79 80.81 171.35 79.46 Clipped 58 148.91 - 536.15 68.98 150.67 44.41 146.31 43.9 Irrelevant-feature Raw 58 160 13.62 617.42 107.52 213.55 108.77 211.19 108.07 Clipped 58 145.5 - 608.04 95.19 180.41 63.29 177.92 62.97 Foods Relevant-feature Raw 36 255.75 24.42 618.46 96.93 203.2 79.46 196.9 78.45 Clipped 36 231.33 - 632.24 90.07 165.87 50.72 158.37 49.87 Irrelevant-feature Raw 44 241.39 34.36 527.26 106.84 187.36 80.14 185.14 79.66 Clipped 44 207.02 - 535.23 97.34 158.09 55.09 155.42 54.4 Spiders Relevant-feature Raw 85 128 2.42 548.22 90.57 147.1 71.31 140.69 67.55 Clipped 85 121.16 - 561.47 76.93 124.49 43.57 117.5 41.68 Irrelevant-feature Raw 86 128 5.31 539.46 72.39 124.33 68.24 119.51 65.75 Clipped 86 122.15 - 539.16 68.82 114.47 42.85 109.57 41.94 Behavior Research Methods Appendix 2 Additional findings in Study 3 Category‑specific difference scores and the impact of outliers and outlier removal Category-specific difference scores showed the same pattern as double-difference scores in how they were impacted by outliers and outlier rejection in how well they were able to recover the true score on average, as depicted in Appendix Fig. 8. Variability in true score recoverability of double‑difference scores We also computed the SD, rather than the mean, of the true score recoverability. The SD was computed on the basis of Fisher r-to-z transformed correlations, rather than untrans- formed correlations. The transformation was applied to minimize the influence of average correlation magnitude on correlation dispersion. The resulting SD represents how unpredictable the correlation between the computed and true score is. The results are depicted in Appendix Fig. 9. They reveal that multilevel compatibility scores are highly variable in their correlation with the true score, compared to the other methods. The results also highlight that median- based scores are not more stable than mean-based scores; on the contrary, D-scores had the smallest SD of their cor- relation with the true score. Outlier detection rates of outlier exclusion methods given varying numbers of outliers The outlier detection rates of different outlier rejection procedures are depicted in Appendix Fig. 10. The percen- tile method stands out as having the highest false negative rate of all outlier detection methods after the data contain more than 1% of outliers, which makes sense, but also the highest false positive rate with fast outliers, which may be because it detects outliers across the entire sample and not within participants. Rejecting RTs deviating more than 2 SD from the participant mean led to the highest true positive rates and lowest false negative rates for fast RTs (i.e. highest sensitivity). For slow RTs, however, the 2 SD method had the highest false positive rate when there were very few outliers, which was apparently to no detri- ment to the reliability of the data, and this false positive rate was greatly reduced when there were more outliers in the data. 1 3 Table 10 Effect size means and variances of RT contrasts in the six datasets used in the study Content Tasktype Outliers Pull effect Pull effect var. Pull effect size Stim. effect Stim. effect var. Stim. effect size Bias effect Bias effect var. Bias effect size Erotica Relevant-feature Raw −20.12 36.56 −.55 −16.92 40.51 −.42 26.12 52.22 .5 Clipped −25.87 36.78 −.7 −18.58 35.13 −.53 20.99 37.63 .56 Irrelevant-feature Raw −19 38.21 −.5 15.93 37.75 .42 −.06 33.83 0 Clipped −28.57 33.32 −.86 11.36 25.81 .44 3.33 31.79 .1 Foods Relevant-feature Raw −30.62 35.24 −.87 −30.88 36.21 −.85 39.26 69.91 .56 Clipped −39.21 40.5 −.97 −30.94 32.5 −.95 38.97 60.13 .65 Irrelevant-feature Raw −27.61 38.91 −.71 −4.52 26.41 −.17 1.01 25.6 .04 Clipped −33.08 34.45 −.96 −1.18 29.1 −.04 1.22 23.66 .05 Relevant-feature Raw −27.04 45.96 −.59 −25.05 45.02 −.56 −7.81 62.48 −.12 Spiders Clipped −31.99 39.16 −.82 −27.77 34.28 −.81 −7.46 52.67 −.14 Irrelevant-feature Raw −35.84 42.22 −.85 11.52 39.22 .29 −5.5 35.56 −.15 Clipped −34.93 36.94 −.95 10.52 35.22 .3 −8.31 26.94 −.31 Behavior Research Methods Fig. 8 True score recoverability changes due to exclusion of outliers across the four category-specific difference score methods Fig. 9 Variability of true score recoverability computed with different outlier rejection methods and scoring algorithms at different numbers of outliers 1 3 Behavior Research Methods Fig. 10 Outlier detection rates for different detection methods and types of outliers Spurious correlates of bias score algorithms noting. First, target-specific bias scores are unsurprisingly correlated with movement direction effects. Second, only We not only computed correlations between the computed the double mean difference score and the double-difference bias score and true underlying bias score, but also with other D-score consistently have a decent correlation with the true parameters that were used to generate the data, including bias effect, while all other algorithms had below-zero cor - true mean RT, true movement direction effect (irrespective relations on some occasions. Third and most importantly, of stimulus), and true stimulus category effect (irrespective the correlation between multilevel-based scores and mean of movement direction). RT is highly spread out both in the positive and negative The results at an outlier count of 0 are depicted in directions, with almost perfect correlations within the Appendix Fig. 11. Three aspects of these results are worth realm of possibility; this is not a property of the data, given 1 3 Behavior Research Methods Fig. 11 Correlations of bias scores with parameters used to generate the data that other bias scores do not feature such extreme corre- recommendations on this. We split the six real datasets lations. This contamination is sure to reduce the validity 100,000 times, computed bias scores for both halves in of multilevel bias scores as well as make them artificially each split, and recorded the correlation between scores for reliable, given that mean RT is a highly reliable variable. both halves. From this large pool of split-half correlations, we added one random correlation at a time to a pool and averaged the correlations in the pool together, recording the Appendix 3 resulting aggregated correlation for each pool size from 1 to 20,000. This was done 200 times for double mean differ - Determining the ideal number of split‑halves ence scores, double median difference scores, and double- difference D-score. To analyze the accuracy associated We next determined the ideal number of split-halves with each pool size, we computed the absolute difference to use for real data, as there are, to our knowledge, no between the average correlation for the pool and that for the Fig. 12 Percentage of pooled split-half correlations deviating more than .005 from the grand average as a function of the number of split-half correlations included in the pool 1 3 Behavior Research Methods six datasets and three algorithms the largest number of Table 11 Largest number of pooled split-half correlations at which more than 5% of pool averages deviated more than .005 from the iterations below which more than 5% of split halves devi- grand average ated more than .005 from the grand average. If less than 5% of averages deviated more than .005 from the grand dataset Double Double Double- mean differ - median dif- difference average, this was deemed an acceptable number of splits. ence ference D-score Appendix Fig. 12 depicts the gradual increase in accuracy in split-half reliability estimation as more split-halves were Irrelevant-feature Erotic 2340 2800 1980 averaged together. Appendix Table 11 depicts the largest Irrelevant-feature Food 3460 5180 1740 number of iterations above which less than 95% of average Irrelevant-feature Spider 1580 1500 860 sets deviated less than .005 from the grand mean of split- Relevant-feature Erotic 1480 1320 700 half correlations for each scoring algorithm and dataset. Relevant-feature Food 540 720 340 To obtain accurate split-half estimates, D-scores required Relevant-feature Spider 380 560 280 the least iterations, as did the relevant-feature AATs, which tend to be more reliable—for these, 2000 iterations would entire set of 100,000 splits. For each pool size we counted be more than enough. Mean and median double-difference the number of average correlations deviating more than scores in irrelevant-feature AAT datasets may require more .005 from the grand average. We computed for each of the than 5500 split-half iterations to obtain stable results. Fig. 13 Decision tree of factors influencing the reliability of the relevant-feature AAT, as computed with category-specific difference scores Fig. 14 Decision tree of factors influencing the reliability of the irrelevant-feature AAT, as computed with category-specific difference scores 1 3 Behavior Research Methods Fig. 15 Decision tree of factors influencing the criterion validity of the relevant-feature AAT, as computed with category-specific difference scores scores were computed with only the target stimuli, thus Appendix 4 ignoring the control stimuli. Reliability outcomes for the relevant-feature AAT are depicted in Appendix Fig.  13, Decision trees for category‑specific scores and for the irrelevant-feature AAT in Appendix Fig. 14. Criterion validity outcomes for the relevant-feature AAT For category-specific difference scores, we generated deci- are depicted in Appendix Fig. 15, and for the irrelevant- sion trees in the exact same manner as was described in feature AAT in Appendix Fig 16. Study 4. The only difference in methodology was that bias Fig. 16 Decision tree of factors influencing the criterion validity of the irrelevant-feature AAT, as computed with category-specific difference scores 1 3 Behavior Research Methods Ack nowledgements The authors would like to thank Johannes Dixon, W. J. (1953). Processing data for outliers. Biometrics, 9(1), Klackl, Max Primbs, and Joppe Klein Breteler for their helpful 74–89. https:// doi. org/ 10. 2307/ 30016 34 methodological suggestions, and Julia Klier for her help with per- Ernst, L. H., Ehlis, A.-C., Dresler, T., Tupak, S. V., Weidner, A., & forming the literature review. Fallgatter, A. J. (2013). N1 and N2 ERPs reflect the regulation of automatic approach tendencies to positive stimuli. Neuroscience Code availability All analysis scripts can be found in this study’s Research, 75(3), 239–249. https:// doi. org/ 10. 1016/j. neures. 2012. online repository: https:// doi. org/ 10. 17605/ OSF. IO/ YFX2C12. 005 Fabre-Thorpe, M. (2011). The characteristics and limits of rapid Authors’ contributions Sercan Kahveci: conceptualization, software, visual categorization. Frontiers in Psychology, 2, 243. https:// formal analysis, data curation, resources, writing – original draft, writ-doi. org/ 10. 3389/ fpsyg. 2011. 00243 ing – review & editing, visualization. Mike Rinck: resources, writing Ferentzi, H., Scheibner, H., Wiers, R. W., Becker, E. S., Lindenmeyer, J., – review & editing. Hannah van Alebeek: resources, writing – review & Beisel, S., & Rinck, M. (2018). Retraining of automatic action ten- editing. Jens Blechert: resources, writing – review & editing, supervision dencies in individuals with obesity: A randomized controlled trial. Appetite, 126, 66–72. https://doi. or g/10. 1016/j. appe t.2018. 03. 016 Funding Open access funding provided by Paris Lodron University of Fokkema, M., Smits, N., Zeileis, A., Hothorn, T., & Kelderman, H. Salzburg. Hannah van Alebeek and Sercan Kahveci were supported by (2018). Detecting treatment-subgroup interactions in clustered the Doctoral College "Imaging the Mind" (FWF; W1233-B). Hannah data with generalized linear mixed-ee ff cts model trees. Behavior van Alebeek was additionally supported by the project “Mapping neural Research Methods, 50(5), 2016–2034. https:// doi. org/ 10. 3758/ mechanisms of appetitive behaviour” (FWF; KLI762-B). Mike Rinck was s13428- 017- 0971-x supported by the Behavioural Science Institute of Radboud University. Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no Data availability The datasets generated and/or analyzed in the current “fishing expedition” or “p-hacking” and the research hypoth- study can be found in this study’s online repository: https://doi. or g/10. esis was posited ahead of time. Retrieved on 10 August, 2021, 17605/ OSF. IO/ YFX2C from http:// www . s t at. colum bia. edu/ ~g elman/ r esea r c h/ un pub lished/ p_ hacki ng. pdf Glashouwer, K. A., Timmerman, J., & de Jong, P. J. (2020). A per- Declarations sonalized approach-avoidance modification intervention to reduce negative body image. A placebo-controlled pilot study. Conflicts of interest The authors declare no conflicts of interest. Journal of Behavior Therapy and Experimental Psychiatry, 68, 101544. https:// doi. org/ 10. 1016/j. jbtep. 2019. 101544 Ethics approval Not applicable. Gračanin, A., Krahmer, E., Rinck, M., & Vingerhoets, A. J. J. M. (2018). The effects of tears on approach–avoidance tendencies in Consent to participate Not applicable. observers. Evolutionary Psychology, 16(3), 1474704918791058. https:// doi. org/ 10. 1177/ 14747 04918 791058 Consent for publication Not applicable. Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit associa- Open Access This article is licensed under a Creative Commons Attri- tion test. Journal of Personality and Social Psychology, 74(6), bution 4.0 International License, which permits use, sharing, adapta- 1464–1480. https:// doi. org/ 10. 1037// 0022- 3514. 74.6. 1464 tion, distribution and reproduction in any medium or format, as long Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understand- as you give appropriate credit to the original author(s) and the source, ing and using the implicit association test: I. An improved scor- provide a link to the Creative Commons licence, and indicate if changes ing algorithm. Journal of Personality and Social Psychology, 85, were made. The images or other third party material in this article are 197–216. https:// doi. org/ 10. 1037/ 0022- 3514. 85.2. 197 included in the article's Creative Commons licence, unless indicated Grubbs, F. E. (1950). Sample criteria for testing outlying observations. otherwise in a credit line to the material. If material is not included in Annals of Mathematical Statistics, 21, 27–58. https:// doi. org/ 10. the article's Creative Commons licence and your intended use is not 1214/ aoms/ 11777 29885 permitted by statutory regulation or exceeds the permitted use, you will Hampel, F. R. (1985). The breakdown points of the mean combined need to obtain permission directly from the copyright holder. To view a with some rejection rules. Technometrics, 27(2), 95–107. https:// copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . doi. org/ 10. 2307/ 12687 58 Heuer, K., Rinck, M., & Becker, E. S. (2007). Avoidance of emotional facial expressions in social anxiety: The approach–avoidance task. References Behaviour Research and Therapy, 45(12), 2990–3001. https://doi. org/ 10. 1016/j. brat. 2007. 08. 010 Hofmann, W., Friese, M., & Gschwendner, T. (2009). Men on the Barton, T., Constable, M. D., Sparks, S., & Kritikos, A. (2021). Self- “pull”: Automatic approach-avoidance tendencies and sexual bias effect: movement initiation to self-owned property is speeded interest behavior. Social Psychology, 40(2), 73–78. https:// doi. for both approach and avoidance actions. Psychological Research, org/ 10. 1027/ 1864- 9335. 40.2. 73 85(4), 1391–1406. https:// doi. org/ 10. 1007/ s00426- 020- 01325-0 Kahveci, S. (2020). AATtools: Reliability and scoring routines for the Berger, A., & Kiefer, M. (2021). Comparison of different response time approach-avoidance task. R package version 0.0.1. Retrieved on outlier exclusion methods: A simulation study. Frontiers in Psychol- 12 December, 2022, from https:// cr an.r- pr oje ct. or g/ pack a g e= ogy, 12, 675558. https:// doi. org/ 10. 3389/ fpsyg. 2021. 675558 AATto ols Cousijn, J., Luijten, M., & Wiers, R. W. (2014). Mechanisms under- Kahveci, S., Meule, A., Lender, A., & Blechert, J. (2020). Food lying alcohol-approach action tendencies: The role of emotional approach bias is moderated by desire to eat specific foods. Appe- primes and drinking motives. Frontiers in Psychiatry, 5, 44. tite, 154, 104758. https:// doi. org/ 10. 1016/j. appet. 2020. 104758 https:// doi. org/ 10. 3389/ fpsyt. 2014. 00044 Kahveci, S., Van Bockstaele, B., Blechert, J., & Wiers, R. W. (2020). De Houwer, J. (2003). The extrinsic affective Simon task. Experimental Pulling for pleasure? Erotic approach-bias associated with porn Psychology, 50(2), 77–85. https://doi. or g/10. 1026/ 1618- 3169. 50.2. 77 1 3 Behavior Research Methods use, not problems. Learning and Motivation, 72, 101656. https:// Rinck, M., Bundschuh, S., Engler, S., Müller, A., Wissmann, J., Ellwart, doi. org/ 10. 1016/j. lmot. 2020. 101656 T., & Becker, E. S. (2002). Reliabilität und Validität dreier Instru- Kahveci, S., Van Alebeek, H., Berking, M., & Blechert, J. (2021). mente zur Messung von Angst vor Spinnen. [Reliability and validity Touchscreen-based assessment of food approach biases: Inves- of German versions of three instruments measuring fear of spiders.]. tigating reliability and item-specific preferences. Appetite, 163, Diagnostica, 48(3), 141–149. https:// doi. org/ 10. 1026/ 0012- 1924. 105190. https:// doi. org/ 10. 1016/j. appet. 2021. 10519048.3. 141 Krieglmeyer, R., & Deutsch, R. (2010). Comparing measures of Rinck, M., Dapprich, A., Lender, A., Kahveci, S., & Blechert, J. (2021). approach-avoidance behaviour: The manikin task vs. two versions Grab it or not? Measuring avoidance of spiders with touchscreen- of the joystick task. Cognition & Emotion, 24(5), 810–828. https:// based hand movements. Journal of Behavior Therapy and Experi- doi. org/ 10. 1080/ 02699 93090 30472 98 mental Psychiatry, 73, 101670. https:// doi. org/ 10. 1016/j. jbtep. 2021. Leins, J., Waldorf, M., Kollei, I., Rinck, M., & Steins-Loeber, S. (2018). 101670 Approach and avoidance: Relations with the thin body ideal in Rotteveel, M., & Phaf, R. H. (2004). Automatic affective evaluation women with disordered eating behavior. Psychiatry Research, 269, does not automatically predispose for arm flexion and extension. 286–292. https:// doi. org/ 10. 1016/j. psych res. 2018. 08. 029 Emotion, 4(2), 156–172. https:// doi. org/ 10. 1037/ 1528- 3542.4. 2. Lender, A., Meule, A., Rinck, M., Brockmeyer, T., & Blechert, J. 156 (2018). Measurement of food-related approach–avoidance biases: Saraiva, A. C., Schüür, F., & Bestmann, S. (2013). Emotional valence and Larger biases when food stimuli are task relevant. Appetite, 125, contextual affordances flexibly shape approach-avoidance movements. 42–47. https:// doi. org/ 10. 1016/j. appet. 2018. 01. 032 Frontiers in Psychology, 4, 933. https:// doi. org/ 10. 3389/ fpsyg. 2013. Lindgren, K. P., Wiers, R. W., Teachman, B. A., Gasser, M. L., Westgate, 00933 E. C., Cousijn, J., ... Neighbors, C. (2015). Attempted training of Solarz, A. K. (1960). Latency of instrumental responses as a function of alcohol approach and drinking identity associations in US under- compatibility with the meaning of eliciting verbal signs. Journal of graduate drinkers: Null results from two studies. PLOS ONE, 10(8), Experimental Psychology, 59(4), 239. https://doi. or g/10. 1037/ h0047 274 e0134642. https:// doi. org/ 10. 1371/ journ al. pone. 01346 42 Spearman, C. (1904). The proof and measurement of association Lobbestael, J., Cousijn, J., Brugman, S., & Wiers, R. W. (2016). between two things. The American Journal of Psychology, Approach and avoidance towards aggressive stimuli and its rela- 15(1), 72–101. https:// doi. org/ 10. 2307/ 14121 59 tion to reactive and proactive aggression. Psychiatry Research, Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). 240, 196–201. https:// doi. org/ 10. 1016/j. psych res. 2016. 04. 038 Increasing transparency through a multiverse analysis. Perspec- Loijen, A., Vrijsen, J. N., Egger, J. I. M., Becker, E. S., & Rinck, M. (2020). tives on Psychological Science, 11(5), 702–712. https://d oi.o rg/ Biased approach-avoidance tendencies in psychopathology: A system-10. 1177/ 17456 91616 658637 atic review of their assessment and modification. Clinical Psychology Tzavella, L., Lawrence, N. S., Button, K. S., Hart, E. A., Holmes, N. Review, 77, 101825. https:// doi. org/ 10. 1016/j. cpr. 2020. 101825 M., Houghton, K., ... Adams, R. C. (2021). Effects of go/no-go Machulska, A., Kleinke, K., & Klucken, T. (2022). Same same, but training on food-related action tendencies, liking and choice. different: A psychometric examination of three frequently used Royal Society Open Science, 8(8), 210666. https:// doi. org/ 10. experimental tasks for cognitive bias assessment in a sample of 1098/ rsos. 210666 healthy young adults. Behavior Research Methods. https://doi. or g/ Van Alebeek, H., Kahveci, S., & Blechert, J. (2021). Improving 10. 3758/ s13428- 022- 01804-9 the touchscreen-based food approach-avoidance task: remedi- Neimeijer, R. A., Roefs, A., Glashouwer, K. A., Jonker, N. C., & ated block-order effects and initial findings regarding validity de Jong, P. J. (2019). Reduced automatic approach tendencies [version 3; peer review: 2 approved with reservations]. Open towards task-relevant and task-irrelevant food pictures in Anorexia Research Europe, 1, 15. https:// doi. org/ 10. 12688/ openr eseur Nervosa. Journal of Behavior Therapy and Experimental Psy-ope. 13241.3 chiatry, 65, 101496. https:// doi. org/ 10. 1016/j. jbtep. 2019. 101496 Van Alebeek, H., Kahveci, S., Rinck, M., & Blechert, J. (2023). Nosek, B. A., Bar-Anan, Y., Sriram, N., Axt, J., & Greenwald, A. Touchscreen-based approach-avoidance responses to appetitive G. (2014). Understanding and using the brief implicit associa- and threatening stimuli. Journal of Behavior Therapy and Exper- tion test: Recommended scoring procedures. PLoS One, 9(12), imental Psychiatry, 78, 101806. https:// doi. org/ 10. 1016/j. jbtep. e110938. https:// doi. org/ 10. 1371/ journ al. pone. 01109 382022. 101806 Parsons, S. (2022). Exploring reliability heterogeneity with multiverse analy- van Peer, J. M., Roelofs, K., Rotteveel, M., van Dijk, J. G., Spinhoven, ses: Data processing decisions unpredictably influence measurement P., & Ridderinkhof, K. R. (2007). The effects of cortisol adminis - reliability. Meta-Psychology, 6. https://doi. or g/10. 15626/ MP .2020. 2577 tration on approach–avoidance behavior: An event-related poten- Payne, B. K. (2001). Prejudice and perception: the role of automatic tial study. Biological Psychology, 76(3), 135–146. https://doi. or g/ and controlled processes in misperceiving a weapon. Journal of 10. 1016/j. biops ycho. 2007. 07. 003 Personality and Social Psychology, 81(2), 181–192. https://do i. van Strien, T., Frijters, J. E. R., Bergers, G. P. A., & Defares, P. B. (1986). org/ 10. 1037// 0022- 3514. 81.2. 181 The Dutch Eating Behavior Questionnaire (DEBQ) for assessment R Core Team. (2020). R: A language and environment for statistical of restrained, emotional, and external eating behavior. Interna- computing. R Foundation for Statistical Computing. tional Journal of Eating Disorders, 5(2), 295–315. https:// doi. org/ Radke, S., Volman, I., Kokal, I., Roelofs, K., de Bruijn, E. R. A., & 10. 1002/ 1098- 108X(198602) 5: 2< 295:: AID- EAT22 60050 209>3. Toni, I. (2017). Oxytocin reduces amygdala responses during 0. CO;2-T threat approach. Psychoneuroendocrinology, 79, 160–166. https:// von Borries, A. K. L., Volman, I., de Bruijn, E. R. A., Bulten, B. H., doi. org/ 10. 1016/j. psyne uen. 2017. 02. 028 Verkes, R. J., & Roelofs, K. (2012). Psychopaths lack the auto- Ratcliff, R. (1993). Methods for dealing with reaction time outliers. matic avoidance of social threat: Relation to instrumental aggres- Psychological Bulletin, 114(3), 510–532. https://doi. or g/10. 1037/ sion. Psychiatry Research, 200(2), 761–766. https:// doi. org/ 10. 0033- 2909. 114.3. 5101016/j. psych res. 2012. 06. 026 Reinecke, A., Becker, E. S., & Rinck, M. (2010). Three indirect tasks assess- Wagenmakers, E.-J., & Brown, S. (2007). On the linear relation ing implicit threat associations and behavioral response tendencies: between the mean and the standard deviation of a response time Test-retest reliability and validity. Zeitschrift für Psychologie/Journal of distribution. Psychological Review, 114(3), 830–841. https:// doi. Psychology, 218(1), 4–11. https://d oi.o rg/1 0.1 027/0 044-3 409/a 00000 2org/ 10. 1037/ 0033- 295X. 114.3. 830 1 3 Behavior Research Methods Wiers, R. W., Eberl, C., Rinck, M., Becker, E. S., & Lindenmeyer, Zech, H. G., Rotteveel, M., van Dijk, W. W., & van Dillen, L. F. (2020). J. (2011). Retraining automatic action tendencies changes alco- A mobile approach-avoidance task. Behavior Research Methods, holic patients' approach bias for alcohol and improves treatment 52(5), 2085–2097. https:// doi. org/ 10. 3758/ s13428- 020- 01379-3 outcome. Psychological Science, 22, 490–497. https://doi. or g/10. Zech, H. G., Gable, P., van Dijk, W. W., & van Dillen, L. F. (2022). Test- 1177/ 09567 97611 400615 retest reliability of a smartphone-based approach-avoidance task: Wittekind, C. E., Reibert, E., Takano, K., Ehring, T., Pogarell, O., & Effects of retest period, stimulus type, and demographics. Behavior Ruther, T. (2019). Approach-avoidance modification as an add-on Research Methods. https:// doi. org/ 10. 3758/ s13428- 022- 01920-6 in smoking cessation: A randomized-controlled study. Behaviour Research and Therapy, 114, 35–43. https:// doi. org/ 10. 1016/j. brat. Open practices statement The data and materials for all experiments 2018. 12. 004 are available at https:// doi. org/ 10. 17605/ OSF. IO/ YFX2C and none of Wittekind, C. E., Blechert, J., Schiebel, T., Lender, A., Kahveci, S., the experiments were preregistered. & Kühn, S. (2021). Comparison of different response devices to assess behavioral tendencies towards chocolate in the approach- Publisher’s note Springer Nature remains neutral with regard to avoidance task. Appetite, 165, 105294. https:// doi. org/ 10. 1016/j. jurisdictional claims in published maps and institutional affiliations. appet. 2021. 105294 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Behavior Research Methods Springer Journals

How pre-processing decisions affect the reliability and validity of the approach–avoidance task: Evidence from simulations and multiverse analyses with six datasets

Loading next page...
 
/lp/springer-journals/how-pre-processing-decisions-affect-the-reliability-and-validity-of-2xi0Vky7JH

References (57)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2023
eISSN
1554-3528
DOI
10.3758/s13428-023-02109-1
Publisher site
See Article on Publisher Site

Abstract

Reaction time (RT) data are often pre-processed before analysis by rejecting outliers and errors and aggregating the data. In stimulus–response compatibility paradigms such as the approach–avoidance task (AAT), researchers often decide how to pre-process the data without an empirical basis, leading to the use of methods that may harm data quality. To provide this empirical basis, we investigated how different pre-processing methods affect the reliability and validity of the AAT. Our literature review revealed 108 unique pre-processing pipelines among 163 examined studies. Using empirical datasets, we found that validity and reliability were negatively affected by retaining error trials, by replacing error RTs with the mean RT plus a penalty, and by retaining outliers. In the relevant-feature AAT, bias scores were more reliable and valid if computed with D-scores; medians were less reliable and more unpredictable, while means were also less valid. Simulations revealed bias scores were likely to be less accurate if computed by contrasting a single aggregate of all compatible conditions with that of all incompatible conditions, rather than by contrasting separate averages per condition. We also found that multilevel model random effects were less reliable, valid, and stable, arguing against their use as bias scores. We call upon the field to drop these suboptimal practices to improve the psychometric properties of the AAT. We also call for similar investigations in related RT-based bias measures such as the implicit association task, as their commonly accepted pre-processing practices involve many of the aforementioned discouraged methods. Highlights • Rejecting RTs deviating more than 2 or 3 SD from the mean gives more reliable and valid results than other outlier rejection methods in empirical data • Removing error trials gives more reliable and valid results than retaining them or replacing them with the block mean and an added penalty • Double-difference scores are more reliable than compatibility scores under most circumstances • More reliable and valid results are obtained both in simulated and real data by using double-difference D-scores, which are obtained by dividing a participant’s double mean difference score by the SD of their RTs Keywords Approach-avoidance task (AAT) · Bias scores · Reliability · Validity · Outlier exclusion · Simulation · Multiverse analysis Introduction Stimulus–response compatibility tasks like the approach–avoid- ance task (AAT; Solarz, 1960), the extrinsic affective Simon * Sercan Kahveci task (De Houwer, 2003), and the implicit association task (IAT; sercan.kahveci@plus.ac.at Greenwald et al., 1998) have been used for over 60 years to 1 measure attitudes without directly asking the participant. Their Department of Psychology, Paris-Lodron-University strength lies in the fact that they measure stimulus–response of Salzburg, Hellbrunner Straße 34, 5020 Salzburg, Austria 2 compatibility implicitly through reaction times (RTs), which Centre for Cognitive Neuroscience, Paris-Lodron-University avoids the methodological issues associated with self-reports, of Salzburg, Salzburg, Austria 3 such as social desirability and experimenter demand. In turn, Behavioural Science Institute, Radboud University, however, they are subject to all the methodological issues Nijmegen, The Netherlands Vol.:(0123456789) 1 3 Behavior Research Methods associated with RT tasks, such as occasional incorrect responses, task-irrelevant stimulus feature, for example, by measuring outlying RTs, and the large quantity of data, which cannot be chocolate approach–avoidance bias by requiring participants meaningfully interpreted until it is reduced. As such, the data to approach stimuli surrounded by a green frame and avoid usually undergo some kind of pre-processing before analysis, stimuli surrounded by a blue frame, thereby making it irrele- whereby error trials and outliers are dealt with in some manner vant whether the stimulus itself contains chocolate. This task (or not), often followed by aggregation of the data into an easily is reported in the literature as unreliable, with reliabilities interpretable bias score. below zero (Kahveci, Van Bockstaele, et al., 2020; Lobbes- There are many methods available to perform each of these tael et al., 2016; Wittekind et al., 2019), though reliabilities pre-processing steps. However, there is no clear-cut answer on around .40 (Cousijn et al., 2014) and even .80 have been which methods are preferable and under which circumstances, reported on individual occasions (Machulska et al., 2022). It leaving researchers to find their way through this garden of has seen frequent use, because its indirect nature conceals the forking paths on their own. Decisions may be made on the goal of the experiment and thus makes it less susceptible to basis of their effect on the data, thereby inflating the likeli - experimenter demand. The relevant-feature AAT, in contrast, hood of obtaining spurious results (Gelman & Loken, 2013). directly manipulates the contingency between a task-relevant Researchers may choose the same pre-processing pipeline feature of the stimulus and the response, for example, by as in already published work. This allows for comparable measuring chocolate approach–avoidance bias by requiring results, but makes the quality of the findings of an entire line participants to approach chocolate stimuli during one block of research dependent on the efficiency of the most popu - and to avoid it during another block. This task usually has a lar pre-processing pipeline. In the best case, the commonly high reliability, from around .90 (Zech et al., 2022), to around accepted set of decisions reliably recover the true effect and .70 (Hofmann et al., 2009; Van Alebeek et al., 2021), up allow the field to progress based on grounded conclusions. In to .50 (Kahveci, Van Bockstaele, et al., 2020); however, the the worst case, it makes the measurement less reliable, thereby direct nature of its instructions make it easy for the partici- misleading researchers with conclusions based on random pant to figure out what the task is about. noise and null findings that mask true effects. Hence, both In the present study, we probed the extent of pre-processing heterogeneity in pre-processing decisions as well as low reli- heterogeneity in the literature on the AAT, and we made an ability can contribute to inconsistent results across studies that effort towards reducing it by examining the reliability and investigate the exact same effect, and thus play a role in the validity obtained through a wide range of pre-processing deci- ongoing replication crisis in psychological science. Ideally, sions using a multiverse analysis, thereby limiting the range of pre-processing decisions would be made based on empirical acceptable pre-processing methods to only the most reliable findings that demonstrate which options yield the best results and valid approaches. The multiverse analysis methodology, (see e.g. Berger & Kiefer, 2021; Ratcliff, 1993). advocated by Steegen et al. (2016), involves applying every The literature on the AAT is no stranger to these issues. combination of reasonable analysis decisions to the data, to The field did not take up the methods which Krieglmeyer probe how variable the analysis outcomes can be, and to what and Deutsch (2010) found to lead to the highest reliability extent each analysis decision contributes to this variability. We and validity (i.e. either strict slow RT cutoffs or data trans- know of one study so far that has examined the impact of pre- formation). Many labs have since settled into their own pre- processing methods on the reliability of the AAT, though it processing pipeline without a firm empirical basis for their did not utilize multiverse analysis. Krieglmeyer and Deutsch decisions, making it unclear whether differing results are due (2010) applied a number of different methods to deal with to differences in task setup or in pre-processing after data col - outliers to the data and compared the resulting bias scores on lection. For example, using the same task setup in which par- the basis of their split-half reliability and overall effect size, ticipants approach and avoid stimuli with a joystick, one study n fi ding that the relevant-feature AAT is most reliable when no found a correlation between spider fear and spider avoidance outlier correction is applied, while the irrelevant-feature AAT bias (e.g. Reinecke et al., 2010), while another did not (e.g. benefits from very strict outlier rejection, e.g. removing all RTs Krieglmeyer & Deutsch, 2010). It is unclear whether this dif- above 1000 ms or deviating more than 1.5 SDs from the mean. ference occurred because the former study did not remove Additionally, Parsons (2022) was the first to examine the effect outliers whereas the latter removed all RTs above 1500 ms, or of pre-processing decisions on reliability, though he looked at because the former study featured 2.66 times more test trials the dot-probe, Stroop, and Flanker tasks rather than the AAT. and 2.25 times more practice trials than the latter, or because Our study instead focused on the AAT, but also extended the two studies used different scoring algorithms. these studies methodologically by also examining criterion Low reliability has also been a problem in the AAT lit- validity, as high reliability is a prerequisite, but not a guar- erature (Loijen et  al., 2020), at least for certain variants antee for high validity; if we focused solely on reliability, we of it. The irrelevant-feature AAT manipulates the contin- would risk achieving highly reliable, but invalid data. Reli- gency between the approach or avoidance response and a ability represents how well a measurement will correlate with 1 3 Behavior Research Methods the same measurement performed again (Spearman, 1904), different types of trials. RTs to these four types of trials can but it is agnostic on what is actually measured. Hence, one be decomposed into three independent contrasts, which are could measure something reliably, but that something might detailed in Table 1. The first contrast is the RT difference be an artifact rather than the effect one was looking for. For between responses to the two stimulus categories (rows in example, participants tend to be slower in the beginning of the Table 1), regardless of response direction. This difference experiment when they are trying to adapt to the task, and some can be caused by the familiarity, visual characteristics, or the are slower than others. This initial slowness is a large interper- prototypicality of the stimulus as a stand-in for its stimulus sonal difference that can be measured reliably, but it has little category, among other causes. As shown in Table 1, this factor to do with cognitive bias. If we only focus on reliability, we contaminates any difference score between single-direction may erroneously believe that our analysis should focus on this responses to one stimulus category versus another. If this is initial slowness rather than ignore it. ignored, we may erroneously conclude that a familiar stimulus We also extended these previous studies by examining category is approached faster than a less familiar category, simulated as well as real data: simulated data allows for a even though all responses to the familiar stimulus category detailed analysis of the conditions under which different are faster, regardless of response direction. outlier rejection and bias scoring methods are more or less The second contrast is the RT difference between approach reliable; but only real data can be used to examine how and avoidance trials, regardless of stimulus content (columns in validity is affected by bias scoring and error and outlier Table 1). This difference can be caused by the relative ease with handling. Previous simulation studies examining outlier which approach and avoidance movements can be made, which rejection have assumed that extreme RTs are unrelated to the can be influenced by irrelevant factors like the individual’s individual’s actual underlying score (Berger & Kiefer, 2021); anatomy and posture as well as by the (biomechanical) setup if, in real data, the approach–avoidance bias expresses itself of the response device. This factor contaminates any difference through errors and extreme RTs (e.g. stronger bias leading score between approach and avoid trials within a single stimulus to more errors when avoiding desired stimuli), then it could category. For example, a study found that patients with anorexia turn out to be preferable to keep them in the data. nervosa avoid, rather than approach, thin female bodies (Leins et al., 2018). Does that mean that women with anorexia, counter- Data structure and methodological challenges of  intuitively, have an avoidance bias away from these stimuli? the AAT Such an interpretation would not be valid, since an identical avoidance bias was demonstrated for normal-weight bodies In this section, we will discuss the characteristics of the AAT in the same patient group as well as in healthy individuals, to understand the methodological challenges that will need indicating that avoidance responses were simply faster and not to be addressed when pre-processing its data. In the AAT, specific to thin bodies. participants view different stimuli and give either a speeded The third contrast is the approach–avoidance bias, and is approach or avoidance response depending on a feature of the represented by the difference between approaching and avoid- stimulus. Responses are typically given with a joystick or a ing a target stimulus category, relative to the difference between similar device that simulates approach toward or avoidance approaching and avoiding a reference stimulus category. As of a given stimulus (Wittekind et al., 2021), though simple shown in Table 1, this double difference can be interpreted as buttons are sometimes used instead. Depending on the input an approach or avoidance bias towards one particular stimulus device, this allows for the measurement of different types type relative to another. of response times per trial, which we term initiation time, movement duration, and completion time (terms previously The current study used by: Barton et al., 2021; Tzavella et al., 2021). The time from stimulus onset to shortly after response onset (initiation The current article consists of four studies. As a first step, we time) indicates how long it took the participant to initiate a reviewed the literature to gain insight into which pre-processing response; the time from response onset until response comple- decisions are in use in the field (Study 1). We discuss thereafter tion (movement duration) indicates the speed of the approach which methods are potentially problematic and consider alter- or avoidance movement. The two are often added together to native methods, giving extra consideration to robust and novel represent the latency from stimulus onset until response com- approaches. Next, we performed a simulation study to compare pletion (completion time). Approach–avoidance bias scores two ways of aggregating data from four conditions, those being quantify the extent to which responses slow down or speed double-difference scores and compatibility scores (Study 2). We up due to the compatibility between stimulus and response. followed up with a simulation study to compare the impact of A typical AAT trial features one out of two stimulus cat- outliers on the reliability of scores derived using a number of egories (target and control) and requires a response in one outlier detection methods and scoring algorithms (Study 3). And out of two directions (approach and avoid), resulting in four lastly, we compared these pre-processing methods on how they 1 3 Behavior Research Methods affect the reliability and validity of real datasets in a multiverse analysis (Study 4). Study 1: Literature review Introduction We performed a focused scoping review of the AAT literature to examine which pre-processing decisions are used in the field of AAT research. The intention was not to be exhaustive or systematic but to tap into the variability in pre-processing decisions in the field to orient the rest of this project. Methods We reviewed 213 articles retrieved from Google Scholar using the keywords “approach–avoidance task OR approach avoidance task” published between 2005 and 2020. We rejected 65 articles after reading the abstract or full text, since they featured no AAT, only a training-AAT, or a variant of the AAT that departs strongly from the original paradigm (e.g. by allowing participants to freely choose whether to approach or avoid a stimulus). We also excluded one experi- ment which featured multiple pre-processing tracks, as we would otherwise have to count two full pre-processing pipe- lines for a single experiment. We thus retained 143 articles containing a total of 163 AATs. When an article contained multiple AAT experiments, all were included as separate entries and counted as such. The experiments were coded on the following variables: instruction type (relevant-feature, irrelevant-feature), response device (e.g. joystick, keyboard, mouse), RT definition (initiation time, movement dura- tion, completion time), inclusion of some sort of training or therapy, the research population, the target stimuli, the type of reported reliability index if any (e.g. even-odd split-half, Cronbach’s alpha of stimulus-specific bias scores), absolute outlier exclusion rules (e.g. any RTs above 2000 ms), adap- tive outlier exclusion rules (e.g. 3 SD above each partici- pant’s mean), error handling rules (e.g. include error trials in analyses, force participants to give correct responses), perfor- mance-based participant exclusion rules (e.g. more than 35% errors), score-based exclusion rules (e.g. bias scores deviat- ing more than 3 SD from sample mean), and the summary statistic used (e.g. double mean difference scores, median category-specific difference scores, simple means). Results A total of 163 experiments from 143 articles were examined. Below, we describe the number and percentage of experi- ments that utilized specific methods in their design, pre- processing, and analysis. The full results of this review can 1 3 Table 1 AAT trial types, difference scores, and their components Movement direction Avoid Approach Stimulus Target (Quadrant A) Target stimulus recognition + general – (Quadrant B) Target stimulus recognition + general = Target-specific difference score = (General avoid cat- avoid speed + avoidance facilitation of target approach speed + approach facilitation of target speed –general approach speed) + (avoidance facil- egory itation of target – approach facilitation of target) – – – Control (Quadrant C) Control stimulus recognition + – (Quadrant D) Control stimulus recognition + = Control-specific difference score = (General avoid general avoid speed + avoidance facilitation of general approach speed + approach facilitation of speed – general approach speed) + (avoidance facil- control control itation of control – approach facilitation of control) = = = (negative) Avoid-specific difference score = (target – (negative) Approach-specific difference score = = Double-difference score = (avoidance facilitation of stimulus recognition – control stimulus recogni- (target stimulus recognition – control stimulus target – approach facilitation of target) – (avoid- tion) + (avoidance facilitation of target – avoid- recognition) + (approach facilitation of target – ance facilitation of control – approach facilitation ance facilitation of control) approach facilitation of control) of control) Note: This table is a schematic depiction of single- and double-difference scores and the RT components they consist of. The quadrants contain a description of which RT components we hypothesize to constitute the RTs of the combination of stimulus and response that the quadrant represents. When read from top to bottom, the bottom row represents the result of subtracting the middle row from the top row. When read from left to right, the right column represents the result of subtracting the middle column from the left column Behavior Research Methods Table 2 Frequencies of upper and lower RT cutoffs in the reviewed literature Outlier definition <100 ms <150 ms <200 ms <250 ms <300 ms <350 ms None Total n % n % n % n % n % n % n % n % >1000 ms 3 1.84% 3 1.84% >1500 ms 2 1.23% 6 3.68% 1 0.61% 6 3.68% 6 3.68% 21 12.88% >1700 ms 1 0.61% 1 0.61% >2000 ms 1 0.61% 20 12.27% 1 0.61% 4 2.45% 7 4.29% 33 20.25% >3000 ms 3 1.84% 2 1.23% 5 3.07% >3500 ms 1 0.61% 1 0.61% >4000 ms 1 0.61% 1 0.61% >5000 ms 1 0.61% 1 0.61% >10,000 ms 2 1.23% 1 0.61% 3 1.84% None 1 0.61% 3 1.84% 2 1.23% 1 0.61% 87 53.37% 94 57.67% Total 4 2.45% 12 7.36% 23 14.11% 2 1.23% 11 6.75% 6 3.68% 105 64.42% 163 100% be found in this study’s online repository: https://doi. or g/10. some did not clarify how they computed Cronbach’s alpha (3; 17605/ OSF. IO/ YFX2C 1.84%). The least common measure was test-retest reliability (4; 2.45%). Response device RT measures Joysticks were by far the most popular response device (132; 80%). They were followed by keyboards (8; 4.91%), button Most studies used a single RT measure (153; 93.9%) but some boxes (7; 4.29%), touchscreens (5; 3.07%), computer mice used multiple (10; 6.13%). Out of all examined experiments, (5; 3.07%), and other/multiple/unknown devices (8; 4.91%). most did not report how RTs were defined (69; 42.33%), but those that did used completion time (50; 30.7%), initiation Instructions time (43; 26.4%), or movement duration (9; 5.52%). The irrelevant-feature AAT was the most popular task type Outlier rejection rules (119; 73.01%), followed by the relevant-feature AAT (41; 25.15%). A small number (3; 1.84%) used both task types in Many experiments applied no outlier rejection (62; 38%), the same experiment. while those that did either applied only absolute outlier rejection methods (38; 23.3%), only adaptive outlier rejec- Reliability measures tion methods (24; 14.7%), or both together (39; 23.9%). Fre- quencies of absolute outlier rejection rules (78; 47.9%) are Reliability was not examined in the majority of experiments shown in Table 2, and frequencies of adaptive outlier rejec- (125; 76.69%); most that did examine reliability used a sin- tion methods (63; 38.7%) are shown in Table 3. gle reliability measure (36; 22.1%), and some used two (2; 1.23%). Split-half reliability was the most common measure Error rules (19; 11.7%). The types of split-half reliability included tem- poral split-half, which is splitting the experiment halfway (5; In most experiments, the authors excluded error trials (115; 3.07%); even-odd split-half, which is splitting the data by even- 70.55%), while others either included them (34; 20.86%), uneven trial number (5; 3.07%); and randomized split-half, replaced them with a block mean RT of correct trials plus a which is averaging together the correlations between many ran- penalty (7; 4.29%), or required participants to give correct dom splits (5; 3.07%); other studies did not mention the type of responses to complete the trial (7; 4.29%). split-half used (4; 2.45%). Cronbach’s alpha was the next most common reliability measure (16; 9.82%). Most experiments Bias score algorithms computed Cronbach’s alpha on the basis of the covariance matrix of stimulus-specific bias scores (11; 6.75%), while a We categorized the observed bias score algorithms where pos- minority computed Cronbach’s alpha for RTs in a single move- sible, and gave them systematic names, which will be used in ment direction, grouping them per stimulus (2; 1.23%), and the remainder of the article. They are shown in Table 4. 1 3 Behavior Research Methods results in the literature are due to differences in experimental Table 3 Frequencies of adaptive outlier rejection methods in the reviewed literature design, pre-processing, or chance. In the following discussion of Study 1, we will review the observed and hypothetical new Outlier definition Both sides Upper side pre-processing decisions based on methodological considera- N % N % tions, in anticipation of Studies 2, 3, and 4. Upper and/or lower 1% 10 6.13% Outlier rejection Upper and/or lower 2% 1 0.61% 1.5 SD 1 0.61% Various methods are used to flag and remove implausible or 2 SD 3 1.84% 2 1.23% extreme RTs. This is especially important considering that 2.5 SD 7 4.29% 1 0.61% non-robust statistics are much more strongly influenced by 3 SD 28 17.18% 7 4.29% individual extreme outliers than by a multitude of regular Multiple methods 1 0.61% RTs; as such, outliers inflate type I and II error rates by sup- Total 50 30.67% 11 6.75% pressing effects that exist and creating effects that do not Unclear 2 1.23% exist in real life (Dixon, 1953; Ratcliff, 1993). None 100 Fixed RT cutos ff It seems sensible to remove outliers based on Participant rejection rules a cutoff that is adapted to the specific study but fixed across participants in that study. This is based on the reasoning that It was uncommon for participants to be rejected based on there is a high likelihood that RTs above or below certain bad performance (42; 25.8%), but it is unclear whether this is values have a different origin than the mental process being because participants performed well in most studies, or their measured (Ratcliff, 1993). The removal of such RTs is thus performance simply was not examined. If participants were thought to enhance the validity of the data. For example, when rejected, it was most commonly on the basis of error rates a participant forgets the instructions and tries to remember (34; 20.9%), with an error rate above 25% being the most them, this can result in a 4-second RT caused by memory common cutoff (12; 7.36%), followed by error rates of 35% search rather than by stimulus recognition and decision-mak- (6; 3.68%), and 20% (4; 2.45%). Less often, participants were ing. The same goes for fast outliers: it is known that people rejected because they had RTs that were too slow (6; 3.68%), only begin to recognize a stimulus after about 150 ms, and or because they had too few trials remaining after error and they only begin giving above-chance responses from 300 ms outlier removal combined (4; 2.45%). In a minority of studies, and onwards (Fabre-Thorpe, 2011). Given this, a 50 ms RT is participants were rejected not (only) due to high error rates or most likely not related to the stimulus that has just been shown slow RTs, but (also) because their bias scores were too outly- on the screen. It remains unclear, however, what the ideal ing (11; 6.75%), their scores had too much influence on the cutoffs are. Ratcliff (1993) found that a RT cutoff of 1500 ms regression outcome (1; 0.61%), or for unclear reasons relat- led to results with decent power to detect a group difference, ing to the magnitude of their scores (1; 0.61%). Many of the when said group difference was in the mean of the distribu- examined experiments gave the impression that no participant tion, but also when it was in the tail of the distribution instead. rejection rule was defined beforehand, but participants were This study, however, utilized simulated data that may not cor- rejected following data inspection. respond with effects observed in real life. Further complexity is introduced by the fact that some stimuli are more visually or Pipelines conceptually complex than others and may thus require more processing time before the participant is capable of respond- We empirically observed a total of 108 unique pre-processing ing correctly using the cognitive mechanism under study. pipelines across 163 studies, out of 218,400 possible combi- nations, computed by multiplying the numbers of all unique Means and SDs By far the most common adaptive outlier observed pre-processing methods at each step with each other. rejection method is to removes all RTs that deviate more than 3 SD from the participant’s mean. Ratcli ff ( 1993) found that Discussion very strict SD boundaries (e.g. M + 1.5 SD) reasonably sal- vage power to detect group differences when the group differ - We found that some pre-processing methods were quite com- ence is in the group means, but significantly weaken power mon (e.g. excluding trials deviating more than 3 SD from the when groups primarily differ in the length of the right tail of participant mean), but there is still much heterogeneity in the the distribution; this suggests that the benefit of using means literature, as only a few studies used identical pre-processing and SDs can depend on the nature of the task. Both means and methods, which makes it difficult to discern whether divergent SDs are themselves also highly influenced by outliers. Thus, 1 3 Behavior Research Methods Table 4 Bias score algorithms and how frequently they have been used Name N % Formula ∼ ∼ Median category-specific difference 47 28.83% RT − RT avoid target approach target Mean category-specific difference 32 19.63% RT − RT avoid target approach target RT −RT Category-specific difference D-score 6 3.68% avoid target approach target SD target RT ∼ ∼ ∼ ∼ Double median difference 18 11.04% RT − RT − RT − RT avoid target approach target avoid control approach control Double mean difference 3 1.84% RT − RT − RT − RT avoid target approach target avoid control approach control ∼ ∼ Median compatibility score 4 2.45% RT − RT avoid target or approach control approach target or avoid control Mean compatibility score 4 2.45% RT − RT avoid target or approach control approach target or avoid control Compatibility D-score 1 0.61% RT avoid target or approach control − RT approach target or avoid control SD RT ∼ ∼ Median movement-specific difference scores 4 2.45% RT − RT avoid target avoid control and ∼ ∼ RT − RT approach target approach control Mean movement-specific difference scores 3 1.84% RT − RT avoid target avoid control and RT − RT approach target approach control Multiple 9 5.52% Other 3 1.84% None 20 12.27% Unclear 9 5.52% Total 163 100% when using this method, extreme outliers widen the SD and Percentiles One of the less common ways to deal with outli- mask smaller outliers that would otherwise have been detected ers was to remove a fixed percentage of the fastest and slowest after an initial exclusion of extreme outliers. Additionally, RTs from the data (10 out of 163 studies). This method has the means and SDs are only correct descriptors for symmetric advantage of not ignoring fast outliers. However, it is independent distributions, which RTs are not; as such, this method is prone of the characteristics of the RT distributions under investigation, to removing only slow outliers, while ignoring extremely fast and is thus likely to remove either too few or too many outliers, RTs that technically do not deviate more than 3 SD from the depending on the data. mean, but are nevertheless theoretically implausible. Hence, this method is often combined with absolute outlier cutoffs that Outlier tests While much less common than any of the afore- eliminate extreme RTs at both ends of the distribution before mentioned methods, another method that deserves mention is means and SDs are computed for further outlier rejection. the significance testing of outliers. The Grubbs test (Grubbs, Alternatively, one may turn to robust estimators of the 1950) was used by e.g. Saraiva et al. (2013), among others, mean and SD, such as the median and the median absolute to detect whether the highest or lowest value in the data sig- deviation (MAD, (Hampel, 1985)), respectively. Unlike the nificantly deviates from the distribution. When found to be mean, the median is not disproportionately affected by outliers significantly different, this value is removed, and the process compared to non-outlying datapoints, as the median assigns is repeated until no more significant outliers are detected. equal leverage to every data point. Hence, it is not affected by the outliers’ extremeness, but only by their count. Simi- Error handling larly, the MAD is a robust alternative to the SD, calculated by computing the deviation of all data points from the median, Study 1 revealed four ways of dealing with error trials: removing the sign from these values, computing their median, including them in the analyses, excluding them, replacing and multiplying this value by a constant to approximate the them with the block mean plus a penalty, or requiring the value that the SD would take on in a normal distribution. participant to give a correct response during the task itself 1 3 Behavior Research Methods and defining the RT as the time from stimulus onset until They represent the advantage of approaching over avoiding the correct response. Which method is ultimately the best one stimulus category relative to another. We will examine depends on whether error trial RTs contain information on how mean-based and median-based double-difference algo- approach–avoidance bias. After all, some implicit tasks rithms compare in their ability to recover reliable and valid are based entirely on errors (e.g. Payne, 2001). Raw error approach–avoidance bias scores. counts sometimes show approach–avoidance bias effects (Ernst et al., 2013; Gračanin et al., 2018; van Peer et al., Compatibility scores These scores involve averaging all RTs 2007) but often they do not (Glashouwer et al., 2020; Heuer from the bias-compatible conditions together, and subtracting this et al., 2007; Kahveci et al., 2021; Neimeijer et al., 2019; from the average of all RTs in the bias-incompatible conditions Radke et al., 2017; von Borries et al., 2012), and it is unclear taken together. When one measures approach–avoidance bias why some studies find such an effect while others do not. towards palatable food, for example, the bias-compatible condi- Therefore, we will examine how the different types of error tions involve approaching food and avoiding control stimuli handling affect reliability and validity (Study 4). (quadrants B and C of Table 1), while the bias-incompatible con- ditions involve avoiding food and approaching control stimuli Bias score computation algorithms (quadrants A and D of Table 1). When there is an equal number of trials in each of the four conditions, compatibility scores (e.g. Category‑specific and movement‑specific difference ) are thus avoid target or approach control − approach target or avoid control scores The category-specific difference score is the most functionally identical (though halved in size) to double-difference popular bias scoring algorithm (93 in 163 studies). To com- scores (which can be reformulated as pute it, one subtracts aggregated approach RTs from aggre- . H o w e v e r , avoid target + approach control − approach target + avoid control gated avoidance RTs for a single stimulus category (Table 1: when there is an unequal number of trials in the conditions con- quadrant A minus B, or C minus D). Movement-specific tained within the averages, the condition with more trials has a difference scores are less popular, but they similarly con- larger influence on the average than the condition with fewer trast a single condition to another, in this case by subtract- trials, which can reintroduce RT influences that a double-differ - ing the approach or avoidance RT for a target stimulus from ence score is meant to account for, such as stimulus-independent the approach or avoidance RT of a control stimulus. In the differences in approach–avoidance speed. Imagine, for example, resulting score, positive values imply that the target stimulus that a participant particularly struggled with avoiding palatable is approached or avoided faster than the control stimulus food stimuli, and made a disproportionate number of errors in (Table 1: quadrant C minus A, and D minus B; note that the this particular condition. After error exclusion, their n fi al dataset Table displays these the other way around). As we discussed thus contains 20 avoid-food trials and 40 trials of each other con- in the introduction, these scores are problematic when inter- dition. The mean RT of the incompatible conditions is thus more preted on their own, because they do not account for inter- strongly influenced by the 40 approach-control trials than by the personal and overall differences in how fast participants 20 avoid-food trials and it fails to cancel out the stimulus-inde- perform approach and avoidance movements, and how fast pendent difference between approach and avoid trials. Therefore, they classify the stimuli into their categories, respectively. it is almost always an impure measure of approach–avoidance This contamination with motor or classification effects can bias, as we will show in a further analysis. produce bias scores with extremely high reliabilities that do The D‑score correction The D-score correction controls not correlate with any relevant interpersonal metric, because the difference score consists primarily of contaminant rather for the fact that larger differences between conditions emerge when a participant has a wider RT distribution, than stimulus-related approach–avoidance bias (as found by e.g. Kahveci, Meule, et al., 2020). To hold any meaning, which occurs when they respond more slowly, as demon- strated by Wagenmakers and Brown (2007). The D-score they need to be contrasted with their opposite equivalent, i.e. approach scores with avoid scores, and target stimuli with was introduced by Greenwald et al. (2003) for the Implicit Association Task and was also adopted in AAT research control stimuli. This can be done through subtraction or by comparing the two scores in an analysis, such as ANOVA. (Wiers et al., 2011). Many different types of D-scores were reported in the AAT literature, with the common thread Therefore, we primarily focus on double-difference scores in this article. being that a mean-based difference score of some kind is divided by the SD of the participant’s RTs. It makes sense Double‑difference scores Double-difference scores cancel to cancel out the effect of narrower or wider SDs, as these can be caused by a myriad of causes other than underly- out effects other than stimulus category-specific approach– avoidance bias, by subtracting approach–avoidance scores ing approach–avoidance bias, such as age, fatigue, and speed-accuracy trade-offs. However, this slowing cannot for a control or comparison stimulus category from those of a target stimulus category (Table 1: quadrants [A-B]-[C-D]). be entirely disentangled from the slower responding that 1 3 Behavior Research Methods may occur when individuals have more difficulty perform- main effect of stimulus-to-movement congruence (computed ing the task due to a strong and rigid approach–avoidance in R with the formula RT ~ congruence + (congruence | bias. Hence, it is as of yet unclear whether the D-score Subject)). correction helps or hurts the validity of the AAT and will therefore be examined here. Dataset simulation Multilevel random effects This scoring method has We simulated AAT datasets to produce values distributed recently been introduced by Zech et al. (2020). It involves with a right skew similarly to real AAT data and with fitting a mixed model and extracting the by-participant adjustable differences between different conditions and random slopes representing the desired contrast between between subjects. For each participant, we first randomly conditions. For example, a contrast between approach and generated the mean RT, SD, movement direction RT dif- avoidance can be retrieved by extracting the random effect ference, stimulus category RT difference, and bias effect of movement direction (0 = avoid, 1 = approach), and a RT difference (the true bias score), based on a predeter- double-difference score can be retrieved by extracting the mined sample-wide mean and SD for each parameter. interaction between movement direction and stimulus cat- After this, we generated gamma-distributed RTs whose egory (0 = control, 1 = target). This method allows for the means and SDs were shifted such that they matched pre- inclusion of known covariates influencing individual RTs determined parameters of their respective condition and such as trial number, temporal proximity of error trials, participant. To be able to generate data with similar prop- and individual stimulus recognition speeds. Due to its nov- erties to those of real studies, we used means and SDs of elty and good performance in the aforementioned study, the aforementioned parameters from the relevant-feature we included this approach here and chose to examine it in AAT described by Lender et al. (2018) with errors and the following analyses. outliers (RT <200 or RT >2000) removed; these param- eters are described in Appendix 1. Each dataset featured 36 participants, each having 256 trials divided into four Study 2: Susceptibility of compatibility conditions. This data simulation procedure is available scores to confounding caused by differences through the function aat_simulate() in the AATtools in trial count between conditions package (Kahveci, 2020) for R (R Core Team, 2020). Introduction and method Analysis procedure As mentioned, compatibility scores are a problematic meas- We simulated 1000 datasets based on the properties from ure of approach–avoidance bias when the number of trials Lender et al. (2018). We also simulated 1000 datasets where in each condition is unequal, which is bound to be the case the RT difference between approach and avoid trials was when outliers and error trials are removed. We demonstrated doubled, as we hypothesized that unequal trial count is espe- this by simulating AAT datasets and examining how reli- cially problematic for compatibility scores when RT differ - ability is impacted by the removal of trials from one specific ences between approach and avoidance trials are large. In condition. each dataset, we removed one trial per participant from the approach-target condition (removing trials from any of the Examined methods other condition instead should lead to identical effects on the compatibility score). After this, we computed double- We examined double-difference and compatibility score difference scores and compatibility scores from the data variants of the four archetypal data aggregation methods using the aforementioned four archetypal data aggregation described in Study 1: means, medians, D-scores, and mul- methods. We repeated this procedure of trial removal and tilevel random effects. The formulas for the first three of score computation until 16 trials remained per participant. these methods are described in Table 1. As for the multilevel methods, multilevel double-difference scores were computed Outcome measures by extracting the per-participant random effect coefficients of a movement × stimulus-type interaction (computed in R We evaluated the accuracy of the bias scores by correlating with the formula RT ~ movement_direction * stimulus_cat- them with the predetermined true score on which the par- egory + (movement_direction * stimulus_category | Sub- ticipants’ data were based. We refer to this measure as (true ject)), whereas multilevel compatibility scores were com- score) recoverability. We chose this measure, since it is intui- puted by extracting per-participant random coefficients of a tively easy to understand on its own, it is computationally 1 3 Behavior Research Methods much less costly than permutated split-half reliability, and it two scoring methods performed identically given equal is equivalent to the square root of reliability, since trial counts across conditions, but diverged when these became unequal. Comparing the four score aggregation Cov(T, T + E) Cor(T, T + E) = √ methods, D-scores best recovered the true score, and they Var(T) ⋅ Var(T + E) were followed by mean-based scores, multilevel-based Cov(T, T) + Cov(T, E) scores, and lastly, median-based scores. √ √ Var(T) ⋅ Var(T)+ Var(E)+ 2Cov(T, E) Given the finding that compatibility scores either perform Var(T) on par or worse than double-difference scores, we see little rea- √ √ son to use them when double-die ff rence scores are available. Var(T) ⋅ Var(T)+ Var(E) We thus recommend using double-difference scores instead of Var(T) √ compatibility scores, and we will do so ourselves in the remain- Var(T)+ Var(E) der of the article. The only exception to these findings is in the case of multilevel-analysis-based scores, where compatibility where the Spearman-Brown-corrected split-half correlation scores were superior to double-difference scores unless groups is an estimator of were unequal in size and simultaneously had a large difference Var(T) between approach and avoidance RTs. We will report on multi- Cor(T + E, T + E) = Var(T)+ Var(E) level compatibility scores. To be able to compare double-difference and compat- ibility scores, we also computed the probability that a ran- Study 3: Simulation study of outlier rejection domly drawn double-difference score would be better than and scoring algorithms a randomly drawn compatibility score at recovering the true score. We arrived at this probability by computing the Introduction and methods mean proportion of recoverability values of double-differ - ence scores that were greater than recoverability values of Given the heterogeneity in the literature as revealed in Study 1, each compatibility score. This was done separately for each we chose to empirically examine the impact of outlier rejection aggregation method and number of missing trials. methods and scoring algorithms on reliability using data simula- tion. We simulated datasets to be able to control the number of outliers in the data, and we applied every unique combination Results and discussion of outlier rejection method and scoring algorithm to these data- sets. We compared the methods to each other in their ability to As depicted in Fig. 1, bias scores became increasingly inac- recover the true scores on which the simulated data were based. curate as the trial count became more unequal across condi- tions. This decrease in accuracy was larger for compatibility Examined methods scores than for double-difference scores, and it was larger when there was more variability between the simulated par- The examined bias computation algorithms included the ticipants in how much faster or slower they were to approach double mean difference score, double median difference or avoid. Overall, the probability of a double-difference score, double-difference D-score, and multilevel compat - score being better than a compatibility score at recovering ibility scores, as described in the previous studies. We the true score was almost always above chance, being .79 examined a number of outlier detection methods. First, we at most. These probabilities are further depicted in Table 5. examined the sample-wide removal of the slowest and/or Double-difference scores only performed worse than fastest percentile of trials (1% / 99%), because it is a com- compatibility scores when computed using multilevel mon method in the AAT literature; we examined the per- analysis, given either relatively small differences in trial participant removal of RTs exceeding the mean by 3 SD (M count between conditions, or given average variability ± 3 SD), because it is similarly common; we examined the in the difference between approach and avoidance RTs. per-participant removal of RTs exceeding the mean by 2 This contrast was driven by multilevel double-difference SD (M ± 2 SD), as a representative of the more strict SD- scores performing worse than their mean- and D-score- based outlier removal methods that is sufficiently different based counterparts, while multilevel compatibility scores from the aforementioned 3 SD method such that its effects performed on par. When bias scores were computed on the data will be more detectable; we examined per- using medians, compatibility scores underperformed rel- participant removal of RTs exceeding the median by ± 3 ative to double-difference scores even when trial counts MADs (median ± 3 MAD), to be able to contrast the com- were equal across conditions. In all other cases, the mon 3 SD method to its robust counterpart; we examined 1 3 Behavior Research Methods Fig. 1 Effect of unequal trial count per condition on the recoverability of the true score from double-difference and compatibility scores that were based on means, medians, D-scores, and multilevel random effects repeated outlier testing and removal using one- or two- Analysis procedure sided Grubbs tests (Grubbs), to represent outlier removal methods based on statistical testing rather than boundaries We generated 1000 datasets in the same manner as Study calculated from the data; and lastly, we contrasted these 2, with each dataset having the same properties as the methods to no outlier rejection (None). We did not examine relevant-feature AAT study of Lender et al. (2018) with absolute outlier cutoffs in this study, as we were concerned outliers and error trials included. These datasets each had that these, unlike adaptive outlier rejection methods, were 36 participants with 256 trials each, spread across 2×2 too sensitive to the arbitrary properties of our current simu- conditions. When examining category-specific difference lation (such as the mean and SD of the RTs), and would scores, we excluded all trials pertaining to the control hence require the manipulation of these properties as well, condition from the data; when examining double-differ - which falls outside the scope of this article. ence scores, the full dataset was used. In each dataset, we Table 5 Probability of double-difference scores having higher true score recovery than compatibility scores Variability in difference between Aggregation method Probability of double-difference scores having higher true score recov - approach and avoidance RTs ery than compatibility scores 0 missing trials 16 missing trials 32 missing trials 48 missing trials Average Multilevel-based .46 .46 .49 .50 Mean-based .50 .51 .54 .54 Median-based .52 .55 .58 .55 D-score-based .50 .50 .54 .56 Large Multilevel-based .46 .50 .60 .69 Mean-based .50 .54 .66 .75 Median-based .54 .61 .71 .75 D-score-based .50 .54 .67 .79 1 3 Behavior Research Methods iteratively replaced one random additional RT with a slow any number of outliers (r = .70–.78). The mean + 2 SD outlier (mean = μ + 1200, SD = 400 ms, gamma- method gave the second highest true score recoverability participant distributed, shape = 3) in every participant’s data, (i.e. first (r = .57–.71), but it was on par with the median + 3 MAD one outlier, then two, then three) after which we separately when there were 14 or fewer slow outliers per participant (2 applied each combination of one outlier rejection method SD and 3 MAD: r = .74–.80). Performing worse than these to slow RTs and one bias scoring algorithm to the data. in terms of true score recoverability were Grubbs’ test (r This was done 32 times per dataset, until 12.5% of each = .48–.69), followed by the mean + 3 SD (r = .42–.69), participant’s data consisted of outliers. To obtain the reli-percentiles (r = .38–.68), and lastly, no outlier removal (r 32 32 ability of these combinations, we utilized the same true = .33–.68). Percentile-based outlier rejection was effective score recoverability measure that we used in Study 2; that when the data consisted of around 1% outliers (top percentile is, we computed the correlation between the computed bias rejection: r = .75–.79, compared to no outlier rejection: r 4 4 scores and the true (double-difference) bias scores that = .67–.75), but it failed to reduce the decline in true score the data was generated from. We thus obtained for each of recoverability compared to no outlier rejection when there the 1000 datasets the recoverability of the true score from were more outliers (from 4 to 32 outliers, true score recover- bias scores that were computed with each combination of ability for double mean difference scores went down by .37 outlier rejection method and bias score algorithm, from for the percentile method and .35 for no outlier removal). data with 0 to 32 outliers per participant. These recover- Comparing bias score algorithms, double median differ - ability values were averaged across datasets to gain an ence scores were barely influenced by outliers (r = .75 to r 0 32 overview of how recoverable the true score was through = .68 with no outlier removal) or outlier rejection. Double each combination of methods at each number of outliers. mean difference scores and double-difference D-scores were This procedure was repeated in another 1000 datasets, more strongly affected by outliers (means: r = .80 to r = 0 32 except we iteratively replaced one random RT with a fast .47, D-scores: r = .81 to r = .47 with no outlier removal), 0 32 outlier (mean = μ – 500 ms, SD = 50 ms, gamma- but they were better at recovering the true score when there participant distributed, shape = 3) in each participant’s data and applied were few outliers or when outliers were excluded with M + outlier rejection to fast RTs before we computed bias scores 2 SD or median + 3 MAD; across virtually all outlier rejec- and recoverability. We also repeated the same process in tion methods and numbers of outliers, D-scores were better another 1000 datasets where we iteratively replaced one ran- than double mean difference scores at recovering the true dom RT with a fast outlier and another with a slow outlier. score (correlations were, on average, .01 higher, up to .05). Multilevel compatibility scores showed the strongest decline Results and discussion in true score recoverability following the addition of outliers (r = .79 to r = .35 with no outlier removal), and outlier 0 32 In this section we report on results regarding the double-dif- rejection failed to bring multilevel compatibility scores back ference scores. Outcomes relating to category-specific differ - on par with the other algorithms (e.g. when combined with ence scores were almost identical in pattern but lower in overall 3 MAD outlier rejection, multilevel: r = .73, and D-score: recoverability, and can be viewed in Appendix 2. We report on r = .78). In addition, we report in Appendix 2 how the multilevel compatibility scores rather than multilevel double- multilevel compatibility score also produces a much wider difference scores since the former were better at recovering the range of correlations with the true score than the other meth- true score in virtually all occasions. The results of the simula- ods do, making it especially difficult to know whether any tions for double-difference scores are depicted in Fig.  2. The single application of this method will produce scores with sensitivity and specificity of the examined outlier rejection the expected reliability; D-scores, in comparison, produced methods is also further discussed in Appendix 2. Correlations scores with the least variable correlation with the true score, of bias scores with other aspects of the underlying data are also indicating that this method is not only highly reliable but reported in Appendix 2. These reveal that multilevel bias scores also consistently reliable. Overall, slow outliers strongly are contaminated with variance from the participant mean RT. decreased true score recoverability (the reduction of recov- Whenever we report a correlation in this section, the associated erability from 0 to 32 outliers was between .03 and .44). number of outliers is reported as a subscript. Fast outliers Slow outliers Grubbs’ test and mean – 3 SDs almost completely failed The true score recoverability of the outlier rejection methods to detect fast outliers, performing no better than no outlier followed a similar pattern across all bias scoring algorithms. rejection (Fig. 2). The best recoverability of the true score The true scores were most recoverable from bias scores was obtained when classifying all RTs faster than 2 SDs computed using the median + 3 MAD method at almost below the mean as outliers, especially in data with many 1 3 Behavior Research Methods Fig. 2 True score recoverability changes due to exclusion of outliers across the four double-difference scoring methods outliers (r = .73–.78). The median – 3 MAD method also reliability when there were few to no outliers (percen- led to better reliabilities (r = .72–.77) than no outlier rejec- tile outlier removal: r = .73–.79; compared to no outlier 16 0 tion (r = .69–.76). Furthermore, compared to no outlier removal: r = .75–.81). Bias scores were most reliable if 16 0 removal (r = .76–.81), removal of the fastest percentile of outliers were removed by rejecting RTs deviating more trials actually led to a decline in reliability which was espe- than 3 MAD from the participant median (r = .50–.68), cially noticeable when there were few or no fast outliers in but with fewer outliers, reliabilities were on par when the data (r = .73–.78). outliers were removed by rejecting RTs deviating more Again, double median difference scores were only very than 2 SD from the participant mean (2 SD: r = .74–.78; slightly affected by outliers (r = .75 to r = .73), followed 3 MAD: r = .74–.79). 0 32 8 by D-scores (r = .81 to r = .70), double mean difference 0 32 scores (r = .80 to r = .69), and lastly multilevel compat- 0 32 ibility scores (r = .79 to r = .65); but median difference Conclusion 0 32 scores had lower reliability in the absence of outliers and never exceeded the reliability of D-scores when all trials For outlier rejection methods, it can be concluded that more than 2 SD below the mean were removed. Overall, fast percentile-based outlier detection removes too few outli- outliers had a relatively small influence on the reliability (the ers when there are many, and it removes too many outli- reduction of reliability from 0 to 32 outliers was between ers when there are few, to the point of making bias scores .02 and .14). less accurate under common circumstances (e.g. Fig. 2, second row; lines with squares). Accordingly, percentile- Bilateral outliers based outlier exclusion appears to be disadvantageous. Given both slow and fast outliers, the remaining outlier Outlier rejection on data containing both slow and fast rejection methods did not strongly differ in effective- outliers led to results resembling a combination of the ness when there were few outliers, but when there were aforementioned findings, with the largest influence many, median ± 3 MAD (Fig. 2, row 3; lines with upward coming from slow outliers. Again, rejecting the top and triangles) outperformed mean ± 2 SD, which in turn out- bottom 1% of RTs reduced rather than improved the performed Grubbs’ test and mean ± 3 SD (Fig. 2, row 3; 1 3 Behavior Research Methods diamonds, downward triangles, and circles). Mean ± 3 Datasets for “Erotica” We used data from a single experi- SD and Grubbs’ test also failed to reject most fast outli- ment fully described in Kahveci, Van Bockstaele, et  al. ers (Fig. 2, row 2, circles and downward triangles), which (2020). In short, 63 men performed an AAT featuring eight suggests there is little point to using these methods to blocks with 40 trials each. In four of these blocks, they had to remove fast outliers; one should thus combine these two classify images of women on the basis of whether the images methods with an absolute outlier cutoff like, for instance, were erotic or not (relevant-feature), and in the other four 200 ms. blocks they had to classify the images on the basis of hair Among the algorithms, double-difference D-scores and color (irrelevant-feature). Half of the participants responded double mean difference scores were most reliable when with the joystick and the other half using the keyboard. For there were few outliers – in data with many slow and fast analysis in the current study, five participants were removed: outliers (>8%), they were outclassed by double median one with incomplete data, and four with a mean RT over difference scores despite outlier rejection. Multilevel 1000 ms. As the criterion validity measure, we chose the compatibility scores were less reliable than the aforemen- participants’ self-reported number of porn-viewing sessions tioned methods when there were no outliers, they became per week, as we found that approach–avoidance scores cor- more unreliable when there were more outliers in the data, related more strongly with this score than with other con- and their reliabilities were more inconsistent than those structs measured in the study. of other methods. Worryingly, applying outlier rejection was not enough to make multilevel compatibility scores Datasets for “Foods” We used data from a single experiment as reliable as those derived with the methods not based fully described in Lender et al. (2018). In short, 117 par- on multilevel analysis. This casts doubt upon whether the ticipants performed either of three joystick-AATs involving use of this scoring method is justifiable. Median difference food and object stimuli where the correct movement direc- scores were shown to be nearly unaffected by outliers, but tion was determined on the basis of different elements: stim- they were less reliable than the other methods when there ulus content (N = 37), picture frame (N = 44), and a shape were few outliers and outlier rejection was applied. Hence, displayed in the middle of the stimulus (N = 36). Each task it appears that the robustness of median-based scores may involved two blocks of 128 trials each. For the current study, be outweighed by the reliability and consistency of mean- we selected the content-based AAT as the relevant-feature based scores and especially D-scores in conjunction with task to be analyzed, and we selected the frame-based AAT appropriate outlier rejection. as the irrelevant-feature AAT to be analyzed. For analysis in the current study, we removed one participant from the relevant-feature AAT with an error rate above 50%. As the Study 4: Comparison of validity criterion variable, we chose the restrictive eating scale (α and reliability of pre‑processing pipelines = .90) of the Dutch Eating Behavior Questionnaire (van on real data Strien et al., 1986), as we found that approach–avoidance bias scores correlated more strongly with this score than Introduction and methods with other constructs measured in the study. We next examined the effect of different pre-processing Datasets for “Spiders” For the relevant-feature AAT involv- decisions on reliability and validity in six real datasets. ing spiders, we used data from a single study fully described in Van Alebeek et al. (2023). In short, 85 participants per- formed a relevant-feature AAT on a touchscreen where they Description of the examined datasets and their criterion were shown pictures of 16 spiders and 16 leaves, and were validity measures required to approach and avoid on the basis of stimulus content. Approaching involved sliding the hand towards the We selected datasets to cover appetitive and aversive stimulus and then dragging it back to the screen center, while stimulus categories, relevant- and irrelevant-feature task avoidance involved sliding the hand away from the stimulus. instructions, and joystick and touchscreen input, to get The task involved 128 trials divided into two blocks, and was results that can generalize to a wide range of future AAT embedded in a larger experiment which also included AATs studies. Datasets were only eligible if they measured both involving butterflies, office articles, and edible and rotten an initiation RT and a full motion RT, if they featured a food. As the criterion variable, we chose the Spider Anxiety target and control category, and if their bias scores were Screening (α = .88; Rinck et al., 2002). significantly correlated with a criterion variable. Proper - For the irrelevant-feature AAT involving spiders, we ties of the datasets, such as mean RT and error rate, are used data from a single study fully described in Rinck et al. shown in Appendix 2. (2021). In short, participants performed an irrelevant-feature 1 3 Behavior Research Methods go/no-go AAT on a touchscreen, where they were shown from the sample mean of that half to ensure correlations images of 16 spiders, 16 leaves, and 16 butterflies. Partici- were not driven by outliers. pants approached or avoided the spiders and leaves based on their position on the screen, while they were required not to Validity of category‑specific difference scores respond to the butterflies. Responses always involved lift- ing the hand off the touchscreen and touching the stimulus, To gain an overview of the psychometric properties of cat- and then sliding it toward the other side of the screen. Thus, egory-specific difference scores, we performed a number of stimuli at the top of the screen were dragged closer and thus tests. We computed category-specific bias scores for target and approached, while stimuli at the bottom of the screen were control stimuli by subtracting participants’ median approach moved further away and thus avoided. After excluding all RT from their median avoid RT, both computed from initia- the no-go trials, the experiment consisted of 128 trials in a tion times of correct responses. We computed the correlation single block. The Spider Anxiety Screening was again used between bias scores for target and control stimuli. Across the as criterion variable (α = .92). whole of the multiverse analyses, we also computed the rank correlation between reliability and criterion validity per data- set and algorithm type (category-specific difference, double- Multiverse analysis difference). We excluded multilevel compatibility scores from analyses involving the irrelevant-feature AAT for reasons The six aforementioned datasets were pre-processed through which are explained further in the results section. many different pipelines, after which we computed the split-half reliability using the function aat_splithalf() in Decision trees R package AATtools (Kahveci, 2020), as well as the cri- terion validity using Spearman correlations. We computed Following the computation of reliability and criterion valid- the average of 6000 random split-half correlations to obtain ity for each pipeline, we applied the Fisher z-transformation the randomized split-half reliability. We used 6000 itera- to the reliability and validity values to be able to analyze tions because we found in an analysis reported in Appendix differences at both low and high levels of reliability and 3 that, at most, 6000 random splits are needed to ensure validity. We submitted the z-transformed reliabilities and that at least 95% of average split-half coefficients deviate validities as dependent variables to linear mixed decision less than .005 from the grand average of 100,000 splits. The tree analyses with random intercepts for dataset and fixed examined components of the pipeline included the defini- predictors for RT type, lower RT cutoff, upper RT cutoff, tion of the RT (initiation time, completion time), the lower adaptive outlier rejection rule, error rule, and aggregation RT limit (0 ms, 200 ms, 350 ms), the upper RT limit (1500 method. We used an alpha level of .001 and a maximum tree ms, 2000 ms, 10,000 ms), the adaptive outlier rule (none, depth of 6 to prevent the decision trees from becoming too mean ± 2 SD, mean ± 3 SD, median ± 3 MAD, <1% and large to display. Decision trees were generated using R pack- >99%, Grubbs’ test), the error rule (keep errors, remove age glmertree (Fokkema et al., 2018). For display in plots errors, replace errors with the block mean + 600 ms; fur- and tables, the z-transformed correlations were averaged and ther called error penalization), the algorithm type (category- then converted back to regular correlations. specific difference, double-difference) and the algorithm aggregation method (mean difference, median difference, Results and discussion D-score, multilevel category-specific difference or compat- ibility). This led to a total of 2592 pipelines per dataset. The Validity of category‑specific difference scores examined pipeline components were selected on the basis of their common use and methodological rigor as revealed As can be seen in Table 6, target- and control-specific differ - by the literature review in Study 1, on the basis of results ence scores were positively correlated in all irrelevant-feature from the analyses in Studies 2 and 3, and with emphasis on AATs, indicating that a significant portion of the variance newly (re)emerging methods in the field (e.g. Grubbs' test: in target-specific and control-specific stimuli is shared; this Saraiva et al., 2013). In each analysis, the pre-processing shared variance may originate from the interpersonal vari- steps were applied in the following order: the RT measure ability in participant’s overall approach–avoidance RT differ - was selected, the lower and upper RT cutoffs were applied, ences, as we speculated in Study 1. Conversely, there was a error trials were excluded if required, outliers were excluded, significant negative correlation for two of the three relevant- error trials were penalized if required, and the bias scores feature AATs, indicating that category-specific difference were computed. During the computation of split-half reli- scores to target and control stimuli are related to a source of ability, participants were excluded from individual iterations variance that increases one bias score but decreases the other, if their bias score in either half deviated more than 3 SD such as response slowdown between blocks. 1 3 Behavior Research Methods Table 6 Correlations between target- and control-specific difference scores, and t-tests comparing control-specific bias scores to zero Correlation between category- Correlation between reliability and criterion validity specific difference scores of target Category-specific difference Double-difference and control stimuli Instructions Stimuli r p r p r p Relevant-feature Erotic 0 .988 .20 <.001 .31 <.001 Food −.40 .017 −.36 <.001 .15 <.001 Spider −.14 .189 −.20 <.001 .31 <.001 Irrelevant-feature Erotic .46 <.001 −.05 .152 −.01 .661 Food .38 .011 −.45 <.001 −.16 <.001 Spider .35 .001 .39 <.001 .08 .010 As reported in Table 6, when bias scores were computed decisions. Variability of reliability estimates was especially with double-difference scores, reliability and criterion valid- strong in multilevel compatibility scores in the irrelevant- ity were positively correlated in four datasets, negatively in feature AAT, with extreme values reaching into the range one, and not at all in one. When bias scores were computed of 1 as well as −1. This is likely due to the fact that small with category-specific difference scores, reliability and cri- random effects are difficult to identify in multilevel models terion validity were negatively correlated in three studies, and can thus get contaminated with other aspects of the data positively in two, and not at all in one. We expected posi- such as the mean RT, as we demonstrated in Appendix 2. tive correlations between reliability and criterion validity, Hence, multilevel compatibility scores may not be valid for as more reliable measures are less influenced by noise and the irrelevant-feature AAT, as well as for any other task with could hypothetically capture the approach–avoidance bias very small effect sizes. Hence, we do not analyze multilevel more accurately, enabling stronger correlations with meas- compatibility scores in the irrelevant-feature AAT in the ures of similar constructs; negative correlations would imply remainder of this article. that when bias scores become more reliable, they get bet- ter at measuring a construct that is different from implicit Reliability decision trees approach–avoidance bias; this would cast doubt upon the validity of the scores. We used decision trees to deconstruct the complex nonlinear These findings thus err more towards supporting than relationships between different factors in how they influence rejecting the idea that category-specific difference scores the reliability and validity of the six AATs. run a risk of being contaminated with sources of variance The reliability decision tree of the relevant-feature unrelated to approach–avoidance bias of the target stimuli, AATs is depicted in Fig. 4. The most influential decision and they run a higher risk than double-difference scores of was how to handle error trials: penalization (.71) gave becoming less valid as they become more reliable. In the worse reliability than error removal or retention (.76). The remainder of this results section we will therefore report on second most influential decision was algorithm: double- double-difference scores, while results on category-specific difference D-scores were the most reliable (.78) but could difference scores can be gleaned in Appendix 4. lead to lower reliability if lax outlier rules (upper RT limit of 10,000 ms, no adaptive outlier exclusion or percentile- Variability in reliability and validity of different bias scoring based) were applied to completion times (.72); multilevel algorithms compatibility and double mean difference scores came in closely after (.76) and the only thing harming their reli- We sought to gain an overview of how much the various bias ability was retention (.73) rather than removal of errors scoring algorithms are perturbed by other pre-processing (.76). Double median difference scores were the least reli - decisions. Figure 3 and Table 7 depict the mean reliabil- able (.73) and were further harmed by the stricter outlier ity and criterion validity of the various datasets, as well as removal methods (.71; Median ± 3 MAD, M ± 2 SD). several measures of spread. Criterion validity and espe- The reliability decision tree of the irrelevant-feature cially reliability were found to strongly fluctuate depend - AAT is depicted in Fig. 5. Reliability was very low for this ing on which pre-processing pipeline was used. Comparing task. Again, error penalization harmed reliability (−.13), task types, the irrelevant-feature AATs were, on average, though less for D-scores (−.02). Algorithm was the sec- less valid and much less reliable, and their reliabilities and ond most important decision when errors were not penal- validities were more strongly perturbed by pre-processing ized: double median difference scores gave bad reliability 1 3 Behavior Research Methods Fig. 3 Distributions of reliability and validity coefficients acquired bias and a variable that was preselected on the basis of its significant through different pre-processing pipelines in the six analyzed data- correlation with approach–avoidance bias scores in that particular sets. This figure depicts the distribution of reliability and criterion dataset. It is therefore of little value to focus on how high or low this validity estimates from all different pre-processing pipelines. A wide value is in absolute terms. Rather, we guide the reader to focus on distribution implies that differing pre-processing decisions had a the spread or uncertainty of this value. In all cases, the validity of the large influence on the resulting reliability or criterion validity. Crite- irrelevant-feature AAT datasets is more spread out than that of the rion validity is based on the correlation between approach–avoidance irrelevant-feature AATs Table 7 Means, confidence intervals, and variability estimates for reliability and validity outcomes over all pipelines AAT type Algorithm Reliability Criterion validity Mean r 95% CI SD z Mean r 95% CI SD z Relevant-feature Multilevel .74 .46, .88 .12 .27 .03, .45 .09 Mean .74 .45, .88 .12 .27 .05, .44 .08 Median .72 .49, .84 .07 .33 .23, .48 .06 D-score .77 .55, .89 .10 .28 .05, .45 .10 Irrelevant-feature Multilevel .23 −.67, .89 .60 .12 −.22, .38 .14 Mean −.05 −.61, .27 .21 .23 .03, .52 .11 Median −.09 −.56, .18 .18 .22 .01, .39 .09 D-score .01 −.32, .22 .12 .23 .03, .50 .10 Note: SDs represent the pooled SD f z-transformed, not raw, reliability and criterion validity estimates. Pooling was done by computing the vari- ance within each dataset first and then averaging across datasets. Criterion validity is based on the correlation between approach–avoidance bias and a variable that was preselected on the basis of its significant correlation with approach–avoidance bias scores in that particular dataset. It is therefore of little value to focus on how high or low this value is in absolute terms. Rather, we guide the reader to focus on the spread or uncer- tainty of this value. In all cases, the validity of the irrelevant-feature AAT datasets is more spread out than that of the irrelevant-feature AATs 1 3 Behavior Research Methods Fig. 4 Decision tree of factors influencing the reliability of the rele- played in each node represent the average reliability achieved by the vant-feature AAT. The factors that the data was split by are denoted decisions that led to that node. Particularly reliable and unreliable on each node, and the factor levels by which the data was split are pathways are respectively depicted in green and grey depicted on the edges emerging from these nodes. The numbers dis- (~ −.10), except with the use of completion times and on average (.23) compared to error removal and retention less strict upper RT limits like 2000 ms or above (.03). (.32). However, if error trial RTs were penalized, validity Double mean difference scores and double-difference could be salvaged with the use of a combination of dou- D-scores benefited the most from removal of error trials ble median difference scores, completion times, and a 1500 and from outlier handling with any method (.05) other ms RT cutoff (.37). When error trials were not penalized, than percentiles. validity was higher for double median difference scores and double-difference D-scores (.33) than for double mean Validity decision trees difference scores or multilevel compatibility scores (.30). Additionally, validity often benefited slightly from removal We used the same methodology to construct decision trees of error trials, and subsequently, the use of completion times for validity. rather than initiation times. As depicted in Fig.  6, the criterion validity of the rel- As depicted in Fig. 7, criterion validity outcomes were evant-feature AAT was much less strongly perturbed by more ambiguous for the irrelevant-feature AAT. Criterion pre-processing decisions than its reliability was. Once validity was higher with outlier rejection methods that again, error penalization was harmful to criterion validity were not strict and not lax, i.e. M ± 3 SD and the Grubbs Fig. 5 Decision tree of factors influencing reliability in the irrele- played in each node represent the average reliability achieved by the vant-feature AAT. The factors that the data was split by are denoted decisions that led to that node. Particularly reliable and unreliable on each node, and the factor levels by which the data was split are pathways are respectively depicted in green and grey depicted on the edges emerging from these nodes. The numbers dis- 1 3 Behavior Research Methods Fig. 6 Decision tree of factors influencing the criterion validity of bers displayed in each node represent the average criterion validity the relevant-feature AAT. The factors that the data was split by are achieved by the decisions that led to that node. Particularly valid and denoted on each node, and the factor levels by which the data was invalid pathways are respectively depicted in green and grey split are depicted on the edges emerging from these nodes. The num- test (.25); and validity could only be harmed within this absent, and likewise, there is much to be gained from the branch by the combination of retaining error trials and informed choice of the study’s pre-processing pipeline. In using completion times (.19). With the other outlier rejec- turn, we will discuss the findings on error handling, out- tion methods, reliability was best when error trials were lier rejection, score computation, RT measurement, and removed or penalized (.22) rather than kept (.19). Unlike instruction type. We will derive from these findings a set in every other decision tree, error penalization did not of recommendations, which are summarized in Table 8. lower the outcome measure in this case. We also consider implications for other RT-based implicit measures. General discussion Error trials There is a long chain of often arbitrary pre-processing Most striking was the finding that replacing error RTs with decisions that researchers have to make before analyzing the block mean RT plus a penalty (e.g. 600 ms) frequently their data, and the wide variability in outcomes this can led to lower reliability and validity. In Study 1 we found generate threatens replicability and scientific progress. that this method was used in 7 out of 163 reviewed studies, Only recently have researchers begun to investigate the likely due to the influence of the implicit association task consequences of different decisions (Steegen et al., 2016), literature in which this method is common. Furthermore, and a comprehensive study for the field of AAT research there was a smaller but noticeable disadvantage in reli- has so far been missing. We aimed to fill this gap here. ability and validity when error trials were kept rather than Our selective literature review in Study 1 revealed a removed, especially when trial completion times were used wide range of pre-processing practices in AAT studies. as the RT measure. Errors were kept in the data in 34 out of We subsequently used simulations (Studies 2 and 3) and 163 reviewed studies. analyses on real data (Study 4) to compare many of these practices, and obtained several findings that can inform Outliers further RT research. Importantly, we found large vari- ability in the obtained reliability and validity outcomes Regarding RT cutoffs, we found that reliability and depending on the chosen pre-processing pipeline. This validity of real data were unaffected by the presence or highlights the fact that the varying practices do indeed absence of lower RT cutoffs; hence, this particular pre- muddy the waters of whether an effect is present or processing decision may not strongly inf luence reliability 1 3 Behavior Research Methods Fig. 7 Decision tree of factors influencing the criterion validity of bers displayed in each node represent the average criterion validity the irrelevant-feature AAT. The factors that the data was split by are achieved by the decisions that led to that node. Particularly valid and denoted on each node, and the factor levels by which the data was invalid pathways are respectively depicted in green and grey split are depicted on the edges emerging from these nodes. The num- and validity outcomes. Upper RT cutoffs did inf luence series of simulation studies that SD-based outlier rejec- outcomes, though not that frequently. A cutoff of 1500 tion, in contrast to MAD-based outlier rejection, is less ms showed good validity for completion times in the prone to inflating type I error. Based on these findings relevant-feature AAT, while a cutoff of 1500 or 2000 ms and prior establishment of methods in the field, our pref- showed slightly better reliability than a cutoff of 10,000 erence thus goes towards either rejecting outliers deviat- ms under very specific conditions. Despite this ambi- ing more than 2 SD from the mean, or towards rejecting guity, however, we do suggest that reasonably chosen RTs deviating more than 3 SDs after very fast slow RTs lower and upper RT cutoffs be applied: both slow and have been removed, as reported in Table 8. fast outliers, however rare and insignificant they may be, We have two explanations for why the outcomes for still represent invalid data, and slow outliers still have a outlier rejection in simulated and real data were diver- strong impact on subsequently applied adaptive outlier gent. First, the outlier rejection methods may have pro- removal methods and RT aggregation. duced more divergent outcomes for our simulations simply We found no clear pattern regarding which outlier because we simulated a large number of outliers: when rejection method produced better results in real data; we the number of simulated outliers was smaller and more did, however, find that removing outliers was better than consistent with what occurs in real data (e.g. 4% of tri- not doing so. This contrasts with the results of our simu- als), the outlier rejection methods were much less distin- lations, where true score recoverability followed a con- guishable. Though less likely, an alternative explanation sistent pattern across outlier rejection methods from best is that the simulation was based on incorrect assumptions. to worst: median ± 3 MAD > mean ± 2 SD > repeated We assumed that RT differences between conditions are st th Grubbs tests > mean ± 3 SD > 1 & 99 percentiles > represented by shifts in the bulk of the RT distribution, none. In our simulations, we found that dealing with out- rather than in the presence of more extreme RTs in one liers by rejecting the lowest and highest RT percentiles condition than in the other; depending on which of these across the dataset can actually harm reliability, since this two assumptions is used, results can be quite different, as method does not distinguish between real outliers and reg- demonstrated by (Ratcliff, 1993). This assumption may ular RTs in very fast or slow individuals. It was used in 10 have favored outlier rejection methods that remove a larger out of 163 reviewed studies. Furthermore, almost all fast number of extreme RTs, such as the MAD. More lenient outliers remained in our simulated data when we rejected outlier rejection methods would be favored if differences RTs deviating more than 3 SDs from the individual mean between conditions instead originated from differences in or RTs that were significant outliers on Grubbs’ test, and the number of extreme RTs. Future research should inves- hence, these methods should be used in conjunction with tigate whether RT differences between conditions in the fixed cutoffs for fast outliers. Rejecting RTs deviating AAT are represented by a larger number of extreme RTs more than 2 SDs from the individual mean produced the or by shifts in the bulk of the RT distribution. best reliability outcomes in simulations involving fast or few slow outliers, but in real data the reliability and valid- Scoring algorithms ity of this outlier rejection method often performed on equal footing with rejecting RTs deviating more than 3 Regarding scoring algorithms, we reasoned that cate- SDs from the mean. Berger and Kiefer (2021) found in a gory-specific difference scores (approach stimuli – avoid 1 3 Behavior Research Methods Table 8 Recommendations for pre-processing AAT data Less reliable/valid method Methods with ambiguous outcomes More reliable/valid alternative Outliers • Not rejecting outliers • Removing the lowest and highest • Rejecting RTs deviating more than 2 SD from the mean percentile of RTs sample-wide • Rejecting RTs deviating more than 3 SD from the mean • Rejecting RTs deemed outliers by repeated Grubbs’ tests • Rejecting RTs deviating more than 3 MADs from the median • Preceding aforementioned methods with the removal of RTs below and above reasonable fixed cutoffs* Error trials • Not removing error trials • Removing error trials • Replacing error trials with the block mean plus a penalty Bias score • Compatibility scores • Double median difference scores • Double-difference D-scores in conjunction with outlier rejec- computa- • Multilevel double-differ - • Multilevel compatibility scores in tion tion ence scores the relevant-feature AAT • Double mean difference scores in conjunction with outlier • Category-specific rejection multilevel scores in the irrelevant-feature AAT • Category-specific differ - ence scores* Note: Recommendations are displayed in order, with the worst (col. 1) and best (col. 3) methods displayed at the top of each list * = primarily based on theoretical considerations stimuli) are confounded with stimulus-independent indi- compatibility scores except when bias scores are computed vidual differences in approach–avoidance speed, and through multilevel modelling. hence, these should always be contrasted with a reference Among these, double-difference D-scores consistently stimulus category. In Study 4, we found that increasing had the highest validity and reliability and the lowest vari- the reliability of a category-specific difference score often ability in outcomes, both in simulated and real data; we decreases its validity, and that category-specific difference therefore express our clear preference for double-difference scores for target and control stimuli are positively cor- D-scores over the other methods. Double mean difference related in the irrelevant-feature AAT, supporting the idea scores had more variable outcomes and were often slightly that these scores are contaminated, and become more con- less reliable and valid. taminated when they are more reliable. We therefore opted We found in both simulations and real data—to our sur- to focus the majority of this article on double-difference prise—that double median difference scores led to lower scores. However, our concerns about category-specific dif- reliability than double mean die ff rence scores or double-dif - ference scores need to be corroborated with more conclu- ference D-scores in conjunction with adequate outlier rejec- sive evidence in future empirical studies, which manipu- tion. Double median difference scores were only more reli- late or track factors that differentially influence approach able in simulated data with many outliers (>8%). In validity, and avoidance RTs, such as posture, fatigue, muscle mass, there was not as much of a difference between algorithms so and response labelling. long as errors and outliers were removed. Hence, we draw We demonstrated using simulations that compatibil- no strong conclusions on whether double median difference ity scores become more inaccurate than double-difference scores are to be discouraged or not. scores when there is an unequal number of trials in differ - We found that multilevel compatibility scores were ent conditions, and they confer no benefits over double-dif- more strongly affected by outliers than any other scor- ference scores; the only exception to this was in multilevel ing algorithm, and applying outlier rejection did not fully random effect scores. We found that multilevel random remedy this issue. Multilevel compatibility scores also effects are inaccurate compared to other methods, and had the largest unpredictability in outcomes both sim- become increasingly inaccurate when bias scores are mod- ulations and real data, and especially in the irrelevant- elled with three model terms as in a double-die ff rence score, feature AAT. We hypothesize that this is due to the fact rather than with a main effect, as in a compatibility score. that it can be difficult for multilevel models to identify Hence, double-difference scores should be preferred over small random effects, as occur in the irrelevant-feature 1 3 Behavior Research Methods AAT, where bias scores explain only a small propor- We did not investigate the impact of several less com- tion of the RT variance. Hence, we recommend against mon pre-processing approaches that address problematic using multilevel random effect scores in the irrelevant- aspects of the data overlooked by most reviewed meth- feature AAT, and we remain ambivalent about their use ods. RT transformations, such as square root, natural in the relevant-feature AAT. Further research is needed logarithm, and inverse transformations, can reduce the to demonstrate whether this method has any advantages rightward skew of the RT distribution and thereby de- that make it preferable over the algorithms that do not use emphasize the inf luence of slow RTs on subsequently mixed modelling. In particular, multilevel modelling (or computed bias scores that are based on means or regres- per-participant regression) could account for trial-level sion. Similarly, there are a number of outlier rejection contaminants of RTs, such as post-error slowing, fatigue methods that can deal with skewed distributions, such and learning effects, and stimulus-specific confounds as the use of interquartile ranges, and exclusion using such as recognition speed, or visuospatial complexity. It asymmetric SDs; these currently remain unexplored in is as of yet unclear how exactly these contaminants could this article and the wider AAT literature. RTs can also be best be modelled, and whether their inclusion benefits the excluded by their temporal position within the block, as validity of the bias scores. participants are often still memorizing the instructions at the start of the block; hence, exclusion of trials at the RT definitions start of the block is a recommended pre-processing step for the brief IAT (Nosek et al., 2014). Regarding RT definitions, our findings were somewhat inconclusive: the only consistent pattern was that com- Generalization to other RT paradigms pletion times are less reliable and valid when error trials are also kept in the data. We therefore cannot draw any The current methodological findings cause concern for how conclusions as to which of these two RT definitions is data are analyzed in other RT tasks. However, it is difficult preferable. We suggest that the RT definition be cho- to forecast how the examined methods affect the validity sen on the basis of theoretical considerations and pre- of other tasks, as other tasks might depend on aspects of vious research in a specific field. As we found in our the data that are masked by this study’s recommendations. own previous research with touchscreen-based AATs, Ideally, the current multiverse decision tree methodology approach–avoidance biases may express themselves pri- could be applied to every popular experimental paradigm marily at the movement planning stage, such as when to confirm whether it is beneficial or detrimental how these the target stimuli are foods (Kahveci et al., 2021; Van tasks are currently pre-processed. It is particularly important Alebeek et  al., 2021), or during movement execution, that such a multiverse analysis is performed on paradigms such as when the target stimuli are spiders (Rinck et al., where the most commonly used pre-processing pipelines 2021). Since very few studies have explored the out- include methods we found to be detrimental. The IAT, for comes of multiple RT definitions (see also: Rotteveel & example, is commonly analyzed by penalizing error trials Phaf, 2004; Solarz, 1960), we recommend that this be and including outliers in the data. These recommendations done more often in future research. by Greenwald et al. (2003) were adopted in a minority of AAT studies that used the D-score (e.g. Ferentzi et al., 2018; Limitations and future directions Lindgren et al., 2015; Van Alebeek et al., 2021), and are contradicted by our findings. We were unable to investigate on which basis to include or This being said, a number of our findings are purely sta- exclude participants from AAT studies, for example, on the tistical in nature and can be expected to generalize regard- basis of extreme mean RT, error rate, outlier count, or bias less of the paradigm. We demonstrated in Study 2 that less score relative to the rest of the sample. Such an investigation accurate aggregated scores are obtained when averaging would require the analysis of far more datasets, and hence, together two conditions with unequal trial counts (as in this is to be addressed by future research. For now, it may compatibility scores) instead of computing separate aver- be sensible to reject participants on the basis of preset crite- ages for each and adding those together (as in double-dif- ria regarding error rates and mean RTs, as these can signal ference scores). This disadvantageous practice is common that the data of a particular participant do not sufficiently in research on the IAT (Greenwald et al., 2003). Addition- represent the mental process under study. Similarly, when ally, rejecting the top and bottom 1% of RTs as outliers non-robust analysis methods are used, outlying bias scores will also lead to the removal of an inappropriately low or should be removed. high number of trials in other paradigms, although this 1 3 Behavior Research Methods method currently sees little use outside the AAT literature. Lastly, fast outliers will also remain undetected in other paradigms when most of the adaptive outlier rejection methods that we examined are applied. Lastly, it remains to be explored further in the AAT and in other paradigms whether it is problematic to con- trast two response conditions to target stimuli without further contrasting these to control stimuli, as with cat- egory-specific difference scores. Stimulus-independent biases favoring one response over the other are common and cannot always be prevented through good experimen- tal design. It remains to be shown, however, how influ- ential they truly are, especially when responses consist of mere button-presses rather than full-limb movements as with the joystick. Conclusions Far from delivering a one-size-fits-all recommendation for pre-processing the AAT, our review, simulations, and multiverse decision tree analyses have recovered a num- ber of more reliable and valid methods, while eliminating a smaller number of methodologically harmful “forking paths” in the garden of AAT pre-processing decisions, as shown in Table 8. As some of these harmful practices are highly common (e.g. error trial retention or penalization) or even dominate the field (e.g. median category-specific difference scores), we hope that the recommendations of the current study will help to significantly improve the overall reliability and validity of future AAT studies. Appendix 1 Parameter retrieval procedure Parameters were computed for the six datasets with and without errors and outliers (defined as RTs below 200 ms or above 2000 ms). For the main effect of movement direc - tion, we computed per participant the mean difference for approach minus avoid trials; for the main effect of stimu- lus category, we computed the mean difference for trials featuring the target minus control stimuli. For the effect of bias score, we computed the mean difference between trials featuring approach of target stimuli and avoidance of control stimuli, minus trials featuring avoidance of target stimuli and approach of control stimuli. For RT mean and SD, we computed the mean and SD of each participant’s RTs before and after the subtraction of the aforementioned movement, stimulus, and bias effects from the RTs. After this, parameter means and SDs were computed across par- ticipants. These parameters are reproduced in Appendix Tables 9 and 10. 1 3 Table 9 Sample characteristics of the six datasets used in the study Content Tasktype Outliers N subjects Mean N trials Mean errors Mean RT Mean RT var. Full RT SD Full RT SD var. Residual RT SD Residual RT SD var. Erotica Relevant-feature Raw 58 160 10.5 538.09 77.1 175.79 80.81 171.35 79.46 Clipped 58 148.91 - 536.15 68.98 150.67 44.41 146.31 43.9 Irrelevant-feature Raw 58 160 13.62 617.42 107.52 213.55 108.77 211.19 108.07 Clipped 58 145.5 - 608.04 95.19 180.41 63.29 177.92 62.97 Foods Relevant-feature Raw 36 255.75 24.42 618.46 96.93 203.2 79.46 196.9 78.45 Clipped 36 231.33 - 632.24 90.07 165.87 50.72 158.37 49.87 Irrelevant-feature Raw 44 241.39 34.36 527.26 106.84 187.36 80.14 185.14 79.66 Clipped 44 207.02 - 535.23 97.34 158.09 55.09 155.42 54.4 Spiders Relevant-feature Raw 85 128 2.42 548.22 90.57 147.1 71.31 140.69 67.55 Clipped 85 121.16 - 561.47 76.93 124.49 43.57 117.5 41.68 Irrelevant-feature Raw 86 128 5.31 539.46 72.39 124.33 68.24 119.51 65.75 Clipped 86 122.15 - 539.16 68.82 114.47 42.85 109.57 41.94 Behavior Research Methods Appendix 2 Additional findings in Study 3 Category‑specific difference scores and the impact of outliers and outlier removal Category-specific difference scores showed the same pattern as double-difference scores in how they were impacted by outliers and outlier rejection in how well they were able to recover the true score on average, as depicted in Appendix Fig. 8. Variability in true score recoverability of double‑difference scores We also computed the SD, rather than the mean, of the true score recoverability. The SD was computed on the basis of Fisher r-to-z transformed correlations, rather than untrans- formed correlations. The transformation was applied to minimize the influence of average correlation magnitude on correlation dispersion. The resulting SD represents how unpredictable the correlation between the computed and true score is. The results are depicted in Appendix Fig. 9. They reveal that multilevel compatibility scores are highly variable in their correlation with the true score, compared to the other methods. The results also highlight that median- based scores are not more stable than mean-based scores; on the contrary, D-scores had the smallest SD of their cor- relation with the true score. Outlier detection rates of outlier exclusion methods given varying numbers of outliers The outlier detection rates of different outlier rejection procedures are depicted in Appendix Fig. 10. The percen- tile method stands out as having the highest false negative rate of all outlier detection methods after the data contain more than 1% of outliers, which makes sense, but also the highest false positive rate with fast outliers, which may be because it detects outliers across the entire sample and not within participants. Rejecting RTs deviating more than 2 SD from the participant mean led to the highest true positive rates and lowest false negative rates for fast RTs (i.e. highest sensitivity). For slow RTs, however, the 2 SD method had the highest false positive rate when there were very few outliers, which was apparently to no detri- ment to the reliability of the data, and this false positive rate was greatly reduced when there were more outliers in the data. 1 3 Table 10 Effect size means and variances of RT contrasts in the six datasets used in the study Content Tasktype Outliers Pull effect Pull effect var. Pull effect size Stim. effect Stim. effect var. Stim. effect size Bias effect Bias effect var. Bias effect size Erotica Relevant-feature Raw −20.12 36.56 −.55 −16.92 40.51 −.42 26.12 52.22 .5 Clipped −25.87 36.78 −.7 −18.58 35.13 −.53 20.99 37.63 .56 Irrelevant-feature Raw −19 38.21 −.5 15.93 37.75 .42 −.06 33.83 0 Clipped −28.57 33.32 −.86 11.36 25.81 .44 3.33 31.79 .1 Foods Relevant-feature Raw −30.62 35.24 −.87 −30.88 36.21 −.85 39.26 69.91 .56 Clipped −39.21 40.5 −.97 −30.94 32.5 −.95 38.97 60.13 .65 Irrelevant-feature Raw −27.61 38.91 −.71 −4.52 26.41 −.17 1.01 25.6 .04 Clipped −33.08 34.45 −.96 −1.18 29.1 −.04 1.22 23.66 .05 Relevant-feature Raw −27.04 45.96 −.59 −25.05 45.02 −.56 −7.81 62.48 −.12 Spiders Clipped −31.99 39.16 −.82 −27.77 34.28 −.81 −7.46 52.67 −.14 Irrelevant-feature Raw −35.84 42.22 −.85 11.52 39.22 .29 −5.5 35.56 −.15 Clipped −34.93 36.94 −.95 10.52 35.22 .3 −8.31 26.94 −.31 Behavior Research Methods Fig. 8 True score recoverability changes due to exclusion of outliers across the four category-specific difference score methods Fig. 9 Variability of true score recoverability computed with different outlier rejection methods and scoring algorithms at different numbers of outliers 1 3 Behavior Research Methods Fig. 10 Outlier detection rates for different detection methods and types of outliers Spurious correlates of bias score algorithms noting. First, target-specific bias scores are unsurprisingly correlated with movement direction effects. Second, only We not only computed correlations between the computed the double mean difference score and the double-difference bias score and true underlying bias score, but also with other D-score consistently have a decent correlation with the true parameters that were used to generate the data, including bias effect, while all other algorithms had below-zero cor - true mean RT, true movement direction effect (irrespective relations on some occasions. Third and most importantly, of stimulus), and true stimulus category effect (irrespective the correlation between multilevel-based scores and mean of movement direction). RT is highly spread out both in the positive and negative The results at an outlier count of 0 are depicted in directions, with almost perfect correlations within the Appendix Fig. 11. Three aspects of these results are worth realm of possibility; this is not a property of the data, given 1 3 Behavior Research Methods Fig. 11 Correlations of bias scores with parameters used to generate the data that other bias scores do not feature such extreme corre- recommendations on this. We split the six real datasets lations. This contamination is sure to reduce the validity 100,000 times, computed bias scores for both halves in of multilevel bias scores as well as make them artificially each split, and recorded the correlation between scores for reliable, given that mean RT is a highly reliable variable. both halves. From this large pool of split-half correlations, we added one random correlation at a time to a pool and averaged the correlations in the pool together, recording the Appendix 3 resulting aggregated correlation for each pool size from 1 to 20,000. This was done 200 times for double mean differ - Determining the ideal number of split‑halves ence scores, double median difference scores, and double- difference D-score. To analyze the accuracy associated We next determined the ideal number of split-halves with each pool size, we computed the absolute difference to use for real data, as there are, to our knowledge, no between the average correlation for the pool and that for the Fig. 12 Percentage of pooled split-half correlations deviating more than .005 from the grand average as a function of the number of split-half correlations included in the pool 1 3 Behavior Research Methods six datasets and three algorithms the largest number of Table 11 Largest number of pooled split-half correlations at which more than 5% of pool averages deviated more than .005 from the iterations below which more than 5% of split halves devi- grand average ated more than .005 from the grand average. If less than 5% of averages deviated more than .005 from the grand dataset Double Double Double- mean differ - median dif- difference average, this was deemed an acceptable number of splits. ence ference D-score Appendix Fig. 12 depicts the gradual increase in accuracy in split-half reliability estimation as more split-halves were Irrelevant-feature Erotic 2340 2800 1980 averaged together. Appendix Table 11 depicts the largest Irrelevant-feature Food 3460 5180 1740 number of iterations above which less than 95% of average Irrelevant-feature Spider 1580 1500 860 sets deviated less than .005 from the grand mean of split- Relevant-feature Erotic 1480 1320 700 half correlations for each scoring algorithm and dataset. Relevant-feature Food 540 720 340 To obtain accurate split-half estimates, D-scores required Relevant-feature Spider 380 560 280 the least iterations, as did the relevant-feature AATs, which tend to be more reliable—for these, 2000 iterations would entire set of 100,000 splits. For each pool size we counted be more than enough. Mean and median double-difference the number of average correlations deviating more than scores in irrelevant-feature AAT datasets may require more .005 from the grand average. We computed for each of the than 5500 split-half iterations to obtain stable results. Fig. 13 Decision tree of factors influencing the reliability of the relevant-feature AAT, as computed with category-specific difference scores Fig. 14 Decision tree of factors influencing the reliability of the irrelevant-feature AAT, as computed with category-specific difference scores 1 3 Behavior Research Methods Fig. 15 Decision tree of factors influencing the criterion validity of the relevant-feature AAT, as computed with category-specific difference scores scores were computed with only the target stimuli, thus Appendix 4 ignoring the control stimuli. Reliability outcomes for the relevant-feature AAT are depicted in Appendix Fig.  13, Decision trees for category‑specific scores and for the irrelevant-feature AAT in Appendix Fig. 14. Criterion validity outcomes for the relevant-feature AAT For category-specific difference scores, we generated deci- are depicted in Appendix Fig. 15, and for the irrelevant- sion trees in the exact same manner as was described in feature AAT in Appendix Fig 16. Study 4. The only difference in methodology was that bias Fig. 16 Decision tree of factors influencing the criterion validity of the irrelevant-feature AAT, as computed with category-specific difference scores 1 3 Behavior Research Methods Ack nowledgements The authors would like to thank Johannes Dixon, W. J. (1953). Processing data for outliers. Biometrics, 9(1), Klackl, Max Primbs, and Joppe Klein Breteler for their helpful 74–89. https:// doi. org/ 10. 2307/ 30016 34 methodological suggestions, and Julia Klier for her help with per- Ernst, L. H., Ehlis, A.-C., Dresler, T., Tupak, S. V., Weidner, A., & forming the literature review. Fallgatter, A. J. (2013). N1 and N2 ERPs reflect the regulation of automatic approach tendencies to positive stimuli. Neuroscience Code availability All analysis scripts can be found in this study’s Research, 75(3), 239–249. https:// doi. org/ 10. 1016/j. neures. 2012. online repository: https:// doi. org/ 10. 17605/ OSF. IO/ YFX2C12. 005 Fabre-Thorpe, M. (2011). The characteristics and limits of rapid Authors’ contributions Sercan Kahveci: conceptualization, software, visual categorization. Frontiers in Psychology, 2, 243. https:// formal analysis, data curation, resources, writing – original draft, writ-doi. org/ 10. 3389/ fpsyg. 2011. 00243 ing – review & editing, visualization. Mike Rinck: resources, writing Ferentzi, H., Scheibner, H., Wiers, R. W., Becker, E. S., Lindenmeyer, J., – review & editing. Hannah van Alebeek: resources, writing – review & Beisel, S., & Rinck, M. (2018). Retraining of automatic action ten- editing. Jens Blechert: resources, writing – review & editing, supervision dencies in individuals with obesity: A randomized controlled trial. Appetite, 126, 66–72. https://doi. or g/10. 1016/j. appe t.2018. 03. 016 Funding Open access funding provided by Paris Lodron University of Fokkema, M., Smits, N., Zeileis, A., Hothorn, T., & Kelderman, H. Salzburg. Hannah van Alebeek and Sercan Kahveci were supported by (2018). Detecting treatment-subgroup interactions in clustered the Doctoral College "Imaging the Mind" (FWF; W1233-B). Hannah data with generalized linear mixed-ee ff cts model trees. Behavior van Alebeek was additionally supported by the project “Mapping neural Research Methods, 50(5), 2016–2034. https:// doi. org/ 10. 3758/ mechanisms of appetitive behaviour” (FWF; KLI762-B). Mike Rinck was s13428- 017- 0971-x supported by the Behavioural Science Institute of Radboud University. Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no Data availability The datasets generated and/or analyzed in the current “fishing expedition” or “p-hacking” and the research hypoth- study can be found in this study’s online repository: https://doi. or g/10. esis was posited ahead of time. Retrieved on 10 August, 2021, 17605/ OSF. IO/ YFX2C from http:// www . s t at. colum bia. edu/ ~g elman/ r esea r c h/ un pub lished/ p_ hacki ng. pdf Glashouwer, K. A., Timmerman, J., & de Jong, P. J. (2020). A per- Declarations sonalized approach-avoidance modification intervention to reduce negative body image. A placebo-controlled pilot study. Conflicts of interest The authors declare no conflicts of interest. Journal of Behavior Therapy and Experimental Psychiatry, 68, 101544. https:// doi. org/ 10. 1016/j. jbtep. 2019. 101544 Ethics approval Not applicable. Gračanin, A., Krahmer, E., Rinck, M., & Vingerhoets, A. J. J. M. (2018). The effects of tears on approach–avoidance tendencies in Consent to participate Not applicable. observers. Evolutionary Psychology, 16(3), 1474704918791058. https:// doi. org/ 10. 1177/ 14747 04918 791058 Consent for publication Not applicable. Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit associa- Open Access This article is licensed under a Creative Commons Attri- tion test. Journal of Personality and Social Psychology, 74(6), bution 4.0 International License, which permits use, sharing, adapta- 1464–1480. https:// doi. org/ 10. 1037// 0022- 3514. 74.6. 1464 tion, distribution and reproduction in any medium or format, as long Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understand- as you give appropriate credit to the original author(s) and the source, ing and using the implicit association test: I. An improved scor- provide a link to the Creative Commons licence, and indicate if changes ing algorithm. Journal of Personality and Social Psychology, 85, were made. The images or other third party material in this article are 197–216. https:// doi. org/ 10. 1037/ 0022- 3514. 85.2. 197 included in the article's Creative Commons licence, unless indicated Grubbs, F. E. (1950). Sample criteria for testing outlying observations. otherwise in a credit line to the material. If material is not included in Annals of Mathematical Statistics, 21, 27–58. https:// doi. org/ 10. the article's Creative Commons licence and your intended use is not 1214/ aoms/ 11777 29885 permitted by statutory regulation or exceeds the permitted use, you will Hampel, F. R. (1985). The breakdown points of the mean combined need to obtain permission directly from the copyright holder. To view a with some rejection rules. Technometrics, 27(2), 95–107. https:// copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . doi. org/ 10. 2307/ 12687 58 Heuer, K., Rinck, M., & Becker, E. S. (2007). Avoidance of emotional facial expressions in social anxiety: The approach–avoidance task. References Behaviour Research and Therapy, 45(12), 2990–3001. https://doi. org/ 10. 1016/j. brat. 2007. 08. 010 Hofmann, W., Friese, M., & Gschwendner, T. (2009). Men on the Barton, T., Constable, M. D., Sparks, S., & Kritikos, A. (2021). Self- “pull”: Automatic approach-avoidance tendencies and sexual bias effect: movement initiation to self-owned property is speeded interest behavior. Social Psychology, 40(2), 73–78. https:// doi. for both approach and avoidance actions. Psychological Research, org/ 10. 1027/ 1864- 9335. 40.2. 73 85(4), 1391–1406. https:// doi. org/ 10. 1007/ s00426- 020- 01325-0 Kahveci, S. (2020). AATtools: Reliability and scoring routines for the Berger, A., & Kiefer, M. (2021). Comparison of different response time approach-avoidance task. R package version 0.0.1. Retrieved on outlier exclusion methods: A simulation study. Frontiers in Psychol- 12 December, 2022, from https:// cr an.r- pr oje ct. or g/ pack a g e= ogy, 12, 675558. https:// doi. org/ 10. 3389/ fpsyg. 2021. 675558 AATto ols Cousijn, J., Luijten, M., & Wiers, R. W. (2014). Mechanisms under- Kahveci, S., Meule, A., Lender, A., & Blechert, J. (2020). Food lying alcohol-approach action tendencies: The role of emotional approach bias is moderated by desire to eat specific foods. Appe- primes and drinking motives. Frontiers in Psychiatry, 5, 44. tite, 154, 104758. https:// doi. org/ 10. 1016/j. appet. 2020. 104758 https:// doi. org/ 10. 3389/ fpsyt. 2014. 00044 Kahveci, S., Van Bockstaele, B., Blechert, J., & Wiers, R. W. (2020). De Houwer, J. (2003). The extrinsic affective Simon task. Experimental Pulling for pleasure? Erotic approach-bias associated with porn Psychology, 50(2), 77–85. https://doi. or g/10. 1026/ 1618- 3169. 50.2. 77 1 3 Behavior Research Methods use, not problems. Learning and Motivation, 72, 101656. https:// Rinck, M., Bundschuh, S., Engler, S., Müller, A., Wissmann, J., Ellwart, doi. org/ 10. 1016/j. lmot. 2020. 101656 T., & Becker, E. S. (2002). Reliabilität und Validität dreier Instru- Kahveci, S., Van Alebeek, H., Berking, M., & Blechert, J. (2021). mente zur Messung von Angst vor Spinnen. [Reliability and validity Touchscreen-based assessment of food approach biases: Inves- of German versions of three instruments measuring fear of spiders.]. tigating reliability and item-specific preferences. Appetite, 163, Diagnostica, 48(3), 141–149. https:// doi. org/ 10. 1026/ 0012- 1924. 105190. https:// doi. org/ 10. 1016/j. appet. 2021. 10519048.3. 141 Krieglmeyer, R., & Deutsch, R. (2010). Comparing measures of Rinck, M., Dapprich, A., Lender, A., Kahveci, S., & Blechert, J. (2021). approach-avoidance behaviour: The manikin task vs. two versions Grab it or not? Measuring avoidance of spiders with touchscreen- of the joystick task. Cognition & Emotion, 24(5), 810–828. https:// based hand movements. Journal of Behavior Therapy and Experi- doi. org/ 10. 1080/ 02699 93090 30472 98 mental Psychiatry, 73, 101670. https:// doi. org/ 10. 1016/j. jbtep. 2021. Leins, J., Waldorf, M., Kollei, I., Rinck, M., & Steins-Loeber, S. (2018). 101670 Approach and avoidance: Relations with the thin body ideal in Rotteveel, M., & Phaf, R. H. (2004). Automatic affective evaluation women with disordered eating behavior. Psychiatry Research, 269, does not automatically predispose for arm flexion and extension. 286–292. https:// doi. org/ 10. 1016/j. psych res. 2018. 08. 029 Emotion, 4(2), 156–172. https:// doi. org/ 10. 1037/ 1528- 3542.4. 2. Lender, A., Meule, A., Rinck, M., Brockmeyer, T., & Blechert, J. 156 (2018). Measurement of food-related approach–avoidance biases: Saraiva, A. C., Schüür, F., & Bestmann, S. (2013). Emotional valence and Larger biases when food stimuli are task relevant. Appetite, 125, contextual affordances flexibly shape approach-avoidance movements. 42–47. https:// doi. org/ 10. 1016/j. appet. 2018. 01. 032 Frontiers in Psychology, 4, 933. https:// doi. org/ 10. 3389/ fpsyg. 2013. Lindgren, K. P., Wiers, R. W., Teachman, B. A., Gasser, M. L., Westgate, 00933 E. C., Cousijn, J., ... Neighbors, C. (2015). Attempted training of Solarz, A. K. (1960). Latency of instrumental responses as a function of alcohol approach and drinking identity associations in US under- compatibility with the meaning of eliciting verbal signs. Journal of graduate drinkers: Null results from two studies. PLOS ONE, 10(8), Experimental Psychology, 59(4), 239. https://doi. or g/10. 1037/ h0047 274 e0134642. https:// doi. org/ 10. 1371/ journ al. pone. 01346 42 Spearman, C. (1904). The proof and measurement of association Lobbestael, J., Cousijn, J., Brugman, S., & Wiers, R. W. (2016). between two things. The American Journal of Psychology, Approach and avoidance towards aggressive stimuli and its rela- 15(1), 72–101. https:// doi. org/ 10. 2307/ 14121 59 tion to reactive and proactive aggression. Psychiatry Research, Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). 240, 196–201. https:// doi. org/ 10. 1016/j. psych res. 2016. 04. 038 Increasing transparency through a multiverse analysis. Perspec- Loijen, A., Vrijsen, J. N., Egger, J. I. M., Becker, E. S., & Rinck, M. (2020). tives on Psychological Science, 11(5), 702–712. https://d oi.o rg/ Biased approach-avoidance tendencies in psychopathology: A system-10. 1177/ 17456 91616 658637 atic review of their assessment and modification. Clinical Psychology Tzavella, L., Lawrence, N. S., Button, K. S., Hart, E. A., Holmes, N. Review, 77, 101825. https:// doi. org/ 10. 1016/j. cpr. 2020. 101825 M., Houghton, K., ... Adams, R. C. (2021). Effects of go/no-go Machulska, A., Kleinke, K., & Klucken, T. (2022). Same same, but training on food-related action tendencies, liking and choice. different: A psychometric examination of three frequently used Royal Society Open Science, 8(8), 210666. https:// doi. org/ 10. experimental tasks for cognitive bias assessment in a sample of 1098/ rsos. 210666 healthy young adults. Behavior Research Methods. https://doi. or g/ Van Alebeek, H., Kahveci, S., & Blechert, J. (2021). Improving 10. 3758/ s13428- 022- 01804-9 the touchscreen-based food approach-avoidance task: remedi- Neimeijer, R. A., Roefs, A., Glashouwer, K. A., Jonker, N. C., & ated block-order effects and initial findings regarding validity de Jong, P. J. (2019). Reduced automatic approach tendencies [version 3; peer review: 2 approved with reservations]. Open towards task-relevant and task-irrelevant food pictures in Anorexia Research Europe, 1, 15. https:// doi. org/ 10. 12688/ openr eseur Nervosa. Journal of Behavior Therapy and Experimental Psy-ope. 13241.3 chiatry, 65, 101496. https:// doi. org/ 10. 1016/j. jbtep. 2019. 101496 Van Alebeek, H., Kahveci, S., Rinck, M., & Blechert, J. (2023). Nosek, B. A., Bar-Anan, Y., Sriram, N., Axt, J., & Greenwald, A. Touchscreen-based approach-avoidance responses to appetitive G. (2014). Understanding and using the brief implicit associa- and threatening stimuli. Journal of Behavior Therapy and Exper- tion test: Recommended scoring procedures. PLoS One, 9(12), imental Psychiatry, 78, 101806. https:// doi. org/ 10. 1016/j. jbtep. e110938. https:// doi. org/ 10. 1371/ journ al. pone. 01109 382022. 101806 Parsons, S. (2022). Exploring reliability heterogeneity with multiverse analy- van Peer, J. M., Roelofs, K., Rotteveel, M., van Dijk, J. G., Spinhoven, ses: Data processing decisions unpredictably influence measurement P., & Ridderinkhof, K. R. (2007). The effects of cortisol adminis - reliability. Meta-Psychology, 6. https://doi. or g/10. 15626/ MP .2020. 2577 tration on approach–avoidance behavior: An event-related poten- Payne, B. K. (2001). Prejudice and perception: the role of automatic tial study. Biological Psychology, 76(3), 135–146. https://doi. or g/ and controlled processes in misperceiving a weapon. Journal of 10. 1016/j. biops ycho. 2007. 07. 003 Personality and Social Psychology, 81(2), 181–192. https://do i. van Strien, T., Frijters, J. E. R., Bergers, G. P. A., & Defares, P. B. (1986). org/ 10. 1037// 0022- 3514. 81.2. 181 The Dutch Eating Behavior Questionnaire (DEBQ) for assessment R Core Team. (2020). R: A language and environment for statistical of restrained, emotional, and external eating behavior. Interna- computing. R Foundation for Statistical Computing. tional Journal of Eating Disorders, 5(2), 295–315. https:// doi. org/ Radke, S., Volman, I., Kokal, I., Roelofs, K., de Bruijn, E. R. A., & 10. 1002/ 1098- 108X(198602) 5: 2< 295:: AID- EAT22 60050 209>3. Toni, I. (2017). Oxytocin reduces amygdala responses during 0. CO;2-T threat approach. Psychoneuroendocrinology, 79, 160–166. https:// von Borries, A. K. L., Volman, I., de Bruijn, E. R. A., Bulten, B. H., doi. org/ 10. 1016/j. psyne uen. 2017. 02. 028 Verkes, R. J., & Roelofs, K. (2012). Psychopaths lack the auto- Ratcliff, R. (1993). Methods for dealing with reaction time outliers. matic avoidance of social threat: Relation to instrumental aggres- Psychological Bulletin, 114(3), 510–532. https://doi. or g/10. 1037/ sion. Psychiatry Research, 200(2), 761–766. https:// doi. org/ 10. 0033- 2909. 114.3. 5101016/j. psych res. 2012. 06. 026 Reinecke, A., Becker, E. S., & Rinck, M. (2010). Three indirect tasks assess- Wagenmakers, E.-J., & Brown, S. (2007). On the linear relation ing implicit threat associations and behavioral response tendencies: between the mean and the standard deviation of a response time Test-retest reliability and validity. Zeitschrift für Psychologie/Journal of distribution. Psychological Review, 114(3), 830–841. https:// doi. Psychology, 218(1), 4–11. https://d oi.o rg/1 0.1 027/0 044-3 409/a 00000 2org/ 10. 1037/ 0033- 295X. 114.3. 830 1 3 Behavior Research Methods Wiers, R. W., Eberl, C., Rinck, M., Becker, E. S., & Lindenmeyer, Zech, H. G., Rotteveel, M., van Dijk, W. W., & van Dillen, L. F. (2020). J. (2011). Retraining automatic action tendencies changes alco- A mobile approach-avoidance task. Behavior Research Methods, holic patients' approach bias for alcohol and improves treatment 52(5), 2085–2097. https:// doi. org/ 10. 3758/ s13428- 020- 01379-3 outcome. Psychological Science, 22, 490–497. https://doi. or g/10. Zech, H. G., Gable, P., van Dijk, W. W., & van Dillen, L. F. (2022). Test- 1177/ 09567 97611 400615 retest reliability of a smartphone-based approach-avoidance task: Wittekind, C. E., Reibert, E., Takano, K., Ehring, T., Pogarell, O., & Effects of retest period, stimulus type, and demographics. Behavior Ruther, T. (2019). Approach-avoidance modification as an add-on Research Methods. https:// doi. org/ 10. 3758/ s13428- 022- 01920-6 in smoking cessation: A randomized-controlled study. Behaviour Research and Therapy, 114, 35–43. https:// doi. org/ 10. 1016/j. brat. Open practices statement The data and materials for all experiments 2018. 12. 004 are available at https:// doi. org/ 10. 17605/ OSF. IO/ YFX2C and none of Wittekind, C. E., Blechert, J., Schiebel, T., Lender, A., Kahveci, S., the experiments were preregistered. & Kühn, S. (2021). Comparison of different response devices to assess behavioral tendencies towards chocolate in the approach- Publisher’s note Springer Nature remains neutral with regard to avoidance task. Appetite, 165, 105294. https:// doi. org/ 10. 1016/j. jurisdictional claims in published maps and institutional affiliations. appet. 2021. 105294 1 3

Journal

Behavior Research MethodsSpringer Journals

Published: Mar 1, 2024

Keywords: Approach-avoidance task (AAT); Bias scores; Reliability; Validity; Outlier exclusion; Simulation; Multiverse analysis

There are no references for this article.