Title: | Robust Outliers Detection |
---|---|
Description: | Detecting outliers using robust methods, i.e. the Median Absolute Deviation (MAD) for univariate outliers; Leys, Ley, Klein, Bernard, & Licata (2013) <doi:10.1016/j.jesp.2013.03.013> and the Mahalanobis-Minimum Covariance Determinant (MMCD) for multivariate outliers; Leys, C., Klein, O., Dominicy, Y. & Ley, C. (2018) <doi:10.1016/j.jesp.2017.09.011>. There is also the more known but less robust Mahalanobis distance method, only for comparison purposes. |
Authors: | Olivier Klein [aut], Marie Delacre [aut, cre] |
Maintainer: | Marie Delacre <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.3 |
Built: | 2025-02-28 05:17:34 UTC |
Source: | https://github.com/mdelacre/routliers |
The Sense of Coherence was assessed with the SOC-13 (Antonovsky, 1987): 7-point Likert scale (13 items) Anxiety and depression were assessed with the HSCL-25 (Derogatis, Lipman, Rickels, Uhlenhuth & Covi, 1974).Subjects have to mention in a 4-point Likert Scale how much there were bothered or upset by each trouble during the last 14 days (1 = not at all; 2 = a little; quite a few; 4 = a lot).
data(Attacks)
data(Attacks)
A data frame with 2077 rows and 46 variables:
age of participants, in years
were participants present in Brussels during the terrorist attacks; 1 = yes; -1 = no
participant gender, 1 = female; -1 = male
Vous avez le sentiment que vous ne vous souciez pas reellement de ce qui se passe autour de vous: 1 = Tres rarement ou rarement; 7 = Souvent
item1 reversed
Vous est-il arrive dans le passe d etre surpris(e) par le comportement de gens que vous pensiez connaitre tres bien ?: 1 = Jamais; 7 = Toujours
item2 reversed
Est-il arrive que des gens sur lesquels vous comptiez vous decoivent ?: 1= Jamais; 7 = Toujours
sense of coherence, item3 reversed
Jusqu a maintenant, votre vie : 1 = N a eu aucun but ni objectif clair; 7 = A eu des buts et des objectifs tres clairs
Avez-vous le sentiment que vous etes traite(e) injustement ?:1 = Tres souvent; 7 = Tres rarement ou jamais
Avez-vous le sentiment que vous etes dans une situation inconnue et que vous ne savez pas quoi faire ?: 1 = Tres souvent; 7 = Tres rarement ou jamais
Faire les choses que vous faites quotidiennement est : 1 = Une source de plaisir et de satisfaction; 7 = Une source de souffrance profonde et d ennui
item7 reversed
Avez-vous des idees ou des sentiments confus(es) ?: 1 = Tres souvent; 7 = Tres rarement ou jamais
Vous arrive-t-il d avoir des sentiments intimes que vous prefereriez ne pas avoir ?: 1 = Tres souvent; 7 = Tres rarement ou jamais
Beaucoup de gens (meme s’ils ont beaucoup de caractere) se sentent parfois de pauvres cloches. Avez-vous deja eu ce sentiment dans le passe ?: 1 = Jamais; 7 = Tres souvent
item10 reversed
Quand quelque chose arrive, vous trouvez generalement que : 1 = Vous surestimez ou sous-estimez son importance; 7 = Vous voyez les choses dans de justes proportions
Avez-vous le sentiment que les choses que vous faites dans la vie quotidienne ont peu de sens ?: 1 = Tres souvent; 7 = Tres rarement ou jamais
Vous avez le sentiment que vous n etes pas sur(e) de vous maitriser : 1 = Tres souvent; 7 = Tres rarement ou jamais
Mal de tete
Tremblement
Fatigue ou etourdissement
Nervosite, agitation au fond de soi
Peur soudaine sans raison particuliere
Continuellement peureux ou anxieux
Battements du coeur qui s'emballent
Sensation d etre tendu, stresse
Crise d angoisse ou de panique
Tellement agite qu'il en est difficile de rester assis
Manque d energie, tout va plus lentement que d habitude
Se fait facilement des repproches
Pleure facilement
Pense a se tuer
Mauvais appetit
Probleme de sommeil
Sentiment de desespoir en pensant au futur
Decourage, morose
Sentiment de solitude
Perte d interets et d envies sexuelles
Sentiment de s etre fait prendre au piège ou fait prisionnier
Agite ou se tracasse beaucoup
Aucun interet pour quoique ce soit
Sentiment que tout est fatiguant
Sentiment d etre inutile
In french
Participants have to answer to many questions (in a 11-page-survey). For 5 questions (indicated by $$ at the beginning of the question), they are told that there is a correct answer and that they will earn $0.06 if they provide this correct answer. At the beginning of the experiment, there are also told that they will earn a $0.60 bonus if they choose the answer E on the last question (whatever this is the correct answer or not).
data(Intention)
data(Intention)
age
Did participants choose to have a reminder? (1 = yes; 0 = no). Note that in conditions 2 and 4, participants had no choices and therefore, 0 is coded for all subjects in these two conditions
Condition 1 = free-reminder-through-association condition: participants read that they can choose to have (for free) an image of an elephant (presented on screen) that would appear at the bottom of page 11 as a reminder of selecting answer E; Condition 2 = non condition: no reminders; Condition 3 = costly-reminder-through-association condition: participants read that if they pay $0.03, an image of an elephant (presented on screen) would appear at the bottom of page 11 as a reminder of selecting answer E Condition 4 = forced-reminder-through-association condition: participants read that an image of an elephant (presented on screen) would appear at the bottom of page 11 as a reminder of selecting answer E.
Did participants earn $0.60 bonus? (1 = yes; 0 = no)
No available information
How much was paid for a reminder? ($0.00 or $0.03)
No available information
Earned money for answering E on the last question: $0.00 (if E was not selected) or $0.60 (if E was selected)
Gender; 0 = male; 1 = female
participants id
Earned money at the beginning ( $0.06 for all participants)
First question for which participants earn a $0.03 bonus if they provide the correct answer
Second question for which participants earn a $0.03 bonus if they provide the correct answer
Third question for which participants earn a $0.03 bonus if they provide the correct answer
Fourth question for which participants earn a $0.03 bonus if they provide the correct answer
Fifth question for which participants earn a $0.03 bonus if they provide the correct answer
Intention$final_problem minus Intention$fee_for reminder; They are 4 possibles outcomes: (1) $-0.03, if a reminder was paid and answer E was not selected on the last question; (2) $0.00, if no reminder was paid and answer E was not selected on the last question; (3) $0.57, if a reminder was paid and answer E was selected on the last question; (4) $0.60, is no reminder was paid and answer E was selected on the last question
equals Intention$Total_Amount_Earned in all but one condition: in condition 1 (free-reminder-through-association condition): Intention$Total_Amount_Earned_if.forced.to.pay.for.cue= Intention$Total_Amount_Earned - 0.03
For 6 scenarios, participants have to evaluate the wrongness of actions, with a scale ranging from 1 (not ok) to 5 (completely ok) Contributors: Biljana Jokic Iris Zezelj osf link: https://osf.io/8wqvc/
data(Morality)
data(Morality)
a data frame with 145 rows and 10 columns
participant id
Is participant English or Serbian?
Is the person in the scenario someone participants know (i.e. colleague, neighbor) ?
A girl pushing another kid off a swing because she really wants to use it before going home
A woman cutting it up a national flag into small pieces and using it in order to clean her house
A man eating his food with his hands, like most of his family members, also in public, after he washes them
A loving man who promised her dying mother that he would visit her grave every week but didn't keep his promise because he was very busy
Two cousins kissing each other passionately on the mouth, in secret, because there are in love
Eating our dog that was hitten by a car in front of our house and was killed
average of all scenarios judgment
Detecting univariate outliers using the robust median absolute deviation
outliers_mad(x, b, threshold, na.rm)
outliers_mad(x, b, threshold, na.rm)
x |
vector of values from which we want to compute outliers |
b |
constant depending on the assumed distribution underlying the data, that equals 1/Q(0.75). When the normal distribution is assumed, the constant 1.4826 is used (and it makes the MAD and SD of normal distributions comparable). |
threshold |
the number of MAD considered as a threshold to consider a value an outlier |
na.rm |
set whether Missing Values should be excluded (na.rm = TRUE) or not (na.rm = FALSE) - defaults to TRUE |
Returns Call, median, MAD, limits of acceptable range of values, number of outliers
#### Run outliers_mad x <- runif(150,-100,100) outliers_mad(x, b = 1.4826,threshold = 3,na.rm = TRUE) #### Results can be stored in an object. data(Intention) res1=outliers_mad(Intention$age) # Moreover, a list of elements can be extracted from the function, # such as all the extremely high values, # That will be sorted in ascending order #### The function should be performed on dimension rather than on isolated items data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) res=outliers_mad(x = SOC)
#### Run outliers_mad x <- runif(150,-100,100) outliers_mad(x, b = 1.4826,threshold = 3,na.rm = TRUE) #### Results can be stored in an object. data(Intention) res1=outliers_mad(Intention$age) # Moreover, a list of elements can be extracted from the function, # such as all the extremely high values, # That will be sorted in ascending order #### The function should be performed on dimension rather than on isolated items data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) res=outliers_mad(x = SOC)
Detecting multivariate outliers using the Mahalanobis distance
outliers_mahalanobis(x, alpha, na.rm)
outliers_mahalanobis(x, alpha, na.rm)
x |
matrix of bivariate values from which we want to compute outliers |
alpha |
nominal type I error probability (by default .01) |
na.rm |
set whether Missing Values should be excluded (na.rm = TRUE) or not (na.rm = FALSE) - defaults to TRUE |
Returns Call, Max distance, number of outliers
#### Run outliers_mahalanobis data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6","soc7r", "soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mahalanobis(x = cbind(SOC,HSC), na.rm = TRUE) # A list of elements can be extracted from the function, # such as the position of outliers in the dataset # and the coordinates of outliers res$outliers_pos res$outliers_val
#### Run outliers_mahalanobis data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6","soc7r", "soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mahalanobis(x = cbind(SOC,HSC), na.rm = TRUE) # A list of elements can be extracted from the function, # such as the position of outliers in the dataset # and the coordinates of outliers res$outliers_pos res$outliers_val
Detecting multivariate outliers using the Minimum Covariance Determinant approach
outliers_mcd(x, h, alpha, na.rm)
outliers_mcd(x, h, alpha, na.rm)
x |
matrix of bivariate values from which we want to compute outliers |
h |
proportion of dataset to use in order to compute sample means and covariances |
alpha |
nominal type I error probability (by default .01) |
na.rm |
set whether Missing Values should be excluded (na.rm = TRUE) or not (na.rm = FALSE) - defaults to TRUE |
Returns Call, Max distance, number of outliers
#### Run outliers_mcd # The default is to use 75% of the datasets in order to compute sample means and covariances # This proportion equals 1-breakdown points (i.e. h = .75 <--> breakdown points = .25) # This breakdown points is encouraged by Leys et al. (2018) data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6","soc7r", "soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mcd(x = cbind(SOC,HSC), h = .75) res # Moreover, a list of elements can be extracted from the function, # such as the position of outliers in the dataset # and the coordinates of outliers res$outliers_pos res$outliers_val
#### Run outliers_mcd # The default is to use 75% of the datasets in order to compute sample means and covariances # This proportion equals 1-breakdown points (i.e. h = .75 <--> breakdown points = .25) # This breakdown points is encouraged by Leys et al. (2018) data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6","soc7r", "soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mcd(x = cbind(SOC,HSC), h = .75) res # Moreover, a list of elements can be extracted from the function, # such as the position of outliers in the dataset # and the coordinates of outliers res$outliers_pos res$outliers_val
plotting data and highlighting univariate outliers detected with the outliers_mad function
plot_outliers_mad(res, x, pos_display = FALSE)
plot_outliers_mad(res, x, pos_display = FALSE)
res |
result of the outliers_mad function from which we want to create a plot |
x |
data from which the outliers_mad function was performed |
pos_display |
set whether the position of outliers in the dataset should be displayed on the graph (pos_display = TRUE) or not (pos_display = FALSE) |
None
#### Run outliers_mad and perform plot_outliers_mad on the result data(Intention) res=outliers_mad(Intention$age) plot_outliers_mad(res,x=Intention$age) ### when the number of outliers is small, one can display the outliers position in the dataset x=c(rnorm(10),3) res2=outliers_mad(x) plot_outliers_mad(res2,x,pos_display=TRUE)
#### Run outliers_mad and perform plot_outliers_mad on the result data(Intention) res=outliers_mad(Intention$age) plot_outliers_mad(res,x=Intention$age) ### when the number of outliers is small, one can display the outliers position in the dataset x=c(rnorm(10),3) res2=outliers_mad(x) plot_outliers_mad(res2,x,pos_display=TRUE)
plotting data and highlighting multivariate outliers detected with the mahalanobis distance approach
plot_outliers_mahalanobis(res, x, pos_display = FALSE)
plot_outliers_mahalanobis(res, x, pos_display = FALSE)
res |
result of the outliers_mad function from which we want to create a plot |
x |
matrix of multivariate values from which we want to compute outliers. Last column of the matrix is considered as the DV in the regression line. |
pos_display |
set whether the position of outliers in the dataset should be displayed on the graph (pos_display = TRUE) or not (pos_display = FALSE) |
plotting data and highlighting multivariate outliers detected with the MCD function Additionnally, the plot return two regression lines: the first one including all data and the second one including all observations but the detected outliers. It allows to observe how much the outliers influence of outliers on the regression line.
None
#### Run plot_outliers_mahalanobis data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mahalanobis(x = cbind(SOC,HSC)) plot_outliers_mahalanobis(res, x = cbind(SOC,HSC)) # it's also possible to display the position of the multivariate outliers ion the graph # preferably, when the number of multivariate outliers is not too high c1 <- c(1,4,3,6,5,2,1,3,2,4,7,3,6,3,4,6) c2 <- c(1,3,4,6,5,7,1,4,3,7,50,8,8,15,10,6) res2 <- outliers_mahalanobis(x = cbind(c1,c2)) plot_outliers_mahalanobis(res2, x = cbind(c1,c2),pos_display = TRUE) # When no outliers are detected, only one regression line is displayed c3 <- c(1,4,3,6,5) c4 <- c(1,3,4,6,5) res3 <- outliers_mahalanobis(x = cbind(c3,c4)) plot_outliers_mahalanobis(res3,x = cbind(c3,c4))
#### Run plot_outliers_mahalanobis data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mahalanobis(x = cbind(SOC,HSC)) plot_outliers_mahalanobis(res, x = cbind(SOC,HSC)) # it's also possible to display the position of the multivariate outliers ion the graph # preferably, when the number of multivariate outliers is not too high c1 <- c(1,4,3,6,5,2,1,3,2,4,7,3,6,3,4,6) c2 <- c(1,3,4,6,5,7,1,4,3,7,50,8,8,15,10,6) res2 <- outliers_mahalanobis(x = cbind(c1,c2)) plot_outliers_mahalanobis(res2, x = cbind(c1,c2),pos_display = TRUE) # When no outliers are detected, only one regression line is displayed c3 <- c(1,4,3,6,5) c4 <- c(1,3,4,6,5) res3 <- outliers_mahalanobis(x = cbind(c3,c4)) plot_outliers_mahalanobis(res3,x = cbind(c3,c4))
plotting data and highlighting multivariate outliers detected with the MCD function Additionnally, the plot return two regression lines: the first one including all data and the second one including all observations but the detected outliers. It allows to observe how much the outliers influence of outliers on the regression line.
plot_outliers_mcd(res, x, pos_display = FALSE)
plot_outliers_mcd(res, x, pos_display = FALSE)
res |
result of the outliers_mad function from which we want to create a plot |
x |
matrix of multivariate values from which we want to compute outliers. Last column of the matrix is considered as the DV in the regression line. |
pos_display |
set whether the position of outliers in the dataset should be displayed on the graph (pos_display = TRUE) or not (pos_display = FALSE) |
None
#### Run plot_outliers_mcd data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mcd(x = cbind(SOC,HSC),na.rm=TRUE,h=.75) plot_outliers_mcd(res,x = cbind(SOC,HSC)) # it's also possible to display the position of the multivariate outliers ion the graph # preferably, when the number of multivariate outliers is not too high c1 <- c(1,4,3,6,5,2,1,3,2,4,7,3,6,3,4,6) c2 <- c(1,3,4,6,5,7,1,4,3,7,50,8,8,15,10,6) res2 <- outliers_mcd(x = cbind(c1,c2),na.rm=TRUE) plot_outliers_mcd(res2, x=cbind(c1,c2),pos_display=TRUE) # When no outliers are detected, only one regression line is displayed c3 <- c(1,2,3,1,4,3,5,5) c4 <- c(1,2,3,1,5,3,5,5) res3 <- outliers_mcd(x = cbind(c3,c4),na.rm=TRUE) plot_outliers_mcd(res3,x=cbind(c3,c4),pos_display=TRUE)
#### Run plot_outliers_mcd data(Attacks) SOC <- rowMeans(Attacks[,c("soc1r","soc2r","soc3r","soc4","soc5","soc6", "soc7r","soc8","soc9","soc10r","soc11","soc12","soc13")]) HSC <- rowMeans(Attacks[,22:46]) res <- outliers_mcd(x = cbind(SOC,HSC),na.rm=TRUE,h=.75) plot_outliers_mcd(res,x = cbind(SOC,HSC)) # it's also possible to display the position of the multivariate outliers ion the graph # preferably, when the number of multivariate outliers is not too high c1 <- c(1,4,3,6,5,2,1,3,2,4,7,3,6,3,4,6) c2 <- c(1,3,4,6,5,7,1,4,3,7,50,8,8,15,10,6) res2 <- outliers_mcd(x = cbind(c1,c2),na.rm=TRUE) plot_outliers_mcd(res2, x=cbind(c1,c2),pos_display=TRUE) # When no outliers are detected, only one regression line is displayed c3 <- c(1,2,3,1,4,3,5,5) c4 <- c(1,2,3,1,5,3,5,5) res3 <- outliers_mcd(x = cbind(c3,c4),na.rm=TRUE) plot_outliers_mcd(res3,x=cbind(c3,c4),pos_display=TRUE)