Statistical test for my Masterthesis

Emerging · January 18, 2021

Hi all,

I'm currently working on my masterthesis. It's about whether the pandemic has influenced the method of phishing that is done by criminals. I've got a dataset on phishing and I'm currently doing a keyword analysis by how many times a certain keyword comes across in phishing emails on a given day. I've normalized the count by dividing it by the number of phishing emails on that particular day (since the amount of emails per day varies). So, now the data looks like the average count of a given keyword in a email per day.

Now I was wondering on how I can test this statistically, I'm thinking about a paired samples T-test. Where I've created 2 groups: before and after the start of the pandemic. Any suggestions?

Thank you!

Prometheus · January 18, 2021

Some quick thoughts.

Paired? That implies you are able to match individual criminal's pre and post scores. If so, great.

You need to start looking at the distribution of your data. It's count data that has been normalised - i'll bet you it's not normally (or Student T) distributed. You may also suffer from an inflated number of zero counts (i.e. some individuals just don't use some words that others do).

What are the actual outcomes here? Presumably you have many keywords, therefore its not a single outcome but a collection. Or do you pool all these results into a single outcome somehow?

Best bet is to consult a statistician who will tsk at you for talking to them after you collected the data. Next best is to delve into the literature and see what other researchers have done - particularly if there's a statistician on the paper, they will likely have encountered many of the problems you will.

Emerging · January 18, 2021

2 hours ago, Prometheus said:

Some quick thoughts.

Paired? That implies you are able to match individual criminal's pre and post scores. If so, great.

You need to start looking at the distribution of your data. It's count data that has been normalised - i'll bet you it's not normally (or Student T) distributed. You may also suffer from an inflated number of zero counts (i.e. some individuals just don't use some words that others do).

What are the actual outcomes here? Presumably you have many keywords, therefore its not a single outcome but a collection. Or do you pool all these results into a single outcome somehow?

Best bet is to consult a statistician who will tsk at you for talking to them after you collected the data. Next best is to delve into the literature and see what other researchers have done - particularly if there's a statistician on the paper, they will likely have encountered many of the problems you will.

No, I've got per the average amount that a certain keyword comes across in a phishing email on a given day. I want to test whether this amount grew since the pandemic (11th March). Something like an interupted timeseries analysis.

Prometheus · January 19, 2021

Ah, you want to statistically compare time series. That sounds tricky, though a quick google shows it might be possible, but i can't help you there. I would ask whether you really want to compare trends over time though, which is what you're doing with time series, or are you only care about comparing the means of the pre and post groups then more standard stats test will suffice.

Dagl1 · January 19, 2021

A few questions and maybe some suggestions (that you need to verify and check, or use as step off points). I am not a statistician, but I do have experience with statistics. Hopefully I don't say anything just blatantly wrong;/

So you have k keywords, and their averages on a single day. Do you put k keywords into a single variable KEYWORD or do you want to measure whether there is a difference for every k keyword? If you want to do k comparisons, please apply some type of multiple comparison correction as your p value (assuming 0.05 is used as cut-off) only means the type of for a type 1 error is 5%, so if you compare 100 times, we need to make sure that that chance doesn't actually go up, and we do this by reducing the acceptable p value.

You will have to provide some more information about your data:
Is the variance constant?
Is the average of your data a function of time? (so can you make an y= aT +b graph, if so the ARIMA I talk about later can't be applied I think (according to a video about it))
How long are the measurement periods, is seasonality included in your analysis?

Now depending on these questions and some other assumptions (that I don't know but you will need to test), and what exactly you want to know (you seem to be interested in a change in frequency, so one could of course take the average frequency of a keyword in a similar period before and after the pandemic, and then just test the two group (each group will have n datapoints (amount of days of each period) and so you can calculate SEM, SD, 95%CI and do regular T-tests (as Prometheus has said, t-tests have assumptions such as the distribution of your data, you will have to test each of these assumptions with a separate test. If you do not meet all the assumptions you can use a non-parametric alternative of the t-test (one for which your data does have all the required things)).
With this approach, you measure frequency over almost a year, so seasonality is something to think of. Taking datapoints from the same days in the previous year or years could of course be a way around this seasonality problem.

If you do want to analyse your data as a time series, there are several things you need to consider:

What exactly do you want to analyze. Are the time series 'different' is really dependent on what you are particularly interested in. Are you interested in the total frequency over time (AuC), are you interested in a change in frequencies per week (a shift from monday to friday), are you interested in something else?

From a quick google search, ARIMA seems to be one approach, but I am not entirely sure how to apply that to two datasets. I suppose you would do ARIMA for both and see if they predict similar things, if not then they are different. But if this is the case, then you would still need to find a way to define when the two predictions are 'statistically different'. Maybe someone more into this part of analysis could help out here. https://stats.stackexchange.com/questions/35129/how-to-compare-two-time-series
https://stats.stackexchange.com/questions/19103/how-to-statistically-compare-two-time-series

Another thing I found is a fixed-effects ANOVA, it seems that the dataset is similar to yours if you can overlay the data from the same days before and after the pandemic (note that ANOVA generally are for 3 or more groups, which is the case on this website, so there most likely is a t-test-like variant of this that you want to use) https://stats.stackexchange.com/questions/12902/comparison-of-time-series-sets

Although not that important at the moment, I do wonder if overlaying march 1 2019 and march 1 2020 is actually the best option (as days of weeks change with such an approach) and it may be better to shift it to match days of the week instead of date). I am not sure if that is actually better, but I think it may be an interesting thing to note in your discussion.

I hope this at least helps you a bit, but it is important to check yourself what you are and are not allowed to do! A lot of tests that you can perform on SPSS have nice documentation somewhere on the internet that includes the assumptions which need to be met, I like this website a lot: https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php)

Kind regards,
Dagl

Sign In

Statistical test for my Masterthesis

Recommended Posts

Emerging

Link to comment

Share on other sites

Prometheus

Link to comment

Share on other sites

Emerging

Link to comment

Share on other sites

Prometheus

Link to comment

Share on other sites

Dagl1

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information