Jump to content

Dependent variable bias? Strong correlation between baseline value and change over time.

Recommended Posts

Hello people,


Using PSPP, I was doing some basic linear regressions. Examining the following correlation:

I included data from over a 100 countries and looked at both baseline values and values 15 years later. Calculating the differences in value for both the independent and dependent variable. Plotting them in a graph.


Results were as follows:

A weak R-squared value which was highly significant nonetheless (P = < 0.0001). With a negative trendline. No confounding was detected from other variables.

I found a "strong" correlation between baseline values of the dependent variable and it's successive changes in values during the 15-year follow-up period. The R-squared values was > 0.7. There was a positive correlation: Higher changes during follow-up were related to higher baseline values.


My problem is as follows:

Most values for the dependent variable dropped over the 15-year follow-up period. When I added baseline values for the dependent variable to the model, there was no noteworthy correlation left between the independent- and dependent variable (P for sig: > 0.50).

Would it be correct to assume the negative correlation between independent- and dependent variable were (probably) caused by the strong correlation between the 2 values for the dependent variable? 


Subgroup analyses of the correlation between the independent- and dependent variable after 15 years showed the following:

-Decreases in values for the independent variable were not linked to changes of the dependent variable.

-Increases in values for the dependent variable were not linked to changes of the independent variable.




Link to post
Share on other sites

Let me check i have this right. The first column is risk of contracting measles in 2000 - how is this calculated, is it how many get measles per 1000, because one country had 200 and that sounds an awful lot. The second column is the percent change in measles vaccination uptake between 2000 and 2015. The third column is the change in risk of contracting measles between 2000 and 2015. It's a strange way of looking at relationships. Exactly what hypothesis do you want to test - there may be a easier way of looking at the data.

Link to post
Share on other sites


Thank you for taking the time.

That is what the columns mean indeed. Incidence rates are calculated per 1,000 live births. The numbers are probably correct. They are from the WHO website. The test is to describe the correlation between changes in vaccination rates and changes in measles incidence (2nd-3rd column). The first column is added to show the correlation between baseline measles incidence and incidence changes over time. Given the fact that this correlation is much stronger than the one between vaccinations and measles incidence, I was wondering how justified it is to adjust for baseline incidence when we look a the relation between changes in vaccination and measles incidence during follow-up. 


I was thinking an effect from vaccinations would have to remain relevant when it is added as the second independent variable (next to baseline measles incidence) in a regression model. With changes in measles incidence as the dependent variable.

Link to post
Share on other sites
  • 1 year later...


R^2 is not a measure of correlation. It is a measure of variability in the dependent  variable explained by changes in the independent variables. For example if you have an R^2 of 24% that means your regression model explains 24% of the variability in the dependent variable, which means you are missing 76% of the needed information. 


Now you can take the square root of R^2 to get Pearson's r, and that is an estimate of the correlation. But keep in mind that correlation is a very specific type of association, one that is linear in nature. However, you can still have non linear associations, at which point you would either need to add a ploy term to your model, do a transformation or use more advanced options. 


A low R^2 means your model is a poor fit, and a low p-value does not mean  you have a working model, it could merely mean that what you have is better at fitting the response than nothing at all. You would want to do ANOVA, AIC, BIC, or some other type of model comparison to see if you model works better than a mean only model. 


If you want to build a regression model use model selection steps to find a better fitting model, if you merely want to report on the correlation between two variables regardless of whatever else is going on, you can use Person's r as a statistical summary for that measure of correlation. A decent fitting regression model would allow you to explore variable correlation in much more depth, but if your model does not meet the assumptions for a linear model and is a poor fit you will not get reliable information.

Edited by Jeremiahcp
Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.