I think it’s fairly well-established that the more predictors you add the model, the higher the R2 (most of the time). Therefore, it can be tricky to compare dueling latent variables with unevequal numbers of indicators. It’s not clear to me, however, that aggregating across the variables solves this problem. It seems to persist even when a single average value is used to represent all of the latent variable’s indicators.
An argument against using latent variable regression
This gentleman on Researchgate is suggesting that you should not do hierarchical regression with latent variables because a latent variable will typically have a higher R-squared than a manifest variable, inflating its explanatory power, and making the conclusions less insightful.
I think that an easy work around would be to take random samples of items and run the test repeatedly in such a way that the number of variables in each factor is more fair.
A partial explanation of this phenomenon
When one person asks why the difference in R2 between aggregated variables and latent variables with multiple indicators, this is the reply:
Here is the full post, including the original question: https://stats.stackexchange.com/questions/59613/why-does-latent-variable-modelling-in-regression-tend-to-push-r-squared-up
A possible solution:
- Rather than using all indicators, randomly sample k indicators from each latent variable to be used in the analysis, k being the least number of indicators used in any of the latent variables under study.
- Run the analysis as previously planned
- Run it again 1,000 times, randomly sampling indicators each time
- Take the mean R from each regression
- Not sure how to get the P-value of the correlation. I’ll have to work that out separately