Looking for someone to point me in the right direction so I can start researching a solution to the problem below:
Problem
Linear regression with high dimensionality - p=29, n = 5000ish, input variables are generally quite highly correlated
When using the model for prediction, data sets regularly have a missing input parameter(s). At the moment I just refit a LSQ solution from the training data with that input deleted. This seems to lead to quite unstable results. Stability is important for my application, more so than absolute accuracy in some senses.
--
Regularisation (e.g. Ridge) feels like it should help, but (and I'm not formally train in stats) as I understand that will reduce the variance of the model, with all input variables - and doesn't necessarily achieve anything for model stability where input parameters are deleted.
Thanks in advance.
Now my buddy and I are both young good looking guys who are successful in our careers, and have had success wooing women, but my buddy has had more success with securing long term girlfriends and not much with short term flings, while I have had more success with short term flings but not long term relationships since my personality tends to repel women within a few weeks (I am highly disagreeable and argumentative, and frankly lacking in empathy or shame, which I am trying to work on). This made me question how I ever managed to get women to tolerate me at all, since I have talked to other young men who have told me about their difficulty in attracting women, and I couldn't find any great faults in their personalities which I don't have ten-fold in severity. In addition, my own experience with couples that I know, and those I don't know which I've seen holding hands walking down the street, has given me an intuitive belief that most the variation in who partners up with who is explained by the physical attributes rather than personality.
But how can we test this hypothesis in a statistically robust manner? I thought about the website that Mark Zuckerberg's made when he was at Harvard, Facemash I think? AFAIK Zuckerberg scraped the pictures of female students from a publicly available Harvard webpage, and then hosted those photos on the Facemash website. If you visited the website, two photos would appear of different female students, and you would select the photo which you found more attractive, making the whole thing a sort of contest. Something akin to an ELO score, or a Glicko score, could then be computed in the posterior, iteratively, after each contest, and ostensibly after many such contests the most attractive women would have the highest ELO scores, and the least attractive women would have the lowest scores; the initial prior assumption is that the ELO score for each woman is equal.
I thought that this scheme, done in a more ethical manner with permission of course, might be an excellent way to figure out how important personality really is in determining attractiveness for men. Note that if we compute ELO scores based purely upon photographs, then that ELO score should reflect physical attractiveness exclusively.
Now we can expect that the rating deviation for the P-ELO scores for the men will be high, since 100 women making 12 phone calls each will result in an average of 100*12/100 = 12 phone calls for each man, but that may be enough information to make statistically significant inferences given how we will proceed. We proceed as follows:
Now we can compare the rankings of each man according to their ELO scores (based purely on the physical), P-ELO scores (based purely on personality), and O-ELO scores (based on in person interaction). If we find that the rankings are more consistent between the P-ELO and O-ELO rankings, then that would suggest that personality is more important than the physical in determining attractiveness. However if we find more consistency between the ELO and O-ELO rankings, then that would suggest that the physical stuff is more important than personality in determining attractiveness.
For example, say I am ranked #25 in ELO, meaning there are 24 men with a higher ELO than myself, and that my buddy is ranked #26. However because of my unpleasant personality, my P-ELO ranking is #75, meanwhile my buddy's P-ELO is #26 since he's a "nice" guy. Now if personality is more important than the physical, we would expect my buddy's O-ELO ranking to be higher than mine, but that is what's really interesting to me, what we would find?
I am looking for advice and comments on the statistical validity of analyzing this based on ELO scores, and if I am missing any confounding effects. Thanks
]]>also called benfords law.
If you realize the potential of this like I do please comment or argue if you like.
]]>Lets assume I have time history data from pressure probe placed in a duct where a gas flows. Of course, when using Fourier transformation (DFT or FFT) I may resolve significant frequencies occurring in the duct. If I create an attractor from these data, can this transformation bring some additional information compared to DFT/FFT ? Does attractor visualization and its post-processing bring "some better point of view" ?
]]>I'm a college sophomore in an applied math BS program with a statistics concentration at a state university. I originally came from a computer science background, but I decided to pursue this after really enjoying my AP Stats class my senior year of high school. I'm currently taking my first mathematical statistics class and in an R job with a professor.
Even though I still enjoy the subject in the classroom, I'm not 100% informed on what's out there as for career options (leaning towards data science right now).
While still in college, how should one prepare for the best chance at a statistics job? Is going on to graduate school recommended in the field?
Thanks for your time!!
]]>Thanks for the answer.
]]>
I'm Koen Van Loon from Belgium and there is something that has been laying on my mind for quite some time now, but I'm not smart enough/don't have the proper education to solve it myself, so I thought you guys might be able to help.
I was thinking about evolution and came up with an experiment which I would like to know the answer to.
Lets say we have a highly sophisticated robot, either biological or mechanical (like those things from Boston dynamics). If we would randomly change this robot, the chance that it would become worse is naturally higher than the chance that the robot would be improved by this chance.
This is the experiment:
We have a robot which need to follow a 'survival' parkour. It needs to jump over things, climb ladders, dodge object etc. When it survives this parkour, at the end there is a machine which it need to step into. The machine scans all its properties, destroys the robot, and create 5 new ones. The new robots (or children) each differ a little bit from the 'parent' robot. The change is random.
These new robots will now also follow this parkour, and if they survive the same will happen to them.
The original robot starts with a 'skill level' of 10 point. A robot is able to survive the parkour if its skill level is above 0 points (1 is the minimum). Otherwise it wont survive and won't be able to make 'children'.
The children differ randomly from there parent. There is a 99% chance that the random chance decreases the skill level with 1 point and a 1% chance that it will increase with 1 point.
Question: What is the chance that these robots still exist over 100 000 generations?
Thanks in advance.
]]>
Not all equalities are the same. There are at least these variants:
Definition (or substituting symbol or replacing symbol, etc.)
Example: fine structure constant,
\[\alpha\overset{{\scriptstyle \textrm{def}}}{=}\frac{e^{2}}{4\pi\epsilon_{0}\hbar c}\]
Identity (or universal equivalence under known or specified set of previous rules that may be algebraic, geometric, etc.)
Example: constriction between sine and cosine,
\[\sin^{2}\theta+\cos^{2}\theta\equiv1\]
Equation (proposed equivalence considered in relation to the finding of "solvers" or "solutions")
Example: Pythagorean Theorem assumed true, solve for c_{2}, given c_{1} and h,
\[5^{2}=\left(c_{1}\right)^{2}+3^{2}\]
Formula (proposed equivalence under non-universal, somehow hidden, or not necessarily specified assumptions)
Example: Pythagorean Theorem,
\[h^{2}\overset{\cdot}{=}\left(c_{1}\right)^{2}+\left(c_{2}\right)^{2}\]
Some formulas can become equations once you give values to terms or supply additional information. And yes, sometimes the distinction between what is a formula and what an identity can be blurry, depending on how we look at the defining "valid rules." E.g., the constriction between sine and cosine can be seen as an identity if we take Euler's identities,
\[\sin\theta\equiv\frac{e^{i\theta}-e^{-i\theta}}{2i}\]
\[\cos\theta\equiv\frac{e^{i\theta}+e^{-i\theta}}{2}\]
as the point of departure; or a formula, if we adopt Pythagoras' theorem plus geometrical definitions of sine and cosine. It may also depend on assumptions about curvature of space, etc.
Needless to say, most people who use mathematics on a regular basis, don't need to be reminded of these distinctions, because they intuitively know what they're about. The danger is when people start playing with equalities (especially definitions, as I've seen) thinking they have a different value than they really do. Also needless to say, but better said, the symbols for eq., id., form., and def. are not intended for general use, but just to illustrate how confusing all this proves to be to many people.
]]>At some part during my studies I had to read and subsequently discuss a paper, I remember this as one of the more difficult papers I had ever read, although after the discussion understood everything. I recently looked back at the paper and now see that I do not understand it anymore (or maybe never did). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5474757/ My question is specifically about the false discovery rate metric used here (see text and picture at the end of this post). For context, I will briefly explain what they do in this paper before asking my question:
In this paper the researchers introduced random mutations into cells (on average each cell would have a single mutation) with the aim of identifying genes involved in oxidative phosphorylation (a process that can utilise both galactose and glucose). Galactose is only used in oxidative phosphorylation, while cells can use other pathways to create energy from glucose, thus by letting mutated cells replicate a few times and then splitting them up into glucose- or galactose-containing medium, they can identify mutations that only target oxidative phosphorylation (as cells with these mutations will remain alive within the glucose-containing medium but died in the galactose-containing medium).
What I am wondering about is their usage of false discovery rate, as they obtain genes 'with a given false discovery rate'; this implies to me that they are talking about local false discovery rates, in contrast to the false discovery rate of a given experiment (I know it as the rate of false discoveries versus the total amount of hypotheses tested). From what I understand (from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC520755/), local false discovery rate is the chance that a given gene is a false positive.
I am however not sure if that is the case or that my mind has made up this interpretation to let it all make sense. Below is an excerpt of the original paper and the specific passages that discuss the false discovery rate. If anyone could correct me, verify that my interpretation is right, or supply general comments, that would be greatly appreciated!
-Dagl
Stay safe and healthy!
We also observed this expected enrichment of positive controls at the gene level, and the known OXPHOS disease genes scored significantly better in galactose compared with glucose, as measured by the false discovery rate (FDR) (Figure 1E). Moreover, the fraction of OXPHOS disease genes below 10% FDR in galactose was enriched 39-fold compared with the background of all genes.
We used the MAGeCK scores to define a set of genes that were specifically necessary for survival in galactose relative to glucose. We filtered out 92 genes with an FDR below 30% in glucose medium because these likely represent broadly lethal genes that cause non-specific cell death. We then identified hit genes enriched for lethality in galactose at three FDR thresholds: 191 “high confidence” hits at 10%, an additional 48 hits between 10% and 20%, and 61 hits between 20% and 30% (Figures 1F and 2).
]]>
I was reading an article related to MonteCarlo Method. The link of the article is:
http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf
I found the equation in the following attached image.
1)In the equation related to the attached image, we are assigning a value to a function. What does this mean?
2)What the expected function will do with this value?
3)Please guide me by providing examples of function fX(ꭓ), E(g(X)) and g(ꭓ).
Zulfi.
]]>Specifically - by using regularisation when fitting a very high order polynomial, intuitively the regularisation part of the objective function should kill off some of the higher order terms.
However since the penalty is on the L2 norm of the weights, it will penalise larger weights preferentially and on first glance this would appear to penalise the lower order terms first.
Since the data should be normalised this is not a problem in reality (ie. higher order terms in the data matrix are reduced, so their weights larger). But in that case by intuition would be the weight penalty reduces all weights at roughly similar rates, ie. unrrelated to the exponent.
Hope that makes roughly sense ... any thoughts?
Thanks in advance.
Ed
Thank you for read this question. I am trying to understand dynamic programming to find a solution to a optimization problem. I understand optimization through linear programming but it is not enough to solve my problem.
What I would like to know is how dynamic programming deal with the optimization of the process as a whole, I mentioned LP because I can see how LP works (objective and constraints) and maybe DP has similitudes.
Maybe the answer of my question is studying more DP but any advise would be good.
I feel like a loop should be closed like in LP but I do not see the concept or concepts which ensure that.
I’ll apreciate any comment, thank you.
]]>I have attached both question and solution
I can see the application of Fourier's law to the right hand side of the solution, but why is right hand side multiplied by area?, also to the left, what has happened to Ko.
Thanks
]]>I am part of a research project on the intuitive understanding of probabilities in contour plots and we are seeking for participants for a short online experiment.
Participation is completely anonymous and will take less than 15min of your time. Just go with the browser of your laptop or desktop PC (no mobile
phones!) to: https://survey.fl.dlr.de/index.php/627715?lang=enWith you participation in the experiment you directly contribute to current basic research of the Friedrich-Schiller-University Jena and the German Aerospace Center. We would be very happy if you could support our work.
Thank you!
]]>Hope someone can help, and please forgive my ignorance on the subject.
-Dagl
]]>Graphically, by projecting normals to the centres of the lines A-A' ', B-B' ', C-C' ', (also p-p' '), where they cross at point r, gives me the center of a single rotation I seek.
Question: how to do that mathematically given only: p = (7.160299318411282, 0) rotation1 = -360/21° & rotation2 = 18°?
Caveat: The above procedure does not work for all combinations of two rotations. Eg. In the next image p = (50,0) and the rotations are (-45° & +45°); which results in the normals to the bisectors all being parallel!
I know that affine transformation using homogeneous coordinates can be composed [https://en.wikipedia.org/wiki/Transformation_matrix#Composing_and_inverting_transformations], but I am stuck for how to utilise that here as in the environment in which I am doing this (LUA embedded in a FEA package), I only have two mechanisms available: rotation about a point and translation in the XY plane.
Question2: Assuming that I get a solution to Q1 above, is my only option to deal with the Caveat case, to compare the angles of rotation and do something different if they are equal?
Thanks.
]]>
Can anyone clarify for me please.
]]>
I saw this article on Cha Chathat said it was about nine millimeters, but that CAN'T be right!
How thick is a sheet of printer paper?
]]>I would like to know whether or not there is a statistic that can differentiate between the case at the top left versus the top right. Clearly R^{2} does not do so. One could plot the residuals, and the non-random distribution sometimes becomes apparent. However, what I was hoping to find is some number, preferably one that would be calculated by a statistics program, that could be compared in the two situations. I am reading Motulsky's book Intuitive Biostatistics (that is where I first saw the Anscombe quartet, but I have not found anything in his book yet. I am presently using ProStat, which has both a calculation of COD (which I am pretty sure is R^{2}), as well as a calculation of "Corrl" which is said by the user manual to indicate "how closely the two variables approximate a linear relationship to each other." I note the presence of squared differences in the numerator of COD, which are not found in Corrl.
]]>
Joe has investments in Company A, Company B, and Company C.
Joe is fated to earn $25.00 from Company A within 2 days from now.
Joe is fated to earn $45.00 from Company B within 3 days from now.
Joe is fated to earn $100.00 from Company C within 5 days from now.
Joe is fated to earn no more than $26.00 from Company C and Company B on day 1 (1 day from now).
Joe is fated to earn at least $14.00 from Company A and Company C on day 2 (2 days from now).
Joe has to earn twice the amount of money on the first day than the second day from Companies A, B, and C and twice the amount of money on the second day than the third day from Companies A, B, and C. This can be expressed algebraically as Joe earning x money on day 3 (3 days from now), 2x money on day 2 (2 days from now), and 4x money on day 1 (1 day from now).
Joe can earn whatever amount of money (that satisfies the other conditions) from Companies A, B, and C on day 4 and day 5 (4 and 5 days from now).
What is the lowest amount of money Joe can earn on day 1 (1 day from now) from Companies A, B, and C? Explain your reasoning.
P.S. How come there doesn't seem to be good formulas to use for this question?
]]>]]>
Using PSPP, I was doing some basic linear regressions. Examining the following correlation:
I included data from over a 100 countries and looked at both baseline values and values 15 years later. Calculating the differences in value for both the independent and dependent variable. Plotting them in a graph.
Results were as follows:
A weak R-squared value which was highly significant nonetheless (P = < 0.0001). With a negative trendline. No confounding was detected from other variables.
I found a "strong" correlation between baseline values of the dependent variable and it's successive changes in values during the 15-year follow-up period. The R-squared values was > 0.7. There was a positive correlation: Higher changes during follow-up were related to higher baseline values.
My problem is as follows:
Most values for the dependent variable dropped over the 15-year follow-up period. When I added baseline values for the dependent variable to the model, there was no noteworthy correlation left between the independent- and dependent variable (P for sig: > 0.50).
Would it be correct to assume the negative correlation between independent- and dependent variable were (probably) caused by the strong correlation between the 2 values for the dependent variable?
Subgroup analyses of the correlation between the independent- and dependent variable after 15 years showed the following:
-Decreases in values for the independent variable were not linked to changes of the dependent variable.
-Increases in values for the dependent variable were not linked to changes of the independent variable.
.
]]>