Statistical analysis on a 2 dimensional data set
I work in computational biology field, and I was recentedly given this particular data:
The data itself is generated by identifying and scoring residue interactions of a protein over time. Briefly, if two residues make contacts within a frame (unit time), then this pair is given a score based on the type and the strength of the contact.
An example data set would look like this:
The problem given by my PI is to come up with a metric to score every residue (not residue pair) based on this dataset. In another word, I'd like to know a way to score each residue based on the frequency and strength of contacts said residue makes.
This problem is very difficult for me because each residue has multiple scores (from multiple contact pairs) within a frame, and across multiple frames. In my very primitive understanding, for every residue, there are two dimensions that require statistical analysis: its contacting pairs and time. My training before this has been purely biological and hence I have little knowledge about statistics. I tried using Z scores, but it does not seem to work well with two dimensions.
I have attached a portion of the dataset. It contains 25 frames for demonstration purpose. I would like a detailed explanation of which method to use, why said method is appropriate, and how to implement them.
This is my first time posting; I hope the bounty amount is appropriate for such a question.
Thank you very much!
- The questioner was satisfied and accepted the answer, or
- The answer was disputed, but the judge evaluated it as 100% correct.
Hello Mathe, thank you so much for your proposed answer. I am mostly trying to understand the theory. One comment on disadvantage one: This is the intension of the scores. Because the score already takes care of contact types and strength, the mode of interaction is already normalized in this dataset. Interaction between i and j and interaction between i and k should be considered the same.
Because of the above reason, I'm a bit worried about the equation, specifically ui and uj are added without constants. I agree that ui and uj should be on the same order because the two residues should contribute equally to the interaction. If I'm understanding this correctly, the error term is key in balancing the equation, especially because ui and uj are the solutions of the equation (which hopefully have only one solution)
I wonder if it makes sense to give a pair-specific constant each to ui and uj, (e.g. sigma*ui + lambda*uj) or use ui * uj with a single pair-specific constant (I understand this may greatly complicate things). Another comment on the time term: protein breathes over time with periods. Would it possible to model time term in sigmoidal instead of linear? It's very late here so I'll think more about this tomorrow and maybe have more thoughts on this.
Hi Lhagura, thank you for your comments. Let me address them now.
1. ui and uj are not unknowns to be solved in an algebraic equation as in"2x+3y=5, x+y=1". Fundamentally, they are coefficients in a linear regression. Understanding linear regression is crucial here.
2. I don't believe it would make sense to add a constant to each ui and uj on each equation. Assuming the contribution of residue i is constant across pairs and time frames, it is enough to have a coefficient for ui. For this reason, it would not be necessary to add a coefficient for ui * uj. Furthermore, I don't think we have enough observations to accurately estimate an interaction effect for all pairs ui * uj.
3. It is possible to model time in a sigmoidal fashion instead of linear. This would require more effort on the fitting procedure, but it is possible. What information do we have on what the sigmoidal function should be?
4. Check if the estimated coefficients on the proposed regression model make sense. They can be read at the end of the R file.
Regarding 3. I have experimented with a sigmoidal function, and because there are so many variables (287), it would be very difficult to guarantee convergence of a non linear regression for the complete dataset. I would suggest instead adding a polynomial of degree 3 on the time frame to improve the fit.
Hi Mathe, thank you for your comments. I tried to actually run the codes using both the sample data and the whole dataset. What worries me a bit is that standard error has the same order of (and sometimes bigger in value than) the coefficient. If I understand the meaning correctly, that means a positive coefficient has a probability of actually being negative (due to standard error extending to the negative realm).
I gave the time term another thought. Two ways I can think of it in terms of real biological meanings. 1) if the protein is in a state-transition state, meaning that the protein is moving from one state to another, then giving a first or third order time term makes sense (especially considering the cascade effect of protein movement, polynomial of degree 3 makes a lot of sense).
That would be what the current model is suitable for. 2) if the protein is in equilibrium, meaning that protein conformation (and hence contact profile) has normal or binomial distribution, I don't know if modeling time term linearly is a great approximation. I am unsure if it's even necessary to model time term in this case. For this particular set, it's a state transition so could you please suggest me of how to change time term into polynomial of degree 3?
Laghura: 1. In this model a estimated coefficient can be statistically zero. This would mean that, on average, a residue makes no contribution to a pair. If this is unrealistic, one could try improving the fit of the model or adding more observations so the standard error gets smaller.
Laghura: 2. Let me adapt the code to account for a polynomial of degree 3. I will edit my initial response with the new code.
Laghura: 3. One could also omit the time variable if this isn't well suited for the dynamics of the situation.
I just uploaded two versions of the code. v1 has no polynomial of degree 3 but it updates the regression call. It is important to run the regression with 'score~ .-1' since this model assumes no intercept. This was overlooked in the previous code. v2 has a polynomial of degree 3 and it runs the regression with 'score~ .-1' as well. I also updated a description of the models (with polynomial and without polynomial).
Hi Mathe, I think the degree-3 polynomial fits much better than the original. I'm actually very satisfied with the current model. I really appreciate your help in building this model!
- 374 views
- Please check if my answers are correct - statistic, probability
- In each of the situations, state whether the indicated model can be regarded as a Generalized Linear Model (GLM) and give reasons for your answer.
- How do you calculate per 1,000? And how do you compensate for additional variables?
- Questions for Statistics Project
- Probability and Statistics Question
- Check if problems are correct
- please use statistics to explain spooky phenomenon
- Compute the cumulative density function of X
I am a statistician and I would gladly work on this problem because it seems interesting. This would involve some back and forth discussion between the two. For example, how many residues, residues pair and frames will there be in the real dataset. This is relevant because it tells us how much information is available for this variables or sources of information. Because of the back and forth nature and the expertise it requires, I would suggest increasing the bounty as well.
Should we assume the real dataset looks very much like the sample dataset you provided?
Yes. The sample is a part of the real dataset. If I remember correctly, this dataset contains 1000 frames and 311 residues. Regarding the bounty, I will definitely add more to it. For now I'll increase it to $150, and will add on it if situation demands.
Would you be ok with an implementation in R?
Sorry I missed the question regarding residue pairs. The residue pairs shown in the dataset are the ones that have non-zero scores. In the script that generates this dataset, every pair (between all members within the 311 residues) is calculated. The current dataset has ~440K unique entries, to give you a sense of the available data. This set can be further extended by adding more frames if necessary.
R should be fine. I mainly use Python, but can read and write R scripts.