# Statistics involving a numerical variable and a nonnumerical variable (date)

For example, say that I am looking at the bloom date for yellow trout lilies in comparison to how far away they are from open water.

My data would look like this:

 The distance the flower is from open water (m) Date flower completely blooms 0m 03/04/2021 5m 06/04/2021

and so on...

Now how do I calculate whether these two factors are even correlated? And how do I find whether the result found is statistically significant? Do I use a correlation test, a t-test, ANOVA, etc. I don't know what would be appropriate. Most of all, is there even a way to perform a statistical test with a non-numerical variable (this case being date as shown in the table)?

I would appreciate the answer to all of these questions. Please put it very simply as I am very bad with understanding how statistics work.

The first step to thinking about these sorts of problems is to really recognize what relationship you are trying to understand. In the example given, we have two different types of data for each flower: the distance from open water and the date flower blooms completely.

One reasonable question to ask from this data is: does being farther from open water cause a change in when flowers reach full bloom? So, we are really asking directly does changing distance change the date the flower blooms? The most straightfoward way would be to do a simple visual test by graphing the two variables against each other.

Now comes the problem of dealing with our date data. While not obvious, dates can be seen as numeric data. The key is noting that you can convert a date into the form "number of days since...". For example, in the given date 03/04/2021, we could recode in the form "number of days since January 1", which would be 92. Then the other 06/04/2021, would be 95. By recasting our "non-numeric" dates, we get back valid numeric variables that are usable via statistical methods.

It is important to remember that whichever way we choose to recode our dates into numbers, we make a choice that we need to justify as relevant to the problem at hand. It wouldn't really make sense to use "days since January 1st, 1970" when were measuring growing trends within a given season. Something simple like the number of days since the start of that year can make sense since the flowers will likely be blooming in the spring, so it's relevant.

Once we have converted our dates into a reasonable number like we described above, your first step should be to make a scatterplot comparing the Distance v. Days. This way you will be able to see visually what is going on.

Suppose that being farther from water does result in later bloom dates. Then what you will likely see in your scatterplot is something that will be roughly a line. This would give an indication of the direction of your trend. There's no guarantee this relationship exists, so it could just look like random noise with no discernable pattern.

In many cases, this graphical check is enough to establish an answer to the question "is there a relationship?", but sometimes you may want to go a bit further and do some additional statistical tests. One such way to do so is to treat this data as a linear regression problem, i.e. fitting a line between distance and days on your scatter plot. In practice what this does is finds the line that fits the data best, one possible line will have the form:

$$\mathrm{Days} = \alpha + \beta \times \mathrm{Distance}$$

Where $\alpha$ and $\beta$ are the parameters that the statistical method will estimate. Nowadays, we would put our data into software, such as Excel or R, to get the result of this method. What you get out of the methods are the values of $\alpha$ and $\beta$ that best fit your data.

To understand what these mean, let's look more closely at the linear regression that we wrote down. Since $\mathrm{Distance}$ can be 0, we see that when distance is 0, our model just becomes $\mathrm{Days} = \alpha$. This means that $\alpha$ is the baseline number of days until full bloom at 0 meters. Another way to think of it is as the intercept of the graph.

Now that means that $\beta$ is the slope or, more specifically, it measures how much the Distance effects the days until full bloom. Now this is great, because $\beta$ is more or less exactly what our question is asking! One additional by-product of our statistical method is that we not only get exact numbers for these paramters, but we get the uncertaintly as well. There's not only one line that could have fit the data -- the randomness in the process means that we get a whole range of values that are compatible with our data. In practice, this often appears as something called a "95% confidence interval" for our parameter $\beta$, and whatever software we use will likely give it to us.

Interpreting these intervals can sometimes be tricky, but a common way to do so is to see if the intervals overlaps 0. (If you've ever heard of a "p-value", this is where they often come from!) If our 95% confidence interval does not overlap zero, then that is commonly referred to as being "statistically significant". However, if our confidence interval does overlap 0, then we often say that the result is non-significant. A non-significant result would mean that our data is possibly consistent with there being no relationship between distance to open water and bloom date.

Given all that, there are numerous different models that one may want to explore when determining these sorts of statistical questions. The methods that I described above are very common and fairly simple, but more sophisticated methods may be warrented depending on the problem.