Missing data in daily diaries
Why do we sometimes collect data from patients daily, rather than weekly or monthly? One of the main reasons is variability in the patient’s condition.
Take the example of asthma. Over a week or a month, an asthma sufferer may experience a baseline level of breathing difficulty, but that doesn’t tell the whole story. If you’ve got a respiratory condition or know others who have, you can probably see where I am going with this. Asthma symptoms, as is the case with many conditions, vary daily.
Susan and the Multiverse
Let’s consider multi-dimensional Susan. Susan exists in more than one universe. In both Universe A and Universe B, she has asthma and in both she is asked to record her symptoms.
In Universe A, she is so limited by her condition that she spends Sunday, Monday and Friday in bed. Thursday is a good day and she gets out into the garden for a while. On the other three days, her condition is moderate. In Universe B, meanwhile, Susan’s symptoms are consistent, and she feels a little unwell on most days. If she put a daily number on her condition, on a scale of zero to 10, it might look like this:
Universe A | Universe B |
Sunday 8 | Sunday 6 |
Monday 10 | Monday 6 |
Tuesday 5 | Tuesday 6 |
Wednesday 4 | Wednesday 7 |
Thursday 1 | Thursday 5 |
Friday 9 | Saturday 6 |
Saturday 5 | Sunday 6 |
In both universes, the average score across the week is 6, representing a moderate level of asthma. But these are two very different Susans with two very different experiences of asthma.
Gathering each day’s threads to weave the tapestry of knowledge
Whether you’re collecting patient data for clinical research or treatment monitoring, it’s important to capture these fluctuations in symptom severity.
In clinical research, we’re trying to create new medication to control or reduce symptoms, so we need to know about fluctuations in severity. In treatment symptom monitoring, we want to measure a treatment’s impact on symptoms, so we want to be able to differentiate between symptom reduction (less severe symptoms) and a reduction in symptom
variability.
As we have seen, if we ask patients to think about their average severity over a period of time, this leads to an “average”, which loses the nuance of the “bad days”. The picture is less detailed, and the knowledge we derive from it far less useful.
This is where daily diary data comes in. Being able to ask patients about their symptoms each day (or in real time using Ecological Momentary Assessment) can help us to capture that daily variability.
What a difference a day makes…?
It ain’t that simple though. One issue that arises when using daily diaries is missing data.
Missing data happens for many reasons. Maybe a patient just forgot, was too busy with other things or was too sick. In each of these cases, the missing data means something. If a patient forgets to record their data, this could be missing completely at random. (That phrase, ‘missing completely at random’, by the way, is an actual factual stats term – which is usually abbreviated as MCAR. So, if you didn’t know that before, now you can act like a fancy statistician. You have my permission.)
The other explanations for missing data are a bit more complex. If data is missing because the patient is too sick, this is certainly not missing completely at random (in fact it is called ‘missing not at random’ or MNAR). In this case we assume that the score the patient would have recorded if they had completed the diary would have been related to the severity that prevented them. To help us with this, we can rely on a useful summary table produced by a really great author in one of their papers, which I have adopted below:
Missingness Mechanism |
Definition | Daily Diary Example | Issues |
MCAR (Missing Completely At Random) | Missingness is unrelated to the construct being observed. |
Technical malfunction. Forgetting. |
Increased variability. |
MAR (Missing At Random) | After conditioning on observed data, missingness is not related to the unobserved missing value. |
Other daily or continuous measures (e.g. heart rate). Previous Assessments. Too busy. |
Increased variability and potential bias. |
MNAR (Missing Not At Random) | The missingness is directly related to the unobserved missing value. |
Too sick. Too busy. |
Increased variability and bias. |
Auto regressive data
So, what can we do when data is missing? First, it helps to understand the structure of this kind of data, which is a cool thing called auto regressive. This means the data for each day is more related to data from days that are closer in time than those that are further away.
Knowing this, we know that we might be able to guess what the missing data would look like, based on the data we have. Is it perfect? No. Is it better than just averaging what you have? Probably. The key question is: “how much missing data can I have and still get a reliable score?”
Luckily we figured that one out with simulation so you don’t have to. In general, as long as we have 4/7 days of data, the estimates we make don’t seem too far off from the truth. It is nuanced, and there are better ways of handling this data than taking the mean (that’s for another post), but in general you can’t go far wrong following this rule.
Is it better to have loved and lost?
Handling missing data is a last resort. It would, of course, be better if data was not missing. Thoughtful planning at the design stage can ensure it doesn't slip through our fingers.
One approach is to only collect necessary data. This principle sounds obvious (and ethical) but it is often neglected. I have worked on plenty of trials that aimed to collect data every day for a year or more, only to end up using a fraction of it in the analysis.
If you’re planning to use data from the seven days before each monthly clinic visit, then reduce the patient burden by only collecting it in this window. This is not only efficient, it can also help stop your patients getting, well, fed up.
Logistic options also exist. They are more resource intensive but can lead to fuller datasets. These include sending an electronic reminder to complete a questionnaire if it has not been recorded by a given time or making a phone call if someone seems to be skipping a day.
Phone calls can also act as an important wellness check. In an MNAR scenario, finding out through a call that the patient didn’t complete the questionnaire because they aren’t doing well isn't just useful statistical information, it also enables intervention if necessary.
All of which shows that the problem of missing data can be solved with the right tools. If you need help with your missing data analysis, your strategy to avoid it, or daily diary data analyses in general – let us know!