Question Re. "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/06 4:31 PM
Response 1 to "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/08 10:43 AM
Response 2 to "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/08 5:13 PM
Response 3 to "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/09 11:16 AM
Response 4 to "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/09 4:16 PM
Response 5 to "Exclusion of "Don't Knows", Refusals,Not Stated from CCHS/RRFSS analysis" posted 2009/07/09 5:36 PM
Question Re. "Do Stats Can data release guidelines apply to the CVfor the difference between ratios?" posted 2009/07/30 9:59 AM
Response 1 to "Do Stats Can data release guidelines apply to the CVfor the difference between ratios?" posted 2009/08/08 8:38 AM
Response 2 to "Do Stats Can data release guidelines apply to the CVfor the difference between ratios?" posted 2009/08/06 2:13 PM
Response 3 to "Do Stats Can data release guidelines apply to the CVfor the difference between ratios?" posted 2009/08/06 6:07 PM
The health analysts at our health unit have been doing a lot of analysis lately using self-reported data such as CCHS and RRFSS, and we had a discussion last week around excluding "Don't Knows", Refusals, and Not Stated responses from the analysis if they make up <5% of responses (which is our usual practice).
This exclusion is one of the items listed under the "Analysis Check List" for APHEO indicators which use CCHS or RRFSS data, as follows:
Users need to consider whether or not to exclude the ‘Refusal', ‘Don't Know' and ‘Not Stated' response categories in the denominator. Rates published in most reports, including Statistics Canada's publication Health Reports generally exclude these response categories. In removing not stated responses from the denominator, the assumption is that the missing values are random, and this is not always the case. This is particularly important when the proportion in these response categories is high.
Whenever we release data from these sources, we do include the caveat that the assumption that the missing values are random may not always be the case. However, there are a number of instances where we are pretty convinced that these responses are not random (i.e., BMI, condom usage, heavy drinking, etc.), and have some discomfort with excluding them.
The approach that we are thinking of taking is to include the responses in tables or crosstabs as a legitimate response category. That way, we are not making assumptions about these responses, but rather acknowledging them and the impact that they may or may not have on the analysis.
Are others out there excluding or including these responses in your analysis and why or why not? I'm sure this debate has happened before, so just wanting to get others perspectives on this.
In the RRFSS, we have guidelines about showing DK, R and NS responses and, like your practice in ..., we must include those responses if they are over 5% of responses. The guidelines don't, however, discuss how to deal with these response categories when under 5%.
When completing my thesis ..., after collecting my data, we came up with this very scenario. I went to my prof with a most puzzled look and asked, "So, now what do I do? I thought this was a dichotomous question and was just being polite by including that Don't Know category, but I must have been wrong because a lot of people answered ‘Don't know.'" She said, "If Don't Know is a reasonable response, then it must be included in your analysis." We went through my entire survey and identified all of the questions that Don't know was a reasonable response. In some cases, it was over 5% and in others, in was under. In all cases, Don't Know was included in the denominator.
Refusals and Not Stated. Those are such tricky responses to deal with and I would imagine that for touchy subjects, they can be much higher than 5%. If you're convinced that they're not random, complete an analysis comparing ‘Refusal and Not Stated' responders with those that did respond on a number of different variables that also impact on the outcome (e.g. income, age, sex, marital status). Then, highlight variables with significant differences to show that the groups are indeed different and, as such, should be handled differently in the analysis or the interpretation.
I will admit that in a thesis environment, there's a lot more time to do all of this than in the health unit environment. When I'm strapped for time, I present the number, state a reason why the number is an underestimate, and then cite a few peer reviewed journal articles that support this statement.
This is a very interesting topic, and I'm having a hard time figuring out how to find published literature on the subject. As near as I can figure, this is the topic of non-response. If that's true, then the problem of the non-random distribution of DK, R, and NS "responses" is really the same problem as non-differential non-response bias. Fortunately, there are ways to deal with this. For simplicity, I'll consider "Don't Know", "Refused", and "Not Stated" to be the same. I suppose there is a subtle argument to be made that "Don't Know" is a valid response in itself, but for most questions and purposes in public health epidemiology this seems like a distinction without a difference.
The RRFSS solution described, of including DK, R, and NS (non-)responses as a separate category if they make up more than 5% of responses, is interesting. I understand the rationale: i.e., that if the number of non-responses (DK, R, NS, or missing) is low, then the potential non-response bias will be low. If greater than 5%, I suppose the argument is that transparency demands that we include the non-response category. However, by including a DK/R/NS category when they constitute more than 5% of responses, you solve nothing because, in practical terms, this often amounts to relegating the non-responses to the non-positive category.
Excluding DK/R/NS responses, the smoking proportion is 10/20 = 50%. Including the DK/R/NS responses, the smoking proportion is 10/30 = 33%. Either inclusion or exclusion involves a major caveat, but neither resolves the issue of whether the smoking proportion is 50% or 33%. If you exclude the DK/R/NS responses you are implicitly saying that half the non-responses were really "yes" and half were really "no". If you include the DK/R/NS category, you implicitly suggest that they were all "no". Thus, simple inclusion/exclusion is not very helpful if the amount of non-response is significant. Indeed, I would argue that if simple inclusion of exclusion is to be the solution, you should opt to EXCLUDE the non-response categories. In that way, you implicitly allocate non-responses equally to the remaining categories. That would seem to create less bias than including the non-response as a separate category and then reporting the proportion "positive" from the complete distribution. StatCan avoids this problem by always reporting the complete distribution on the Health Indicators website. We often don't have that luxury. The RRFSS advice should probably be something more like: "if you have more than X% non-response, then inclusion or exclusion of the various non-response categories is up to you, but you need to include a caveat since neither inclusion nor exclusion adequately solves the problem of item non-response." And, of course, here we're only talking about item non-response, not unit non-response (i.e., selection bias), which is probably a bigger problem.
Anyway, as I said above, there are better ways to deal with item non-response than simple exclusion (or inclusion of a "missing" category). Simple exclusion works fine when the proportion of item non-response is "low". What is low? 5%, 6%, 8%, 10%...sure. 20%...probably not. There is room here for a judgment call.
When non-response is not "low", it may be worth the effort to do imputation. Imputation can be as simple or as complicated as you like. It isn't a solution to the problem of bias, but it is a way to try to obtain the best possible estimates given the data you have. The most common methods involve regression modeling the variable to be imputed as a function of a set of independent variables and then predicting the value of the missing data. A newer technique that is less widely implemented is multiple imputation, where the variance in the regression model used to predict the missing values is captured by creating multiple imputed datasets, and this extra variance is captured in all estimates derived from the dataset. Stata 10 has an imputation command that uses the regression method above, as well as a hot deck imputation command. The new Stata 11 (which I don't have yet) has implemented multiple imputation. SPSS has a missing values add-on, which I'm not familiar with but which appears to use regression methods. In some commands, SPSS has an option to replace missing values with the mean of the variable in question, but that is a rather primitive method of imputation and would tend to perpetuate any non-random item response bias already present. Plus, it must only be available for continuous variables where using a mean would make sense. If you don't have SPSS's missing data add-on, you can probably figure out how to kludge together a procedure using regression methods and predicted values. Or you could use R, which has some very nice imputation commands like -transcan- in the Hmisc library by Frank Harrell. That's what I use to impute missing values in categorical data. But I hesitate to recommend it, since I seem to get an eye-rolling response whenever I mention R on this list.
So, what to do? Given how people tend to reduce a distribution to a single summary measure, like the proportion who smoke, if I were pressed for time, I think I would tend toward treating the various non-response categories as missing. If I had the time and inclination I would (and do) impute the missing values. Once I get Stata 11, I'll probably really go to town and use multiple imputation, since they've apparently integrated the multiple imputation procedures with all the rest of their commands, including the survey commands. So, apparently I'll be able to simultaneously compute correct survey design-based estimates, as well as impute missing values on the fly while integrating the extra variance that comes from using another estimation procedure to predict the missing values. Wow, exciting stuff!
..., this is a great discussion.
The best background readings for this would be various big fat textbooks on survey methods, which include full chapters on item-specific missing data and separate this from whole people missing from the entire survey.
I'll speak very briefly about DK versus other kinds of item-specific missing data and then go back to missing data as a chunk.
A segue into the meaning of Don't Know.
In a perfect world (where we don't live), one would separate out ‘Don't Knows' for further study.
- I don't understand the question or what you are talking about.
- I have never thought about that matter, at all, and therefore have no information to offer you.
- I neither agree nor disagree but DK is the only ‘neutral' category available
- I don't want to answer (REFUSE) because it is sensitive, etc.
If you read classic texts on questions and answers in survey methods, and literature on opinion measurement, you'll get advice to always INVITE people to say DK and explore why (as above). I've seen cases where DK is very telling. One example is multiple "dk" options seen in a CCHS measure on "Have you been screened with mammography" which offers "I have been, but don't know how long ago" as separate from "I don't know what a mammogram is". Whoever wrote that question gets an A+ -- pain in the butt to analyse but better quality data.
A survey which does NOT give DK options may be somewhat non-comparable from one which tries to force choice. If you don't let people say DK, they might pick other categories somewhat at random and so it is ‘complete' information, but may be less valid. Similarly, if the same question in different sources has quite different DK percentages, they also might not be comparable - EVER, no matter what you do. I remember asking "do you think smoking should be banned on school property" in the late 1980s. Very few people said "DK". By the late 1990s, the question needed to be unfolded into several questions: "inside, outside, just off school property but pissing off the neighbours" and "by teachers, students, students under or over 18?" Etc etc etc. Why bother trying to track %agree over time periods where the entire socio-political issue was so _qualitatively_ different. We WANT easy binary %s but truth doesn't work that way.
Back to missing values in the dataset (now we'll lump DK refused etc, because we'll assume there are too few obs of each to fuss about)
What you do with item-specific missing data may depend on the ROLE the variable is playing in the analysis or scientific inquiry.
1) The MVs are in a question which is the primary OUTCOME of interest or thing you are trying to measure.
Advice (for what it's worth):
Report actual percentages of DK refused etc in as much detail as you can stomach. It is of interest.
If you remove them from denominator and numerator (i.e., reporting percent of valid answers), you ARE making the assumption that the true values (if you knew them) among missing would be distributed among the valid answers in the same distribution as among those who gave a valid response. (e.g., if percent of valid=Y is 45%, you assume that percent of true=Y among missing would be 45% also).
If you have small total percents missing, then this assumption has little risk.
You MIGHT consider a simplistic sensitivity analysis (presume all DK are Y and then all are N - how different are your percents?). This is what a person can do in his/her head so long as %excluded is reported in the fine print.
You MIGHT consider a sophisticated sensitivity analysis. To do this with INTERNAL data, you can use the multiple-imputation procedures of Stata, coming out. A warning is that, even with more sophisticated routines, peer-reviewers will still likely have antibodies against the word "IMPUTATION".
The BEST way to do this is with an EXTERNAL validation study in which you find a way to track down the truth among a sample of the would-be missing observations and use that information in a sophisticated statistical sensitivity analysis, such as using probabilistic simulation. I've rarely seen that resources are available for such things - but if there is enough value in the answer, then the resources should be found. You will see this in literature as corrections for measurement error and misclassification error using external validation data.
2) the MVs are in a question which is a CATEGORICAL PREDICTOR variable of interest in a model of disease occurrence or some other outcome, such as an attitude etc.
Again, include as many categories for the predictor variable as you can including keeping DK as one of your dummy variable categories (i.e., turn YES vs. NO into YES, NO and UTD - unable to determine - categories and use NO or YES as the reference group). You'll get advice on this from epidemiology journals focusing on regression analysis. You don't want to use the MV category as the reference group because then all your beta coefficients associated with that variable lack precision. An example here, I've seen, was were the category "I know I do not have a family history of Crohns disease" was very different from "I don't know what Crohns disease is" in terms of risk of future colorectal cancer. With a family history of colon cancer, you tend to learn about bowel disease and cease to say ‘dk'.
Advantages: you don't lose observations and statistical power from your analysis, and you have not committed any new misclassification error by lumping them somewhere they may not be.
3) The MVs are in a question which is a CATEGORICAL potential CONFOUNDER variable of interest in a model.
Again, I'd include MV as a separate dummy value. Excluding the observations drops power and may create selection bias. Collapsing the MVs into another valid category may introduce misclassification error and result in RESIDUAL CONFOUNDING.
No reason to impute for the above. AND, remember that regression coefficient for Yes versus No may not be comparable across studies where the percent DK differs (because of measurement differences or because the meaning of DK differs), no matter what you do.
4) The MVs are in a CONTINUOUS or QUANTITY variable which is the OUTCOME of interest.
As above, if you leave them out, you are assuming that the distribution (this time continuous) of real answers among missing would be the same as among valid, if you had the truth. Also as above, the remote possibility exists of getting external validation data and doing a sensitivity analysis. Again, your multiple imputation route might help here as well, but might not be more acceptable politically than just going with raw data and caveats about selection and measurement error possibilities.
5) The MVs are in CONTINOUS or QUANTITY type POTENTIAL CONFOUNDER variables in a regression model.
There are a couple of simplistic regression techniques for this too - to avoid dropping all those missing observations.
First, dummy plus continuous (for messing around more than publishing):
You can create a binary variable for ‘missingness' in the quantity variable and then for all the people missing, you can assign (impute) them the mean value of the non-missing observations. Run a lot of interaction terms for missingness*other predictors that matter, and you may conclude that you are comfortable going ahead excluding the missing observations listwise and accept the lower statistical power. If you leave the MISSINGNESS plus partially-imputed quantity variables in the model (and keep the extra observations), you will have underestimated the variance associated with THAT variable is underestimated. You will also have some residual confounding for that confounder, but you did anyway. These sins might be less important than dropping all those observations listwise where you could STILL have residual confounding, AND greater selection bias, AND poorer power for what you were really interested in. BUT, if you try to publish this (especially if you use the dreaded "I" word - imputation - in your write-up), the peer reviewers will probably not like it.
Second, one more category in an already categorized quantity variable.
You can also create categories for the quantity variable and add one more category reflecting missingness:
Eg.., Number of drinks in the past week as:
Missing or UTD
Comments: First, you might ALREADY have your continuous variable chopped up in quantity "bins" such as the above. A lot of would-be continuous constructs are actually obtained from people using categories (for good cognitive reasons). So, you are NOT introducing new measurement error, but you would be introducing new selection error by dropping the missing observations. Now that I think of this, this technique should be used more often than I see.
A lot of people also tend to CHOP UP things that are measured continuously for reporting purposes (i.e., create categories or ‘bins' out of something that is continuous in the data set). We all should know that chopping up continuous data is not ideal from the standpoint of regression analysis. It introduces measurement error (assuming people within quantity bands gave the same info, which they did not) and reduces power, and is not the most parsimonious way to deal with potential non-linear relationships. But, we still do it a lot because it is appealing and we get away with it. (I know this, but I've still often made the decision to do it anyway, and I probably will again.) If we're already going to get away with it, it is not adding any new sins to keep the missing observations as a category
6) The MVs are in a question which is a REALLY IMPORTANT PREDICTOR VARIABLE OR REALLY IMPORTANT potential CONFOUNDER of interest in a model.
Buy Stata 11 and go the multiple imputation route - especially if you can back that up with external validation study data on the most likely distribution of the unknown information.
Thank goodness someone is finally going to put this in Stata so I don't have to use R. ROFLOL.
I like ...'s idea of reporting the percent positive excluding the MVs, and then also reporting the percent missing or valid from that estimate. I started doing this routinely in my breastfeeding surveillance reports because I found a rather suspicious negative relationship over time between the percentage of mothers successfully contacted and the percentage breastfeeding. Whenever the percent contacted went down, the percent breastfeeding went up, which leads me to suspect that we are better at contacting the kind of women who are likely to breastfeed, and that we have to be careful not to report an increasing breastfeeding rate when the real cause may be a decreasing contact rate. So, I continue to report the percent of women breastfeeding (by month), excluding those we haven't been able to contact for follow-up, but I report the monthly contact success rate right beside it. This is really a solution to unit non-response in this case, but I think it is also good practice for item non-response.
I also agree, as usual, with pretty much everything else ... wrote, except her comment about including a dummy variable for the missing values (rather than a a missing value category) in regression modeling with a potential confounder. I don't see how there is any difference between including a missing value category and including a missing value indicator variable - I think they are equivalent methods and both are likely to lead to not only residual confounding, but complete confouding. Here is what Rothman et al (Modern Epi, 3rd Edition) have to say on that subject:
"Unfortunately, there are some methods commonly used in epidemiology that can be invalid even if data are missing completely at random. One such technique creates a special missing category for a variable with missing values, and then uses this category in the analysis as if it were just a special level of the variable. In reality, such a category is a mix of actual levels of the variable. As a result, the category can yield completely confounded results if the variable is a confounder (Vach and Blettner, 1991) and can thus lead to biased estimates of the overall effect of the study exposure. An EQUIVALENT METHOD, sometimes recommended for regression analyses, is to create a special "missing-value" indicator variable for each variable with missing values. This variable equals 1 for subjects whose values are missing and 0 otherwise. This missing-indicator approach is just as biased as the "missing-category" approach (Greenland and Finkle, 1995). For handling ordinary missing-data problems, both the missing-category and missing-indicator approaches should be avoided in favor of other methods; even the complete-subject method is usually preferable, despite its limitations."
The "other methods" to be favoured, according to Rothman et al, include imputation and inverse-probability weighting methods. So, in other words, excluding missing values is better than including them, but imputation is even better. I don't have his book on me at the moment, but as I recall Frank Harrell gives essentially the same advice in favour of imputation in his excellent book Regression Modeling Strategies.
I want to emphasize one additional point, which is that while it is great, in theory, to report the relative frequency all of the categories of an outcome variable, including the various flavours of MVs, the advisibility of this really depends on the audience for the report. If you are submitting it to a peer-reviewed journal mainly for consumption by other scientists, it might make some sense to include a category (or categories) for the MVs. But for most lay audiences (including public health staff), I think we need to be cognizant that they will automatically reduce all of the detail to a summary measure, like the percent positive. I think it is part of my job to anticipate this and provide the "least bad" summary measure. To go back to my earlier example, if you gave public health staff the data below what would they take from it?
Yes: 10 (33%)
No: 10 (33%)
MV: 10 (33%)
If you give them that frequency distribution, most people will interpret this to say that the proportion who smoke is 33%, which is a strong distortion of the data. It would be better, I think, to anticipate how they'll interpret the frequency distribution and instead give them this:
* based on 67% valid observations
Or something like that.... Now, you might argue that social desirability bias may cause more smokers than non-smokers to refuse to answer the question. If so, then 33% is probably an underestimate of the smoking prevalence and 50% is probably an overestimate. That's a pretty wide margin of possible values for the smoking prevalence, so some resolution is needed. This is where imputation can help you narrow down the true value. If you model the data, you might find that a model containing age, sex, and education strongly predicts smoking behaviour. You can use that model to predict (impute) the missing values in the dataset. It isn't a perfect method since you're basing your imputation model on the problematic data itself (possible validity problem if the missingness is strongly non-random), but nothing short of an expensive and difficult external validation study will be a perfect solution and I've rarely seen that done at the local health unit level. You might partly mitigate the potential validity problem by using the published literature to help choose the covariates used to impute the data.
As ... notes, there is a long-standing (perhaps old-fashioned) bias against imputation. One reason is the possible validity problem noted above. But I think also that it was done badly in the old days, and because it seems like a sneaky way to maintain sample size. However, similar "validity" arguments are made against design-based survey weighting, against survey research in general (given low modern response rates), and even against predictive regression methods in general. Yes, those arguments have merit and we must make our best efforts to mitigate bias, but they are hardly deal-breakers. Regression modeling of relationships between variables in a dataset is perhaps the most common analytical procedure in the epidemiological literature, so it seems illogical to be against applying those same methods to estimate the missing values in the dataset and get the most out of the data you've collected. It seems to me that, aside of preventing bias through awesome response rates or correcting it through awesome external validation studies, imputation is the "least bad" way to deal with item non-response. And if you don't have time for imputation, leave out the MVs from the frequency distribution but try to report somewhere nearby the percent valid (or missing).
Thanks to ... for continuing this fascinating discussion.
...'s advice to be critical of, and read well beyond, the approaches I describe is well-placed. To be clear, I wasn't trying to promote these as gospel and they don't eliminate the need (where it is warranted) to do further work to look for BIAS in estimates that could be attributable to missing data. You need to look at what Greenland suggests as other approaches, which will be assessments of potential bias (actually effect modification across participants and non-participants in a question or the whole questionnaire). External validation studies is one approach (but these have limits). The multiple imputation thing, using available information about missing data patterns within the data is another (but still have limits). And there are others.
The dummy var approaches are still better than even dumber approaches, such as.
- collapse DK with No and sweep it under the rug (e.g., adjustment for family history Y v. N v. DK _should_ control for confounding better than adjustment for Y v. No+DK)
- just don't include a potential confounder at all because the item-specific missing data rate is too high to admit to.
- Going with just non-missing, and sweeping the extent of missing data under the rug (showing the DK in models at least highlights their existence to the reader).
- Trying to go with all non-missing data but then having nobody accept your work at all because the percent of observations dropped was too high - whereas it is also possible that selection bias may NOT have invalidated your findings.
No perfect solution to dealing with things that are hard to measure. We live with a lot of imperfect measures.
Hi everyone. I've been doing some bootstrapping this morning, looking at the difference between ratios. I have a situation where the CIs for that difference excludes 0 (barely), indicating a significant difference, but the CV listed for the difference is greater than 33.3.
My question is, are we supposed to apply the same CV guidelines to differences between ratios as Stats Can requires for release of the ratios themselves?
Here is a response from Statistics Canada:
In response to your question... this is the message I received from the Health Reports analysts...
It is our practice to look at the CVs for all point estimates. When we test for significance between estimates we check that the p-value is less the 0.05. Based on this p-value, we report the results of the significance tests, regardless of the CV of the difference between the estimates.
The question of CV and p-value has come up before and confused and distressed people, often.
This might help.
If what you are doing is estimating a descriptive parameter or point estimate (e.g., reporting a descriptive proportion, number or mean), then a CV that is large indicates you have too little precision to be sure that your estimate of that mean or proportion reflects the true value and you might say "we can't conclude we have a good estimate of that proportion" so we SUPRESS reporting of the point estimate to keep someone from using it as though it was a precise estimate. You would not suppress saying that you attempted to calculate the estimate, you just would not report its value.
The role of CV is often more confusing for tests of hypotheses regarding relationships or group differences, and enormous overlap, mathematically and philosophically (but not perfectly), with the test statistic to which the p-value is assigned and the CV as a test statistic.
When I'm talking about hypotheses regarding ‘effect sizes', examples are:
- difference between two proportions or means, where the null is no difference; or the of the difference =0
- beta coefficient in a regression model of any kind, where the null beta value = 0.
For each of these, the underlying p-value is (often) based on a Wald-type test statistic which gets larger (and p-values smaller) when the ratio of effect_size:standard_error gets larger. The CV gets larger when the ratio of the standard_error:effect_size gets larger. In other words, one is nearly the inverse of the other. The two are very very closely related. Effect sizes which are not statistically significant tend to have CV in the non-reportable range, and vice versa. Much of the time, they disagree only because things are on the borderline, and if something is not significant, it should be reported as such.
There is a problem with interpretation of advice to ‘SUPRESS' statistics which have a large CV. In the case of a non-significant finding for an effect of interest, you would SUPPRESS reporting, say that there is a difference between the two sexes in some rate. You would NOT SUPRESS the fact of having tested for the difference. You would report it as a non-statistically significant difference and suppress a conclusion that sex mattered and you wouldn't go on to apply the non-significant point estimate to any further calculations or policy decisions.
Some people have argued that the StatCan bootstrap tools should not present CV for tests of effect size (such as in the output for regression) because of this confusion, but there certainly should be no harm in reporting this statistic if people aren't misled by it.
A special case might be where something is statistically significant (i.e., one can reasonably conclude that the null hypothesis of no-difference can be excluded) but the CV is still large. This might be the case say, where you have a very large odds ratio as a point estimate, which is larger, in your sample, than is likely to have happened by chance alone; however, the upper and lower bounds of its confidence limit are very far apart, so you might be unconfident about knowing whether odds ratio is small (albeit non-zero) or absolutely huge. It would be a statistically significant finding, but still an imprecise estimate of the effect size. In some scientific enquiries a conclusion of non-zero relationship is all we need to achieve. In others, we also need tight intervals around the estimate of the magnitude of the effect (for example in cost-benefit analysis of acting to prevent or treat something), so the CV for statistically-significant findings might still provide a useful note of caution in the application of the statistically significant point estimate.
There is always value in showing confidence intervals - for significant effects and non-significant effects. Effect estimates _with confidence intervals_ need not be suppressed in the case of a large CV, especially if those confidence intervals are being communicated to a reasonably sophisticated user audience.
Thank you ... and ... for injecting some sanity into the discussion of CVs. Too often the discussion revolves around observances and prohibitions that more resemble doctrinaire fetish than empirical science and free inquiry. I think it is more sensible to think of the release guidelines as just that: guidelines, not unbreakable laws of nature. I have long argued that the presentation of a confidence interval along with a point estimate is sufficient to indicate the precision of the estimate in question (perhaps with a warning if the interval is very wide). I think suppression is a rather extreme way to handle imprecision. Indeed, StatCan's most recent advice indicates that a warning will suffice when CVs are "large". NHANES guidelines similarly do not absolutely require suppression of statistically unreliable estimates, but rather encourage people to include a warning that the estimate may be statistically unreliable.
As both ... and ... indicate, the testing of the statistical significance of effect measures replaces the need to read the CV leaves at the bottom of the tea cup, since they measure essentially the same thing. When there is a conflict between the size of the CV and the statistical significance of a test, you should default to the test value (or better yet to the confidence interval). Why? Because the "unacceptable" CV cut-off of 33.3% is simply arbitrary. If you recall, back in the day the OHS data had an unacceptable CV cut-off of 25%. NHANES III recommended a warning at 30%. The arbitrary values we set for statistical significance testing (or confidence intervals) at least have the virtue of near-universal use and acceptance in fields that utilize statistics to prove a point. Even at that, we recognize the traditional alpha level as just a convention. There is no vengeful god that will strike you down for using a 90% confidence interval instead of a 95% confidence interval, or for setting your level of significance at 0.1 instead of 0.05. Yet, some seem to feel that StatCan's guidelines are like holy writ enjoining all to honour the magical 33.3% value on pain of excommunication. I am sure that the good statisticians at StatCan do not feel that way about their guidelines at all.
All of this reminds of me of something my good friend RB used to do to assuage the fickle CV gods, and which convinced me early of the futility of blind obedience to CV guidelines. If a rare characteristic was being estimated, it commonly occurred that the CV of the estimate would be too large for publication. However, if one simply turned the question around and estimated the percentage of people who *didn't* have the rare characteristic, then the CV of the estimate was just fine. Such a simple and reasonable solution spoke volumes about RB's cleverness, but it also angered me somewhat to see a good mind forced to waste energy on what amounted to sophistry (or rather the evasion of sophistry). One sees similar folly when StatCan publishes estimates and suppresses the "not stated" category because the CV is too large, even when all of the other categories are plainly displayed.... Need I say more?
Not to get too far off track, but a similar argument occurs in legal circles over literal vs. pragmatic (or constructivist) approaches to law. We all know that strict legal literalism can lead to some rather absurd conclusions, such as charging an employer with racial discrimination because they implemented an affirmative action program, or charging a surrogate mother with human trafficking. Such actions may meet a literal reading of certain statutes, but do not reflect either the intent of the statute, nor do they meet the standard of criminality in our society. Thus, modern Canadian jurisprudence has, for the most part, decided to make pragmatic interpretations of the law based on particular circumstances and what is considered "reasonable" behaviour today. Similarly, while CV guidelines serve a certain purpose, which is to be respected, they must be interpreted in context by "judges" (that's us) well-versed in their meaning and purpose. Thus, I'm not criticizing StatCan...obviously they are more than capable of taking a thoughtful approach, and they are not wrong to publish guidelines to help us interpret the reliability of their data. I think it is up to us to understand the purpose of the guidelines and then make judgments about how to respect the spirit they embody while maximizing the utility of the data available to us.