To DS or not to DS? – Data Science W231 | Behind the Data: Humans and Values

By Maya Miller-Vedam

A passing comment in w261 yesterday reminded me of some of our w231 conversations in the past few weeks. We were discussing the benefits of SVMs vs Logistic Regression in different situations and the professor cited ‘heart attacks’ as an example of non linearly separable (i.e. inherently noisy) data. Specifically he said “there are lots of cases where, out of two people who are exactly the same, one has a heart attack and the other doesn’t.” In the context of his point (that sometime both algorithms perform similarly) this was a helpful elaboration. However, it struck me that the statement ‘two people who are exactly the same…’ was also an example of the way that data science can encode assumptions without thinking about them.

In reality what my professor meant by that phrase isn’t that two people really are exactly the same but rather that their representation in the feature space was the same. In this case, the feature space consists of medical facts like age, weight, diet, ‘stress level’, family history, etc. Importantly, the feature space will never include all possible risk factors because there is always the possibility of risk factors that the researchers wouldn’t know to look for or aren’t able to measure. So, yes its likely (given a wealth of research) that heart attacks are an inherently noisy phenomenon … but its also possible that the noisiness is not actually inherent but is a reflection of the fact that our features fail to capture the full range of a predictive panel (I think this is called a ‘content validity’ problem, yes?).

What is most interesting to me is that I think in many situations it will be impossible to know if you are looking at data with a content validity problem or a phenomenon that is simply noisy. To be clear, I am not trying to contest the medical establishment’s understanding of the causes of heart attacks… just pointing out an example of what seems like an information science challenge. Data scientists are fond of the truism ‘garbage in garbage out,’ but what if it’s impossible to know if your data is ‘garbage’ to begin with?

From this angle, the ‘heart attacks’ case is an interesting contrast to the diversity hiring problem we discussed in class two weeks ago. There, we concluded that the features available to an HR department algorithm were pretty clearly inadequate when it comes to capturing the specific meaning of ‘success’ and of ‘diversity’ that we’d actually want to optimize for. In our discussion we attributed this partly to a problem with categories (e.g. single race check boxes fail to capture multi-racial applicants); partly to a measurement challenge (e.g. how do you document diversity of opinion or life experience?); and partly to a problematic assumption that the success of diverse applicants is causally attributable to features of their own background as opposed to features of the environment in which they work. At the time, it felt pretty easy to conclude that as a result of this insurmountable (“content validity”?) problem, we shouldn’t be trying to use DS to solve the problem of increasing diversity in hiring. I stand by that conclusion, but am now wondering whether it will always be so clear cut.

As I understand our readings from Harding, Jurgenson and Valentine, the point of talking about the ways in which human researchers & institutions have influenced the content and construction of the features in our data isn’t to invalidate our results but rather to make sure that we apply those results appropriately (and yes, doubt them appropriately). In the ‘heart attacks’ case I take this to mean that I should not expect a deep learning algorithm to discover a new features space with more predictive power unless I can feed it substantively different data, data that is not constrained by medical researchers’ judgements about what is or isn’t ‘relevant’ or ‘valid’ patient data. That being said, the medical research community’s formulation of the feature space is certainly a useful one for ruling out proposed risk factors and is likely also still a somewhat useful one for predicting risk. It does make sense to trust the output of existing heart attack predictive algorithms as long as we foreground the understanding that these predictions may come with large confidence intervals. In the diversity hiring case, I would trust an applicant profile screening algorithm to identify racially diverse applicants who resemble current successful employees, I just wouldn’t assume that those applicants will be automatically successful or particularly diverse along any dimension other than racial self-identification. I also suspect that applicants identified by this kind of algorithm would have just as easily been identified by traditional HR methods.