Press → to advance slides.
Selection, sampling, reporting bias
Bias of an estimator
Inductive bias
Of course, these raise ethical issues, too
Unjustified basis for differentiation
Practical irrelevance
Moral irrelevance
It is domain specific
Concerned with important opportunities that affect people’s life chances
It is feature specific
Concerned with socially salient qualities that have served as the basis for unjustified and systematically adverse treatment in the past
Extends to marketing and advertising; not limited to final decision
This list sets aside complex web of laws that regulates the government
Disparate Treatment
Formal
or
Intentional
Disparate Impact
Unjustified
or
Avoidable
Formal: explicitly considering class membership
Even if it is relevant
Intentional: purposefully attempting to discriminate without direct reference to class membership
Pretext or ‘motivating factor’
1. Plaintiff must first establish that decision procedure has a disparate impact
‘Four-fifths rule’
2. Defendant must provide a justification for making decisions in this way
‘Business necessity’ and 'job-related’
3. Finally, plaintiff has opportunity to show that defendant could achieve same goal using a different procedure that would result in a smaller disparity
‘Alternative practice’
Procedural fairness
Equality of opportunity
Distributive justice
Minimized inequality of outcome
Ricci v. DeStefano
Texas House Bill 588
Automated underwriting increased approval rates for minority and low-income applicants by 30% while improving the overall accuracy of default predictions
Gates, Perry, Zorn (2002)
and that bias still affects formal assessments
McKay, McDaniel (2006)
Only what the data supports?
Withhold protected features?
Automate decision-making, thereby limiting discretion?
Police records measure “some complex interaction between criminality, policing strategy, and community-policing relations”
Lum, Isaac (2016)
Future observations of crime confirm predictions
Fewer opportunities to observe crime that contradicts predictions
Initial bias may compound over time
Learn to predict hiring decisions
Learn to predict who will succeed on the job (e.g., annual review score)
Learn to predict how employees will score on objective measure (e.g., sales)
Features may be less informative or less reliably collected for certain parts of the population
A feature set that supports accurate predictions for the majority group may not for a minority group
Different models with the same reported accuracy can have a very different distribution of error across population
In many cases, making accurate predictions will mean considering features that are correlated with class membership
With sufficiently rich data, class memberships will be unavoidably encoded across other features
No self-evident way to determine when a relevant attribute is too correlated with proscribed features
Not a meaningful question when dealing with a large set of attributes
Discovering unobserved differences in performance
Skewed sample
Tainted examples
Coping with observed differences in performance
Limited features
Sample size disparity
Understanding the causes of disparities in predicted outcome
Proxies
Running example: Hiring ad for (fictitious?) AI startup
Note: random variables in the same probability space
Notation: $\mathbb{P}_a\{E\}=\mathbb{P}\{E\mid A=a\}.$
Score function is any random variable $R=r(X,A)\in[0,1].$
Can be turned into (binary) predictor by thresholding
Example: Bayes optimal score given by $r(x, a) = \mathbb{E}[Y\mid X=x, A=a]$
Independence: $C$ independent of $A$
Separation: $C$ independent of $A$ conditional on $Y$
Sufficiency: $Y$ independent of $A$ conditional on $C$
Lots of other criteria are related to these
Require $C$ and $A$ to be independent, denoted $C\bot A$
That is, for all groups $a,b$ and all values $c$:
$\mathbb{P}_a\{C = c\} = \mathbb{P}_b\{C = c\}$
Sometimes called demographic parity, statistical parity
When $C$ is binary $0/1$-variables, this means
$\mathbb{P}_a\{C = 1\} = \mathbb{P}_b\{C = 1\}$ for all groups $a,b.$
Approximate versions:
Post-processing: Feldman, Friedler, Moeller, Scheidegger, Venkatasubramanian (2014)
Training time constraint: Calders, Kamiran, Pechenizkiy (2009)
Pre-processing: Via representation learning — Zemel, Yu, Swersky, Pitassi, Dwork (2013) and Louizos, Swersky, Li, Welling, Zemel (2016); Via feature adjustment — Lum-Johndrow (2016)
Many more...
Ignores possible correlation between in $Y$ and $A$.
In particular, rules out perfect predictor $C=Y.$
Premits laziness:
Accept the qualified in one group, random people in other
Allows to trade false negatives for false positives.
Conflates desirable long-term goal with algorithmic constraint
Require $R$ and $A$ to be independent conditional on target variable $Y$,
denoted $R\bot A \mid Y$
That is, for all groups $a,b$ and all values $r$ and $y$:
$\mathbb{P}_a\{R = r\mid Y=y\} = \mathbb{P}_b\{R = r\mid Y=y\}$
Require $R$ and $A$ to be independent conditional on target variable $Y$,
denoted $R\bot A \mid Y$
Proposed in H, Price, Srebro (2016);
Zafar, Valera, Rodriguez, Gummadi (2016)
Optimality compatibility
$R=Y$ is allowed
Penalizes lazyness
Incentive to reduce errors uniformly in all groups
Recall, neither of these is achieved by independence.
Method from H, Price, Srebro (2016):
Post-processing correct of score function
Post-processing: Any thresholding of $R$ (possibly depending on $A$)
No retraining/changes to $R$
Given score $R$, plot (TPR, FPR) for all possible thresholds
Look at ROC curve for each group
Feasible region: Trade-offs realizable in all groups
Given cost for (FP, FN), calculate optimal point in feasible region
Optimality preservation: If $R$ is close to Bayes optimal, then the output of postprocessing is close to optimal among all separated scores.
This does not mean it's necessarily good!
Alternatives to post-processing:
(1) Collect more data.
(2) Achieve constraint
at training time.
Fix function class ${\cal H}$ and lost function $\ell$ solve \[ \min_{h\in{\cal H}}\mathbb{E}\ell(h(X, A), Y) \] \[ \text{ s.t. } h(X,A)\bot A\mid Y \]
Highly intractable.
Hence, consider moment relaxation of separation:
\[
\sigma_{RA}\sigma_{Y}^2 = \sigma_{RY}\sigma_{YA}
\]
where $\sigma_{UV}=\mathbb{E}(U-\mathbb{E}U)(V-\mathbb{E}V)$ is the covariance.
For the purpose of predicting $Y$,
we don't need to see $A$ when we have $R.$
Note: Sufficiency satisfied by Bayes optimal score $r(X,A)=\mathbb{E}[Y\mid X=x,A=a].$
Sufficiency implied by calibration by group:
\[
\mathbb{P}\{ Y = 1 \mid R = r, A = a \} = r
\]
Calibration by group can be achieved by
various standard calibration methods
(if necessary, applied for each group).
Given uncalibrated score $R$,
fit a sigmoid function
$S = \frac{1}{1+\exp(\alpha R + \beta)}$
against target $Y$
For instance by minimizing log loss $-\mathbb{E}[Y\log S + (1-Y)\log(1-S)]$
Any two of the three criteria we saw are
mutually exclusive except in degenerate cases.
Proof pointed out by Shira Mitchell.
A counterexample for three-valued $Y$ is
due to Shira Mitchell and Jackie Shadlen.
Variants observed by
Chouldechova (2016);
Kleinberg, Mullainathan, Raghavan (2016).
Poster session on Wed Dec 6th 6:30–10:30p @ Pacific Ballroom #74
ProPublica's main charge:
Black defendants face higher false positive rate.
Northpointe's main defense:
Scores are calibrated by group.
Corbett-Davies, Pierson, Feller, Goel, Huq (2017):
Neither calibration nor equality of false positive rates
rule out blatantly unfair practices.
Can we address the shortcomings of
independence, separation, sufficiency
with other criteria?
There's a fundamental issue...
All criteria we've seen so far are observational.
Passive observation of the world
No what if scenarios or interventions
This leads to inherent limitations
Anything you can write down as a probability statement involving $X, A, C, Y.$
BTW, what we saw only used $A, C, Y.$
There are two scenarios with identical joint distributions,
but completely different interpretations for fairness.
In particular, no observational definition
can distinguish the two scenarios.
No observational criterion can distinguish them.
Answer to substantive social questions not
always provided by observational data.
This is part of what motivates causal reasoning.
Directed graphical model with extra structure
Structural equation: $V \leftarrow f_V(U, W, N_V)$
Describes how data is generated from independent noise variables $\{N_V\}$
Inspired by Pearl's analysis of Bickel's UC Berkeley sex bias study.
Gender bias in admissions explained by
influence of gender
on department choice.
Formally, assuming plausible causal graph,
only path from $A$ (gender) to decision goes through department
And, we decide that this is okay.
In Scenario II, only path from $A$ to $R^*$ goes through CS:
In Scenario I, there is a path from $A$ to $R^*$ through pinterest:
Structural equation: $V \leftarrow f_V(U, W, N_V)$
Intervention $\mathrm{do}(W\!\!\leftarrow\!\! w)$: Replace $W$ by $w$ in all structural equations
New structural equation: $V \leftarrow f_V(U, {\color{red}w}, N_V)$
Allows to set variables against their natural inclination.
Average-causal effect of $A$ on score $R$
$\mathbb{E}[ R \mid do(A=a) ] - \mathbb{E}[R \mid do(A=b) ]$
Average-causal effect in context $X=x$
$\mathbb{E}[ R \mid do(A=a), X=x ] - \mathbb{E}[R \mid do(A=b), X=x ]$
But can we actually intervene on sensitive attributes (gender, race)?
Practically, generally speaking, no!
Is it conceptually possible and meaningful? Perhaps sometimes.
Consider proxies instead of underlying sensitive attributes
Kilbertus, Rojas-Carulla,
Parascandolo, H, Janzing, Schölkopf (2017)
Closely related: Nabi, Shpitser
(2017)
Interventions on proxies often more feasible:
What would've happened had I been
of a different
gender when applying to this job?
Leads to notion of counterfactual fairness
in
Kusner, Loftus, Russell, Sliva (2017).
See talk at NIPS on Wednesday 4:50p, Hall C
Also,
Russell,
Kusner, Loftus, Sliva (2017).
Poster session Wed 6:30p, Pacific Ballroom
#191
Inspect meaning of features | No causal inference necessary |
Inspect paths in causal model | Qualitative causal understanding |
Estimate average causal effects | Causal inference and assumptions |
Estimate individual level counterfactuals | Strong quantitative causal understanding |
Insights often depend strongly on model and assumptions!
Idea: match similar units in treatment and control group
Use matching for estimating causal effect
Variety of techniques, such as, propensity scores
Closely related to individual fairness.
Dwork-H-Pitassi-Reingold-Zemel (2011)
Assume task specific dissimilarity measure $d(x,x')$
Require similar individuals map to similar distributions
over outcomes
via map $M\colon\cal{X}\to\Delta(\cal{O})$:
$D(M(x), M(x')) \le d(x, x')$
We'll barely even scratch the surface
See Hand (2010) for more.
Measurement affects scale of data
Nominal, ordinal, interval, ratio scales
How does scale affect the interpretation of statistical analyses?
Most of the scales used widely and effectively by psychologists are ordinal scales. In the strictest propriety the ordinary statistics involving means and standard deviations ought not to be used with these scales, for these statistics imply a knowledge of something more than the relative rank order of data. On the other hand, for this “illegal” statisticizing there can be invoked a kind of pragmatic sanction: in numerous instances it leads to fruitful results.
Explict distinction between empirical relational
system
and numerical relational system
Formal representation results (e.g., isomorphism exists)
Example:
Often "pragmatic": Measurement procedure defines the concept
Latent variable models figure prominently
(e.g., item-response models, Rasch
models)
Establishing validity of measurement is difficult, and often subjective
Observational criteria can help discover
discrimination,
but are insufficient on their own.
No conclusive proof of (un-)fairness
Causal viewpoint can help articulate problems, organize assumptions
Social questions starts with measurement
Human scrutiny and expertise irreplacable
ML is domain-specific: We need to understand legal and social context
Besides inspecting models,
scrutinize data and how it was generated
Besides static one-shot problems,
study long-term effects, feedback loops, and interventions
Establish qualitative understanding of
when/why ML is the
right tool for the application
Establish understanding of what constitutes negligence