Aligning Artificial Intelligence with Human Values: Challenges from Computational Social Choice Theory
Submitted for PH3C5 (Philosophy of Computing and Artificial Intelligence) Project.
Artificial Intelligence systems (henceforth, “AIs”) differ fundamentally from traditional algorithms because their underlying logic isn’t explicitly written out or easily traceable. In conventional systems, biased parameters or backdoors may be attributed to developer intentions or neglect; but in AIs, outcomes can emerge unpredictably, and accountability for unintended consequences can be complicated. This is because modern AIs utilise machine learning: instead of operating based on a fixed set of rules crafted by developers, they learn from large datasets and adjust parameters autonomously to optimise for performance objectives using data and feedback (Russell & Norvig, 2021, p. 670). This process introduces an element of opacity, and hence a concern that AIs may not align with human values or intentions in unforeseen ways. But this value alignment problem, of “achieving agreement between our true preferences and the objective we put into the machine” (Russell & Norvig, 2021, p. 23), raises a crucial question: how do we decide on the values or rules we encode?
Computational social choice theory – the intersection of computer science and social choice theory, “the design and analysis of methods of collective decision making” (Chevaleyre, Endriss, Lang, & Maudet, 2007) – provides a relevant framework for addressing this challenge, as it offers structured methods for aggregating individual preferences in ways that could guide AIs towards alignment with human interests.
One paper in this line of research is AI Alignment and Social Choice: Fundamental Limitations and Policy Implications (Mishra, 2023). Mishra draws on foundational results in social choice theory to examine limitations inherent in Reinforcement Learning from Human Feedback (RLHF), which is the preeminent method for mitigating harms from current AIs, such as ensuring compliance with user instructions and preventing biased outputs in Large Language Models (LLMs). In particular, he uses Arrow’s and Sen’s impossibility theorems to argue that RLHF is not a scalable solution for achieving universal AI alignment: no democratic voting rule can simultaneously satisfy fairness, respect for individual preferences, and alignment with diverse user values without imposing a dictatorial outcome or violating personal ethical preferences. Mishra therefore concludes that, rather than aiming for universally aligned AI, developers should be incentivised to work on smaller models tailored to a narrower set of homogenous users, where individual preferences can be better accommodated while transparent voting rules enhance accountability.
Similar challenges are explored in Aligned with Whom? Direct and Social Goals for AI Systems (Korinek & Balwit, 2024). The authors consider the AI alignment problem in terms of the principle-agent problem, which is about the potential conflict that arises between an entity delegating a task (“principal”) and the entity charged with the task (“agent”), which may have its own priorities it pursues contrary to the interests of the principal. The agent is the AI. The alignment problem consists in identifying the principal’s desired goals, conveying them to the AI, and ensuring that the agent pursues the actions that correctly implement the transmitted goals. But the question remains: who is the principal? In the authors’ terminology, direct alignment is when the principal is the operator or owner of the AI, and alignment ensures it pursues that individual’s specific goal, such as a social networking entrepreneur seeking to maximise user engagement at the expense of increasing political polarisation and reducing social welfare. Social alignment treats society as a whole – or, a representative aggregation of society’s values – as the principal, so the AI acts in a way that is beneficial to the broader collective.
In the language of this latter paper, Mishra can be interpreted as suggesting that social alignment is impossible because we cannot reliably identify society’s desired goals, since aggregating individual preferences cannot produce a full set of rational social preferences to convey to a general AI; and that we should instead focus alignment efforts on the direct alignment of more specialised systems. Korinek and Balwit are aware of these issues, but nevertheless write, “However, that [Arrow’s theorem] does not imply that we need to give up on social alignment entirely.” They maintain that we can work within the constraints of social choice theory to achieve partial social alignment, identifying the most important areas where there is a broad consensus in society in spite of disagreement over how to rank choices. Though not explicitly mentioned, it is clearly in everybody’s interest to avoid extinction; but it will also be generally agreed, the authors claim, that loss of life and harmful discrimination should be avoided. AIs should respect the partial ordering of social preferences – the minimum hierarchy of widely agreed-upon values. The authors concede that there will inevitably be “situations in which society genuinely disagrees, so social preferences do not provide instructions for what a socially aligned AI should do.” In such cases, they suggest that AIs should adhere to a rights-based approach, embedding respect for fundamental rights as a fallback when social preferences do not provide clear guidance.
However, Mishra’s analysis raises challenges for this rights-based approach. He shows that Sen’s impossibility theorem demonstrates that a social choice mechanism cannot simultaneously satisfy minimal respect for individual rights and Pareto efficiency (an aggregation function that ensures that if everyone prefers one option over another, the favoured option is ranked higher). In this context, individual rights refer to “protected domains” – areas of personal choice where an individual’s preferences are upheld without interference from others. Sen’s principle of minimal liberalism argues that each person should have autonomy over certain private matters, provided they are harmless to others; for instance, society should permit me to paint my walls pink even if a majority of the community would prefer me to paint them white. Users of AIs may have similar personal ethical boundaries and expect AIs to respect their preferences on issues that they consider harmless within their protected domain. But Sen’s theorem implies that it is impossible to respect such harmless preferences in protected domains for multiple individuals while also achieving Pareto efficiency.
This analysis has not focussed on the technical problem of how we convey our values to AIs, but on how societal preferences might be aggregated into the values we choose. Korinek and Balwit suggest that while democratic methods cannot fully aggregate social preferences into a coherent set of rules, certain shared values form a partial ordering of social preferences; and in cases of irreconcilable disagreement, AIs should ensure they respect fundamental rights. However, Mishra’s critique based on Sen’s impossibility theorem indicates that respecting individual rights across multiple users inevitably conflicts with Pareto efficiency, limiting the feasibility of universal social alignment. Instead, we should focus on direct alignment, with individually tailored AIs serving diverse user interests – while mitigating potential externalities – within our pre-existing liberal framework.
Bibliography
Chevaleyre, Y., Endriss, U., Lang, J., & Maudet, N. (2007). A Short Introduction to Computational Social Choice. SOFSEM 2007: Theory and Practice of Computer Science (pp. 51-69). Springer, Berlin, Heidelberg.
Korinek, A., & Balwit, A. (2024). Aligned with Whom? Direct and Social Goals for AI Systems. In J. B. Bullock, Y.-C. Chen, J. Himmelreich, V. M. Hudson, A. Korinek, M. M. Young, & B. Zhang, The Oxford Handbook of AI Governance (pp. 65-85). Oxford University Press.
Mishra, A. (2023, October 2023). AI Alignment and Social Choice: Fundamental Limitations and Policy Implications. Retrieved from arXiv: https://arxiv.org/pdf/2310.16048
Russell, S., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach. Harlow: Pearson.
Result
Mark: 88% (High 1st)
Feedback:
This project considers the value alignment problem for artificial intelligence in light of results from social choice theory such as Arrow and Sen's impossibility theorems. This is an interesting and valuable connection to make. And the exposition is particularly cogent and well-structured.
If you choose to develop this into a longer essay, you might consider investigating whether the framework presented in Mishra (2023) -- which is still a preprint -- has been developed any further.