Evaluating ICC Performance: Design is Critical. The ICC should carefully apply social science methodology when devising performance indicators. Among other things, it needs to maintain a critical distinction between performance evaluation and impact assessment.
Cutting-edge social science research offers answers to these kinds of questions. The answers always start in the same place: design of inquiry.
Commentary on the International Criminal Court is heavy with evaluation but light on method. Observers make harsh pronouncements about the Court’s cost, its pace, its conviction rate, and its bias.1 Rarely, though, are gripes about the ICC accompanied by a measured discussion of what we should expect.
The ICC is the first of its kind. That means there are no baselines for comparison. There is also no protocol, or set of best practices, for how to review its work. Yet the criticisms mount, creating a pressing need for internal performance review.
In December 2014, the Assembly of State Parties requested that the Court “intensify its efforts to develop qualitative and quantitative indicators that would allow the Court to demonstrate better its achievements and needs.”2 This move could be a sign that the Court’s honeymoon period—if it ever existed—is over, and it’s now time to start counting beans. The request could also be a way to exert subtle pressure on the Court to demonstrate its worth in the face of political pressure.
The Court has obliged. Pursuant to the ASP’s request, the ICC is working with the Open Society Justice Initiative to develop indicators of Court performance. The pilot work is detailed in two 2016 reports. The Second Report,3 a twenty page document on the development of indicators, is helpful in two ways: it provides insight into the Court’s thinking about how it should be evaluated, and it appends a fifty-page annex with raw data on the last three years of Court activity.
Clear from this report, and from the question posed to the ICC Forum penned by Head Prosecutor Fatou Bensouda, is that many methodological questions remain unanswered. How should the Court quantify fairness? How can it possibly evaluate whether its own leadership is effective? Can contextual factors be captured by quantitative measures or benchmarks?
Luckily, cutting-edge social science research offers answers to these kinds of questions. The answers always start in the same place: design of inquiry.
Methodical research design is not glamorous but is absolutely crucial—and social scientists watch from the sidelines as practitioners routinely mess it up. At its most fundamental level, research design is about matching the questions you want to ask to the methods you will use to answer those questions. It also requires thinking about the audience for your work. Most importantly, design means planning before you waste time collecting data or creating indicators.
How should the ICC do this work? I make four suggestions.
Suggestion One: Maintain a Distinction Between Performance Evaluation and Impact Assessment
The first step in designing an examination of ICC operations is to separate “performance evaluation” from “impact assessment.” Though these terms sound vaguely similar—or like corporate jargon plastered on a PowerPoint slide—there is a real and meaningful difference between the two concepts.
Evaluation determines “the worth of a thing,”4 which involves weighing “strengths and weaknesses of programs, policies, personnel, productions, and organizations to improve their effectiveness.”5 This lies in contrast to impact assessment or the “tracking of change as it relates to an identifiable source or set of factors.”6 Where the first is about applying standards to facts, the second is about determining the casual power of an intervention.
To appreciate the difference, consider schools. Evaluating school performance means ensuring teachers are hired and compensated properly, students are treated fairly, the curriculum is chosen carefully, parents’ concerns are addressed, grades are awarded following equally applied procedures, and students’ standardized test scores meet certain thresholds. Closely monitoring a school’s performance in these areas is different from assessing the social impact of a school. Assessing social impact requires one to look outward: Do schools improve the neighborhoods in which they reside? Are schools helping students maintain gainful employment? Do school expulsion policies exacerbate incarceration? Evaluation is about the quality of internal performance; impact assessment is about external effects.
Embedded in the word evaluation is value. Evaluating the performance of a person or an organization means making judgments based on normative criteria about what is good. We do this all the time: humans are steady-state evaluators. That food server is good, that garbage truck is moving too slow, that government office is inefficient, and so on.
Impact assessment requires much more meticulous observation about what causes what. We know smoking leads to cancer based on painstaking medical research starting in the 1950s that established statistical correlations between addictive smoking and the incidence of lung cancer. Decades later, clinical studies established linkages between the inhalation of smoke and biological mutations in the body. We can be sure of the impact of smoking because scientists discovered correlations and then came to understand the biological mechanisms that produce this correlation. These two types of causal inference—statistical correlations and the identification of mechanisms—are essential to impact assessment.
When examining the ICC, observers exhibit two tendencies. The first is to conduct performance evaluation of the Court based on unspecified criteria. Trials are too long, or they are too costly, or the rights of the defendant are inadequate. However, criminal courts in the most developed countries in the world may be subject to the same criticisms. What is a good benchmark length for an investigation or an appeals process? What is the ideal expenditure-to-conviction ratio? Evaluating without stated baselines leaves analysis unanchored; and, many times, unmoored from reality.
The second tendency among observers is to blur the lines between performance evaluation and causal impact assessment or to evaluate based on impact. This amounts to thick consequentialism: the ICC is good if and only if it can stop violence on the ground or, at the very least, do no harm in a situation of interest. This argument is powerful because it simplifies complex interventions; nevertheless, the consequentialist standard is probably unhelpful for evaluation. No organization or set of interventions can produce an unblemished record of successes. No criminal court can deter violence in every instance.
This is why evaluation and impact assessment should remain distinct. An organization can perform well based on internal standards but have no impact on its external environment. Or it can perform poorly and have huge impacts. These are separate issues.
A 2015 Open Society Justice Initiative consultancy report on ICC performance does not recognize the distinction, urging the Court to create “impact indicators” in order to explore, among other things, “the Court’s legacy in the countries where it operates and beyond, including its deterrent effect.”7 The ICC seems not to have heeded this advice, which is for the better. In its two reports on performance, the Court emphasizes four criteria by which to evaluate its operations:
- that the proceedings are expeditious, fair, and transparent;
- that its leadership and management are effective;
- that it ensures adequate security for its work; and
- that victims have adequate access to the Court.
These criteria direct the Court inward, restricting it to evaluating operational conduct, and preclude judgment based strictly on the consequences of ICC interventions. This is good.
However, this list of goals, along with the Court’s work on indicators so far, combines a number of overlapping evaluative criteria. Some lend themselves to quantification and some do not. For instance, the first goal combines a concern for expeditiousness which could be measured temporally, with a concern for fairness which cannot be measured directly. This creates two potential problems. First, it invites arbitrary quantitative benchmarking; and, second, it introduces a number of overlapping and complex evaluative standards that must be meticulously parsed.
Suggestion Two: Carefully Design a Few Key Benchmarks
First, the ICC’s listed goals embody four areas of focus: trial proceedings, leadership, security, and victims. The Second Report presents fifty pages of tabled data.
The sheer number of indicators in these tables impress a data scientist, but they are probably dizzying to most people. Under a section labeled “fairness and expeditiousness of proceedings,” the report includes summary information on seven different phases of every ongoing proceeding at the ICC: confirmation, trial preparation, trial, judgment, sentencing, reparations, and final appeal. Under the confirmation stage alone, there are ten indicators; and, if each phase is included, the total number of indicators comes to sixty-three. This is a lot, and it is not obvious why some of these indicators are useful for evaluation.
However, we should resist the urge to be too critical at this stage. The Court is still assembling all of this raw data in one place and workshopping ideas about how to aggregate this data into more substantial indicators of performance.8
In this process, those staffers tasked with producing indicators should keep a few elements of design in mind. First, less is more. Among methodologists, there is something known as Goodhart’s Law or the “tendency of a measure to become a target.”9 Indicators are a powerful technology.10 When we produce measures of performance, those measures come to shape the way tasks are performed. For instance, if students are admitted to university on the basis of the SAT, schools will start to orient their secondary education around improving SAT scores.
This same process could occur at the ICC. If we produce a battery of indicators about the speed of trial proceedings, and those indicators are leveraged as performance benchmarks, then it will create a strong incentive to hasten proceedings at every stage to improve trial expeditiousness scores. In this instance, speedier trials may be a desired outcome; but the more benchmarks that are produced, the higher the risk that doing the job of prosecuting war criminals will be reduced to checking boxes on a form. Though it should continue to collect as much data as possible, the ICC staff should construct and publicize only a few key quantitative indicators of performance. Doing so will preserve its agency and protect it from the mire of audit culture.
Second, in constructing a crucial few indicators, plan carefully. Start by drawing diagrams that break big concepts into component parts and link those parts to observable Court activities. This is called operationalizing a concept. Take, for example, the notion of effective leadership and management. What are the components of this concept? Based on the presentation of data in the 2016 Court report, the components appear to be: budget implementation, human resources, and staff diversity represented by geographic and gender balance. Within the geographical balance component category, many countries are listed as “under-represented” in staff.
There are two problems of operationalization in this construct. The first is that geographical balance is not intuitively related to effective leadership. Geographical balance, in my mind, is a component of representativeness not effectiveness. An argument could be made that these ideas are cousins; but, in current form, there is a mismatch between the parent concept and one of its components. The second issue is that the method for determining that certain countries are under-represented on staff remains unspecified. No clear link is drawn between the conceptual component—geographical balance—and its level of representation measure. A diagram outlining the logic of conceptualization and indicator choice, of the type often used in studies of democracy,11 could easily address these issues.
Third, during the design process, don’t favor indicators just because they are easily quantified. One reason the Court may have chosen to present indicators of the staff’s geographical composition is that it’s easily measured and expressed in numbers. There are fourteen staff members from Italy. Showing this is much easier than trying to measure a concept as large as effective leadership. Even more difficult is measuring other outlined evaluative criteria like victim access or fairness of proceedings. For instance, in her question to the ICC Forum, Head Prosecutor Bensouda worries that the “subjectivity” of fairness “makes it an inherently difficult value to measure.”
This is a reasonable concern. But the inherent difficulty of measuring concepts like fairness or access does not mean that it cannot be done. Social scientists possess reliable indicators of highly complex phenomena, like state repression,12 democracy,13 and judicial independence.14 These are all built atop hundreds of subjective coding decisions, as are other regularly referenced measures of fairness. Electoral fairness, for instance, depends on the judgment of election monitors and monitoring agencies.15 How does one create usable indicators of such complex concepts?
The answer is the final point about design: when it comes to complex concepts or conceptual components, be inductive. To again reference the question written to the ICC Forum, Head Prosecutor Bensouda states, “Before fairness can be measured, there must be a shared understanding of what it means.” I understand the logic behind this statement, which reflects an impulse toward deductive reasoning: first we define and then we measure. However, it is not technically true that we must define something before we can measure it. Some interpretive concepts are innately unsuitable for top-down measurement. Just like Plato’s interlocutors in The Republic could not define justice, we probably can’t arrive at a universal definition of concepts like fairness.
What we can do is ask people—participants, staff, affected communities—whether they think certain proceedings are fair. This is an inductive process. It is possible to construct performance indicators by modeling people’s responses to survey questions alongside other information like expert evaluations. For example, Bayesian statistical models can estimate underlying or latent traits in a population based on various sources of available data. Used by the most well designed data projects in the world,16 Bayesian models do not assume a proper or universal definition of various concepts. Instead, they take thousands of individually recorded judgments and use them to generate estimates. In some ways, performing Bayesians statistics is the mathematical equivalent of analyzing connotation.
At relatively low cost, the Court could seek out high-level statisticians to conduct surveys and build, from the bottom up, indicators of complex concepts like trial fairness or transparency. This could be part of a much-needed larger strategy of “reckoning with complexity” in international criminal law.17 But to do so, the Court will first have to address a third issue of design: who is the audience?
Suggestion Three: Know Your Audience
Among the four goals outlined by the Court lurk six big evaluative criteria: trial expeditiousness, trial fairness, transparency of proceedings, leadership effectiveness, security, and victim access. If one is to assess perceptions of how fair or accessible the ICC is, one must first ask: fair or accessible to whom?
This is where evaluation gets political. Who is the Court’s master? For whom is this performance evaluation meant? Whose perceptions of fairness should count?
As I see it, the ICC has five important audiences:
- The Assembly of State Parties
- Outside jurists and experts
- Judges, staff, and counsel who have direct experience with the Court
- Compliance partners, who are actors within states that have the power to promote international criminal accountability18
If the ICC intends on building indicators based on a combination of assembled data and stakeholder feedback, as I have suggested, it would be helpful to match evaluative aims to the audience of interest. For example, it strikes me that those most qualified to answer questions about expeditiousness and effective leadership are Groups 2 and 3. The same goes for trial fairness. It would be most enlightening to know how defense counsel perceives Court proceedings, in comparison to judges, staff, and observers. Groups 3 and 4 are probably in the best position to answer survey questions about operational security. These evaluations would be easy to implement because they require only that the ICC survey its own employees, or those with whom it regularly interacts.
The promise of survey-based responses is greatest in relation to Groups 4 and 5 that include compliance partners and victims. While knowing what staff and outside experts think of victim access could certainly yield interesting results, the crucial evaluations should come from those people in situation countries directly affected by ICC interventions.
There are two ways to meaningfully access the affected population. The first is to interview or survey victim participants or local ICC partners. Good work of this type is already underway. A study by the Human Rights Center at the Berkeley School of Law was based on interviews with 622 victims registered with the ICC. Among other things, it found that victims want more contact with the Court; and that they possess “insufficient knowledge to make informed decisions about their participation in ICC cases.”19 Another study performed by the ICTY in coordination with independent experts at University of North Texas asked tribunal witnesses to evaluate their experience testifying as well as their perceptions of ICTY effectiveness, administration of justice, and fairness. The results are surprisingly positive, with a majority of witnesses reporting that they think the ICTY has done a “good job.”20 Because these two studies are directly relevant to evaluative criteria being considered by the ICC, the Court might do well to borrow from their approach.
It will also be necessary to conduct random surveys of the wider population especially in those areas being examined or investigated. Based on its listed evaluative criteria, the ICC seems particularly concerned about victim access. This means addressing the following question: “Does the population of victims in a situation country have sufficient opportunity to engage with the Court?” Based on a sample of already-registered victims, we cannot know how many other victims were denied access or were generally unaware of the ICC’s involvement. Only by randomly drawing samples of the population at large can we determine how many people in the wider population were victimized but did not access the Court. While more difficult, this kind of work is certainly possible. For example, The Hague Institute for Justice conducted a random survey in four regions most devastated by 2007–2008 election violence in Kenya. The evidence shows that around half of the respondents witnessed or were victimized by violence; members of that half were much more favorable to the Court than the average Kenyan respondent.21
While survey research is costly, academic institutions and research partners can shoulder some of the financial burden. Moreover, survey-based evaluations will be an increasingly valuable investment over time for three reasons. First, as mentioned before, they are more flexible because they do not rely on top-down definitions of fuzzy concepts like fairness or effectiveness. Thus, survey evaluations can be deployed relatively quickly without waiting for groups of people to agree on definitions. Second, once designed, surveys can be re-used to attain data from anonymous respondents in various audiences across contexts. One could use the same survey in many different countries. Third, surveys help produce comparative benchmarks. If the ICC wants to create performance indicators that can serve as a guideline to future practice, it is absolutely necessary to establish comparable baselines. What is an appropriate fairness score? Is trial fairness improving? These questions can only be answered if there is more than one data point produced by administering the same data-generating instrument—repeating surveys—at different points in time.
Some people may be inclined to interrupt here. If the evaluative ideal I’m outlining were reached, the ICC would possess rigorously designed performance indicators that draw on feedback from important audiences. It might also convert these indicators into benchmarks against which it continually evaluates its own performance over time. However, a skeptic could claim this all amounts to little more than navel-gazing. Even if the ICC performs its work well, it does not mean that it has a positive impact on conflict-affected countries or on global politics as a whole.
According to the Second Report, civil society urged the ICC to “give serious attention to the development of indicators that measure and facilitate improvement in achieving a broader sense of impact in situation countries on the ground.”22 Shouldn’t the Court really focus its energies on assessing its broader impact on the deterrence of atrocity, on reconciliation, or on peace?
Suggestion Four: Assist, But Do Not Conduct Impact Assessment
My answer is: No. The ICC should not perform impact assessments, which should be kept separate from performance evaluation. Furthermore, it is good that the Court has so far approached this issue with caution, promising to consider impact assessment in the future but ultimately not moving on it. Why?
First, expecting a justice institution to assess its own impact is abnormal. The US Supreme Court does not publish research on how its decisions affect society. The Department of Justice conducts audits of DOJ operations to root out misconduct, and it also publishes statistical reports through the Bureau of Justice Statistics; but it is not tasked with assessing the larger impact of its operations on the deterrence of crime or recidivism. This is a good model for the ICC. Evaluate performance and furnish scholars and observers with statistics, but do not perform causal studies.
Second, impact assessments conducted by the ICC would likely be biased toward positive results. That is not to challenge the integrity of ICC leadership or staff; it is only to recognize, especially when funding or support are at stake, that it is nearly impossible to remain objective.
And third, impact assessment is very hard and requires expert training in causal inference, time, and investment. That kind of work should be left to social scientists.
Academic researchers, using both advanced qualitative and quantitative methods, are already producing a wealth of new ICC impact studies. These can be split into three types: those that focus on the legal effects of ICC interventions, those that focus on the Court’s deterrence of atrocity crimes, and those that focus on the impact of the ICC on political violence.
With regard to legal impacts, evidence suggests that actors operating in the shadow of the Court change behavior to appear compliant with international criminal law. The Colombian government made many alterations to its Special Peace Jurisdiction because of the OTP’s monitoring during an extended preliminary examination.23 Sarah Nouwen notes how Sudan and Uganda both established special courts to try atrocity crimes, but ultimately argues that these courts were established to under-perform.24 In this, she sees a blind spot in the complementarity regime.
Other studies show that the ICC can have more indirect legal impacts. In a forthcoming article, Florencia Montal and I find that ICC investigations are statistically correlated with more prosecutions of state agents—like police and security forces—for human rights crimes. To show this, we used a statistical matching procedure that compares countries with ICC investigations to similar countries that are not subject to ICC intervention. There is no legal reason to expect a relationship between ICC investigations and domestic human rights prosecutions. The latter are normally for crimes that do not reach the level of atrocity and are technically outside of the ICC’s jurisdiction. However, the correlation between investigation and human rights trials is strong. The reason is that domestic reformer coalitions are emboldened by the ICC presence in a country. They lobby for more local accountability, they push for judicial reform, and they file more legal cases. The government responds with more prosecutions. Because this was an unforeseen impact of ICC intervention, we call it “unintended positive complementarity.”25
New research also yields a much more nuanced understanding of atrocity-crime deterrence. Early research on the deterrence question was primarily hypothetical or based squarely on theoretical models. Much of it argued that the ICC could not possibly deter atrocities because those committing such offenses would be insensitive to the prospect of punishment.26
However, scholarly impact assessors are using statistical analysis of observed data to challenge excessively rationalist accounts of deterrence. Ben Appel finds the average levels of repressive violence decreases in states after they ratify the Rome Statue, and repressive violence is also lower in those states than in non-Rome-ratifying states.27 Courtney Hillibrecht presents evidence that government-sponsored killing decreased in Libya following the referral to the ICC.28 And Jo and Simmons discover that, among all civil war states, Rome Statute ratification is associated with roughly 50% fewer civilian killings by state governments. Furthermore, direct intervention by the ICC is associated with almost 60% fewer targeted killings by both government and rebel forces. The authors conclude that violent actors change behavior not only because they fear legal punishment but also because of the informal sanctions associated with being targeted by the ICC. The latter process they call “social deterrence.”
These findings should not be taken as evidence that ICC involvement in a country is universally positive. Other data scientists discover little relationship between ICC intervention and violence.29 The ICC also has mixed impacts on larger processes of organized political violence. The Court was not established to end war, but it regularly gets involved in civil war states. Researchers argue convincingly that the ICC affects the resolution of civil war in a non-linear fashion. For instance, Mark Kersten contends that ICC arrest warrants in Uganda encouraged LRA leaders to come to the negotiating table; but, because the warrants could not be dissolved, they also stood in the way of a peace settlement.30 Hillibrecht and Strauss contend that state leaders simply use the Court to constrain their main political opponents.31 And other research suggests that the ICC’s impact on political violence might change with its stages of involvement, varying between preliminary examinations, investigation, and trial phases.32 Much remains to be explored.
This is just a brief survey, but one thing seems certain. For all its faults and missteps, the ICC has definitely sent shockwaves through global society. It does alter political behavior. This raises a puzzle that circles back to the issue of performance indicators: Are the ICC’s on-the-ground impacts dependent on how well it performs its functions? Many critical voices charge that the Court has fallen far short of its operational expectations; yet new impact studies suggest that the Court exerts measurable effects on legal developments, patterns of violence, and political conflict. What does it mean if the ICC has these impacts despite sub-par performance?
It’s quite possible that, so far, the ICC is more important for what it is than what it actually does. The mere existence of the Court sends resonant signals of accountability across the globe. A second possibility is that political actors will adjust, learn from Kenya’s obstructionist example, and begin exploiting the ICC’s shortcomings.
It’s not the job for the ICC itself to sort out these possibilities. Instead, the Court must focus on improving its own practices. It needs to look inward and design well conceived performance indicators. In terms of knowledge, this will produce increasing returns. As time goes on, trained impact assessors can use reliable indicators to establish whether effective or fair Court operations yield greater impacts on the ground—and ultimately contribute to a better world.
Endnotes — (click the footnote reference number, or ↩ symbol, to return to location in text).
For examples, see , International Criminal Court: 12 Years, $1 Billion, 2 Convictions, Forbes, Mar. 12, 2014, available online; , Why is the International Criminal Court so Bad at Prosecuting War Criminals?, Wilson Q. (Jun. 15, 2015), available online. ↩
Program Evaluation, in The International Encyclopedia of Education Evaluation 42 (Herbert J. Walberg & Geneva D. Haertel eds., 1990). ↩,
Monitoring Democracy: When International Election Observation Works, and Why It Often Fails (2012). ↩,
The New Terrain of International Law: Courts, Politics, Rights 53 (2014). ↩,
In the Shadow of the ICC: Colombia and International Criminal Justice (May 26, 2011), available online; & , From Law versus Politics to Law in Politics: A Pragmatist Assessment of the ICC’s Impact, 32 Am. U. Int’l L. Rev. 645 (2016), Lexis/Nexis paywall. ↩, & ,
Complementarity in the Line of Fire: The Catalysing Effect of the International Criminal Court in Uganda and Sudan (Nov. 7, 2013). ↩,
Unintended Positive Complementarity: Why International Criminal Court Investigations Increase Domestic Human Rights Prosecutions, Am. J. Int’l L. (forthcoming 2017), SSRN paywall. Earlier version (Jan. 20, 2015), available online, archived. ↩& ,
For reviews of the literature, see , The ICC and the Prevention of Atrocities: Criminological Perspectives, 17 Hum. Rts. Rev. 286 (2016), SpringerLink paywall. Earlier version: The Hague Institute for Global Justice, Working Paper 8 (Apr. 2015), available online, archived; , Punishing Genocidaires: A Deterrent Effect or Not?, 8 Hum. Rts. Rev. 319 (2007), SpringerLink paywall. ↩
International Tribunals and Human Security (2016). ↩,
Justice In Conflict: The Effects of the International Criminal Court’s Interventions on Ending Wars and Building Peace (Aug. 2016). ↩,
Beyond Deterrence: The ICC Effect in the OTP, openDemocracy, Feb. 19, 2015, available online (last visited Jun. 27, 2017); & , Unpacking the Deterrent Effect of the International Criminal Court: Lessons From Kenya, St. John’s L. Rev. (forthcoming Dec. 2016), available online. ↩,
Suggested Citation for this Comment:
Evaluating ICC Performance: Design is Critical, ICC Forum (Jul. 10, 2017), available at http://iccforum.com/performance#Dancy.,
Suggested Citation for this Issue Generally:
How can the Performance of the ICC be Properly Assessed?, ICC Forum (Jul. 10, 2017), available at http://iccforum.com/performance.