An explainer of the survey statistics

The authoritarian origins of well-organised opposition parties: The rise of Chadema in Tanzania

On the 18th of December 2018, the abovementioned article was published online in African Affairs, ahead of print publication in 2019.

You can see the article (behind a pay wall) at this link.

Survey statistics are often treated with suspicion in Tanzania. Popular discussions about what can reasonably be inferred from them are sometimes ill-informed, or even mis-informed. In part, this is because surveyors do not explain what they did, what methods they used, and what it means. When they do, the language is often technical and mystifying.

Below is a primer which is supposed to explain in plain language what the statistics presented in this article show, what methods were used to collect them, and what we can infer from them.

Headline finding

The survey data presented in this article suggests that Chadema canvassed at least as many people as CCM did in the last month of the 2015 general election. Applied statistics deals in probabilities, and so the survey data presented in this article enables us to reach this conclusion with confidence, but not with certainty.

Relevance

These findings contradict the received wisdom that CCM is better-organised on the ground than any other political party in Tanzania. Equally, it demonstrates the scale of Chadema’s party-building between 2003 and 2015.

Context

This does not mean that Chadema ran as strong a ground campaign as CCM in all respects. For example, as I write in my doctoral thesis, Tanzanian election campaigns are rally-intensive. The rally makes up a much larger portion of campaign contact than canvassing does.

Who did this survey?

The survey data presented in this article was collected by Ipsos. The survey had two rounds. The first was an omnibus survey, meaning that multiple clients added questions to one survey. The second round was conducted just for this project. The survey questions referred to in this article were commissioned by Dan Paget.

How many rounds did this survey have? When did it take place?

One pre-election round. Field work for this round was completed by the 22nd of September 2015.

One post-election round. Field work for this round was conducted between the 1st and 10th of November 2015.

How many people were surveyed in each round of this survey?

2,000 respondents were surveyed in the pre-election round.

1,000 respondents were surveyed in the post-election round.

The sample of the second round was reduced in order to lower the total survey cost.

Was this survey representative, and why?

Data scientists are rarely interested in the answers that their respondents give to questions per se. They are normally interested in the answers that their respondents give as a way to make inferences about the answers that people in the population (in this case, the country) would have given as a whole. They are interested in the sample as representative of the population. Why should some 2,000 interviewees tell anyone about what the average Tanzanian thinks? The answer provided by applied statistics is that samples become representative of populations when they are randomly selected from those populations. If every Tanzanian were assigned a number, and a random-number machine generated 2,000 numbers from that range of numbers, and the 2,000 people with the corresponding numbers were selected for inclusion in the sample, that would be a truly random selection process, and that sample would most probably be representative of Tanzania

Even then, random selection only makes it more likely that a sample is representative of a population as a whole. For example, could a random selection process like the one described above that selects 2,000 respondents from the population of Tanzania produce an unrepresentative sample in which all the respondents are male and over 80 years old? Yes it could, but this would be extremely unlikely.

Data scientists deal with these issues of probability by treating summary statistics from surveys as estimates, called ‘point estimates’ in data science jargon. To take an invented example, if a summary statistic of the average (mean) age in a representative survey of Ugandans were 24, this would be the point estimate of the real age of the average Ugandan. Data scientists assign these point estimates confidence intervals. Confidence intervals are a pair of upper and lower bound statistics that flank a point estimate and express the range of values within which the real population statistic probably falls. To continue with the same (made-up) example, imagine that the lower and upper bound 95% confidence intervals of Ugandan age in the sample were 22 and 26. This would mean that, based on the survey data, there would be a 95% chance that the real average (mean) age of Ugandans fell somewhere between 22 and 26.

In the article of interest, survey statistics are assigned given standard errors. A 95% confidence interval is 1.96 standard errors from the point estimate.

What was the selection procedure for this survey?

No survey employs truly random sampling methods, but for this survey, a selection procedure was designed which best approximates the stochastically random ideal. I have generalised from some of the specifics of Ipsos’ selection procedure to ensure that I do not give away their trade secrets.

First, 200 wards were randomly selected from across Tanzania, adjusting for their population sizes of wards in general. These served as the primary sampling units. The selection procedure was stratified so that a representative share of wards was chosen across Tanzania’s regions, and a representative mixture of urban and rural wards were chosen. These sampling frame and stratification was conducted using publicly available data from the 2012 census, conducted by National Bureau of Statistics.

In each ward, a survey team went in person. They employed random-walk procedures to select each respondent. This involves randomly selecting a number of starting points in the ward, and giving the interviewers instructions like ‘go down this road and enter house number x on the left. After that, skip y houses and enter the next one.' This also involves a random procedure to select a member of the household to interview. 2,000 people were interviewed in total.

These principles and method mirror those adopted by the Afrobarometer in most respects.

How was the second round conducted?

However, the data presented in this article is not from that field work, which took place in early- to mid-September 2015, before the election. During that field work, respondents were asked whether they would be willing to be called back by the surveyors. They gave a phone number, a series of further phone numbers from their households or neighbours. After the election, a subsample of interviewees was randomly selected from the original sample. They were telephoned and surveyed over the phone. In total, the recontact rate was 82%, a good rate by industry standards. 1,000 respondents were re-contacted in total.

Which survey questions does the article report on?

In the post-election survey, respondents were asked the questions presented below, among others. I provide the questions in English and their Swahili translations below in italics. These questions were translated professionally and those translations were checked by an independent third party.

'Agents from which political parties have spoken to you face-to-face in the last month, if any?'

'Je ni wakala/wajumbe/wawakilishi/makada wa chama kipi cha kisiasa waliozungumza na wewe ana kwa ana hivi karibuni, kama wapo?'

The interviewees recorded whichever party they said. If they said UKAWA, they were then asked the following question in order to clarify: ‘do you mean UKAWA, or a party within UKAWA?’

'Akisema UKAWA, muulize kama anamaanisha UKAWA au chama kilichopo ndani ya UKAWA?'

Respondents that specified a party within UKAWA were marked down as having been contacted by that party.

For each party that they listed in answer to this question, they were then asked:

'How many times were you spoken to face-to-face by a [Party X] agent in the month before election day? (READ OUT) once, twice, three times, four times, five or more times, or no times?'

'Je ni mara ngapi mwakilishi wa [chama x] alizungumza na wewe ana kwa ana katika mwezi mmoja kabla ya siku ya uchaguzi? Mara moja, mara mbili, mara tatu, mara nne, mara tano au zaidi, au hakuna muda?'

How and why were the responses weighted?

Despite the best attempts to generate a random sample, there were small discrepancies between the survey sample and the Tanzanian population, regarding the gender balance, age balance and regional balance of the sample. In particular, Zanzibari residents are over-represented in the sample. To correct for this, and make the sample as representative as possible, the survey sample was weighted, meaning that the answers of respondents from under-represented groups were counted for proportionally more, while respondents from over-represented groups were counted for proportionally less. This is conventional survey procedure. To read more about weighting practice, follow this link. However, to avoid the suspicion that the weighting may have involved ‘hocus-focus’ that allowed me to manipulate the results in some way, I present both the weighted and unweighted figures below.

What do the results show?

A weighted 9.9 percent of respondents reported being canvassed by CCM in the last month of the campaign (unweighted, this figure is 11.6 percent). As a proportion of the 22.75 million people that were registered to vote in the 2015 election, this is equivalent to approximately 2.25 million people.

The standard error is 1.0 percent, so the 95% confidence intervals for this statistic are approximately 7.9 percent and 11.9 percent.

A weighted 11.8 percent of respondents reported being canvassed by Chadema in the last month of the campaign (unweighted, the figure is 11.5 percent). As a proportion of the 22.75 million people that were registered to vote in the 2015 election, this is equivalent to approximately 2.68 million people.

The standard error is 1.2 percent, so the 95% confidence intervals for this statistic are approximately 9.4 percent and 14.2 percent.

If one examines the data for the volume of contacts (ie. the number of people contacted x the number of times that they were contacted), rather than the proportion of the sample that was contacted, the same patterns emerge.

How should these results be interpreted?

The weighted point estimate for Chadema contact is higher than the weighted estimate for CCM contact. However, is would be premature to conclude that Chadema canvassed more people than CCM did. The confidence intervals for these statistics overlap, and so these differences may be due to sampling error. In other words, I cannot rule out the possibility that the difference in the canvassing rates reported in the survey is because of the random sample of Tanzanians surveyed was not (quite) representative of the population of Tanzania.

However, the balance of probability strongly suggests that Chadema canvassed at least as many people as CCM did in the last month of the 2015 campaign.

Can I access the raw data and examine these statistics myself?

I will make the data-file with the relevant questions in available, so that there is complete transparency about the data.

I won't provide access to the whole data-set, because there are answers to other survey questions in that data-set that I wish to analyse and publish on before I release them to the public.

Unfortunately, I am employed on temporary contracts as a teacher at two universities at the moment, and neither of these universities provided me with access to a Stata License as a temporary employee. I need to use Stata or an equivalent programme to open the file and extract the relevant data. Therefore, I can't separate the relevant data from the main file and upload it here immediately. I will ask a friend for a favour and get that done, and then make the file available on this website. Please be patient in the mean time.