### Statistical Indexing and Probabilistic Weighting Copy

An statistics and research design, an **index** is a composite statistic – a measure of changes in a representative group of individual data points, or in other words, a compound measure that aggregates multiple indicators.^{}^{} Indexes – also known as composite indicators – summarize and rank specific observations.^{}

Much data in the field of social sciences and sustainability are represented in various indices such as

- Gender Gap Index,
- Human Development
- Index or the Dow Jones Industrial Average.

The ‘Report by the Commission on the Measurement of Economic Performance and Social Progress’, written by Joseph Stiglitz, Amartya Sen, and Jean-Paul Fitoussi in 2009 ^{}suggests that these measures have experienced a dramatic growth in recent years due to three concurring factors:

- improvements in the level of literacy (including statistical)
- increased complexity of modern societies and economies, and
- widespread availability of information technology.

According to Earl Babbie, items in indexes are usually weighted equally, unless there are some reasons against it (for example, if two items reflect essentially the same aspect of a variable, they could have a weight of 0.5 each).^{}

According to the same author, constructing the items involves four steps. First, items should be selected based on their content validity, uni dimensionality, the degree of specificity in which a dimension is to be measured, and their amount of variance. Items should be empirically related to one another, which leads to the second step of examining their multivariate relationships. Third, indexes scores are designed, which involves determining their score ranges and weights for the items. Finally, indexes should be validated, which involves testing whether they can predict indicators related to the measured variable not used in their construction.^{}

A handbook for the construction of composite indicators was published jointly by the OECD and by the European Commission’s Joint Research Centre in 2008. The handbook – officially endorsed by the OECD high level statistical committee, describe ten recursive steps for developing an index:^{}

- Step 1: Theoretical framework
- Step 2: Data selection
- Step 3: Imputation of missing data
- Step 4: Multivariate analysis
- Step 5: Normalisation
- Step 6: Weighting
- Step 7: Aggregating indicators
- Step 8: Sensitivity analysis
- Step 9: Link to other measures
- Step 10: Visualisation

As suggested by the list, many modelling choices are needed to construct a composite indicator, which makes their use controversial.^{} The delicate issue of assigning and validating weights is discussed e.g. in.^{}A sociological reading of the nature of composite indicators is offered by Paul-Marie Boulanger, who sees these measures at the intersection of three movements:^{}

- the democratisation of expertise, the concept that more knowledge is needed to tackle societal and environmental issues that can be provided by the sole experts – this line of thought connects to the concept of extended peer community developed by post-normal science
- the impulse to the creation of a new public through a process of social discovery, which can be reconnected to the work of pragmatists such as John Dewey
- the semiotic of Charles Sanders Peirce; Thus a CI is not just a sign or a number, but suggests an action or a behaviour.

#### Basic Probabilistic Weighting Model

The basic weighting function used is that developed in , and may be expressed as follows:

where **x** is a vector of information about the document, **0** is a reference vector representing a zero-weighted document, and **R** and are relevance and non-relevance respectively.

For example, each component of **x** may represent the presence/absence of a query term in the document or its document frequency; **0** would then be the “natural” zero vector representing all query terms absent.

In this formulation, independence assumptions (or, indeed, Cooper’s assumption of “linked dependence” ), lead to the decomposition of **w** into *additive* components such as individual term weights. In the presence/absence case, the resulting weighting function is the Robertson/Sparck Jones formula for a term-presence-only weight, as follows:

where and .

With a suitable estimation method, this becomes:

where **N** is the number of indexed documents, **n** the number containing the term, **R** the number of known relevant documents, and **r** the number of these containing the term. This approximates to inverse collection frequency (ICF) when there is no relevance information. It will be referred to below (with or without relevance information) as .

If we deal with within-document term frequencies rather than merely presence and absence of terms, then the formula corresponding to would be as follows:

where , is the corresponding probability for , and and are those for term absence.