The Wall Street Journal reports on a funny statistical exercise involving the search for life on a billion other planets. Whether a planet has "life on it" is a random variable. There are certain characteristics of the planet that are observable such as its distance from the closest star, its diameter and other attributes. Based on these observable planetary attributes, what is your best guess of whether that planet has life on it? After all, during this time of budget constraints -- NASA should only send space probes to places that have a high index score.
A "propensity score" scholar would implement the following strategy. Take a random sample of planets (say 100 of them) for which we know whether there is life on the planet or not and for which we know a vector of planetary attributes. Call this vector Z. So this will include the stuff such as distance to the closest star, diameter, density etc.
Define Life = a dummy variable that equals one if there is life on a specific planet and 0 otherwise. Using linear regression methods to estimate a linear probability model of the form:
Life = constant + b*Z + U (equation #1)
This yields an estimate of "b" which we call "b_hat". Think of "b_hat" as an estimate of the slope as a specific Z attribute such as distance to the closest star increases, how much does the probability of Life change by? If the probability goes down sharply then "b_hat" will be negative and large. The estimates of "b_hat" represent index weights that allow the researcher to collapse the Z vector into a single index for predicting which of the billion planets are most likely to be home to life.
Now that we have estimated this equation, the researcher can form the following prediction index:
Probability of Life on planet J = b_hat*Z_j where Z_j is planet j's observable attributes
Sort this index from highest to lowest and the astrobiologists are ready to explore the universe! I acknowledge that it takes time and effort to collect the Z_j vector for each of a billion planets.
Now, there is only one problem here. There is only one known planet that we know has Life and this is Earth. This makes it difficult to estimate the "Life statistical model" presented in equation (1) above. Without such estimates, the astrobiologists must be simply making up their "b_hat" estimates and that isn't very scientific.