Data Collection and Statistical Computations

DATA COLLECTION

The indoor radon database presented in the part of this web site entitled "Radon Concentrations Across Ohio" was developed as follows. Government agencies, university researchers, and commercial testing companies that have made indoor radon measurements in Ohio were contacted and copies of their data were requested. The many data sets were combined to produce a unified database that includes 1494 zip code areas in Ohio. Most of the data (about 90 percent) were provided by companies engaged in radon testing, and the rest come from surveys conducted by university researchers and various government agencies. The database includes test results from several kinds of detectors which were set out by both owners and professional technicians in a variety of building and room types during all seasons and with test periods of different duration. However, the vast majority of the data (something over 95 percent) comes from houses where the tests were done by the homeowners using either charcoal canisters or alpha-track detectors. Despite the eclectic nature of the data, the measurements for a given county or zip code area probably provide a good estimate of the year-round average radon concentration in the living areas of houses.

About 91 percent of the data were sent on computer disks and so were transferred without transcription errors. The remainder of the data were provided as printed tables. Most of these were read directly into ASCII computer files using a text scanner and so transcription errors were again avoided. Only about 2 percent of the data had to be typed into the database by hand. This data has been triple-checked for accuracy and so is believed to be free of transcription errors.

Click Here to see the organizations that supplied the data.

STATISTICAL COMPUTATIONS

Measures of Central Tendency (or "Average")

1. MEDIAN (Md)

Md = that value of radon concentration which divides a distribution so that equal numbers of values are both larger and smaller. The median is also known as the second quartile (i.e., the value corresponding to the 50th percentile of the distribution of radon measurements). If the number of measurements (n) is odd for a given zip code area or county, then the median is the central value after the measurements have been arranged in order of increasing magnitude. If n is even then the median is the arithmetic mean of the two central values.

2. ARITHMETIC MEAN (AM) where Xi is an individual radon measurement, n is the total number of measurements in a zip code area or county, and S denotes summation.

3. GEOMETRIC MEAN (GM) where Xi and n are as previously defined, P denotes consecutive multiplication, ln is the natural logarithm, and e is the Napierian constant with a value of 2.7118282.

Note: zero values for Xi are not allowed. All such values in the radon database were therefore changed to 0.1, the minimum detectable level of radon.

Measures of Peripheral Tendency (or "Limits")

1. FIRST QUARTILE (Q1)

Q1 = that value of radon concentration which divides a distribution so that 25 percent of the values are smaller and the rest are larger (i.e., Q1 corresponds to the 25th percentile of a distribution).

If n equals, say, 100 and the measurements have been arranged in order of increasing magnitude, then Q1 would be the 25th value. For other values of n, Q1 must be calculated by linearly interpolating between the two closest percentiles. For example, if n = 17, then the two percentiles closest to 25 for the rank-ordered measurements are 23.53 for the 4th measurement (X4; i.e., 23.53 = [4/17] • 100) and 29.41 for the 5th measurement (X5). The first quartile would then be calculated as follows:

Q1 = X4 + ([25-23.53]•[(X5-X4)/(29.41-23.53)])

2. THIRD QUARTILE (Q3)

Q3 = that value of radon concentration which divides a distribution so that 75 percent of the values are smaller and the rest are larger (i.e., Q3 corresponds to the 75th percentile of a distribution).

The calculation of Q3 follows the same linear interpolation procedure described for Q1. The only difference, of course, is that Q3 must be calculated from the two percentiles closest to 75 if n is any number other than 100.

3. MAXIMUM (Max)

Max = the single largest radon concentration for a zip code area or county.

4. MINIMUM (Min)

Min = the single smallest radon concentration for a zip code area or county.

Measures of Dispersion (or "Spread")

1. STANDARD DEVIATION about the Arithmetic Mean (SD) where all symbols are as previously defined.

2. INTERQUARTILE RANGE (IR) = Q3 - Q1

3. SIMPLE RANGE (SR) = Max - Min

Discussion

The "arithmetic mean" (AM) is one of three measures of average concentration provided. It is the most common way of calculating an average. The "geometric mean" (GM), however, is a better measure of average for radon data. The distribution of radon concentrations tends to be asymmetrical around the average: that is, the difference between the average and first quartile is less than the difference between the average and third quartile. As a result of this asymmetry, the larger concentrations have a disproportionately greater weight than the smaller ones in the calculation of the arithmetic mean. This problem is largely avoided by the geometric mean and, thus, this statistic is the one recommended generally by radon experts for estimating the "average" radon concentration in an area. The "median" (Md) is a common measure of average but is inferior to the arithmetic and geometric means because it is based only on the central one or two concentrations. In general, however, the median and geometric mean will tend to have similar values.

The other statistics are useful for describing the variability or spread in the radon concentrations. The "maximum" (Max) concentration, although always of interest, can be misleading. It is based on the single most extreme value in a distribution and so may represent erroneous or other questionable data. The "minimum" (Min) concentration is reported only for the zip code areas because it is always close to zero where there are large numbers of measurements (as in the case of counties). Because the maximum is an unstable statistic, so to is the "simple range" (SR). Two better measures of radon variability are the "first quartile" (Q1) and "third quartile" (Q3). The "interquartile range" (IR) is a stable and meaningful statistic that indicates the "common" range of concentrations (i.e., the range of the central 50 percent of the distribution).

The reliability of radon statistics depends on how representative the test measurements are for a given county or zip code area, and this, in turn, depends on the number of measurements and how they were made. Because of the eclectic nature of the database compiled for this web site, it is difficult to generalize about the reliability of the individual statistics, but it would seem that at least two or three dozen measurements are required for fully trustworthy results. This requirement is met certainly for all the counties, but many zip code areas have very few measurements and so their statistics should be viewed with caution. In the maps, the geometric mean is displayed only for those zip code areas with 10 or more measurements. 