CH-3 – Image Analysis – Part-2

Important Note: This article is part of the series in which discuss theory of Video Stream Matching. – How the KS Test Works?

In looking at a list of numbers, for example, the controlB group results from the second example:

controlB={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}

It is hard to see the general situation. Thus descriptive statistics were developed to reduce the list of all the data items to a few simpler numbers. Thus perhaps better interpret data set from the following:



Standard Deviation = 11.2

We can see from this that something is abnormal. For normally distributed data you should expect about 15% of the data to lie more than 1 standard deviation below the mean (i.e., below 3.61-11.2=-7.59), but no data are that small, in fact no datum is even negative. Similarly only about 2% of the data should be more than 2 standard deviations above the mean (i.e., above 3.61+2×11.2=26.01), but in fact one data-point (50.57) way beyond that (hence an “outlier”). – Cumulative Fraction Function & Empirical Distribution Function

The cumulative fraction function and the empirical distribution function are two names for the same thing: a graphical display of how the data is distributed. If you sort the controlB dataset from small to large you get: [25]

sorted controlB={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57}

Evidently no data lies strictly below 0.08, 5%=.05=1/20 of the data is strictly smaller that 0.10, 10%=.10=2/20 of the data is strictly smaller than 0.15, 15%=.15=3/20 of the data is strictly smaller than 0.17… There are 17 data points smaller than , and hence we’d say that the cumulative fraction of the data smaller than is .85=17/20. For any number x, the cumulative fraction is the fraction of the data that is strictly smaller than x. Below is the plot of the cumulative fraction for our control data. Each step in the plot corresponds to a data-point.

You can see with a glance that the vast majority of the data is scrunched into a small fraction of the plot on the far left. This is a sign of a non-normal distribution of the data.

In order to better see the data distribution, it would be nice to scale the x-axis differently, using more space to display small x data points. Since all the data are positive you can use a “log” scale. (Since the logarithm of negative numbers and even zero is undefined, it is not possible to use a log scale if any of the data are zero or negative.)

Since many measured quantities are guaranteed positive (the width of a leaf, the weight of the mouse, [H+]) log scales are common in science. Here is the result of using a log scale:

You can now see that the median (the point that divides the data set evenly into two: half above the median, half below the median) is a bit below 1.

Now plot the cumulative fraction of the treatment group on the same graph as we plotted the control cumulative fraction. (Use a dashed line to display the treatment group so distinguish it from the control group.)

Figure 3.3

You can see that the control and treatment datasets span much the same range of values (from about .1 to about 50). But for most any x value, the fraction of the treatment group that is strictly less than x is clearly less than the fraction of the control group that is less than x.

That is, by-and-in-large the treatment values are larger than the control values for the same cumulative fraction. For example, the median (cumulative fraction =.5) for the control is clearly less than one whereas the median for the treatment is more than 1.

The KS-test uses the maximum vertical deviation between the two curves as the statistic D. In this case the maximum deviation occurs near x=1 and has D=.45. (The fraction of the treatment group that is less then one is 0.2 (4 out of the 20 values); the fraction of the control group that is less than one is 0.65 (13 out of the 20 values). Thus the maximum difference in cumulative fraction is D=.45.)

Ali Imran
Over the past 20+ years, I have been working as a software engineer, architect, and programmer, creating, designing, and programming various applications. My main focus has always been to achieve business goals and transform business ideas into digital reality. I have successfully solved numerous business problems and increased productivity for small businesses as well as enterprise corporations through the solutions that I created. My strong technical background and ability to work effectively in team environments make me a valuable asset to any organization.

Leave a Reply