Does nearest neighbor distance approach farthest neighbor distance in high-dimensional space?
I was reading a classical ANN paper that discusses the cases when the distance difference between the nearest neighbor and the farthest neighbor becomes negligible in high-dimensional space. In Section 3.5.3, it provides an interesting manually constructed example, but without detailed proof steps. I was writing this blog to complement the process.
Problem: Let Xm=(X1,X2,⋯,Xm) be a random m-dimensional vector, such that X1=U1 and Xi=Ui+Xi−1/2, where Ui∼Uniform(0,i) is a continuous uniform random variable. Now, we sample two random vectors Pm,Qm independently from the distribution of Xm. What is the expected L2 distance and variance of distance between Pm and Qm?
The conclusion in the paper is 91m2+31m+o(m), but my results show the more accurate form should be 91m2+271m+o(m).
Solution: The expected distance between Pm and Qm could be computed as follows:
Note that Cov(Pi,Qi)=0 by independence. Then, our next step is to compute Var(Xi).
It is not hard to see Xi−1 and Ui are independent. Therefore, Var(Xi)=Var(Ui)+Var(Xi−1)/4. Since Var(Ui)=i/12, we have Var(Xi)=i/12+Var(Xi−1)/4. By solving the above recurrence relation, we have:
Var(Xi)=271(1/4)i+91i−271
How to solve the recurrence (I do not know how, following is GPT’s answer but I verified):
We first solve the homogeneous part of the recurrence, which is Var(Xi)=Var(Xi−1)/4. The characteristic equation for this homogeneous part is r=1/4, so the general solution to the homogeneous part is C(1/4)i for some constant C.
Next, we find a particular solution to the non-homogeneous recurrence. We can try a particular solution of the form Var(Xi)=A∗i+B for some constants A and B. Substituting this into the original recurrence gives us:
We need to match the coefficients of i and the constant terms on both sides of the equation. This gives us the following system of equations:
{A=121+41AB=41(B−A)
Then A=91 and B=−271. So the non-homogeneous solution is Var(Xi)=91i−271.
Putting everything together, the general solution to the original recurrence is Var(Xi)=C(1/4)i+91i−271. Since Var(X1)=1/12, we can solve for C to get C=271. Therefore, the final solution to the recurrence is:
Var(Xi)=271(1/4)i+91i−271
Now back to our goal, we need to summarize 2Var(Xi) for i=1,2,⋯,m to get the expected distance between Pm and Qm:
I am not writing the variance of distance here since it is much more complicated. Anyway, the difference of our results does not affect the main conclusion of the paper.