We'll first review what the shortest splitline algorithm is, then
explain how to use fancy algorithms techniques from computational geometry,
to program it so it is guaranteed to run quickly (even on worst-case input)
and with high accuracy (in fact exact if we had exact real arithmetic and knew the
exact country-shape and people locations).
This page is intended for perfectionists.
Even fairly crude algorithmic approaches suffice
to deliver acceptable accuracy in acceptable time on realistic input,
so the methods discussed here
are not really necessary.
The shortest splitline algorithm (SSA)
Was invented by Warren D. Smith in the early 2000s.
It is a very simple mechanical procedure that inputs the
shape of a country and the locations of its inhabitants, and a positive number N,
and outputs
a unique subdivision of that country into N equipopulous districts.
Formal recursive formulation of shortest splitline districting algorithm:
ShortestSplitLine( State, N ){
If N=1 then output entire state as the district;
A = floor(N/2);
B = ceiling(N/2);
find shortest splitline resulting in A:B population ratio
(breaking ties, if any, as described in notes);
Use it to split the state into the two HemiStates SA and SB;
ShortestSplitLine( SB, B );
ShortestSplitLine( SA, A );
}
Notes:
Since the Earth is round, when we say "line" we more precisely mean
"great circle." If there is an exact length-tie for "shortest" then break that tie by using
the line closest to North-South orientation, and if it's still a tie, then use the Westernmost of
the tied dividing lines.
If the state is convex, then a line will always split it into exactly two pieces
(each itself convex). However, for nonconvex states, a line could split it into more than
two connected pieces e.g. by "cutting off several bumps." (We expect that will occur
rarely, but it is certainly mathematically possible.) In either case the splitline's
"length" means the distance between the two furthest-apart points of the line that both lie
within the region being split.
If anybody's residence is split in two by one of the splitlines (which would happen,
albeit very rarely) then they are automatically declared to lie in
the most-western (or if line is EW, then northern) of the two districts.
(An alternative idea would be to permit such voters to choose which district they want to be in.)
Example:
The picture shows
Louisiana districted using the
shortest splitline algorithm using year-2000 census data.
Want N=7 districts.
Split at top level: 7 = 4+3. This is the NNE-directed splitline.
Split at 2nd level: 7 = (2+2) + (1+2).
Split at 3rd level: 7 = ((1+1) + (1+1)) + ((1) + (1+1)).
result: 7 districts, all exactly equipopulous.
How to make a computer run the SSA very quickly
Theorem (single split-stage of SSA):
Given as input:
The V vertices of the border of the country (assumed to be a simple
polygon or collection of simple polygons, can have polygonal holes, use
anticlockwise order for outer boundaries and clockwise for inner boundaries).
Let V=V_{R}+V_{C}
where V_{R}
is the number of "reflex" vertices (i.e. bent "the wrong way")
and V_{C} is the number of "convex" vertices (bending toward the interior of the country)
of the country-border.
P people (specified as XY coordinates on the plane);
note the people and polygon vertices can also be specified on the sphere
using XYZ coordinates or latitude-longitude, and our methods will still work
provided all these V+P points lie within the interior of some hemisphere.
A randomized algorithm exists (on a pointer machine with real arithmetic)
to find the (exact) "shortest splitline" which splits the population in any specified
ratio, in
O( [ V + h(P+V_{R}) α(P+V_{R}) ] logV )
steps, while consuming O(P+V) memory cells.
Here α(x) is an
inverseAckermann
function (extremely slowly-growing)
while h(P) is the largest possible number of halving lines
for a P-point set in the plane.
Tamal Dey proved
that h(P)≤2^{5/3}P^{4/3},
which was improved to by
Khovanova & Yang 2012 to
h(P)≤[(40/3)(P-1)P^{3}]^{1/3}
(which reduces the constant factor to 74.7% of Dey's
value).
A conjecture of Erdös et al says
h(P)=P^{1+o(1)}.
In the other direction, point sets constructed by Geza Toth
show h(P)≥Pexp((lnP)^{1/2}C)
for some constant C>0.
[G.Toth: Point sets with many k-sets,
Discrete & Computational Geometry 26,2 (2001) 187-194.]
Thus given Erdös's conjecture our runtime bound is
O( [ V + (P+V_{R})^{1.001} ] logV )
steps.
Corollary (full SSA):
Given also an integer N with 2≤N≤P, we can run the whole shortest splitline
algorithm to subdivide the country into N equipopulous (to within 1 person) districts, in
O( [V + N + h(P+V_{R}) α(P+V_{R})] logV logN )
steps while consuming O(P+V+N) memory cells.
Given Erdös's conjecture this runtime bound is
O( [V + N + (P+V_{R})^{1.001}] logV logN ).
Conjectural Further Speedup:
If the country is specified using only a single simple polygon (now no polygonal
"holes" nor multiple connected components are allowed; the word "simple" means "not self-crossing")
then the "logV" in all
the runtime bounds above, can be omitted (i.e. replaced by "1").
If we do allow holes and multiple components then still it ought to be possible
to attain
O( [(V + N) logV + h(P+V_{R}) α(P+V_{R})] logN )
Optimality:
These algorithm runtimes and space-usages obviously are optimal to within logarithmic
and inverse-Ackermanic factors and a factor of h(P)/P.
All of these are slow growing.
If this full scheme is programmed it should be extremely
fast and accurate, perhaps running in only a few minutes for an entire country
using IEEE double precision real arithmetic
to find the exact splitline to 9 decimal places, i.e. centimeter accuracy.
Proof (how the algorithm works):
We simply combine a lot of known algorithmic results from computational geometry.
If we are on the sphere (round earth), then
we can perform a gnomonic (central)
projection to
map everything onto a flat plane in such a way that
geodesic segments on the sphere
correspond to line segments on the plane.
As the central direction of the projection we can use the direction from
the Earth-center
to the center of the smallest sphere enclosing all the V+P points.
This smallest enclosing sphere may be found in O(V+P) steps
(Megiddo 1983; or it is simpler to use the
approach of Welzl 1991 based on adding the points one at a time in random order).
Then the projection
can be done in O(1) steps per point, i.e. also in O(V+P) steps.
Find the convex hull of the V border-points of the country (and the P people,
except since the people are assumed to lie inside the country they can be ignored
for this purpose). This takes O(VlogV) steps. We now have a subdivision
of that convex hull polygon into polygonal
regions, some of which are inside the country and some of which are not.
Also subdivide all of these regions further by
triangulating them by diagonals
in O(VlogV) steps as in Fournier & Montuno 1984.
Apply the preprocessing of
Guibas, Hershberger, Leven, Sharir, Tarjan 1987
to all those triangulated polygons.
This paper in O(V) time builds a data structure in O(V) memory cells
allowing arbitrary future point+direction
queries to be answered in O(logV) steps per query,
where each answer says which polygonal
region the query-point is located in, and which border
line-segment it sees if it looks
in the given direction.
Use the algorithm of Har-Peled 2000 (viewed using "point-line duality"; see also Chan 1999)
to "visit" each pair AB of people forming a
(desired population ratio) splitline for the P people. This
algorithm runs in O(α(P)h(P)) steps.
For each such pair AB, find the two points where the line through AB hits the convex hull
(this may be done in O(logV) time per query using a procedure akin to "binary search")
then use the GHLST data structure to "look inward along AB" from those two points
to find the two furthest-apart points on AB which lie within the country.
To be more careful and exact:
"the" splitline corresponding to a person-pair AB is actually a
family of
splitlines (since the lines do not actually go exactly through a person, and we
allow 1 extra person on one side of the line than the other).
Each family consists of all the
lines lying between two particular lines, and
our description last step
was really only about finding the bounding-lines for each such family.
We can use Har-Peled's methods to visit each such family.
There actually is not
just one point on the country-border on each end of this line, but rather
the whole line-family intersects the country border (at each end) at two sets
S_{A} and S_{B} of
line segments (and partial line segments). We then need to find the shortest
line within that whole family. It is best to
include the reflex vertices of the country-border polygon(s) as artificial extra "people"
(used for the purpose of finding splitline family-boundaries AB, where A and/or B are
allowed now to be artificial people, but not used for
the purpose of reckoning population-counts on each side of the line; Har-Peled's algorithm
can be made to handle this refinement).
This is why the terms in the runtime bounds arising from Har-Peled's algorithm
involve not P but rather P+V_{R}.
The task of finding the shortest splitting line-segment in the family
may be accomplished by building the
Voronoi diagram of all the L line segments in
S_{A} (using construction algorithms by Yap 1987,
or Boissonnat et al 1992, preprocessing them for fast point-location using Kirkpatrick
1983 or Sarnak-Tarjan 1986) in O(LlogL) time, then
querying using all the points in S_{B}
[each query takes O(logL) steps; also I think
all this can be done using GHLST's data structure]
to find the closest S_{A}-point to each
(and vice versa with the roles of A & B interchanged)
and computing exact distances from points to line-segments (and using the true-spherical
distance formula if problem originally posed on round earth).
Note that due to including all the reflex boundary-vertices as "artificial people"
each such family only intersects the boundary polygons on convex chains
and of course we discard out all but the outermost two chains for this purpose.
Notes on the conjectural further speedup:
A V-vertex simple polygon can be triangulated in O(V), not VlogV time using
an extremely complicated algorithm by Chazelle 1991
(see also Amato, Goodrich, Ramos 2000).
The GHLST data structure then can be built in O(V), not VlogV, time for a simple polygon.
The convex hull of a single simple polygon can be found in O(V), not VlogV time
using a method related to the "Graham scan."
The Voronoi diagram of the vertices of a convex V-vertex polygon
can be built in O(V), not VlogV time using a method of
Aggarwal & Shor 1987.
(This is relevant to the final step of
our algorithm involving convex subchains S_{A} and S_{B} of the
polygonal boundary.)
This suggests that the distance between two convex polygonal chains, with V vertices in all,
both inward-bending, should be computable in
O(V) not VlogV time.
(That was already known using 2D
convex-programming methods if they both are outward-bending and
with disjoint convex hulls.)
I suspect by some "moving fingers" methods the
need for the GHLST data structure can be eliminated
and so can the need for the binary-search-like procedure on the convex hull,
but we nevertheless could obtain O(1) query-time in an "amortized" sense.
Notes for practical programmers (as opposed to
algorithm-theoreticians who want theoretical guarantees):
In practice I would recommend a number of shortcuts to make programming simpler
without costing much:
I would not use the whole set of people, I'd use some random subset with
only, say, (PNlogP)^{1/2} people, which ought to provide good enough
accuracy almost certainly comparable or smaller than errors the census has anyway.
(And if you want more pseudo-accuracy, later refinement could be done
using the full population.)
I'd skip the final step (or rather, use some simplified form of it).
That is, the whole point that really, any two people-subsets separable by
a splitline are in general
separable by a family of splitlines, is for real populations in real countries,
not going to matter much since it only allows improving splitline lengths by tiny amounts.
since the families will be very "thin" and only intersect the country boundaries on very short
subsegments. If so, these can be dealt with by brute force rather than the fancy
Voronoi-diagram building and point-location (asymptotically efficient) method.
The whole (very complicated) GHLST data structure
could be omitted if your country borders are
not insanely
wiggly, by keeping track of a number of "fingers" on the borders which all move
around slightly as Har-Peled's lines move around slightly (he visits them in a
nice order which doesn't do a lot of "motion").
I'd not even attempt to program the "further speedup."
References
A. Aggarwal, L. Guibas, J. Saxe, P. Shor:
A Linear time algorithm for computing the Voronoi diagram of a convex polygon,
Discrete & Computational Geometry 4 (1989) 591-604.
See also L.Paul Chew:
Building Voronoi Diagrams for Convex Polygons in Linear Expected Time,
(1990) Dartmouth Dept. of Math & Computer Science PCS-TR90-147.
Steven J. Fortune: A sweepline algorithm for Voronoi Diagrams, Algorithmica 2,2 (1987) 153-174.
A.Fournier & D.Y.Montuno: Triangulating simple polygons and similar problems,
ACM Trans.Graphics 3,2 (1984) 153-174.
[O(NlogN) algorithm and O(N) if begin from a "trapezoidation."]
Leonidas Guibas, John Hershberger, Daniel Leven, Micha Sharir, Robert E. Tarjan:
Linear-time algorithms for visibility and shortest path problems inside
triangulated simple polygons,
Algorithmica 2,1-4 (1987) 209-233.
Emo Welzl:
Smallest Enclosing Disks (Balls and Ellipsoids), pp.359-370 in H. Maurer (Ed.),
New Results and New Trends in Computer Science,
Springer Lecture Notes in Computer Science #555, (1991).
See also Bernd Gärtner: Fast
and Robust Smallest Enclosing Balls,
Proc. 7th Annual European Symposium on Algorithms (ESA), Springer
Lecture Notes in Computer Science #1643, pp.325-338.
Chee K. Yap:
An O(n log n) Algorithm
for the Voronoi Diagram of
a Set of Simple Curve Segments,
Discrete & Comput. Geometry 2,4 (1987) 365-393.