Connected Components on 1000x1000x1000 Datasets
W. Randolph Franklin & Eric Landis
Rensselaer Polytechnic Institute & U Maine/Orono
http://wrfranklin.org/
frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu
Thanks to NSF CCR0306502 and DMS0327634.
The Problem
 Understanding how a block of concrete breaks up as it is
being crushed to failure.
 2d slices thru Xray scan data might not convey the 3D nature of the pieces.
 Thesis: use connected components stats.
Some Slices
Before thresholding
After thresholding (done by another program).
The Proposed Solution's Problem
 This is a classic unionfind problem for
compiling 1957 Fortran EQUIVALENCE statements.
EQUIVALENCE (A,B), (C,D)
EQUIVALENCE (C,B)
 Trivia: Equivalence statements were in Fortran even before subroutines.
 Solve by building a forest of connected components and
squashing trees to minimize depth.
 Tarjan: T=θ(N α*(N))
 What is the constant factor?
 No efficient implementations found in literature.
Our Special Case
 Our elements have a Euclidean metric: connectivity is implicit.
 Union operations are not random, but occur only as the data is read.
 Large datasets require minimizing space and time constant factors.
 Processing external or internal data sequentially would minimize
thrashing. (In fact, this was not a problem).
No toy datasets.
Data Structures
 The elements are 1000^{3} voxels in E^{3}.
 There is some correlation between elements.
 Don't store individuals; store 1D runs of elements.
(Efficiency is inputdependent).
 Run (x,y, z_{lo}, z_{hi}): fixed (x,y). z_{lo}<=z<=z_{hi}.
 Each connected component is a tree of runs.
 Each voxel is connected to either its 6 or its 26 neighbors.
 Don't store edges; infer them.
 If universe is N^{3}, and M=# runs,
Storage (bytes) = 12M + 4N^{2}.
Reading Input
 Data is read row (fixed (x,y)), by row.
 Each row is grouped into runs.
 The runs are inserted into a big run array in order by
(x,y,z_{lo}), indexed by an array of dope vectors.
 The (x,y) row of runs is compared against the (x1,y),
(x,y1), and maybe (x1,y1) rows to find adjacencies.
 That takes linear time.
 When two runs are adjacent, compute union area.
A_{union} = A_{1} + A_{2}  2 A_{overlap}
 Trees are squashed when traversed.
Code and Algorithm Testing
How do we know that it works?
 Test examples with known number of components.
 Rotate datasets around the grand diagonal: that completely changes the runs, but the components should be invariant.
Three Experiment Sets
 1024 x 1088 x 1088 3D concrete scan
 18573 x 19110 2D map scan
 3D random tests
 All this required > 1000 program executions.
Big Example
 Universe: 1024x1088x1088 = 1,212,153,856
 Fraction of voxels that are elements: 50%
 N runs: 20,216,828; Mean elements per run: 30
 N components: 4,539,562
 N pairs of adjacent runs united: 29,978,799
 N already in same component: 14,301,533
 Largest component: 2993 runs, 23675 elements.
 2GHz IBM t43p laptop; SuSE linux and gcc 4.1.0.
 Virtual memory used: 340MB
 Elapsed (also CPU) time: 51 secs. (I/O insignif.)
Effect of Different Thesholds
What value makes the data the richest?
Components' Fractal Dimension

 Computed on approx 512^{3} data.
 26connectivity.
 d=3 ⇒
area ∝ volume^{2/3}.
 That's not happening here.

Component Volume Histogram

N_{V} ≈ ∝ V^{2.5}

2D Special Case

 USGS map, 20" square, scanned at 1000 dpi.
 18573x19110 pixels.
 32858 comps, 118.371 runs/component
 Time: 25 cpu secs on 1600 MHz Pentium.
 Output image is a 2000x2000 detail.

Random Universe Experiments
 Assume each voxel is an element independently with probability p
 Vary universe size and connectivity (6 or 26).
 Perform multiple experiments and average.
 Measure component number and size.
Number of Random Components
 Max number occurs at p=0.2 for 6connectivity,
 or at p=0.05 for 26connectivity.
 Independent of universe size.
Repeatability
 For p=0.5 for 6connectivity, do 100 experiments for each of many universe sizes.
 Plot number of components.
Size of Random Components
 Relation of size to p is not clear.
 Theory??
Future Applications
 Porosity: Oilbearing rock has voids; do they connect?
 (Electrical or heat) conductivity: between opposite faces of universe of random elements, such as an aerogel or random fibers.
 Strength of an object formed by random elements.
Possible computer experiments: What is lightest/cheapest object satisfying some given min or max property?
Interactive visualization: Small examples would be fast, perhaps 0.1 secs for 100^{3} universe. Show components immediately as the user changes inputs.