Connected Components on 1000x1000x1000 Datasets
W. Randolph Franklin & Eric Landis
Rensselaer Polytechnic Institute & U Maine/Orono
http://wrfranklin.org/
frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu
Thanks to NSF CCR-0306502 and DMS-0327634.
The Problem
- Understanding how a block of concrete breaks up as it is
being crushed to failure.
- 2-d slices thru X-ray scan data might not convey the 3D nature of the pieces.
- Thesis: use connected components stats.
Some Slices
Before thresholding
After thresholding (done by another program).
The Proposed Solution's Problem
- This is a classic union-find problem for
compiling 1957 Fortran EQUIVALENCE statements.
EQUIVALENCE (A,B), (C,D)
EQUIVALENCE (C,B)
- Trivia: Equivalence statements were in Fortran even before subroutines.
- Solve by building a forest of connected components and
squashing trees to minimize depth.
- Tarjan: T=θ(N α*(N))
- What is the constant factor?
- No efficient implementations found in literature.
Our Special Case
- Our elements have a Euclidean metric: connectivity is implicit.
- Union operations are not random, but occur only as the data is read.
- Large datasets require minimizing space and time constant factors.
- Processing external or internal data sequentially would minimize
thrashing. (In fact, this was not a problem).
No toy datasets.
Data Structures
- The elements are 10003 voxels in E3.
- There is some correlation between elements.
- Don't store individuals; store 1-D runs of elements.
(Efficiency is input-dependent).
- Run (x,y, zlo, zhi): fixed (x,y). zlo<=z<=zhi.
- Each connected component is a tree of runs.
- Each voxel is connected to either its 6 or its 26 neighbors.
- Don't store edges; infer them.
- If universe is N3, and M=# runs,
Storage (bytes) = 12M + 4N2.
Reading Input
- Data is read row (fixed (x,y)), by row.
- Each row is grouped into runs.
- The runs are inserted into a big run array in order by
(x,y,zlo), indexed by an array of dope vectors.
- The (x,y) row of runs is compared against the (x-1,y),
(x,y-1), and maybe (x-1,y-1) rows to find adjacencies.
- That takes linear time.
- When two runs are adjacent, compute union area.
Aunion = A1 + A2 - 2 Aoverlap
- Trees are squashed when traversed.
Code and Algorithm Testing
How do we know that it works?
- Test examples with known number of components.
- Rotate datasets around the grand diagonal: that completely changes the runs, but the components should be invariant.
Three Experiment Sets
- 1024 x 1088 x 1088 3D concrete scan
- 18573 x 19110 2D map scan
- 3D random tests
- All this required > 1000 program executions.
Big Example
- Universe: 1024x1088x1088 = 1,212,153,856
- Fraction of voxels that are elements: 50%
- N runs: 20,216,828; Mean elements per run: 30
- N components: 4,539,562
- N pairs of adjacent runs united: 29,978,799
- N already in same component: 14,301,533
- Largest component: 2993 runs, 23675 elements.
- 2GHz IBM t43p laptop; SuSE linux and gcc 4.1.0.
- Virtual memory used: 340MB
- Elapsed (also CPU) time: 51 secs. (I/O insignif.)
Effect of Different Thesholds
What value makes the data the richest?
Components' Fractal Dimension
|
- Computed on approx 5123 data.
- 26-connectivity.
- d=3 ⇒
area ∝ volume2/3.
- That's not happening here.
|
Component Volume Histogram
|
NV ≈ ∝ V-2.5
|
2D Special Case
|
- USGS map, 20" square, scanned at 1000 dpi.
- 18573x19110 pixels.
- 32858 comps, 118.371 runs/component
- Time: 25 cpu secs on 1600 MHz Pentium.
- Output image is a 2000x2000 detail.
|
Random Universe Experiments
- Assume each voxel is an element independently with probability p
- Vary universe size and connectivity (6 or 26).
- Perform multiple experiments and average.
- Measure component number and size.
Number of Random Components
- Max number occurs at p=0.2 for 6-connectivity,
- or at p=0.05 for 26-connectivity.
- Independent of universe size.
Repeatability
- For p=0.5 for 6-connectivity, do 100 experiments for each of many universe sizes.
- Plot number of components.
Size of Random Components
- Relation of size to p is not clear.
- Theory??
Future Applications
- Porosity: Oil-bearing rock has voids; do they connect?
- (Electrical or heat) conductivity: between opposite faces of universe of random elements, such as an aerogel or random fibers.
- Strength of an object formed by random elements.
Possible computer experiments: What is lightest/cheapest object satisfying some given min or max property?
Interactive visualization: Small examples would be fast, perhaps 0.1 secs for 1003 universe. Show components immediately as the user changes inputs.