Connected Components on 1000x1000x1000 Datasets

W. Randolph Franklin & Eric Landis

Rensselaer Polytechnic Institute & U Maine/Orono

http://wrfranklin.org/

frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu

Thanks to NSF CCR-0306502 and DMS-0327634.

The Problem

Understanding how a block of concrete breaks up as it is being crushed to failure.
2-d slices thru X-ray scan data might not convey the 3D nature of the pieces.
Thesis: use connected components stats.

Some Slices

Before thresholding

After thresholding (done by another program).

The Proposed Solution's Problem

This is a classic union-find problem for compiling 1957 Fortran EQUIVALENCE statements.
```
      EQUIVALENCE (A,B), (C,D)
      EQUIVALENCE (C,B)
```
Trivia: Equivalence statements were in Fortran even before subroutines.
Solve by building a forest of connected components and squashing trees to minimize depth.
Tarjan: T=θ(N α*(N))
What is the constant factor?
No efficient implementations found in literature.

Our Special Case

Our elements have a Euclidean metric: connectivity is implicit.
Union operations are not random, but occur only as the data is read.
Large datasets require minimizing space and time constant factors.
Processing external or internal data sequentially would minimize thrashing. (In fact, this was not a problem).

No toy datasets.

Data Structures

The elements are 1000³ voxels in E³.
There is some correlation between elements.
Don't store individuals; store 1-D runs of elements. (Efficiency is input-dependent).
Run (x,y, z_lo, z_hi): fixed (x,y). z_lo<=z<=z_hi.
Each connected component is a tree of runs.
Each voxel is connected to either its 6 or its 26 neighbors.
Don't store edges; infer them.
If universe is N³, and M=# runs,

Storage (bytes) = 12M + 4N².

Reading Input

Data is read row (fixed (x,y)), by row.
Each row is grouped into runs.
The runs are inserted into a big run array in order by (x,y,z_lo), indexed by an array of dope vectors.
The (x,y) row of runs is compared against the (x-1,y), (x,y-1), and maybe (x-1,y-1) rows to find adjacencies.
That takes linear time.
When two runs are adjacent, compute union area. A_union = A₁ + A₂ - 2 A_overlap
Trees are squashed when traversed.

Code and Algorithm Testing

How do we know that it works?

Test examples with known number of components.
Rotate datasets around the grand diagonal: that completely changes the runs, but the components should be invariant.

Three Experiment Sets

1024 x 1088 x 1088 3D concrete scan
18573 x 19110 2D map scan
3D random tests
All this required > 1000 program executions.

Big Example

Universe: 1024x1088x1088 = 1,212,153,856
Fraction of voxels that are elements: 50%
N runs: 20,216,828; Mean elements per run: 30
N components: 4,539,562
N pairs of adjacent runs united: 29,978,799
N already in same component: 14,301,533
Largest component: 2993 runs, 23675 elements.
2GHz IBM t43p laptop; SuSE linux and gcc 4.1.0.
Virtual memory used: 340MB
Elapsed (also CPU) time: 51 secs. (I/O insignif.)

Effect of Different Thesholds

What value makes the data the richest?

Components' Fractal Dimension

Computed on approx 512³ data.
26-connectivity.
d=3 ⇒
area ∝ volume^2/3.
That's not happening here.

Component Volume Histogram

N_V ≈ ∝ V^-2.5

2D Special Case

USGS map, 20" square, scanned at 1000 dpi.
18573x19110 pixels.
32858 comps, 118.371 runs/component
Time: 25 cpu secs on 1600 MHz Pentium.
Output image is a 2000x2000 detail.

Random Universe Experiments

Assume each voxel is an element independently with probability p
Vary universe size and connectivity (6 or 26).
Perform multiple experiments and average.
Measure component number and size.

Number of Random Components

Max number occurs at p=0.2 for 6-connectivity,
or at p=0.05 for 26-connectivity.
Independent of universe size.

Repeatability

For p=0.5 for 6-connectivity, do 100 experiments for each of many universe sizes.
Plot number of components.

Size of Random Components

Relation of size to p is not clear.
Theory??

Future Applications

Porosity: Oil-bearing rock has voids; do they connect?
(Electrical or heat) conductivity: between opposite faces of universe of random elements, such as an aerogel or random fibers.
Strength of an object formed by random elements.

Possible computer experiments: What is lightest/cheapest object satisfying some given min or max property?

Interactive visualization: Small examples would be fast, perhaps 0.1 secs for 100³ universe. Show components immediately as the user changes inputs.

2006 FWCG, Smith College

Connected Components on 1000³ Datasets

Connected Components on 1000x1000x1000 Datasets

W. Randolph Franklin & Eric Landis

Rensselaer Polytechnic Institute & U Maine/Orono

http://wrfranklin.org/

frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu

Thanks to NSF CCR-0306502 and DMS-0327634.

The Problem

Some Slices

The Proposed Solution's Problem

Our Special Case

Data Structures

Reading Input

Code and Algorithm Testing

Three Experiment Sets

Big Example

Effect of Different Thesholds

Components' Fractal Dimension

Component Volume Histogram

2D Special Case

Random Universe Experiments

Number of Random Components

Repeatability

Size of Random Components

Future Applications

2006 FWCG, Smith College

Connected Components on 10003 Datasets

Connected Components on 1000x1000x1000 Datasets

W. Randolph Franklin & Eric Landis

Rensselaer Polytechnic Institute & U Maine/Orono

http://wrfranklin.org/

frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu

Thanks to NSF CCR-0306502 and DMS-0327634.

The Problem

Some Slices

The Proposed Solution's Problem

Our Special Case

Data Structures

Reading Input

Code and Algorithm Testing

Three Experiment Sets

Big Example

Effect of Different Thesholds

Components' Fractal Dimension

Component Volume Histogram

2D Special Case

Random Universe Experiments

Number of Random Components

Repeatability

Size of Random Components

Future Applications

Connected Components on 1000³ Datasets