Connected Components on 1000x1000x1000 Datasets

W. Randolph Franklin & Eric Landis

Rensselaer Polytechnic Institute & U Maine/Orono

http://wrfranklin.org/

frankwrATrpiDOTedu or wrfATecseDOTrpiDOTedu or mailATwrfranklinDOTorg or franklinATcsDOTrpiDOTedu

Thanks to NSF CCR-0306502 and DMS-0327634.

The Problem

  • Understanding how a block of concrete breaks up as it is being crushed to failure.
  • 2-d slices thru X-ray scan data might not convey the 3D nature of the pieces.
  • Thesis: use connected components stats.

     

Some Slices

Before thresholding

After thresholding (done by another program).

The Proposed Solution's Problem

  • This is a classic union-find problem for compiling 1957 Fortran EQUIVALENCE statements.
          EQUIVALENCE (A,B), (C,D)
          EQUIVALENCE (C,B)
    
  • Trivia: Equivalence statements were in Fortran even before subroutines.
  • Solve by building a forest of connected components and squashing trees to minimize depth.
  • Tarjan: T=θ(N α*(N))
  • What is the constant factor?
  • No efficient implementations found in literature.

Our Special Case

  • Our elements have a Euclidean metric: connectivity is implicit.
  • Union operations are not random, but occur only as the data is read.
  • Large datasets require minimizing space and time constant factors.
  • Processing external or internal data sequentially would minimize thrashing. (In fact, this was not a problem).
No toy datasets.

Data Structures

  • The elements are 10003 voxels in E3.
  • There is some correlation between elements.
  • Don't store individuals; store 1-D runs of elements. (Efficiency is input-dependent).
  • Run (x,y, zlo, zhi): fixed (x,y). zlo<=z<=zhi.
  • Each connected component is a tree of runs.
  • Each voxel is connected to either its 6 or its 26 neighbors.
  • Don't store edges; infer them.
  • If universe is N3, and M=# runs,

    Storage (bytes) = 12M + 4N2.

Reading Input

  • Data is read row (fixed (x,y)), by row.
  • Each row is grouped into runs.
  • The runs are inserted into a big run array in order by (x,y,zlo), indexed by an array of dope vectors.
  • The (x,y) row of runs is compared against the (x-1,y), (x,y-1), and maybe (x-1,y-1) rows to find adjacencies.
  • That takes linear time.
  • When two runs are adjacent, compute union area.
    Aunion = A1 + A2 - 2 Aoverlap
  • Trees are squashed when traversed.

Code and Algorithm Testing

How do we know that it works?
  • Test examples with known number of components.
  • Rotate datasets around the grand diagonal: that completely changes the runs, but the components should be invariant.

Three Experiment Sets

  • 1024 x 1088 x 1088 3D concrete scan
  • 18573 x 19110 2D map scan
  • 3D random tests
  • All this required > 1000 program executions.

Big Example

  • Universe: 1024x1088x1088 = 1,212,153,856
  • Fraction of voxels that are elements: 50%
  • N runs: 20,216,828; Mean elements per run: 30
  • N components: 4,539,562
  • N pairs of adjacent runs united: 29,978,799
  • N already in same component: 14,301,533
  • Largest component: 2993 runs, 23675 elements.
  • 2GHz IBM t43p laptop; SuSE linux and gcc 4.1.0.
  • Virtual memory used: 340MB
  • Elapsed (also CPU) time: 51 secs. (I/O insignif.)

Effect of Different Thesholds

What value makes the data the richest?

Components' Fractal Dimension

  • Computed on approx 5123 data.
  • 26-connectivity.
  • d=3 ⇒
    area ∝ volume2/3.
  • That's not happening here.

Component Volume Histogram

NV ≈ ∝ V-2.5

2D Special Case

  • USGS map, 20" square, scanned at 1000 dpi.
  • 18573x19110 pixels.
  • 32858 comps, 118.371 runs/component
  • Time: 25 cpu secs on 1600 MHz Pentium.
  • Output image is a 2000x2000 detail.

Random Universe Experiments

  • Assume each voxel is an element independently with probability p
  • Vary universe size and connectivity (6 or 26).
  • Perform multiple experiments and average.
  • Measure component number and size.

Number of Random Components

  • Max number occurs at p=0.2 for 6-connectivity,
  • or at p=0.05 for 26-connectivity.
  • Independent of universe size.

Repeatability

  • For p=0.5 for 6-connectivity, do 100 experiments for each of many universe sizes.
  • Plot number of components.

Size of Random Components

  • Relation of size to p is not clear.
  • Theory??

Future Applications

  • Porosity: Oil-bearing rock has voids; do they connect?
  • (Electrical or heat) conductivity: between opposite faces of universe of random elements, such as an aerogel or random fibers.
  • Strength of an object formed by random elements.

Possible computer experiments: What is lightest/cheapest object satisfying some given min or max property?

  • Interactive visualization: Small examples would be fast, perhaps 0.1 secs for 1003 universe. Show components immediately as the user changes inputs.