(back to Connected Components main page)

# Conditions of Use and Availability

1. Connect was developed by W. Randolph Franklin, based on George Nagy's encouragement, and discussions, and test data from Eric Landis, George Nagy, and Tong Zhang.
2. Material created by WR Franklin is Copyright © 1994-2006, W. Randolph Franklin under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
3. A distribution tarball, corrected on May 19 2018, is here .

# Environment

1. Connect is implemented using Linux.
2. Connect is the user-callable program. It is a zsh function that calls a C++ program, connect2, and processes its output for the user
3. Smaller auxiliary functions are written either in C++ or as zsh shell functions.

# Input Format

1. The data represents a rectangular solid of size NX x NY x NZ voxels.
2. The order of voxels follows the C standard: Z dimension varying fastest, then Y, then X.
3. Connect connects the 0 voxels into components.
4. The input data file is in binary format, and is a string of bits, one per voxel. Regardless of the endianness of the machine, the first bit corresponds to the int 128, the second to 64, and so on.
5. There are no other chars, including newlines, in the file.
6. If NZ is not a multiple of 8, then one byte will contain both the last few voxels of one row, and also the first few voxels of the next row. I.e., the rows are not padded.

# Output Format

1. The output lists some comments, then summary info about the data, and finally the components. Each component contains summary info, followed by its runs.
2. Three lines of comments for humans start the output.
3. After that, the output is a sequence of ASCII integers, separated into lines with a linefeed.
4. Consecutive numbers on one line are separated by one space (the usual case), or one tab (if mentioned below).
5. Each connected component that connect finds is described as a set of runs. One run is a sequence of consecutive voxels with fixed x and y, and occupying a range of z.
6. The coordinate system labels the voxels, not the gaps between them. I.e., a run from ZL=0 to ZH=2 has 3 voxels.
7. The first line of the output is a summary line with the following fields.
1. NX, the size of the universe in the x-direction.
2. NY
3. NZ
4. Connectivity, either 6 or 26.
5. Number of empty voxels (the ones being connected).
6. Number of runs.
7. Number of components in the universe.
8. Number of components actually written; see below.
8. Each component is on one line. These summary fields are first:
1. Number of runs in this component.
2. Volume (number of voxels).
3. Surface area.
9. Then there are 4 numbers for each run in the component.
1. X.
2. Y.
3. ZL.
4. ZH.
10. The separator before each run's x is a tab instead of a space.
11. To save space, components that are considered uninteresting are not written out. This includes components of volume 1, and components of volume greater than 100,000. This may be changed, by editing routine write_components.
12. The components are written in descending order of volume.
13. Example:
Connected components
NX NY NZ connectivity #voxels #runs #components #comps_written:
For each component: #runs, vol, area, (x, y, zl, zl)+
512 544 544 6 76073721 2615676 534723 163110
6027 47382 37206	161 197 178 178	162 197 177 179 ....
4263 8478 27820	        492 171 248 249	492 172 249 249 ....
3871 8435 23908	        499 305 158 158	499 315 173 174	500 ...
...
2 2 10	0 2 360 360	0 3 360 360
1 2 10	0 3 334 335
1 2 10	0 3 315 316
1 2 10	0 3 237 238
1 2 10	0 3 234 235
1 2 10	0 2 218 219
1 2 10	0 0 267 268
1 2 10	0 0 255 256
1 2 10	0 0 248 249



# How to Run

1. Connect2 gets info from its command line, thus:
 connect NX NY NZ CONNECTIVITY [KEY]  < INFILE > OUTFILE


where
1. NX, NY, and NZ are the dimensions of the input data.
2. CONNECTIVITY, which must be either 6 or 26, is the desired connectivity of the output components.
3. KEY is optional, and defaults to 0. If nonzero, it causes the following nonstandard action.
Instead of writing the data file described above to cout, write only the number of components found. This is useful for statistical analyses.
2. Of course, you may also pipe the data thus:
  gunzip < INFILE.gz | connect2 NX NY NZ CONNECTIVITY | gzip > OUTFILE.gz


3. You may use shell functions such as the following.
  function do512() {
gunzip < binvol/p512.gz | connect2 512 544 544 6 | gzip >! /tmp/p512.out.gz }


4. Connect2 writes statistics to cerr (stderr).
5. Connect duplicates cerr to file stats.

# Auxiliary Programs

Here are some of the available auxiliary programs.

1. fns contains many zsh functions that call, and pipe together, the following programs. They show the exact sequence of commands used to calculate many of the results. Warning: some of the programs may have been changed since some of the functions were written.
2. char2bit: Reads cin, assumed to be an ASCII file of 0 and 1 chars, and packs the data by writing bits to cout.
3. bit2char: Reads cin, assumed to be a binary file, and converts each bit to an ASCII char, 0 or 1, writing that to cout.
4. rotatexyz: Rotates the data volume about the grand diagonal. This internally stores the volume with 1 byte per voxel.
All these rotation programs have the volume dimensions hard-coded into the source (since recompiling is so easy).
5. rotatexyz2: Rotates the data volume about the grand diagonal. This internally stores the volume with 1 bit per voxel.
6. rotateyzx2, rotatezxy2: like rotatexyz2, but with different sizes compiled in, so all 3 can be run together.
7. hist counts the number of 0s and 1s in a binary voxel file. This is useful since many programs, like pgmhist and xv, seem to invert the meanings of 0 and 1.
8. comp2pgm converts a specified tile of an output components file to a PGM file. Then, xv can be used to randomly color it for display. Currently, you must delete the first 3 lines of the components file before passing it to comp2pgm. This will be fixed eventually.

Later versions may rename some of the these programs (for consistency).

# Method and Reasons

1. The data is stored in a binary file as bits since that is much faster to read. Also the file is 1/8 as large, tho compression would shrink the difference.
2. main calls, i.a., initialize and process_data.
3. process_data loops, processing the rows one-by-one with process-one-row.
4. A row of the data has fixed x and y, and 0 <= z < nz.
5. process_one_row calls read_row to read a row of data, then runnify_row to to combine it into runs of adjacent 0 voxels. The relevant elements of a run are x, y, zl, zh.
6. Finally process_one_row calls connect_row to connect the runs in this row to the runs in the rows to the left (y-1), below (x-1) , and, if we're doing 26-connectivity, to the rows down and to the left and right.
7. process_one_row tries to connect to those other 2 or 4 rows whether or not they exist (they may be off the edge of the universe). If so, connect_row detects this and immediately returns.
8. The data structure for a connected component is a tree of runs. The root run of the tree is flagged. It is the highest numbered run in the component. Each other run has a pointer either to the root run, or to another higher-numbered run of the component.
9. connect_row compares the two rows to find any pairs of overlapping runs. This is done by zigzagging up the runs of the two rows, calling test_and_process_adjacency for any pair of runs that might overlap. The zigzag works as follows.
1. Compare the first run of the first row with the first run of the second row.
2. See which of these two runs has the lower zhi.
3. For that row, advance to the next run.
10. test_and_process_adjacency takes two runs that are possibly adjacent. The definition of this depends on whether we are doing 6-connectivity or 26-connectivity. When the two runs are adjacent, their corresponding components must be unified into one component, as follows.
1. find_root_of_run traces the trees from both current runs to the trees' roots. This should not be far since we will be looking only at runs in the current x-plane and the one below.
2. If the two roots are different, then the lower-numbered root is made to point to the higher-numbered root, preserving the invariant that a component's runs point to higher-numbered runs. We are building up lists of runs from low x to high x by adding new runs at the list heads.
3. Finally, update_root makes a second pass from each of the two adjacent runs down to the root, making each run point directly to the new root. This flattens the tree, making future traversals faster.
4. Note that, now, other runs in the components may not point directly to the root, even if they once did, because the root may have changed.
5. There is some trickiness here. update_root(run,newroot) traces down from run to its root, making every run, including the run's root, point to newroot. This is ok if this whole component is being connected to another, different, component. However, if this component's root was already newroot, then this makes the root's root positive, when it should be negative. In fact, this creates an infinite loop. Therefore, every pair of calls to update_root are always followed by a fixup, making the root's root negative.
11. Since the root run of a component doesn't need a pointer to another run, that space is used to store the negative of the component's surface area. The area is much easier to calculate as the adjacent runs are processed than to calculate later. The negative is used to flag that this is a root.
12. In test_and_process_adjacency, when two runs in the same component are determined to be adjacent, that component's area is reduced by twice the overlap length. Note the '+1' since the coordinate refers to the voxel, and not to the gap between adjacent voxels.
13. When two runs that were not known to be in the same component are determined to be adjacent, which means that two components are being unified, then the area of the combined component is the sum of the areas of the two original components, minus twice the overlap area.
14. However, if we're doing 26-connectivity and the two current rows are diagonally adjacent (which info is passed along in argument diag), then adjacent runs have no overlapping area.
15. check_tree_depth_and_flatten iterates down thru the runs, rolling up any long chains of runs and flattening the trees to be only 1-deep. Since we are iterating from top to bottom, whenever we process a run, it should never be more than 2 from its root. We check this and report any errors, which would indicate a conceptual problem with my thinking. We also do a consistency check that each run's root is higher-numbered than the run.
16. assemble_components counts the number of components and assembles the runs of each component into a linked list, thus:
1. Iterate thru the runs. If a run is a root, then add it to the array of component roots, and increment the number of components. Otherwise, add it to the linked list of runs in this component. The list's head is in array compruns.
2. sort_components creates compsort, a dope vector with the numbers of all the interesting components. Components whose volume is one, or whose volume is greater than 100K are not considered interesting. The reasoning is this. The unit volume components are probably noise. In the largest dataset used to test connect, the very large component was the empty volume exterior to the object whose cracks we are interested in.
3. Then sort_components sorts compsort inversely by component volume. Routine compar does the actual comparison for qsort.
17. write_components writes the output header lines, containing comments for any people reading the file, followed by various summary numbers.
18. Then write_components iterates thru the components, writing their runs to cout.
19. Each component's header and all its runs are written out on one line. The line is variable length, possible quite long. However, this makes it easy to use utility programs to find the i-th component, or to extract only the header info for statistical analysis.
20. Altho some obsolete text processing programs have hardcoded maximum line lengths of, say, 1024 chars, this is not a problem with current operating systems and utilities.
21. connect, the user-callable zsh function, which called connect2, duplicates cerr to file stats as well as to the terminal.

# Notes

1. The initial development was in Visual C++ under Windows95, but was quickly changed to Linux, as it is so much easier to use.
2. Connect assumes that someone else has already threshholded the voxels as empty or solid
3. When the input was a ASCII file of 0 and 1 chars, 2/3 of the total time was just reading the data. That's why I switched to binary data.
4. Concerning the endian problem: It's necessary only that connect2, char2bit, and bit2char be consistent among themselves. To be safe when exchanging data with other users, you might use ASCII.
5. Alternatively, given a binary file of uncertain endianness, you might process it, then flip each byte and reprocess it. When the bits are ordered correctly, if this means less random, the file will be more connected.
6. Why the output format is ASCII instead of binary: 1) some numbers will be large enough that 4-byte ints are required, but 2) most ints are small. Therefore a binary file would be no smaller than an ASCII one. Processing it would be faster, but I ignore that for simplicity.)
7. The output data format is designed to be easily human-readable as well as easily machine readable. The former implies using special spacing; the latter counts of the lengths.
8. All the runs for one component are written to one line to make it easier to page thru the output file, say to find the 10,000th component by size. It also makes it easier to extract just the component volumes and areas.
9. I tested char2bit (formerly called pack) and bit2char (formerly called unpack) by chaining them thus:
       gunzip &lt; ascvol/New.gz | sum
gunzip &lt; ascvol/New.gz | pack | bit2char | sum


ascvol/New.gz has 544x544x100 = 29593600 voxels. The sums agreed.
10. Converting input from ASCII to bin: Compare the output component size histogram before and after for ascvol/New data (544x544x100). Output is the same.
However, the time is reduced by a factor of 4 to 5, which is excellent. Nevertheless, read_row is still perhaps 75% of the total time, depending on gprof's accuracy. (Many calls to short routines may be faking it out.)
11. On 14-Aug-99, I rotated the meaning of the coordinates to make more sense. Replace all refs to x by y, to y by z, and to z by x. Now, x varies slowest and z fastest. Regression test this on the 544x544x512 case. The stats on the components are the same.
12. Compiling with -O3, instead of -g -pg, doubles the speed.
13. I tested writing the output to a Win95 compressed disk in Linux, since this disk happened to have a lot of space. Computing and writing the 11MB output from p512 took 560 cpu secs on a 233MHz Pentium. Running the same program writing to /tmp took only 53 cpu secs on a 233MHz Pentium. Conclusion: don't write to compressed Win95 partitions this since it is so slow.
14. I tested the effect of compiling in larger arrays than necessary (on dataset p512). With arrays about twice as large as necessary: 53 secs on a 233MHz Pentium. With them just the necessary size: 53 secs on a 233MHz Pentium. Conclusion: no change.
15. I rewrote rotatexyz to store the voxels as bits not chars, to save space and prevent excessive paging. Now, the CPU time to rotate p512 is 128 secs on a 233MHz Pentium.
16. I tested my rotating the universe programs by writing '''rotatexyz2, rotateyzx2, rotatezxy2''', rotating p512 3 times with them, and comparing the result to the original data. It matched. Conclusion: the rotation programs are probably correct
17. I tested connect2 by comparing the list of components' volumes and areas calculated on the original, again on the rotated p512 data, and again on the twice rotated data. All three were the same. Since the runs are quite different on the rotate data, this gives me considerable confidence that the program is working.
18. I successfully repeated that test with 26-connectivity on the p512 data.
19. I tested not using update_root to flatten the trees, on the p512 data. This increased the time by 50%, and make the average number of pointers followed in find_root_of_run rise from 1.4 to 49. Conclusion: update_root is very useful.
20. Alternate versions of the data structures are possible, which would trade off space versus time in either direction.
21. On 13-Sept-99 I renamed the CC program from connect.cc to connect2.cc, and renamed the zsh function from process to connect.
22. When I ran connect reading compressed p512 data from a Unix partition, instead of uncompressed data from a vfat partition, the CPU time thru processing planes fell from 43.7 to 37.7. Conclusion: don't do I/O to VFAT partitions when doing time tests.
23. Now I can't reproduce that?? First, reading a file seems to cache it so that reading it again is faster. Second, I dunno.
24. On 16-Sept-99 I renamed the program from concrete to connect, which is more logical. Then I reran the p512 case and compared the output components. The files were identical.
25. On 20-Sept-99 I changed the output format to add comment lines, and the input to specify nx etc as command-line args. I also changed the sort to occur inside the C++ program instead of externally. This is faster and makes the program more portable.

# Version History

 1 29 Sept 99 First public release 1.1 3 Oct 99 Minor changes. Add command line option to write only number of components. Do random volumes. More littls stats analysis functions. 1.2 27 Oct 99 Correct an error where run #0, and all other runs that it pointed to in its component, were not written (because 0 was incorrectly used as an end-of-list indicator). Add info on the 2D blacklayer test. 1.3 Sep 2006 Update for non-backward-compatible changes in the C++ language and libraries

# Why Unix Makes Me Productive

Look at this one-line zsh function.

     function procout() {
tail +2 | cut -f 1 | cut -d' ' -f 2-3 | sort -nr | uniq -c
}


It reads the connect output file and produces the list of component volumes and sizes, as follows.

1. Delete the first line.
2. In each line, delete everything after the first tab.
3. In each line delete everything except the second and third fields, delimited by spaces.
4. Do a reverse numeric sort on the first field.
5. Search for groups of consecutive identical lines. Replace each group with one copy of the line, preceded by a count of the number of occurrances.