# Software

## 1 Intro

Here we summarize our software that is freely available for nonprofit education and research. Our software -

1. is often by far the fastest,

2. can often process larger datasets, and

3. runs well in parallel on shared memory machines.

Collaborators include:

2. Daniel Benedetti

3. Chaulio Ferreira

4. Thiago L. Gomes

5. David Hedin

6. D Keane

7. Eric Landis

8. Wenli Li

9. Ed Nagy

10. George Nagy

11. Guilherme Pena

12. Salles Viana Gomes de Magalhaes

13. Tong Zhang

### 1.1 Notes

3. As this software has been produced and documented over decades, details can be obsolescent.

4. The SW is research quality, not commercial quality. We write it to prove that our ideas work. We work in specialized environments using tools that please us.

5. Systems change, bit rot, old programs that compiled once upon a time might not compile now. Both C++ and Fortran have changed incompatibly.

6. From time to time, we reorganize this site. It might get better, but old links might break.

7. Nevertheless, since we want people to see how good our algorithms are, please write with any questions, and we'll try to help.

## 2 Recent software

### 2.1 3D-EPUG-Overlay

3D-EPUG-Overlay is an algorithm for exactly computing the intersection (it can also compute general overlays) of pairs of 3D triangulated meshes. Each input mesh may represent a single solid or a more complex polygonal subdivision (representing multiple solids, each one identified by a label). 3D-EPUG-Overlay employs a combination of several techniques to ensure its correctness and efficiency:

1. Multiple precision rational numbers, to exactly represent the coordinates of the objects and completely eliminate roundoff errors during the computations.

2. Simulation of Simplicity, a symbolic perturbation technique, to ensure that all geometric degeneracies (special cases) are properly handled.

3. Simple data representations and local information, to simplify the correct processing of the data and make the algorithm more parallelizable.

4. A uniform grid, to efficiently index the data, and accelerate some of the steps of the algorithm such as testing pairs of triangles for intersection or locating points in the mesh.

5. An arithmetic filter (implemented using the excellent CGAL Interval arithmetic number type) is employed to accelerate the computation with the rationals (while still guaranteeing exactness), avoiding the usage of expensive operations with multiple-precision numbers during the evaluation of the predicates.

6. Parallel programming, to accelerate the intersection process and explore better the ubiquitous parallel computing capability of current hardware.

Experiments showed that it was up to 101 times faster than the algorithm available in LibiGL (the state of art algorithm for exact mesh intersection) and, also, it had a performance that was comparable to a parallel inexact algorithm that was specifically developed to be very fast. Besides being fast, 3D-EPUG-Overlay was more memory efficient than the other evaluated methods. Furthermore, in all test cases the result obtained by 3D-EPUG-Overlay matched the reference solution.

This publication list includes also our preceding 2D overlay algorithm.

Contact: Salles

### 2.2 ParCube

ParCube finds the pairwise intersections in a set of millions of congruent cubes. This operation is required when computing boolean combinations of meshes or polyhedra in CAD/CAM and additive manufacturing, and in determining close points in a 3D set. ParCube is very compact because it is uses a uniform grid with a functional programming API. ParCube is very fast; even single threaded it usually beats CGAL's elapsed time, sometimes by a factor of 3. Also because it is FP, ParCube parallelizes very well. On an Nvidia GPU, processing 10M cubes to find 6M intersections, it took 0.33 elapsed seconds, beating CGAL by a factor of 131. ParCube is independent of the specific parallel architecture, whether shared memory multicore Intel Xeon using either OpenMP or TBB, or Nvidia GPUs with thousands of cores. We expect the principles used in ParCube to apply to other computational geometry problems. Efficiently finding all bipartite intersections would be an easy extension.

Contact: WRF

## 3 Terrain

### 3.1 Hydrography

#### 3.1.1 EMFlow

EMFlow is a very efficient algorithm and implementation to compute the drainage network (i.e. the flow direction and flow accumulation) on huge terrains stored in external memory. Its utility lies in processing the large volume of high resolution terrestrial data newly available, which internal memory algorithms cannot handle efficiently.

EMFlow computes the flow direction using an adaptation of our previous method RWFlood that uses a flooding process to quickly remove internal depressions or basins. Flooding, proceeding inward from the outside of the terrain, works oppositely to the common method of computing downhill flow from the peaks.

To reduce the total number of I/O operations, EMFlow adopts a new strategy to subdivide the terrain into islands that are processed separately. The terrain cells are grouped into blocks that are stored in a special data structure managed as a cache memory.

EMFlow’s execution time was compared against the two most recent and most efficient published methods: TerraFlow and r.watershed.seg. It was, on average, 27 times faster than both methods, and EMFlow could process larger datasets. Processing a 50000x50000 terrain on a machine with 2GB of internal memory took only 3000 seconds, compared to 87000 seconds for TerraFlow while r.watershed.seg failed on terrains larger than 15000x15000. On very small, say 1000x1000 terrains, EMFlow takes under a second, compared to 6 to 20 seconds, so it could be a component of a future interactive system where a user could modify terrain and immediately see the new hydrography.

#### 2012

Contact: Thiago Gomes.

#### 3.1.2 RWFLOOD

RWFLOOD is a new and faster internal memory method to compute the drainage network, that is, the flow direction and accumulation on terrains represented by raster elevation matrix. The main idea is to surround the terrain by water (as an island) and then to raise the outside water level step by step, with depressions filled when the water reaches their boundary. This process avoids the very time-consuming depression filling step used by most of the methods to compute flow routing, that is, the flow direction and accumulated flow. The execution time of our method is very fast, and linear in the terrain size. Tests have shown that our method can process huge terrains more than 100 times faster than other recent methods.

This won the Best Paper Award (2nd place) at AGILE'2012 15th AGILE 2012 international conference on Geographic Information Science.

Code: tbd

Contact: Salles.

### 3.2 Visibility

#### 3.2.1 TiledVS (Tiled Viewshed)

TiledVS is a better algorithm and implementation for external memory viewshed computation. It is about four times faster than the most recent and most efficient published methods. Ours is also much simpler. Since processing large datasets can take hours, this improvement is significant. This algorithm was able to process a terrain with 10 billion points in one hour and a half on a modest computer with only 512 MB of RAM. It is 10 times faster than our previous algorithm (EMViewshed).

To reduce the total number of I/O operations, our method is based on subdividing the terrain into blocks which are stored in a special data structure managed as a cache memory. The viewshed is that region of the terrain that is visible by a fixed observer, who may be on or above the terrain. Its applications range from visual nuisance abatement to radio transmitter siting and surveillance.

#### 2016

Contact: Chaulio R. Ferreira.

#### 3.2.2 Parallel Multiple Observer Siting on Terrain

This is the optimization and parallelization of the multiple observer siting program, originally developed by Franklin and Vogt. Siting is a compute-intensive application with a large amount of inherent parallelism. The advantage of parallelization is not only a faster program but also the ability to solve bigger problems. We have parallelized the program using two different techniques: OpenMP, using multi-core CPUs, and CUDA, using a general purpose graphics processing unit (GPGPU). Experiment results show that both techniques are very effective. Using the OpenMP program, we are able to site tens of thousands of observers on a 16385 × 16385 terrain in less than 2 minutes, on our workstation with two CPUs and one GPU. The CUDA program achieves the same in about 30 seconds.

Contact: Wenli.

#### 3.2.3 An efficient GPU multiple-observer siting method based on sparse-matrix multiplication

This is an efficient parallel heuristic for siting observers on raster terrains. More specifically, the goal is to choose the smallest set of points on a terrain such that observers located in these points are able to visualize at least a given percentage of the terrain. This problem is NP-Hard and has several applications such as determining the best places to position (site) communication or monitoring towers on a terrain. Since siting observers is a massive operation, its solution requires a huge amount of processing time even to obtain an approximate solution using a heuristic. This is still more evident when processing high resolution terrains that have become available due to modern data acquiring technologies such as LIDAR and IFSAR.

Our new implementation uses dynamic programming and CUDA to accelerate the swap local search heuristic, which was proposed in previous works. Also, to efficiently use the parallel computing resources of GPUs, we adapted some techniques previously developed for sparse-dense matrix multiplication. Our parallel algorithm is up to 7352 times faster than a sequential algorithm. The reasons for this speedup is not only the use of GPU, but also because of the use of techniques originally designed for sparse-dense matrix multiplication.

We compared this new method with previous parallel implementations and the new method is much more efficient than the previous ones. It can process much larger terrains (the older methods are restrictive about terrain size) and it is faster.

Code: Guilherme Pena.

#### 2013

Contact: Guilherme Pena

### 3.3 Triangulated Irregular Network (TIN)

TIN is the current version of my efficient Triangulated Irregular Network computation program, for input points on a grid. Degeneracies, such as included points falling on the edge of the square, or on an existing edge, are handled. TIN uses an incremental greedy method. It stops when a predetermined maximum absolute error is reached, or when a given number of points have been included.

Due to its compact and simple data structures, TIN can process, in memory, a 10800x10800 grid of points on a PC.

In contrast to other TIN implementations, my TIN operates incrementally by repeatedly inserting the worst point. Therefore the ordered list of points that it selects can be used for progressive transmission of the surface.

Code

Contact: WRF

## 4 2D geometry

### 4.1 UPLAN

UPLAN is an efficient algorithm for path planning on road networks with polygonal constraints. U PLAN is based on the A* algorithm and it uses a uniform grid for efficiently indexing the obstacles. UPLAN can efficiently process very many polygonal obstacles. UPLAN was the 2nd place finisher in the ACM GISCUP 2015 competition.

Contact: Salles.

### 4.2 EPUG-Overlay

EPUG-Overlay (Exact Parallel Uniform Grid Overlay) is an algorithm to overlay two maps that is fast and parallel, has no roundoff errors, and is freely available. EPUG-Overlay combines several novel aspects. It represents coordinates with rational numbers, thereby ensuring exact computations with no roundoff errors and the ensuing sliver problems and topological impossibilities. For efficiency, EPUG-Overlay performs the map overlay in parallel, thereby utilizing the ubiquitous multicore architecture. Our application goes beyond merely using existing packages, which are inefficient when used in parallel on large problems. Indeed, overlaying two maps with 53,000,000 edges and 730,000 faces took only 322 elapsed seconds (plus 116 seconds for I/O) on a dual 8-core 3.1 GHz Intel Xeon E5-2687 workstation. In contrast, GRASS, executing sequentially and generating roundoff errors, takes 5300 seconds.

The overlay operation combines two input maps (planar graphs) containing faces (polygons) separated by polyline edges (chains), into a new map, each of whose faces is the intersection of one face from each input map. Floating point roundoff errors can cause an edge intersection to be missed or the computed intersection point be in a wrong face, leading to a topological inconsistency. Thus, a program might fail to compute a valid output map at all, using any amount of time. This gets worse when the inputs are bigger or have slivers. Heuristics can ameliorate this problem, but only to an extent.

By representing each coordinate as a vulgar fraction, with multiprecision numerator and denominator, the computation is exact. EPUG-Overlay also executes various useful subproblems very quickly, such as locating a set of points in a planar graph and finding all the intersections among a large set of small edges. EPUG-Overlay is built on our earlier sequential floating-point algorithm that found the areas of the overlay polygons, without finding the polygons themselves.

Contact: Salles.

### 4.3 Grid-Gen

Grid-Gen is an efficient heuristic for map simplification. Grid-Gen deals with a variation of the generalization problem where the idea is to simplify the polylines of a map without changing the topological relationships between these polylines or between the lines and control points. Grid-Gen uses a uniform grid to accelerate the simplification process and can handle a map with more than 3 million polyline points and 10 million control points in 9 seconds in a Lenovo T430s laptop.

It was the fourth fastest algorithm in the GISCUP 2014, but it was the only one to process all datasets without creating any topological error.

Contact: Salles.

### 4.4 ANOTB

This 1973 Fortran IV program computes any of the 16 possible boolean combinations of two polygons. The polygons may have multiple separate loops of edges, which may be nested.

The algorithm starts at an intersection point between an edge of polygon A and an edge of polygon B. It uses a decision table to pick which of the four possible directions to go. At each successive intersection, it also uses a decision table to pick the next direction.

Then it processes edge loops that have no intersections with the other polygon. Each loop is either included in its original direction, included in the other direction, or excluded.

Anotb source. Compiler errors are caused by the Fortran language's incompatible changes since I wrote this.

Contact: WRF

### 4.5 TESSEL

This 1972 program takes a set of edges that form a planar graph.

Consider a planar graph of vertices, edges, and faces. If all that is known is the vertex locations and which vertices are at the ends of each edge, then this program will find the faces.

Tessel source. Compiler errors are caused by the Fortran language's incompatible changes since I wrote this.

Contact: WRF

### 4.6 PNPOLY

PNPOLY (1970) tests whether a test point is contained in a given polygon. It is only 8 lines of executable C code.

Contact: WRF

## 5 3D geometry

### 5.1 NearptD

NearptD is a very fast parallel nearest neighbor algorithm and implementation, which has processed $10^7$ points in $E^6$ and $184\cdot10^6$ points in $E^3$. It uses 1/5 the space and as little as 1/100 the preprocessing time as FLANN (a well-known approximate nearest neighbor program). Up to $E^4$ , its query time is also faster, by up to a factor of 100. NearptD uses Nvidia Thrust and CUDA in C++ to perform parallel preprocessing and querying of large point cloud data. Nearest neighbor searching is needed by many applications, such as collision detection, computer vision or machine learning. This implementation is an extension of the Nearpt3 algorithm performed in parallel on the GPU for a variable number of dimensions. NearptD shows that a uniform grid can outperform a kd-tree for preprocessing and searching large datasets.

### 5.2 PinMesh

PinMesh is a very fast algorithm with implementation to preprocess a polyhedral mesh, also known as a multi-material mesh, in order to perform 3D point location queries. PinMesh combines several innovative components to efficiently handle the largest available meshes. Because of a 2-level uniform grid, the expected preprocessing time is linear in the input size, and the code parallelizes well on a shared memory machine. Querying time is almost independent of the dataset size. PinMesh uses exact arithmetic with rational numbers to prevent roundoff errors, and symbolic perturbation with Simulation of Simplicity (SoS) to handle geometric degeneracies or special cases. PinMesh is intended to be a subroutine in more complex algorithms. It can preprocess a dataset and perform 1 million queries up to 27 times faster than RCT (Relative Closest Triangle), the current fastest algorithm. Preprocessing a sample dataset with 50 million triangles took only 14 elapsed seconds on a 16-core Xeon processor. The mean query time was 0.6 μs.

Contact: Salles.

### 5.3 UNION3

UNION3 computes mass properties of the union of many identical axis-aligned cubes. This implementation has processed 20,000,000 cubes on a dual processor Pentium Xeon workstation in about one elapsed hour. The ideas would extend to the union of general polyhedra. UNION3 works by combining three techniques. The first is a set of local topological formulae for computing polyhedral mass properties. No global topology is used, such as complete face or edge information, or edge loops and face shells. For example, the volume of a multiple component, concave, rectilinear object is :math: sum s_i x_i y_i z_i  , summed over the vertices :math: (x_i , y_i, z_i) , where each :math:s_i is a function of the directions of the incident edges. The second technique is a 3-D grid of uniform cells, which is overlaid on the data. Each cell has a list of edges, faces, and cubes that overlap it. Only pairs of objects in the same cell are tested for intersection. Experimentally, the uniform grid is quite tolerant of varying densities in the input. The third technique, building on the second, is the covered-cell concept. When a cell is completely contained within a cube, its overlap list is cleared, and objects in it are not intersected against each other (unless they both also overlap the same other, noncovered, cell). These techniques combine to reduce the expected computation time to linear in the number of cubes, regardless of the number of intersections. In the slowest step of UNION3, which is testing pairs of objects in each cell for intersection, no inter-cell communication is required. Therefore UNION3 parallelizes quite well.

Contact: WRF

### 5.4 CONNECT

Connect is a very efficient implementation on a laptop computer of the union-find algorithm for finding connected components when the input is more than a 1000x1000x1000 3D box of binary voxels. Each voxel may be connected either to its 6 orthogonal neighbors, or to all 26 neighbors. Component properties, such as area, and volume are also computed. This implementation is fast enough to permit experimental studies of random connectivity.

Determining the connected components of a 2D or 3D array of bits is a required operation in domains such as the following:

• Investigating the distribution of cracks in concrete as it is stressed to failure (which is what motivated this research),

• Extraction of connected components in image processing, and

• Determination of porosity of an underground fluid reservoir.

Connect is very space- and time-efficient. The largest test case had a universe of size 1024x1088x1088, which has 1,212,153,856 voxels, about 1/2 empty and found the 4,539,562 components. The implementation environment was a 2GHz IBM T43p laptop with SuSE linux and gcc 4.1.0. The virtual memory used, which is a function of the space preallocated for runs and components, which was sized for this example, was only 340MB. The program's elapsed time was equal to its CPU time, which was 51 CPU seconds. Smaller examples run proportionately faster, in perhaps 0.1 seconds for a 100x100x100 universe. Very complicated cases would run more slowly.

Contact: WRF