Computer vision is an interdisciplinary field of artificial intelligence (AI) that trains computers to interpret digital images or videos. Recent advances in imaging such as sub-meter resolution and 3D imagery lead to huge datasets that challenge geospatial data analysts trying to identify features. But applying deep learning and computer vision to geospatial analytics can shorten model development and the time-to-decision for enterprise solutions.
How Does Computer Vision Work?
Computer vision is used to identify patterns, features, and components in images. Take, for example, the brown Tabby cat lying in the grass outside of Figure 1.
Figure 1. Computer Vision Overview (Source for the cat image: https://commons.wikimedia.org/wiki. This diagram is based on a similar diagram available at https://algorithmxlab.com/blog/computer-vision/)
With human vision, our eyes are our sensors and our brains are processing and interpreting the images. Compared to machines humans have an enormous ability to recognize patterns and features. For example, if we see a cat image in a book, we can easily recognize a real cat when we see one, even if features such as color or breed are different. This task is much more difficult for machines.
Geospatial Analytics Applications
Many geospatial analytics applications, as well as many other domains, can benefit from deep learning and computer vision, as shown in Figure 2.
Figure 2. Examples of Computer Vision Applications (images source)
Retail examples that you may have tried are checkout-free shopping rolled out by Amazon Go, or virtual mirrors that enable you select the proper clothes size without measuring or trying them on. Tools have also been developed for retail security to help prevent shoplifting. Table 1 ranks the top 10 suppliers of machine vision software in the United States.
Table 1. Top 10 Suppliers of Machine Vision Software in the United States
Of note for geospatial applications and remote sensing, Harris Geospatial Solutions ranks fourth in the list.
Defense and intelligence (D&I) applications are one of the key application areas for computer vision and deep learning products. Here is a list of D&I domains with strong geospatial components:
- Warfare Systems
- Logistics and transportation
- Targeting systems (enhancement and automation)
- Combat simulations and training
- Threat monitoring and situational awareness
- Homeland security (weapons/vehicle searches, facility protection)
Some current trends in this area include wide adoption of UAV imagery for deep learning, sub-meter commercial imagery access from spaceborne and airborne instruments, vast amounts of 3D imagery and 3D geometry data for deep learning, and blending traditional imagery from cameras or other sensors with the internet of things and GPS devices to get a common operating picture.
What is Deep Learning?
Figure 3 illustrates a hierarchical diagram of artificial intelligence, machine learning, and deep learning. Deep learning is a subset of the machine learning which is a subset of AI.
Figure 3. Hierarchy of AI, ML and DL (Source)
Figure 4 compares traditional machine learning and deep learning. Machine Learning critically involves a person (subject matter expert, data scientist, etc. ) to determine which features are most important to describe the specific input image. They pass this image to a classifier to identify the type of object.
Figure 4. Machine Learning versus Deep Learning (Source)
But deep learning, using any combination of structured or unstructured datasets, can itself perform feature extraction and classification in combination until it finds a solution.1
Convolutional neural networks are a central part of deep learning. Figure 5 illustrates its basic architecture which usually has five layers.
Figure 5. Architecture of a basic Convolutional Neural Network (Source: Phung and Rhee, Appl. Sci. 2019, 9, 4500)
The input layer matches the size of the input images and, depending on the network, image scaling may be required. The convolution layer convolves the kernels using shared weights, and the pooling layer reduces the image size while trying to maintain the contained information. These three layers comprise the Feature Extraction component, which creates feature maps that are then passed to the fully connected layer. The fully connected and output layers comprise the Classification component; its output is the classification result.
Figure 6 is a roadmap of the evolution of detection methods. On the left are the traditional detection methods such as the Viola–Jones detector (designed specifically for face detection), the Histogram of Oriented Gradients (HOG) detector (uses a histogram of the intensity gradients to describe the shape and appearance on an object), and the Deformable Part Model (DPM) detector (uses separate parts of the image to determine whether they are part of the object). There are many others, but these are the major traditional methods.
Figure 6. Evolution of Detection Methods (Source: Murthy at al. Appl. Sci. 2020, 10(9), 3280; https://doi.org/10.3390/app10093280)
In 2012, AlexNet was created by Alex Krizhevsky to support the ImageNet Large Scale Visual Recognition Challenge. It was one of the first Convolutional Neural Networks. After that, the field exploded with more sophisticated algorithms created each year, aided by computational hardware advances that were more affordable and accessible. Examples of these Deep Learning Detection methods (shown on the right of Figure 6) and are divided by number of stages. The two-stage detectors separate the localization task from classification. They propose a bounding box and then classify the object within it. The one-stage detectors bypass the localization task and use classification on a dense cluster of possible locations.
Computer Vision Use Cases in Geospatial Applications
Figure 7 shows typical computer vision use cases in geospatial applications. The image classification returns the class label the Neural Network has been trained against, while the object detection also provides the localization (or the bounding box) of the object class.
Figure 7. Typical computer vision use cases in Geospatial Applications (Source)
Segmentation processes images on a pixel level to identify the group each pixel belongs to. Semantic segmentation distinguishes the object from its background; examples include distinguishing roads from other terrain features. Instance segmentation can identify individual objects within a group of objects; for example, identifying different types of buildings within a group.
The use cases of Figure 7 pertain to two-dimensional data. However, 3D geospatial data has become more available and accessible over the last few years. This data includes latitude and longitude (e.g., the X, Y coordinates for 2D data) but also adds a vertical coordinate, allowing for 3D point clouds. The 3D geospatial data is derived from 3D sensors such as LiDAR (Light Detection And Ranging).
Figure 8. Lidar scanning performed with a multicopter UAV (Source)
The LiDAR instrument fires rapid laser pulses at a surface — sometimes at 150,000 pulses per second — and a second instrument senses the reflected light. Each pulse can have multiple returns depending on the survey target; for example, a deep dense forest or a canopy you may have multiple returns (see Figure 9).
Figure 9. LiDAR Normalized Digital Surface Model application (Source 1 and 2)
There, the first return shows the top of the canopy and later ones relate to different depths inside the canopy depending on how dense it is. The first return is usually the most important because it reveals key information about the terrain (such as the height of a canopy or the top of buildings). The fourth return in this case is the ground surface indicating that the canopy here was not that dense. Sometimes you can determine the ground height using deep learning for 3D geospatial data.
As with 2D data, 3D data can be used for object segmentation, detection, and classification, but is more challenging because it is extremely noisy and can be incomplete. A recent review paper by Li et al. 2020 (focused on Deep Learning (DL) for Lidar cloud points) summarizes some of the challenges related to 3D object detection (see Figure 10).
Figure 10. Deep Learning for 3D Geospatial Data (Source)
I will discuss an example from Use Case (b) from Figure 10, which is 3D Object Detection. LiDAR collects billions of points in different scenes, which data can become overwhelming if not screened and filtered by pre-processing. Figure 11 shows three deep learning frameworks frequently used for analyzing 3D data for object detection/localization.
Figure 11. Deep learning models for 3D object detection/localization
The first framework involves scene segmentation and coarse localization of the object as a first step. Then extraction of the features for the proposed region is performed and the localization and classification are predicted through a Bounding-Box Prediction Network. The second framework uses a Voxel-based network. (A voxel is the three-dimensional counterpart of a 2D pixel.) These Voxel point cloud Lego blocks are passed to a VoxelNet to produce the detection/ localization. The third method applies View-based networks, where the point clouds are decomposed and mapped into red, green, and blue channels. This RGB-Map is passed to a deep learning algorithm, for example YOLO (You Only Look Once), to produce the detection/localization.
Geospatial Data Processing Pipelines
It is hard to comprehend how big the data is in geospatial applications. Evan a very small swab of a footprint captured using a multi-channel multispectral sensor produces a vast amount of data. For example, a single hyperspectral image is hundreds of megabytes. To create a robust pipeline to ingest and stage the geospatial data for the deep learning algorithm is a massive endeavor. Fortunately, there are readily available tools, including Google BigQuery, AWS Athena, Google Earth, PostGIS (which is an extension of the Postgres database), and GDAL that can prepare and stage the data for deep learning at an enterprise level (see Figure 12).
Figure 12. Geospatial data processing tools
Summary
Fine-scale (sub-meter resolution) data volumes from on-board satellite, aerial and drone imagery are rapidly increasing. It is becoming more challenging for the remote sensing community to process the amount of data by applying traditional analytics methods. Object detection methods and popular open-source computer vision tools are helping to overcome these challenges. Computer vision and deep learning techniques are being adopted in many of the geospatial and remote sensing fields (e.g., classification, object detection, segmentation) making it possible to provide analytics products in a timely fashion.
1. The two approaches can be combined; anecdotally, researchers who craft features for deep learning to use have seen 10% improvement. Still, it is remarkable that deep learning without feature engineering works as well as it does.