Extracting Object Height From LiDAR & Aerial Imagery -

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 25, 2024 1

Integrating LiDAR and Aerial Imagery for Object

Height Estimation: A Review of Procedural and

AI-Enhanced Methods

Jesus Guerrero, Graduate Student Member, IEEE

Abstract—This work shows a procedural method for extracting

object heights from LiDAR and aerial imagery. We discuss how

to get heights and the future of LiDAR and imagery processing.

SOTA object segmentation allows us to take get object heights with

no deep learning background. Engineers will be keeping track of

world data across generations and reprocessing them. They will

be using older procedural methods like this paper and newer ones

discussed here. SOTA methods are going beyond analysis and into

generative AI. We cover both a procedural methodology and the

newer ones performed with language models. These include point

cloud, imagery and text encoding allowing for spatially aware AI.

Index Terms—GeoAI, LiDAR, Segmentation, Transformers

I. INTRODUCTION

In the original project-this research was based on ﬁnding tree

canopy heights for San Antonio, TX. However, the result is

applicable to any object. We limited ourselves to Aerial LiDAR

and imagery. Using any geospatial editor, this workﬂow can

be reproduced to get object heights. The end result is tabular

data of individual objects. It can include the height, location,

area, perimeter length and object type.

We surveyed the literature looking for solutions for this tree

height problems [2]. We found in this modern era better results

are possible with more detailed remote data. More bands and

different models are a contemporary solution. But LiDAR and

imagery are readily obtainable and updated at any time. Drones,

satellites can survey many times and in any part of the world.

For a few thousand dollars per kilometer they can be reshot.

Using this procedural method any geospatial research will be

able to duplicate this, not just on trees. In addition, this work

will talk about the future of geospatial AI with LiDAR and

Imagery [5].

A. Technical Background

The geospatial ﬁeld is beginning to merge with modern AI

methods. Years ago most of the models were RNNs, CNNs and

classical machine learning. Now-engineers are using advanced

methods to recreate a niche ﬁeld-GeoAI. Though it has existed

this entire time, it never looked as it does now. Embedding

methods like Word2Vec just did not exist back then. And now

we have anything to vector as a simple plug and play to deep

learning models.

Point clouds created from LiDAR are part of this embedding

[14]. They exist as having bands detailing their values per

point. Below is an example of a point cloud where the points

have cartesian coordinates for 3D awareness. These coordinates

are usable in any workﬂow both procedural and deep learning.

Using LiDAR libraries we can place RGB and infrared values to

each point. These types of point clouds can have up to 5 bands

[15]-the coordinate, elevation, RGB colors and infrared. Then

a ﬁnal classiﬁcation ﬂag-such as powerlines, tree, building.

These bands and classiﬁcation can be ﬂattened in the same

way as Vision Transformers [4]. They are used in neural models

to perform many tasks. The same is true for imagery. We take

the RGB bands of the image then ﬂatten them for a model

like SAM (Segment Anything Model) [4].

Fig. 1 Point Clouds

For the purpose of this research we convert the LiDAR

values into elevation maps then subtract heights. These maps

are grayscale at a low resolution with values being height of

that particular point. Existing image-text to text models can

interpret those values.

II. RESEARCH QUESTIONS

As an introduction to GeoAI we start with a procedural way

of solving object height. Afterward we show modern methods.

Both are reviewed and discussed. Quickly you will see what

is written here appear in the coming months. Given the nature

of machine learning this type of technology is inevitable.

TABLE I: Research Questions

Research Question

RQ1 How to get object heights using LiDAR & aerial imagery?

RQ2

What LiDAR & imagery models can be used in remote sensing?

RQ3 How can we get object heights using only LiDAR?

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 25, 2024 2

A. Methods

RQ1: How to get object heights using LiDAR & imagery?

Typically LiDAR points are classiﬁed by type. This would be

trees, ground, buildings, power lines. Without this, there would

have to be an intermediate step of classifying points. Point

classiﬁcation is a small issue and is already solved. Labelling

a cluster of points as one object-that is the bigger issue. The

fulcrum of the method used here is segmentation. We combine

the individual object detection by SAM and the LiDAR type

to create tabular data:

Fig. 2 Results

The two input data, LiDAR and imagery, are processed

separately then combined. Two elevation models are created,

ground and object by height value. The objects are taken from

the LiDAR with null outside the points. The blank white just

means no data exists at those points.

Ground elevation is blocked by the object point clouds. To

remedy this we take the nearest neighboring points of missing

heights to create a ground height. We then subtract object

height from ground height then union the resulting differences

as an object table. Once segmentation is done-the area of the

objects can be placed under three metrics. Min, Max and Mean

height of the object. It is feet about sea level. We include a

picture of some results by object:

Fig. 3 Procedural Method

RQ2: How can we get object heights using only LiDAR?

Encoding LiDAR bands into transformer blocks is the latest

trend [14]. Normal LiDAR data includes simple elevation

and points classiﬁed. Given how easy it is to create custom

architectures-we can repurpose past LiDAR for transformers.

Most LiDAR datasets were used in convolutional neural

networks and models like PointNet++ [10].

Instead of using aerial imagery and LiDAR separately-current

research is ﬁnding the best results with them together. They

combine the RGB colors of imagery with the LiDAR elevation,

infrared-all in 5 bands. Given this combination of bands-remote

sensing object heights can become the most accurate ever.

Fig. 4 LiDAR Point Classiﬁcation

After adding an adapter-transformer blocks can be trained

with MLP layers to do anything we want. Even multi-model

LLMs which can take in LiDAR [8]. These generative AI are

similar to the original adapter for Vision Transformer. Though

we ﬂatten the new bands. Some of these models will be released

to the public soon.

RQ3: What is the future of LiDAR and imagery processing?

For now-using the pipeline displayed in ﬁgure RQ1 is the most

accessible way to get object heights. It requires little deep

learning education. Researchers can use the method in this

paper with only geospatial knowledge. Machine Learning is

becoming more accessible month by month to all people. The

ﬁeld is accelerating as it becomes easier for people to enter.

Just a few years ago transformers only existed for NLP.

Now-given how quickly vision transformers made its way

into generative AI. No modal of data will be exempt from

being placed into transformers and therefore LLMs. The same

will be done with remote sensing generative AI. Today it is

starting with ground point clouds. But soon it will turn into

aerial LiDAR. We will see more remote sensing problems

solved with transformers and NLP. LiDAR and imagery are no

exception. Some surveys are regularly updated with the latest

research on imagery and LiDAR. If you want to see the SOTA

here is where you can ﬁnd them:

TABLE II: Resources for Imagery & LiDAR

Focus Listing

Imagery Techniques

https://github.com/satellite-image-

deep-learning/techniques

LiDAR Techniques

https://github.com/szenergy/awesome-

lidar

Remote Sensing LLMs

https://github.com/ZhanYang-

nwpu/Awesome-Remote-Sensing-

Multimodal-Large-Language-

Model

III. LIMITATIONS

For the lifetime of object detection in vision research-

accuracy of aerial imagery has been a struggle. Libraries like

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 25, 2024 3

DeepForest [12], foundational models such as SAM, do struggle

to get SOTA results. For many years vision researchers have

been trying to segment objects from satellite imagery. There are

so many use cases for having up to date objects from satellite-

it’s a hot topic. The typical result is about 70% recall for objects

today, even with SAM. For many problems and methods this

is the constricting value. As such it is the constricting value

for this research.

We found these libraries work in different spatial resolutions.

Small 10cm resolution was good for small objects, missing all

the bigger objects. 30cm was great for larger objects but missed

the small objects. We took a happy medium and decided on

20cm. Notice in ﬁgure 5 the bigger tree canopies have not been

detected. This is due to the 20cm resolution with SAM. It’s

the perfect spatial resolution for the average tree, but not the

bigger ones. In addition most of these models perform better

in cities. Geography and location make a huge difference.

Fig. 5 SAM Results

Two solutions for this is to ﬁne-tune SAM/DeepForest or

ensemble results at different resolutions. We found lower

resolutions, like 50cm, do pick up bigger object but also

hallucinate objects. At 20cm there is less false positives and

just miss larger objects. It is recommended by other engineers

to ﬁne-tune results to the city or rural area being inferenced.

Other RGB segmentation methods exist-like in the table below.

These recall numbers are average for all resolution, geography

and object type.

TABLE III: RGB Detection Models

Models Recall Can ﬁne-tune?

RetinaNet (Trainable) 65%+ Yes

SAM (Foundational) 68% Yes

DeepForest (Trees) 69% Yes

IV. FUTURE RESEARCH DIRECTIONS

The future of remote sensing is geospatial LLMs. The

architectures are already there. Given the state of remote

sensing today-there are many untried projects. The difference

today is the quality of point, RGB and text embedding. The

transformers architectures have become very effective. Papers

like LiDAR-LLM [16] have promising results. The authors

show increasing n-gram metrics, classiﬁcation accuracies and

improved sequential adherence.

In the corpus of research-more papers like these will

be published soon. PointNet++, PointTransformers [15] and

Fig. 6 Example Point Cloud LLMs [16]

so many others are on the way. These transformers come

in compact model scripts of pure tensor manipulation. It

has become so simple to create these models due to the

effectiveness of attention. For transformer models-we have

not seen such effects on AI since the publication of RNNs and

CNNs.

The main issue-everything is so new. These architectures

are not heavily trained. LiDAR encoders ﬂatten the bands

and classiﬁcations. The .las ﬁle is placed into an MLP and

through the transformer as an embedding. With the versatility

of attention and simplicity of the architecture-what we need is

datasets. Datasets are the future of geospatial LLMs. We need

to input geo data, text and output geo data, text.

V. CONCLUSION

We gave an introduction to GeoAI through a procedural

method for extracting object heights. The current state of remote

sensing is moving from these methods to generative ones.

Despite the speed of remote sensing we are not yet extracting

geometries from aerial data using NLP. The latest literature

shows we are currently getting past the LiDAR embedding.

But-there have not been any groundbreaking language model

developments here yet. Ground LiDAR is far more popular

due to its use in robotics. Research is at the cusp of creating

geospatial LLMs that will turn procedural methods into NLP

ones.

REFERENCES

[1] Activevisionlab/awesome-llm-3d, 06 2024.

[2] Robin M. Cole. satellite-image-deep-learning, 04 2023.

[3]

Sijun Dong, Libo Wang, Bo Du, and Xiaoliang Meng. Changeclip:

Remote sensing change detection with multimodal vision-language

representation learning. ISPRS journal of photogrammetry and remote

sensing, 208:53–69, 02 2024.

[4]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-

senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias

Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil

Houlsby. An image is worth 16x16 words: Transformers for image

recognition at scale. arXiv:2010.11929 [cs], 10 2020.

[5]

Esmaeil Farshi, Anahita Arad, and John Smith. Revolutionizing visual

description: Integrating lidar data for enhanced language models.

[6]

Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang, and Deren Li.

Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and

visual models, 01 2024.

[7]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland,

Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg,

Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.

[8]

lzw lzw. lzw-lzw/awesome-remote-sensing-vision-language-models, 06

2024.

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 25, 2024 4

[9]

Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li,

Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang

Bian, Philip H. Torr, Marc Pollefeys, Matthias Nießner, Ian D. Reid,

Angel X. Chang, Iro Laina, and Victor Adrian Prisacariu. When llms

step into the 3d world: A survey and meta-analysis of 3d tasks via

multi-modal large language models, 05 2024.

[10]

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet:

Deep learning on point sets for 3d classiﬁcation and segmentation, 2017.

[11]

Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guillaume Couairon,

Huy V Vo, John Brandt, Justine Spore, Sayantan Majumdar, Daniel

Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy

Johns, Brian White, Tobias Tiecke, and Camille Couprie. Very high

resolution canopy height maps from rgb imagery using self-supervised

vision transformer and convolutional decoder trained on aerial lidar.

Remote sensing of environment, 300:113888–113888, 01 2024.

[12]

Ben G Weinstein, Sergio Marconi, Mélaine Aubry-Kientz, Gré-

goire Vincent, Henry Senyondo, and Ethan White. Deepforest: A

<scp>python</scp> package for rgb deep learning tree crown delineation.

Methods in Ecology and Evolution, 11:1743–1751, 10 2020.

[13]

Congcong Wen, Yuan Hu, Xiang Li, Zhenghang Yuan, and Xiao Xiang

Zhu. Vision-language models in remote sensing: Current progress and

future trends. 05 2023.

[14]

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and

Dahua Lin. Pointllm: Empowering large language models to understand

point clouds, 08 2023.

[15]

Jian Yang, Ruilin Gan, Binhan Luo, Ao Wang, Shuo Shi, and Lin Du.

An improved method for individual tree segmentation in complex urban

scenes based on using multispectral lidar by deep learning. IEEE Journal

of Selected Topics in Applied Earth Observations and Remote Sensing,

17:6561–6576, 2024.

[16]

Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi

Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. Lidar-

llm: Exploring the potential of large language models for 3d lidar

understanding, 12 2023.

[17]

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote

sensing vision-language tasks via instruction tuning with large language

model, 01 2024.

[18]

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao.

Earthgpt: A universal multi-modal large language model for multi-sensor

image comprehension in remote sensing domain, 03 2024.

[19] ZhanYang. Zhanyang-nwpu/awesome-remote-sensing-multimodal-large-

language-model, 06 2024.

[20]

Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang,

and Wayne Zhang. Towards vision-language geo-foundation model: A

survey, 06 2024.

Synthetic Text Detection: Systemic Literature Review

SYNTHETIC TEXT DETECTION: SYSTEMIC LITERATURE REVIEW Jesus Guerrero Texas A&M University – San Antonio San Antonio jguer017@jaguar.tamu.edu Izzat Alsmadi Texas A&M University – San Antonio

19 August 2024

What is a loop?

When coding it is common to need to repeat lines of code. Loops help us do this in mass for all our coding needs. You

14 June 2024

Leave a Reply

Related Posts

Synthetic Text Detection: Systemic Literature Review

What is a loop?

Join the Newsletter to Unlock FREEBIES!!!

Thank you!

Leave a Reply

Related Posts

Synthetic Text Detection: Systemic Literature Review

What is a loop?