Tingde Liu, M. Sc
Artem Leichter, M. Sc.
Institut für Kartographie und Geoinformatik Leibniz Universität Hannover
We introduce MMS-LLM, a multi-modal large language model capable of understanding LiDAR point clouds of objects. It perceives object types, geometric structures, and appearance without concerns for ambiguous depth, occlusion, or viewpoint dependency. This study aims to improve PointLLM’s ability to process LiDAR point clouds without relying on color information. We used typical laser scanning information (Intensity) to replace the missing color details in LiDAR point clouds. We designed a framework to automatically extract point cloud instances and generate text instructions. Using this framework, we created a new dataset with 4.1K LiDAR point cloud instances, 4.1K simple point-to-text instruction pairs, and 3.6K complex instruction pairs. Through this dataset, we fine-tuned the LLM. To rigorously evaluate the perceptual and generalization capabilities of the fine-tuned LLM, we employed an evaluation method based on GPT-4/ChatGPT. Experimental results demonstrate that our trained MMS-LLM 7B v1 outperforms the existing PointLLM 7B v1.2 in handling LiDAR point cloud data.
- [2025-03-11] We have released the code for data generation and training. You can use this code to create your own dataset.
- 💬 Dialogue Examples
- 🔍 Overview
- 📦 Training and Evaluation
- 🚀 Demo
- 🗺️ Roadmap
- 📚 Related Work
- 🤝 Contributing
- 👏 Acknowledgements
Please refer to our paper for more results.
Table presents the performance of our model, MMS-LLM 7B v1, compared to PointLLM 7B v1.2 and the control group Reference 7B v1 on the ikgc17 test dataset. As shown in the table, our model, MMS-LLM 7B v1, demonstrates excellent performance in handling LiDAR point cloud data, outperforming PointLLM 7B v1.2 in both classification and captioning tasks. This indicates that our targeted fine-tuning of the projector and LLM significantly improved the model’s performance on LiDAR point cloud tasks.Please refer to our paper for more results.
We test our codes under the following environment:
- Ubuntu 20.04
- NVIDIA Driver: 515.65.01
- CUDA 11.7
- Python 3.10.13
- PyTorch 2.0.1
- Transformers 4.28.0.dev(transformers.git@cae78c46)
To start:
- Clone this repository.
git clone https://github.com/TingdeLiu/MMS-LLM.git
cd MMS-LLM/PointLLM- Install packages
conda create -n mmsllm python=3.10 -y
conda activate mmsllm
pip install --upgrade pip
pip install -e .
# * for training
pip install ninja
pip install flash-attnThis dataset, used in MMS-LLM, comes from the Institute of Cartography and Geoinformatics (IKG). IKG researchers continuously collect urban data in Hanover using an MMS system, which includes two laser scanners, two cameras, and an IMU/GNSS device. The dataset, collected in the Linden-Nord district, contains approximately 0.9 billion points. Raw point cloud data was colored and enriched with semantic labels via label transfer.
The image below shows the visualization of one of the point clouds.
cd create_dataset/pointcloud
python semantic.pyInspired by Leichter et al., 2021, we propose an instance extraction strategy. Our workflow, shown in the image below, consists of four sequential processing steps.
cd create_dataset/pointcloud
python instance.pyInspired by Cap3D, we collect point cloud projections from different viewpoints.
cd create_dataset/instruction
python project.pyInspired by Automatic Generation of Large Point Cloud Training Datasets Using Label Transfer, we project the point cloud onto the Street View HD image.
cd create_dataset/instruction
python get_image-full.pycd create_dataset/instruction
python transform.pyWe use InternVL to generate the complex instructions.
cd create_dataset/instruction
python complex_instruction_generate.py- In
PointLLM/datafolder, create a directory namedanno_data. - Our instruction-following data, including both the simple-description and complex instructions.
- The simple-description data has 4K samples and the complex instructions have 3.6K samples.
- Both training data are based on the ikgc17 dataset.
- The complex instructions are generated with InternVL.
- Put the data files in the
anno_datadirectory. The directory should look like this:
PointLLM/data/anno_data
├── ikgc17_brief_description_filter.json
├── ikgc17_brief_description.json
└── ikgc17_complex_instruction.json- Note, the
ikgc17_brief_description_filter.jsonis filtered fromikgc17_brief_description.jsonby removing the 262 objects we reserved as the validation set. If you want to reproduce the results in our paper, you should use theikgc17_brief_description_filter.jsonfor training. Theikgc17_complex_instruction.jsoncontains objects from the training set.
- Download the referencing GT
ikgc17_brief_description_val.jsonwe use for the benchmarks on ikgc17 dataset, and put it inPointLLM/data/anno_data. We also provide the 262 object ids we filter during training, which can be used to evaluate on all the 262 objects.
- In
PointLLMfolder, create a directory namedcheckpoints. - Download the pre-trained LLM and point encoder:
PointLLM_7B_v1.1_init or PointLLM_13B_v1.1_init. Put them in the
checkpointsdirectory. - Note that the above "v1.1" means we use the Vicuna-v1.1 checkpoints, and you do not need to download the original LLaMA weights again.
- For stage-1 training, simply run:
cd PointLLM
scripts/IKGPointLLM_train_stage1.sh- After stage-1 training, start stage-2 training:
scripts/IKGPointLLM_train_stage2.sh- Run the following commands to infer the results.
- Different commands for inferencing on different benchmarks (PointLLM_7B_v1.2 as an example):
cd PointLLM
export PYTHONPATH=$PWD
# Open Vocabulary Classification on ikgc17
python pointllm/eval/eval_objaverse.py --model_name model/MMSLLM_7B_v1 --task_type classification --prompt_index 0 # or --prompt_index 1
# Object captioning on ikgc17
python pointllm/eval/eval_objaverse.py --model_name model/MMSLLM_7B_v1 --task_type captioning --prompt_index 2
- Please check the default command-line arguments of these two scripts. You can specify different prompts, data paths, and other parameters.
- After inferencing, the results will be saved in
{model_name}/evaluationas a dict with the following format:
{
"prompt": "",
"results": [
{
"object_id": "",
"ground_truth": "",
"model_output": "",
"label_name": ""
]
}- Get your OpenAI API key at https://platform.openai.com/api-keys.
- Run the following commands to evaluate the model outputs in parallel with ChatGPT/GPT-4 (which cost approximately $1.5 to $2.2 USD).
cd PointLLM
export PYTHONPATH=$PWD
export OPENAI_API_KEY=sk-****
# Open Vocabulary Classification on Objaverse
python pointllm/eval/evaluator.py --results_path /path/to/model_output --model_type gpt-4-0613 --eval_type open-free-form-classification --parallel --num_workers 15
# Object captioning on Objaverse
python pointllm/eval/evaluator.py --results_path /path/to/model_output --model_type gpt-4-0613 --eval_type object-captioning --parallel --num_workers 15
# Close-set Zero-shot Classification on ModelNet40
python pointllm/eval/evaluator.py --results_path /path/to/model_output --model_type gpt-3.5-turbo-0613 --eval_type modelnet-close-set-classification --parallel --num_workers 15- The evaluation script supports interruption and resumption. You can interrupt the evaluation process at any time by using
Ctrl+C. This will save the temporary results. If an error occurs during the evaluation, the script will also save the current state. You can resume the evaluation from where it left off by running the same command again. - The evaluation results will be saved in
{model_name}/evaluationas another dict. Some of the metrics are explained as follows:
"average_score": The GPT-evaluated captioning score we report in our paper.
"accuracy": The classification accuracy we report in our paper, including random choices made by ChatGPT when model outputs are vague or ambiguous and ChatGPT outputs "INVALID".
"clean_accuracy": The classification accuracy after removing those "INVALID" outputs.
"total_predictions": The number of predictions.
"correct_predictions": The number of correct predictions.
"invalid_responses": The number of "INVALID" outputs by ChatGPT.
# Some other statistics for calling OpenAI API
"prompt_tokens": The total number of tokens of the prompts for ChatGPT/GPT-4.
"completion_tokens": The total number of tokens of the completion results from ChatGPT/GPT-4.
"GPT_cost": The API cost of the whole evaluation process, in US Dollars 💵.An interactive inference notebook is provided at demo.ipynb. It covers:
- Loading a trained checkpoint
- Visualising a LiDAR point cloud
- Running captioning, classification, and multi-turn dialogue
conda activate mmsllm
cd MMS-LLM/PointLLM
jupyter notebook ../demo.ipynbIf you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:
@mastersthesis{liu2025mmsllm,
title = {MMS-LLM: Empowering Large Language Models to Understand LiDAR Point Clouds},
author = {Tingde Liu},
school = {Institut f{\"u}r Kartographie und Geoinformatik, Leibniz Universit{\"a}t Hannover},
year = {2025}
}- Release MMS-LLM 7B v1 pre-trained checkpoint on Hugging Face
- Release IKG dataset card on Hugging Face (format specification, class taxonomy, sample statistics)
- Upgrade LLM backbone from Vicuna-7B to LLaMA 3 / Vicuna-13B
- Release the full IKG urban point cloud annotation data (subject to data-sharing agreement)
- Support additional input formats beyond
.npy(.las,.pcd,.ply) - Add data augmentation (random rotation, jitter, scale) for training robustness
- Extend from object-level to scene-level understanding of full MMS scans
- Support change detection queries between two point clouds of the same location
- Grounding: output 3D bounding boxes alongside text descriptions
- Publish full evaluation results and model outputs for reproducibility
- Add support for evaluating with open-source LLM judges (e.g. LLaMA 3) as alternatives to GPT-4
- Multi-GPU training guide (tested configuration for 4× A100)
- Quantised inference (INT4/INT8) for deployment on consumer GPUs
- Docker image for one-command environment setup
Together, Let's make LLM for 3D great!
- Point-Bind & Point-LLM: aligns point clouds with Image-Bind, and leverages ImageBind-LLM to reason multi-modality input without 3D-instruction data training.
- 3D-LLM: employs 2D foundation models to encode multi-view images of 3D point clouds.
- DETERMINATION OF PARKING SPACE AND ITS CONCURRENT USAGE OVER TIME USING SEMANTICALLY SEGMENTED MOBILE MAPPING DATA: propose a processing pipeline to extract car bounding boxes from a given 3D point cloud.
We welcome contributions! Please read CONTRIBUTING.md for guidelines on reporting issues, setting up the development environment, and submitting pull requests.
- LLaVA: Our codebase is built upon LLaVA.
- Vicuna: We use the Vicuna-7B and Vicuna-13B checkpoints.
- Objaverse: We use models of the Objaverse dataset for training and evaluation.
- Cap3D: We use the Cap3D captioning data for our data generation.
- ULIP-2: We use ULIP-2 for pre-training our point cloud encoder.
- InternVL: We use InternVL captioning data for our data generation.









