Advanced visual perception in mining – multimodal fusion and enhancement

- Organization:
- The Australasian Institute of Mining and Metallurgy
- Pages:
- 4
- File Size:
- 229 KB
- Publication Date:
- Sep 1, 2024
Abstract
Multi-sensor fusion visual perception technology significantly enhances visual perception capabilities
in complex and harsh environments by combining visual data from different sensors. This technology
is especially useful for scenarios where traditional single visible light sensors, such as standard
cameras, struggle due to poor lighting, adverse visual conditions, or other visual obstructions. In
environments with low light, smoke, or dust, the performance of conventional visible light sensors
degrades, limiting their application. By integrating data from various sensors like radar, infrared
thermal imaging, and others, multi-sensor fusion technology provides a more comprehensive and
complementary visual solution. This fusion not only enhances the robustness of visual systems but
also greatly expands their potential applications in fields like visual surveillance. In recent years,
deep learning methods have become the mainstream technology for addressing multi-sensor fusion
issues, demonstrating significant advantages in adaptive feature selection compared to traditional
machine learning methods. However, deep learning-based multi-sensor fusion algorithms still face
several challenges.
The design of network structures is often redundant, lacking effective screening of useful
components within multimodal information. Mainstream fusion algorithms focus excessively on
improving display effects without adequately considering the needs of downstream applications and
tasks. Existing fusion perception algorithms are generally designed for open visual scenes and lack
targeted design for complex visual degradation factors present in tunnel environments of mines.
Current deep learning methods for designing multi-sensor fusion networks heavily rely on manual
experience to create fusion modules. This reliance increases the complexity of network design and
can lead to redundancy. Redundant network modules are difficult to identify in end-to-end learning
tasks at this stage, significantly slowing down network inference and potentially interfering with the
output of fusion perception. Additionally, training these fusion modules typically requires large
amounts of well-annotated multimodal data, further increasing implementation difficulty and cost.
For instance, the fusion network design proposed by Li and Wu (2018) focuses on extracting
multimodal fusion features, while Zhao et al (2020) introduced a feature decomposition mechanism
in the feature fusion and extraction modules. Researchers like Zhang and Ma (2021), Xu et al (2020),
and Liu et al (2017) have attempted to improve network structures through dense cascades and the
introduction of residual connections. Although these methods have improved the handling of
multimodal information to some extent, they still face limitations in distinguishing between useful and
redundant information, with redundancy persisting in network designs. Ideally, the design of fusion
networks should be more intelligent, guiding network structure design heuristically based on the
importance of different modal information for the current task. This would reduce information
redundancy and network weights, requiring the development of new algorithms or frameworks
capable of automatically identifying and optimising fusion strategies, thereby reducing dependence
on manual experience while improving fusion efficiency and network performance.
Current deep learning designs for multi-sensor fusion algorithms often overlook the
Citation
APA:
(2024) Advanced visual perception in mining – multimodal fusion and enhancementMLA: Advanced visual perception in mining – multimodal fusion and enhancement. The Australasian Institute of Mining and Metallurgy, 2024.