Speaker
Description
Conventional deep-learning-based structural damage recognition using images usually requires well-designed network structures for various damage categories and complicated procedures of hyperparameter tuning and retraining. Recently, pre-trained large foundation models in unimodal of vision and multimodal of vision-language have been established to perceive fundamental knowledge of large-scale visual and linguistic datasets, which provides great potential for structural damage diagnosis with full use of structural inspection images and texts. This study introduced a large model pipeline for structural damage diagnosis by establishing a large vision model for visual damage segmentation and a vision-language model for linguistic damage description towards intelligent structural inspection. First, a large vision model based on DINO was proposed via cross-level feature alignment and contrastive learning for universal structural damage segmentation. The recognition accuracy and model robustness were validated by various types of structural components and surface damage for actual bridges and buildings. Then, a large vision-language model based on miniGPT-v2 was developed via domain knowledge embedding and cross modal learning for multi-round dialogue of human-agent interaction to describe details of structural damage. The model capacity and generalization ability were further demonstrated on several downstream tasks in vision-language cross modality of image captioning, visual question answering, and visual grounding. The results preliminarily indicated the feasibility and effectiveness of the proposed large model paradigm for structural damage diagnosis.