Multimodal automatic depression detection (ADD) has garnered significant research interest due to its potential for providing fast, objective, and reliable assessments. Despite advancements in the field, several challenges persist. First, many existing methods employ fixed-size feature modeling, which fails to adapt to temporal variations in depressive emotional expression, resulting in the loss of critical depressive behavior information. Second, while attention-based detection methods effectively extract features, they incur substantial computational overhead. Finally, multimodal ensemble processes often encounter modality inertia and modality forgetting issues, which compromise model stability and performance. We propose an innovative multimodal ADD network called the fully convolutional adaptive ensemble network (FC-AEN) to address these challenges. This approach comprises three key modules. The first module, deformable patch embedding (DPE), dynamically adjusts the time range of feature segments during training to ensure comprehensive capture of depressive behaviors. The second module, fully convolutional spatiotemporal detector (FCSTD), enhances computational efficiency to manage larger data scales effectively. The third module, adaptive dynamic ensemble (ADE), addresses the multimodal unbalanced ensemble problem through alternating training and gradient adjustment. We conducted experiments on four datasets (AVEC2013, AVEC2014, AVEC2019, and CMDep) and achieved new state-of-the-art (SOTA) performance across three benchmarks. Specifically, on the AVEC2019 dataset, we achieved a mean absolute error (MAE) of 4.91, representing a relative reduction of 14.9%; on the AVEC2013 dataset, we achieved an MAE of 5.11 and a relative reduction of 5.0%; and on the CMDep dataset, we achieved an MAE of 6.04 and a relative reduction of 3.4%.