Accurate detection of histopathological cancer subtypes is crucial for personalized treatment. Currently, deep learning methods based on histopathology images have become an effective solution to this problem. However, existing deep learning methods for histopathology image classification often suffer from high computational complexity, not considering the variability of different regions, and failing to synchronize the focus on local-global information effectively. To address these issues, we propose a coarse-to-fine inference based vision transformer (ViT) network (CFI-ViT) for pathological image detection of gastric cancer subtypes. CFI-ViT combines global attention and discriminative and differentiable modules to achieve two-stage inference. In the coarse inference stage, a ViT model with relative position embedding is employed to extract global information from the input images. If the critical information is not sufficiently identified, the differentiable module is adopted to extract local image regions with discrimination for fine-grained screening in the fine inference stage. The effectiveness and superiority of the proposed CFI-ViT method have been validated through three pathological image datasets of gastric cancer, including one private dataset clinically collected from Yunnan Cancer Hospital in China and two publicly available datasets, i.e., HE-GHI-DS and TCGA-STAD. The experimental results demonstrate that CFI-ViT achieves superior recognition accuracy and generalization performance compared to traditional methods, while using only 80 % of the computational resources required by the ViT model.