Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.
We leverage 10 datasets with various foot interaction, including our proposed COFE dataset that provides in-the-wild joint-level foot contact annotations.
Our method first applies low-level style randomization on input image and encodes it into image feature using a ViT backbone. From image feature, shoe style and shoe content randomization are performed with random shoe images from the UT Zappos50K dataset to produce a shoe style-invariant feature. This feature is then processed by a ground feature encoder to extract ground feature, which is used to predict pixel height map and ground normal. Finally, the ground feature and shoe style-invariant feature are fused to form a contact feature, which is decoded to produce the final foot contact prediction.
There's a lot of excellent works that we wish to share.
Reducing domain gap by reducing style bias.
Progressive random convolutions for single domain generalization.
DECO: Dense Estimation of 3D Human-Scene Contact In The Wild.
Learning Dense Hand Contact Estimation from Imbalanced Data.
@article{jung2025feco,
title={Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation},
author={Jung, Daniel Sungho and Lee, Kyoung Mu},
journal={arXiv preprint arXiv:2511.22184},
year={2025}
}