Creating novel views in urban settings is crucial for applications like autonomous driving and virtual tours. Unlike object-level or indoor situations, outdoor settings pose unique challenges, including larger scenes, frame inconsistencies from moving vehicles, and noisy camera poses. This paper introduces a method to address these challenges in view synthesis for outdoor scenarios, utilizing the neural point light field scene representation with 2D image data and 3D point cloud information. We propose a method that efficiently removes dynamic objects in the scene and jointly refines camera poses to recover clean views. We achieve this by estimating the optical flow for the input video sequence and masking out moving objects during training. By learning a consistent geometric representation in the neural point light field, the masked-out areas are correctly recovered in both trained and unseen views, without leaving black areas. Moreover, the learned geometry allows us to extrapolate from current camera trajectory and recover plausible extended views. Additionally, we propose to simultaneously optimize the camera pose along with the scene representation, accommodating noisy camera pose inputs typical of real-world applications. Through validation on real-world urban datasets, we demonstrate stable and satisfactory results in synthesizing novel views of urban scenes.
inproceedings
BibTeXKey: DSC24