Autonomous 3D reconstruction, the process whereby an agent can produce its own representation of the world, is an extremely challenging area in both vision and robotics. However, 3D reconstructions have the ability to grant robots the understanding of the world necessary for collaboration and high-level goal execution. Therefore, this thesis aims to explore methods that will enable modern robotic systems to autonomously and collaboratively achieve an understanding of the world. In the real world, reconstructing a 3D scene requires nuanced understanding of the environment. Additionally, it is not enough to simply “understand” the world, autonomous agents must be capable of actively acquiring this understanding. Achieving all of this using simple monocular sensors is extremely challenging. Agents must be able to understand what areas of the world are navigable, how egomotion affects reconstruction and how other agents may be leveraged to provide an advantage. All of this must be considered in addition to the traditional 3D reconstruction issues of correspondence estimation, triangulation and data association. Simultaneous Localisation and Mapping (SLAM) solutions are not particularly well suited to autonomous multi-agent reconstruction. They typically require the sensors to be in constant communication, do not scale well with the number of agents (or map size) and require expensive optimisations. Instead, this thesis attempts to develop more pro-active techniques from the ground up. First, an autonomous agent must have the ability to actively select what it is going to reconstruct. Known as view-selection, or Next-Best View (NBV), this has recently become an active topic in autonomous robotics and will form the first contribution of this thesis. Second, once a view is selected, an autonomous agent must be able to plan a trajectory to arrive at that view. This problem, known as path-planning, can be considered a core topic in the robotics field and will form the second contribution of this thesis. Finally, the 3D reconstruction must be anchored to a globally consistent map that co-relates to the real world. This will be addressed as a floorplan localisation problem, an emerging field for the vision community, and will be the third contribution of this thesis. To give autonomous agents the ability to actively select what data to process, this thesis discusses the NBV problem in the context of Multi-View Stereo (MVS). The proposed approach has the ability to massively reduce the amount of computing resources required for any given 3D reconstruction. More importantly, it autonomously selects the views that improve the reconstruction the most. All of this is done exclusively on the sensor pose; the images are not used for view-selection and only loaded into memory once they have been selected for reconstruction. Experimental evaluation shows that NBV applied to this problem can achieve results comparable to state-of-the-art using as little as 3.8% of the views. To provide the ability to execute an autonomous 3D reconstruction, this thesis proposes a novel computer-vision based goal-estimation and path-planning approach. The method proposed in the previous chapter is extended into a continuous pose-space. The resulting view then becomes the goal of a Scenic Pathplanner that plans a trajectory between the current robot pose and the NBV. This is done using an NBV-based pose-space that biases the paths towards areas of high information gain. Experimental evaluation shows that the Scenic Planning enables similar performance to state-of-the-art batch approaches using less than 3% of the views, whichcorresponds to 2.7 × 10 −4 % of the possible stereo pairs (using a naive interpretation of plausible stereo pairs). Comparison against length-based path-planning approaches show that the Scenic Pathplanner produces more complete and more accurate maps with fewer frames. Finally, the ability of the Scenic Pathplanner to generalise to live scenarios is demonstrated using low-cost robotic platforms. Finally, to allow global consistency and provide a basis for indoor robot-human interaction, this thesis proposes a novel human-inspired floorplan localisation approach. This method uses the intuition that humans use semantic cues, such as doors and windows, to localise within a floorplan. These semantic cues are extracted from an RGB image, presented as a novel sensor modality called Semantic Detection and Ranging (SeDAR) and used as observations within a Monte-Carlo Localisation (MCL) framework. Experimental evaluation shows that SeDAR-based MCL has the ability to outperform state-of-the-art MCL when using range measurements. It is also demonstrated that the semantic cues are sufficient for localisation, as this approach achieves results comparable to state-of-the-art without range measurements. When combined, these contributions provide solutions to some of the most fundamental issues facing autonomous and collaborative robots. They advance the fields of 3D Reconstruction, Path-planning and Localisation by allowing autonomous agents to reconstruct complex scenes. The field of 3D reconstruction is advanced by demonstrating that intelligent view selection is capable of drastically improving performance of established methods. The field of Path-planning is advanced by establishing that pro-active behaviours can be encoded into low-cost robotics, such that high-level goals result in emergent strategies for collaboration. Finally, the field of Localisation is advanced by validating that human-inspired localisation based on distinctive semantic landmarks is an effective alternative to traditional scan-matching. The experiments in this thesis demonstrate that autonomous agents can navigate unknown complex scenes using simple monocular cameras. This thesis lays the foundation for autonomous, collaborative 3D reconstruction that goes beyond simple SLAM-based solutions and enables high-level collaboration towards a common goal.