New computing applications, e.g., deep neural network (DNN) training and inference, have been a driving force that changed the semiconductor industry landscape. The data-intensive nature of DNN applications usually leads to high computation costs and complexity, and different hardware accelerators have thus been developed to improve the efficiency of running these models. For example, NVIDIA GPU, google TPU and near-memory computing architectures can enhance DNN performance and energy efficiency compared to conventional CPUs. Past that, IMC methods can circumvent the fundamental von Neumann bottleneck and enable highly parallel computing, leading to even higher hardware efficiency and performance. This thesis examines several aspects of IMC accelerators based on emerging memory devices such as RRAM, which can offer high computation density, throughput and energy efficiency. We present a reconfigurable IMC design that can accelerate general arithmetic and logic functions. The system consists of small look-up tables (LUTs), a memory block, and search auxiliary blocks integrated into an RRAM crossbar array. External data access and data conversions are eliminated to allow operations fully in-memory. Details of logic and arithmetic functions such as addition, AND and multiplication are discussed based on search and writeback steps. A compact instruction set is demonstrated in this architecture through circuit-level simulations. Performance evaluations show that the proposed IMC architecture suits data-intensive tasks with low power consumption. Next, we discuss DNN accelerator designs using a tiled IMC architecture. Popular models including VGG-16 and MobileNet are successfully mapped and tested on the RRAM-based tiled IMC architecture. Effects of finite RRAM array size and quantized partial sums (Psums) due to ADC precision constraints are analyzed. Methods are developed to address these challenges and preserve DNN accuracy and IMC performance gains. For practical IMC implementations and to support larger models, we develop a Tiled Architecture for In-memory Computing and Heterogeneous Integration (TAICHI), a general IMC DNN accelerator design. TAICHI is based on tiled RRAM crossbar arrays heterogeneously integrated with local arithmetic units and global co-processors to allow the same chip to efficiently map different models while maintaining high energy efficiency and throughput. A hierarchical mesh network-on-chip is implemented to facilitate communication among clusters in TAICHI to balance reconfigurability and efficiency. Detailed implementation of the different circuit components is presented, and the system performance is benchmarked at several technology nodes. The heterogeneous design also allows the system to accommodate models larger than the on-chip storage capability to make the hardware system future-proof. Large-scale implementations of IMC accelerators face two technological challenges – high ADC overhead and device variability. We note these challenges can be addressed by restricting neuron activations to single-bit values, i.e., spikes, and by employing binary weights. Based on these principles, we propose efficient hardware implementation of binary-weight SNNs (BSNNs) that can be achieved using current RRAM devices and simple circuits. Binary activations also provide opportunities for intra and inter-layer data routing, and neuron circuit design optimizations. Through high-precision backpropagation-through-time (HP-BPTT) and a proper neuron design, we show BSNN can achieve accuracies comparable to floating-point models. With these co-designs, the proposed architecture can achieve high energy efficiency and accuracy for common SNN datasets. The robustness of the BSNN model against device non-idealities is further verified through experimental chip measurements. Finally, we discuss other opportunities to further enhance IMC architecture performance, including possible pipelining optimization, mapping strategy and BSNN training optimization strategies.