Image is used everywhere since it conveys a story, a fact, or an imagination without any words. Thus, it can substitute the sentences because the human brain can extract the knowledge from images faster than words. However, creating an image from scratch is not only time-consuming, but also a tedious task that requires skills. Creating an image is not a trivial task since it contains rich features and fine-grained details, such as colors, brightness, saturation, luminance, texture, shadow, and so on. Thus, in order to generate an image in less time and without any artistic skills, sketch-to-image synthesis can be used. The reason is that hand sketches are much easier to produce, where only the key structural information is contained. Moreover, it can be drawn without skills and in less time. In fact, since sketches are often simple and rough black and white and sometimes imperfect, converting a sketch into an image is not a trivial problem. Hence, it has attracted the researchers' attention to solve this challenging problem; therefore, much research has been conducted in this field to generate photorealistic images. However, the generated images still suffer from issues, such as the un-naturality, the ambiguity, the distortion, and most importantly, the difficulty in generating images from complex input with multiple objects. Most of these problems are due to converting a sketch into an image directly in one-shot. To this end, in this dissertation, we propose a new framework that divides the problem into sub-problems, leading to generating high-quality photorealistic images even with complicated sketches. Instead of directly mapping the input sketch into an image, we map the sketch into an intermediate result, namely, mask map, through an instance segmentation and semantic segmentation in two levels: background segmentation and foreground segmentation. Background segmentation is formed based on the context of the existing foreground objects. Various natural scenes are implemented for both indoor and outdoor scenes. Then, a foreground segmentation process is commenced, where each detected object is sequentially and semantically added into the constructed segmented background. Next, the mask map is converted into an image through image-to-image translation model. Following this step, a post-processing stage is implemented to enhance the synthetic image further via background improvement and human face refinement. This leads to not only generating better results but also being able to generate images from complicated sketches with multiple objects. We further improve our framework by implementing scene and size sensing. As for size awareness feature, in the instance segmentation stage, the objects' sizes might be modified based on the surrounding environment and their respective size prior to reflect reality and produce more realistic and naturalistic images. Moreover, to implement scene awareness feature in the background improvement step, after the scene is initially defined based on the context and then classified based on a scene classifier, a scene image is first selected. Then, the generated objects are placed on the chosen scene image and based on a pre-defined snapping point to place the objects in their proper location and maintain realism.Furthermore, since the generated images have been improved over time regardless of the input modality, it sometimes becomes hard to distinguish between the synthetic images and genuine ones. Of course, this improves the content and the media, but it is considered a serious threat regarding legitimacy, authenticity, and security. Thus, an automatic detection system of AI-generated images is a legitim need. This system also can be used for image synthesis models as an evaluation tool despite the input modality. Indeed, AI-generated images usually bear explicit or implicit artifacts that result during the generation process. Prior research work on detecting the synthetic images generated by one specific model or similar models with similar architecture. Hence, a generalization problem has arisen. To tackle this problem, we propose to fine-tune a pre-trained Convolutional Neural Network (CNN) model on a special newly collected dataset. This dataset consists of AI-generated images from different image synthesis architectures and different input modalities, i.e., text, sketch, and other sources (another image or mask) to help in the generalization ability across various tasks and architectures. Our contribution in general is two-fold. We first generate high-quality realistic images from simple, rough, black and white sketches, where a newly collected dataset of sketch-like images is compiled for training purposes. Second, since artificial images would have advantages and disadvantages in the real world, we create an automated system that is able to detect and localize synthetic images from genuine ones, where a large dataset of generated and real images is collected to train a CNN model.