Author: "Park, Jongsoo" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Park, Jongsoo"' showing total 289 results

Start Over Author "Park, Jongsoo"

289 results on '"Park, Jongsoo"'

1. Context Parallelism for Scalable Million-Token Inference

Author: Yang, Amy, Yang, Jingyi, Ibrahim, Aya, Xie, Xinfeng, Tang, Bangsheng, Sizov, Grigory, Reizenstein, Jeremy, Park, Jongsoo, and Huang, Jianyu
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
Published: 2024

2. The Llama 3 Herd of Models

Author: Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, Goyal, Anirudh, Hartshorn, Anthony, Yang, Aobo, Mitra, Archi, Sravankumar, Archie, Korenev, Artem, Hinsvark, Arthur, Rao, Arun, Zhang, Aston, Rodriguez, Aurelien, Gregerson, Austen, Spataru, Ava, Roziere, Baptiste, Biron, Bethany, Tang, Binh, Chern, Bobbie, Caucheteux, Charlotte, Nayak, Chaya, Bi, Chloe, Marra, Chris, McConnell, Chris, Keller, Christian, Touret, Christophe, Wu, Chunyang, Wong, Corinne, Ferrer, Cristian Canton, Nikolaidis, Cyrus, Allonsius, Damien, Song, Daniel, Pintz, Danielle, Livshits, Danny, Esiobu, David, Choudhary, Dhruv, Mahajan, Dhruv, Garcia-Olano, Diego, Perino, Diego, Hupkes, Dieuwke, Lakomkin, Egor, AlBadawy, Ehab, Lobanova, Elina, Dinan, Emily, Smith, Eric Michael, Radenovic, Filip, Zhang, Frank, Synnaeve, Gabriel, Lee, Gabrielle, Anderson, Georgia Lewis, Nail, Graeme, Mialon, Gregoire, Pang, Guan, Cucurell, Guillem, Nguyen, Hailey, Korevaar, Hannah, Xu, Hu, Touvron, Hugo, Zarov, Iliyan, Ibarra, Imanol Arrieta, Kloumann, Isabel, Misra, Ishan, Evtimov, Ivan, Copet, Jade, Lee, Jaewon, Geffert, Jan, Vranes, Jana, Park, Jason, Mahadeokar, Jay, Shah, Jeet, van der Linde, Jelmer, Billock, Jennifer, Hong, Jenny, Lee, Jenya, Fu, Jeremy, Chi, Jianfeng, Huang, Jianyu, Liu, Jiawen, Wang, Jie, Yu, Jiecao, Bitton, Joanna, Spisak, Joe, Park, Jongsoo, Rocca, Joseph, Johnstun, Joshua, Saxe, Joshua, Jia, Junteng, Alwala, Kalyan Vasuden, Upasani, Kartikeya, Plawiak, Kate, Li, Ke, Heafield, Kenneth, Stone, Kevin, El-Arini, Khalid, Iyer, Krithika, Malik, Kshitiz, Chiu, Kuenley, Bhalla, Kunal, Rantala-Yeary, Lauren, van der Maaten, Laurens, Chen, Lawrence, Tan, Liang, Jenkins, Liz, Martin, Louis, Madaan, Lovish, Malo, Lubo, Blecher, Lukas, Landzaat, Lukas, de Oliveira, Luke, Muzzi, Madeline, Pasupuleti, Mahesh, Singh, Mannat, Paluri, Manohar, Kardas, Marcin, Oldham, Mathew, Rita, Mathieu, Pavlova, Maya, Kambadur, Melanie, Lewis, Mike, Si, Min, Singh, Mitesh Kumar, Hassan, Mona, Goyal, Naman, Torabi, Narjes, Bashlykov, Nikolay, Bogoychev, Nikolay, Chatterji, Niladri, Duchenne, Olivier, Çelebi, Onur, Alrassy, Patrick, Zhang, Pengchuan, Li, Pengwei, Vasic, Petar, Weng, Peter, Bhargava, Prajjwal, Dubal, Pratik, Krishnan, Praveen, Koura, Punit Singh, Xu, Puxin, He, Qing, Dong, Qingxiao, Srinivasan, Ragavan, Ganapathy, Raj, Calderer, Ramon, Cabral, Ricardo Silveira, Stojnic, Robert, Raileanu, Roberta, Girdhar, Rohit, Patel, Rohit, Sauvestre, Romain, Polidoro, Ronnie, Sumbaly, Roshan, Taylor, Ross, Silva, Ruan, Hou, Rui, Wang, Rui, Hosseini, Saghar, Chennabasappa, Sahana, Singh, Sanjay, Bell, Sean, Kim, Seohyun Sonia, Edunov, Sergey, Nie, Shaoliang, Narang, Sharan, Raparthy, Sharath, Shen, Sheng, Wan, Shengye, Bhosale, Shruti, Zhang, Shun, Vandenhende, Simon, Batra, Soumya, Whitman, Spencer, Sootla, Sten, Collot, Stephane, Gururangan, Suchin, Borodinsky, Sydney, Herman, Tamar, Fowler, Tara, Sheasha, Tarek, Georgiou, Thomas, Scialom, Thomas, Speckbacher, Tobias, Mihaylov, Todor, Xiao, Tong, Karn, Ujjwal, Goswami, Vedanuj, Gupta, Vibhor, Ramanathan, Vignesh, Kerkez, Viktor, Gonguet, Vincent, Do, Virginie, Vogeti, Vish, Petrovic, Vladan, Chu, Weiwei, Xiong, Wenhan, Fu, Wenyin, Meers, Whitney, Martinet, Xavier, Wang, Xiaodong, Tan, Xiaoqing Ellen, Xie, Xinfeng, Jia, Xuchao, Wang, Xuewei, Goldschlag, Yaelle, Gaur, Yashesh, Babaei, Yasmine, Wen, Yi, Song, Yiwen, Zhang, Yuchen, Li, Yue, Mao, Yuning, Coudert, Zacharie Delpierre, Yan, Zheng, Chen, Zhengxing, Papakipos, Zoe, Singh, Aaditya, Grattafiori, Aaron, Jain, Abha, Kelsey, Adam, Shajnfeld, Adam, Gangidi, Adithya, Victoria, Adolfo, Goldstand, Ahuva, Menon, Ajay, Sharma, Ajay, Boesenberg, Alex, Vaughan, Alex, Baevski, Alexei, Feinstein, Allie, Kallet, Amanda, Sangani, Amit, Yunus, Anam, Lupu, Andrei, Alvarado, Andres, Caples, Andrew, Gu, Andrew, Ho, Andrew, Poulton, Andrew, Ryan, Andrew, Ramchandani, Ankit, Franco, Annie, Saraf, Aparajita, Chowdhury, Arkabandhu, Gabriel, Ashley, Bharambe, Ashwin, Eisenman, Assaf, Yazdan, Azadeh, James, Beau, Maurer, Ben, Leonhardi, Benjamin, Huang, Bernie, Loyd, Beth, De Paola, Beto, Paranjape, Bhargavi, Liu, Bing, Wu, Bo, Ni, Boyu, Hancock, Braden, Wasti, Bram, Spence, Brandon, Stojkovic, Brani, Gamido, Brian, Montalvo, Britt, Parker, Carl, Burton, Carly, Mejia, Catalina, Wang, Changhan, Kim, Changkyu, Zhou, Chao, Hu, Chester, Chu, Ching-Hsiang, Cai, Chris, Tindal, Chris, Feichtenhofer, Christoph, Civin, Damon, Beaty, Dana, Kreymer, Daniel, Li, Daniel, Wyatt, Danny, Adkins, David, Xu, David, Testuggine, Davide, David, Delia, Parikh, Devi, Liskovich, Diana, Foss, Didem, Wang, Dingkang, Le, Duc, Holland, Dustin, Dowling, Edward, Jamil, Eissa, Montgomery, Elaine, Presani, Eleonora, Hahn, Emily, Wood, Emily, Brinkman, Erik, Arcaute, Esteban, Dunbar, Evan, Smothers, Evan, Sun, Fei, Kreuk, Felix, Tian, Feng, Ozgenel, Firat, Caggioni, Francesco, Guzmán, Francisco, Kanayet, Frank, Seide, Frank, Florez, Gabriela Medina, Schwarz, Gabriella, Badeer, Gada, Swee, Georgia, Halpern, Gil, Thattai, Govind, Herman, Grant, Sizov, Grigory, Guangyi, Zhang, Lakshminarayanan, Guna, Shojanazeri, Hamid, Zou, Han, Wang, Hannah, Zha, Hanwen, Habeeb, Haroun, Rudolph, Harrison, Suk, Helen, Aspegren, Henry, Goldman, Hunter, Damlaj, Ibrahim, Molybog, Igor, Tufanov, Igor, Veliche, Irina-Elena, Gat, Itai, Weissman, Jake, Geboski, James, Kohli, James, Asher, Japhet, Gaya, Jean-Baptiste, Marcus, Jeff, Tang, Jeff, Chan, Jennifer, Zhen, Jenny, Reizenstein, Jeremy, Teboul, Jeremy, Zhong, Jessica, Jin, Jian, Yang, Jingyi, Cummings, Joe, Carvill, Jon, Shepard, Jon, McPhie, Jonathan, Torres, Jonathan, Ginsburg, Josh, Wang, Junjie, Wu, Kai, U, Kam Hou, Saxena, Karan, Prasad, Karthik, Khandelwal, Kartikay, Zand, Katayoun, Matosich, Kathy, Veeraraghavan, Kaushik, Michelena, Kelly, Li, Keqian, Huang, Kun, Chawla, Kunal, Lakhotia, Kushal, Huang, Kyle, Chen, Lailin, Garg, Lakshya, A, Lavender, Silva, Leandro, Bell, Lee, Zhang, Lei, Guo, Liangpeng, Yu, Licheng, Moshkovich, Liron, Wehrstedt, Luca, Khabsa, Madian, Avalani, Manav, Bhatt, Manish, Tsimpoukelli, Maria, Mankus, Martynas, Hasson, Matan, Lennie, Matthew, Reso, Matthias, Groshev, Maxim, Naumov, Maxim, Lathi, Maya, Keneally, Meghan, Seltzer, Michael L., Valko, Michal, Restrepo, Michelle, Patel, Mihir, Vyatskov, Mik, Samvelyan, Mikayel, Clark, Mike, Macey, Mike, Wang, Mike, Hermoso, Miquel Jubert, Metanat, Mo, Rastegari, Mohammad, Bansal, Munish, Santhanam, Nandhini, Parks, Natascha, White, Natasha, Bawa, Navyata, Singhal, Nayan, Egebo, Nick, Usunier, Nicolas, Laptev, Nikolay Pavlovich, Dong, Ning, Zhang, Ning, Cheng, Norman, Chernoguz, Oleg, Hart, Olivia, Salpekar, Omkar, Kalinli, Ozlem, Kent, Parkin, Parekh, Parth, Saab, Paul, Balaji, Pavan, Rittner, Pedro, Bontrager, Philip, Roux, Pierre, Dollar, Piotr, Zvyagina, Polina, Ratanchandani, Prashant, Yuvraj, Pritish, Liang, Qian, Alao, Rachad, Rodriguez, Rachel, Ayub, Rafi, Murthy, Raghotham, Nayani, Raghu, Mitra, Rahul, Li, Raymond, Hogan, Rebekkah, Battey, Robin, Wang, Rocky, Maheswari, Rohan, Howes, Russ, Rinott, Ruty, Bondu, Sai Jayesh, Datta, Samyak, Chugh, Sara, Hunt, Sara, Dhillon, Sargun, Sidorov, Sasha, Pan, Satadru, Verma, Saurabh, Yamamoto, Seiji, Ramaswamy, Sharadh, Lindsay, Shaun, Feng, Sheng, Lin, Shenghao, Zha, Shengxin Cindy, Shankar, Shiva, Zhang, Shuqiang, Wang, Sinong, Agarwal, Sneha, Sajuyigbe, Soji, Chintala, Soumith, Max, Stephanie, Chen, Stephen, Kehoe, Steve, Satterfield, Steve, Govindaprasad, Sudarshan, Gupta, Sumit, Cho, Sungmin, Virk, Sunny, Subramanian, Suraj, Choudhury, Sy, Goldman, Sydney, Remez, Tal, Glaser, Tamar, Best, Tamara, Kohler, Thilo, Robinson, Thomas, Li, Tianhe, Zhang, Tianjun, Matthews, Tim, Chou, Timothy, Shaked, Tzook, Vontimitta, Varun, Ajayi, Victoria, Montanez, Victoria, Mohan, Vijai, Kumar, Vinay Satish, Mangla, Vishal, Albiero, Vítor, Ionescu, Vlad, Poenaru, Vlad, Mihailescu, Vlad Tiberiu, Ivanov, Vladimir, Li, Wei, Wang, Wenchen, Jiang, Wenwen, Bouaziz, Wes, Constable, Will, Tang, Xiaocheng, Wang, Xiaofang, Wu, Xiaojian, Wang, Xiaolan, Xia, Xide, Wu, Xilun, Gao, Xinbo, Chen, Yanjun, Hu, Ye, Jia, Ye, Qi, Ye, Li, Yenda, Zhang, Yilin, Zhang, Ying, Adi, Yossi, Nam, Youngjin, Yu, Wang, Hao, Yuchen, Qian, Yundi, He, Yuzi, Rait, Zach, DeVito, Zachary, Rosnbrick, Zef, Wen, Zhaoduo, Yang, Zhenyu, and Zhao, Zhiwei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Published: 2024

3. Wukong: Towards a Scaling Law for Large-Scale Recommendation

Author: Zhang, Buyun, Luo, Liang, Chen, Yuxin, Nie, Jade, Liu, Xi, Guo, Daifeng, Zhao, Yanli, Li, Shen, Hao, Yuchen, Yao, Yantao, Lakshminarayanan, Guna, Wen, Ellie Dingqiao, Park, Jongsoo, Naumov, Maxim, and Chen, Wenlin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short., Comment: 12 pages
Published: 2024

4. Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Author: Luo, Liang, Zhang, Buyun, Tsang, Michael, Ma, Yinbin, Chu, Ching-Hsiang, Chen, Yuxin, Li, Shen, Hao, Yuchen, Zhao, Yanli, Lakshminarayanan, Guna, Wen, Ellie Dingqiao, Park, Jongsoo, Mudigere, Dheevatsa, and Naumov, Maxim
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Retrieval
Abstract: We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales.
Published: 2024

5. Big data insights into the diagnostic values of CBC parameters for sepsis and septic shock in burn patients: a retrospective study

Author: Kim, Myongjin, Kym, Dohern, Park, Jongsoo, Yoon, Jaechul, Cho, Yong Suk, Hur, Jun, Chun, Wook, and Yoon, Dogeon
Published: 2024
Full Text: View/download PDF

6. Pioneering predictions of AKI and AKIN severity in burn patients: a comprehensive CBC approach

Author: Park, Jongsoo, Kym, Dohern, Kim, Myongjin, Cho, Yong Suk, Hur, Jun, Chun, Wook, Yoon, Dogeon, and Yoon, Jaechul
Published: 2024
Full Text: View/download PDF

7. MTrainS: Improving DLRM training efficiency using heterogeneous memories

Author: Kassa, Hiwot Tadese, Johnson, Paul, Akers, Jason, Ghosh, Mrinmoy, Tulloch, Andrew, Mudigere, Dheevatsa, Park, Jongsoo, Liu, Xing, Dreslinski, Ronald, and Ardestani, Ehsan K.
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance.
Published: 2023

8. With Shared Microexponents, A Little Shifting Goes a Long Way

Author: Rouhani, Bita, Zhao, Ritchie, Elango, Venmugil, Shafipour, Rasoul, Hall, Mathew, Mesmakhosroshahi, Maral, More, Ankit, Melnick, Levi, Golub, Maximilian, Varatkar, Girish, Shao, Lei, Kolhe, Gaurav, Melts, Dimitry, Klar, Jasmine, L'Heureux, Renee, Perry, Matt, Burger, Doug, Chung, Eric, Deng, Zhaoxia, Naghshineh, Sam, Park, Jongsoo, and Naumov, Maxim
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture
Abstract: This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization approaches, including narrow-precision floating-point and block floating-point. MX utilizes multiple levels of quantization scaling with ultra-fine scaling factors based on shared microexponents in the hardware. The effectiveness of MX is demonstrated on real-world models including large-scale generative pretraining and inferencing, and production-scale recommendation systems.
Published: 2023

9. RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Author: Zhao, Mark, Choudhary, Dhruv, Tyagi, Devashish, Somani, Ajay, Kaplan, Max, Lin, Sung-Han, Pumma, Sarunya, Park, Jongsoo, Basant, Aarti, Agarwal, Niket, Wu, Carole-Jean, and Kozyrakis, Christos
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Retrieval, Computer Science - Performance
Abstract: We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system., Comment: Published in the Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys 2023)
Published: 2022

10. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Author: Zhang, Buyun, Luo, Liang, Liu, Xi, Li, Jay, Chen, Zeliang, Zhang, Weilin, Wei, Xiaohan, Hao, Yuchen, Tsang, Michael, Wang, Wenjun, Liu, Yang, Li, Huayu, Badr, Yasmine, Park, Jongsoo, Yang, Jiyan, Mudigere, Dheevatsa, and Wen, Ellie
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.
Published: 2022

11. Central venous catheter tip colonization and associated bloodstream infection in patients with severe burns under routine catheter changing

Author: Jeon, Kibum, Han, Seung Beom, Kym, Dohern, Kim, Myongjin, Park, Jongsoo, Yoon, Jaechul, Hur, Jun, Cho, Yong Suk, and Chun, Wook
Published: 2024
Full Text: View/download PDF

12. Evaluating clinical heterogeneity and predicting mortality in severely burned patients through unsupervised clustering and latent class analysis

Author: Kim, Sungmin, Yoon, Jaechul, Kym, Dohern, Hur, Jun, Kim, Myongjin, Park, Jongsoo, Cho, Yong Suk, Chun, Wook, and Yoon, Dogeon
Published: 2023
Full Text: View/download PDF

13. 75% radiation dose reduction using deep learning reconstruction on low-dose chest CT

Author: Jo, Gyeong Deok, Ahn, Chulkyun, Hong, Jung Hee, Kim, Da Som, Park, Jongsoo, Kim, Hyungjin, Kim, Jong Hyo, Goo, Jin Mo, and Nam, Ju Gang
Published: 2023
Full Text: View/download PDF

14. Tracking longitudinal biomarkers in burn patients with sepsis and acute kidney injury: an unsupervised clustering approach

Author: Kim, Myongjin, Kym, Dohern, Hur, Jun, Park, Jongsoo, Yoon, Jaechul, Cho, Yong Suk, Chun, Wook, and Yoon, Dogeon
Published: 2023
Full Text: View/download PDF

15. First-Generation Inference Accelerator Deployment at Facebook

Author: Anderson, Michael, Chen, Benny, Chen, Stephen, Deng, Summer, Fix, Jordan, Gschwind, Michael, Kalaiah, Aravind, Kim, Changkyu, Lee, Jaewon, Liang, Jason, Liu, Haixin, Lu, Yinghai, Montgomery, Jack, Moorthy, Arun, Nadathur, Satish, Naghshineh, Sam, Nayak, Avinash, Park, Jongsoo, Petersen, Chris, Schatz, Martin, Sundaram, Narayanan, Tang, Bangsheng, Tang, Peter, Yang, Amy, Yu, Jiecao, Yuen, Hector, Zhang, Ying, Anbudurai, Aravind, Balan, Vandana, Bojja, Harsha, Boyd, Joe, Breitbach, Matthew, Caldato, Claudio, Calvo, Anna, Catron, Garret, Chandwani, Sneh, Christeas, Panos, Cottel, Brad, Coutinho, Brian, Dalli, Arun, Dhanotia, Abhishek, Duncan, Oniel, Dzhabarov, Roman, Elmir, Simon, Fu, Chunli, Fu, Wenyin, Fulthorp, Michael, Gangidi, Adi, Gibson, Nick, Gordon, Sean, Hernandez, Beatriz Padilla, Ho, Daniel, Huang, Yu-Cheng, Johansson, Olof, Juluri, Shishir, Kanaujia, Shobhit, Kesarkar, Manali, Killinger, Jonathan, Kim, Ben, Kulkarni, Rohan, Lele, Meghan, Li, Huayu, Li, Huamin, Li, Yueming, Liu, Cynthia, Liu, Jerry, Maher, Bert, Mallipedi, Chandra, Mangla, Seema, Matam, Kiran Kumar, Mehta, Jubin, Mehta, Shobhit, Mitchell, Christopher, Muthiah, Bharath, Nagarkatte, Nitin, Narasimha, Ashwin, Nguyen, Bernard, Ortiz, Thiara, Padmanabha, Soumya, Pan, Deng, Poojary, Ashwin, Ye, Qi, Raginel, Olivier, Rajagopal, Dwarak, Rice, Tristan, Ross, Craig, Rotem, Nadav, Russ, Scott, Shah, Kushal, Shan, Baohua, Shen, Hao, Shetty, Pavan, Skandakumaran, Krish, Srinivasan, Kutta, Sumbaly, Roshan, Tauberg, Michael, Tzur, Mor, Verma, Sidharth, Wang, Hao, Wang, Man, Wei, Ben, Xia, Alex, Xu, Chenyu, Yang, Martin, Zhang, Kai, Zhang, Ruoxi, Zhao, Ming, Zhao, Whitney, Zhu, Rui, Mathews, Ajit, Qiao, Lin, Smelyanskiy, Misha, Jia, Bill, and Rao, Vijay
Subjects: Computer Science - Hardware Architecture
Abstract: In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a single-socket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs. We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design.
Published: 2021

16. Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Author: Zhaoxia, Deng, Park, Jongsoo, Tang, Ping Tak Peter, Liu, Haixin, Jie, Yang, Yuen, Hector, Huang, Jianyu, Khudia, Daya, Wei, Xiaohan, Wen, Ellie, Choudhary, Dhruv, Krishnamoorthi, Raghuraman, Wu, Carole-Jean, Nadathur, Satish, Kim, Changkyu, Naumov, Maxim, Naghshineh, Sam, and Smelyanskiy, Mikhail
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, Computer Science - Information Retrieval, Computer Science - Performance, Mathematics - Numerical Analysis
Abstract: Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan during which topic trends and users' interests inevitably evolve. Practicing these low-precision technologies helped us save datacenter capacities while deploying models with up to 5X complexity that would otherwise not be deployed on traditional general-purpose CPUs. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering and advance the state of the art of ML in industry.
Published: 2021

17. Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Author: Du, Xiaocong, Bhushanam, Bhargav, Yu, Jiecao, Choudhary, Dhruv, Gao, Tianxiang, Wong, Sherman, Feng, Louis, Park, Jongsoo, Cao, Yu, and Kejariwal, Arun
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Deep learning recommendation systems at scale have provided remarkable gains through increasing model capacity (i.e. wider and deeper neural networks), but it comes at significant training cost and infrastructure cost. Model pruning is an effective technique to reduce computation overhead for deep neural networks by removing redundant parameters. However, modern recommendation systems are still thirsty for model capacity due to the demand for handling big data. Thus, pruning a recommendation model at scale results in a smaller model capacity and consequently lower accuracy. To reduce computation cost without sacrificing model capacity, we propose a dynamic training scheme, namely alternate model growth and pruning, to alternatively construct and prune weights in the course of training. Our method leverages structured sparsification to reduce computational cost without hurting the model capacity at the end of offline training so that a full-size model is available in the recurring training stage to learn new data in real-time. To the best of our knowledge, this is the first work to provide in-depth experiments and discussion of applying structural dynamics to recommendation systems at scale to reduce training cost. The proposed method is validated with an open-source deep-learning recommendation model (DLRM) and state-of-the-art industrial-scale production models.
Published: 2021

18. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Author: Mudigere, Dheevatsa, Hao, Yuchen, Huang, Jianyu, Jia, Zhihao, Tulloch, Andrew, Sridharan, Srinivas, Liu, Xing, Ozdal, Mustafa, Nie, Jade, Park, Jongsoo, Luo, Liang, Yang, Jie Amy, Gao, Leon, Ivchenko, Dmytro, Basant, Aarti, Hu, Yuxi, Yang, Jiyan, Ardestani, Ehsan K., Wang, Xiaodong, Komuravelli, Rakesh, Chu, Ching-Hsiang, Yilmaz, Serhat, Li, Huayu, Qian, Jiyuan, Feng, Zhuobo, Ma, Yinbin, Yang, Junjie, Wen, Ellie, Li, Hong, Yang, Lin, Sun, Chonglin, Zhao, Whitney, Melts, Dimitry, Dhulipala, Krishna, Kishore, KR, Graf, Tyler, Eisenman, Assaf, Matam, Kiran Kumar, Gangidi, Adi, Chen, Guoqiang Jerry, Krishnan, Manoj, Nayak, Avinash, Nair, Krishnakumar, Muthiah, Bharath, khorashadi, Mahmoud, Bhattacharya, Pallab, Lapukhov, Petr, Naumov, Maxim, Mathews, Ajit, Qiao, Lin, Smelyanskiy, Mikhail, Jia, Bill, and Rao, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
Published: 2021

19. Efficient Soft-Error Detection for Low-precision Deep Learning Recommendation Models

Author: Li, Sihuan, Huang, Jianyu, Tang, Ping Tak Peter, Khudia, Daya, Park, Jongsoo, Dixit, Harish Dattatraya, and Chen, Zizhong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Soft error, namely silent corruption of signal or datum in a computer system, cannot be caverlierly ignored as compute and communication density grow exponentially. Soft error detection has been studied in the context of enterprise computing, high-performance computing and more recently in convolutional neural networks related to autonomous driving. Deep learning recommendation systems (DLRMs) have by now become ubiquitous and serve billions of users per day. Nevertheless, DLRM-specific soft error detection methods are hitherto missing. To fill the gap, this paper presents the first set of soft-error detection methods for low-precision quantized-arithmetic operators in DLRM including general matrix multiplication (GEMM) and EmbeddingBag. A practical method must detect error and do so with low overhead lest reduced inference speed degrades user experience. Exploiting the characteristics of both quantized arithmetic and the operators, we achieved more than 95% detection accuracy for GEMM with an overhead below 20%. For EmbeddingBag, we achieved 99% effectiveness in significant-bit-flips detection with less than 10% of false positives, while keeping overhead below 26%.
Published: 2021

20. FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Author: Khudia, Daya, Huang, Jianyu, Basu, Protonu, Deng, Summer, Liu, Haixin, Park, Jongsoo, and Smelyanskiy, Mikhail
Subjects: Computer Science - Machine Learning, Computer Science - Performance
Abstract: Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed fbgemm, a high-performance kernel library, from ground up to perform high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
Published: 2021

21. Mixed-Precision Embedding Using a Cache

Author: Yang, Jie Amy, Huang, Jianyu, Park, Jongsoo, Tang, Ping Tak Peter, and Tulloch, Andrew
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to grow at a significant rate. Meanwhile, these large-scale models are often trained with GPUs where high-performance memory is a scarce resource, thus motivating numerous work on embedding table compression during training. We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision, and the most frequently or recently accessed rows cached and trained in full precision. The proposed architectural change works in conjunction with standard precision reduction and computer arithmetic techniques such as quantization and stochastic rounding. For an open source deep learning recommendation model (DLRM) running with Criteo-Kaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache whose size are 5% of the embedding tables, while maintaining accuracy. For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables, while maintaining accuracy, and 16% end-to-end training speedup by reducing GPU-to-host data transfers.
Published: 2020

22. Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data

Author: Ye, Mao, Choudhary, Dhruv, Yu, Jiecao, Wen, Ellie, Chen, Zeliang, Yang, Jiyan, Park, Jongsoo, Liu, Qiang, and Kejariwal, Arun
Subjects: Computer Science - Machine Learning
Abstract: Large scale deep learning provides a tremendous opportunity to improve the quality of content recommendation systems by employing both wider and deeper models, but this comes at great infrastructural cost and carbon footprint in modern data centers. Pruning is an effective technique that reduces both memory and compute demand for model inference. However, pruning for online recommendation systems is challenging due to the continuous data distribution shift (a.k.a non-stationary data). Although incremental training on the full model is able to adapt to the non-stationary data, directly applying it on the pruned model leads to accuracy loss. This is because the sparsity pattern after pruning requires adjustment to learn new patterns. To the best of our knowledge, this is the first work to provide in-depth analysis and discussion of applying pruning to online recommendation systems with non-stationary data distribution. Overall, this work makes the following contributions: 1) We present an adaptive dense to sparse paradigm equipped with a novel pruning algorithm for pruning a large scale recommendation system with non-stationary data distribution; 2) We design the pruning algorithm to automatically learn the sparsity across layers to avoid repeating hand-tuning, which is critical for pruning the heterogeneous architectures of recommendation systems trained with non-stationary data.
Published: 2020

23. Post-Training 4-bit Quantization on Embedding Tables

Author: Guan, Hui, Malevich, Andrey, Yang, Jiyan, Park, Jongsoo, and Yuen, Hector
Subjects: Computer Science - Machine Learning, Computer Science - Information Retrieval, Statistics - Machine Learning
Abstract: Continuous representations have been widely adopted in recommender systems where a large number of entities are represented using embedding vectors. As the cardinality of the entities increases, the embedding components can easily contain millions of parameters and become the bottleneck in both storage and inference due to large memory consumption. This work focuses on post-training 4-bit quantization on the continuous embeddings. We propose row-wise uniform quantization with greedy search and codebook-based quantization that consistently outperforms state-of-the-art quantization approaches on reducing accuracy degradation. We deploy our uniform quantization technique on a production model in Facebook and demonstrate that it can reduce the model size to only 13.89% of the single-precision version while the model quality stays neutral., Comment: Accepted in MLSys@NeurIPS'19 (http://learningsys.org/neurips19/)
Published: 2019

24. Deep Learning Recommendation Model for Personalization and Recommendation Systems

Author: Naumov, Maxim, Mudigere, Dheevatsa, Shi, Hao-Jun Michael, Huang, Jianyu, Sundaraman, Narayanan, Park, Jongsoo, Wang, Xiaodong, Gupta, Udit, Wu, Carole-Jean, Azzolini, Alisson G., Dzhulgakov, Dmytro, Mallevich, Andrey, Cherniavskii, Ilia, Lu, Yinghai, Krishnamoorthi, Raghuraman, Yu, Ansha, Kondratenko, Volodymyr, Pereira, Stephanie, Chen, Xianjie, Chen, Wenlin, Rao, Vijay, Jia, Bill, Xiong, Liang, and Smelyanskiy, Misha
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning, 68T05, I.2.6, I.5.0, H.3.3, H.3.4
Abstract: With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design., Comment: 10 pages, 6 figures
Published: 2019

25. A Study of BFLOAT16 for Deep Learning Training

Author: Kalamkar, Dhiraj, Mudigere, Dheevatsa, Mellempudi, Naveen, Das, Dipankar, Banerjee, Kunal, Avancha, Sasikanth, Vooturi, Dharma Teja, Jammalamadaka, Nataraj, Huang, Jianyu, Yuen, Hector, Yang, Jiyan, Park, Jongsoo, Heinecke, Alexander, Georganas, Evangelos, Srinivasan, Sudarshan, Kundu, Abhisek, Smelyanskiy, Misha, Kaul, Bharat, and Dubey, Pradeep
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can represent is the same as that of IEEE 754 floating-point format (FP32) and conversion to/from FP32 is simple. Maintaining the same range as FP32 is important to ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754 compliant half-precision floating point (FP16) requires hyper-parameter tuning. In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16. We have implemented a method to emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters.
Published: 2019

26. Spatial-Winograd Pruning Enabling Sparse Winograd Convolution

Author: Yu, Jiecao, Park, Jongsoo, and Naumov, Maxim
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Deep convolutional neural networks (CNNs) are deployed in various applications but demand immense computational requirements. Pruning techniques and Winograd convolution are two typical methods to reduce the CNN computation. However, they cannot be directly combined because Winograd transformation fills in the sparsity resulting from pruning. Li et al. (2017) propose sparse Winograd convolution in which weights are directly pruned in the Winograd domain, but this technique is not very practical because Winograd-domain retraining requires low learning rates and hence significantly longer training time. Besides, Liu et al. (2018) move the ReLU function into the Winograd domain, which can help increase the weight sparsity but requires changes in the network structure. To achieve a high Winograd-domain weight sparsity without changing network structures, we propose a new pruning method, spatial-Winograd pruning. As the first step, spatial-domain weights are pruned in a structured way, which efficiently transfers the spatial-domain sparsity into the Winograd domain and avoids Winograd-domain retraining. For the next step, we also perform pruning and retraining directly in the Winograd domain but propose to use an importance factor matrix to adjust weight importance and weight gradients. This adjustment makes it possible to effectively retrain the pruned Winograd-domain network without changing the network structure. For the three models on the datasets of CIFAR10, CIFAR-100, and ImageNet, our proposed method can achieve the Winograd domain sparsities of 63%, 50%, and 74%, respectively.
Published: 2019

27. Artificial intelligence system for identification of false-negative interpretations in chest radiographs

Author: Hwang, Eui Jin, Park, Jongsoo, Hong, Wonju, Lee, Hyun-Ju, Choi, Hyewon, Kim, Hyungjin, Nam, Ju Gang, Goo, Jin Mo, Yoon, Soon Ho, Lee, Chang Hyun, and Park, Chang Min
Published: 2022
Full Text: View/download PDF

28. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Author: Park, Jongsoo, Naumov, Maxim, Basu, Protonu, Deng, Summer, Kalaiah, Aravind, Khudia, Daya, Law, James, Malani, Parth, Malevich, Andrey, Nadathur, Satish, Pino, Juan, Schatz, Martin, Sidorov, Alexander, Sivakumar, Viswanath, Tulloch, Andrew, Wang, Xiaodong, Wu, Yiming, Yuen, Hector, Diril, Utku, Dzhulgakov, Dmytro, Hazelwood, Kim, Jia, Bill, Jia, Yangqing, Qiao, Lin, Rao, Vijay, Rotem, Nadav, Yoo, Sungjoo, and Smelyanskiy, Mikhail
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.
Published: 2018

29. On Periodic Functions as Regularizers for Quantization of Neural Networks

Author: Naumov, Maxim, Diril, Utku, Park, Jongsoo, Ray, Benjamin, Jablonski, Jedrzej, and Tulloch, Andrew
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning, 68T05, I.2.6, I.5.0
Abstract: Deep learning models have been successfully used in computer vision and many other fields. We propose an unorthodox algorithm for performing quantization of the model parameters. In contrast with popular quantization schemes based on thresholds, we use a novel technique based on periodic functions, such as continuous trigonometric sine or cosine as well as non-continuous hat functions. We apply these functions component-wise and add the sum over the model parameters as a regularizer to the model loss during training. The frequency and amplitude hyper-parameters of these functions can be adjusted during training. The regularization pushes the weights into discrete points that can be encoded as integers. We show that using this technique the resulting quantized models exhibit the same accuracy as the original ones on CIFAR-10 and ImageNet datasets., Comment: 11 pages, 7 figures
Published: 2018

30. Glow: Graph Lowering Compiler Techniques for Neural Networks

Author: Rotem, Nadav, Fix, Jordan, Abdulrasool, Saleem, Catron, Garret, Deng, Summer, Dzhabarov, Roman, Gibson, Nick, Hegeman, James, Lele, Meghan, Levenstein, Roman, Montgomery, Jack, Maher, Bert, Nadathur, Satish, Olesen, Jakob, Park, Jongsoo, Rakhov, Artem, Smelyanskiy, Misha, and Wang, Man
Subjects: Computer Science - Programming Languages
Abstract: This paper presents the design of Glow, a machine learning compiler for heterogeneous hardware. It is a pragmatic approach to compilation that enables the generation of highly optimized code for multiple targets. Glow lowers the traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation. The high-level intermediate representation allows the optimizer to perform domain-specific optimizations. The lower-level instruction-based address-only intermediate representation allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation and copy elimination. At the lowest level, the optimizer performs machine-specific code generation to take advantage of specialized hardware features. Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets. The lowering phase is designed to reduce the input space and allow new hardware backends to focus on a small number of linear algebra primitives.
Published: 2018

31. Identification of Active Pulmonary Tuberculosis Among Patients With Positive Interferon-Gamma Release Assay Results: Value of a Deep Learning-based Computer-aided Detection System in Different Scenarios of Implementation

Author: Park, Jongsoo, Hwang, Eui Jin, Lee, Jong Hyuk, Hong, Wonju, Nam, Ju Gang, Lim, Woo Hyeon, Kim, Jae Hyun, Goo, Jin Mo, and Park, Chang Min
Published: 2023
Full Text: View/download PDF

32. Effects of grid-type shear connector arrangements used for insulated concrete sandwich wall panels with a low aspect ratio

Author: Choi, Insub, Kim, JunHee, Kim, DongWon, and Park, JongSoo
Published: 2022
Full Text: View/download PDF

33. Two-step approach to scheduling quantum circuits

Author: Guerreschi, Gian Giacomo and Park, Jongsoo
Subjects: Quantum Physics, Computer Science - Emerging Technologies
Abstract: As the effort to scale up existing quantum hardware proceeds, it becomes necessary to schedule quantum gates in a way that minimizes the number of operations. There are three constraints that have to be satisfied: the order or dependency of the quantum gates in the specific algorithm, the fact that any qubit may be involved in at most one gate at a time, and the restriction that two-qubit gates are implementable only between connected qubits. The last aspect implies that the compilation depends not only on the algorithm, but also on hardware properties like connectivity. Here we suggest a two-step approach in which logical gates are initially scheduled neglecting connectivity considerations, while routing operations are added at a later step in a way that minimizes their overhead. We rephrase the subtasks of gate scheduling in terms of graph problems like edge-coloring and maximum subgraph isomorphism. While this approach is general, we specialize to a one dimensional array of qubits to propose a routing scheme that is minimal in the number of exchange operations. As a practical application, we schedule the Quantum Approximate Optimization Algorithm in a linear geometry and quantify the reduction in the number of gates and circuit depth that results from increasing the efficacy of the scheduling strategies., Comment: Edited text. Added figure summarizing the two-step approach
Published: 2017
Full Text: View/download PDF

34. Enabling Sparse Winograd Convolution by Native Pruning

Author: Li, Sheng, Park, Jongsoo, and Tang, Ping Tak Peter
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Sparse methods and the use of Winograd convolutions are two orthogonal approaches, each of which significantly accelerates convolution computations in modern CNNs. Sparse Winograd merges these two and thus has the potential to offer a combined performance benefit. Nevertheless, training convolution layers so that the resulting Winograd kernels are sparse has not hitherto been very successful. By introducing a Winograd layer in place of a standard convolution layer, we can learn and prune Winograd coefficients "natively" and obtain sparsity level beyond 90% with only 0.1% accuracy loss with AlexNet on ImageNet dataset. Furthermore, we present a sparse Winograd convolution algorithm and implementation that exploits the sparsity, achieving up to 31.7 effective TFLOP/s in 32-bit precision on a latest Intel Xeon CPU, which corresponds to a 5.4x speedup over a state-of-the-art dense convolution implementation., Comment: 10 pages, 2 figures
Published: 2017

35. Value of a deep learning-based algorithm for detecting Lung-RADS category 4 nodules on chest radiographs in a health checkup population: estimation of the sample size for a randomized controlled trial

Author: Nam, Ju Gang, Kim, Hyun Jin, Lee, Eun Hee, Hong, Wonju, Park, Jongsoo, Hwang, Eui Jin, Park, Chang Min, and Goo, Jin Mo
Published: 2022
Full Text: View/download PDF

36. Primary Ewing sarcoma of the kidney mimicking cystic papillary renal cell carcinoma in an older patient: A case report

Author: Kim, Suhong, primary, Park, Jongsoo, additional, Ko, Young Hwii, additional, and Kwon, Hee Jung, additional
Published: 2024
Full Text: View/download PDF

37. Endobronchial Ultrasound Using Guide Sheath-Guided Transbronchial Lung Biopsy in Ground-Glass Opacity Pulmonary Lesions without Fluoroscopic Guidance

Author: Park, Jongsoo, primary, Kim, Changwoon, additional, Jang, Jong Geol, additional, Lee, Seok Soo, additional, Hong, Kyung Soo, additional, and Ahn, June Hong, additional
Published: 2024
Full Text: View/download PDF

38. Faster CNNs with Direct Sparse Convolutions and Guided Pruning

Author: Park, Jongsoo, Li, Sheng, Wen, Wei, Tang, Ping Tak Peter, Li, Hai, Chen, Yiran, and Dubey, Pradeep
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed in mobile devices, data centers, and even supercomputers. The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits. While pruning the fully connected layers reduces a CNN's size considerably, it does not improve inference speed noticeably as the compute heavy parts lie in convolutions. Pruning CNNs in a way that increase inference speed often imposes specific sparsity structures, thus limiting the achievable sparsity levels. We present a method to realize simultaneously size economy and speed improvement while pruning CNNs. Paramount to our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns. Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures. Together, these two allow us to demonstrate 3.1--7.3$\times$ convolution speedups over dense convolution in AlexNet, on Intel Atom, Xeon, and Xeon Phi processors, spanning the spectrum from mobile devices to supercomputers. We also open source our project at https://github.com/IntelLabs/SkimCaffe., Comment: 12 pages, 5 figures
Published: 2016

39. Image quality of ultralow-dose chest CT using deep learning techniques: potential superiority of vendor-agnostic post-processing over vendor-specific techniques

Author: Nam, Ju Gang, Ahn, Chulkyun, Choi, Hyewon, Hong, Wonju, Park, Jongsoo, Kim, Jong Hyo, and Goo, Jin Mo
Published: 2021
Full Text: View/download PDF

40. Prediction of visceral pleural invasion in lung cancer on CT: deep learning model achieves a radiologist-level performance with adaptive sensitivity and specificity to clinical needs

Author: Choi, Hyewon, Kim, Hyungjin, Hong, Wonju, Park, Jongsoo, Hwang, Eui Jin, Park, Chang Min, Kim, Young Tae, and Goo, Jin Mo
Published: 2021
Full Text: View/download PDF

41. Sixty-four-fold data reduction of chest radiographs using a super resolution convolutional neural network

Author: Nam, Ju Gang, primary, Kang, Seung Kwan, additional, Choi, Hyewon, additional, Hong, Wonju, additional, Park, Jongsoo, additional, Goo, Jin Mo, additional, Lee, Jae Sung, additional, and Park, Chang Min, additional
Published: 2024
Full Text: View/download PDF

42. Slowly Growing Pulmonary Glandular Papilloma with Air Bronchogram: A Case Report

Author: Lim, Taehoon, primary, Park, Jongsoo, additional, and Kwon, Heejung, additional
Published: 2024
Full Text: View/download PDF

43. Usual Interstitial Pneumonia: Associations With Complications After Percutaneous Transthoracic Needle Lung Biopsy

Author: Park, Jongsoo, primary, Lee, Jong Hyuk, additional, Hong, Wonju, additional, Hwang, Eui Jin, additional, Yoon, Soon Ho, additional, Goo, Jin Mo, additional, and Park, Chang Min, additional
Published: 2023
Full Text: View/download PDF

44. The Performance of a Deep Learning-Based Automatic Measurement Model for Measuring the Cardiothoracic Ratio on Chest Radiographs

Author: Kim, Donguk, primary, Lee, Jong Hyuk, additional, Jang, Myoung-jin, additional, Park, Jongsoo, additional, Hong, Wonju, additional, Lee, Chan Su, additional, Yang, Si Yeong, additional, and Park, Chang Min, additional
Published: 2023
Full Text: View/download PDF

45. Does the Influence of Empowering Leadership Trickle Down? Evidence From Law Enforcement Organizations

Author: Park, Jongsoo and Hassan, Shahidul
Published: 2018

46. HPC formulations of optimization algorithms for tensor completion

Author: Smith, Shaden, Park, Jongsoo, and Karypis, George
Published: 2018
Full Text: View/download PDF

47. Korean Perspective on FDI in India: Hyundai Motors' Industrial Cluster

Author: Park, Jongsoo
Published: 2004

48. Nontuberculous Mycobacterial Infection Mimicking Lung Cancer in a Patient with Usual Interstitial Pneumonia Pattern Interstitial Lung Disease: A Case Report

Author: Park, Jongsoo, primary, Lee, Chaebin, additional, Lim, Jae-Kwang, additional, Park, Jongmin, additional, and Park, Byunggeon, additional
Published: 2023
Full Text: View/download PDF

49. With Shared Microexponents, A Little Shifting Goes a Long Way

Author: Darvish Rouhani, Bita, primary, Zhao, Ritchie, additional, Elango, Venmugil, additional, Shafipour, Rasoul, additional, Hall, Mathew, additional, Mesmakhosroshahi, Maral, additional, More, Ankit, additional, Melnick, Levi, additional, Golub, Maximilian, additional, Varatkar, Girish, additional, Shao, Lai, additional, Kolhe, Gaurav, additional, Melts, Dimitry, additional, Klar, Jasmine, additional, L'Heureux, Renee, additional, Perry, Matt, additional, Burger, Doug, additional, Chung, Eric, additional, Deng, Zhaoxia (Summer), additional, Naghshineh, Sam, additional, Park, Jongsoo, additional, and Naumov, Maxim, additional
Published: 2023
Full Text: View/download PDF

50. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Author: Patwary, Md. Mostofa Ali, Satish, Nadathur Rajagopalan, Sundaram, Narayanan, Park, Jongsoo, Anderson, Michael J., Vadlamudi, Satya Gautam, Das, Dipankar, Pudov, Sergey G., Pirogov, Vadim O., Dubey, Pradeep, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kunkel, Julian M., editor, and Ludwig, Thomas, editor
Published: 2015
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

289 results on '"Park, Jongsoo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources