Profile picture of Yoonsoo Kim

Yoonsoo Kim

AI Research Engineer / Kaggle Grandmaster / Creator Of This Website

Links: Github / Kaggle / LinkedIn / Old Blog

I like to code and create new things that bring value to someone.

I first encountered programming when I was majoring Economics in Seoul National University. I found it fascinating to design logics that perfectly control the results. AI greatly expands the breadth of tasks achievable by programming, and I started to self-study AI. Then I met Kaggle, where I can develop ML algorithms to solve various problems and compete with people around the world. I was almost addicted to the catharsis that comes from inventing new ideas to boost my program's performance and ranking on leaderboard. After 2 years of Kaggling, I became Competition Grandmaster with highest rank of 17(0.01%), youngest in South Korea.

I have experiences in building high performant ML models in image, text, tabular and sequential data, and have experience in building working RL agents in tasks other than benchmarks. I also have experience in building full stack web app with nextjs,

Below, I will describe my experiences with programming in chronological order.

2017

๐Ÿ‘จโ€๐Ÿ’ป Sep~Dec: My First Encounter to Programming

I first encountered programming in an university lecture. I learned the basics of Python and used it to solve problems. I found it interesting to create something new based on logic and seeing it work as I instructed. So I started to search what I can learn with programming.

2018

๐ŸŒ Jan~Feb: Web Development with PHP

I found some free online courses for web development. After following the courses, I started to develop web forum for sharing images and videos. For technical stack, I used PHP and mySQL using the Laravel framework. I had fun making the logic and designing for the website, and after about 1~2 months of dedicated work, I was able to deploy a working website. However since I pretty much used copy & paste approach of coding, maintaining and upgrading the code was not clean, so I closed the website and moved away from this project.

๐Ÿค– Apr: Machine Learning Course

Back in 2016, I watched Go Match between LeeSedol and Alphago, and was very impressed by the potential of Artificial Intelligence. After moving away from the web development project, I searched for ways to learn AI. Eventually, I found Machine Learning course at Coursera taught by professor Andrew Ng of Stanford University. I attended the course, and first encountered concepts such as supervised learning (multiple linear regression, logistic regression, neural networks, and decision trees), unsupervised learning (clustering, dimensionality reduction, recommender systems), and some of the best practices used in Silicon Valley for artificial intelligence and machine learning innovation(evaluating and tuning models, taking a data-centric approach to improving performance, and more). I liked the course, and decided to attend another recent course by Andrew Ng.

๐Ÿง  May: Deep Learning Course

The famous 'Deep Learning Specialization' course at Coursera was freshly launched at the time. I attended the course and learned basic Neural Networks, Convolutional Neural Networks, Sequence Models, and practical skills for deep learning. The course wasn't math-heavy, and I was able to see how the system works without getting lost in mathematical details and proofs. After attending these two(Machine Learning, Deep Learning) courses by Andrew Ng, I was ready to dive into the field of AI.

Certificate

๐Ÿ“’ Jul~Aug: Applied Data Science with Python Course

I wanted to implement and build something out of the knowledge I learned from Andrew Ng's two courses. But I found out that I had little practice at coding AI with python. So I attended 'Applied Data Science with Python' Specialization course taught by University of Michigan at Coursera. I learned to apply statistical, machine learning, information visualization, and text analysis techniques to gain new insight into their data. I became more comfortable with handling data with python after finishing this course.

Certificate

๐Ÿ“„ Aug: Opened Personal Blog DataPlayground

I opened blog to keep track of what I learned in the field of AI. I used โ€˜Github Pagesโ€™ for hosting and โ€˜jekyllโ€™ for static website generating, customized design based on โ€˜Minimal Mistakesโ€™ theme and used Markdown for writing the posts.

๐Ÿ“ˆ Sep~Nov: Worked on Predicting Stock Price with Deep Learning

Like many others, I was curious if deep learning can predict future price of stock with historical price data. With the skills I learned, I built and trained deep learning models to predict stock prices. Then, based on the model I made automatic trading bot. I failed to make revenue, but it was a good practice for me. Retrospectively, the task was too hard for me who had little experience at that time.

2019

๐Ÿคผ Jan: Joined First Kaggle Competition

While searching for places to apply and upgrade my skills at AI, I found Kaggle. Kaggle is the largest global AI competition platform. At first sight, the prize money was surprisingly large compared to other competitions I've seen. Also, it drew my attention that I can compete with all the people around the world by building AI algorithm in front of my computer. I participated some tutorial competitions, then full of anticipation, I started to join real Kaggle competitions. Here, I'll only introduce competitions where I earned at least silver medal.

๐Ÿฅˆ Apr~May: First Silver Medal - Kaggle LANL Earthquake Prediction

  • Timeline: 2019/01/11~06/04
  • Prize Money : $50,000
  • Host: Los Alamos National Laboratory
  • Goal: Predict when an earthquake will occur in experiment environment, using sequential signal data
  • Team: Solo
  • Result: ๐ŸฅˆTop 2.76% (125/4521)
  • Solution: Generate features from sequential data and select important features / Train ensemble of gradient boosting machines

This was the competition I earned my first silver medal. Honestly, I didn't expect it but leaderboard shakeup was there for me. I got familiar with processing large data with pandas, setting up experiment workflows, configuring and training SOTA gbms, ensembling models through participating in this competition.

๐Ÿฅˆ May~Jun: Kaggle Instant Gratification

  • Timeline: 2019/05/18~06/21
  • Prize Money: $5,000
  • Host: Kaggle
  • Goal: Predict target that was generated with hidden method by the hosts
  • Team: Solo
  • Result: ๐Ÿฅˆ Top 0.76% (14/1832)
  • Solution: Reverse-engineer data generation method and submitted best possible answer

Out of many participants, only few (<5) found out the perfect solution. Despite the solution was about reverse engineering synthetic data which is hardly useful in real world, I gained some confidence in competing in Kaggle competitions.

๐Ÿฅˆ Jul~Oct: Kaggle IEEE-CIS Fraud Detection

  • Timeline: 2019/07/16~10/04
  • Prize Money: $20,000
  • Host: IEEE Computational Intelligence Society
  • Goal: Detect fraudulent transaction
  • Team: Solo
  • Result: ๐Ÿฅˆ Top 0.49% (31/6381)
  • Solution: Group customers with feature combination and use groups to generate features / Train ensemble of gradient boosting machines

Despite top solutions included reverse engineering customer groups which was not intended by the hosts, I gained skill in efficiently inspecting, handling and generating features from tabular data with python. Also, I got more comfortable with using gbms and hyperparameter tuning.

๐Ÿ“š Sep: Started Summarizing Papers

During participating in Kaggle competitions, when I want to figure out and understand SOTA approach to the problem, or when I want to boost the performance, I naturally come to read some papers. I started to summarize some of them at my blog to get a better understanding of the content. I had hard time understanding math-rich part of the papers, but at most times, I was able to learn the main idea the paper. Apart from competition related papers, I also searched and read some classic papers. After reading and summarizing the papers, I became comfortable in using papers as another resource for problem-solving.

๐Ÿฅˆ Oct~Nov: Kaggle Understanding Clouds from Satellite Images

  • Timeline: 2019/08/17~11/19
  • Prize Money: $10,000
  • Host: Max Planck Institute for Meteorology
  • Goal: Segment type of clouds from satellite images
  • Team: Solo
  • Result: ๐Ÿฅˆ Top 3.51% (54/1538)
  • Solution: BCE+Dice loss to optimize Dice coefficient / Added classification branch to UNet architecture and multiplied it to segmentation output / Exclude masks with low probability or small area.

I learned to finetune SOTA pretrained image models and do multilabel image segmentation. Also I became comfortable in constructing multi-stage pipeline including preprocessing, training deep learning models, postprocessing, ensembling and pseudo labeling.

2020

๐Ÿฅˆ Jan~Mar: Kaggle Deepfake Detection Challenge

  • Timeline: 2019/12/12~2020/4/24
  • Prize Money: $1,000,000
  • Host: DFDC(Deepfake Detection Challenge: built by AWS, Facebook, Microsoft, the Partnership on AIโ€™s Media Integrity Steering Committee, and academics)
  • Goal: Detect if some video contains deepfake
  • Team: 4
  • Result: ๐Ÿฅˆ Top 1.32% (30/2265)
  • Solution: Extracted frames from video and did image classification / UNet with classification branch as model architecture / Added margin before detecting->extracting face out of frames / Hard augmentation / Considered model generalization

I spent more than 3 months for this competition, tried a lot of experiments where most failed, but succeeded in getting 2nd at public leaderboard. However, like expected, generalizability was the issue, and I slipped to 30th in the private leaderboard. Nevertheless, this was the competition where I learned and got used to a lot of Computer Vision techniques and constructing complex pipelines efficiently. I moved to pytorch from Keras for easier customization, utilized modular python files for easier code maintenance rather than using only jupyter notebooks. I tried to diagnose models by inspecting examples that the models were struggling at and by extracting GradCams. I chose and used pretrained face detection model and incorporated in the pipeline. To evaluate the models with stability, I search and downloaded some other deepfake datasets and tested models against these samples.

๐Ÿข Oct: Joined Upstage Challenge Team Internship

Thanks to my achievements in Kaggle competitions, I was given opportunity to join Upstage as Challenge Team Intern, when the company was founded. There, I mainly participated in Kaggle competitions with my team leader who was Kaggle Competition Grandmaster. During 1 year of internship at Upstage, I achieved Kaggle Grandmaster and had several opportunities to make a presentation and take interview about my experience.

๐Ÿฅ‡ Oct~Nov: First Gold Medal - Kaggle Mechanisms of Action (MoA) Prediction

  • Timeline: 2020/09/04~12/01
  • Prize Money: $30,000
  • Host: Laboratory for Innovation Science at Harvard
  • Goal: Predict MoA(Mechanims of Action) of drugs from biological activities
  • Team: 2
  • Result: ๐Ÿฅ‡ Top 0.39% (17/4273)
  • Solution: Feedforward neural network with custom architecture / Label smoothing, input clipping, input dropout to increase generalizability / Pseudo training for last bit of performance boost / Ensemble

It was the first time I participated with my team leader. We had hard time since the scores of the models were inconsistent, but we tried all we can do were able to get decent position in final leaderboard. It was my first gold medal in kaggle competition, and I remember that moment like yesterday. With this gold medal, I became Kaggle Competition Master tier.

๐Ÿฅ‡ Nov~Dec: Kaggle Riiid Answer Correctness Prediction

  • Timeline: 2020/10/06~2021/01/08
  • Prize Money: $100,000
  • Host: Riiid AIEd Challenge
  • Goal: Predict if the student will get correct answer on questions, given student's history
  • Team: 2
  • Result: ๐Ÿฅ‡ Top 0.21% (7/3395)
  • Solution: Custom neural network architecture utilizing transformer encoder and LSTM / Feature engineering / Ensemble / Making sure the data pipeline is same in train & test time

It was the first time I used big machine(4 RTX 3090 + 64 core Ryzen Threadripper), and I learned to efficiently utilize it. My team leader competed similar competition before, and I learned his approach for dealing with tabular-sequential type of problems.

2021

๐Ÿฅˆ Jan~Feb: Kaggle Rainforest Connection Species Audio Detection

  • Timeline: 2020/11/18~2021/02/18
  • Prize Money: $15,000
  • Host: Rainforest Connection
  • Goal: Predict which of 24 bird/frog species are present in 10 second wav file
  • Team: Solo
  • Result: ๐Ÿฅˆ Top 2.80% (32/1143)
  • Solution: Frequency crop / Custom loss functions to maximize metric / Iterative pseudo training / Audio augmentations / Positive GAP

I only had 2~3 weeks to compete in this competition, and it was the first time I put hands on audio data. But like all Kaggle competitions are, it was a good place to learn new concepts rapidly, and I was able to construct pipeline to classify audio that worked decently. I also got my hands dirty with pytorch coding, implementing custom losses and models, which definitely helped my pytorch coding skills. It was tight in time, but I had fun implementing various ideas on this competition.

๐Ÿฅ‡ Feb~Mar: First Solo Gold Medal - Kaggle RANZCR CLiP - Catheter and Line Position Challenge

  • Timeline: 2020/12/15~2021/03/17
  • Prize Money: $50,000
  • Host: Royal Australian & NZ College of Radiologists
  • Goal: Classify the presence and correct placement of different types of catheter on chest x-rays
  • Team: Solo
  • Result: ๐Ÿฅ‡ Top 0.71% (11/1547)
  • Solution: Downconv to utilize high resolution / UNet pretraining to utilize catheter position annotations / Pseudo labeling on external dataset

It was my first solo gold medal at Kaggle competitions. Having participated in previous image competitions, I knew the baseline approach and had code base for this task, so I could start a step ahead. I tried many different approaches, including contrastive self-supervised pretraining which was hot at that time(which didn't help in my case..), and found some successful methods which brought me high in the leaderboard. Upstage provided me some spare machines, so I could use multiple computers for ensemble and pseudo labeling at the end stage of the competition. I had an opportunity to give talk about my experience for this competition at public presentation session hosted by my company (video link below).

๐Ÿ† Apr~May: 1st Place - Kaggle Shopee - Price Match Guarantee

  • Timeline: 2021/03/09~05/11
  • Prize Money: $30,000
  • Host: Shopee
  • Goal: Group the same products in online shopping mall, based on product's title and image.
  • Team: 2
  • Result: FIRST PLACE ๐Ÿ† Top 0.04% (1/2426)
  • Solution: Finetune pretrained image encoder & text encoder with ArcFace and used cosine similarity as metric / Combine image & text embeddings with concat & union method / Iterative Neighborhood Blending (INB)

It was first 1st place in Kaggle competitions for both me and my team leader. We continuously implemented new ideas, and slowly climbed up the leaderboard when one succeeded out of many. We were able to play with various ideas since we had stable baseline pipeline for both image and text models beforehand. We reached 1st one week before the deadline and kept the place to the final leaderboard. I think it was our previous experience and attitude of not giving up that brought us 1st place for this competition. We gave talk about our experience in this competition(link below).

๐Ÿฅ‡ Jul~Aug: Achieving Kaggle Competition Grandmaster - Kaggle CommonLit Readability Prize

  • Timeline: 2021/05/04~08/03
  • Prize Money: $60,000
  • Host: CommonLit
  • Goal: Rate the complexity of literary passages for grades 3-12 classroom use
  • Team: 2
  • Result: ๐Ÿฅ‡ Top 0.22% (8/3633)
  • Solution: Finetuned pretrained language transformers / Incorporated textstat, GPT2 features at transformer input / Stage 2 training with GBM (Residual stacking) / Automatic hyperparameter tuning / No dropout in transformers

Getting gold medal in this competition, I had total of 5 gold medals(including 1 solo gold) and achieved Kaggle Competition Grandmaster! There were 233 global and 5 Korean grandmasters at that time, and I became the youngest Korean Grandmaster. I didn't expect it when I started this journey with Kaggle. But thanks to my team leader and wonderful computer the company provided, I could get results that I dreamed of. Commonlit competition was not an easy competition, since the variance of validation and test scores were large, and it was hard to come up with working ideas since pretrained transformers were already too powerful. We were at silver medal zone at public leaderboard, but got gold medal at private(final) leaderboard, thanks to intense ensembling which guaranteed stability across data. We tried pytorch lightning framework for this competition, which itself was challenging, since we needed to change the code base. Pytorch lightning was quite useful when the number of gpus & machines grew. We took in person interview with MBN, for achieving many gold medals in Kaggle competitions. I had personal interview with KED for achieving Kaggle Competition Grandmaster.

๐Ÿ† Aug: 1st at 2021 SNU FastMRI Challenge

I spotted poster at school corridor by coincidence, that Department of Electrical and Computer Engineering at Seoul National University is hosting a competition. The task was to transform undersampled multi-coil k-space brain MRI data into image that is visually similar to fully sampled original MRI image. I didn't know anything about theory behind MRI, but some short lectures about theory was provided that I can make use of. In about 10 days, I was able to make a winning solution, thanks to my prior experience in deep learning. However, my model wasn't novel nor SOTA, and I learned that there is a lot to do in customizing the model and improving data to enhance the performance of the solution, and that can only be achieved when you understand the theory well. I wasn't planning to follow this field, so I left it as it is.

๐Ÿฅ‡ Aug~Sep: Kaggle Optiver Realized Volatility Prediction

  • Timeline: 2021/06/29~2022/01/11
  • Prize Money: $100,000
  • Host: Optiver
  • Goal: From stock orderbook and trade data for 10 minutes, predict price volatility for next 10 minutes
  • Team: Solo
  • Result: ๐Ÿฅ‡ Top 0.05% (2/3852)
  • Solution: Generate rolling aggregation features from sequential data / Custom feed forward neural network ensemble

It was the last Kaggle competition I participated as Upstage intern. It was kind of a disappointing competition since there existed potential data leakage which all the top solutions had to use. Anyways, the task itself was interesting, since I was interested in predicting future market with AI.

๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Oct: Joined Deepest

Deepest is the deep learning club at Seoul National University with members varying from undergraduate, graduate students and engineers at startups/firms. After finishing the intern course at Upstage, I joined Deepest and met people who were passionate about deep learning.

๐Ÿ“ˆ Nov~2022 Feb: Built Automatic Cryptocurrency Trading Bot

After finishing the intern course, I decided to take break in competition and do some interesting projects. Then I thought, 'why don't I revisit stock price prediction?'. I failed to make it work few years ago, but since I had more experiences now, I thought I might give it another go. Instead of stock market, I chose cryptocurrency market, since it had better api. I had some modeling ideas, but constructing automatic trading system was hard. There were seemingly miscellaneous details that turned out big in real trading, and it took quite a lot of time figuring out those one by one. I kept the system running for about 2 months, saw profit initially, but lost more later. I found cryptocurrency market is very volatile and that I need to be more conservative on the strategies. Also, I looked for other approaches that I can take, and decided to study Reinforcement Learning.

2022

๐Ÿค“ Feb~Mar: Introduction to Reinforcement Learning

I searched online how I can best learn reinforcement learning, found wonderful course by David Silver, and followed the course. Concepts were confusing at first, but whenever I got confused, I could come back to the course or search online. After finishing the course, I had basic theoretical background, but I didn't know how to implement RL algorithms in real projects such as cryptocurrency trading. So I took an Udacity RL nanodegree(Certificate, Project1 Code, Project2 Code, Project3 Code). I gained some experience in implementing RL algorithms during coding for the projects. These projects were kind of fun, watching the agents improve over time. After completing the course, I tried to implement RL agent for cryptocurrency trading, but failed. Debugging RL algorithms was HARD. I took my hands off RL cryptocurrency trading temporarily.

๐Ÿ  Apr~Jul: Revisiting Web Development & Building This Website

I always wanted to create my own project that uses AI and deploy it onto the web. Web is the platform where I can present what I created to all the people around the world. So I revisited web development which I left few years ago. This time, I attended Udemy Course to get background knowledge of web development. Since the field of web development rapidly changes, after finishing the udemy course, I searched, chose and took recent online courses at youtube. At first, I learned html, css and javascript. Then I learned react for frontend, nodejs(express) for backend, mongodb for database. I was able to make full stack website with these tech stack, but I had some difficulties mainly with authentication, and client side state management. I found next.js suites my needs, since it enables developers to choose between static site generation, client side rendering and server side rendering. For authentication, I implementing it from scratch for some toy projects, but eventually moved to next-auth. For css, I tried bootstrap, material ui, css in js and others but found tailwindcss enhances development experience. I was using mongoose as ORM, but moved to prisma since it supports various databases and works well with nextjs. I took care of seo via meta tags and dynamic sitemap. Also, to decrease loading speed, I utilized state, client caching and cdn caching by vercel edge network. Finally, my tech stack for web development is as follows; typescript, next.js, next-auth, prisma, tailwindcss, vercel.

When I was editing my LinkedIn profile, I thought I might use my web development skills to build portfolio website where I can update my portfolio and write posts in the way I liked. It would be better if other people can use it too. So I created this website, called "Portfoly". It was fun to create website and customize it the way I liked.

๐ŸŽฎ Jun~Jul: RL For Games - Kaggle Kore 2022

For team project at Deepest, I chose Kaggle RL competition and gathered members. Our team consisted of me, 4 master's degree student and 1 machine learning research engineer. The competition is run as follows; competitors submit agent which takes input of the current game state and outputs action to take, then submitted agents are automatically, periodically played against each other to gain scores. Agent with highest score at the end of the competition wins. It seemed fun and a good opportunity to upgrade my RL skills, and I ambitiously started to train agents. However, the game setting was hard to train RL, since action space was huge and incorporating all available information to input space was not straight-forward. We struggled hard to beat strong rule-based agent, and on the last week, I managed to train an agent that beats it. It did not perform well against top leaderboard rule-based agents, but I can definitely say that I learned many details on implementing and training RL agent through this project.

Sep~: Upstage

I joined Upstage as an AI Research Engineer at Challenges team, where all of the members had extensive experience in Kaggle. Challenges team was a functional team, targeting increasing model performance of various company products, and also participate competitions if relevant.

At first, I worked on improving performance of parser in DRP(Detector, Recognizer, Parser)OCR pipeline. It was my first time working on OCR, but since the goal was to improve performance of the model like in the case of Kaggling, I was able to help the project be successful.

2023

Jan~Feb: RecSys

I had interest in Recommendation System, where there are a lot of potential paths to reach the goal. Our team was able to use about 3 weeks to participate in Kaggle competition related to recsys, and internalize the lessons learned into recsys product of our company.

We participated in OTTO โ€“ Multi-Objective Recommender System competition, and achieved 16/2574 in 3 weeks. I personally had no experience in recsys, and learned SOTA recsys techniques. I learned how to implement high-performing two-stage system; candidate extraction and reranking. Also, I learned how to handle huge tabular datasets efficiently.

We internalized solution learned from the competition into one of our recsys product.

Mar: ICDAR HierText

ICDAR(International Conference on Document Analysis and Recognition) is an biggest conference directly related to OCR. OCR system was one of main product of our company, and I had chance to participate in one of the competitions held by ICDAR with some coworkers. ICDAR 2023 Competition on Hierarchical Text Detection and Recognition is a competition held by google research, targeting on detecting and recognizing texts while also extracting their hierarchy(word->line->paragraph).

It was my first time working on detector system, but within 1 month, was able to construct a 1st place solution. In addressing hierarchical text detection, we implement a two-step approach. First, we perform multi-class semantic segmentation where classes are word, line, and paragraph regions. Then, we use the predicted probability map to extract and organize these entities hierarchically. Later, with company support, I participated in ICDAR 2023 conference held at San Jose with coworkers.

April~May: ChatGPT & AutoGPT

ChatGPT has been very popular, and our team had chance to investigate into its application. In particular, we customized AutoGPT. We added more functions that the agent can use, and integrated it in slack workspace the company was using.

It was a good time to experience potential capabilities of Large Language Model(LLM)s, but we found many limitations of AutoGPT, such as cost, time, and context length.

Jun~Aug: LLM TF

Then, 4 of our team members were assigned as LLM Taskforce, to test the possibility of training competitive LLM on our own. Our initial mission was to get high rank on Huggingface Open LLM Leaderboard. We were not familiar with training LLMs, but like Kagglers do, we approached the task as one of Kaggle competitions.

We started off by instruction-tuning llama models, first 7b model, then 30b model, then 70b model. Within a month and a half, we succeeded in training best performing model under 65b(News Article(Korean) / Model). Soon, we took first place in the open llm leaderboard with instruction-tuned llama-70b(News Article(Korean) / Model).

Sep~: Training SOLAR

For the company, 1st place on open llm leaderboard was a huge achievement, and LLM team was newly organized. Challenges team was renamed Foundation Model team as part of LLM Engine team and our LLM was branded SOLAR. With my teammates, I continued to improve performance of SOLAR while keeping it small and efficient. We started to pretrain and finetune the model with huge amount of GPUs. I gained experiences in preprocessing and handling Terabytes of text data, and training LLMs efficiently with large scale multi-node devices.

We also also started to train models that are good at both Korean and English, to target Korean customers.

Dec: SOLAR 10.7B

We re-took 1st place on Open LLM Leaderboard, with finetuned version of our pretrained model, SOLAR-10.7B. This was the result of our team's increasing capability in pretraining, not only in finetuning.

2024

Jan: SOLAR with Longer Context

We extended context length of SOLAR from 4k to 64k, so that it can process/generate long texts. Handling long texts is crucial especially when using LLMs for service. With my teammate, I researched previous studies, constructed evaluation pipeline, preprared data, trained model, and assess its performance in comparison with other models. Our model successfully surpassed previous SOTA models and served as foundation for many of our company's products.

ยฉ 2024 Yoonsoo.