Be a rock for you - everyone can build a large-scale recommendation system
2022-01-27 05:02:46 【Byte beat technical team】
What is personalized recommendation ？ In short , Is to recommend the items he likes to the user . near 10 year , Mobile Internet is developing rapidly , Personalized recommendation plays an important role . Take the operation of a content product as an example ： The user growth team promotes new products through advertising and other means , promote DAU; The product technology team distributes content of interest to users , Increase retention and length of stay ; The commercialization team distributes advertisements that users may be interested in , Improve the realization efficiency of unit flow ; Commercialized revenue is used for user growth , Form a positive cycle . Personalized recommendation technology runs through every link , It has become the high-speed growth engine of many companies .
How to make personalized recommendation ？ Usually , For a business , First, multiple optimization objectives will be defined （ For example, the playing time of video 、 give the thumbs-up 、 Share , Click of e-commerce 、 purchased 、 Purchase, etc ）, Then build one or more models to predict these goals , Finally, the estimated scores of multiple objectives are integrated to complete the ranking . For recommendation systems , The core work , Is to build an accurate prediction model . Over the years , The industry's recommendation model has been moving towards large-scale 、 Real time 、 The trend of refinement is evolving . Large scale refers to the large amount of data and models , The training samples have reached 10 billion or even trillions , A single model achieves TB even to the extent that 10TB above ; Real time refers to features 、 Model 、 Candidate real-time update ; Refinement is in feature engineering 、 Model structure 、 Optimization methods are reflected in many aspects , Various innovative ideas emerge one after another .
Implementation of large-scale recommendation system , Engineering challenges are great . This article chooses what you are most concerned about Training and Serving System , Introduce what challenges will be encountered in the construction process , What have we done . For any company , from 0 Building such a system is by no means easy , The investment is very large . In the internal byte beating , After years of exploration and precipitation , There are thousands of Engineers , Continuously iterate and optimize the recommendation system . that , What problems will you encounter when building a recommendation system ？ Let's start with a story ：
A The story of the company
A It is an e-commerce company , Their products are 300 ten thousand DAU, There is one 10 Human algorithm team , They're building a recommendation system , There was a lot of trouble , Let's see in detail .
A The company wants to train a click through rate model , Every day 1 Billion exposures ,100 Million hits , They want to use 3 Months of data training model , The sample size reaches 90 Billion . They designed it 200 Features , Include users ID、 goods ID、 User's click sequence, etc , Want to assign... To each feature 16 A vector of dimensions , Roughly, the size of the model is 500G. After analysis , They found it necessary to do distributed training and model storage , So I investigated some open source solutions ：
- Tensorflow：Google Open source machine learning system , have access to Partitioned Variable For distributed storage Embedding, So as to realize large-scale training . But because of table size Fix , Yes hash Risk of conflict .
- PyTorch：Facebook Open source machine learning system , Use Ring All Reduce Synchronization parameters , It is required that the single machine can accommodate all parameters , It's hard to train oversized models .
- XDL： Domestic open source machine learning system , Since the research PS System , use TF As a training engine , And built some recommended models out of the box . It can realize large-scale training in function , But the open source support of this system is weak , Use is risky in production .
- Angel： Domestic open source machine learning system , Its characteristic is that it is similar to big data system Spark Close combination , Use Spark Complete data preprocessing and feature engineering . Since the research Parameter Server, Embedded Pytorch For the training engine , Can train super models . however Angel The online and offline features of are difficult to ensure consistency , Only suitable for offline training platform .
After comparison ,A The company chose Tensorflow To do distributed training . however , When training the model, I found that the speed is very slow , Even if a lot of resources are invested, it still needs 5 Genius can finish 3 Months of data . They spent a lot of time studying Tensorflow,profiling Training process , Some problems were found ：
- TensorFlow Distributed runtime Poor performance , For each feature, a separate pair of send/recv op To connect worker and PS, So single worker Just follow PS Produced 200 individual send/recv, It's caused TensorFlow Runtime Scheduling is difficult , Reduces the speed of distributed training .
- During training CPU The usage rate of is very unstable , look CPU Not fully utilized .
- Some operators are particularly slow , It is speculated that it may be related to memory bandwidth .
- Although the network bandwidth is not full , But adding more machines can no longer improve the training speed .
- Browse TF Found on the official website TF Recently, various distributed strategies have been introduced , They correspond to different topologies of the training cluster . They are very confused , I don't know which one to choose .
Although many performance problems have been found , But optimization is not very easy . After a period of hard work , They optimized some of the problems , Change the training time from 5 Days compressed to 3 God , Barely acceptable . however , When the training goes to the 40 In an hour , Because a machine OOM, The training task is suspended . They tried a few more times , It is found that the training success rate is relatively low , After analysis, it is found that the main reason is :
- TF Build based on static topology configuration cluster, Dynamic networking is not supported , That means when someone ps perhaps worker After hang up and restart , If ip Or the port changes （ For example, machine crash）, Training will not continue .
- TF Of checkpoint Contains only PS Stored parameter information , It doesn't contain worker The state of the end , Not globally consistent checkpoint, It can't be done Exactly-Once semantics .
The challenge of doing a good job of fault tolerance is not small , They can only isolate a separate cluster first , Let the training be as stable as possible . Cannot be mixed with other tasks , Naturally, the utilization rate of resources is much lower .
Several twists and turns , Barely trained one 500G Model of , They want to push the model onto the line Serving, So consider the design of online system . After a discussion , They think that Serving The system must meet the following requirements ：
- Distributed ： The recommendation model is characterized by a large number of Embedding, The model is easy to achieve TB level , Consider future model iterations , Must support distributed Serving.
- Low latency ： The delay of single estimation should be as low as possible , The fine-tuning model should generally be controlled in 80ms Inside . Complex depth model , You may need to GPU Come on Serving, And do a series of performance optimization .
- High availability ： A small number of nodes hang up, which does not affect the online stability , It is usually solved by multiple copies , Need the support of the dispatching system .
- Less jitter ： Model update 、 go online 、 Offline and other operations , No delay jitter .
- AB test ： The recommended system iterates quickly , Algorithm engineers do a lot of AB experiment , The flow of the experimental group will be dynamically adjusted , Online systems need to be able to support models 、 Dynamic scheduling of services .
at present , No open source system can meet the above requirements , Major companies are self-developed , The actual investment is not small .A The company has limited manpower , Lack of experience , Only by some means of model compression , So that a single machine can Serving, The model can't be too complex .
After the model goes online ,A The company has encountered a new problem ： How to update the model . The cost of regular full-scale retraining is very high , If there are more than one online at the same time ABTest Model of , Even worse . therefore , At least days of incremental updates , It's better to update in real time . But increment / Real time updates , It's not easy to implement . Actually , There are more problems to be solved in the future A company , such as ： How to ensure the consistency of online and offline features ; What if the upstream data flow is unstable ; How to solve the growing problem of the model ; How to do mixed training of many scene data ; How to deal with the problem of large-scale candidates ; How to solve the problem of large delay in conversion events, etc .
adopt A The story of the company , You can see , Develop a large-scale recommendation system , It's really difficult , The cost is also high . that , Is there a product that can directly cover data verification 、 Feature Engineering 、 Model development 、 Online services 、AB Testing and other whole process , Let the business easily build a first-class recommendation system , No longer encounter A The headache of the company ？ Yes .
After setting up the volcanic engine , We have been trying to , Open byte recommendation technology to external customers . Now , We can already use the intelligent recommendation platform of volcanic engine , To help you solve these difficulties and pain points . At present, this platform has also opened some places for enterprises to use for free , The specific information can be understood at the end of the text .
Next , Let's introduce , Large scale recommendation in intelligent recommendation platform Training and Serving programme , We call it Monolith( rock ), I hope it can become a solid foundation for everyone to make a recommendation system , The following is the architecture diagram ：
As you can see from the diagram ,Monolith yes PS framework , Let's take a look at how this architecture works ：
Batch / Incremental training
- Worker/PS When it is started, it will turn to ZK register , The information includes (server_type,index). then Worker towards ZK Request registration information , Generate Cluster Information , Realize dynamic networking , Dynamic networking is the basis of fault tolerance .
- After the training ,Worker Will get data from standard input or files , Simultaneously from PS Pull parameters , Then proceed forward/backward Calculation , Get the gradient , And its Push to PS.
- PS After obtaining the gradient , One side , Update internal with optimizer weight, On the other hand , Will record what data has been updated . stay PS The last one TF Session, It will send the updated parameters to Online PS, So as to realize real-time incremental update . Besides , Feature filtering , Feature elimination and so on PS on .
- During or at the end of training , Can write checkpoint. To speed up checkpoint,Monolith There is no continuation of TF Medium saveable, But the use of estimator saving listener, Streaming multithreaded access , The performance of chief mate is improved . In order to reduce the checkpoint Volume , Obsolete features will be eliminated .
- load saved_model.Entry Essentially, TF Serving, It will be from HDFS Load non Embedding part , Simultaneous direction ZK register , So that the upper layer can do load balancing .Online PS Will also first ask ZK register , And then from HDFS Load parameters in , And remove the auxiliary parameters of the optimizer during the loading process , take fp32 convert to fp16, Quantization Compression, etc .
- For a request ,Entry Will randomly select a group Online PS, Get from Embedding, Complete the forecast .Entry/Online PS It's multi copy , As long as one copy exists , Services are available . Online PS It's multi segmented , Sure Serving Super models . Multiple shards can be deployed on one machine , It's fine too Entry/OnlinePS Mixing part .
- For some systems with high real-time performance of the model ,Training PS Will go through it directly RPC Way and Online PS To communicate , Thus, the time interval of sample feedback to the online model is shortened to the order of minutes .
- Training PS It can be done with Online PS Communications , Accept Training PS Parameter update for ;Entry Can automatically from HDFS Read update parameters on , So as to realize the incremental update of minute level parameters .
in summary ,Monolith It includes Training/Serving/Parameter Sync etc. , It's a complete system .
Compared with other systems in the industry ,Monolith Successfully met many challenges , It has the following characteristics ：
It's solved TensorFlow PS Communication bottlenecks
In the industrial level recommendation model , We often use hundreds or even thousands of features , Each type of feature needs to create a hash table to store the feature embeddings. Directly generate a hash table for each type of feature , Searching hundreds of tables at the same time can lead to two problems ：
- PS and Worker The connection will produce too many send/recv op, Greatly affect distributed systems runtime Operating efficiency .
- these ops This leads to too many nodes in the model diagram , The model diagram is too large , Training initialization time is too long .
For the above problems , We have made optimization at the framework level ： For configuring isomorphic hash tables （dim identical 、 The optimizer parameters are the same ）, stay python API Level merge hash tables to reduce the number of tables , meanwhile monolith Will be responsible for communication op Further consolidation , Thus greatly reducing send/recv ops, Solved the problem of native TensorFlow PS Communication problems .
For asynchronous training ,monolith Variables and embedding Prefetching and gradient asynchronous update , For most models , It can make more effective use of bandwidth and CPU, So as to improve the training speed , Optimize resource utilization .
All round fault tolerance
Based on service discovery , Whether it's Worker still PS An error occurred , Can recover quickly . about Worker,Monolith Different worker There is no direct communication between nodes , So a worker Your failure is not for others worker An impact ; meanwhile ,worker The progress entered will be stored , When worker When you fail for unexpected reasons , The progress entered will not be lost ; When PS shard Node failure , According to offline / The nature of online tasks is different , Support different modes of partial recovery and full recovery , Make certain trade-offs in correctness and recovery speed .
Monolith Complements the application of open source software in distributed systems Serving Gaps in , Provides TB Reasoning service of level model . Supports multiple copies 、 High availability ,Training PS In the process of training , The minute level will be just updated Embedding Sync to Serving PS, So as to realize near real-time parameter update , Improve the recommendation effect of the product .
In addition to the solutions mentioned above TensorFlow PS Beyond the communication bottleneck ,Monolith stay Parameter Server framework 、 Bottom Hash Table Design 、 Network transmission 、 Multithreading acceleration 、OP Fusion、 Instruction set acceleration and other directions have also been carefully optimized, and considerable performance gains have been achieved . Take asynchronous training as an example , The whole process of training is as follows ：
- Network communication optimization ： adopt embedding prefetch, gradients postpush The Internet IO And the forward direction of the graph / Backward computing is asynchronous , At the same time, it supports the separation of control flow and data flow 、 Compression transmission and other optimization ;
- Memory optimization ： Filter by supporting features 、 Feature compression 、 Feature elimination and other means , It can save a lot of money training/serving Stage memory usage ;
- Computational optimization ：Hot spot code use AVX Instruction set optimization 、 Time consuming Op Fine tuning 、 manual Op Fusion And other means to accelerate forward / Backward calculation process ;
- Other aspects ： Multithreading optimization 、 Fine grained lock design 、IO Asynchronous with calculation, etc .
at present ,Monolith Has passed the recommendation platform , Successfully applied in e-commerce 、 Community 、 On the scenes of many industries such as video , effect 、 stability 、 The performance has been fully verified . future , We will also continue to maintain high-speed iterations , Continuously optimize the user experience and platform functions .
Thank you for seeing this . at present , Byte beat's intelligent recommendation platform has been opened to business partners through volcanic engine . If your business wants to apply recommendation algorithms to help business growth , But also a headache for building a recommendation system , Try the volcano engine intelligent recommendation platform . For more detailed information, click Portal understand ：
It is worth mentioning that , At present, the intelligent recommendation platform is open 30 A quota for free use by business partners , The free time is up to 2021 year 11 month 30 Japan . Students who want to receive places , Please also scan the QR code below as soon as possible to sign up ：
At the end
Last , Introduce to you , We are volcanic engines - Intelligent recommendation team , Committed to making enterprises around the world , Can have a top recommendation system . Machine learning systems are very welcome 、 Recommended Architecture 、 Recommend students in the direction of algorithm to join us ,base The earth ： Beijing 、 Shenzhen 、 Hangzhou 、 Singapore , Email address for resume delivery ：[email protected], Email title ： full name - Years of service - Volcanic engine intelligent recommendation - Position orientation , Looking forward to working with you ！
author[Byte beat technical team],Please bring the original link to reprint, thank you.
The sidebar is recommended
- Spring IOC container loading process
- [thinking] the difference between singleton mode and static method - object-oriented programming
- Hadoop environment setup (MySQL environment configuration)
- 10 minutes, using node JS creates a real-time early warning system for bad weather!
- Git tool
- Force deduction algorithm - 92 Reverse linked list II
- What is the sub problem of dynamic programming?
- C / C + +: static keyword summary
- Idea does not have the artifacts option when configuring Tomcat
- Anaconda can't open it
guess what you like
I don't know how to start this
Matlab simulation of transportation optimization algorithm based on PSO
MySQL slow log optimization
[Vue] as the window is stretched (larger, smaller, wider and higher), the text will not be displayed
Popular Linux distributions for embedded computing
Suzhou computer research
After installing SSL Certificate in Windows + tomcat, the domain name request is not successful. Please answer!!
Implementation time output and greetings of jQuery instance
The 72 year old uncle became popular. Wu Jing and Guo fan made his story into a film, which made countless dreamers blush
How to save computer research
- Springboot implements excel import and export, which is easy to use, and poi can be thrown away
- The final examination subjects of a class are mathematical programming, and the scores are sorted and output from high to low
- Two pronged approach, Tsinghua Professor Pro code JDK and hotspot source code notes, one-time learning to understand
- C + + recursive knapsack problem
- The use of GIT and GitHub and the latest git tutorial are easy to understand -- Video notes of crazy God speaking
- PostgreSQL statement query
- Ignition database test
- Context didn't understand why he got a high salary?, Nginxfair principle
- Bootstrap switch switch control user's guide, springcloud actual combat video
- A list that contains only strings. What other search methods can be used except sequential search
- [matlab path planning] multi ant colony algorithm grid map path planning [including GUI source code 650]
- [matlab path planning] improved genetic algorithm grid map path planning [including source code phase 525]
- Iinternet network path management system
- Appium settings app is not running after 5000ms
- Reactnative foundation - 07 (background image, status bar, statusbar)
- Reactnative foundation - 04 (custom rpx)
- If you want an embedded database (H2, hsql or Derby), please put it on the classpath
- When using stm32g070 Hal library, if you want to write to flash, you must perform an erase. If you don't let it, you can't write continuously.
- Linux checks where the software is installed and what files are installed
- SQL statement fuzzy query and time interval filtering
- 69. Sqrt (x) (c + + problem solving version with vs runnable source program)
- Fresh students are about to graduate. Do you choose Java development or big data?
- Java project: OA management system (java + SSM + bootstrap + MySQL + JSP)
- Titanic passenger survival prediction
- Vectorization of deep learning formula
- Configuration and use of private image warehouse of microservice architect docker
- For someone, delete return 1 and return 0
- How does Java dynamically obtain what type of data is passed? It is used to judge whether the data is the same, dynamic data type
- How does the database cow optimize SQL?
- [data structure] chain structure of binary tree (pre order traversal) (middle order traversal) (post order traversal) (sequence traversal)
- Webpack packaging optimization solution
- 5. Operation element
- Detailed explanation of red and black trees
- redhat7. 9 install database 19C
- Blue Bridge Cup notes: (the given elements are not repeated) complete arrangement (arrangement cannot be repeated, arrangement can be repeated)
- Detailed explanation of springboot default package scanning mechanism and @ componentscan specified scanning path
- How to solve the run-time exception of test times
- Detailed explanation of k8s management tool kubectl
- Android system view memory command