current position:Home>Be a rock for you - everyone can build a large-scale recommendation system

Be a rock for you - everyone can build a large-scale recommendation system

2022-01-27 05:02:46 Byte beat technical team


What is personalized recommendation ? In short , Is to recommend the items he likes to the user . near 10 year , Mobile Internet is developing rapidly , Personalized recommendation plays an important role . Take the operation of a content product as an example : The user growth team promotes new products through advertising and other means , promote DAU; The product technology team distributes content of interest to users , Increase retention and length of stay ; The commercialization team distributes advertisements that users may be interested in , Improve the realization efficiency of unit flow ; Commercialized revenue is used for user growth , Form a positive cycle . Personalized recommendation technology runs through every link , It has become the high-speed growth engine of many companies .

How to make personalized recommendation ? Usually , For a business , First, multiple optimization objectives will be defined ( For example, the playing time of video 、 give the thumbs-up 、 Share , Click of e-commerce 、 purchased 、 Purchase, etc ), Then build one or more models to predict these goals , Finally, the estimated scores of multiple objectives are integrated to complete the ranking . For recommendation systems , The core work , Is to build an accurate prediction model . Over the years , The industry's recommendation model has been moving towards large-scale 、 Real time 、 The trend of refinement is evolving . Large scale refers to the large amount of data and models , The training samples have reached 10 billion or even trillions , A single model achieves TB even to the extent that 10TB above ; Real time refers to features 、 Model 、 Candidate real-time update ; Refinement is in feature engineering 、 Model structure 、 Optimization methods are reflected in many aspects , Various innovative ideas emerge one after another .

Implementation of large-scale recommendation system , Engineering challenges are great . This article chooses what you are most concerned about Training and Serving System , Introduce what challenges will be encountered in the construction process , What have we done . For any company , from 0 Building such a system is by no means easy , The investment is very large . In the internal byte beating , After years of exploration and precipitation , There are thousands of Engineers , Continuously iterate and optimize the recommendation system . that , What problems will you encounter when building a recommendation system ? Let's start with a story :

A The story of the company

A It is an e-commerce company , Their products are 300 ten thousand DAU, There is one 10 Human algorithm team , They're building a recommendation system , There was a lot of trouble , Let's see in detail .

A The company wants to train a click through rate model , Every day 1 Billion exposures ,100 Million hits , They want to use 3 Months of data training model , The sample size reaches 90 Billion . They designed it 200 Features , Include users ID、 goods ID、 User's click sequence, etc , Want to assign... To each feature 16 A vector of dimensions , Roughly, the size of the model is 500G. After analysis , They found it necessary to do distributed training and model storage , So I investigated some open source solutions :

  • Tensorflow:Google Open source machine learning system , have access to Partitioned Variable For distributed storage Embedding, So as to realize large-scale training . But because of table size Fix , Yes hash Risk of conflict .
  • PyTorch:Facebook Open source machine learning system , Use Ring All Reduce Synchronization parameters , It is required that the single machine can accommodate all parameters , It's hard to train oversized models .
  • XDL: Domestic open source machine learning system , Since the research PS System , use TF As a training engine , And built some recommended models out of the box . It can realize large-scale training in function , But the open source support of this system is weak , Use is risky in production .
  • Angel: Domestic open source machine learning system , Its characteristic is that it is similar to big data system Spark Close combination , Use Spark Complete data preprocessing and feature engineering . Since the research Parameter Server, Embedded Pytorch For the training engine , Can train super models . however Angel The online and offline features of are difficult to ensure consistency , Only suitable for offline training platform .

After comparison ,A The company chose Tensorflow To do distributed training . however , When training the model, I found that the speed is very slow , Even if a lot of resources are invested, it still needs 5 Genius can finish 3 Months of data . They spent a lot of time studying Tensorflow,profiling Training process , Some problems were found :

  • TensorFlow Distributed runtime Poor performance , For each feature, a separate pair of send/recv op To connect worker and PS, So single worker Just follow PS Produced 200 individual send/recv, It's caused TensorFlow Runtime Scheduling is difficult , Reduces the speed of distributed training .
  • During training CPU The usage rate of is very unstable , look CPU Not fully utilized .
  • Some operators are particularly slow , It is speculated that it may be related to memory bandwidth .
  • Although the network bandwidth is not full , But adding more machines can no longer improve the training speed .
  • Browse TF Found on the official website TF Recently, various distributed strategies have been introduced , They correspond to different topologies of the training cluster . They are very confused , I don't know which one to choose .

Although many performance problems have been found , But optimization is not very easy . After a period of hard work , They optimized some of the problems , Change the training time from 5 Days compressed to 3 God , Barely acceptable . however , When the training goes to the 40 In an hour , Because a machine OOM, The training task is suspended . They tried a few more times , It is found that the training success rate is relatively low , After analysis, it is found that the main reason is :

  • TF Build based on static topology configuration cluster, Dynamic networking is not supported , That means when someone ps perhaps worker After hang up and restart , If ip Or the port changes ( For example, machine crash), Training will not continue .
  • TF Of checkpoint Contains only PS Stored parameter information , It doesn't contain worker The state of the end , Not globally consistent checkpoint, It can't be done Exactly-Once semantics .

The challenge of doing a good job of fault tolerance is not small , They can only isolate a separate cluster first , Let the training be as stable as possible . Cannot be mixed with other tasks , Naturally, the utilization rate of resources is much lower .

Several twists and turns , Barely trained one 500G Model of , They want to push the model onto the line Serving, So consider the design of online system . After a discussion , They think that Serving The system must meet the following requirements :

  • Distributed : The recommendation model is characterized by a large number of Embedding, The model is easy to achieve TB level , Consider future model iterations , Must support distributed Serving.
  • Low latency : The delay of single estimation should be as low as possible , The fine-tuning model should generally be controlled in 80ms Inside . Complex depth model , You may need to GPU Come on Serving, And do a series of performance optimization .
  • High availability : A small number of nodes hang up, which does not affect the online stability , It is usually solved by multiple copies , Need the support of the dispatching system .
  • Less jitter : Model update 、 go online 、 Offline and other operations , No delay jitter .
  • AB test : The recommended system iterates quickly , Algorithm engineers do a lot of AB experiment , The flow of the experimental group will be dynamically adjusted , Online systems need to be able to support models 、 Dynamic scheduling of services .

at present , No open source system can meet the above requirements , Major companies are self-developed , The actual investment is not small .A The company has limited manpower , Lack of experience , Only by some means of model compression , So that a single machine can Serving, The model can't be too complex .

After the model goes online ,A The company has encountered a new problem : How to update the model . The cost of regular full-scale retraining is very high , If there are more than one online at the same time ABTest Model of , Even worse . therefore , At least days of incremental updates , It's better to update in real time . But increment / Real time updates , It's not easy to implement . Actually , There are more problems to be solved in the future A company , such as : How to ensure the consistency of online and offline features ; What if the upstream data flow is unstable ; How to solve the growing problem of the model ; How to do mixed training of many scene data ; How to deal with the problem of large-scale candidates ; How to solve the problem of large delay in conversion events, etc .

Our work

adopt A The story of the company , You can see , Develop a large-scale recommendation system , It's really difficult , The cost is also high . that , Is there a product that can directly cover data verification 、 Feature Engineering 、 Model development 、 Online services 、AB Testing and other whole process , Let the business easily build a first-class recommendation system , No longer encounter A The headache of the company ? Yes .

After setting up the volcanic engine , We have been trying to , Open byte recommendation technology to external customers . Now , We can already use the intelligent recommendation platform of volcanic engine , To help you solve these difficulties and pain points . At present, this platform has also opened some places for enterprises to use for free , The specific information can be understood at the end of the text .

Next , Let's introduce , Large scale recommendation in intelligent recommendation platform Training and Serving programme , We call it Monolith( rock ), I hope it can become a solid foundation for everyone to make a recommendation system , The following is the architecture diagram :

As you can see from the diagram ,Monolith yes PS framework , Let's take a look at how this architecture works :

Batch / Incremental training

  • Worker/PS When it is started, it will turn to ZK register , The information includes (server_type,index). then Worker towards ZK Request registration information , Generate Cluster Information , Realize dynamic networking , Dynamic networking is the basis of fault tolerance .
  • After the training ,Worker Will get data from standard input or files , Simultaneously from PS Pull parameters , Then proceed forward/backward Calculation , Get the gradient , And its Push to PS.
  • PS After obtaining the gradient , One side , Update internal with optimizer weight, On the other hand , Will record what data has been updated . stay PS The last one TF Session, It will send the updated parameters to Online PS, So as to realize real-time incremental update . Besides , Feature filtering , Feature elimination and so on PS on .
  • During or at the end of training , Can write checkpoint. To speed up checkpoint,Monolith There is no continuation of TF Medium saveable, But the use of estimator saving listener, Streaming multithreaded access , The performance of chief mate is improved . In order to reduce the checkpoint Volume , Obsolete features will be eliminated .

Online reasoning

  • load saved_model.Entry Essentially, TF Serving, It will be from HDFS Load non Embedding part , Simultaneous direction ZK register , So that the upper layer can do load balancing .Online PS Will also first ask ZK register , And then from HDFS Load parameters in , And remove the auxiliary parameters of the optimizer during the loading process , take fp32 convert to fp16, Quantization Compression, etc .
  • For a request ,Entry Will randomly select a group Online PS, Get from Embedding, Complete the forecast .Entry/Online PS It's multi copy , As long as one copy exists , Services are available . Online PS It's multi segmented , Sure Serving Super models . Multiple shards can be deployed on one machine , It's fine too Entry/OnlinePS Mixing part .
  • For some systems with high real-time performance of the model ,Training PS Will go through it directly RPC Way and Online PS To communicate , Thus, the time interval of sample feedback to the online model is shortened to the order of minutes .
  • Training PS It can be done with Online PS Communications , Accept Training PS Parameter update for ;Entry Can automatically from HDFS Read update parameters on , So as to realize the incremental update of minute level parameters .

in summary ,Monolith It includes Training/Serving/Parameter Sync etc. , It's a complete system .

Compared with other systems in the industry ,Monolith Successfully met many challenges , It has the following characteristics :

It's solved TensorFlow PS Communication bottlenecks

In the industrial level recommendation model , We often use hundreds or even thousands of features , Each type of feature needs to create a hash table to store the feature embeddings. Directly generate a hash table for each type of feature , Searching hundreds of tables at the same time can lead to two problems :

  1. PS and Worker The connection will produce too many send/recv op, Greatly affect distributed systems runtime Operating efficiency .
  2. these ops This leads to too many nodes in the model diagram , The model diagram is too large , Training initialization time is too long .

For the above problems , We have made optimization at the framework level : For configuring isomorphic hash tables (dim identical 、 The optimizer parameters are the same ), stay python API Level merge hash tables to reduce the number of tables , meanwhile monolith Will be responsible for communication op Further consolidation , Thus greatly reducing send/recv ops, Solved the problem of native TensorFlow PS Communication problems .

For asynchronous training ,monolith Variables and embedding Prefetching and gradient asynchronous update , For most models , It can make more effective use of bandwidth and CPU, So as to improve the training speed , Optimize resource utilization .

All round fault tolerance

Based on service discovery , Whether it's Worker still PS An error occurred , Can recover quickly . about Worker,Monolith Different worker There is no direct communication between nodes , So a worker Your failure is not for others worker An impact ; meanwhile ,worker The progress entered will be stored , When worker When you fail for unexpected reasons , The progress entered will not be lost ; When PS shard Node failure , According to offline / The nature of online tasks is different , Support different modes of partial recovery and full recovery , Make certain trade-offs in correctness and recovery speed .

Distributed Serving

Monolith Complements the application of open source software in distributed systems Serving Gaps in , Provides TB Reasoning service of level model . Supports multiple copies 、 High availability ,Training PS In the process of training , The minute level will be just updated Embedding Sync to Serving PS, So as to realize near real-time parameter update , Improve the recommendation effect of the product .

performance optimization

In addition to the solutions mentioned above TensorFlow PS Beyond the communication bottleneck ,Monolith stay Parameter Server framework 、 Bottom Hash Table Design 、 Network transmission 、 Multithreading acceleration 、OP Fusion、 Instruction set acceleration and other directions have also been carefully optimized, and considerable performance gains have been achieved . Take asynchronous training as an example , The whole process of training is as follows :

  • Network communication optimization : adopt embedding prefetch, gradients postpush The Internet IO And the forward direction of the graph / Backward computing is asynchronous , At the same time, it supports the separation of control flow and data flow 、 Compression transmission and other optimization ;
  • Memory optimization : Filter by supporting features 、 Feature compression 、 Feature elimination and other means , It can save a lot of money training/serving Stage memory usage ;
  • Computational optimization :Hot spot code use AVX Instruction set optimization 、 Time consuming Op Fine tuning 、 manual Op Fusion And other means to accelerate forward / Backward calculation process ;
  • Other aspects : Multithreading optimization 、 Fine grained lock design 、IO Asynchronous with calculation, etc .

at present ,Monolith Has passed the recommendation platform , Successfully applied in e-commerce 、 Community 、 On the scenes of many industries such as video , effect 、 stability 、 The performance has been fully verified . future , We will also continue to maintain high-speed iterations , Continuously optimize the user experience and platform functions .

A gift

Thank you for seeing this . at present , Byte beat's intelligent recommendation platform has been opened to business partners through volcanic engine . If your business wants to apply recommendation algorithms to help business growth , But also a headache for building a recommendation system , Try the volcano engine intelligent recommendation platform . For more detailed information, click Portal understand :

It is worth mentioning that , At present, the intelligent recommendation platform is open 30 A quota for free use by business partners , The free time is up to 2021 year 11 month 30 Japan . Students who want to receive places , Please also scan the QR code below as soon as possible to sign up :

At the end

Last , Introduce to you , We are volcanic engines - Intelligent recommendation team , Committed to making enterprises around the world , Can have a top recommendation system . Machine learning systems are very welcome 、 Recommended Architecture 、 Recommend students in the direction of algorithm to join us ,base The earth : Beijing 、 Shenzhen 、 Hangzhou 、 Singapore , Email address for resume delivery :[email protected], Email title : full name - Years of service - Volcanic engine intelligent recommendation - Position orientation , Looking forward to working with you !

copyright notice
author[Byte beat technical team],Please bring the original link to reprint, thank you.

Random recommended