Episode 1 Deep Learning and Natural Language Processing 101
October 02, 2020 | 42 min
Podcast Host – Vinayak Joglekar, CTO at Synerzip
Podcast Guest – Krishnakumar Bhavsar
Brief Summary
As information becomes more and more valuable, Deep learning and Natural Language Processing (NLP) continue to get the limelight in the world of technology. Our co-host Vinayak Joglekar, CTO at Synerzip, is joined by Krishnakumar Bhavsar, an AI and NLP expert, to look at the basic concepts and the implementation of DL and NLP. They also discuss an interesting case study on processing more than 200 resumes in less than 60 minutes using neural networks.
Podcast Transcript
Madhura Gaikwad
Hello everyone. I am Madhura Gaikwad, and you’re listening to ZipRadio podcasts powered by Synerzip. On this podcast channel, we will host discussions on topics based on technology and agile trends with our CTO and Co-founder at Synerzip, Vinayak Joglekar. In the first episode, Vinayak is in conversation with Krishnakumar Bhavsar on deep learning and natural language processing. Krishna Kumar is an expert in artificial intelligence, deep learning and natural language processing. So let’s see what they have in store for us today. Welcome onboard guys.
Vinayak Joglekar:
Thank you. Thanks for the introduction. As we all know, deep learning is something that is a hot topic and it is an extension to the way of machine learning. In a few words, deep learning has to do with multiple passes of machine learning. One learning pass could be treated as where you give some input to the learning algorithm, and you get some output. If you take this output from the first pass and give it as an input to the second pass, then this becomes the case for deep learning. And typically this type of multiple passes are characteristic, as we will learn from Krishnakumar, of what they call as neural networks and there are very popular neural network packages or libraries such as TensorFlow. And, uh, we will be talking more about those, but Krishnakumar for the sake of clarity and for those of our listeners who are not familiar with their machine learning and deep learning. As I understand, if you do the machine learning part multiple times, that is deep learning. Is my understanding correct?
Krishnakumar Bhavsar
Yes. I would agree with you in this aspect as it also, I mean, it is the same, but machine learning is more like we have people who’ve been doing it since the 1950s. So the statistical analysis, that is what exactly when we started doing the same thing with machines, when we use a lot more data to do the same kind of analysis, such as regression algorithms, joint analysis, or any of the clustering algorithms that people used to learn earlier or monitoring time series and all that, all of that we started doing through machines, computers and we started calling it machine learning. It is nothing but the base of probabilities and statistics.
Vinayak Joglekar:
Yeah. So I remember doing cluster analysis and factor analysis way back in our colleges. It was a part of our statistics course. Machine learning is nothing but old wiring new models, right?
Krishnakumar Bhavsar
Exactly. Yeah, exactly.
Vinayak Joglekar:
Then what is deep learning?
Krishnakumar Bhavsar
Okay. So deep learning is a concept which was kind of introduced somewhere in late 1980s where the introducers thought that we can take input from one pass, so basically it’s kind of a feedback mechanism so that we can try to continuously improve something. When you go through one single pass of a heavy algorithm, what you get is usually a plot or kind of, if you imagine that you run a regression analysis, then what you essentially get is you get, consider it as an X and Y axis of your parameters. And what you get is a line or probably a curve of your current target variable. You know that’s how it should go. So it tries to predict those values based on that. So what is deep learning? Deep learning is trying to go beyond that. So, okay. I ran this algorithm once, I got this output. How do I feed it back? And I learned more from my mistakes, so that, currently whatever the smooth curve or line that I have as my kind of a predictor value, I can make it, you know, up and down somewhere, introduce some more possibilities of errors so that I can be more accurate. This was introduced but we lacked processing capabilities. The neural nets, we understood that, you know, in our brains and you know, how they communicate with general signals and all that. We did understand all of that. But we did not have the processing capabilities to actually implement that. Fast forward to 2012, and, and suddenly everyone realizes that now we have the processing capabilities.
And then we started having algorithms like libraries, like you said the Tensorflow libraries. They all started coming up. So Google started with that. So as you already said, it’s nothing but a feedback mechanism. You take input, you do run the algorithm, you feed it back to the next pass. It is slightly tweaked, it’s not the same, but it’s kind of layered. So has multiple years. So earlier when you used to run the algorithm only one single pass would be done, or only one single iteration will be done, and you will get the output. Now what happens is that you have two passes, three passes. So the actual output is getting refined and refined again like you have in an oil refinery. First you find what is diesel, then you find kerosene, then there is petrol. So it is going through filters.
Vinayak Joglekar:
Right? So one of the things that I have understood from what you have just described is that let’s say you have a set of points on an X, Y kind of graph, there is a predictor, which kind of tries to connect those points in one pass. And that would be a really smooth kind of curve, but that is not exactly predicting your variable. So you pass it to the second pass to correct the mistakes that you have, or the errors that you have in your first pass. And then you get a better fit in the second pass and then so on and so forth. So with every pass, you become better and better. This is something which is interesting there. What is helping us is making it more democratized by having computing power, which is now available to everyone. And then now you can start using this. So this is what I think is the current hype or current state. Because now we have people like or companies like Synerzip doing it, and we don’t need a supercomputer to engage in such things. So that is exactly what has happened. So it is the same thing like ike machine learning, machine learning already conceptually became more popular because of the availability of computing power. And the same is with deep learning. So my next question is that I have heard of deep learning problems that typically are solved, just for the audience’s benefit, I will just describe what I know about the type of problems that deep learning can solve. So one of the popular problems is image recognition, there are a few pictures, and then there are pictures of cats, dogs, and then you are supposed to identify which picture is a cat or dog. Well, what happens in this, how deep learning is useful is that you pick a picture and you divide it into multiple, small windows or convolutions. Those convolutions are processed or looked at in the first pass to identify edges where in a particular convolution, you see the change of color, it will be identified as edge.
Now in this, once you identify the edges, you have the second pass in which those edges will be joined together to identify them as lines or triangles or circles, or some kind of shapes that eventually form the body of a cat, and maybe the third pass, you start identifying certain features of the cat such as eyes or whiskers or ears of the cat or whatever, and so on and so forth. Then multiple passes like that and you finally identify a cat as a cat and a dog as a dog. This is what I understand is a typical application of deep learning in this particular problem. Can you tell me how deep learning or Tensorflow or any other type of neural net as you’re familiar with that is used? Can you explain this a little more as to what happens in these convolutional neural networks?
Krishnakumar Bhavsar
So as you say, convolutions when you say it’s a window, that means it is a physical window across your image.
Now, whatever you have as an image, it’s nothing but it’s on your horizontal plane. It’s an X and Y axis. And you have a flat image on your machine. If suppose you had, uh, the images, what you would do as a convolution is you have a smaller frame running across the entire image, starting at zero zero. Okay. And if I have a 2*2 frame, then it will be a 0 – to 2 on x-axis and 0 to 2 on y-axis and this 2*2 frame that is 4 pixels, I will keep on running it horizontally and vertically. to identify if there is any difference from one frame to the other frame. This particular small frame is what we call as a convolution.
Vinayak Joglekar (09:45):
Okay. But my next question, now, this would be very difficult because of the variation in sizes, right? When you are creating pictures of cats and dogs, they would be having variations such as the size aspect ratio of these pictures. Also, you know, in the color image that I like in that picture, whether the cat is under the tree or it is running down the hill, how do you find it in the picture? Where is the cat? How do you identify that, how do you label that as such, because for training, you need to label that data. Otherwise the computer would not know where the cat is in the picture. How do you prepare this data? So I think you can’t feed pictures just as is, you might need to pre-process those pictures, and then you might be able to, so how do you do that?
Krishnakumar Bhavsar (10:38):
So you’re right. In any kind of data analysis, be it image processing, or text processing, or simple statistical processing, there is always some amount of pre-processing required. What pre-processing means is you will take that data and massage it and you will prepare it in a format that can be fed into your computer. Now, so what are the different things that you do in preprocessing? You do data-standardization, you remove certain outliers. Then sometimes you also have to add missing data and you have to also convert formats to other formats. Now, these are normally the things that you do statistically. Now, you know, when you’re talking about image processing. So what exactly we will do for the problem that we are discussing, cats and dogs. First is the color. We need to standardize the color. So for that, what we will do is we’ll probably grayscale the entire image. And the other standardization that we need to do, as you said, sizes are different. So we may decide on a particular size, so we are going to compress all the images to a particular size, say 500*500 px, and we are going to standardize them by making them all greyscale.
So when you convert these into grayscale, that means variations that the algorithm has to understand, the complexity has certainly reduced. So when this complexity has reduced, that means your algorithm will converge faster and your edges will be detected faster. And since now that we are at it, I would also like to talk about how exactly will this algorithm understand this, because you will have to have a huge dataset. When you said you know that some cat is on the tree or some cat is under the hill, absolutely agree. So that is why we need to have a huge or as big as possible dataset for training as well as for testing. So let me explain what is train and test. For doing any kind of analytics or running any kind of analytics algorithm, you need quite a lot of data to make the algorithm understand the fit. And whenever you have this data, you will usually divide it into a training set and a testing set.
So what it means is that you would expose the algorithm only to your training part, and you will use the testing part to test the results of the actual algorithm fit. Usually this ratio is 70 :30 or 60:40, depending upon, you know, what kind of data size you have. So for this image classification problem that we are talking about, you know cats and dogs, I think we will, if you look at the Google face recog database, or kaggle, I’m sure we can find thousands of images of cats and dogs. And particularly with this problem, lots of people are trying to solve. I think this is one of the basic problems people try to solve, when they are trying to learn deep learning. So yes, preprocessing is involved, and you need a lot of data for the algorithm to actually be more as generic as possible. Otherwise, you will tend to overdo or overfit..
Vinayak Joglekar (13:31):
What is overfitting? Because my understanding of overfitting is that your predictor will fit the training data very well, but in its desperation to fit the training data well, it will end up being so specific to the training data that it will stop being good for the future data points that will come up in your test data or live data. So how do you stop that from happening, how do you not overfit? And another part is that if you don’t overfit you might even end up underfitting. In which case, you know, your predictor doesn’t even predict your training data accurately. It is underfitting, right? Then you’re unlikely to predict your future dataset accurately. So you have to do, uh, tightrope walk between underfitting and overfitting. You don’t want to underfit or overfit, so how to manage that?
Krishnakumar Bhavsar (14:28):
Sure. When we are talking about overfitting and underfitting, I will also deduce a few concepts that are precision recall, and accuracy as well. As you correctly say, when you have an overfitting model, that means it is going to predict all your training data, and accuracy will be very high for the data. When you say accuracy, that means if you have fed hundred examples to your model, then it will give you you know 90 examples correctly identified. That means you are more than 90% accurate. Now we have this PR measure – it’s called precision and recall. What is this precision and recall? Precision means that it’s a simple ratio of how many went through it and how many were correctly identified. It’s very similar to accuracy. Precision and accuracy go hand in hand. They are very similar. That means if I have a hundred images and I identify 80 of them correctly, my precision is 80%.
Recall, on the other hand, is the total number of images that are passed through or examples are passed through the algorithm, how many it was able to identify, be it correct or wrong. So you have two classifications. How many it correctly identified truly, and how many it identified false. So you have false positives and negatives. So how many true negatives and true positives. That is the ratio of true negatives and true positives, is what we call as recall. Whereas the ratio of false positive and false negative, that goes under precision.
Vinayak Joglekar
Yeah. So I would say that if I’m able to say identify a hundred and out of that 80 are correctly identified, the accuracy is 80% but if I have 100 or 200 out of which I am able to identify 100 with a right or wrong, in that case my recall is only 50%. So when I try to improve my precision, by accurately fitting, I might end up reducing my recall. Because then you don’t want to make a mistake, to improve the precision, so in which case you might actually end up not identifying some positives as positives.
In which case you will improve your precision but you lose out on your recall. So these are two measures, which are conflicting measures, but which is all good. Tell me what it has to do with what we are talking about. How’s that, overfitting and underfitting, so I understand that by measuring this you will know whether it is overfitting or underfitting. But what do you do to make sure it is not overfitting or underfitting?
Krishnakumar Bhavsar
Initial thing to test is whether your model is overfitting or underfitting. When you look at your precision recall on the training dataset, it is very high, and if it’s on your testing dataset it is very low, which means you have an overfitiing dataset.
This means either your dataset is imbalanced, ie., your test set and your training set you have not properly randomized your samples. Your sampling itself has gone bad. First thing you check is your sampling. Then the second thing you check is with the algorithm that you are using, if you are using a deep learning algorithm, how many passes you are running. And, how many nodes are involved. If you have a data of say million records and the number of columns or independent variables is less than hundred, at that time, I think if you are running more than two passes with more than 5 nodes, then it is definitely going to be an overfit. Because the more you try to fit to the set that has been provided, it is obviously going to overfit.
But in the same case, if you ran for a million records for a less than hundred parameters, if you ran on a single pass, it would be underfitting, it would be too generic.
Vinayak Joglekar (18:41):
Oh, so that is how you find balance. So it has to do with the total population or the total size of your sample, how random it is. Because if it’s not evenly spread between the training and test data, again you are likely to go wrong, and also the number of passes. So if you have more records, you should take advantage of more records by having more passes. But if you don’t have those many records in the training set, you should not have too many passes just to improve accuracy, because then you will end up overfitting. Yeah, so this is all very good. Now, let us come back to the problem that you tried to solve at Synerzip with the help of deep learning. Can you describe the problem? And I will chime in because I was also part of this.
Krishnakumar Bhavsar (19:31):
Absolutely. Absolutely. Let us first try to look at, you know, what exactly is the business that we had in hand? I think Vinayak you are at a better place to tell the audience what exactly was the problem.
Vinayak Joglekar (19:45):
Okay. So let me take a shot at it and you can chime in. So we have resumes which are parsed in our system and we need to recognize and use features inside the resumes. The resume may have different types of features. There are high-level features, low level features, for example, in a resume let’s take education. So education is having multiple records like there will be schooling, college records, post graduation, institute or university record. Maybe there are three records in education. Now each record will have low-level features such as the year of passing, the degree or the certification that he got and the university from the institute that which he did. So these are low level features. So these are easy to identify because we have what are known as gazetteers or lists and then you are doing just a matching. You will know that two zero one eight, 2018 is a year because it doesn’t match with the gazetteer. Similarly, you will know that BE or BTech would be a graduation degree because that would match with that in the list.
Similarly, you will probably know that IIT is an institute because it would match with that in the list. The low level features are easy to identify. But now you need to conclude that since this particular record, it has a degree and it has an institute or a college, it must be education record. Now this is something that was done on the basis of rules and you were going crazy because those rules were running into long, long things because every candidate will have a different way of writing things in his resumes and trying to write a rule that would be generalized enough to take care of every way in which, for example, education is written by every candidate, somebody would write the institute first, somebody would not write the year of passing. All sorts of confusions would happen there and you would end up not identifying education as education. That was the problem that we started with and we thought that we would give this problem a shot by using what we call as Tensorflow or machine learning or neural nets, and that’s how I would describe the problem.
So you have top level features such as the objectives, experience, education, project details, and all of that, we need to identify these top level features, based on the input of the low level features such as technology or year or degree or specific functional skill or certain keywords, which act as anchors like responsibilities or designations, you know, on the basis of these or names of companies, on the basis of these very well identify low-level features and also format of the resume itself. So you have a table or a paragraph, or you have some bold. So all this information goes into, and then on the basis of which we are trying to identify the high level features. So that is the problem. Have I missed something?
Krishnakumar Bhavsar (23:12):
Uh, sure. So I mean, infact you gave a very nice detailed answer. Still what I would really like to tell the audience before they get into the details is that the problem really was classifying the resumes. The idea behind this was that an HR organization, or a human recruitment firm,or an agency which does recruitments. For them, they get like, you know, hundreds of thousands of resumes everyday. And to categorize or classify each of these resumes, you know what exactly this information we put into. Even if we talk about just technology resumes, still, I can very well say that, you know, we will have at least 50 different buckets in which we can classify these resumes. Now to classify these resumes as you said, if you understood something that we called as a high-level feature is that, you know, you have a top level 7-8 entities, in which the entire resume resides, mainly you know I will call them like full name, objective, overview education, company experience, skillset, and the projects that you worked on.
This is usually you know any kind of resume that you get from any portal that you are downloading. And every resume for all these kinds of higher level entities, it also has at a granular level, some more medium level entities, some more lower-level entities, that are actually propagating to make us, or give us the output, for identifying the higher level entities. Now, as you correctly said, we are going to use all of these entities, you kind of introduced all of them as well already. So how about we just move on and see how we monitor this and how we directed on this.
Okay. So let’s talk about that. Sure. So, what we did was, as Vinayak has already mentioned, we had already done a lot of work with gait and natural language processing, ontology related stuff to identify the lower-level features. So just to repeat, you know, what kind of lower and medium level features we are talking about is your first name, last name, technology name , address, that goes on your resume that you have on your left hand side and then you’ll say project name, and you will have the actual project name. Or maybe education, and then you will have your actual education. So these are your low and medium level features, and this is a low hanging fruit that you can easily get, as you mentioned. So we identify all these entities using those natural language processing and computational linguistics. Having got all of these entities, Along with that, we also consider the properties of the document in terms of the structure of the document. So that is another lower level entity that we also consider.
Now, when we combine all of this, we actually came up with around 90 such features or entities that we can identify at a lower level. And at the top level, all of these entities are feeding you at the top 8 entities that I just told you, you know, full name, objective, education, so on and so forth. When we got this then we first decided that, okay, so now we have all this data. So what we should do with this, should we try to model it in like, you know, statistical way or maybe go with the NLP way. So then we decided that you know let’s model it in a statistical way. In that we decided that we will work at the offset level. That means we will tie and predict each and every offset with…
Vinayak Joglekar:
For the benefit of the audience, an offset is a position where a character can reside within a resume. So if you say a resume has 1000 words and each word is of 10 characters, there will be 10,000 offsets, right? The offset number one would have certain properties like, you know, which one of the 96 features that we just talked about is present or not present in the offset number 1. Similarly, similarly we’ll have 10,000 offsets, 10000 rows, and each row will have 96 columns. Right? And in that column, you will have a yes or a no type of answer, right? That whether this feature exists or does not exist. Is that how we modeled the problem?
Krishnakumar Bhavsar (27:38):
Yes, exactly. So that is how we decided to model the problem. We also talked about something like standardization. Since we thought that the location of these elements was important, that means people will not write their full name at the end or the bottom of the resume. Neither will they write the objective or the overview at the bottom. Similarly, the footnote cannot ever come at the start of the resume. So which means that we have to do something about the size of the resume. So that we decided that we will stretch all the resumes to 20000 offsets. The premise of this was we looked at, we had already looked at thousands of resumes. And we saw that, you know, the average length of the resume is anywhere between 10,000 to 20,000 offsets. That’s why we decided that if you have 20000 offsets as the maximum possible, then all the resumes can probably fit into it and you will not lose any data. The idea was that we should not lose any data. That is why we decided that we will take 20000 as the maximum offset. Having decided that, we stretched all our training and testing data to be of 20000 size, that is from 0-20000 every resume will have 20000 characters, is what we standardized
Having done that, then we ran it through multiple algorithms. Finally, we converged with a multinomial neural net algorithm, which gave us a very good accuracy on the first pass. And we saw that all the smaller features that we had, the smaller, higher level features like the first name or the full name or overview, you mean shorter limited. Yes, exactly. So we saw that when the number of characters were less, you know, if we had a, about not more than 1000 characters. Yeah, so for example, a full name would have let’s say less than a hundred characters, whereas when you talk about the project, it could run into hundreds of characters, or maybe even thousands of characters. So shorter. It is, you are seeing you’re talking like the accuracy was better or shorter. So we saw that the accuracy was very high for the smaller or the shorter entity, that we wanted to identify higher level entities. But with the larger entities, although the location kind of looked correct, but we had, we used to have these very small outlying kinds of problems.
Vinayak Joglekar: (30:04):
Yeah. So like it was identifying the project for example correctly, for where it was supposed to be, but at the same time, there were some false positives, sprinkled short, like one or two characters sprinkled all over the resume, right? I mean it would have those false positives, outliers, as you would say, and that was spoiling the accuracy.
Krishnakumar Bhavsar
Correct. So as Vinayak already said, so yeah, you will kind of, you know, you have identified almost like a 5,000 character long project section, but in between you will find that there are lots of false positives. So you know some of the places it has not been able to identify. Now this problem, to rectify this problem, what we decided was that probably we have not given the location, and adjoining entities of those things which are around it that much priority. So we had given priority only to the location, but we had not given any priority to what is around it basically. So then we decided that in the next pass, what we will do is we would talk about where we would insert location, as well as what is around each of those offsets. Whether they are also part of the same entity or the same high-level feature, or they are of different high-level features. And also what is the length of that particular feature.
Vinayak Joglekar:
So I think the length was very, very valuable because there are certain features that are by their nature very short, and there are certain features by their nature they are long. Like the project is not short and the first name or the full name cannot be long. So I think you could not have given the length, uh, without having it go through the first pass right? So the first pass output was the length of the feature, and that output goes back to the second pass. Because now we have the information available that what was the length of the identified feature, right?
Krishnakumar Bhavsar
Identified block. Right. So now that we knew that what is the feature of the identified block, then we decided, okay, now these blocks, which are really small in size, cannot be projects. So that’s when the algorithm actually understood that, okay, what is around it and what is the length of this block, and it made it even better in the second pass, right? So first pass, we got very good, uh, accuracy with the smaller or shorter kind of entities. In the second pass, we got very good accuracy with the larger, the bigger kind of entities.
Vinayak Joglekar:
But then what happened to the smaller ones?
Krishnakumar Bhavsar
The smaller ones, yeah, so here is the catch. So what we decided was that we will stick to our output for the first pass of the smaller entities, because that was giving us a better output, but since we were actually using location and what is around it in the second pass, it was kind of gargling out the smaller entities in the second pass.
Vinayak Joglekar:
Now that’s interesting. So we would use, like for example, the first pass would correctly identify the email address or the name or the short ones and that was taken as the output for the short ones, but for the long ones like the experience record or the project record, we would depend on the second pass.
Krishnakumar Bhavsar (33:18):
Absolutely. Yep. That’s how we went about it.
Vinayak Joglekar (33:21):
So that is very interesting, so, you know, having done this can you tell us a little bit about the type of tools that we used for the analysis as well the visualization?
Krishnakumar Bhavsar (33:32):
Sure. So initially what we started with this was that, you know, uh, we probably tried to identify only the gaps between the sections. And, uh, we thought that, you know, if he just identified the gaps properly, we’ll be able to identify the sections properly. So then we thought that, you know, uh,the binary classification algorithms, if we tried to use that, it would be good. But having seen the results of those, we quickly understood that what we are trying to identify is really not what should be done. The reason, prominent reason was that the data was extremely imbalanced, as in, there were hardly any gaps between sections or between entities. So if you try to identify the gap, your actual true positives, that you’re trying to identify, that set is very small. So even if your precision is very high, your recall is very low. It is a classic example of that. Now. So then we decided that, you know, instead of trying to identify the gaps, we should probably try and identify the top level entities.
And then we decided that then we went with neural networks. So what in this we did was we applied the algorithm and we were looking at the results through, as in, you know, trying to look at the dispersion block, what exactly are the connotations, and where exactly are they appearing. So think of it like a horizontal bar in which when I click on project, it will show me where all the project is in this particular resume, starting from 0 to 20000. So this is the type of visualization you will get, and it really helped us. And, uh, the programming language that we used was mostly R and Java, Java was for the initial processing of the resume. In fact, because we used to Gate, so we did the initial NLP part in Java using Gate, and for converting this Gate document, into our statistical kind of a table, for that also we used Java. Another important thing that I would like to discuss is we had a particular problem, when we were running MySql database for the pre-processing class. So we realized that using directly the disc or dumping data or file in a disc, was much faster than using a relational database.
Vinayak Joglekar (33:21)
So could you tell what the magnitude or the degree of magnitude. What is the order of magnitude? How much more time it took for MySql to put it in a .csv format on a disc?
Krishnakumar Bhavsar:
Sure, I mean, you would be quite surprised to know the numbers. If I had a 20000 or a 15000 offset resume that is 15000 characters, to put it into MySQL database in the format that we wanted with 96 features and, and of course, to just create 15000 rows with 96 columns, it was taking us close to around about eight hours for the one document. But if I’m trying to do the same thing, dumping it on the disc, it was doing it in two seconds.
Vinayak Joglekar (36:41):
Oh My God! And loading a .csv kind of data into R metrics is very easy, right?
Krishnakumar Bhavsar:
Absolutely! Loading it into a data fence is like a kid’s job.
Vinayak Joglekar
So finally, you know, I have the last question and let you ask if you have any. So you know, how would I use this, now that I have this, what is the way to supposedly come across a new resume, uh, how is this information used to infer? How is the inference done, that in the new resume, to identify features at a high level, given that we have these 96 low level features, which are very accurately known, we want to identify the high level features, how do you go about doing that?
Krishnakumar Bhavsar:
End of the day, this is all mathematics. The final outcome of any kind of deep learning algorithm or a machine learning algorithm is an equation. So you have a simple equation with a number of parameters that you have and along with that for each parameter you will have an associated kind of a coefficient. That coefficient is called a theta. Each coefficient is actually awarded based on the importance of that particular feature in deriving the final outcome for the result. That means if I have 96 features, I’m getting 96 coefficients for each of the features. Everytime a new resume comes, I convert it into the format I have, uh, the CSV format for each of the offset. I run this equation and it would give me the final answer, which particular class, this offset belongs to.
Vinayak Joglekar
So it would tell me that this particular offset, since it has these 96 features, this looks like it belongs to the project feature. Uh, and then, uh, when you punch together a few hundred of those, then it starts looking like a project. Is that correct? So, yeah, so this is interesting. So inference can be done even on a mobile phone, right? Once you know the theta, it doesn’t require any processing power.
Krishnakumar Bhavsar:
It doesn’t require absolutely doesn’t require any special processing power. It’s just a simple mathematical equation that you can run even from a calculator.
Vinayak Joglekar
Yeah, so that is wonderful. So inference can be done using a low power kind of a front end, at the edge. And you know you only do the training at the end. Now, when you add some more resumes to your training set, that you need to run retraining, right? So how frequently does this need to be retrained?
Krishnakumar Bhavsar:
It depends on your results. If you had a good enough sample to take the start and till your results are not getting deteriorated, I would suggest that, you know, you should continue using your current model, or, the other or the better option is that you find an incremental model, that you can train incrementally. So you can have something like a weekly batch, in which, every Saturday or every Friday evening, you will put some resumes in it and incrementally train it, and it gives you the result. So that means you do not have to go through the full cycle. So whatever was learned is already there, you are incrementally adding some more features, that is you know some more kind of information or knowledge in the same model, so that it can get better at predicting what is coming.
Vinayak Joglekar:
So how long would it take? Let’s say there are 1000 resumes and 96 features, and each resume has 20000 rows. So you are talking about 20 mil rows and 986 features each. So how long did it take in the first instance to process it using the neural net algorithm?
Krishnakumar Bhavsar:
So the initial multinomial that we ran, it took us, for the first pass, it took us six and a half hours for 1000 resumes with 20000 offsets each. That is about 20 mil records. Which is quite fast, it was quite quick. And it was done not on any kind of supercomputer or anything. It was done on my Macbook 16GB Pro. So, so you can understand that you know what we are talking about. When we said that you know the new techniques because of the processing power that we now have available in the commodity hardware, we can very well do it, and this is very much acceptable.
Vinayak Joglekar:
So, great, which means if you take a delta of one week, which will be let’s say hundred or two hundred resumes it should be done in less than an hour.
Krishnakumar Bhavsar:
Yeah. It should be less than an hour. Yes. For an incremental training model.
Vinayak Joglekar:
Okay. Have I not asked you something that I should have asked?
Krishnakumar Bhavsar:
I think we discussed quite a lot of things in fact, more than what I think I had planned for.
Vinayak Joglekar:
Oh, good good. So thanks for coming. And so, uh, it was wonderful to have you here. And thanks for coming here and wish you good luck for your future career.
Krishnakumar Bhavsar:
Thank you very much Vinayak.
Madhura Gaikwad (41:38):
Thanks Vinayak. And thank you, Krishna Kumar. Those were some great insights and I’m sure our audience will find them interesting. Thank you everyone for joining this episode. If you’re looking to accelerate your product roadmap, visit our website, www.synerzip.com for more information. Stay tuned to future radio episodes for more expert insights on technology and agile trends. Thank you.