PyCaret -Why you don’t need to be an engineer to be a Data Scientist | Moez Ali

In this podcast, we spoke with Moez Ali, the creator and founder of PyCaret. PyCaret is an open-source, low-code Machine Learning library in Python. It enables you to automate ML pipelines from pre-processing to deployment in very few lines of code. PyCaret 1.0 was launched in April 2020, and its 2.0 version, released just a couple of weeks back, is already getting a lot of traction.

We talked about Moez’s life experiences on three continents and how they came to the idea of creating PyCaret. We also touched upon his beliefs in Data Science and how he thinks PyCaret fits in the area of many pre-existing libraries. You’ll come to know the inspiring story and how Moez, who is not originally from a technical background, created his own Python package. Plus, there’s an exciting rapid-fire round at the end. Oh, there’s a lot more, read on!

Theme Introduction: In this podcast series, we will have a light and casual chat with Founders, Tech Leads, Hiring Managers/HRs, Data Scientist, Researchers, Community Builders about how they use AI to solve real business problems. This podcast will bring you closer to all these amazing companies and get you excited to work with them. So don’t forget to stay tuned to all the upcoming episodes of the series and make sure to subscribe and follow us.
PyCaret Journey
PyCaret Journey

Moez’s career background

Kunaal Naik: Moez, welcome to the CLL Podcast! Congratulations on the launch of PyCaret 2.0. We’ll talk more about the latest version, but first, let’s look at your journey till the inception of PyCaret. You did your Masters in Economics from the University of Karachi and then Chartered Accountant Management from CIMA, you worked in East Africa in Healthcare Analytics and then moved to Canada and did your Chartered Professional Accountant in 2016. You worked in analytics roles at SickKids and Scotiabank and now are also leading analytics at PwC Canada. And now, during this while, you also founded PyCaret. So how has the journey been through this entire part, Moez?

Moez Ali: Thank you Kunaal, I am extremely thrilled to be here. The journey was exciting. I have been reporting in analytics roles since I started my career in 2008. And I think then ten years ago was when all the Business Process Automation started to happen. Very early in my career, I realized that the things we have been doing in any department deal with reporting and data. I realized way back when my career started that things would change because of the way Process Automation was infecting the operational tasks. And if you think of accounting, there could be nothing more repetitive than accounting. So I think that’s what got my attention.

I am a Chartered Accountant, and I have practiced it for a few years. Prior to that, my college was in Medicine and Healthcare. So I didn’t have any formal Computer Science education, but I was always influenced by the way technology can change things. So I am always curious and a very good reader. I would keep an eye on what’s going on and what’s happening.

So when I moved to Canada from Africa 4 years ago, I saw an opportunity to revamp my career, and that’s where I chose my second master’s degree from Queen’s University last year. As I was going through the program, the PyCaret idea started. And the last two-and-a-half years have been extremely crazy because PyCaret is like a full-time thing to me, except that it was not a paid job as it’s an open-source thing. So it takes a lot of time still because we are still a growing community, and I am investing a lot of time in creating a team that is truly diverse, which has a balanced set of skills both in software engineering and data science, and the team has a vision — which is to help PyCaret integrate more into the community. Because without a community, projects die out like any other open-source project. The only difference between good and great open-source projects is community.

So, it was an excellent journey. And if you ask me if I would repeat it, I would say yes without hesitation.

Kunaal: Absolutely, Moez. Great answer!

Also, thanks to Chayan Kathuria for putting all the research in for creating this podcast’s script.

Challenges in current role

Kunaal: So, Moez, you’re also currently working as the Head of Centre of Excellence at PwC. What are the challenges you face in this role?

Moez: Yes, so I do a lot of academic and research consulting, which is not just limited to PwC. So at the end of the day, it boils down to the lack of clarity and expectations in people from business as to what analytics and data science can solve. I think there’s a myth in that part, and expectations are not clear. The gap between the technology teams and the teams that have business requirements is the most problematic thing in today’s analytics space. And that’s the biggest challenge currently.

Kunaal: Yes, I agree with you, Moez, on this. I was in one such client interaction a few years back and faced the same issue. They just gave us a business problem and asked us to solve it using Machine Learning. And they didn’t have any background data to give us to get to that solution. So that was a big eye-opener for me.

Moez: Yeah, and I also think that it’s great that AI and Deep Learning are solving some great challenges and getting a buzz around them. Only because Deep Learning is so cool doesn’t imply that it can replace Machine learning, business knowledge, or statistical knowledge. So I think that’s where we also need to educate the young Data Scientists that Deep Learning is not a substitute for Machine Learning or Classical Statistics.

Academia and Industry gap

Kunaal: Absolutely, I think you brought up a great point, Moez. A lot of focus is given to AI & ML and not on thinking about business problems. Do you think there’s a gap between what Universities are teaching and what is being used in the industry?

Moez: Yes. I think the way Data Science is being thought of is very unfortunate. 80% of the courses are teaching Python instead of Data Science. So you rightly pointed out. In a typical Data Science education, they should first keep Python/R as a prerequisite and not a part of them. They should train people on how to frame a business problem into a Machine Learning problem because that’s the biggest challenge.

For example, if your client gives you a problem with inventory management, your primary responsibility as a Data Scientist is to help him simplify the problem and then frame that inventory management problem into some kind of forecasting problem. That’s the primary role of Data Scientists. But the entire focus in Universities is on something which is just 5–10% of the entire process.

Kunaal: I agree with you, Moez. 80% of the courses are ML/AI/Python-based. They should focus on Business thinking, framing your business problem statements, and data preparation.

Time devoted to PyCaret

Kunaal: Okay, so you mentioned that PyCaret is like a full-time job to you, along with PwC. So how much time is devoted to creating and sustaining PyCaret currently?

Moez: When I started this, it was a full-time job. I was sometimes working 18–20 hours/day, especially during the last summer. But these days after the second release of PyCaret, I have good support from the community. So as we are talking now, I have developers already working on the code. And maybe in a few weeks, we will make PyCaret a public organization. So I am hoping to get more developers to contribute, and then my time would be limited to vision and leadership. But the way I have managed it up till last year is only by working excess hours. There’s no substitute for hard work. It’s impossible to create something like this in a 7-hour workday.

Moez’s non-tech background impact on PyCaret

Kunaal: Great. So, you’re not originally from a tech background. Did that add to extra efforts like reading a lot and revisiting stuff?

Moez: Yeah, so not many people know this, and I am not shy to confess that when I started with the idea of PyCaret, I didn’t know Python. I had done R for a couple of months but didn’t have programming experience. So I think if you have an idea of what to do and are passionate about it, you don’t need to exactly know how to do it in the beginning. Because eventually, you will figure out the way to do it.

First, I didn’t have a computer science background, and when I started building PyCaret, I built it in Jupyter notebooks. And when we released PyCaret 1.0, we didn’t have unit tests, CI/CD pipelines, etc. All that I have learned throughout my journey. You don’t have to be perfect on day one. So obviously, I had to do a lot of rework on Google and Stack Overflow. So not knowing to code should not be a barrier to doing something. And I repeatedly say that even if you don’t know to code, I think you should know how to think coding as writing is not a big deal. I am not a software engineer, but I have a lot of software engineers coming in and improving the codebase without impacting the functionality of the product.

So I think coding is a communication skill — a way to communicate with the machines. In any field you are in, you should at least be able to think to code, as it will enable you to think logically and build your problem-solving skills.

Kunaal: Absolutely. So for the audience reading, you saw how Moez has worked on this problem statement, even without knowing Python initially. So it is important to have a vision in mind and then work towards solving it rather than trying to learn so many things before you even apply that stuff.

And also another thing that was called out here is that, while he was building PyCaret, there was a requirement later in the stage to come up with Unit Testing, CI/CD, and that’s where we are stepping into the software engineering part. So for those who are learning only machine learning and AI, at some point in time, if you are creating something big, you will also need to be testing your code rigorously before it goes into production. Those are some skills you also need to pick up while doing machine learning or data science.

Moez: Absolutely, I think that’s a very good point you’ve mentioned, Kunaal. I want to add that Data Science is a glorified software engineering problem. To eventually be able to do data science, not as a citizen data scientist, not as a Marketing Analyst/Data Scientist, not a sales Data Scientist, not a financial data scientist, not those roles. To be part of centralized data science teams responsible for managing infrastructure and setup, you have to learn software engineering. At a scale, there is no way you’d be able to make it without software engineering skills.

Marketing strategy behind PyCaret

Kunaal: Agreed, Moez. So, this is the first time I’m talking to somebody who has created a Python package, and I’ve always been fascinated with the creation process of Python packages.

But I wanted to bring out something that is more important to get the due respect that the particular package gets — and that’s marketing. So I wanted to know when you created the product, did you have a marketing strategy to put the finished version out on the market?

Moez: Interesting. So this is a good question because every time anybody asks me about it, my answer is, to you, PyCaret is a package because people use it as a Python package. For me, it’s a product where the package is only 10% of it. There’s a lot of work that goes behind, and that’s not just part of the package, that’s the part of the product. So, for example, all the content that goes online, the website itself, all the images, artifacts, videos, and everything else. I think it’s a lot of work and has nothing to do with code.

So I think as I went overtime this last year, I didn’t only have learned Python and developed PyCaret, but I also had to adopt WordPress, Joomla, and 3–4 other things to build the website. And then finally, somewhere in early January, I did a soft release, a kind of 1.0 announcement, and got about 100–200 views on LinkedIn. And then I realized, there are millions of projects and this project would be just like other projects if we didn’t put the word out. So, at the end of the day, it comes down to how well you can put the word out on LinkedIn, Twitter, and social media. So I think 70% of my hard work went into strategizing social media, coming up with the content and the strategy and distribution and all that.

And If you see our website and content, it’s not by coincidence. It’s a very thoughtful and very intentional design. Our content is being written and proofread by 3–4 citizen data scientists so that it’s beginner-friendly content.

Kunaal: Absolutely. Now, at CLL, we are also working at a similar pace. So if I feel you in terms of the marketing strategy. Like creating images, short videos, and promotions, and then getting into medium. So all of that work is overwhelming at some time, but again, at some point, it is also an enjoyable process because we love putting good content out.

PyCaret product-market-fit measure

Kunaal: Another thing I wanted to ask on this, Moez, is that PyCaret came out of some inspiration, but did you measure a product-market fit for the open-source products?

Moez: No, I didn’t compare, or I didn’t do any form of lead analysis or gap analysis, because it, as I mentioned, started in Jupyter notebook. So when I was building it, I had no idea that we would be using PyCaret on GPUs one year later, and we are now doing so much cool engineering stuff. If I had a little idea, I would have done some needed analysis. But over time, with the 2.0 release, PyCaret has taken its own position.

So people tend to compare it with, Scikit-Learn, and other Machine Learning libraries, but I don’t think so because PyCaret is not an algorithmic library. We don’t have algorithms, and we don’t intend to include our own algorithms. But I see it as a replacement for the code you would write if you were to directly work with Scikit-Learn, XGBoost, LightGBM, etc. So I think it has its own unique position and space in the open-source framework.

PyCaret vs. AutoML frameworks

Kunaal: Okay. So you’re saying it doesn’t compare well with the ML frameworks like Scikit-learn or XGBoost. But there’s some part of automation going on behind them. Does that part of automation compare to the existing auto ML frameworks like H2O, Teapot, or AWS autopilot?

Moez: I never created PyCaret with the intention of a complete autopilot. In the pre-1.0 release, we had the auto ML, which was pretty similar to H2O and Datarobot, which is a library of models, and you would, based on some kind of space optimization, iterate over your library and create different models. That’s what AutoML is. So, I think Google is the exception as it’s using reinforcement learning to define such a space. But I think the initial search space is still randomly defined with all the other tools, be it open-source/paid/non-paid. It’s not using Transfer Learning or genetically.

With PyCaret, I never aimed to create AutoML because there are already enough AutoMLs. I think H2O is pretty solid. I created PyCaret for an in-person modeling experience. So you’re sitting in front of the computer, you are coding, and your aim is to train a classifier that you can use to predict new data sets, it could be produced, it could be your school. And as you are doing that, the amount of boiler code that you have to maintain is a challenge. Because if 90% of the students are not coming from computer science or a quantitative background, they would have an R file or a Python file somewhere, and they would just copy the code.

But I don’t think that 90% of time should be spent just coding so that you can get one number out at the end of your excrement. I wanted to replace this process for those academicians, the students, researchers, and data scientists. So I replaced the need for 10 or 20 lines of code just to achieve a simple thing.

And if you read PyCaret code, it’s like the English language. We should always put comments in our work, but I think that with PyCaret, when I put comments, my comments don’t make sense because we are coding create_model(), and the comment on that line is “creating a model.” So rather than spending 50% of your time troubleshooting your coding errors, you can think more about creative solutions to your problem.

So I never wanted to be in this space of developing world-class Auto ML solutions because I’m not a huge believer in machines replacing everything. Don’t get me wrong. I think so AutoML is good, but I think that it’s over-hyped. It is like simply asking the machine to perform something under the hood. So if you do not have dozens and dozens of use cases in your back pocket, it’s not a financially viable decision to buy AutoML software. Because these Machine Learning Platforms as a Service are very expensive. And I don’t think it would replace production-level data or science teams.

Significant additions in PyCaret 2.0

Kunaal: Agree. So now you have moved from PyCaret 1.0 to 2.0, and the whole idea is that you keep improving data science in terms of simplicity. So what are the significant features you introduced in this release, and what are the features in the pipeline that you want to further add in the future?

Moez: Okay. So I think the biggest change for me from 1.0 to 2.0 as we had a lot of people logging issues. And those errors were based on the fact that I never considered that people would use PyCaret in the command line, Sagemaker, or other different environments. At that time, I thought it was the Jupyter notebook only. But after releasing 1.0, I realized that the target audience is much broader and bigger than just the students. We realized we were using HTML heavily for interactive display, causing a failure. So in 2.0, we made it compatible with the command line. So now you can pass a parameter when you initialize the setup and use it in the terminal, you can use it and Spyder, or wherever you want. That, I think, has just increased the market size and the number of users for us.

The second biggest thing in 2.0 is the integration with ML flow. ML flow, which is created by Databricks, gives you end-to-end the ability to not just run models but also to maintain the life cycle of the model. So we integrated PyCaret to the back end of ML flow. So now you can use PyCaret not just to train your models but also to store the artifacts and manage the deployment, which makes it more accessible and practical. Using Jupiter notebook to train your models, how would you keep track of all the metadata? Data Science is a bit different from Software Engineering as it involves experimentation. And because of that, you generate a lot of metadata. If you train 50 models, you need to store thousands of data points in terms of metadata. So all the models have hyperparameters, pre-processing pipelines that you have used to build the model, and the metrics that you use to generate accuracy. One way is that you run it in a notebook, copy the numbers in Excel and keep storing it all in a table. You can do that for maybe five models, but that would be a very stupid idea if you are a regular user. So I think that for me, that’s a game-changer because I’m not just the creator of that, but I use that at my work, and I know a couple of companies using it already, and it’s a wonderful feature.

Integration with Data Analytics tools

Kunaal: Absolutely. I’ve also tried integrating PyCaret with PowerBI, and the process was seamless. The way it was done, the simplicity at which I wrote the code and then just ran it, and it worked all together, was great. Did you think about integrating it initially, or did it come up later because of the use cases you’re currently dealing with at some of the roles you’re working with?

Moez: No, even when we released 1.0, we had these integrations in our minds. We knew it was working with Alteryx 9, PowerBI, QlikSense, and other tools. But the fact is that I was always focused on a group that did not come from a computer science background, which meant I was targeting citizen data scientists. And I know visualization is one big part of the job, even if you’re a citizen data scientist. So I think it came naturally and was very much intentional from the beginning. And I don’t think that machine learning in PowerBI was a very popular idea before PyCaret.

Kunaal: We’d love to see some of these use cases come on very easily, especially when we’re trying to do cluster analysis and showing how different clusters are moving for different customers. So that everybody can look at it and make the necessary targeting decisions on that cluster analysis.

How does PyCaret fit in the existing toolbox?

Kunaal: Okay. So many data scientists that are currently not working with PyCaret, what’s in for them? How do you see current data scientists augment their entire journey of data science with PyCaret?

Moez: I don’t think it has to replace anything or what you’re currently doing. It’s just one more tool in your toolbox. The thing is that it would save you time a lot of time. And I think it wouldn’t be about who can build the models or who can do this thing. In the future, it would be more about who can do it efficiently because I’m sure companies would care about that.

There’s no learning curve under the hood. You’ve been seeing the same Scikit-learn, XGBoost, LightGBM, etc. What it does is it saves you time. When you start your training process, it will save your time so that you don’t have to set up your cross-validation strategy every time, you don’t have to calculate metrics and log them, and you don’t have to write ten lines of extra code to create a Pandas Dataframe to present the results. You don’t have to write three lines of code every time you do an AUC plot or confusion metrics, you don’t have to write 15 lines of code to ship your model to AWS S3 or GCP or Azure blob storage.

So I think there’s nothing new. There is no learning curve, or a very minimal learning curve. So there is nothing to lose for data scientists. We have created it in a way that you can use it in your existing ecosystem even to a point where, if you don’t have anything else, you can take PyCaret and write the code in your SQL server, like literally. And with its low code, you are writing create, tune, and ensemble for modeling. So they are not even lines, they’re just words. So I think people not using it, thinking it would do any harm to them, is a very stupid idea.

Funding plans for PyCaret

Kunaal: Okay. So I wanted to move our focus to how PyCaret sustains financially as its open source. Do you plan to do bootstrap or get some funding for it?

Moez: I have self-funded everything, including website domains, licenses, and registration videos. I haven’t asked for funding from anybody, and neither did anybody show interest. But going forward, especially after the 2.0 release, we have a few tech players reaching out to us. And I think we’d have good prospects in getting support from these big companies in terms of money and developer support to keep it going.

Let’s have some fun moments with Moez,

Why the name ‘PyCaret’?

Kunaal: We’ll just take a couple more questions. I’m very curious to know why the name PyCaret?

Moez: So when I was in the school, I was very impressed by the work of Dr. Max Kuhn, who built the package Caret in R. “Caret” stands for — Classification And Regression Training. And “Py” is a representation of Python. Because when we started this, I didn’t have an idea that it would expand to Unsupervised Learning and NLP and all that. So, Caret is just a representation of classification and regression training, obviously influenced by Dr. max Kuhn’s work in R.

How to contribute to PyCaret?

Kunaal: Great. So we also want to know, as a CLL community, how can we contribute to PyCaret? What are the ways that you think will help you in terms of improving PyCaret?

Moez: Absolutely! So, until the first release of PyCaret, there was a page of contributions on our website, but I had no idea what to do with people reaching out for contributions. We were not that mature on GitHub. We were not that organized. But now, as we speak, most things are streamlined.

So if you are into coding and technical things and want to contribute, we need you. Go to our GitHub; there are open issues, Help Wanted tags, and urgent things. I also pin things on the top of our GitHub, which are my personal messages/request or call for help from an expert. And there is an active sprint for the next 2.1 release, which will end this month. So you can contribute. And there would be major refactoring projects, and I’m already forming a team. So feel free to reach out to me on LinkedIn. If you are an experienced seasoned software engineer and can give your time, I think you should.

And if you are not a software engineer, you can still contribute a lot by contributing content and improving the documentation. And I consider content as a very big part of PyCaret. So if you’re not into coding but can write, then reach out to me. That’s a very important area to contribute as well.

Kunaal: Awesome. So, the audience that is reading here, you saw that there are a lot of ways that you can contribute. And obviously, many use cases can be done with PyCaret, which can help the community know how to use PyCaret. So if you have those use cases, please have them coming.

Okay, Moez, we come to this podcast’s end. It was an amazing session!

Moez: Okay, so thanks, Kunaal, for having me here. It was a pleasure and fun to be here tonight. And to all the readers, make sure you subscribe to Co-learning Lounge and their YouTube channel as well as PyCaret’s YouTube channel. People here at Co-Learning Lounge are doing fantastic work building tech communities across the globe.

Kunaal: Thank you. And we just wanted to call out Chayan Kathuria’s effort and Yogesh Kothiya’s effort in putting all of this together. Thank you.

Listen to the full episode here 🎧

If you like what we do and want to know more about our community 👥 then please consider sharing, following, and joining it. It is completely FREE.

Also, don’t forget to show your love ❤️ by clapping 👏 for this article and let us know your views 💬 in the comment.

Join here: https://blogs.colearninglounge.com/join-us