Return to site

16. Reuben Tan on Data Lake, Data Literacy and Click through rate

· AI,podcast


broken image


Podcast with Reuben Tan Part 1


Podcast with Reuben Part 1


Reuben shared his backstory of how he got into the digital AI space. He believed companies doing digital transformation should start with thinking about data storage solutions. Building a data lake also requires one to consider the scalability of the application. Data literacy is key to enabling companies to be successful with the usage of digital transformation and this is often neglected. Reuben also shares a use case of digital marketing to illustrate the importance of picking the right metrics to enable the successful use of data and metrics to drive change.

[00:00:00] Andrew Liew Weida: hi, everyone. Welcome to the AI of mankind show where I share anything interesting about mankind. I'm your host for this season. My name is Andrew Liew. I work across four Continents and 12 international cities. Also, I work in tech startups across a range of roles from selling products, making customers happy, figuring out fundraising, making finance tick, building teams, and developing sticky products. Apart from building startups. I've also worked in fortune 500 companies as a chief data scientist or technologist and people leader. You can call me Jack of all trades or master of learning. I hope to make this podcast show a great learning experience for us In each season, there is a series of interesting things that invite guests to share their views about their life and interests.

[00:01:09] Andrew Liew Weida: Now let the show begin.

[00:01:26] Andrew Liew Weida: Hi, everyone allow me to introduce my guest today. Ruben is a data scientist currently at Crimson logic. He completed his undergraduate studies in mathematics and economics at the Nanyang Technological University NTU in short and started his career at the DBS bank as a data scientist and machine learning engineer.

[00:01:46] Andrew Liew Weida: He's familiar with natural language programming, computer vision, and as dabble in speech to text software. As a young professional in technology, Ruben constantly upskill himself and is genuinely curious and enjoys letting new things. Outside of work, he does volunteering to give back to the community. During his free time, he enjoys learning or reading about economics philosophy, sustainability, as well as watching films and TV. Let's welcome, Ruben.

[00:02:26] Andrew Liew Weida: Hey, Ruben. Hey, welcome to this show. So tell me about yourself. How do you get from where you finish school to where you are today?

[00:02:36] Reuben Tan: All right. Thanks Andrew. However I did to be on your show. No problem with regards to that question. I think my story begins back when I was 17. I've always been interested in computers and computer gaming. So I had to pick up a project on cyber security, defense science organization, aka , DSO . So at a part of time, I was thinking that coding can be that hard. Alright. I got my first taste of programming. I had to teach myself some C++ and python and at that point of time, honestly, I was not very good at it and I struggled a lot. So this led me to believe that I was not very good cut for coding. And I ended up doing math economics and a couple of basic modules there had coding. So actually I began to find out through my peers that this struggle is normal is something that you don't adapt very easily. And my prior experience helped me a lot as soon as I began to enjoy it, and I felt that it was very similar solving puzzle and I got a knack for it. So in my second year I participated in a hackathon that was organized by DBS. This was still in the early of epay and paylah was still relatively new. Grab pay, Shoppee pay all wasn't out there.. So my team came out the idea of having an app to do the reservation or the end payment for you and all in one integrated platform.

[00:03:46] Reuben Tan: So let's say we go as a lazy customer. I be able to receive reserve a table for two, and let's say 7:30 PM. And I really know what we want to order two of us. So we just pull out an app. Then reserve. Okay. So I, we do is we come in, sit down, eat without any hassle. So after punching some numbers and some packing for a couple of days, my team ended up winning and we offered some seed money to implement our idea. So from this, I give a bit more confidence in my coding ability, as well as learn how to pitch and share ideas, because it was not just my own ability. I was looking at other people over there, looking what they did, looking at how they pitch, how to present and how they think. It's very good experience. Next, when I graduated, I felt that I was interested in using my skillset in applicable industry and came across DBS SEED program. After my test and interviews, I found myself place in this institutional banking tech analytics team . From there, I had a role of a data scientist / business analyst and machine learning engineer. As I actually didn't have a CS background, I had to put quite lot of effort to pick up a couple of things like SQL etc, but with some perserverance and that from senior colleagues I could be familiar myself with the Hadoop ecosystem. And one of my early projects was classification problem on free text.

[00:05:02] Reuben Tan: It was actually a mix of English and Hong Kong, Chinese characters. This was in 2018. I think it was relatively new and I was not as resourceful as now. So both of us, we had to struggle to figure out, cause we didn't really have a lot of ideas at the point of time. My background was more , mathematician than data science. My economics background had a strong focus on regression. And I was quite strong at that, but I did not really know much about NLP. So we ended up doing sort of research and trial and error. And we eventually got our first success, which was as simple SVL model. And after a lot more practice and research we started to ensemble the models and have a different cleaning process, the data and bit by bit accuracy. So I found myself enjoying this iterative process and I started branching a bit more into other aspects of data science and machine learning. And eventually I got a project. It was a new challenge, but I was already familiar with self learning. I had agreed to move fast break stuff like what Mark Zuckerberg says Facebook.

[00:05:57] Reuben Tan: Bit by bit I explore different domains in ML and data science. And I feel after close to 3 years, I still much to learn about actually learn about the IT role specifically on the architectural site. And I found an opportunity in Crimson Logic and they accepted me under the chief architect office as a data scientist.

[00:06:16] Andrew Liew Weida: Tell me more about what are you doing at the position in crimson logic? What are the multiple hats that you're actually doing ?

[00:06:23] Reuben Tan: My team at Crimson logic is a bit smaller and a bit younger in the data journey, I have to wear multiple hat. So I have a data scientist hat, an architect head as a data analyst and data engineer. So in other words, I'm actually like handling the project a lot by myself, which is, I think very fun. I enjoy this. As I feel I have a lot of ownership and control over the step and allows people to go and flex my knowledge which I feel is very important, is very applicable across all sorts of software engineering, not just data in general. And I'm working on on data virtualization and dashboards with some prescriptive analytics, as well as some kind of partner automation in AWS. So one [00:07:00] of the things that I picked up over Crimson logic was AWS. And I felt that learning AWS is very fun. And I feel like knowing at least one cloud platform is very beneficial to any technology carrier . It leads you to learn about the entirely software architecture and it will empower you in your discussions with your architect and your infra team. So whether you are in a new startup or older company undergoing digital transformation. It's should add a little bit of skill set. Each of these pathological disruption is digitized or die, and there's gonna be a lot more opportunities for you to help digitize, if you are familiar with at least one cloud software architecture as the rest can also be easily picked up.

[00:07:38] Andrew Liew Weida: Yeah. I think I agree you that these days it's very useful to where multiple has tried all sorts of different path or task of the grant end-to-end skill of deploying a machine learning artificial intelligence systems. And in your case, interesting, I wanna ask is that the problem sets that you were building these solutions and systems on, are they solving internal organizational problems or external facing problem. In other for example, are you solving let's say HR or finance or marketing internally, or are you let's say, let's say in security or payment products kind of stuff. Which one are you really working on?

[00:08:17] Reuben Tan: I'm luckily doing more internal stuff . It's also gotten me from a bit more familiar with the company's data.

[00:08:22] Andrew Liew Weida: I remember one of the Crimson Logic projects was just 2 Factors authentication that was probably many years back. So now that you say it's now in shipping, apparently the company probably pivots a lot. And with people like you, they're able to pivot faster, and be more agile now. Coming back to companies like Crimson logic, being able to pivot a lot. And you being in DBS, you work with a lot of different teams now, what are the challenges that companies face when they do digital transformation when they deploy AI or work with data?

[00:08:54] Reuben Tan: I believe the first thing to do is to start thinking of data storage solutions. And I feel that need to start building a data lake. So people don't know the value of data until they or someone else truly needs it. That is of course it is legal and somebody else needs the data. I think economics magazine in 2018 says that data is the new oil and like oil, the more we the better. Next, they to start training employees to be data literate, not just have a basic understanding of data science, but also understand what the data means and what their specific data means. So this applies to both tech and business that the business site should understand how the data is structured, and the kind of data they are. And for the tech site, they should understand more about the business application. Don't just think of the software of the database but what this data actually means. And from there, I think business decisions can be made after some research and consulting, whatever available data they have if possible makes make projections, even if they're simple. But I think the most important part is to choose metrics with care. One of the most important things I learned in economics is that people respond to incentives. They will game the system. For example, if you're running a marketing campaign and your metric is just simply the clickthrough rate. So the team has incentives to minimize the number of emails sent. So they lose some potential business because now there are fewer fees and they only send out those that they're very sure click and then from there they can treat . This data point is very high. So in this case, you are actually losing revenue overall. It's not good business.

[00:10:23] Andrew Liew Weida: You mentioned about four points. The first one is to talk about the data. Don't be too late to start trying to build the data lake so that when somebody needs the data, it's not too late, it's not like scrambling everywhere. Then the second point is. Enabling the company to upskill their people to be data literacy. The third one is so that they can be data-driven and therefore the fourth one is choosing the right metrics because as we all economists, I'm also economically trained that people actually respond to the incentive.

[00:10:53] Andrew Liew Weida: Now I wanna draw a bit further in terms of collecting the data, the process of collecting data. Tell me more about considering now that you have been in two companies, what is the common thing that people always get wrong in terms of building a data lake or in terms of aggregating the data from different sources?

[00:11:13] Reuben Tan: I feel that within the data lake, you need to be able to scale cause whether you are currently small or big, eventually you're going to get big. And normally in data, the bigger, the better, I feel that we're not wrong right now. Hadoop is still the most prevalent data technology out there for using date lake and warehouses. And I believe if you do it on AWS, there's a whole array of solutions for you to help. I think there's Kinesis for real-time data that's Lake formation for actually helping you get your data into the leak itself. And there's also a whole other product, snowmobile, and snowball to actually have an import already existing data to migrate over to the server. After you build your data Lake, you should be able to start doing projects. And I feel that when it comes to doing projects, it's okay to start small, even if they're very simple. For example, before joining the data science or AI event, you can start with business intelligence. Dashboards are an amazing and helpful tool for C-level executives to make a decision because. It's very summarized. It's very concise. It's very fast.

[00:12:21] Andrew Liew Weida: One of the challenges that most people or most companies have when they start to build the data lake is trying to figure out the different tools to piece together. And that's where you recommend a few different software, like the snowmobile. Now coming back to the second question, I wanna ask what data literacy mean to you and how can we train people to have data literacy.

[00:12:46] Reuben Tan: To me, data literacy means like they understand the basics of not just not, I think that's too advanced. Maybe there are basic statistics. In other words, they need to know basic projections and basic A/B testing. All the more common, simple stuff. They also to be very comfortable if data wrangling in a sense that they should be able to open up the Excel, just rough and filter for things here and that know, have to know how to use data, to answer questions. That's the first part of data literacy. Understanding how data works. The second part would be to actually know your own data, which is to say let's say you have a product and you, and this is sitting in a database somewhere. You need to know how the database comes together. And if it's going and from the, think this data and merge it with some other existing data that you have how are you gonna join come together.

[00:13:39] Andrew Liew Weida: Now that you say that it just ring me a bell when we were working together before. And in terms of data literacy, you mentioned it's really about helping people understand how to get the data. Where do the sources come from and reading that data? What does the data mean? Is it a for example, is it continuous data or is it discrete data. Continuous means like 1, 2, 3, 4, 5, 6, 8, 9, 10, and then discrete could be let's say, male, female, 0 is male. 1 is female. And it may seem simple to us. But maybe to the layman, anybody when they started this thing. They were trying to figure out like, it's like when we were learning ABC back in the baby, right? No. It's like for a non-data non-technical guy business guys, they would have to understand, oh that language has to translate to data. It means a different thing. So we all know that money most likely is a continuous data value and gender is a discreet or discontinuous kind of stuff.

[00:14:38] Andrew Liew Weida: And coming back to the third point or the fourth point, you mentioned choosing the metrics you can. So there are two parts here when thinking about metrics, one is getting clarity about what that metric means and selecting metrics because I think the most famous management guru, Peter Drucker mentioned that you can't manage if you cannot measure.

[00:14:59] Reuben Tan: Actually I agree , but, what I am saying that choosing the metrics that you care is like measuring with care because you need to know what you're measuring and need to know your objective. So you have to craft it in a way that what you're measuring, will fit what you're trying to maximize or minimize in this sense.

[00:15:17] Andrew Liew Weida: Yeah. Do you have a good classic example that you always use ?

[00:15:20] Reuben Tan: I think the previous example that I gave was relatively good enough with regards to click-through rate I think it is in fact a common mistake. A lot of business users I encountered and not just me. A couple of my friends have encountered as well. They have very consistent tracking click-through rate and as the data scientist, I feel like I cannot, I ethically cannot go and do that. I do not want to gain the system. I do not want to purposely only send to those that I know were really more responsive.

[00:15:46] Reuben Tan: I do want to maximize the potential revenue, so I have to go and educate them. I have to try to convince them. Said to them, Hey, we shouldn't be using this rate as the success metric, we should perhaps be looking at what we can do instead is we can do a forecast of usage if we do not do a campaign. And then, after we run the campaign, we go and try, we go and measure this. We gonna measure the difference and see if the study is specifically significant. Maybe that would be a better way to measure success.

[00:16:12] Andrew Liew Weida: So you mentioned about click-through rate. For marketers sometimes I personally also deal with them before and they feel that it's way too far down to the, they call it the funnel that they can track revenue. But most of companies should care about tracking either revenue, profits or costs because these are the more tangible from an economic perspective that drive the entire economic operation of a company. And but yeah, interestingly as I mentioned about the perverseness of selecting in metrics as an incentive measure is that if you think about it, if the marketers it's easier for them to game the clickthrough rate than to game the revenue. If I know that KPI or key performance indicator on my job. That's being measured to give me bonuses on my pay [00:17:00] or whether should I stay in the company is a metric that I can control. I will be more incentivized to tell you, Hey Ruben, this is the right metrics to do. What do you have to say about that?

[00:17:11] Reuben Tan: I believe in economics terms, this will be caught moral hazard. So this is their incentive, so they have no choice. They have to push it. Maybe my argument makes sense to them and oh yeah, he's making sense, but oh no, I know that if he does this, my KPI should go down even though, even if it benefits the company. So I think sometimes we actually have to help the business, people, this marketers, we have to go and bring it to the attention to their supervisors that, Hey, maybe you shouldn't be tracking them on this. Maybe this is not the right way. We have to go and constantly challenge preexisting notions of what is a good metric because things will always change though.

[00:17:48] Andrew Liew Weida: So your preferred approach would normally is to go to them. Discuss through these particular metrics, as say our case, we're talking about click-through and say, Hey maybe this click-through doesn't really serve the purpose of what we are trying to do or we're trying to solve and then getting their buy-in and change. And so out of like 100 cases or 10 cases, like how many cases actually work and why it works or why it didn't work.

[00:18:13] Reuben Tan: Oh, I wish I had the number on that now, but I think my sample size is a little bit too small. Obviously, you have to keep escalating to go and convince more and more people. When we ensure that they do change this KPI, we have to deal with their supervisor. And I think sometimes think a supervisor does not have that kind of judgment call? So you have to keep going upwards to the level. And at a point of time, , it's just a bit too difficult to push already. No choice. We have to just compromise. At the end of the day, it is not a horrible metric. Like it's just somewhat decent, but it's just not the best metric in my opinion.

[00:18:47] Andrew Liew: Hi everyone, thanks for tuning into this episode. We have come to the end of part 1 with Reuben. In the next episode, we will continue with Reuben on part 2 which he shared with us his views on leadership and cost-benefit scalability analysis. Reuben shared an example of the concept of decreasing return to scale on the application of AI. Lastly, Reuben shared his views on the future of AI and the 2 schools of thought.

[00:19:08] Andrew Liew: If this is the first time you are tuning in. Remember to subscribe to this show. If you have subscribed to this show and love this episode Please share it with your friends, family, and acquaintances. See you later and see you soon.