You've probably heard a lot about data science, artificial intelligence and big data. Frankly, there has been a lot of hype around these areas. What it has done is inflate expectations about what data science and data can actually accomplish. Overall, this has been negative for the field of data science and for big data. It is useful to think a bit about the questions that can be asked to separate the hype of data science from the reality of data science.

The first question is always "What is the question you're trying to answer with the data?" If someone comes to talk to you about a big data project, an artificial intelligence project, or a data science project, and they start talking about the newest technology they can use to do distributed computing, and analyze data with machine learning, and they throw a bunch of buzzwords at you, the first question you should ask is "What is the question you're trying to answer with the data?" Because that really narrows down the question and filters out a lot of hype around the tools and technologies that people are using, which can often be very interesting and fun to talk about. We like to talk about them too, but they're not really going to add value to your organization on their own.

Also Read: Data Democratization and AI in the Financial Sector

The second question to ask yourself, once you've identified the question you're trying to answer with the data, is, "Do you have the data to actually answer that question?" So often the question you want to answer and the data you have to answer with are not really very compatible with each other. So you have to ask yourself, "Can we get the data in such a way that we can answer the question we want to answer?" Sometimes the answer is simply no, in which case you have to give up (for now). Bottom line, if you want to decide whether a project is hype or reality, you have to decide whether the data people are trying to use is actually relevant to the question they are trying to answer.

The third thing to ask yourself is, "If you could answer the question with the data you have, could you even use the answer in a meaningful way?" This question goes back to that idea from the Netflix competitions where there was a solution to the problem of predicting what videos people would like to watch. And it was a very, very good solution, but it wasn't a solution that could be implemented with the computing resources that Netflix had in a way that was financially expedient. Even though they could answer the question, even though they had the right data, even though they were answering a specific question, they couldn't actually implement the results of what they found out

If you ask yourself these three questions, you will be able to decipher very quickly whether a data science project is all hype or whether it is a real contribution that can actually move your organization forward.

How do you determine the success of a data science project?

Small businesses rarely use cutting-edge technology, simply because it is not within their budgets, expertise or resources. However, almost all are called upon to experiment with such technology, because if they don't, someone else will, and ultimately whoever does will gain in competitiveness, cost or profitability.

Defining the success of an AI project (which is technically called data science or machine learning) is a crucial part of managing a data science experiment.

Of course, success is often context-specific. However, some aspects of success are general enough to merit discussion. My list of hallmarks of success includes

Also Read: What Are the Expected Results of a Data Science Project?

The creation of new knowledge.

Decisions or policies are made based on the outcome of the experiment.

A report, presentation or app with impact is created.

You learn that the data cannot answer the question you are asking.

Some more negative outcomes are: that decisions are made that ignore clear evidence from the data, that the results are equivocal and do not shed light in one direction or another, that uncertainty prevents the creation of new knowledge.

Let's talk first about some of the positive outcomes.

New knowledge seems ideal to me. However, new knowledge does not necessarily mean that it is important. If it produces decisions or policies

If it produces actionable decisions or policies, even better. (Wouldn't it be great if there were evidence-based policy, like the evidence-based medicine movement that has transformed medicine?). Having our data science products have a big (positive) impact is, of course, the ideal. Creating reusable code or applications is a great way to increase the impact of a project.

Finally, the last point is perhaps the most controversial.

I consider a data science project to be successful if we can demonstrate that the data cannot answer the questions being asked. I remember a friend telling a story about the company he worked for. They hired many expensive data science consultants to help use their data to inform pricing. However, the prediction results were not helping.

They could see that the data could not answer the hypothesis being studied. There was too much noise and the measurements were not accurately measuring what was needed. Sure, the result was not optimal, as they still needed to know how to price things, but it did save money on consultants. Since then, I have heard this story repeated almost identically by friends in different industries.

Also Read:
* How the Biggest Companies in the World Design Machine Learning Applications
* What Is Open Innovation In Data Science?

Most Related Articles

Business

How the Biggest Companies in the World Design Machine Learning Applications

I'm often asked, "what kind of machine learning project should I work on?"And I usually answer with "follow your curiosity."Why?Because of how experimental machine learning is, it's in your best interest to figure things out through tinkering. By trying things which might not work.However, machine learning projects are no longer works of magic. The device you're reading this on probably uses machine learning in several different ways you're not aware of (see Apple's implicit machine learning below).That being said, this issue of ML Monthly (April 2021 edition) collects different design best practices from companies using machine learning at world-scale proportions.And after reading through them, you'll start to notice there are many overlaps in how things are done. This is a good thing. Because the overlaps are what you can use for your own projects.As models and machine learning code become more and more reproducible, you'll notice an overarching theme here: machine learning is an infrastructure problem.Which is something you've known all along, "how do I get data from one place to another in the fastest, most efficient way possible?"If you're considering working on your own machine learning projects, read through each of the guidelines below and try the materials in the bonuses section, but remember, none of these will replace the knowledge you gain from experimenting yourself (guidelines, schmuidelines).Note: I have used the terms machine learning and artificial intelligence (AI) interchangeably throughout this article. You can read "machine learning system" as "AI system" and vice versa.Also read: Top 10 Data Science Leaders You Should FollowApple’s Human Interface Guidelines for Machine LearningI'm writing these lines on an Apple MacBook in a library where I can see at least 6 other Apple logos. This morning I watched two people in front of me pay for their coffee using their iPhones.Apple devices are everywhere.And they all use machine learning in many different ways, to enhance photos, to preserve battery life, to enable voice searches with Siri, to suggest words for quick type.Apple's Human Interface Guidelines for Machine Learning share how they think about and how they encourage developers to think about using machine learning in their applications.They start with two high-level questions and break it down from there:What is the role of machine learning in your app?What are the inputs and outputs?For the role of machine learning in your app, they go on to ask, is it critical (need to have) or complementary (nice to have)? Is it private or public? Is it visible or invisible? Dynamic or static?For the inputs and outputs (I'm a big fan of this analogy because it's similar to a ML model's inputs and outputs) they discuss what a person will put into your system and what your system will show them.Does a person give a model explicit feedback? As in, do they tell your model if it's right or wrong? Or does your system gather implicit feedback (feedback which doesn't require a person to do any extra work other than use the app)?Questions to think about when asking what role machine learning plays in your app/feature. Source:https://developer.apple.com/design/human-interface-guidelines/machine-learning/overview/roles/Google's People and AI Research (PAIR)Google's design principles for AI can be found in their People and AI Research (PAIR) guidebook.The PAIR guidebook also comes along with a great glossary of many different machine learning terms you'll come across in the field (there's a lot). It breaks down designing an AI project into six sections.User Needs + Defining SuccessWhere's the intersection of what AI is capable of and what the people using your service require?Should you automate (remove a painful task) or augment (improve) with AI?What's the ideal outcome?Data Collection + EvaluationTurn a person's requirements into data requirements (it all starts with the data)Where does your data come from? (is it responsibly sourced?)Build, fit and tune your model (good models start with good data)Mental Models (setting expectations)What does a person believe your ML system can achieve?Explainability + TrustAI systems are probability-based (and may give strange results), how can this be explained?What information should a person know about how a ML model made a decision? (confidence levels, "we're showing you this because you liked that...")Feedback + ControlHow can a person give feedback to help your system improve?Errors + Graceful FailureWhat is an "error" and what is a "failure"? (a self-driving car stopping at a green light could be an error but running a red light could be a failure)ML systems aren't perfect and your system will eventually fail, what do you do when it does?Each section comes with a worksheet to practice what you've learned.A trend you'll notice after going through the guidelines (especially PAIR) is setting expectations. Being upfront with what your system is capable of. If a person expects your system to be magic (as ML is often portrayed) but isn't aware of its limitations, they may be let down.Also read: Customize your Jupyter NotebooksMicrosoft's design guidelines for Human-AI interactionMicrosoft's design guidelines for Human-AI interaction tackle the problem in four stages:Initially (what should a person know when they first use your system?)During interaction (what should happen whilst a person is using your service?)When wrong (what happens when your system is wrong?)Over time (how does your system improve over time?)You'll notice Microsoft's guidelines take you on a walk in a person using your ML system's shoes. And again we see a trend.Problem → Create solution (ML or not) → Set expectations → Allow feedback → Have a mechanism for when it's wrong → Improve over time (go back to the start).Microsoft's guidelines for Human-AI interaction cards, starting with initial stages through to what to do as a person interacts with your machine learning system over time. Source: https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/Facebook’s Field Guide to Machine LearningWhile previous resources have taken the approach of an overall ML system, Facebook's Field Guide to Machine Learning focuses more on the modelling side of things.Their video series breaks a machine learning modelling project into six parts:Problem definition — what problem are you trying to solve?Data — what data do you have?Evaluation — what defines success?Features — what features of the data best align with your measure of success?Model — what model best suits the problem and data you have?Experimentation — how can you iterate and improve upon the previous steps?But as the modelling side of things in machine learning gets more accessible (thanks to pretrained models, existing codebases, etc), it's important to keep in mind all of the other parts of machine learning.I used Facebook's Field Guide to Machine Learning as the outline of the Zero to Mastery Data Science and Machine Learning Course. You can also read an expanded version of these steps on my blog.Spotify’s 3 Principles for Designing ML-Powered ProductsHow do you build a service which provides music to 250 million users across the world?You start by going manual before you go magic (principle 3) and you continually ask the right questions (principle 2) to identify where the people using your service are facing friction (principle 1).The sentence above is a play on words of Spotify's three principles for designing machine learning-powered products.Principle 1: Identify friction and automate it awayAnywhere a person struggles in pursuit of their goals whilst using your service can be considered friction.Imagine a person searching for new music on Spotify but unable to find anything which suits their tastes. Doing so could hurt someone's experience.Spotify realized this and used machine learning-based recommendation systems to create Discover Weekly (what I'm currently listening to), a playlist which refreshes with new music every week.And in my case, it looks like they must've adhered to their other two principles whilst building it because these tracks I'm listening to are bangers.Principle 2: Ask the right questionsAsk. Ask. Ask. If you don't know, you could end up designing a product in the wrong direction.Much like many of the other guideline steps above challenge you to think from the person using your service's point of view, this is the goal of asking the right questions: find out what issues your customers are having and see if you can solve them using machine learning.Principle 3: Go manual before you go magicalFound a source of friction?Can you solve it without machine learning?How about starting with a heuristic (an idea of how things should work)?Like if you were Spotify and trying to build a playlist of new music someone was interested in, how do you classify something as new?Your starting heuristic could be anything older than 30 days wouldn't be classified as new.After testing multiple heuristics and hypotheses (a manual process) you could then again review whether or not machine learning could help. And because of your experiments, you'd be doing so from a very well-informed point of view.Also read: Building a Product Recommendation System with Collaborative FilteringFrom Big Data to Good Data by Andrew NgAndrew Ng presented a talk at Scale's recent conference on the movement of ML systems from big data to good data. And Roboflow did a great summary of the main points — all of which talk to the things we've discussed above.Some of my favourites include:Getting to deployment is a starting point rather than the finish line (closing the proof of concept and production gap)From big data to good data (MLOps' most important task is ensuring high-quality data in all phases of the ML project lifecycle and not all companies have access to big data)Freeze your codebase and iterate on your data (for many problems the model is a solved problem, the data is what's needed)Andrew Ng on the importance of thinking about good data as well as big data. Source: https://scale.com/events/transform/videos/big-data-to-good-dataLearning moreThe above are all guidelines on how to think about building ML-powered systems. But they don't show you tools or how to go about doing so.The following are extra resources I'd recommend for filling the gaps left by the above.Choose one and read through/work through all the materials/labs whilst building your own ML-powered project.Engineering best practices for machine learning (Software Engineering 4 Machine Learning) — a thorough guide on developing software systems with machine learning components.Machine Learning Engineering Book by Andriy Burkov — a one stop shop for many of the guidelines and steps discussed above, I have this book on my desk and use it as a reference.CS329s: Machine Learning System Design — an entire Stanford course covering all of the steps that go into a designing a machine learning-powered system. Led by Chip Huyen with guest lectures (including one from yours truly) by engineers from many different machine learning companies.Full Stack Deep Learning — machine learning doesn't stop once a model is built (and after reading the above, you know the model is a small part of the entire system). Full Stack Deep Learning introduces many of the steps around model building such as data storage, data manipulation, data versioning (notice the emphasis on data), model deployment as well as different tools for implementing them.Made with ML MLOps curriculum — MLOps = machine learning operations. Made with ML MLOps is made by Goku Mohandas in apprenticeship style, "here's how I would build an ML-powered service and how you can too".LJ Miranda's outstanding blog post on software engineering skills for data scientists– if I was to write a blog post specifically on going from building models (in notebooks) to writing full-stack code, this would be it.[This post originally appeared as the April 2021 issue of Machine Learning Monthly, a monthly newsletter I write containing the latest and greatest (but not always latest) of the machine learning field.]

Daniel Morales

Jul 09, 2021

Data Science

Business

What Are the Expected Results of a Data Science Project?

From a company perspective, data science projects should always be viewed as experiments. Remember that we are talking about science, and science bases many of its theories on the results of a series of experiments. From here, many companies start with the wrong assumptions, thinking that the results are exact sciences, of which there must be a single, true answer. The reality is that many data science projects fail because of the lack of iteration that is needed once the first results are obtained, and because they do not adopt a scientific approach to the process, that is, an experimental approach. But what are the expected results? Well, the most common results expected from a data science project are:APIsIntegrationsApplications and/or PlatformsReportsPresentations1- APIsAPIs are a set of subroutines, functions and procedures (or methods, in object-oriented programming) provided by a certain library to be used by other software as an abstraction layer. This sounds a bit confusing, but it's easier than it sounds. This is code that can be shared between machines. Here the requests are not made by a user from a browser, but by one program to another program, and the result is chunks of code that can then be read and processed. APIs is the most common form of expected outcome for a company that intends to develop a data science project, specifically talking about predictive models, as it allows to easily integrate the solution within the company's current internal programs, without the need to worry about deep integration, or the incompatibility of programming languages, devices or internal systems. This is how a company that has a strong web, IoT, or mobile presence can immediately use a data science solution without the need to invest large resources. The requirements for an API, as an expected outcome of a data science experiment are:Well-written, understandable and reproducible help pages or documentationThe code must be well documentedThe code must be version-controlledAlso read: ¿What Is A Data Science Tournament?2- IntegrationsIntegrations are a bit more complex from a technical point of view, since they involve integrating a solution within the company's current systems. So, for example, if all the current web development was done in Java, and the machine learning solution was delivered in Python, the software engineering team will have to figure out how to integrate both languages into the stack, either through internal microservices or other application coupling techniques. Let alone if the company has a monolithic stack. If the latter is the case, you should opt for an API. 3- Applications and PlatformsAnother common solution is to create an external service, completely different from the company's main service. Here the aim is to host the service in a different domain, where the objective is to access only to obtain the results of the predictive modeling. This is common for companies that use engineering as marketing, making predictive programs to add leads to their pipeline of prospects. The requirements for these applications and web pages, as an expected outcome of a data science experiment are:Ease of use of the toolHelp pages or documentationThe code must be well documentedThe code must be version-controlled4- ReportsHere we would no longer be talking about predictive models, but about data analysis in general. The most common and expected result is a report or series of reports where it is expected to understand the historical data, the reason for the historical results and the conclusions of the same. They are usually full of statistical data useful for decision making. They are useful for management, marketing or human resources committees. There are many formats for this, but the ideal would be not only to present the report, but to have the opportunity to make a presentation and tell a story about the data. Ideally, the reports provided to you should beClearly writtenInclude a narrative around the data.Creation of an analytical datasetAnalysisClear and even interactive graphicsConcise conclusionsOmitting unnecessary detailsReproducibleAlso Read: How Poker Can Teach Data Science Fundamentals5- PresentationsPresentations are where data scientists tell stories with data. A detailed but conclusive report on historical data is expected for decision making. This helps any area of the company, and can be presented at any business committee. The exact same criteria for presentations:Clarity:Include a narrative around the data.Creation of an analytical datasetAnalysisConcise conclusionsClear and even interactive graphicsOmitting unnecessary detailsReproducibleConclusionAs we can see there are several types of results when we expect to run a data science project. We see the importance of having clear objectives, what we can achieve, but most important of all: take the project as an experiment, not as a magic solution to all our problems. Also read: What Is Open Innovation In Data Science?

Daniel Morales

Jul 09, 2021

Business

Data Democratization and AI in the Financial Sector - Podcast

In this blog post we will talk about the democratization of data in the financial sector. The format will be a bit different than usual, as it is an interview with our CEO Dimitry Kushelevsky given to PrivacyLabs.ai. The interview was given in Podcast format, and the original audio can be found here: https://www.buzzsprout.com/1769590/8683204-data-democratization-and-ai-in-the-financial-sector-with-dimitry-kushelevskyYou can also find the PrivacyLabs.ai post here: https://privacylabs.ai/data-democratization-and-ai-in-the-financial-sector/Paul StarrettHello, everybody. Welcome to another podcast by PrivacyLabs. My name is Paul Starrett. I am the founder of PrivacyLabs.. Remember, PrivacyLabs. is one word. And this podcast today is going to be with Dimitry Kushelevsky. And this is in a series of podcasts on privacy preservation, and democratization of data, which is the focus of this podcast and similar technology specifically, generally within the area of machine learning and artificial intelligence. Just a little bit of background on Dimitry and myself, we had the pleasure of meeting through an investment group about three months ago, we both are advisors in various capacities for a company called Ealax.com company that specializes synthetic data for financial crime. But since then Dimitry and I have had many conversations around this topic. And I thought it would be wonderful to tap his brain for this area in democratization since his company datasource.ai is specializes in that. And his background is really perfect for this topic. So we’ll be talking with him about that. And I think without further ado, Dimitry, if you introduce yourself and your company, and then we’ll just dive right in.Dimitry KushelevskyThat’s great. Well, well, thanks again, for involving me in your podcast. It’s, it’s an honor, and I am most happy to continue our conversation, which has been very productive and very engaging so far. So let’s see. So where do I begin? So, as you mentioned, I am the CEO and co founder in datasource.ai, a startup that we were started with the sole purpose of democratizing AI, more specifically, data science in the form of machine learning, and making its incredible capabilities available to the entire world. Right now, it really is what I loosely describe as a 1% problem, 1% versus 99%. It seems that many people, many business organizations, many individuals in tech, are already very familiar with the concept of AI and what it can bring the specific benefits that it can bring, as far as improving their operations as far as bringing additional revenues and boosting their potential leads to boost their profits. The bottom line, if you will, however, very few companies out there really can boast that they’ve actually taken a serious strategic approach to deploying AI algorithms in their software infrastructure and their software stack. And, you know, it is, like I said, it’s more of a 1% ai problem where a handful of the visionary companies with typically with big budgets, they’re typically, you know, multinational global corporations, they realize that there is a great deal to be gained with very low potential risk at the same time. So they seemed perfectly comfortable spending some money on developing a data science team and making their you know, I should say, becoming an early adopter of AI when it comes to actual implementation of various AI algorithms, as well as data science tools overall in their in their operating infrastructure. Meanwhile, the mainstream of the business organizations out there are still very much left on the outside looking in. So far, if if a company wanted to deploy any serious AI capabilities in their software infrastructure, that pretty much by default, required that they hire and either hire an in house data science team, and acquire an actual infrastructure engineering team that would develop a physical as well as base software infrastructure to run data science and AI algorithms. And that of course, costs quite a bit of money. And it does require a considerable amount of expertise, which today is still in a great deal of deficit. It’s still fairly hard to come by. And the schools of course, the power universities across the globe are producing data scientists as quickly as they can. But there is still a pretty significant deficit for that area for that specialization. So that, where does that leave the, basically the 99%, as I refer to them, so far, most of them simply have not been able to, to even seriously play around with AI and machine learning capabilities. And they’ve basically been doing what they’ve been doing for the last 20-30 years. Most of them, you know, who, who did want to, who did want to do some sort of a decision making implement some sort of a automated decision making in their software stack, they typically use rules based software, that is, of course, very limited, because it’s not, it’s not based on the dynamics of the immediate situation in the immediate scenario at hand. So to use a very common example, if you have an ecommerce store it, it, of course, can have some basic rules, rule sets, but built in baked into a script, that would, that would tell the machine or the controller to perform a certain task, whenever a visitor comes to, you know, looking for a specific recommendation, or looking for, you know, looking to do something in their store or purchase something in their store. That’s great. But of course, that if you have a rules based algorithm, that’s not based, that’s not using AI, in essence, you’re trying to trying to serve as this potential client by looking in the rearview mirror. And, of course, there’s only so much that you can do, of course, you know, the really cool part of machine learning and AI, is that you can actually have a machine or an algorithm monitor all the real time details surrounding this particular visit, or in my fictitious example of an ecommerce store. And based on what it’s seeing, it can make a real time decision that is a lot more likely going to result in in the in the purchase or in the customer being delighted, because he or she managed to get a great recommendation, when perhaps they least expected it. So anyway, the long story short, and that is, by deploying AI by by utilizing the toolkits that are available with machine learning data science, and other affiliated technologies in that space. Very few people today argue that there is nothing nothing to be gained. However, very at the same time, very few people, especially the smaller and medium sized businesses with typically tighter budgets and, and more limiting real human resources constraints, they’re typically locked out, you know, just costs too much. And they just don’t have that kind of those kinds of resources and expertise to, you know, to throw into data science or AI or machine learning. So that’s basically where we come in, we are trying to bring the both the price points associated with AI and machine learning down to a point where a typical, you know, middle of the road, SMB business, should be able to afford it. And at the same time we are performing, we have implemented a number of unique features, such as automation, that would make it very easy for that type of a user that type of client to actually implement elements, functional elements of AI and machine learning in their infrastructure. Without that requirement that I mentioned before, without requiring that the they hire onboard data scientists or spend a lot of money on a data science infrastructure to complement their existing operational infrastructure. So that’s, in essence, what we’re trying to do and we’re hoping that ultimately we can deliver a tidal wave of benefits to a very large number of of people and businesses that otherwise until now have been unable to, to access them.Paul StarrettGreat, no, I and that’s a great lead in actually, I think you stated the the existing state where things are the 1%, and then the lockout, if you will, of the remaining 99%. And I think it’d be helpful to get down under the hood a bit more into what datasource.ai does. If listeners aren’t familiar, there’s a company called Kaggle, which was recently, I guess they were purchased by Google. And Kaggle, what they do is they put out a challenge or a problem, and they ask for people to submit to kaggle solutions. And if they are, if their solution is chosen, they’re given a cash reward. Often that’s, you know, 50,000, 100,000, it’s quite a bit of money. But the idea is to get all of these contributors who are competing for that prize. And in so doing, they’re getting really this very high quality very sort of, well, sort of the competition has drawn out the best of the, the those who are contributing what we call sort of the crowdsourcing. And what you’re doing datasource.ai is taking the concept and making it much more available, kind of the Henry Ford, if you will, you’re, you’re allowing it to come to the masses. And so you have a smaller sometimes, you know, the cash prize, if you will, could be 5000, it could be free, really depends. But the idea is that this the SMB, the small to medium sized business, then has access, they put up a cash mount, like $5,000, I’m just picking names out of that are numbers out of a hat, you they then come to you, and then you get this competition. And I think that let me know if I haven’t stated that properly, but also need you to state of I think you’ve got quite a few projects, going.Dimitry KushelevskyWe got it, we are definitely turning some heads and attracting, frankly, a lot of heavyweights in the data science community who, as we’ve already demonstrated, who are happy to contribute their skills and the energy and creativity to, you know, to help us become successful. Yeah, we’ve done a number of projects, as you mentioned, that, in essence, our data science competitions, but so far, or most of them, were did not have a cash prize associated with them, we just wanted to, you know, to try out our, our platform to make sure that the features and automation and other capabilities are working as, as planned. And at the same time, we wanted to test just the general assumption behind our business model, which is, you know, there is a very committed very high energy, very vibrant community supporting data science, as well as implementations of AI and machine learning in the mainstream businesses and other organizations. So, so far, we’ve been very, very pleased with what we observed, we are actually beginning to monetize our, our platform now. So it’s very exciting time as well, because I want to to offer actual cash prizes, to, to the winners of the of the most successful algorithms that our contestants have got submitted. And also what we’re doing, you know, thank you for bringing up Kaggle. While the concept behind crowdsourcing AI or machine learning algorithms is actually quite similar between what we do and what Kaggle does. But there are certainly a number of unique capabilities, starting from the differential between the markets, the target markets that they focus their offerings toward, versus what we’re trying to do. So as I mentioned earlier, we’re really looking to bring it down to both a very low price point, as well as a very low requirement of, of the expertise and other dedicated resources that a any given client would have to have on board in order to use our system. But in order to ultimately, you know, develop a high quality machine learning algorithm and implemented in the, in their software infrastructure. Typically Kaggle project still would require data scientists onboard those data scientists will typically come with the project, you know, the customer, the client would be expected to bring him in the cash prizes with Kaggle are significantly greater, I’d say typically on the order of magnitude greater versus our target cash prize values. So by doing so, once again, we’re trying to really bring all these great benefits of AI and machine learning and data science into the global mainstream. So obviously, we’re you know that that entails that we would try to turn it into a very much a high volume, low barrier to entry type business model, and want to have lots and lots of businesses, you know, who could, who could, you know, realize very quickly that, hey, I can actually for for very little money. And without having to go and hire dedicated data scientists, to my team, I can actually go and develop one or more machine learning algorithms that are going to be high quality, they’re going to be designed by humans, by expert humans. And they are extremely likely based on that indicators that we’ve seen from the earlier deployments, they’re extremely likely to improve our business and grow our bottom line, which is ultimately what we’re trying to do. I mean, you know, ultimately, as far as our purpose goes, that we behind our company, behind both of our, but myself and my co founder, Daniel, we are really trying to, you know, we are passionate, obviously, we’re passionate about AI and data science and machine learning. And we are really focused on bringing all those great capabilities, all those great, fairly easily attainable benefits that the, you know, that customers can utilize, right down to the average business, the average organization around the globe, no matter what their budget, no matter what their size, no matter what their, what their ability is to, you know, to hire on board expertise and other resources. So that’s obviously because of that deep desire that Daniel and I share, and have shared from the very beginning, we have developed and launched a platform that is highly automated already. Although, of course, without question, as we progress as we grow. And we have additional developer resources, of course, we’re going to continue to, to enhance it. And, you know, and to add additional features and capabilities that that are only planning today. And the ultimate benefit is that as we get more and more clients, utilizing our platform to crowdsource high value, high capability, high quality machine learning algorithms, as they deploy those algorithms, they will undoubtedly be getting very impressive results based on everything we’ve seen in all the studies we’ve read so far, they are really setting themselves up for a great deal of additional success, even if they’re a successful company already. So that, of course, is why Daniel and I are very excited to be doing what we’re doing. And we’re even more. So more. So we’re even more excited about the future that, you know, that this technology holds that we could potentially bring to the mainstream business customers around the globe as we grow as a company.Paul StarrettYes, and that’s, that’s great. I it it let’s it leads me to think of the the crowdsourcing, it’s not only does the individual company get the benefit of the, of your platform and your expertise between you and your co founder, in addition to all of the teams that are competing, to satisfy some goal that the competition so to speak, is put to, there’s also, this is going to lead into, I think the part here where we’re gonna get into the challenges that come with this, that what you can do is you can have, let’s say different companies that are perhaps in the same vertical the same domain, share your information, to gain the synergy across their different insights, learn from the machine learning efforts. The problem is, especially in highly regulated industries, with if getting the data is the big problem. And the one of the biggest barriers there, of course, is privacy regulation and data protection laws. And the idea there is that there are techniques, there are a solutions that allow you to essentially create a different data set that’s called there’s various things here now, it’s a big, it’s a fairly large topic. We cover this, I just finished a podcast with Patricia Thaine, which you’ll find on our website which discusses privacy preservation technologies in the grand scheme. But for right now here with machine learning, we’re going to focus on synthetic data. What that is, is is a method by which an algorithm will take the original data that contains private sensitive data. And it replicates it. But it leaves behind any remnants of the sensitivity, or of the privacy of the underlying data, thereby kind of lifting it up and out of those concerns. So now you can share it it’s not a panacea, there’s a thing called the privacy budget, which says that the more that you remove the privacy and sensitive information, the less valuable your data becomes to a machine or machine learning algorithm. And it’s not a it’s not a simple process, but it’s very doable. And so Dimitry, I think, you know, Ealax company mentioned earlier, they do this, and be able to do it for things like a banking and financial services. And I know, Dimitry, you personally have quite a, quite a bit of background in this area of financial services. What is your perspective on the promise of synthetic data and your thoughts on what it is and, and, and how we expect to see that utilized not only for a company to do it just for the internal purposes, but then perhaps to share it with other?Dimitry KushelevskyYeah, absolutely. So without question financial, the financial vertical financial industry is one of the one of the verticals, that is really, really well positioned to take advantage of AI and the power of the capabilities that that it can bring to them, again, with, you know, with the help of a company like ours, for a very low cost and a very low resource requirement. And, again, it seems that, I guess, because the financial industry is so close to business, and so, so close to recognizing the the material aspect of what this kind of technology these technologies can bring, they they’re getting it, you know, they clearly they’re, they’re sensing that this is not just a fad, AI is here to stay. And, again, there’s they’re seeing like the smaller local institutions are seeing that the the larger brands in their industry are deploying AI either I would say the, you know, the larger financial vertical representatives are among those early adopters who, you know, who have done some strategic early deployments, and they actually have benefited from them pretty significantly. So, you know, what, what does what does does the future hold or what does what kind of capabilities, what kind of benefits does does it hold for for Finance? Well, there are so many great applications, right, I normally start looking at any business opportunity or even a use case scenario by by examining the what what the customer’s needs are, and in this case, in the financial vertical, the customer’s needs are quite extensive, right, they are the most of the banking institutions and financial institutions already have considerable amount of data that they have been collecting about their customers just as a part of their day to day operations. And of course, because they are required to do so by law, right. So, for one, they already have a great important ingredient that many representatives of other verticals may or may not always have. So, they have the data, they also have very specific means such as they want to remain competitive, they want to, they want to be able to offer new services, they want to target their, their marketing and other customer focused materials better. And ultimately, they of course, they want to save on their operations as well. Another another huge opportunity for the financial industry across the board, of course, is something that we discussed earlier. Is, is the fraud and, you know, criminal activity prevention. So AI, of course, I’m you know, I’m I’m very excited, you know, banner waving, waving, you know, person in the AI ecosystem. So yes, I do admit that I might be a bit biased here. But AI, I really would, would strongly submit that AI provides a tremendous opportunity, perhaps much more powerful than any other source of tools available today, to address all of these use case scenarios, and they’re really exciting part to me here is that we would be, by developing AI algorithms and other AI based solutions, we could directly and very positively impact you know, those customers and meet their needs. You know. So that’s, that’s really exciting part, ultimately, everything has to, you know, begin and end with the customer. So anytime that we have, we have a customer who already has a demonstrated set of needs that can directly impact their, their business in a very positive manner. Of course, any business person will be very excited to offer their platform or their solution to help their their users get and get exactly that effect. So, yeah, there’s a, there’s a lot, a lot to do a lot of opportunity. But of course, there is always, as always, there is a challenge. And the challenge is quite significant in financial spaces, that has to do with regulation. And it has to do with the severe privacy protection regulations that virtually all the financial institutions have to abide by across the globe. Right. So that is one big challenge that that without, with that, unless we find a way to solve it as an industry, I think, you know, Ai, and machine learning and data science will be extremely limited in terms of the depth and breadth of those benefits that we can deliver. So having having companies like Ealax around producing very close proxies for the customer’s actual original data, however, without disclosing any of the any of the private or personal or confidential information associated with the bank, or its customers, or without with institutional risk customers, could may very well be the difference between all those institutions, being able to take advantage of these great, but your business benefits and not being able to do so. So it’s really quite a big development.Paul StarrettYes, I agree. And I think I wanted to sort of slip in an elevator pitch that I have to kind of encapsulate what you said about, you know, how data is becoming much more vexing even for the midsize, and small companies. Because, as we know, the the amount of data that companies generate is growing exponentially every year. And the only way to really wrangle it is with with machine learning. That’s all you’re left with. So it becomes the new normal becomes the best practice. I think some unique things that we can share with our listeners, is that synthetic data does allow not only for us to drop out the sensitive or private information, again, though, want to emphasize it’s not a panacea. There’s there’s some, some knobs to turn. And there was some loss of insight, but often no free lunch, right? Exactly, exactly. So privacy budgets, you got to pay somewhere. But I think generally it’s very much a net gain. But there’s an upside to that as well as it with synthetic data, you can actually gain more insights from the underlying data that go above and beyond what you’d expect to build in a machine learning model from that data. Because the synthetic data can generate new types of transactions and new types of scenarios that a machine learning algorithm can then use. It also has the ability to some other issues around regulation has to do with explainability of machine learning, how’s it working? Do we know what the model the machine learning model is doing? You can you can add into this synthetic data, metrics, and other information that help you establish, you know, how the explainability, which is a very big piece of the privacy, regulations and so forth. GDPR has specific requirements around that, as do most laws, and just for just a picture of my own, you know, blow my own horn here and pay some bills, that’s what PrivacyLabs, does we help come in and make sure that I have a background in machine learning abd law. And so I’m able to help bring things together, get the machine the get the explainability in there, and to make sure that the the compliance professionals understand the technology, and what’s happening and make sure that all kind of comes together, profitable and compliant way. So that’s kind of our role in this. And I of course, look forward to working with you and, Ealax and other companies to to sort of bring this to the market. I think that’s, I think from the standpoint of the so that really the goal here is that democratization of data and I think maybe we can finish on this topic. That we’ve basically covered the idea that the individual institution, whether they’re small or midsize, really, I think is where the the, the, the issue of the need is, is most vexing. The data is getting bigger and faster, more complex. And then machine learning really is the best way to save money and reduce risk and so forth. But this also the ability to build to make a better world and Dimitry this is a big piece of Absolutely, it’s in your heart is that, again, could we have, let’s say financial services, institutions share all of their data together to build kind of a, for example, a fraud machine learning model, that is sort of a superset of all of the intelligence has come from all of the things. Again, I think that when we get into things like synthetic data and other things, that becomes much more realistic. And you have this sort of crowdsourcing in its own right, in that regard.Dimitry KushelevskyAnd you get to use the wisdom of the crowd to solve solve some of the biggest challenges that were dogging the entire industry across the globe. So yeah, this is one of the many excellent value points behind the entire technology.Paul StarrettYes, yes. In the area of for those who are a little bit more maybe savvy in the direction of data science, a thing called transfer learning where you’re taking, essentially, the typical case is deep learning neural networks, and you’re able to take the prior models that have been built, and then leverage that background. Transformers are a typical example. But again, that’s that just sort of a aside, mentioned, for those of us who are a little bit more into data science. I think that kind of rounds up again, the purpose that the idea here was the democratization of data sharing, it’s being able to leverage democratization for crowdsourcing of information around a specific problem for a company, such that they can then become the can enter the market and remain competitive by being able to leverage and have access to machine learning, but also in the ability to have domain share information for the common good. So I think that we’ve done a great job, frankly, I think in this what is roughly half an hour,Dimitry Kushelevskythere’s a lot of ground to cover. For some, like yourself, I’m sure you know it, there’s a great temptation to get into the weeds, because there are so many great use cases and so many great applications, and ultimately, so many incredible benefits, business and personal benefits that we can deliver to literally billions of people out there with this with this type of technology. That, of course is very, very exciting. And, you know, frankly, that’s, I think, very much a part of our future. You know, if I just read a PwC sourced study recently, where they claim that by the year 2030, they explore we expect that AI is going to add a little over $15 trillion, that 15 trillion. Yeah, one 515 trillion dollars to the global economy. It’s incredible, just absolutely incredible, frankly, even today, closer to home, so to speak, or closer to our timeframe, right now, the machine learning but your industry is measured somewhere around nine or between nine and $10 billion. Obviously COVID kind of played with those numbers, like with any other numbers, but I believe that’s still more or less where we are today. But the really exciting news, and I believe the study this study, mistaken came out of McKinsey, they are actually forecasting a 39% year over year, compound growth rate for the next foreseeable future, I believe the by the year 26 or 27, they’re expecting this number to go grow up to around 120 127 billion. So it I mean, these are astronomical numbers. You know, and you mentioned earlier that, yes, there are certainly multiple applications that are multiple entrants into the AI and machine learning sourcing space. And I’m, I’m certain there will be more I don’t think it’s that big a reach to to forecast that it’s going to get better and better and bigger and more. You know, that densely populated as far as the AI industry goes. But my you know, the way I see it is there’s so much great potential, it’s truly just an ideal, you know, textbook case of plenty for mentality, it’s something that we are going to, we can, we can build new solutions within to develop a tremendous amount of value added to, you know, to literally millions, if not billions of customers. So there’s plenty of, there’s plenty of good to be done, you know, that’s a really, really exciting part. For everyone who is already in this space or is considering, you know, entering it, including the folks who are potentially going to be our future customers, we welcome them to come and check us out. And, and, you know, we offer a free consultation for anybody who’s interested in exploring what, you know, what we offer, and how it may be able to benefit their their business, their operations or, you know, overcome any other challenges that they might be facing?Paul StarrettYes, yes. And I did want to sneak in here one more comment about an and then I’m gonna ask you to retreat for your, your closing thoughts on what you think we haven’t haven’t covered or something you think needs to be emphasized. But I think one of the other things that we keep talking about synthetic data. And I just want to iterate the reason we say that is because Gartner has predicted this 60% of machine learning will be based on synthetic data by 2024. That’s right around the corner. So I think that kind of gives us a sense of, there’s an there’s an area, and I’ll make this brief because it’s a technical area, that the software development lifecycle has really moved to what they call an agile framework, which requires very quick turnaround. And that is the new normal for the development of anything, any kind of software or any solution that’s being used by enterprise. And the problem is, is that to get the data, it takes a long time, contracts and laws and other things require months. And you don’t have that time when you have an agile process in software development that requires a daily kind of turnaround. So this synthetic data allows you to generate that data much more quickly and get get to pay dirt. I just wanted to do that. That’s a very new hot topic that we’ve kind of tripped over here from other discussions. So other than that, I’m going to finish here. I will anything, Dimitry, you think we should, you know, we’ve got a few minutes here. Anything you think that we should know, that we haven’t discussed or anything you want to emphasize?Dimitry KushelevskyYeah, well, the one of the most interesting challenges that we are up against right now is, is we rather obviously, don’t want to boil the ocean, if, if you if you know what I mean, there are so many great use case scenarios, there are so many great applications for AI for machine learning for, you know, quite literally running data science competition, that we you know, we have to be very judicious as far as which ones we pursue, it was a great temptation between both founders to try and just go after every interesting opportunity, every challenge that has a real business need and real data behind it, that the customer may already have a potential customer. But we find ourselves deliberately, you know, keeping ourselves disciplined in a way that we want, you know, we’re trying to validate our major assumptions that will rather obviously, you know, provide us the, our go to market and our business, evolution projectory for, you know, for the foreseeable future. So, I With that in mind, so yeah, it’s a great problem to have. And with that in mind, I, again, I want to welcome any, anybody who’s interested in playing in this space, and even just checking us out and seeing and discussing with one of our experts, or one of us directly, what we can, what we can do and how, in specific terms AI and machine learning can, can help them overcome their challenges and grow their business and bolster their bottom line or take better care of their customers. So once again, I of course, we would love to, I would love to welcome additional people who are either as excited about AI as we are, or perhaps they’re just intrigued. And they, you know, if nothing else, they want to see, hey, let’s talk and let’s see what what this technology and technology may potentially have in store for them and their business. SoI again, I welcome people to listening to this or intrigued about the potential benefits that they can gain with AI Data Science and machine learning, I welcome them to come visit us. If you know today’s if they are interested, they, if they are intrigued by what you and I just discussed, they’re intrigued by the content that we’ve posted on our web page. I, of course, would love to chat with them, and they can just click on the free consultation by and schedule a few minutes to chat with us, I think, you know, every single conversation is, is very interesting to us. Because, again, it kind of helps us to triangulate the most promising opportunities for us to deliver maximum value. So don’t wind up boiling the ocean, but we ultimately wind up, you know, meeting the meeting our our mission requirements of our mission and helping businesses accomplish their Akash accomplish their goals for success. And hopefully better than any other alternatives out there in the marketplace, which I do strongly believe that we can. So thank you for thank you for the opportunity.Paul StarrettYes, no, my pleasure. And I just so people know, I guess the website is its datasource.ai. And it’s all one word, no hyphens, no dots or anything data source.ai. And I believe is it dimitry@datasource.ai?Dimitry KushelevskyYeah, dimitry@datasource.ai. You know, if and that’s, believe me, just having my first name is a blessing, as you know, because in this email address, because I have a long, you know, Ukrainian last name that that would confuse anybody. So, yes, but I, of course, would welcome you know, any, anyone who wants to reach out and, and connect with me directly.Paul StarrettGreat or they can go to your website, as you indicated. Great. Well, listen, I’m just going to close out here with some thoughts on PrivacyLabs sort of role in this is that the process of bringing artificial intelligence or machine learning into your enterprise infrastructure in one form or another, is a horizontally kind of active topic. And that’s where we can help to look at the security requirements, the compliance, I have an attorney who’s kind of specialized in compliance law, I’m much more technical, but I can help discuss the topics with the compliance folks and help sort of scope things and one thing we do in privacy Labs is we are we work with partner companies like One Trust and BigID, and TrustArc, and at one another, one of my favorites is Centrl. That we can use those tools to help kind of herd the cats to kind of bring everything together. We specialize in machine learning and automation and an audit so that we can make sure that everything’s going the way it would be expected either by by way of a regulator or to to make sure you’re, you’re covered legally at some level. So that’s kind of what we do. And again, Dimitriy thank you so much. And I think we’ll close out here, and I’m sureDimitry KushelevskyI wanted to give you a quick plug, Paul, yes, because I deeply appreciate what you do. As far as opening the gains for potentially a very large number of, of business owners and business executives, who, because of you and your work, will be able to take advantage of what we offer. So that’s I really appreciate having having met you and having had a bunch of really productive conversations that we had already. And I look forward to continuing very much along the same lines.Paul StarrettThank you. Those are kind words, and I wouldn’t disagree with you if I say so myself. I think we’ve really we’ve really positioned ourselves and it’s usually with with my guidance directly, personally. Yes, we’re sort of the concierge if you will, to kind of help people get in and cover all the bases horizontally and peripherally. So great. With that said, we will close ourselves out here. And Dimitry, we will have another podcast soon. Probably one of the updates or some other vertical or something. But thank you again. And thank you listeners. I hope that you learned a lot and watch for future podcasts from us. Thanks. Thanks all.

Daniel Morales

Jul 09, 2021

Business

How to Build Your Data Analytics Team

Peer reviewed by Kat Holmes — Data Director ITV‍As businesses recognize the decisive power of data to achieve business goals, most are hoping to put data in the driver’s seat of their business and product strategies. This entails putting together a strong data team which can effectively propagate its insights across different areas of the business. Unfortunately, this is no easy task.To be truly data driven, companies need to build three capabilities: data strategy, data governance and data analytics.3 pillars for data-driven companies — Image from PitchStrategy: Data strategy is your organization’s roadmap for using data to achieve its goals. It requires a clear understanding of the data needs inherent to the business strategy. Why are you collecting data? Are you trying to make money, save money, manage risk, deliver exceptional customer experience, all the above?Governance: Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the efficient use of information in enabling your organization to achieve its goals. A well-crafted data governance strategy ensures that data in your company is trusted, accurate and available.Analytics: The term data analytics refers to the process of analyzing raw data to draw conclusions about the information they contain. Typically, those involved with data analytics in an organization are data engineers, data analysts and data scientists.Ultimately, your ability to leverage data will depend on these three pillars. If you’re reading this and realizing that your organization possesses none of these, don’t worry. That’s why we’re here. A good place to start is to build a strong analytics team, one that is closely tied with the strategic goals of your business. It is the first pillar of your data organization, and the focus of the article.When building a data analytics team, heads of data typically grapple with the following questions:How big should this team be?How many data engineers, data analysts, data scientists?How does the team interact with the rest of the organization?Which structure for the data team? Centralized or embedded?They rightly do so; having a strong data team is not a luxury anymore, but essential to the very survival of a company today.Let’s start with the basics though.Where are you in your data journey?Before building a data team, it’s important that you realize where you are in your “data journey”, because this will directly affect the structure of your team. This part is thus dedicated to a simplified data maturity assessment. Beware, company size and data maturity are two different things. Your organization can be large but immature on a data level.Data maturity is the journey towards seeing tangible value from your data assets. We propose a simple framework of data maturity assessment, in which you measure your ability to understand your past, know your present and predict your future. What do I mean by this?Well, in most companies each department has its own set of KPIs that support the execution of the corporate strategy. It’s not enough just to define them, they must also be clearly tracked, and you must also have the ability to predict future outcomes against these KPIs. This ability rests on a clear knowledge of your present, which, in turn, builds on a strong understanding of the past. Do this, and you have found a simple way to assess your data maturity. For example, if you’re unable to identify the revenue drivers for your company ( your past), it means you need to work on your data maturity by bringing visibility to your business before you seek to predict future outcomes. We don’t recommend skipping steps. It’s like Maslow’s hierarchy of needs, but for data.Read Also: What Are the Expected Results of a Data Science Project?Data hierarchy of needs — Image by Louise de LeyritzLet’s look at a couple of practical examples:Marketing ROI. Define your ROI, across multiple channels, by using an identified attribution model. Then understand its evolution in the previous 12 months, and especially its drivers (identify performing channels, time of the year, product, ….). Then track on a daily/weekly/monthly basis its evolution thanks to a reporting tool you trust ( present). Forecast your marketing budget based on these predictive models ( future).Customer Satisfaction. Define your customer satisfaction measure. Is it NPS, CSAT? Everyone in your company should share a common understanding of how it is computed. As with our previous example, compute its evolution in the previous 12 months, find its drivers ( past). Then track daily the satisfaction of your customers with trusted dashboards. Identify action to take from today to increase it. Your understanding of the past and the present state of customers satisfaction will allow to predict churn efficiently ( future)Understanding your past and present is commonly referred to as performing descriptive analytics. Descriptive analytics helps an organization understand its performance by providing context to help key stakeholders interpret information. This context is usually in the form of data visualization, including graphs, dashboards, reports and charts. When you are analysing data to forecast the future, you’re engaging in predictive analytics. The idea with predictive analytics is to take historical data, feed it into a machine learning model that considers key patterns. Apply this model to current data, and hope that it will forecast the future. We’ll use the terms of descriptive and predictive analytics throughout the article to refer to understanding the past, present or predicting the future.If you realize that your organization is not fully mature (ie. you don’t have a clear understanding of your past and present), here are our recommendations for what should be the next steps of your data team.Key players on a data analytics teamA data analytics team is usually composed of four core functions, which are detailed below.Data engineer: They are responsible for designing, building, and maintaining datasets that can be leveraged in data projects. As such, data engineers closely work with both data scientists and data analysts. We also include the new role of analytics engineer here, although, in practice, this role lies between analytics and engineering.Data scientist: They use advanced mathematics and statistics, and programming tools to build predictive models. The roles of data scientists and data analysts are pretty similar, but data scientists focuse more on predictive analytics than descriptive analytics.Data analyst: They use data to perform reporting and direct analysis. Whereas data scientists and engineers typically interact with data in its raw or unrefined states, analysts work with data that’s already been cleaned and transformed into more user-friendly formats.Business analyst/ops analyst: They help the organization improve its processes and systems. They focus on dashboarding, answer business questions and propose their interpretation. They are agile and straddle the line between IT and the business to help bridge the gap and improve efficiency. They frequently work with a specific business area such as marketing or finance, and their SQL literacy can range from basic dashboarding to advanced analysis.Head of data analytics: They provide strategic oversight to the data team. Their goal is to create an environment that allows all different parties to access the data they need painlessly, build the skills of the business to draw meaningful insights from the data, and ensure data governance. They also act as a bridge between the data team and the main business unit, acting both as a visionary and a technical lead.‍Read Also: What Is Open Innovation In Data Science?How large should the team be?Different companies will build data teams of different sizes, no one size fits all. We have studied the data team’s structure of 300+ companies, with a 300–1000 employee range and derived the following insights:As a general rule, you should aim to have a total of 5–10% of data analysis savvy employees in your company. Some companies such as Amazon or Facebook are training a huge portion of their employees, but we have excluded them for our analysis.The first hires of a brand-new data teams are often a data engineer and a data analyst. With just these two roles, organizations can already engage in some basic descriptive analytics. When building a larger team, think in terms of the skillset you need. A typical data project requires the following skills: database, software development, machine learning, visualization, collaboration, and communication skills. It is very rare to find individuals who possess all these skills. You should thus be aware of which skill each candidate brings to the table. Regardless of how many people you decide to hire, your team should ideally cover this skill set. Where you are in your data journey also impacts who you hire and at which stage. Generally, data analysts focus on understanding the past. That is, they take the data you have and try to understand the drivers of growth and other metrics. Business analysts/obs analysts are oriented towards the present (dashboarding). Finally, data scientists focus on predicting future outcomes. So, if you have trouble understanding your past, hire a data analyst instead ahead of a data scientist.What should ultimately guide the size of your data team is the number of business problem statements and the complexity of the most serious problems. Look at the size of your roadmap and establish how many people you need to complete your data projects within a reasonable amount of time. If you realize it would take more than a year for your data team to complete its projects, then it’s probably time to expand the team. We also encourage you to look at your run vs build ratio. Members of your data team ‘run’ when they work on daily business operations, focusing on the present performance of the organization. They ‘build’ when they work on long-term projects, such as adding new features to the product. Your data team should be running 2/3 of the time and building 1/3 of the time. If your data team spends all its time focusing on day-to-day needs, you are jeopardising the future of your company, and it is probably time to expand the team.Finally, you might have to make some project-specific hirings. If you’re a fintech conducting a project on fraud detection, or a company specialising in dispatching for logistics, you might want to hire someone who knows the specifics of your industry.How does the data team integrate with the company?There is no perfect structure for an analytics team, and your structure is likely to change many times. If your data team structure hasn’t changed for the past 2 years, then it’s likely to be a sub-optimal structure. Why? Because the data needs of your company are evolving rapidly, calling for an adaptation of your data team’s structure. Also keep in mind that the more static your organization, the harder the next change will be. For this reason, we don’t prescribe a given structure, but rather present the most common models and how they can be suited to different types of businesses.The very first step to take when structuring your data team is to find the data people that already exist in your organization. They might not be just the people with the term “data” in their title, but they could be any employee who’s not afraid of data analysis or has SQL skills already, such as business analysts/ ops analysts. If you don’t take the time to locate pre-existing data people carefully, you are likely to end up with an unplanned data team structure, unlikely to fit your business needs.‍Centralized modelCentralized model for data teams — Image by Louise de LeyritzThe centralized model is the most straightforward structure to implement, and it is usually the first step for companies who aim to be data driven. There are, however, a few drawbacks to this model, which are referenced below. This structure usually leads to a centralized data “platform”, where the data team has access to all the data, and services the whole organization in a variety of projects. All data engineers, analysts and scientists within this team are managed directly by the head of data. With this structure, the data team is reporting in a dotted line to data stakeholders based in business units, in a consultant/client-type relationship.Read Also: The 3 Basic Principles of a Data-Driven CompanyThis flexible model is adaptable to the continuously evolving needs of a growing business. If you’re at the beginning of your data journey, that is, you still struggle to have a clear vision of your past and present, this is the structure we recommend. The data team’s first projects will seek to bring visibility to the business, ensuring all departments in your organization have KPIs and dashboards they can trust. This kind of structure is particularly good for analytics where reusability and data governance are important.Advantages✅ The data team can help with other teams’ projects while working on its own agenda.✅ The team can prioritise projects across the company.✅ There are more opportunities for talent and skillset development in a centralized team. In fact, the data team works on a broader variety of projects, and data engineers, scientists and analysts can benefit from their peers’s insights.✅ The head of data has a centralized view of the company’s strategy and can assign data people to projects that are the most suited to their capabilities.✅ Encourages career growth, as data engineers, scientists and have clear perspectives of seniority roles.‍Drawbacks❌ High chance of disconnect between the data analytics team and other business units. In this model, data engineers and data scientists are not immersed in the day-to-day activities of other teams, making it difficult for them to identify the most relevant problems to tackle.❌ Risk for the analytics group to be reduced to a “support” function, with other departments not taking their responsibilities.❌ As the data team serves the rest of the business, other business units might feel like their needs are not properly addressed, or that the planning process is too bureaucratic and slow.‍Decentralized/Embedded modelDecentralized model for data teams — Image by Louise de LeyritzIn a decentralized model, each department hires its “own” data people, with a centralized data platform. In this model, data analysts and scientists focus on the problems faced by their specific business unit, with little interaction with data people from other areas of the company. With this structure, data analysts report directly to the head of their respective business unit.Advantages✅ Embedded teams of data people are agile and responsive, because they are dedicated to their respective business functions and have good domain knowledge.✅ Product managers can assign data tasks to the people most qualified to work on them.✅ Business data teams don’t have to fight for resources to build their data project because the resources sit in the teams.Drawbacks❌ Lack of source of truth, duplication of data content❌ Data people end up working on redundant issues due to a lack of communication between different teams.❌ The creation of silos leads to productivity erosion since data people can’t draw on their colleagues’ expertise as they do in the centralized model.❌ This model makes it harder to optimally staff data people on different projects.❌ Business managers, usually lacking technical backgrounds, will find it hard to manage data people and understand the quality of their work.‍Federated model/ Centre of excellenceA federated model is most suited to companies that have reached data maturity, have a clear data strategy and engage in predictive analytics.Center of excellence mode l- Image by Louise de LeyritzIn the Centre of Excellence model (COE), data people are embedded in business units, but a centralized group that provides leadership, support and training remains. If data analysts and scientists are deployed across business departments, you would still have a data leader (or a core of data leaders according to company size) who prioritizes and supervises data projects. This ensures that the most beneficial data projects are tackled first.This strategy is most suited to larger, enterprise-scale companies with a clear data roadmap. The centre of excellence model entails a larger data team, as you need data scientists both in the COE and in the different business branches. If you are a small or medium company, your needs might not require a data team of this size.Read Also: How to Make Your Company a Data-driven Organization?This approach retains the advantages of both the centralized and the embedded model. It is a more balanced structure in which the data team’s actions are coordinated, but also keeps the data experts embedded in business units.Again, it’s extremely important that you know who your data people are. When building a centralized team at the beginning of your data journey, make sure you don’t have business analysts/ops embedded in other departments. Otherwise, you will end up with an unwanted mixed model, creating complete chaos in your organization. When creating a COE, you need to ensure it’s wanted and planned.Advantages✅ The Centre of Excellence model provides the advantages of both the centralized and the embedded models.It still presents some drawbacks, though:Drawbacks❌ This model requires an additional layer of coordination and communication needed to ensure alignment between COE and business units.❌ Not fit for purpose for small — medium sized organizations, so these companies can then hook it to the benefits that can come with this hub and spoke model.Final wordsBuilding a strong analytics team is a key pillar you need to build if your company is to become data-driven. The extent to which you will extract business value from data ultimately depends on the strength of this team, and how symbiotic it is with the rest of your business. There is no made-to-order advice for the size, composition and structure of your data team. That’s why you need to understand the data maturity level of your organization, so that you can build a data team suited to your business’ needs and aligned with your business strategy.At Castor, we write about all the processes involved when leveraging data assets: from the modern data stack, to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly.Want to check it out? Reach out to us and we will show you a demo.Originally published at https://www.castordoc.com.

Daniel Morales

Jul 09, 2021

Separating Hype From Value In Artificial Intelligence

Contents Outline

Daniel Morales

Separating Hype From Value In Artificial Intelligence

How do you determine the success of a data science project?

Related Posts

Categories

Join Competition

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales

Separating Hype From Value In Artificial Intelligence

Contents Outline

Social Sharing

Daniel Morales

How do you determine the success of a data science project?

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales