By Phoebe Wong & Robert Bennett

To be a real “full-stack” data scientist, or what many bloggers and employers call a “unicorn” you’ve to master every step of the data science process — all the way from storing your data, to putting your finished product (typically a predictive model) in production.

But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought.

Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don’t realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work.

And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere.

Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of.

As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs.

The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management — from types of databases, where and how data is stored and processed, to the current commercial options — so the aspiring “unicorns” could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.

The Rise of Unstructured Data & Big Data Tools

IBM 305 RAMAC (Source: WikiCommons)

The story of data science is really the story of data storage. In the pre-digital age, data was stored in our heads, on clay tablets, or on paper, which made aggregating and analyzing data extremely time-consuming.

In 1956, IBM introduced the first commercial computer with a magnetic hard drive, 305 RAMAC. The entire unit required 30 ft x 50 ft of physical space, weighed over a ton, and for $3,200 a month, companies could lease the unit to store up to 5 MB of data.

In the 60 years since, prices per gigabyte in DRAM has dropped from a whopping $2.64 billion in 1965 to $4.9 in 2017. Besides being magnitudes cheaper, data storage also became much denser/smaller in size.

A disk platter in the 305 RAMAC stored a hundred bits per square inch, compared to over a trillion bits per square inch in a typical disk platter today.

This combination of dramatically reduced cost and size in data storage is what makes today’s big data analytics possible.

With ultra-low storage cost, building the data science infrastructure to collect and extract insights from huge amount of data became a profitable approach for businesses.

And with the profusion of IoT devices that constantly generate and transmit users’ data, businesses are collecting data on an ever increasing number of activities, creating a massive amount of high-volume, high-velocity, and high-variety information assets (or the “three Vs of big data”).

Most of these activities (e.g. emails, videos, audio, chat messages, social media posts) generate unstructured data, which accounts for almost 80% of total enterprise data today and is growing twice as fast as structured data in the past decade.

125 Exabytes of enterprise data was stored in 2017; 80% was unstructured data. (Source: Credit Suisse)

This massive data growth dramatically transformed the way data is stored and analyzed, as the traditional tools and approaches were not equipped to handle the “three Vs of big data.” New technologies were developed with the ability to handle the ever increasing volume and variety of data, and at a faster speed and lower cost.

These new tools also have profound effects on how data scientists do their job — allowing them to monetize the massive data volume by performing analytics and building new applications that were not possible before. Below are the major big data management innovations that we think every data scientist should know about.

Relational Databases & NoSQL

Relational Database Management Systems (RDBMS) emerged in the 1970’s to store data as tables with rows and columns, using Structured Query Language (SQL) statements to query and maintain the database.

A relational database is basically a collection of tables, each with a schema that rigidly defines the attributes and types of data that they store, as well as keys that identify specific columns or rows to facilitate access.

The RDBMS landscape was once ruled by Oracle and IBM, but today many open source options, like MySQL, SQLite, and PostgreSQL are just as popular.

RDBMS ranked by popularity (Source: DB-Engines)

Relational databases found a home in the business world due to some very appealing properties. Data integrity is absolutely paramount in relational databases.

RDBMS satisfy the requirements of Atomicity, Consistency, Isolation, and Durability (or ACID-compliant) by imposing a number of constraints to ensure that the stored data is reliable and accurate, making them ideal for tracking and storing things like account numbers, orders, and payments. But these constraints come with costly tradeoffs.

Because of the schema and type constraints, RDBMS are terrible at storing unstructured or semi-structured data.

The rigid schema also makes RDBMS more expensive to set up, maintain and grow. Setting up a RDBMS requires users to have specific use cases in advance; any changes to the schema are usually difficult and time-consuming.

In addition, traditional RDBMS were designed to run on a single computer node, which means their speed is significantly slower when processing large volumes of data. Sharding RDBMS in order to scale horizontally while maintaining ACID compliance is also extremely challenging. All these attributes make traditional RDBMS ill-equipped to handle modern big data.

By the mid-2000’s, the existing RDBMS could no longer handle the changing needs and exponential growth of a few very successful online businesses, and many non-relational (or NoSQL) databases were developed as a result (here’s a story on how Facebook dealt with the limitations of MySQL when their data volume started to grow).

Without any known solutions at the time, these online businesses invented new approaches and tools to handle the massive amount of unstructured data they collected: Google created GFS, MapReduce, and BigTable; Amazon created DynamoDB; Yahoo created Hadoop; Facebook created Cassandra and Hive; LinkedIn created Kafka.

Some of these businesses open sourced their work; some published research papers detailing their designs, resulting in a proliferation of databases with the new technologies, and NoSQL databases emerged as a major player in the industry.

An explosion of database options since the 2000’s. Source: Korflatis et. al (2016)

NoSQL databases are schema agnostic and provide the flexibility needed to store and manipulate large volumes of unstructured and semi-structured data.

Users don’t need to know what types of data will be stored during set-up, and the system can accommodate changes in data types and schema.

Designed to distribute data across different nodes, NoSQL databases are generally more horizontally scalable and fault-tolerant.

However, these performance benefits also come with a cost — NoSQL databases are not ACID compliant and data consistency is not guaranteed. They instead provide “eventual consistency”: when old data is getting overwritten, they’d return results that are a little wrong temporarily.

For example, Google’s search engine index can’t overwrite its data while people are simultaneously searching a given term, so it doesn’t give us the most up-to-date results when we search, but it gives us the latest, best answer it can.

While this setup won’t work in situations where data consistency is absolutely necessary (such as financial transactions); it’s just fine for tasks that require speed rather than pin-point accuracy.

There are now several different categories of NoSQL, each serving some specific purposes. Key-Value Stores, such as Redis, DynamoDB, and Cosmos DB, store only key-value pairs and provide basic functionality for retrieving the value associated with a known key.

They work best with a simple database schema and when speed is important. Wide Column Stores, such as Cassandra, Scylla, and HBase, store data in column families or tables, and are built to manage petabytes of data across a massive, distributed system.

Document Stores, such as MongoDB and Couchbase, store data in XML or JSON format, with the document name as key and the contents of the document as value.

The documents can contain many different value types, and can be nested, making them particularly well-suited to manage semi-structured data across distributed systems.

Graph Databases, such as Neo4J and Amazon Neptune, represent data as a network of related nodes or objects in order to facilitate data visualizations and graph analytics.

Graph databases are particularly useful for analyzing the relationships between heterogeneous data points, such as in fraud prevention or Facebook’s friends graph.

MongoDB is currently the most popular NoSQL database, and has delivered substantial values for some businesses that have been struggling to handle their unstructured data with the traditional RDBMS approach.

Here are two industry examples: after MetLife spent years trying to build a centralized customer database on a RDBMS that could handle all its insurance products, someone at an internal hackathon built one with MongoDB within hours, which went to production in 90 days.

YouGov, a market research firm that collects 5 gigabits of data an hour, saved 70 percent of the storage capacity it formerly used by migrating from RDBMS to MongoDB.

Data Warehouse, Data Lake, & Data Swamp

As data sources continue to grow, performing data analytics with multiple databases became inefficient and costly. One solution called Data Warehouse emerged in the 1980’s, which centralizes an enterprise’s data from all of its databases.

Data Warehouse supports the flow of data from operational systems to analytics/decision systems by creating a single repository of data from various sources (both internal and external). In most cases, a Data Warehouse is a relational database that stores processed data that is optimized for gathering business insights.

It collects data with predetermined structure and schema coming from transactional systems and business applications, and the data is typically used for operational reporting and analysis.

But because data that goes into data warehouses needs to be processed before it gets stored — with today’s massive amount of unstructured data, that could take significant time and resources.

In response, businesses started maintaining Data Lakes in the 2010's, which store all of an enterprise’s structured and unstructured data at any scale. Data Lakes store raw data, and could be set up without having to first define the data structure and schema.

Data Lakes allow users to run analytics without having to move the data to a separate analytics system, enabling businesses to gain insights from new sources of data that was not available for analysis before, for instance by building machine learning models using data from log files, click-streams, social media, and IoT devices.

By making all of the enterprise data readily available for analysis, data scientists could answer a new set of business questions, or tackle old questions with new data.

Data Warehouse and Data Lake Comparisons (Source: AWS)

A common challenge with the Data Lake architecture is that without the appropriate data quality and governance framework in place, when terabytes of structured and unstructured data flow into the Data Lakes, it often becomes extremely difficult to sort through their content.

The Data Lakes could turn into Data Swamps as the stored data become too messy to be usable. Many organizations are now calling for more data governance and metadata management practices to prevent Data Swamps from forming.

Distributed & Parallel Processing: Hadoop, Spark, & MPP

While storage and computing needs grew by leaps and bounds in the last several decades, traditional hardware has not advanced enough to keep up.

Enterprise data no longer fits neatly in standard storage, and the computation power required to handle most big data analytics tasks might take weeks, months, or simply not possible to complete on a standard computer.

To overcome this deficiency, many new technologies have evolved to include multiple computers working together, distributing the database to thousands of commodity servers. When a network of computers are connected and work together to accomplish the same task, the computers form a cluster.

A cluster can be thought of as a single computer, but can dramatically improve the performance, availability, and scalability over a single, more powerful machine, and at a lower cost by using commodity hardware.

Apache Hadoop is an example of distributed data infrastructures that leverage clusters to store and process massive amounts of data, and what enables the Data Lake architecture.

Evolution of database technologies (Source: Business Analytic 3.0)

When you think Hadoop, think “distribution.” Hadoop consists of three main components: Hadoop Distributed File System (HDFS), a way to store and keep track of your data across multiple (distributed) physical hard drives; MapReduce, a framework for processing data across distributed processors; and Yet Another Resource Negotiator (YARN), a cluster management framework that orchestrates the distribution of things such as CPU usage, memory, and network bandwidth allocation across distributed computers.

Hadoop’s processing layer is an especially notable innovation: MapReduce is a two step computational approach for processing large (multi-terabyte or greater) data sets distributed across large clusters of commodity hardware in a reliable, fault-tolerant way.

The first step is to distribute your data across multiple computers (Map), with each performing a computation on its slice of the data in parallel.

The next step is to combine those results in a pair-wise manner (Reduce). Google published a paper on MapReduce in 2004, which got picked up by Yahoo programmers who implemented it in the open source Apache environment in 2006, providing every business the capability to store an unprecedented volume of data using commodity hardware.

Even though there are many open source implementations of the idea, the Google brand name MapReduce has stuck around, kind of like Jacuzzi or Kleenex.

Hadoop is built for iterative computations, scanning massive amounts of data in a single operation from disk, distributing the processing across multiple nodes, and storing the results back on disk.

Querying zettabytes of indexed data that would take 4 hours to run in a traditional data warehouse environment could be completed in 10–12 seconds with Hadoop and HBase. Hadoop is typically used to generate complex analytics models or high volume data storage applications such as retrospective and predictive analytics; machine learning and pattern matching; customer segmentation and churn analysis; and active archives.

But MapReduce processes data in batches and is therefore not suitable for processing real-time data. Apache Spark was built in 2012 to fill that gap.

Spark is a parallel data processing tool that is optimized for speed and efficiency by processing data in-memory. It operates under the same MapReduce principle, but runs much faster by completing most of the computation in memory and only writing to disk when memory is full or the computation is complete.

This in-memory computation allows Spark to “run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.” However, when the data set is so large that insufficient RAM becomes an issue (usually hundreds of gigabytes or more), Hadoop MapReduce might outperform Spark.

Spark also has an extensive set of data analytics libraries covering a wide range of functions: Spark SQL for SQL and structured data; MLib for machine learning, Spark Streaming for stream processing, and GraphX for graph analytics.

Since Spark’s focus is on computation, it does not come with its own storage system and instead runs on a variety of storage systems such as Amazon S3, Azure Storage, and Hadoop’s HDFS.

In an MPP system, all the nodes are interconnected and data could be exchanged across the network (Source: IBM)

Hadoop and Spark are not the only technologies that leverage clusters to process large volumes of data.

Another popular computational approach to distributed query processing is called Massively Parallel Processing (MPP).

Similar to MapReduce, MPP distributes data processing across multiple nodes, and the nodes process the data in parallel for faster speed. But unlike Hadoop, MPP is used in RDBMS and utilizes a “share-nothing” architecture — each node processes its own slice of the data with multi-core processors, making them many times faster than traditional RDBMS.

Some MPP databases, like Pivotal Greenplum, have mature machine learning libraries that allow for in-database analytics. However, as with traditional RDBMS, most MPP databases do not support unstructured data, and even structured data will require some processing to fit the MPP infrastructure; therefore it takes additional time and resources to set up the data pipeline for an MPP database.

Since MPP databases are ACID-compliant and deliver much faster speed than traditional RDBMS, they are usually employed in high-end enterprise data warehousing solutions such as Amazon Redshift, Pivotal Greenplum, and Snowflake. As an industry example, the New York Stock Exchange receives four to five terabytes of data daily and conducts complex analytics, market surveillance, capacity planning and monitoring.

The company had been using a traditional database that couldn’t handle the workload, which took hours to load and had poor query speed. Moving to an MPP database reduced their daily analysis run time by eight hours.

Cloud Services

Another innovation that completely transformed enterprise big data analytics capabilities is the rise of cloud services.

In the bad old days before cloud services were available, businesses had to buy on-premises data storage and analytics solutions from software and hardware vendors, usually paying upfront perpetual software license fees and annual hardware maintenance and service fees. On top of those are the costs of power, cooling, security, disaster protection, IT staff, etc, for building and maintaining the on-premises infrastructure.

Even when it was technically possible to store and process big data, most businesses found it cost prohibitive to do so at scale.

Scaling with on-premises infrastructure also require an extensive design and procurement process, which takes a long time to implement and requires substantial upfront capital. Many potentially valuable data collection and analytics possibilities were ignored as a result.

“As a Service” providers: e.g. Infrastructure as a Service (IaaS) and Storage as a Service (STaaS) (Source: IMELGRAT.ME)

The on-premises model began to lose market share quickly when cloud services were introduced in the late 2000’s — the global cloud services market has been growing 15% annually in the past decade.

Cloud service platforms provide subscriptions to a variety of services (from virtual computing to storage infrastructure to databases), delivered over the internet on a pay-as-you-go basis, offering customers rapid access to flexible and low-cost storage and virtual computing resources.

Cloud service providers are responsible for all of their hardware and software purchases and maintenance, and usually have a vast network of servers and support staff to provide reliable services.

Many businesses discovered that they could significantly reduce costs and improve operational efficiencies with cloud services, and are able to develop and productionize their products more quickly with the out-of-the-box cloud resources and their built-in scalability.

By removing the upfront costs and time commitment to build on-premises infrastructure, cloud services also lower the barriers to adopt big data tools, and effectively democratized big data analytics for small and med-size businesses.

There are several cloud services models, with public clouds being the most common. In a public cloud, all hardware, software, and other supporting infrastructure are owned and managed by the cloud service provider.

Customers share the cloud infrastructure with other “cloud tenants” and access their services through a web browser.

A private cloud is often used by organizations with special security needs such as government agencies and financial institutions. In a private cloud, the services and infrastructure are dedicated solely to one organization and are maintained on a private network.

The private cloud can be on-premises, or hosted by a third-party service provider elsewhere. Hybrid clouds combine private clouds with public clouds, allowing organizations to reap the advantages of both.

In a hybrid cloud, data and applications can move between private and public clouds for greater flexibility: e.g. the public cloud could be used for high-volume, lower-security data, and the private cloud for sensitive, business-critical data like financial reporting.

The multi-cloud model involves multiple cloud platforms, each delivers a specific application service. A multi-cloud can be a combination of public, private, and hybrid clouds to achieve the organization’s goals. Organizations often choose multi-cloud to suit their particular business, locations, and timing needs, and to avoid vendor lock-in.

If you want to know about a case study in data management, go to this link

Most Related Articles

Top 2 Online Data Science Courses to Improve your Career in 2023

The discipline of Data Science is expanding quickly and has enormous promise. It is used in various sectors, including manufacturing, retail, healthcare, and finance. Today, a wide variety of online Data Science courses are accessible. With so many choices, you might need help choosing the best one. This article will summarize the best Data Science programs so you can choose the program that's best for you. What Is a Data Science Course?The theoretical ideas of data science are taught to novices in a Data Science course. Additionally, you'll learn about the steps involved in Data Science, such as mathematical and statistical analysis, data preparation & staging, data interpretation, data visualization, and methods for presenting data insights in an organizational context. Advanced subjects, such as employing neural networks to develop recommendation engines, are covered in more specialized courses.Why Data Science?In the expanding field of data science, a data scientist earns one of the best jobs. Data science gained popularity and started to be utilised in an expanding number of applications when big data appeared and the necessity to manage these massive volumes of data arose. Data science, which enables companies to derive conclusions on the basis and take measures based on those conclusions, is one of the primary applications of artificial intelligence. Data Scientists are in great demand due to Data Science's importance for all industries. There is fierce rivalry everywhere. But if you can get an advantage over your competitors, you may easily land lucrative positions in demand. Taking data science courses online might give you that advantage. Data science involves:● Analytical capabilities.● A foundational understanding of the field.● Practical abilities to produce outcomes. To understand data science, you don't need to spend years working with big data or have a tonne of expertise in the software sector. You may always study from the greatest online Data Science courses and create a way to join this area while working. These are the top data science programs you can take to further your career and understand the subject. Let's discover more about the top data science courses available online. 2 Online Data Science Courses for 2023 to Advance Your Career1. Program for Business Analytics CertificationThis online Data Science course lasts three months and calls for 8 to 10 hours of study per week. The course created for analytics aspirants is one of the greatest data science courses in India and one of the market's top data science courses. It has more than 100 hours of material. The Data Science course was built with the help of business professionals from organizations like Flipkart, Gardener, and Actify. This is one of the finest online courses for learning the fundamentals of data science since it offers committed mentor assistance, prompt doubt resolution services, and live sessions with subject matter specialists. Students will gain knowledge in statistics, optimization, business problem-solving, and predictive modeling via this course. This online data science course was created for managers, engineers, recent graduates, software and IT workers, and marketing and salespeople. Students will concentrate on corporate problem-solving, insights, and narrative for the very first 3 weeks of the course. In this portion, you will discover how to formulate hypotheses, comprehend business issues, and concentrate on narrative. The following four weeks will be devoted to understanding statistics, optimization, and exploratory data analysis. A case study assignment will also be included. You will study several machine learning approaches to evaluate data and provide insights during the last five weeks, which will be devoted to predictive analysis. There will be three initiatives at the industry level: uber supply-demand gap, customer creditworthiness, and market mix modeling for e-commerce. Students who take this business analyst course have access to various options, including the ability to apply for managerial, business analyst, and data analyst employment. 2. Data Science Master's DegreeIt is among the top online courses for Data Science. This master’s in Data Science program lasts for 18 months and is delivered online. If you engage in expert online data science courses from a recognised and trusted provider, you may put your talents to the test on real assignments.This course is one of the best Data Science courses since it offers a variety of distinctive characteristics. There are several specialization options available for the course. Business intelligence/data analytics, Natural Language Processing, and Deep Learning,Business Analyst course, and Data Engineering are the available specialization areas. In addition to these specializations, the course offers its students a platform to study more than 14 programming languages and technologies that are utilized in the diverse area of data science, as well as industry mentoring and committed career assistance. One of the greatest data science courses in India, it includes tools like Python, Tableau, Hadoop, MySQL, Hive, Excel, PowerBI, MongoDB, Shiny, Keras, TensorFlow, PySpark, HBase, and Apache Airflow. More than 400 hours of learning material are planned for the online data science course. You will get a thorough understanding of data science and topics linked to it through these videos and publications, enabling you to succeed in any data science interviews. For the students to have a practical and hands-on understanding of all the tools and ideas covered in the course, the online data course includes more than ten industrial projects and case studies. The students may study various subjects, languages, and tools throughout the course. The first four weeks will be devoted to learning the fundamentals of Python and how to use Excel to deal with data. The 11 weeks will be devoted to teaching students how to utilize all the tools needed for data science and how to prepare and work with the provided data. You will acquire in-depth information on Python, Excel, and SQL in this part. Learning about machine learning and its many algorithms will be the main emphasis of the next nine weeks. ConclusionThe best online Data Science courses provide a good introduction to the subject, which is how the article can be summed up. They go through the fundamentals of data science, such as handling data, cleansing data, and doing statistical analysis. They also provide a more thorough examination of Data Science and machine learning Anyone interested in pursuing a career in data science must take these courses.

nikos_datasource

Apr 02, 2020

Data Science

Machine Learning

Model Evaluation Metrics in Machine Learning

CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02 Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!

Juan Guillermo Gómez Ramírez

Apr 02, 2020

Data Science

DataSource AI Hosts KTM AG's Inaugural AI Challenge: "Code the Light Fantastic"

DataSource AI announces the launch of the KTM AG inaugural AI Challenge, an unprecedented 3-month online competition that aims to revolutionise two-wheeler innovation through artificial intelligence and deep learning. KTM AG is a global frontrunner in two-wheeler innovation, known for pushing the boundaries of what's possible in the world of motorcycles. With a rich history of groundbreaking engineering and a commitment to cutting-edge technology, KTM AG has set new standards in performance, design, and safety. As a global leader in two-wheeler innovation, KTM AG invites participants to embark on this groundbreaking innovation journey. At the core of this competition lies a challenge set to redefine the future of motorcycle lighting systems. Participants are tasked with developing an algorithm for a high-beam lighting system utilizing a pixel matrix. Participants can find detailed guidelines in the Datathon competition. The datathon unfolds in a 3-tiered cascade model: This Code Challenge by KTM AG promises not only substantial rewards but also an exciting opportunity to shape the future of two-wheeler technology, along with supporting the participants to upscale and test their knowledge in a global AI competition. The cumulative budget for this remarkable Code Challenge by KTM AG is a substantial €24,000, motivating participants with not only the opportunity to push the boundaries of two-wheeler technology but also significant rewards for those who rise to the occasion. With cumulative prizes, contestants have the chance to potentially take home a maximum reward of €10,800 in addition to contributing to cutting-edge advancements in the field. We invite all aspiring innovators, data scientists, and AI enthusiasts to join us in this journey to "Code the Light Fantastic." For more information, rules, and registration details, please register hereAbout DataSource: At DataSource AI, we are driven by a singular mission - to democratise the immense power of data science and AI/ML for businesses of all sizes and budgets. We facilitate AI competitions, for businesses of all sizes and budgets by harnessing our extensive data expert community that's collaborating over our intelligent AI algorithm crowdsourcing platform. Our community is at the heart of what we do. We've built a diverse and talented pool of data experts who are passionate about solving real-world problems. They collaborate, ideate, and innovate, driving forward the frontiers of data science.

nikos_datasource

Apr 02, 2020

Data Science

Data-Driven Creativity: Enhancing Video Content through Data Science

In the age of digital marketing and content creation, data-driven creativity is becoming an increasingly important concept. It's the fusion of artistic vision with the insights gleaned from data science to enhance the impact and effectiveness of video content. This 2500-word blog will explore how data science can be leveraged to elevate video content creation, ensuring that it not only engages but also resonates with the intended audience.Introduction to Data-Driven CreativityData-driven creativity marks a groundbreaking shift in video content creation, blending artistic vision with the insights provided by data science. This combination allows creators to break free from conventional creative limits, using data analytics to develop content that is both visually captivating and strategically significant. By delving into viewer behavior, preferences, and interactions, creators can refine their stories and visuals, achieving a deeper connection with their audience. This technique effectively transforms data into a guide for storytelling, steering content towards increased relevance and attractiveness. Consequently, video content becomes a more potent medium for engaging viewers and delivering impactful messages. Fundamentally, data-driven creativity is about converting data points into compelling stories and turning analytical insights into creative masterpieces, thereby redefining the standards of digital video content.Understanding the Role of Data in Video Content CreationExploring the Role of Data in Video Content Creation ventures into the rapidly growing realm of data-driven creativity, where data science emerges as a key instrument in enriching video content. In this realm, data transcends mere figures to become a narrative element, providing rich insights into what audiences prefer, how they behave, and emerging trends. Utilizing data, video creators can break free from conventional creative constraints, shaping their stories to more deeply connect with viewers. This process involves a detailed examination of viewer interactions, demographics, and feedback to hone storytelling skills, aiming to create videos that are not only watched but also emotionally impactful and memorable. Data-driven creativity is a fusion of art and science, where each view, reaction, and comment plays a role in directing the trajectory of video content, enhancing its relevance, engagement, and effect. This marks a transformative phase in content creation, where data equips creators to weave narratives that are not just creatively rich but also finely tuned to the dynamic preferences and interests of their audience.The Process of Gathering and Analyzing DataCollecting and analyzing data forms the foundation of data-driven creativity, especially in the realm of video content enhancement. This process involves the acquisition of key information, including audience demographics, interaction metrics, and performance measures, utilizing sophisticated tools and technologies. These range from social media analytics to advanced data mining applications designed to track a broad spectrum of viewer interactions. Once collected, this data undergoes thorough analysis to identify trends, preferences, and behaviors within the target audience. Such analysis equips content creators with insightful knowledge, allowing them to adjust their video content for greater appeal and connection with their audience. Leveraging these insights, creators can modify elements such as the tone, style, and themes of their content, revolutionizing storytelling methods and ensuring their content is both captivating and impactful. This integration of data science with creative storytelling heralds a transformative phase in video content production, where analytical findings significantly enhance artistic expression.Tailoring Content to Audience PreferencesAdapting content to audience preferences through data-driven creativity signifies a vital evolution in video content production. By incorporating data science, creators gain profound insights into audience behaviors, likes, and engagement patterns. This approach facilitates the creation of content that better resonates with viewers, ensuring everything from the plot to visual elements aligns with their interests. Utilizing analytics such as viewer habits and interaction rates, creators can pinpoint engaging aspects for better video content. Using a high-quality video editor tool is important to make the video look better. This knowledge allows precise adjustments, making the content not only captivating but also highly relevant. Ultimately, incorporating data in video content creation leads to more impactful and resonant viewer experiences, forging a deeper bond between the audience and the content.Enhancing Storytelling with Data InsightsUtilizing data insights to enhance storytelling is a groundbreaking method in video content production. Termed data-driven creativity, this technique blends the storytelling craft with data science accuracy. Content creators leverage analysis of viewer engagement, preferences, and behavior to fine-tune their narratives, ensuring a deeper connection with their audience. This integration results in not only engaging narratives but also ones that are in tune with audience interests and emerging trends. Insights from data grant a clearer understanding of what truly engages viewers, empowering creators to optimize their storytelling for the greatest effect. This modern approach reinvents traditional storytelling into an experience that's both more impactful and centered around the audience, with each creative decision being shaped and enriched by data.Using Data to Predict Future TrendsUtilizing data for future trends in data-driven creativity marks a revolutionary step in improving video content via data science. This technique focuses on analyzing viewer interactions, demographic information, and behavioral tendencies to predict future content direction. Using data enables creators to be proactive, crafting video content that resonates with emerging audience preferences and interests. Such a forward-thinking approach guarantees ongoing relevance in a dynamic digital world and fosters innovation and leadership among content creators. The blend of data analytics and artistic insight leads to the production of not just captivating but also pioneering videos, demonstrating the significant role of data in shaping the future of video content creation.Balancing Creativity and DataAchieving a harmonious blend of creativity and data in video content production is both subtle and potent. Data-driven creativity embodies the convergence of artistic flair and data analytics, providing an innovative method to boost video effectiveness. By weaving in data analysis, video creators unlock insights into what their audience prefers and how they behave, guiding their artistic choices. This integration results in content that is not only enthralling but also deeply meaningful to viewers. It is essential, however, to ensure that data serves as a guide, not a ruler, in the creative journey. This equilibrium keeps the content fresh and appealing while aligning it thoughtfully with data-driven knowledge. In essence, data-driven creativity in video content merges the narrative craft with analytical insights, culminating in videos that are both compelling and influential.Overcoming Challenges in Data-Driven CreativityOvercoming hurdles in data-driven creativity necessitates a nuanced integration of data science into the creation of video content. It involves striking a delicate balance between analytical methodologies and artistic expression, ensuring that data serves as an informative tool rather than a constraint on creativity. Accurate interpretation of data empowers content creators to avoid formulaic outputs, utilizing insights to enrich storytelling and enhance audience engagement. This intricate process demands a comprehensive understanding of the artistry of video creation and the scientific principles behind data analysis. Ethical considerations, including respecting audience privacy and obtaining data consent, are pivotal in this approach. Innovative strategies within data-driven creativity empower creators to produce content that forges deeper connections with viewers, setting new benchmarks in the digital landscape. Embracing these challenges is essential for unlocking the full potential of data-enhanced video content.Ethical Considerations in Data-Driven CreativityIn the domain of data-driven creativity, ethical considerations play a crucial role, especially when utilizing data science to enhance video content. While utilizing data insights can enhance creative processes, it is essential to address privacy concerns and ensure transparent, responsible data usage. Achieving the right equilibrium between creativity and ethical considerations becomes paramount as brands employ data to customize video content. Upholding user privacy and securing informed consent are fundamental principles in ethical data-driven creativity, fostering trust among audiences. Moreover, there is an obligation to avoid perpetuating biases and stereotypes in content creation, championing inclusivity and diversity. Ethical practices not only maintain brand integrity but also contribute to a positive and respectful digital environment for consumers.Tools and Resources for Data-Driven Video CreationExplore the potential of data-driven creativity using state-of-the-art tools and resources for crafting videos. In the current digital landscape, integrating data science and video content is transforming the landscape of creative processes. Immerse yourself in a domain where insights derived from data direct every facet of video production. These tools empower creators to customize content according to audience preferences, ensuring that each video is not only visually captivating but also strategically aligned. From scriptwriting informed by analytics to incorporating personalized visual elements, the utilization of data science takes video content to unprecedented levels. Delve into the crossroads of technology and creativity, where strategies driven by data redefine storytelling, captivating audiences in a personalized and meaningful manner.The Future of Data-Driven Creativity in Video ContentThe evolution of data-driven creativity in video content is set to transform our interaction with digital media. Through the incorporation of data science, creators gain valuable insights into viewer preferences, behavior, and trends. This collaboration enables a personalized and captivating viewing experience, heightening audience engagement. With the utilization of data-driven creativity, content producers can shape videos to suit the unique preferences of their target audience, resulting in more impactful storytelling and brand communication. As technology progresses, we anticipate a shift towards highly personalized content, driven by data insights, leading to innovative approaches in video production. This convergence of creativity and data science holds significant promise for the future development of video content within the digital landscape.ConclusionIn summary, the convergence of data-driven insights and creative components represents a transformative shift in the realm of video content creation. The fusion of Data Science and creativity provides content producers with the tools to precisely tailor videos to audience preferences, resulting in more impactful and engaging content. Leveraging the potential of data facilitates a deeper comprehension of viewer behavior, enabling targeted storytelling. Amidst the digital landscape, the symbiosis of data and creativity not only elevates video content but also fosters innovation and personalized experiences. Looking ahead, embracing Data-Driven Creativity becomes crucial for maintaining a leading edge in the continually evolving landscape of video content creation.

nikos_datasource

Apr 02, 2020

Everything a Data Scientist Should Know About Data Management

Contents Outline

Admond Lee

Everything a Data Scientist Should Know About Data Management

Related Posts

Categories

Join Competition

nikos_datasource

Juan Guillermo Gómez Ramírez

nikos_datasource

nikos_datasource

Everything a Data Scientist Should Know About Data Management

Contents Outline

Social Sharing

Admond Lee

Related Posts

Categories

Join Competition

Most Related Articles

nikos_datasource

Juan Guillermo Gómez Ramírez

nikos_datasource

nikos_datasource