Apache Spark vs. Hadoop MapReduce — pros, cons, and when to use which

What is Apache Spark?

The company founded by the creators of Spark — Databricks — summarizes its functionality best in their Gentle Intro to Apache Spark eBook (highly recommended read - link to PDF download provided at the end of this article):

“Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of the time of this writing, Spark is the most actively developed open source engine for this task; making it the de facto tool for any developer or data scientist interested in Big Data. Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale up to Big Data processing on an incredibly large scale.”

What is Big Data?

Let’s look at Gartner’s widely used definition of Big Data, so we can later understand how Spark opts to tackle lots of the challenges associated with working with Big Data in real-time at scale:

“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

The Complex World of Big Data

Note: The key takeaway here is that the “Big” in Big Data is not just about volume. You’re not just getting a lot of data, but it is also coming at you fast in real-time, in a complex format, and from a variety of sources. Hence the 3-Vs of Big Data - Volume, Velocity, Variety.

Why do most Big Data Analytics companies get a “spark in their eye” when they hear about all of Spark’s useful functionalities?

Based on my preliminary research, it seems there are three main components that make Apache Spark the leader in working efficiently with Big Data at scale, which motivate a lot of big companies working with large amounts of unstructured data, to adopt Apache Spark into their stack.

Spark is a unified, one-stop-shop for working with Big Data — “Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs. The main insight behind this goal is that real-world data analytics tasks — whether they are interactive analytics in a tool, such as a Jupyter notebook, or traditional software development for production applications — tend to combine many different processing types and libraries. Spark’s unified nature makes these tasks both easier and more efficient to write” (Databricks eBook). For example, if you load data using a SQL query and then evaluate a machine learning model over it using Spark’s ML library, the engine can combine these steps into one scan over the data. Furthermore, Data Scientists can benefit from a unified set of libraries (e.g., Python or R) when doing modeling, and Web Developers can benefit from unified frameworks such as Node.js or Django.
Spark optimizes its core engine for computational efficiency — “by this, we mean that Spark only handles loading data from storage systems and performing computation on it, not permanent storage as the end itself. Spark can be used with a wide variety of persistent storage systems, including cloud storage systems such as Azure Storage and Amazon S3, distributed file systems such as Apache Hadoop, key-value stores such as Apache Cassandra, and message buses such as Apache Kafka. However, Spark neither stores data long-term itself nor favors one of these. The key motivation here is that most data already resides in a mix of storage systems. Data is expensive to move so Spark focuses on performing computations over the data, no matter where it resides” (Databricks eBook). Spark’s focus on computation makes it different from earlier big data software platforms such as Apache Hadoop. Hadoop included both a storage system (the Hadoop file system, designed for low-cost storage over clusters of Defining Spark 4 commodity servers) and a computing system (MapReduce), which were closely integrated together. However, this choice makes it hard to run one of the systems without the other, or more importantly, to write applications that access data stored anywhere else. While Spark runs well on Hadoop storage, it is now also used broadly in environments where the Hadoop architecture does not make sense, such as the public cloud (where storage can be purchased separately from computing) or streaming applications.
Spark’s libraries give it a very wide range of functionalities — Today, Spark’s standard libraries are the bulk of the open source project. The Spark core engine itself has changed little since it was first released, but the libraries have grown to provide more and more types of functionality, turning it into a multifunctional data analytics tool. Spark includes libraries for SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these libraries, there are hundreds of open source external libraries ranging from connectors for various storage systems to machine learning algorithms.

Apache Spark vs. Hadoop MapReduce…Which one should you use?

The short answer is — it depends on the particular needs of your business, but based on my research, it seems like 7 out of 10 times the answer will be — Spark. Linear processing of huge datasets is the advantage of Hadoop MapReduce, while Spark delivers fast performance, iterative processing,real-time analytics, graph processing, machine learning and more.

The great news is that Spark is fully compatible with the Hadoop eco-system and works smoothly with Hadoop Distributed File System (HDFS), Apache Hive, and others. So, when the size of the data is too big for Spark to handle in memory, Hadoop can help overcome that hurdle via its HDFS functionality. Below is a visual example of how Spark and Hadoop can work together:

https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark

The image above demonstrates how Spark uses the best parts of Hadoop through HDFS for reading and storing data, MapReduce for optional processing and YARN for resource allocation.

Next, I will try to highlight Spark’s many advantages over Hadoop MapReduce by performing a brief head-to-head comparison between the two.

Source: https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/

Speed

Apache Spark — it’s a lightning-fast cluster computing tool. Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop by reducing the number of read-write cycles to disk and storing intermediate data in-memory.
Hadoop MapReduce — MapReduce reads and writes from disk, which slows down the processing speed and overall efficiency.

Ease of Use

Apache Spark — Spark’s many libraries facilitate the execution of lots of major high-level operators with RDD (Resilient Distributed Dataset).
Hadoop — In MapReduce, developers need to hand-code every operation, which can make it more difficult to use for complex projects at scale.

Handling Large Sets of Data

Apache Spark — since Spark is optimized for speed and computational efficiency by storing most of the data in memory and not on disk, it can underperform Hadoop MapReduce when the size of the data becomes so large that insufficient RAM becomes an issue.
Hadoop — Hadoop MapReduce allows parallel processing of huge amounts of data. It breaks a large chunk into smaller ones to be processed separately on different data nodes. In case the resulting dataset is larger than available RAM, Hadoop MapReduce may outperform Spark. It’s a good solution if the speed of processing is not critical and tasks can be left running overnight to generate results in the morning.

Functionality

Apache Spark is the uncontested winner in this category. Below is a list of the many Big Data Analytics tasks where Spark outperforms Hadoop:

Iterative processing. If the task is to process data again and again — Spark defeats Hadoop MapReduce. Spark’s Resilient Distributed Datasets (RDDs) enable multiple map operations in memory, while Hadoop MapReduce has to write interim results to a disk.
Near real-time processing. If a business needs immediate insights, then they should opt for Spark and its in-memory processing.
Graph processing. Spark’s computational model is good for iterative computations that are typical in graph processing. And Apache Spark has GraphX — an API for graph computation.
Machine learning. Spark has MLlib — a built-in machine learning library, while Hadoop needs a third-party to provide it. MLlib has out-of-the-box algorithms that also run in memory.
Joining datasets. Due to its speed, Spark can create all combinations faster, though Hadoop may be better if joining very large data sets that require a lot of shuffling and sorting is needed.

A visual summary of Spark’s many capabilities and its compatibility with other Big Data engines and programming languages below:

source: https://www.quora.com/Is-Spark-a-component-of-the-Hadoop-ecosystem

Spark Core — Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built on top of the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems.
Cluster management — A cluster manager is used to acquire cluster resources for executing jobs. Spark core runs over diverse cluster managers including Hadoop YARN, Apache Mesos, Amazon EC2 and Spark’s built-in cluster manager. The cluster manager handles resource sharing between Spark applications. On the other hand, Spark can access data in HDFS, Cassandra, HBase, Hive, Alluxio, and any Hadoop data source
Spark Streaming — Spark Streaming is the component of Spark which is used to process real-time streaming data.
Spark SQL: Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. The DataFrame and Dataset APIs of Spark SQL provide a higher level of abstraction for structured data.
GraphX: GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
MLlib (Machine Learning): MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.

Conclusion

With the massive explosion of Big Data and the exponentially increasing speed of computational power, tools like Apache Spark and other Big Data Analytics engines will soon be indispensable to Data Scientists and will quickly become the industry standard for performing Big Data Analytics and solving complex business problems at scale in real-time. For those interested in diving deeper into the technology behind all those functionalities, please click on the link below and download Databricks’s eBook — “A Gentle Intro to Apache Spark”, or check out “Big Data Analytics on Apache Spark”.

Most Related Articles

Big Data

What Is SQLite and How To Install It?

What is SQLite?Learn about the SQLite database engine and how to install it on your computer.In this article we will be exploring the extremely prevalent database engine called SQLite. We will describe what it does, its main uses, and then explain how to set it up and use it on your own computer.WHAT IS SQLITE?SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment.DRAWBACKS TO SQLITESQLite’s signature portability unfortunately makes it a poor choice when many different users are updating the table at the same time (to maintain integrity of data, only one user can write to the file at a time). It also may require some more work to ensure the security of private data due to the same features that make SQLite accessible. Furthermore, SQLite does not offer the same exact functionality as many other database systems, limiting some advanced features other relational database systems offer. Lastly, SQLite does not validate data types. Where many other database software would reject data that does not conform to a table’s schema, SQLite allows users to store data of any type into any column.SQLite creates schemas, which constrain the type of data in each column, but it does not enforce them. The example below shows that the id column expects to store integers, the name column expects to store text, and the age column expects to store integers:CREATE TABLE celebs ( id INTEGER, name TEXT, age INTEGER );However, SQLite will not reject values of the wrong type. We could accidentally insert the wrong data types in the columns. Storing different data types in the same column is a bad habit that can lead to errors that are difficult to fix, so it’s important to be strict about your schema even though SQLite will not enforce it.USES FOR SQLITEEven considering the drawbacks, the benefits of being able to access and manipulate a database without involving a server application are huge. SQLite is used worldwide for testing, development, and in any other scenario where it makes sense for the database to be on the same disk as the application code. SQLite’s maintainers consider it to be among the most replicated pieces of software in the world.SETTING UP SQLITEBinaries for SQLite can be installed at the SQLite Download page.WINDOWSFor Windows machines:Download the sqlite-tools-win32-x86-3200100.zip file and unzip it.From your git-bash terminal, open the directory of the unzipped folder with cd ~/Downloads/sqlite-tools-win32-x86-3200100/sqlite-tools-win32-x86-3200100/.Try running sqlite with the command winpty ./sqlite3.exe. If that command opens a sqlite> prompt, congratulations! You’ve installed SQLite.We want to be able to access this command quickly from elsewhere, so we’re going to create an alias to the command. Exit the sqlite> prompt by typing in Ctrl + C, and in the same git-bash terminal without changing folders, run these commands:echo "alias sqlite3=\"winpty ${PWD}/sqlite3.exe\"" >> ~/.bashrcandsource ~/.bashrcThe first command will create the alias sqlite3 that you can use to open a database. The second command will refresh your terminal so that you can start using this command. Try typing in the command sqlite3 newdb.sqlite. If you’re presented with a sqlite> prompt, you’ve successfully created the sqlite3 command for your terminal. Enter Ctrl+ C to quit. You can also exit by typing .exit in the prompt and pressing Enter.MAC OS XFor Macs, use the Mac OS X (x86) sqlite-tools package:Install it, and unzip it.In your terminal, navigate to the directory of the unzipped folder using cd.Run the command mv sqlite3 /usr/local/bin/. This will add the command sqlite3 to your terminal path, allowing you to use the command from anywhere.Try typing sqlite3 newdb.sqlite. If you’re presented with a sqlite>prompt, you’ve installed SQLite! Enter control + d to quit. You can also exit by typing .exit in the prompt and pressing return.LINUXIn Ubuntu or similar distributions:Open your terminal and run sudo apt-get install sqlite3. Otherwise, use your distribution’s package managers.Try typing in the command sqlite3 newdb.sqlite. If you’re presented with a sqlite> prompt, you’ve successfully created the sqlite3command for your terminal. You can exit by typing .exit in the prompt and pressing enter.CONCLUSIONYou’ve installed database software and opened a connection to a database. Now you have the full power of SQL at your fingertips. You’ll be able to manage all the data for any application you can dream of writing. Congratulations!

Daniel Morales

Apr 10, 2020

Data Science

Big Data

Data Collection Might Not Be As Easy As It Might Seem

In-depth exploration of data collection processesSome of my most popular repositories on GitHub have been about data collection, either through web scraping or using an Application Programming Interface (API). My approach had always been to find a resource from where I can get the data and then directly start fetching it. After collecting the data, just save it, draw insights and that’ll be it.Photo by Milan Seitler on UnsplashBut what if you want to share the data? What if someone is looking for this dataset and they don’t know how to go about it? What if they have this dataset but don’t know what each column means or where to browse for if they need more information? These questions arise because data sharing and usability is important but almost no one tries to make an effort to make it reproducible and easily accessible.This is where the best practices of data collection come into being. The metadata along with your data is almost as important because without it your data might be useless. Let’s explore in-depth, what this is and what everyone should do to make the process of data collection right!Start by figuring out what to collectPhoto by Edho Pratama on UnsplashFirst step, as always, is to look for data that already exists. Someone might have collected a similar or the same data you wanted to collect for their problem. If you find such a data, take it (if made available by them) and properly cite your source wherever and whenever you use that data for any analysis. That’s it!However, if you don’t find the data you need, you’ll have to collect it yourself. It could be a list of Wikipedia pages that you scrape off their website, repositories information you might want to grab for your GitHub account using the GitHub API or data collected from a sensor. The things that you can collect are almost limitless.Collect the dataPhoto by Markus Spiske on UnsplashWhatever you decided to collect, start collecting your data. You can use BeautifulSoup to extract information from HTML pages, access APIs as needed using their documentation or maybe create an Android application that reads data from a sensor and saves it to a CSV file.Once you have the data you want, you might want to share your work with others. You would want others to understand what you collected, why you collected and maybe use your data by properly citing your work. It then becomes essential to have the data in a proper format that others can understand and use.Data about your data — MetadataNow, I’ll tell you something that we always use but often overlook as an essential part of the data. Yes, I’m talking about the metadata. The information that tells you what each column means, what are the units of measurement, when was the data collected, and a lot more.Let’s understand the importance of metadata using an example. The UCI Machine Learning repository includes a long list of datasets which you can use for your analysis and prediction. Let’s pick the Beast Cancer Data Set. This is how the dataset looks:Breast Cancer Data Set (Data)— UCI Machine LearningJust by looking at the data and no additional information, we just cannot figure out what each column even means, let alone do any analysis on it. But just when I show the below picture that has column description, we can use the dataset, extract information, perform exploratory analysis and do predictions.Breast Cancer Data Set (Attributes) — UCI Machine LearningThis is why information about the data is really important. This one essential step can make or break your dataset.But what all should we collect?Photo by Phad Pichetbovornkul on UnsplashIf you think about it, you’ll find that there are a lot of things that you can collect as metadata such as date of collection, location, column description, and more. Thus, there exists a unified collection of metadata standards that one can choose from such that others can get complete information. Some of the common ones are as follows:Dublin CoreThe Dublin Core includes a list of elements that you need to specify about the data such as Date Created, Creator and other information.Metadata Encoding and Transmission StandardThe Metadata Encoding and Transmission Standards (METS) is a metadata standard for descriptive and structural data represented as eXtensible Markup Language (XML).International Organization for Standardization (ISO)The ISO defines a list of standards which are followed worldwide. The standards may vary based on usage and area. For example, for a standard way to represent time — there is the ISO 8601 standard which signifies how to write the date and time in a commonly understood pattern.There are other standards which exist as well but the usage depends on what data you’re trying to collect. The basic general point when collecting metadata is that if someone today or sometime in the future, decides to work on your data, the data and metadata should be self-sufficient in describing everything.However, to do so, there is another essential information along with metadata — provenance.The provenance includes information about the process of data collection and if any transformations were made on that data. While collecting data, we keep track of when and how was the data collected, measuring devices, the process, data collector, any limitations, and everything about the process of data processing (if done).ConclusionThe complete package of data along with metadata and provenance makes the data future proof in a usable format.

Karim David Barragan

Apr 10, 2020

Data Science

Machine Learning

Big Data

The hardest question you’ve been asked in a data science interview

I work at a YC company that has a evolved an interesting internal Slack group of data scientists. It’s a private group, but recently it got some attention on Twitter and we figured it might help aspiring data scientists if we published a few of the conversations we’ve been having on there. Twitter agreed, so that’s what I’m going to do today.The first conversation I’m going to post started with a question that one of our Fellows asked the community: What’s the hardest question you’ve been asked in a data science interview?(I’ve changed the asker’s name below, but a few of the participants kindly agreed to share their full names and links to their online profiles.)Susan Pan asks:What’s the most difficult question you ever encountered in a data science interview?I’ll share mine: “How many years of experience do you have in language X?” This is really hard to answer: Do I count the years I used it in academia? Do I count the years I used it in my hobby projects? Do I count the years when I used it at my job, but just during 15% of my time?I once decided to answer this question by asking the interviewer, “Can you please elaborate?” I think the interviewer thought I was crazy.Hoping to hear your most difficult questions and maybe we can share pointers on how to answer them!Ray Phan’s answer:Here’s mine: “If you had to pick one technical problem that was the most difficult for you, explain what it is and how did you approach solving it?”The reason why this is deceivingly difficult is that it opens you up to questions as you go. They can decide how far or how deep they want to investigate each and every part of your approach. In fact, this is one question I ask all the time when I interview someone. You can quickly determine whether someone really knows how to solve the problem, or if they rode on someone else’s coattails.Interestingly, this is the only question Elon Musk asks during interviews. (Source: He personally interviewed me when I was applying to Tesla’s Autopilot Program.)Susan:Thanks for sharing! With this question, are you testing a candidate’s problem-solving approach or their depth of understanding of technical concepts or a mixture of both?Ray:Mixture of both.I want to see how good they are at approaching a relatively unknown problem given their skill set at that point in time, what skills and approaches they learned throughout the whole process, and their problem-solving ability to determine if they successfully solved it.By the way they answer my follow-up questions as well as the level of detail they share with me with regards to how they solve it, it gives me a pretty good idea on whether they’re someone who can work independently, can work in a group (as they’re explaining the concepts to me and I dig further) and whether I would trust that person at the end of the day.This is why I said this question is deceivingly difficult because it tells me pretty much everything about the person’s aptitude in a single question.Leo Knauth:My problem right now would be: I could tell you about what truly was the hardest problem that I ever faced, but then I would have to admit that I did poorly at the time. Really poorly. I realize this is a potential place for me to show growth, but I would ultimately first have to admit that I initially fell flat on my face.If you were to interview me, would you appreciate the honesty? Or would you recommend maybe picking the second hardest problem I ever faced instead, maybe one where I did less miserably?Ray:I would want you to admit that to me and tell me why. Growth is also something I look for and if you didn’t learn anything from that, then I wouldn’t hire you — and if the conversation is cut short, I’d jump to the second problem!Leo:Fair. Thank you! I certainly need to practice these sort of interview questions.Ray:Part of my mentorship that I do with my mentees is exactly this line of questioning. I usually split it up into 4 sets of interviews to make sure the mentee is prepared.(1) Prescreen(2) Technical fit ← The question I mentioned above goes here(3) Business Acumen(4) Culture FitWhat I try to do is ask questions they wouldn’t be expecting — which is also why I stress to them to not prepare for my mock interview sessions.But yes, practice! Your mentor will hopefully do the things I just said.The full conversation was a bit longer than this, and it got a couple of other answers. But Ray’s was my favorite, because the interview question he gives forces you to set your own level of difficulty. If you pick a technical problem that’s too easy, you might look bad; but if you pick a one that’s too hard, you might mess up its solution, and also look bad! So you have to pick the hardest problem you’re pretty sure you can solve — which is the whole point of the question.It’s also interesting that this is the only question that Elon Musk asks during his interviews. That’s something I didn’t know about.I’m thinking about posting more of these Slack conversations in the future. So if you’re interested in seeing the other answers in this conversation (or in seeing others ones), hit me up on Twitter and let me know. My DMs are open if you have any questions.

Julio Bertty

Apr 10, 2020

Big Data

Pandas

How to Process a DataFrame With Millions of Rows in Seconds

Yet another Python library for Data Analysis that You Should Know About — and no, I am not talking about Spark or DaskBig Data Analysis in Python is having its renaissance. It all started with NumPy, which is also one of the building blocks behind the tool I am presenting in this article.In 2006, Big Data was a topic that was slowly gaining traction, especially with the release of Hadoop. Pandas followed soon after with its DataFrames. 2014 was the year when Big Data became mainstream, also Apache Spark was released that year. In 2018 came Dask and other libraries for data analytics in Python.Each month I find a new Data Analytics tool, which I am eager to learn. It is a worthy investment of spending an hour or two on tutorials as it can save you a lot of time in the long run. It’s also important to keep in touch with the latest tech.While you might expect that this article will be about Dask you are wrong. I found another Python library for data analysis that you should know about.Like Python, it is equally important that you become proficient in SQL. In case you aren’t familiar with it, and you have some money to spare, check out this course: Master SQL, the core language for Big Data analysis.Big Data Analysis in Python is having its renaissanceIn case you’re interested, Udacity offers Free Access to:- Intro to Machine Learning with PyTorch - Deep Learning Nanodegree and moreMeet VaexPhoto by Mathew Schwartz on UnsplashVaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets.It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.What’s the difference between Vaex and Dask?Photo by Stillness InMotion on UnsplashVaex is not similar to Dask but is similar to Dask DataFrames, which are built on top pandas DataFrames. This means that Dask inherits pandas issues, like high memory usage. This is not the case Vaex.Vaex doesn’t make DataFrame copies so it can process bigger DataFrame on machines with less main memory.Both Vaex and Dask use lazy processing. The only difference is that Vaex calculates the field when needed, wherewith Dask we need to explicitly use the compute function.Data needs to be in HDF5 or Apache Arrow format to take full advantage of Vaex.How to install Vaex?To install Vaex is as simple as installing any other Python package:pip install vaexLet’s take Vaex to a test drivePhoto by Eugene Chystiakov on UnsplashLet’s create a pandas DataFrame with 1 million rows and 1000 columns to create a big data file.import vaex import pandas as pd import numpy as np n_rows = 1000000 n_cols = 1000 df = pd.DataFrame(np.random.randint(0, 100, size=(n_rows, n_cols)), columns=['col%d' % i for i in range(n_cols)]) df.head()First few lines in a Pandas Dataframe (image made by author)How much main memory does this DataFrame use?df.info(memory_usage='deep')Let’s save it to disk so that we can read it later with Vaex.file_path = 'big_file.csv' df.to_csv(file_path, index=False)We wouldn’t gain much by reading the whole CSV directly with Vaex as the speed would be similar to pandas. Both need approximately 85 seconds on my laptop.We need to convert the CSV to HDF5 (the Hierarchical Data Format version 5) to see the benefit with Vaex. Vaex has a function for conversion, which even supports files bigger than the main memory by converting smaller chunks.If you cannot open a big file with pandas, because of memory constraints, you can covert it to HDF5 and process it with Vaex.dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)This function creates an HDF5 file and persists it to disk.What’s the datatype of dv?type(dv) # output vaex.hdf5.dataset.Hdf5MemoryMappedNow, let’s read the 7.5 GB dataset with Vaex — We wouldn’t need to read it again as we already have it in dv variable. This is just to test the speed.dv = vaex.open('big_file.csv.hdf5')Vaex needed less than 1 second to execute the command above. But Vaex didn’t actually read the file, because of lazy loading, right?Let’s force to read it by calculating a sum of col1.suma = dv.col1.sum() suma # Output # array(49486599)I was really surprised by this one. Vaex needed less than 1 second to calculate the sum. How is that possible?Opening such data is instantaneous regardless of the file size on disk. Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.PlottingVaex is also fast when plotting data. It has special plotting functions plot1d, plot2d and plot2d_contour.dv.plot1d(dv.col2, figsize=(14, 7))Plotting with Vaex (image made by author)Virtual columnsVaex creates a virtual column when adding a new column, — a column that doesn’t take the main memory as it is computed on the fly.dv['col1_plus_col2'] = dv.col1 + dv.col2 dv['col1_plus_col2']The virtual column in Vaex (image made by author)Efficient filteringVaex won’t create DataFrame copies when filtering data, which is much more memory efficient.dvv = dv[dv.col1 > 90] AggregationsAggregations work slightly differently than in pandas, but more importantly, they are blazingly fast.Let’s add a binary virtual column where col1 ≥ 50.dv['col1_50'] = dv.col1 >= 50Vaex combines group by and aggregation in a single command. The command below groups data by the “col1_50” column and calculates the sum of the col3 column.dv_group = dv.groupby(dv['col1_50'], agg=vaex.agg.sum(dv['col3'])) dv_groupAggregations in Vaex (image made by author)JoinsVaex joins data without making memory copies, which saves the main memory. Pandas users will be familiar with the join function:dv_join = dv.join(dv_group, on=’col1_50') ConclusionIn the end, you might ask: Should we simply switch from pandas to Vaex? The answer is a big NO.Pandas is still the best tool for data analysis in Python. It has well-supported functions for the most common data analysis tasks.When it comes to bigger files, pandas might not be the fastest tool. This is a great place for Vaex to jump in.Vaex is a tool you should add to your Data Analytics toolbox.When working on an analysis task where pandas is too slow or simply crashes, pull Vaex out of your toolbox, filter out the most important entries and continue the analysis with pandas.Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.

Daniel Morales

Apr 10, 2020

A Beginner’s Guide to Apache Spark

Contents Outline

Dilyan Kovachev

A Beginner’s Guide to Apache Spark

Related Posts

Categories

Join Competition

Daniel Morales

Karim David Barragan

Julio Bertty

Daniel Morales

A Beginner’s Guide to Apache Spark

Contents Outline

Social Sharing

Dilyan Kovachev

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Karim David Barragan

Julio Bertty

Daniel Morales