Data Science

PARQUET File Format – Everything you need to know!

Data has grown exponentially over the past few years, and one of the biggest challenges has become the best way to find the best way to store a variety of data flavors. Unlike in the past (not far away), organizations now want to analyze raw data when relational databases are considered the only way – think of social media sentiment analysis, audio/video files, etc. – which are often not allowed to be stored in a traditional (relational) way, or storing them in a traditional way would require a lot of effort and time to increase the overall time, which would increase the overall time.

Another challenge is to somehow stick to traditional methods of storing data in a structured way, but move that data to an enterprise data warehouse without designing complex and time-consuming ETL workloads. Also, what if half of the data professionals in your organization are skilled (agreeable as data scientists, data engineers) and other half (data engineers, data analysts)? Would you insist that “Pydastists” learn SQL? Or, or vice versa?

Or, do you prefer a storage option that can give full play to the strengths of the entire data team? I have good news for you – something similar has existed since 2013, and that’s Apache Parquet!

in short

Before you show you the source and appearance of the parquet file format, there are five main reasons (at least) that parquet is considered the de facto standard for storing data:

  • Data compression – Parquet files provide reduced memory consumption by applying various encoding and compression algorithms
  • Column storage – This is crucial in analytical workloads, where fast data read operations are a key requirement. However, there is more information about this later in the article…
  • Language agnostic – As mentioned earlier, developers can use different programming languages ​​to manipulate data in parquet files
  • Open source format – Meaning, you are not locked by a specific supplier
  • Supports complex data types

Shops and stores

We have mentioned that Parquet is a column-based storage format. However, to understand the benefits of using the Parquet file format, we first need to draw the boundaries between the row-based and column-based methods of storing the data.

In traditional row-based storage, data is stored as a set of rows. Things like this:

Image of the author

Now, some common questions your users might ask when we talk about OLAP scenarios are:

  • How many balls did we sell?
  • How many users in the United States have purchased T-shirts?
  • How much did the client Maria Adams spend?
  • How many sales do we have on January 2?

To be able to answer these questions, the engine must scan every row from beginning to end! Therefore, to answer the following question: How many users in the United States buy T-shirts, the engine must do something similar:

Image of the author

Essentially, we only need two columns of information: the product (T-shirt) and Country (US), but the engine will scan all five columns! This is not the most effective solution – I think we can agree…

Column storage

Now let’s check how column storage works. You might think that the method has a 180 degree difference:

Image of the author

In this case, each column is an independent entity – meaning, each column is physically separated from the other columns! Go back to our previous business problem: Now the engine can only scan those columns required for query (product and country), and Skip the scan Unnecessary columns. And, in most cases, this should improve the performance of analytical queries.

OK, that’s fine, but the column store exists before Parquet, it still exists outside Parquet. So, what’s special about the parquet quet format?

Parquet is a columnar format that stores data in row groups

Wait, what? ! Isn’t it even complicated enough before? Don’t worry, it’s much easier than it sounds.

Let’s go back to our previous example, describing how Parquet stores the same data:

Image of the author

Let’s stop for a moment and explain the illustration above, because that’s exactly what the parquet file is structured (some other things are intentionally omitted, but we’ll explain that soon too). Columns are still stored as separate units, but Parquet introduces other structures called row groups.

Why is this additional structure very important?

You need to wait for an answer :). In the OLAP scheme, we mainly focus on two concepts: projection and predicate. Projection means choose SQL statements – Query which columns are needed. Going back to our previous example, we only need the product and country columns, so the engine can skip scanning the rest of the columns.

Predicate reference Where Clauses in SQL language – Lines meet criteria defined in queries. In our case we are only interested in the T-shirts, so the engine can skip scan row 2 completely, where all values ​​in the product column equal to the socks!

Image of the author

Let’s stop here quickly, because I want you to be aware of the difference between various types of storage in terms of the work the engine needs to perform:

  • Row Store – The engine needs to scan all 5 columns and all 6 rows
  • Column Store – The engine needs to scan 2 columns and all 6 rows
  • Column storage with row groups – the engine needs to scan 2 columns and 4 rows

Obviously, this is an oversimplified example with only 6 rows and 5 columns, where you absolutely won’t see the performance difference between these three storage options. However, in real life, the difference becomes more obvious when you work on a lot of data.

Now, the fair question is: How does parquet “know” which row group does skip/scan?

The parquet file contains metadata

This means that each parquet quet file contains “data about data” – information like the minimum and maximum values ​​in a specific column in a row group. In addition, each parquet quet file contains a footer that retains information about format version, schema information, column metadata, and more. You can find more details about parquet metadata types here.

Important: To optimize performance and eliminate unnecessary data structures (row groups and columns), the engine first needs to be “familiar with” the data, so the metadata is read first. This is not a slow operation, but it still takes some time. So if you are querying data from multiple small parquet files, the query performance will be degraded because the engine will have to read the metadata from each file. So you should be better off merging multiple smaller files into one larger file (but still not too large :)…

I hear you, I hear you: Nicholas, what is “small” and what is “big”? Unfortunately, there is no single “gold” number here, but for example, Microsoft Azure Synapse Analytics recommends that a single Parquet file should be at least a few hundred MB in size.

What else is there?

Here is a simplified, high-level illustration in the parquet quet file format:

Image of the author

Can it be better than this? Yes, with data compression

OK, we’ve explained how to skip scans of unnecessary data structures (row groups and columns), which may benefit your query and improve overall performance. But, it’s not just about this – remember when I first told you that one of the main advantages of the parquet format is the reduced memory footprint of the file? This is achieved by applying various compression algorithms.

I’ve written about various types of data compression in Power BI (and tabular models in general), so maybe it’s a good idea to read this article first.

There are two main encoding types that allow Parquet to compress data and get amazing savings in space:

  • Dictionary encoding – Parquet creates a dictionary of different values ​​in the column and then replaces the “real” value with the index value in the dictionary. Going back to our example, this process looks like this:
Image of the author

You might be wondering: Why is this overhead, when the product name is short, right? OK, but now you can imagine that you store detailed descriptions of the product, such as: “Long-arm T-shirt, application on the neck”. And now you can imagine that you have sold the product millions of times…yes, instead of repeating the value of “long arm…bla bla”, just storing the index value (integral instead of text).

Can it be better than this? Yes, use Delta Lake file format

Well, what’s the Delta Lake format now? This is an article about parquet, right?

So, in simple English: Delta Lake is nothing more than a “parquet on steroids”. When I say “steroids”, the main thing is the version of the parquet wood file. It also stores a transaction log to enable tracking all changes applied to the board file. This is also called a transaction that is in line with acid mergers.

Since it supports not only acidic transactions, it also supports time travel (rollback, audit trails, etc.) and DML (data manipulation language) statements such as insertion, update, and deletion, you can’t be wrong if you think of Delta Lake as a “data warehouse on a data lake”. The pros and cons of studying Lakehouse concepts are beyond the scope of this article, but if you want to dig deeper, I recommend you read this article from Databricks.

in conclusion

We have evolved! Like us, data is constantly evolving. Therefore, new flavors of data require new storage methods. The Parquet file format is one of the most efficient storage options in the current data environment because it utilizes various compression algorithms and multiple benefits of fast query processing by leveraging various compression algorithms and by enabling the engine to skip unnecessary data.

Thank you for reading!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button