On April 3, 2023, pandas 2.0 was released, marking a significant change in how data representation in memory is handled by pandas. Despite this change, the release has been designed to avoid breaking existing pandas code or disrupting users' familiarity with pandas.
Pandas has been decoupling from its initial reliance on NumPy and gradually incorporating Apache Arrow as a backend for DataFrame and Series operations. Apache Arrow provides better support for data types such as strings and missing values, which were limitations in NumPy. This change has been implemented in stages, starting with the addition of an API for Extension Arrays in pandas 1.5 and the implementation of a string data type based on Arrow in 2020. With the addition of Apache Arrow support for all data types in pandas 2.0, it marks a major milestone in the evolution of pandas and justifies the release of a new major version.
By default, pandas 2.0 will continue to use the original data types when creating Series or DataFrame objects. This means that the changes in the way that data types are handled are optional and can be explicitly specified when needed, rather than being applied globally to all operations. This approach helps to ensure backwards compatibility for existing code, as it allows users to continue using the original data types by default, while also providing the flexibility to use the new Arrow-backed data types when desired.
There are some incompatibilities, however. Check out the documentation for specifics.
Apache Arrow is an open-source software framework that provides a standard, language-independent way of representing data in memory. It enables efficient, high-performance interactions between different systems and programming languages. PyArrow is the Python implementation of Apache Arrow and provides a set of APIs for working with Arrow data structures, as well as for reading and writing Arrow data from various data sources, such as CSV and Parquet.
Starting with version 1.5, Apache Arrow support for all data types was added to pandas (though it was deemed experimental at the time). In pandas 2.0, this support was made official, with the pyarrow backend providing substantial memory usage reduction and performance improvements when dealing with non-numeric data types (see these examples). This is especially true when compared to the numpy backend, which was not optimized for these types. By using Arrow, pandas can rely on a standard method for handling missing values that is consistent across all data types, without having to create separate solutions for each type.
In Pandas 2.0, new options were added to specify the data type backend for the returned DataFrame. By setting dtype_backend to "numpy_nullable", a DataFrame backed by NumPy nullable dtypes will be returned, whereas setting it to "pyarrow" will return a pyarrow-backed nullable ArrowDtype DataFrame. These options provide a more flexible and powerful way to work with missing values across all data types consistently, and can be used with functions like read_csv, read_xml, read_sql, convert_dtypes, among others.
The advantage of using nullable types is that it avoids cases or operations where None and NaN are treated differently, leading to undesired outcomes. For example, type casting, sorting, grouping and aggregation, can produce different results depending on whether the missing values are represented as NaN or None (read more here). With nullable dtypes, missing values are represented uniformly as <NA>, which can simplify data manipulation and analysis.
Additionally, ArrowDtype provides a more efficient and memory-friendly alternative to NumPy dtypes, and can support features like zero-copy data transfer between Python and other programming languages.
Being able to use different numpy numeric data types as indexes in pandas offers several benefits, such as enhanced flexibility, improved performance, and better compatibility with other data types and libraries. Until recently, pandas only supported the use of int64, uint64, and float64 dtypes as an index, but now it allows all numpy numeric dtypes, including int8, int16, int32, uint8, uint16, uint32, uint64, float32, and float64.
This is useful, for example, for indexing large datasets: If you have a large dataset with millions or billions of rows, using a smaller numeric dtype for Index can significantly reduce memory usage. For example, using int32 instead of int64 for Index can save half of the memory for integer-based Indexes.
Pandas by default uses a mutable data structure, where modifications to data often result in copying of the data. When operations are performed on a pandas DataFrame or Series, such as filtering, slicing, or modification of values, pandas typically creates a new copy of the data, even if the modifications are minor. This is because pandas uses a "chained indexing" approach, where each operation creates a new intermediate object with a copy of the data, rather than modifying the original data in place. This can be quite inefficient and resource-consuming.
Copy-on-Write is a technique that avoids unnecessary copying of data by only copying the affected data when modifications are made, instead of copying the entire dataset. It provides efficient memory usage, improved performance, faster operations, data integrity, and simplified code management, making it a useful approach in various programming scenarios.
Copy-on-Write improvements have been introduced in pandas 2.0, but are not enabled by default. However, it is possible to enable them either locally or globally, depending on your needs. Check out the commands here.
Pandas 2 introduces more efficient memory management options that can greatly simplify the work of users that handle large amounts of data. For those who do not work with such large dataframes, the new version should not introduce noticeable changes.
From my experience with using the Detectron2 library for object detection and training machine learning models, here's an explanation on how to split a dataset into test, train, and validation sets, register metadata for each, and monitor accuracy on validation while training.
Build a real-time streaming pipeline with Apache Kafka and Amazon MSK. Learn how to generate, ingest, and process streaming data using Python and Kafka. Enable real-time processing, machine learning integration, and intelligent decision-making in your applications.