What’s new in 2.2.0 (Month XX, 2024)#

These are the changes in pandas 2.2.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Calamine engine for read_excel()#

The calamine engine was added to read_excel(). It uses python-calamine, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (GH 50395).

There are two advantages of this engine:

  1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than ‘openpyxl’, 20x - ‘odf’, 4x - ‘pyxlsb’, and 1.5x - ‘xlrd’. But, ‘openpyxl’ and ‘pyxlsb’ are faster in reading a few rows from large files because of lazy iteration over rows.

  2. Calamine supports the recognition of datetime in .xlsb files, unlike ‘pyxlsb’ which is the only other engine in pandas that can read .xlsb files.

pd.read_excel("path_to_file.xlsb", engine="calamine")

For more, see Calamine (Excel and ODS files) in the user guide on IO tools.

Series.struct accessor to with PyArrow structured data#

The Series.struct accessor provides attributes and methods for processing data with struct[pyarrow] dtype Series. For example, Series.struct.explode() converts PyArrow structured data to a pandas DataFrame. (GH 54938)

In [1]: import pyarrow as pa

In [2]: series = pd.Series(
   ...:     [
   ...:         {"project": "pandas", "version": "2.2.0"},
   ...:         {"project": "numpy", "version": "1.25.2"},
   ...:         {"project": "pyarrow", "version": "13.0.0"},
   ...:     ],
   ...:     dtype=pd.ArrowDtype(
   ...:         pa.struct([
   ...:             ("project", pa.string()),
   ...:             ("version", pa.string()),
   ...:         ])
   ...:     ),
   ...: )
   ...: 

In [3]: series.struct.explode()
Out[3]: 
   project version
0   pandas   2.2.0
1    numpy  1.25.2
2  pyarrow  13.0.0

enhancement2#

Other enhancements#

  • Series.attrs / DataFrame.attrs now uses a deepcopy for propagating attrs (GH 54134).

  • read_csv() now supports on_bad_lines parameter with engine="pyarrow". (GH 54480)

  • ExtensionArray._explode() interface method added to allow extension type implementations of the explode method (GH 54833)

  • ExtensionArray.duplicated() added to allow extension type implementations of the duplicated method (GH 55255)

  • DataFrame.apply now allows the usage of numba (via engine="numba") to JIT compile the passed function, allowing for potential speedups (GH 54666)

  • Implement masked algorithms for Series.value_counts() (GH 54984)

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

merge() and DataFrame.join() now consistently follow documented sort behavior#

In previous versions of pandas, merge() and DataFrame.join() did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (GH 54611).

As documented, sort=True sorts the join keys lexicographically in the resulting DataFrame. With sort=False, the order of the join keys depends on the join type (how keyword):

  • how="left": preserve the order of the left keys

  • how="right": preserve the order of the right keys

  • how="inner": preserve the order of the left keys

  • how="outer": sort keys lexicographically

One example with changing behavior is inner joins with non-unique left join keys and sort=False:

In [4]: left = pd.DataFrame({"a": [1, 2, 1]})

In [5]: right = pd.DataFrame({"a": [1, 2]})

In [6]: result = pd.merge(left, right, how="inner", on="a", sort=False)

Old Behavior

In [5]: result
Out[5]:
   a
0  1
1  1
2  2

New Behavior

In [7]: result
Out[7]: 
   a
0  1
1  2
2  1

merge() and DataFrame.join() no longer reorder levels when levels differ#

In previous versions of pandas, merge() and DataFrame.join() would reorder index levels when joining on two indexes with different levels (GH 34133).

In [8]: left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))

In [9]: right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))

In [10]: result = left.join(right)

Old Behavior

In [5]: result
Out[5]:
       left  right
B A C
1 x 1     1      2
2 x 2     1      2

New Behavior

In [11]: result
Out[11]: 
       left  right
A B C             
x 1 1     1      2
  2 2     1      2

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

X

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Deprecate alias M in favour of ME for offsets#

The alias M is deprecated in favour of ME for offsets, please use ME for “month end” instead of M (GH 9586)

For example:

Previous behavior:

In [7]: pd.date_range('2020-01-01', periods=3, freq='M')
Out [7]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'],
              dtype='datetime64[ns]', freq='M')

Future behavior:

In [12]: pd.date_range('2020-01-01', periods=3, freq='ME')
Out[12]: DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='ME')

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

  • Categorical.isin() raising InvalidIndexError for categorical containing overlapping Interval values (GH 34974)

Datetimelike#

  • Bug in DatetimeIndex.union() returning object dtype for tz-aware indexes with the same timezone but different units (GH 55238)

Timedelta#

  • Bug in rendering (__repr__) of TimedeltaIndex and Series with timedelta64 values with non-nanosecond resolution entries that are all multiples of 24 hours failing to use the compact representation used in the nanosecond cases (GH 55405)

Timezones#

Numeric#

  • Bug in read_csv() with engine="pyarrow" causing rounding errors for large integers (GH 52505)

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

  • Bug in read_csv() where on_bad_lines="warn" would write to stderr instead of raise a Python warning. This now yields a errors.ParserWarning (GH 54296)

  • Bug in read_csv() with engine="pyarrow" where usecols wasn’t working with a csv with no headers (GH 54459)

  • Bug in read_excel(), with engine="xlrd" (xls files) erroring when file contains NaNs/Infs (GH 54564)

  • Bug in to_excel(), with OdsWriter (ods files) writing boolean/string value (GH 54994)

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Other#

  • Bug in cut() incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (GH 54964)

Contributors#