What’s new in 2.2.0 (Month XX, 2024)#
These are the changes in pandas 2.2.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Calamine engine for read_excel()#
The calamine engine was added to read_excel().
It uses python-calamine, which provides Python bindings for the Rust library calamine.
This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (GH 50395).
There are two advantages of this engine:
Calamine is often faster than other engines, some benchmarks show results up to 5x faster than ‘openpyxl’, 20x - ‘odf’, 4x - ‘pyxlsb’, and 1.5x - ‘xlrd’. But, ‘openpyxl’ and ‘pyxlsb’ are faster in reading a few rows from large files because of lazy iteration over rows.
Calamine supports the recognition of datetime in
.xlsbfiles, unlike ‘pyxlsb’ which is the only other engine in pandas that can read.xlsbfiles.
pd.read_excel("path_to_file.xlsb", engine="calamine")
For more, see Calamine (Excel and ODS files) in the user guide on IO tools.
Series.struct accessor to with PyArrow structured data#
The Series.struct accessor provides attributes and methods for processing
data with struct[pyarrow] dtype Series. For example,
Series.struct.explode() converts PyArrow structured data to a pandas
DataFrame. (GH 54938)
In [1]: import pyarrow as pa
In [2]: series = pd.Series(
...: [
...: {"project": "pandas", "version": "2.2.0"},
...: {"project": "numpy", "version": "1.25.2"},
...: {"project": "pyarrow", "version": "13.0.0"},
...: ],
...: dtype=pd.ArrowDtype(
...: pa.struct([
...: ("project", pa.string()),
...: ("version", pa.string()),
...: ])
...: ),
...: )
...:
In [3]: series.struct.explode()
Out[3]:
project version
0 pandas 2.2.0
1 numpy 1.25.2
2 pyarrow 13.0.0
enhancement2#
Other enhancements#
Series.attrs/DataFrame.attrsnow uses a deepcopy for propagatingattrs(GH 54134).read_csv()now supportson_bad_linesparameter withengine="pyarrow". (GH 54480)ExtensionArray._explode()interface method added to allow extension type implementations of theexplodemethod (GH 54833)ExtensionArray.duplicated()added to allow extension type implementations of theduplicatedmethod (GH 55255)DataFrame.apply now allows the usage of numba (via
engine="numba") to JIT compile the passed function, allowing for potential speedups (GH 54666)Implement masked algorithms for
Series.value_counts()(GH 54984)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
merge() and DataFrame.join() now consistently follow documented sort behavior#
In previous versions of pandas, merge() and DataFrame.join() did not
always return a result that followed the documented sort behavior. pandas now
follows the documented sort behavior in merge and join operations (GH 54611).
As documented, sort=True sorts the join keys lexicographically in the resulting
DataFrame. With sort=False, the order of the join keys depends on the
join type (how keyword):
how="left": preserve the order of the left keyshow="right": preserve the order of the right keyshow="inner": preserve the order of the left keyshow="outer": sort keys lexicographically
One example with changing behavior is inner joins with non-unique left join keys
and sort=False:
In [4]: left = pd.DataFrame({"a": [1, 2, 1]})
In [5]: right = pd.DataFrame({"a": [1, 2]})
In [6]: result = pd.merge(left, right, how="inner", on="a", sort=False)
Old Behavior
In [5]: result
Out[5]:
a
0 1
1 1
2 2
New Behavior
In [7]: result
Out[7]:
a
0 1
1 2
2 1
merge() and DataFrame.join() no longer reorder levels when levels differ#
In previous versions of pandas, merge() and DataFrame.join() would reorder
index levels when joining on two indexes with different levels (GH 34133).
In [8]: left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))
In [9]: right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))
In [10]: result = left.join(right)
Old Behavior
In [5]: result
Out[5]:
left right
B A C
1 x 1 1 2
2 x 2 1 2
New Behavior
In [11]: result
Out[11]:
left right
A B C
x 1 1 1 2
2 2 1 2
Backwards incompatible API changes#
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
Package |
Minimum Version |
Required |
Changed |
|---|---|---|---|
X |
X |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
Changed |
|---|---|---|
X |
See Dependencies and Optional dependencies for more.
Other API changes#
Deprecations#
Deprecate alias M in favour of ME for offsets#
The alias M is deprecated in favour of ME for offsets, please use ME for “month end” instead of M (GH 9586)
For example:
Previous behavior:
In [7]: pd.date_range('2020-01-01', periods=3, freq='M')
Out [7]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'],
dtype='datetime64[ns]', freq='M')
Future behavior:
In [12]: pd.date_range('2020-01-01', periods=3, freq='ME')
Out[12]: DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='ME')
Other Deprecations#
Changed
Timedelta.resolution_string()to returnmin,s,ms,us, andnsinstead ofT,S,L,U, andN, for compatibility with respective deprecations in frequency aliases (GH 52536)Deprecated allowing non-keyword arguments in
DataFrame.to_clipboard(). (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_csv()exceptpath_or_buf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_dict(). (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_excel()exceptexcel_writer. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_gbq()exceptdestination_table. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_hdf()exceptpath_or_buf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_html()exceptbuf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_json()exceptpath_or_buf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_latex()exceptbuf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_markdown()exceptbuf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_parquet()exceptpath. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_pickle()exceptpath. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_string()exceptbuf. (GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_xml()exceptpath_or_buffer. (GH 54229)Deprecated automatic downcasting of object-dtype results in
Series.replace()andDataFrame.replace(), explicitly callresult = result.infer_objects(copy=False)instead. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)(GH 54710)Deprecated downcasting behavior in
Series.where(),DataFrame.where(),Series.mask(),DataFrame.mask(),Series.clip(),DataFrame.clip(); in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Callresult.infer_objects(copy=False)on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)(GH 53656)Deprecated including the groups in computations when using
DataFrameGroupBy.apply()andDataFrameGroupBy.resample(); passinclude_groups=Falseto exclude the groups (GH 7155)Deprecated not passing a tuple to
DataFrameGroupBy.get_grouporSeriesGroupBy.get_groupwhen grouping by a length-1 list-like (GH 25971)Deprecated strings
S,U, andNdenoting units into_timedelta()(GH 52536)Deprecated strings
T,S,L,U, andNdenoting frequencies inMinute,Second,Milli,Micro,Nano(GH 52536)Deprecated strings
T,S,L,U, andNdenoting units inTimedelta(GH 52536)Deprecated the extension test classes
BaseNoReduceTests,BaseBooleanReduceTests, andBaseNumericReduceTests, useBaseReduceTestsinstead (GH 54663)Deprecated the option
mode.data_managerand theArrayManager; only theBlockManagerwill be available in future versions (GH 55043)Deprecating downcasting the results of
DataFrame.fillna(),Series.fillna(),DataFrame.ffill(),Series.ffill(),DataFrame.bfill(),Series.bfill()in object-dtype cases. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)(GH 54261)
Performance improvements#
Performance improvement in
concat()withaxis=1and objects with unaligned indexes (GH 55084)Performance improvement in
to_dict()on converting DataFrame to dictionary (GH 50990)Performance improvement in
DataFrame.groupby()when aggregating pyarrow timestamp and duration dtypes (GH 55031)Performance improvement in
DataFrame.sort_index()andSeries.sort_index()when indexed by aMultiIndex(GH 54835)Performance improvement in
Index.difference()(GH 55108)Performance improvement in
Series.duplicated()for pyarrow dtypes (GH 55255)Performance improvement when indexing with more than 4 keys (GH 54550)
Performance improvement when localizing time to UTC (GH 55241)
Bug fixes#
Bug in
AbstractHolidayCalendarwhere timezone data was not propagated when computing holiday observances (GH 54580)Bug in
pandas.core.window.Rollingwhere duplicate datetimelike indexes are treated as consecutive rather than equal withclosed='left'andclosed='neither'(GH 20712)Bug in
pandas.api.types.is_string_dtype()while checking object array with no elements is of the string dtype (GH 54661)Bug in
DataFrame.apply()where passingraw=Trueignoredargspassed to the applied function (GH 55009)Bug in
pandas.DataFrame.melt()where it would not preserve the datetime (GH 55254)Bug in
pandas.read_excel()with a ODS file without cached formatted cell for float values (GH 55219)
Categorical#
Datetimelike#
Bug in
DatetimeIndex.union()returning object dtype for tz-aware indexes with the same timezone but different units (GH 55238)
Timedelta#
Bug in rendering (
__repr__) ofTimedeltaIndexandSerieswith timedelta64 values with non-nanosecond resolution entries that are all multiples of 24 hours failing to use the compact representation used in the nanosecond cases (GH 55405)
Timezones#
Numeric#
Bug in
read_csv()withengine="pyarrow"causing rounding errors for large integers (GH 52505)
Conversion#
Bug in
Series.convert_dtypes()not converting all NA column tonull[pyarrow](GH 55346)
Strings#
Interval#
Bug in
Interval__repr__not displaying UTC offsets forTimestampbounds. Additionally the hour, minute and second components will now be shown. (GH 55015)Bug in
IntervalIndex.get_indexer()with datetime or timedelta intervals incorrectly matching on integer targets (GH 47772)Bug in
IntervalIndex.get_indexer()with timezone-aware datetime intervals incorrectly matching on a sequence of timezone-naive targets (GH 47772)Bug in setting values on a
Serieswith anIntervalIndexusing a slice incorrectly raising (GH 54722)
Indexing#
Bug in
Index.difference()not returning a unique set of values whenotheris empty orotheris considered non-comparable (GH 55113)Bug in setting
Categoricalvalues into aDataFramewith numpy dtypes raisingRecursionError(GH 52927)
Missing#
MultiIndex#
Bug in
MultiIndex.get_indexer()not raisingValueErrorwhenmethodprovided and index is non-monotonic (GH 53452)
I/O#
Bug in
read_csv()whereon_bad_lines="warn"would write tostderrinstead of raise a Python warning. This now yields aerrors.ParserWarning(GH 54296)Bug in
read_csv()withengine="pyarrow"whereusecolswasn’t working with a csv with no headers (GH 54459)Bug in
read_excel(), withengine="xlrd"(xlsfiles) erroring when file contains NaNs/Infs (GH 54564)Bug in
to_excel(), withOdsWriter(odsfiles) writing boolean/string value (GH 54994)
Period#
Plotting#
Bug in
DataFrame.plot.box()withvert=Falseand a matplotlibAxescreated withsharey=True(GH 54941)
Groupby/resample/rolling#
Reshaping#
Bug in
concat()ignoringsortparameter when passedDatetimeIndexindexes (GH 54769)Bug in
merge()returning columns in incorrect order when left and/or right is empty (GH 51929)
Sparse#
ExtensionArray#
Styler#
Other#
Bug in
cut()incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (GH 54964)