目录

Pandas 入门

读取数据集

1
2
import pandas as pd
reviews = pd.read_csv("~/Programming/datasets/winemag-data-130k-v2.csv")

分组和排序

对分数分组求和

1
reviews.groupby("points").points.count()

Output:

1
2
3
4
5
6
7
8
9
points
80       397
81       692
82      1836
...
98        77
99        33
100       19
Name: points, dtype: int64

获得相同评分葡萄酒的最低价格

1
reviews.groupby("points").price.min()

Output:

1
2
3
4
5
6
7
8
9
points
80      5.0
81      5.0
82      4.0
...
98     50.0
99     44.0
100    80.0
Name: price, dtype: float64

从每个酒庄中选择第一瓶葡萄酒的名称

1
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
winery
1+1=3                                     1+1=3 NV Rosé Sparkling (Cava)
10 Knots                            10 Knots 2010 Viognier (Paso Robles)
100 Percent Wine              100 Percent Wine 2015 Moscato (California)
1000 Stories           1000 Stories 2013 Bourbon Barrel Aged Zinfande...
1070 Green                  1070 Green 2011 Sauvignon Blanc (Rutherford)
...
Órale                       Órale 2011 Cabronita Red (Santa Ynez Valley)
Öko                    Öko 2013 Made With Organically Grown Grapes Ma...
Ökonomierat Rebholz    Ökonomierat Rebholz 2007 Von Rotliegenden Spät...
àMaurice               àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                                    Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

根据国家和省份挑选最好的葡萄酒

1
reviews.groupby(["country", "province"]).apply(lambda df:df.loc[df.points.idxmax()])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
                            Unnamed: 0  ...                winery
country   province                      ...
Argentina Mendoza Province       82754  ...  Bodega Catena Zapata
          Other                  78303  ...                Colomé
Armenia   Armenia                66146  ...              Van Ardi
Australia Australia Other        37882  ...       Marquis Philips
          New South Wales        85337  ...            De Bortoli
                                ...  ...                   ...
Uruguay   Juanico                 9133  ...        Familia Deicas
          Montevideo             15750  ...                 Bouza
          Progreso               93103  ...                Pisano
          San Jose               39898  ...        Castillo Viejo
          Uruguay                39361  ...               Narbona
[425 rows x 14 columns]

agg() 方法可以在一个 dataframe 上运行一些一维的函数

1
reviews.groupby(['country']).price.agg([len, min, max])

Output:

1
2
3
4
5
6
7
8
9
                            len   min     max
country
Argentina                3800.0   4.0   230.0
Armenia                     2.0  14.0    15.0
Australia                2329.0   5.0   850.0
...
US                      54504.0   4.0  2013.0
Ukraine                    14.0   6.0    13.0
Uruguay                   109.0  10.0   130.0

多索引:

使用 groupby() 可能会产生多索引

1
2
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
                             len
country   province
Argentina Mendoza Province  3264
          Other              536
Armenia   Armenia              2
Australia Australia Other    245
          New South Wales     85
                          ...
Uruguay   Juanico             12
          Montevideo          11
          Progreso            11
          San Jose             3
          Uruguay             24
[425 rows x 1 columns]

多索引转回常规索引

1
countries_reviewed.reset_index()

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
       country          province   len
0    Argentina  Mendoza Province  3264
1    Argentina             Other   536
2      Armenia           Armenia     2
3    Australia   Australia Other   245
4    Australia   New South Wales    85
..         ...               ...   ...
420    Uruguay           Juanico    12
421    Uruguay        Montevideo    11
422    Uruguay          Progreso    11
423    Uruguay          San Jose     3
424    Uruguay           Uruguay    24
[425 rows x 3 columns]

使用 sort_values() 对数据排序(默认为升序排序)

1
2
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
          country               province    len
179        Greece  Muscat of Kefallonian      1
192        Greece          Sterea Ellada      1
194        Greece                 Thraki      1
354  South Africa             Paardeberg      1
40         Brazil       Serra do Sudeste      1
..            ...                    ...    ...
409            US                 Oregon   5373
227         Italy                Tuscany   5897
118        France               Bordeaux   5941
415            US             Washington   8639
392            US             California  36247
[425 rows x 3 columns]

降序排序

1
countries_reviewed.sort_values(by='len', ascending=False)

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
          country         province    len
392            US       California  36247
415            US       Washington   8639
118        France         Bordeaux   5941
227         Italy          Tuscany   5897
409            US           Oregon   5373
..            ...              ...    ...
101       Croatia              Krk      1
247   New Zealand        Gladstone      1
357  South Africa  Piekenierskloof      1
63          Chile          Coelemu      1
149        Greece           Beotia      1
[425 rows x 3 columns]

重新按照索引排序

1
countries_reviewed.sort_index()

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
       country          province   len
0    Argentina  Mendoza Province  3264
1    Argentina             Other   536
2      Armenia           Armenia     2
3    Australia   Australia Other   245
4    Australia   New South Wales    85
..         ...               ...   ...
420    Uruguay           Juanico    12
421    Uruguay        Montevideo    11
422    Uruguay          Progreso    11
423    Uruguay          San Jose     3
424    Uruguay           Uruguay    24
[425 rows x 3 columns]

同时使用多个列排序

1
countries_reviewed.sort_values(by=['country', 'len'])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
       country          province   len
1    Argentina             Other   536
0    Argentina  Mendoza Province  3264
2      Armenia           Armenia     2
6    Australia          Tasmania    42
4    Australia   New South Wales    85
..         ...               ...   ...
421    Uruguay        Montevideo    11
422    Uruguay          Progreso    11
420    Uruguay           Juanico    12
424    Uruguay           Uruguay    24
419    Uruguay         Canelones    43
[425 rows x 3 columns]

数据类型和缺失值

获取 price 列的数据类型

1
reviews.price.dtype

Output:

1
dtype('float64')

查看 dataframe 中每一列的数据类型

1
reviews.dtypes

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

使用 astype() 转换数据类型

1
reviews.points.astype('float64')

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
0         87.0
1         87.0
2         87.0
3         87.0
4         87.0
          ...
129966    90.0
129967    90.0
129968    90.0
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

缺少值的条目被赋予值 NaN,缩写为 “Not a Number”

查看 country 为 NaN 的行

1
reviews[pd.isnull(reviews.country)]

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
        Unnamed: 0 country  ...             variety                           winery
913            913     NaN  ...             Chinuri               Gotsa Family Wines
3131          3131     NaN  ...           Red Blend                Barton & Guestier
4243          4243     NaN  ...            Ojaleshi  Kakhetia Traditional Winemaking
9509          9509     NaN  ...         White Blend                         Tsililis
9750          9750     NaN  ...          Chardonnay                         Ross-idi
            ...     ...  ...                 ...                              ...
124176      124176     NaN  ...           Red Blend                Les Frères Dutruy
129407      129407     NaN  ...  Cabernet Sauvignon                      El Capricho
129408      129408     NaN  ...         Tempranillo                      El Capricho
129590      129590     NaN  ...           Red Blend                        Büyülübağ
129900      129900     NaN  ...              Merlot                           Psagot
[63 rows x 14 columns]

使用 fillna() 填充缺失的行

1
reviews.region_2.fillna("Unknown)

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
0                   Unknown
1                   Unknown
2         Willamette Valley
3                   Unknown
4         Willamette Valley
                ...
129966              Unknown
129967         Oregon Other
129968              Unknown
129969              Unknown
129970              Unknown
Name: region_2, Length: 129971, dtype: object

使用 replace(old_val, new_val) 方法替换非空值

1
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
0             @kerino
1          @vossroger
2          @paulgwine
3                 NaN
4          @paulgwine
             ...
129966            NaN
129967     @paulgwine
129968     @vossroger
129969     @vossroger
129970     @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

重命名和合并

重命名一列

1
reviews.rename(columns={'points': 'score'})

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
        Unnamed: 0  ...                                    winery
0                0  ...                                   Nicosia
1                1  ...                       Quinta dos Avidagos
2                2  ...                                 Rainstorm
3                3  ...                                St. Julian
4                4  ...                              Sweet Cheeks
            ...  ...                                       ...
129966      129966  ...  Dr. H. Thanisch (Erben Müller-Burggraef)
129967      129967  ...                                  Citation
129968      129968  ...                           Domaine Gresser
129969      129969  ...                      Domaine Marcel Deiss
129970      129970  ...                          Domaine Schoffit
[129971 rows x 14 columns]

重命名可以选择 index 或者 colum 参数

1
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
             Unnamed: 0  ...                                    winery
firstEntry            0  ...                                   Nicosia
secondEntry           1  ...                       Quinta dos Avidagos
2                     2  ...                                 Rainstorm
3                     3  ...                                St. Julian
4                     4  ...                              Sweet Cheeks
                 ...  ...                                       ...
129966           129966  ...  Dr. H. Thanisch (Erben Müller-Burggraef)
129967           129967  ...                                  Citation
129968           129968  ...                           Domaine Gresser
129969           129969  ...                      Domaine Marcel Deiss
129970           129970  ...                          Domaine Schoffit
[129971 rows x 14 columns]

因为很少重命名索引值,通常情况下 set_index() 更好用

给行索引和列索引添加名称属性

1
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fields  Unnamed: 0  ...                                    winery
wines               ...
0                0  ...                                   Nicosia
1                1  ...                       Quinta dos Avidagos
2                2  ...                                 Rainstorm
3                3  ...                                St. Julian
4                4  ...                              Sweet Cheeks
            ...  ...                                       ...
129966      129966  ...  Dr. H. Thanisch (Erben Müller-Burggraef)
129967      129967  ...                                  Citation
129968      129968  ...                           Domaine Gresser
129969      129969  ...                      Domaine Marcel Deiss
129970      129970  ...                          Domaine Schoffit
[129971 rows x 14 columns]

组合有三个核心方法concat(), join()merge()

concat(): 给定一个元素列表,此函数将沿着一个 axis 将这些元素混合在一起

1
2
3
canadian_youtube
british_youtube
pd.concat([canadian_youtube, british_youtube])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
          video_id  ...                                        description
0      n1WpP7iowLc  ...  Eminem's new track Walk on Water ft. Beyoncé i...
1      0dBIkQ4Mz1M  ...  STill got a lot of packages. Probably will las...
2      5qpjK5DgCt4  ...  WATCH MY PREVIOUS VIDEO  \n\nSUBSCRIBE  http...
3      d380meD0W0M  ...  I know it's been a while since we did this sho...
4      2Vv-BfVoq4g  ...  🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
            ...  ...                                                ...
40876  sGolxsMSGfQ  ...  🚨 NEW MERCH! http://amzn.to/annoyingorange 🚨➤ ...
40877  8HNuRNi8t70  ...   Retrouvez vos programmes préférés : https://...
40878  GWlKEM3m2EE  ...  Find out more about Kingdom Hearts 3: https://...
40879  lbMKLzQ4cNQ  ...  Peter Navarro isnt talking so tough now. Ana ...
40880  POTgw38-m58  ...  藝人:李妍瑾、玉兔、班傑、LaLa、小優、少少專家:陳筱屏(律師)Wendy(心理師)、羅...
[40881 rows x 16 columns]

          video_id  ...                                        description
0      Jw1Y-zhQURU  ...  Click here to continue the story and make your...
1      3s1rvMFUweQ  ...  Musical guest Taylor Swift performs Ready for...
2      n1WpP7iowLc  ...  Eminem's new track Walk on Water ft. Beyoncé i...
3      PUTEiSjKwJU  ...  Salford drew 4-4 against the Class of 92 and F...
4      rHwDegptbI4  ...  Dashcam captures truck's near miss with child ...
            ...  ...                                                ...
38911  l884wKofd54  ...  NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ...
38912  IP8k2xkhOdI  ...  THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new...
38913  Il-an3K9pjg  ...  Get 2002 by Anne-Marie HERE  http://ad.gt/200...
38914  -DRsfNObKIQ  ...  Eleni Foureira represented Cyprus at the first...
38915  4YFo4bdMO8Q  ...  Debut album 'Light of Mine' out now: http://ky...
[38916 rows x 16 columns]

          video_id  ...                                        description
0      n1WpP7iowLc  ...  Eminem's new track Walk on Water ft. Beyoncé i...
1      0dBIkQ4Mz1M  ...  STill got a lot of packages. Probably will las...
2      5qpjK5DgCt4  ...  WATCH MY PREVIOUS VIDEO  \n\nSUBSCRIBE  http...
3      d380meD0W0M  ...  I know it's been a while since we did this sho...
4      2Vv-BfVoq4g  ...  🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
            ...  ...                                                ...
38911  l884wKofd54  ...  NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ...
38912  IP8k2xkhOdI  ...  THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new...
38913  Il-an3K9pjg  ...  Get 2002 by Anne-Marie HERE  http://ad.gt/200...
38914  -DRsfNObKIQ  ...  Eleni Foureira represented Cyprus at the first...
38915  4YFo4bdMO8Q  ...  Debut album 'Light of Mine' out now: http://ky...
[79797 rows x 16 columns]

join(): 可以组合具有共同索引的不同 dataframe 对象

1
2
3
4
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
                                                                 video_id_CAN  ... description_UK
title                                              trending_date               ...
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting ... 18.04.01       PNn8sECd7io  ...            NaN
#1 Fortnite World Rank - 2,323 Solo Wins!          18.09.03       DvPW66IFhMI  ...            NaN
#1 Fortnite World Rank - 2,330 Solo Wins!          18.10.03       EXEaMjFeiEk  ...            NaN
#1 MOST ANTICIPATED VIDEO (Timber Frame House R... 17.20.12       bYvQmusLaxw  ...            NaN
                                                   17.21.12       bYvQmusLaxw  ...            NaN
                                                                       ...  ...            ...
😲She Is So Nervous But BLOWS The ROOF After Tak... 18.02.05       WttN1Z0XF4k  ...            NaN
                                                   18.29.04       WttN1Z0XF4k  ...            NaN
                                                   18.30.04       WttN1Z0XF4k  ...            NaN
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels W... 18.07.05       Wt9Gkpmbt44  ...            NaN
🚨Active Shooter at YouTube Headquarters - LIVE ... 18.04.04       Az72jrKbANA  ...            NaN
[40900 rows x 28 columns]

其他

显示所有列

1
pd.set_option("display.max_columns", None)