Pandas

pandas is the standard tool across research and industry for working with tabular data.

And in python we will do import pandas as pd, and read data as df = pd.read_csv('data.csv').

Arrange data in a tabular format.
Extract useful information filtered by specific conditions.
Apply numerical operations using NumPy to our data.
Perform vectorized computations to speed up our analysis.

DataFrame

When using pandas, the data is stored in a DataFrame, which is a tabular data structure. It is similar to a spreadsheet or a SQL table, but it can also be used to store more complex data types, such as strings or lists.

	Landmark	Type	Height	Year Built
0	Sather Gate	Gate	30	1910.0
1	Campanile	Tower	307	1914.0
2	Doe Library	Library	80	1911.0
3	Memorial Glade	Open Space	0	NaN
4	Sproul Plaza	Plaza	0	1962.0

Common Operations

Create a DataFrame

pd.dataFrame(data)
pd.read_csv("xxx.csv", index_col='yy')

Quick review

df.head()
df.tail()
df.sample()
 
df.info()  # a summary including the number of rows, columns, data types, and memory usage. 
df.describe()  # statistics like mean, standard deviation, and percentiles.
 
df.shape()  # the number of rows and columns.
df.zize()  # the total number of elements
 
df.value_counts['x']()  # unique values in a column.
df.unique['x']()  # an array of unique values in a column.

Extraction

df[1:]  # rows 1 to the end
df['x', 'y']  # columns x and y
 
df.loc[[1, 2]]  # by label, inclusive
df.loc[:, 'x']
 
df.iloc[[1,2]]  # by position, exclusive
df.iloc[1:, [2,4]]
 
df[df['x'] > 0]  # filter
df[df['x'] > 0 and df['y'] < 0]
df[df['x'].isin(['x1', 'x2'])]
df.loc[df['x'] > 0]

Modification

df['x'] = [0, 2, 1]
df['y'] = df['x'] * 0.5
 
df.drop(columns=['x'], inplace=True)  # default False
df.drop(index=[0, 1], inplace=True)
 
df.sort_values(by='x', ascending=True)  # default True
df.sort_values(by=['x', 'y'], ascending=[True, False])
 
df.isnull()  # find missing values
df.fillna(0)  # fill
df.dropna()

Aggregation
- Basic: .sum(), .mean(), .median(), .min(), .max(), .count(), .nunique(), prod()
- Statistical: .var(), .std(), sem()(standard error of the mean), .skew()(skewness), .kurt()(kurtosis)
- Logical: .all(), .any()
- Indexing: .first(), .last(), .idxmin(), .idxmax()
- In-Built agg: sum(), max(), min()
- Numpy agg: np.sum(), np.max(), np.min(), np.mean()
- Pandas agg: .agg('sum','mean', 'min', 'max', 'first', 'last')
```
df.groupby('x').agg('sum')
```

Join

# 1. Inner join
df1.join(df2.set_index('x1'), on='x2', how='inner')
df1.merge(df2, how='inner', left_on='x1', right_on='x2')
# 2. Outer join
df1.merge(df2, how='outer', left_on='x1', right_on='x2')

inner	outer	left	right
only both	all both	all left	all right

🌲vsk_dl_notes

Explorer

Data Processing

Pandas

DataFrame

Common Operations

Visualization

Pytorch

Graph View

Table of Contents