30 things you can do with Pandas — The Archives

Hello everyone! Today I want to write about the Pandas library and here are the 30 things you can do with Pandas to better understand the data! First thing first, lets import pandas library:

import pandas as pd df=pd.read_csv('test.csv') # read a test file to dataframe

(1) Read in a CSV dataset

pd.DataFrame.from_csv(“csv_file”) or pd.read_csv(“csv_file”)

(2) Read in an Excel dataset

pd.read_excel("excel_file")

(3) Write your data frame directly to csv

df.to_csv("data.csv", sep=",", index=False)

(4) Create a dataframe from data with column names

pd.DataFrame(data,columns=[])

(5) Get Data type for all the columns

df.dtypes

(6) Basic dataset feature info

df.info()

(7) Basic dataset statistics

print(df.describe())

(8) List the column names

df.columns

(9) Drop missing data

df.dropna(axis=0, how='any')

(10) Replace missing data

df.replace(to_replace=None, value=None)

(11) Check for NANs

pd.isnull(object)

(12) Drop a feature

df.drop('feature_variable_name', axis=1)

(13) Convert object type to float

pd.to_numeric(df["feature_name"], errors='coerce')

(14) Convert data frame to numpy array

df.as_matrix()

(15) Get first “n” rows of a data frame

df.head(n)

(16) Get last “n” rows of a data frame

df.tail(n)

(17) Get data by feature name

df.loc[feature_name]

(18) Apply a function to a data frame

df["height"].apply(lambda height: 2 * height)

(19) Renaming a column

df.rename(columns = {df.columns[2]:'size'}, inplace=True)

(20) Count categories of categorical variable

df["job"].value_counts()

(21) Get the unique entries of a column

df["name"].unique()

(22) Accessing sub-data frames

new_df = df[["name", "size"]]

(23) Summary information about your data

# Sum of values in a data frame df.sum() # Lowest value of a data frame df.min() # Highest value df.max() # Index of the lowest value df.idxmin() # Index of the highest value df.idxmax() # Statistical summary of the data frame, with quartiles, median, etc. df.describe() # Average values df.mean() # Median values df.median() # Correlation between columns df.corr() # To get these values for only one column, just select it like this# df["size"].median()

(24) Sorting your data

df.sort_values(ascending = False)

(25) Boolean indexing

df[df["size"] == 5]

(26) Selecting values

df.loc([0], ['size'])

(27 Cross frequency tables between two variables

pd.crosstab(df["y"],df["z"])

(28) Plot function for numeric columns

df["size"].plot()

(29) Get shape (row,columns) of the DataFrame

df.shape

(30) Get Randomly selected n rows from DataFrame

df.sample(n)

There are many more useful things in pandas. We’ll see more about them in upcoming posts.

"Happy Reading, Happy Learning"

Comments