Data Manipulation with Pandas

Data Manipulation with Pandas
Pandas is a python package built on Numpy and Matplotlib that is used for data manipulation and visualization. It is used by the entire python data science community. Tabular or Rectangular data is the most popular form of data for data analysis and pandas can handle its manipulation and visualization in a fluent manner. It is designed to work with rectangular data or data frames. When you first receive a dataset you want to quickly explore it and get a sense of its content. For that pandas provide several methods and attributes. At firs to import pandas in your python file following script is added at first.
import pandas as pd 
Now if you have to read a CSV file using pandas then following script is used:
pd.read_csv("csv_file_location")
After you have loaded data following methods can be run in your data.  Try it yourselves.
Let us consider a data frame as below and let's see what the method effects are on it.
Data Manipulation with Pandas
Data Manipulation with Pandas
The first is head(), it returns the first few rows of the data frames. It is very useful if we have many rows but for few rows not much difference there.
Data Manipulation with Pandas
Another one is the info method. It displays the name of the column, the datatype they contain, and whether they have missing values.
Data Manipulation with Pandas
Dataframe's shape is an attribute that contains a tuple that holds the number of rows followed by the number of columns. Shape is attribute not method so we write it without parenthesis.
Data Manipulation with Pandas
Describe is another method that computes the summary statistics for numerical columns like mean and median. It is used for a quick overview of the numeric variables. There are many summary statistics functions like median, mode, min, max, std, var, etc.
Data Manipulation with Pandas
A data frame has mainly three components as Values, Index, and Columns.
Values contain the data frame values in two-dimensional NumPy array.
Data Manipulation with Pandas
Index and Column return the names of rows and columns respectively. The index represents the rows and it can return numeric values.
Data Manipulation with Pandas
Sorting and Subsetting
At first to sort values in the data frame according to some column we use:
dataframe_name.sort_values("name_of_column", ascending = True)
Data Manipulation with Pandas



You can also the first sort by one column and then by another column as:
dataframe_name.sort_values("name_of_column", "name_of_second_column", ascending = True)
Data Manipulation with Pandas
Now if you have to select certain column or columns from the whole data frame you can:
Data Manipulation with Pandas
Data Manipulation with Pandas
Data Manipulation with Pandas
Data Manipulation with Pandas
Data Manipulation with Pandas

Also you can use logical operators to access certain data as:
Data Manipulation with Pandas
If you want to filter on multiple values of a categorical variable the easiest way is to use isin method as below:
Data Manipulation with Pandas
For example, you want to add a new column to your data frame, then you can apply a method like below to add a new column. This method is also called a mutating or transforming of data frame.
Data Manipulation with Pandas
There is lot more to pandas. Other topics and approaches will be added shortly.
PS: this article is based on the datacamp data science course and most of the images belong to datacamp.

Post a Comment

Previous Post Next Post