Skip to main content

Command Palette

Search for a command to run...

Python Pandas for Data Analysis (Learn what Matters)

Published
5 min read
Python Pandas for Data Analysis (Learn what Matters)

What is Pandas?

Pandas is a Python library designed to handle structured data easily. It offers powerful data structures, namely Series and DataFrames, to manage and manipulate data effectively. With Pandas, you can clean, analyze, and even visualize data in a few lines of code.

Why Use Pandas?

  • Efficient Data Handling: Process large datasets with ease.

  • Versatility: Handle various data formats, including CSV, JSON, Excel, and SQL.

  • Integration: Works seamlessly with other data science libraries like Numpy, Scipy, and Scikit-Learn.

Installing Pandas

To start using Pandas, install it using pip:

pip install pandas

Getting Started with Pandas

Pandas is imported using pd as a common alias:

import pandas as pd

Pandas Data Structures: Series and DataFrames

Pandas has two types of data structures:

i. Series - One dimensional array with indexes.

ii. Dataframe - Tabular spreadsheet like structure represent rows and columns.

  • Series

A Pandas Series is a one-dimensional array that can hold any data type.

# Creating a Series
data = pd.Series([10, 20, 30, 40])
print(data)

# output -
# 0    10
# 1    20
# 2    30
# 3    40
  • DataFrame

A DataFrame is a two-dimensional structure (table) with labeled axes (rows and columns). It’s the most commonly used structure in Pandas.

import numpy as np
import pandas as pd

dic1 = {
    "name":['sutapa', 'gojo', 'Suguru', 'levi', 'naruto'],
    "marks": [89, 56, 34, 90, 32],
    "city": ["Kalyani", "Shibuiya", "Keisen", "Wall Maria", "Konoha"]
}

df = pd.DataFrame(dic1)

print(df)

# output -

#    name  marks  city
# 0    sutapa    89    Kalyani
# 1    gojo    56    Shibuiya
# 2    satroru    34    Keisen
# 3    levi    90    Wall Maria
# 4    naruto    32    Konoha

#Gives a Table like structutre

Note- DataFrame is collection of Series

print(df['name'])

#output - 
#0    sutapa
#1      gojo
#2    Suguru
#3      levi
#4    naruto
#Name: name, dtype: object

print(type(df['name']))
#or
print(df[df.columns[0]])
# pandas.core.series.Series (DataFrame is collection of series)

Converting into Excel Form

pd.to_csv() method in Pandas is used to export a DataFrame to a CSV (Comma-Separated Values) file, in excel sheet. It converts the data in a DataFrame into a text format (most of the time in a Excel File)

df.to_csv('students.csv')
# Now we will get a Excel sheet on that folder with
# the name of 'students.csv' with the same data as 'df'

Without Index: If you want the excel file or .csv file without any index number then -

df.to_csv('students_NoIndex.csv', index=False)
# It will make the file without indecing

Without Headers: .csv file without any headers (it’s give me a text file)

df.to_csv('students_NoHeaders', header=False)

Specifying Columns: used to save only specific columns in the DataFrame.

# Exporting only specific columns
df.to_csv('students_SpecificColumns.csv', columns=['name', 'city'])

Reading file in Pandas

(Pandas also supports other file formats like Excel, JSON, SQL databases, and more.)

  1. Reading a CSV File

The most commonly used function is pd.read_csv() for reading CSV files, but Pandas also provides functions to read Excel, JSON, SQL, and other formats.

import pandas as pd

# Read a CSV file
df = pd.read_csv('data.csv')
  1. Reading an Excel File
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
  1. Reading a JSON File

     df = pd.read_json('data.json')
    
  2. Reading a Plain Text File

# Reading a tab-separated text file
df = pd.read_csv('data.txt', sep='\t')
  1. Reading a SQL File

Some Extra Functions In Pandas

  • .head() & .tail() - are used to quickly preview the first and last few rows of a DataFrame
# Show a few rows from start
df.head()

# Show a few rows from last
df.tail()

# Reading only 1st two rows
df.head(2)

# Reading only last two rows
df.tail(2)

  • .describe() - fIt provides a quick summary of the main statistical metrics for numerical columns in a DataFrame
# Describe the numarical values with 'count', 'mean', 'std', 'min', 'max' etc
df.describe()

Creating Random Series & Dataframe

  1. Random Series using pd.Series & np.random

     import pandas as pd
     import numpy as np
    
     Random_series = pd.Series(np.random.rand(6))
     print(Random_series);
    
     output - 
     0    0.740176
     1    0.527784
     2    0.751823
     3    0.774766
     4    0.416964
     5    0.816490
     dtype: float64
    
  2. Random Dataframe using pd.DataFrame & np.random

     import pandas as pd
     import numpy as np
    
     Random_dataframe = pd.DataFrame(np.random.rand(4,6))
     print(Random_dataframe);
    
     output - 
    
               0         1         2         3         4         5
     0  0.344185  0.204012  0.034631  0.176049  0.795942  0.045355
     1  0.128220  0.390108  0.532313  0.969730  0.877553  0.508515
     2  0.732563  0.080579  0.589364  0.077826  0.708684  0.215115
     3  0.508971  0.029887  0.487922  0.316253  0.827630  0.901426
    

Converting a DataFrame to a NumPy Array

using to_numpy()

data = {
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
}

df = pd.DataFrame(data)

# Convert to NumPy array
numpy_array = df.to_numpy()
print(numpy_array)

output - 
[['1' 'a']
 ['2' 'b']
 ['3' 'c']]

Converting a Series to a NumPy Array

series = pd.Series([10, 20, 30])

# Convert Series to NumPy array
numpy_array = series.to_numpy()
print(numpy_array)

output - 
[10 20 30]

Transposing a DataFrame

using .T

data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}

df = pd.DataFrame(data)

# Transpose the DataFrame
df_transposed = df.T
print(df_transposed)

output - 
   0  1  2
A  1  2  3
B  4  5  6
C  7  8  9

Note- For a two dimension array, axis=0 means row and axis=1 columns

df.sort_index(axis=1, ascending=False)

# output - 
#    C    B    A
#0    7    4    1
#1    8    5    2
#2    9    6    3

# Here, the columns gets in decending order, cause we mention that ‘axis=1’
df.sort_index(axis=0, ascending=False)

# output
#     A    B    C
#2    3    6    9
#1    2    5    8
#0    1    4    7

# Rows are in decending order