Data Handling Using Pandas #1 - vol. 1

Data Handling Using Pandas #1 - vol. 1

Python for Data Analysis.

Hello reader long story cut short first-of-all the question that comes first in your mind is What is data science or data analytics? So, in simple words it is a process of analyzing a large set of data points to et answers to questions related to that dataset.

What's the need of data analytics? The need for data analytics arises to handle huge data which is an area of concern for large business organizations, communities & consumers. Data handling is an important part of analyzing the data as because data is not always available in the desired format.
We all know that data is stored in different formats like .csv files, an Excel file or an HTML file. Python pandas have become a buzzword in the python community. It is an important tool used nowadays in the field of data sciences.

Now, let know the main objectives or features of pandas,

  • Pandas has built in functionality for like easy grouping & easy joins of data, rolling windows.
  • Data frame object help a lot in keeping track of our data.
  • With a pandas data frame, we can have different data types (float, int, string, datetime, etc) all in one place.
  • Pandas has built in functionality for like easy grouping & easy joins of data, rolling windows.
  • Good IO capabilities; Easily pull data from a MySQL database directly into a data frame.
  • With pandas, you can use patsy for R-style syntax in doing regressions.
  • Tools for loading data into in-memory data objects from different file formats.
  • Data alignment and integrated handling of missing data.
  • Reshaping and pivoting of data sets.

So, let's have a talk about Data Structures in Pandas,

Two important data structures of pandas are– Series & DataFrame

1. Series?

Series is like a one-dimensional array like structure with homogeneous data. Some basic feature of series are,

  • Homogeneous data
  • Size Immutable
  • Values of Data Mutable

2. DataFrame?

DataFrame is like a two-dimensional array with heterogeneous data. Basic feature of DataFrame are,

  • Heterogeneous data
  • Size Mutable
  • Data Mutable

So, let's understand the pandas series its like one-dimensional array capable of holding data of any type (integer, string, float, python objects, etc.). Series can be created using constructor. Syntax- pandas.Series( data, index, dtype, copy) Creation of Series is also possible from – ndarray, dictionary, scalar value.

Series can be created using

  • Array
  • Dict
  • Scalar value or constant

Pandas DataFrame, Create DataFrame. It can be created with followings,

  • Lists
  • dict
  • Series
  • Numpy ndarrays
  • Another DataFrame

Now, let's know about the binary operations, It is possible to perform add, subtract, multiply, and divide operations on a dataframe. So, pandas provides the methods like add(), sub(), mul(), div() and more related functions like radd(), rsub() for carrying out binary operations on dataframes. Out of these operations add(), sub(), mul() & div() methods perform the basic mathematical operations for addition, subtraction, multiplication and division of two dataframes.

The functions rsub and radd stands for right side subtraction and right side addition. Since all operations involve two dataframes to act upon, they are known as binary operations.

Now what's the role of Boolean Indexing in data handling? So, let's understand what is Boolean indexing it's a type of indexing which causes actual values of the data in the dataframe, i.e. using boolean vector in order to access a dataframe with boolean index, we have to create a dataframe in which index of the dataframe contains a boolean value, that is "True" or "False"

Concatenation in dataframe,

How do I concatenate DataFrame in pandas? When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame that's it.

In the above discussion I talked about csv file so what is it exactly? csv that means comma separated values is a simple file format used to store tabular data, such as a spreadsheet or database. A csv file stores tabular data i.e. number and texts in plain text. Each line of the file is a data record

That's all for now reader. I will be back with another article. Stay tuned for vol.2 🤞

Lastly,

The most damaging phrase in the language is...it's always been done this way - Grace Hopper