Python for data analysis: a beginner's guide

If your professor asked you to analyze data for your thesis, class project or research assignment, Python is the tool you should learn. It is free, widely used in both academia and industry, and has libraries that turn complex analysis into a few lines of code. This guide will get you from zero to producing real results for your university project.

Why Python for data analysis

Alternative	Limitation
Excel	Struggles with large datasets, limited statistical functions, hard to reproduce
SPSS	Expensive license, closed ecosystem, no programming skills gained
R	Steeper learning curve, smaller job market outside statistics
Python	Free, massive ecosystem, transferable job skill, handles any dataset size

Python is not just for computer science students. Business, engineering, social sciences and health sciences programs increasingly require data analysis, and Python is the most practical choice across all of them.

Setting up your environment

The fastest way to start is with Google Colab — it runs in your browser, requires no installation and gives you free access to computing resources.

Option 1: Google Colab (recommended for beginners)

Go to colab.research.google.com
Sign in with your Google account
Create a new notebook
Start writing Python code immediately

Option 2: Local installation with Anaconda

Download Anaconda from anaconda.com
Install it (choose "Add to PATH" on Windows)
Open Jupyter Notebook from the Anaconda Navigator
Create a new Python 3 notebook

Both options give you Jupyter Notebooks, which let you write code, see results and add explanations in the same document — perfect for academic work.

The three essential libraries

You need three libraries for 90% of university data analysis tasks:

1. Pandas — data manipulation

Pandas is your spreadsheet on steroids. It lets you load, clean, filter, group and transform data.

import pandas as pd

# Load your data
df = pd.read_csv('survey_results.csv')

# See the first 5 rows
df.head()

# Basic statistics
df.describe()

# Filter rows
engineering_students = df[df['major'] == 'Engineering']

# Group and calculate
avg_by_major = df.groupby('major')['gpa'].mean()

Key operations you will use constantly:

df.head() — Preview your data
df.describe() — Summary statistics (mean, std, min, max)
df.groupby() — Group data by categories
df.dropna() — Remove rows with missing values
df.value_counts() — Count occurrences of each value

2. Matplotlib and Seaborn — visualization

Matplotlib creates charts. Seaborn makes them look better with less code.

import matplotlib.pyplot as plt
import seaborn as sns

# Bar chart of average GPA by major
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='major', y='gpa')
plt.title('Average GPA by Major')
plt.xlabel('Major')
plt.ylabel('GPA')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('gpa_by_major.png', dpi=150)
plt.show()

Chart types for common academic needs:

Analysis need	Chart type	Seaborn function
Compare categories	Bar chart	`sns.barplot()`
Show distribution	Histogram	`sns.histplot()`
Show relationship between two variables	Scatter plot	`sns.scatterplot()`
Show correlation matrix	Heatmap	`sns.heatmap()`
Show trends over time	Line chart	`sns.lineplot()`
Compare distributions	Box plot	`sns.boxplot()`

3. SciPy — statistical tests

When your thesis requires hypothesis testing, SciPy provides the statistical functions.

from scipy import stats

# T-test: compare means of two groups
group_a = df[df['method'] == 'traditional']['score']
group_b = df[df['method'] == 'experimental']['score']

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')

if p_value < 0.05:
    print('Statistically significant difference')
else:
    print('No statistically significant difference')

Common statistical tests:

Test	Use case	Function
T-test	Compare means of two groups	`stats.ttest_ind()`
Chi-square	Test independence of categorical variables	`stats.chi2_contingency()`
ANOVA	Compare means of three or more groups	`stats.f_oneway()`
Pearson correlation	Measure linear relationship	`stats.pearsonr()`
Shapiro-Wilk	Test normality of data	`stats.shapiro()`

A complete workflow example

Here is a realistic example: analyzing survey data from a university research project about study habits.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Step 1: Load data
df = pd.read_csv('study_habits_survey.csv')

# Step 2: Clean data
df = df.dropna()  # Remove incomplete responses
df = df[df['hours_per_week'] > 0]  # Remove invalid entries

# Step 3: Descriptive statistics
print("=== Descriptive Statistics ===")
print(f"Total respondents: {len(df)}")
print(f"Average study hours: {df['hours_per_week'].mean():.1f}")
print(f"Average GPA: {df['gpa'].mean():.2f}")

# Step 4: Visualization
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='hours_per_week', y='gpa', hue='major')
plt.title('Study Hours vs GPA by Major')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.tight_layout()
plt.savefig('study_hours_vs_gpa.png', dpi=150)
plt.show()

# Step 5: Statistical test
correlation, p_value = stats.pearsonr(df['hours_per_week'], df['gpa'])
print(f"\nPearson correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

This workflow — load, clean, describe, visualize, test — applies to virtually any dataset you encounter in your academic career.

Handling common data problems

Problem	Solution
Missing values	`df.dropna()` or `df.fillna(df.mean())`
Duplicate rows	`df.drop_duplicates()`
Wrong data types	`df['column'] = pd.to_numeric(df['column'])`
Outliers	Filter with `df[df['value'] < threshold]`
Inconsistent text	`df['column'] = df['column'].str.lower().str.strip()`

Always document what you cleaned and why. Professors and thesis advisors want to see that your data preparation was methodical, not arbitrary.

Presenting results in your thesis

When including Python analysis in your academic work:

Export charts as high-resolution images — Use plt.savefig('chart.png', dpi=300) for print quality
Include your code in an appendix — Or provide a link to your Colab notebook or GitHub repository
Interpret every chart — Never include a visualization without explaining what it shows
Report statistical results properly — Include test name, test statistic, p-value and degrees of freedom
Use APA format for tables — If your institution requires it, format statistical tables according to APA 7th edition guidelines

Need help with data analysis for your thesis or research project? At Folium Labs we handle everything from survey design to statistical analysis and data visualization. Get professional results that meet your institution's academic standards.

Next steps

Once you are comfortable with the basics, explore these libraries to expand your capabilities:

Scikit-learn — Machine learning (classification, regression, clustering)
Statsmodels — Advanced statistical models (regression analysis, time series)
Plotly — Interactive charts for web-based presentations
NLTK — Text analysis and natural language processing

Python is a skill that pays dividends long after graduation. The analysis techniques you learn for your thesis are the same ones companies use to make data-driven decisions.

Struggling with your research methodology or data analysis? Explore our research support services and let our team guide you through the process.

Why Python for data analysis

Alternative	Limitation
Excel	Struggles with large datasets, limited statistical functions, hard to reproduce
SPSS	Expensive license, closed ecosystem, no programming skills gained
R	Steeper learning curve, smaller job market outside statistics
Python	Free, massive ecosystem, transferable job skill, handles any dataset size

Setting up your environment

The fastest way to start is with Google Colab — it runs in your browser, requires no installation and gives you free access to computing resources.

Option 1: Google Colab (recommended for beginners)

Go to colab.research.google.com
Sign in with your Google account
Create a new notebook
Start writing Python code immediately

Option 2: Local installation with Anaconda

Download Anaconda from anaconda.com
Install it (choose "Add to PATH" on Windows)
Open Jupyter Notebook from the Anaconda Navigator
Create a new Python 3 notebook

Both options give you Jupyter Notebooks, which let you write code, see results and add explanations in the same document — perfect for academic work.

The three essential libraries

You need three libraries for 90% of university data analysis tasks:

1. Pandas — data manipulation

Pandas is your spreadsheet on steroids. It lets you load, clean, filter, group and transform data.

import pandas as pd

# Load your data
df = pd.read_csv('survey_results.csv')

# See the first 5 rows
df.head()

# Basic statistics
df.describe()

# Filter rows
engineering_students = df[df['major'] == 'Engineering']

# Group and calculate
avg_by_major = df.groupby('major')['gpa'].mean()

Key operations you will use constantly:

df.head() — Preview your data
df.describe() — Summary statistics (mean, std, min, max)
df.groupby() — Group data by categories
df.dropna() — Remove rows with missing values
df.value_counts() — Count occurrences of each value

2. Matplotlib and Seaborn — visualization

Matplotlib creates charts. Seaborn makes them look better with less code.

import matplotlib.pyplot as plt
import seaborn as sns

# Bar chart of average GPA by major
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='major', y='gpa')
plt.title('Average GPA by Major')
plt.xlabel('Major')
plt.ylabel('GPA')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('gpa_by_major.png', dpi=150)
plt.show()

Chart types for common academic needs:

Analysis need	Chart type	Seaborn function
Compare categories	Bar chart	`sns.barplot()`
Show distribution	Histogram	`sns.histplot()`
Show relationship between two variables	Scatter plot	`sns.scatterplot()`
Show correlation matrix	Heatmap	`sns.heatmap()`
Show trends over time	Line chart	`sns.lineplot()`
Compare distributions	Box plot	`sns.boxplot()`

3. SciPy — statistical tests

When your thesis requires hypothesis testing, SciPy provides the statistical functions.

from scipy import stats

# T-test: compare means of two groups
group_a = df[df['method'] == 'traditional']['score']
group_b = df[df['method'] == 'experimental']['score']

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')

if p_value < 0.05:
    print('Statistically significant difference')
else:
    print('No statistically significant difference')

Common statistical tests:

Test	Use case	Function
T-test	Compare means of two groups	`stats.ttest_ind()`
Chi-square	Test independence of categorical variables	`stats.chi2_contingency()`
ANOVA	Compare means of three or more groups	`stats.f_oneway()`
Pearson correlation	Measure linear relationship	`stats.pearsonr()`
Shapiro-Wilk	Test normality of data	`stats.shapiro()`

A complete workflow example

Here is a realistic example: analyzing survey data from a university research project about study habits.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Step 1: Load data
df = pd.read_csv('study_habits_survey.csv')

# Step 2: Clean data
df = df.dropna()  # Remove incomplete responses
df = df[df['hours_per_week'] > 0]  # Remove invalid entries

# Step 3: Descriptive statistics
print("=== Descriptive Statistics ===")
print(f"Total respondents: {len(df)}")
print(f"Average study hours: {df['hours_per_week'].mean():.1f}")
print(f"Average GPA: {df['gpa'].mean():.2f}")

# Step 4: Visualization
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='hours_per_week', y='gpa', hue='major')
plt.title('Study Hours vs GPA by Major')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.tight_layout()
plt.savefig('study_hours_vs_gpa.png', dpi=150)
plt.show()

# Step 5: Statistical test
correlation, p_value = stats.pearsonr(df['hours_per_week'], df['gpa'])
print(f"\nPearson correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

This workflow — load, clean, describe, visualize, test — applies to virtually any dataset you encounter in your academic career.

Handling common data problems

Problem	Solution
Missing values	`df.dropna()` or `df.fillna(df.mean())`
Duplicate rows	`df.drop_duplicates()`
Wrong data types	`df['column'] = pd.to_numeric(df['column'])`
Outliers	Filter with `df[df['value'] < threshold]`
Inconsistent text	`df['column'] = df['column'].str.lower().str.strip()`

Always document what you cleaned and why. Professors and thesis advisors want to see that your data preparation was methodical, not arbitrary.

Presenting results in your thesis

When including Python analysis in your academic work:

Export charts as high-resolution images — Use plt.savefig('chart.png', dpi=300) for print quality
Include your code in an appendix — Or provide a link to your Colab notebook or GitHub repository
Interpret every chart — Never include a visualization without explaining what it shows
Report statistical results properly — Include test name, test statistic, p-value and degrees of freedom
Use APA format for tables — If your institution requires it, format statistical tables according to APA 7th edition guidelines

Need help with data analysis for your thesis or research project? At Folium Labs we handle everything from survey design to statistical analysis and data visualization. Get professional results that meet your institution's academic standards.

Next steps

Once you are comfortable with the basics, explore these libraries to expand your capabilities:

Scikit-learn — Machine learning (classification, regression, clustering)
Statsmodels — Advanced statistical models (regression analysis, time series)
Plotly — Interactive charts for web-based presentations
NLTK — Text analysis and natural language processing

Python is a skill that pays dividends long after graduation. The analysis techniques you learn for your thesis are the same ones companies use to make data-driven decisions.

Struggling with your research methodology or data analysis? Explore our research support services and let our team guide you through the process.

Python for data analysis: a beginner's guide

Why Python for data analysis

Setting up your environment

The three essential libraries

1. Pandas — data manipulation

2. Matplotlib and Seaborn — visualization

3. SciPy — statistical tests

A complete workflow example

Handling common data problems

Presenting results in your thesis

Next steps

Need help with your project?

You might also like

Power BI for your thesis: data visualization step by step

Git and GitHub guide for university students

Python for data analysis: a beginner's guide

Why Python for data analysis

Setting up your environment

The three essential libraries

1. Pandas — data manipulation

2. Matplotlib and Seaborn — visualization

3. SciPy — statistical tests

A complete workflow example

Handling common data problems

Presenting results in your thesis

Next steps

Need help with your project?

You might also like

Power BI for your thesis: data visualization step by step

Git and GitHub guide for university students