Python for data analysis: a beginner's guide
If your professor asked you to analyze data for your thesis, class project or research assignment, Python is the tool you should learn. It is free, widely used in both academia and industry, and has libraries that turn complex analysis into a few lines of code. This guide will get you from zero to producing real results for your university project.
Why Python for data analysis
| Alternative | Limitation |
|---|---|
| Excel | Struggles with large datasets, limited statistical functions, hard to reproduce |
| SPSS | Expensive license, closed ecosystem, no programming skills gained |
| R | Steeper learning curve, smaller job market outside statistics |
| Python | Free, massive ecosystem, transferable job skill, handles any dataset size |
Python is not just for computer science students. Business, engineering, social sciences and health sciences programs increasingly require data analysis, and Python is the most practical choice across all of them.
Setting up your environment
The fastest way to start is with Google Colab — it runs in your browser, requires no installation and gives you free access to computing resources.
Option 1: Google Colab (recommended for beginners)
- Go to colab.research.google.com
- Sign in with your Google account
- Create a new notebook
- Start writing Python code immediately
Option 2: Local installation with Anaconda
- Download Anaconda from anaconda.com
- Install it (choose "Add to PATH" on Windows)
- Open Jupyter Notebook from the Anaconda Navigator
- Create a new Python 3 notebook
Both options give you Jupyter Notebooks, which let you write code, see results and add explanations in the same document — perfect for academic work.
The three essential libraries
You need three libraries for 90% of university data analysis tasks:
1. Pandas — data manipulation
Pandas is your spreadsheet on steroids. It lets you load, clean, filter, group and transform data.
import pandas as pd
# Load your data
df = pd.read_csv('survey_results.csv')
# See the first 5 rows
df.head()
# Basic statistics
df.describe()
# Filter rows
engineering_students = df[df['major'] == 'Engineering']
# Group and calculate
avg_by_major = df.groupby('major')['gpa'].mean()
Key operations you will use constantly:
df.head()— Preview your datadf.describe()— Summary statistics (mean, std, min, max)df.groupby()— Group data by categoriesdf.dropna()— Remove rows with missing valuesdf.value_counts()— Count occurrences of each value
2. Matplotlib and Seaborn — visualization
Matplotlib creates charts. Seaborn makes them look better with less code.
import matplotlib.pyplot as plt
import seaborn as sns
# Bar chart of average GPA by major
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='major', y='gpa')
plt.title('Average GPA by Major')
plt.xlabel('Major')
plt.ylabel('GPA')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('gpa_by_major.png', dpi=150)
plt.show()
Chart types for common academic needs:
| Analysis need | Chart type | Seaborn function |
|---|---|---|
| Compare categories | Bar chart | sns.barplot() |
| Show distribution | Histogram | sns.histplot() |
| Show relationship between two variables | Scatter plot | sns.scatterplot() |
| Show correlation matrix | Heatmap | sns.heatmap() |
| Show trends over time | Line chart | sns.lineplot() |
| Compare distributions | Box plot | sns.boxplot() |
3. SciPy — statistical tests
When your thesis requires hypothesis testing, SciPy provides the statistical functions.
from scipy import stats
# T-test: compare means of two groups
group_a = df[df['method'] == 'traditional']['score']
group_b = df[df['method'] == 'experimental']['score']
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')
if p_value < 0.05:
print('Statistically significant difference')
else:
print('No statistically significant difference')
Common statistical tests:
| Test | Use case | Function |
|---|---|---|
| T-test | Compare means of two groups | stats.ttest_ind() |
| Chi-square | Test independence of categorical variables | stats.chi2_contingency() |
| ANOVA | Compare means of three or more groups | stats.f_oneway() |
| Pearson correlation | Measure linear relationship | stats.pearsonr() |
| Shapiro-Wilk | Test normality of data | stats.shapiro() |
A complete workflow example
Here is a realistic example: analyzing survey data from a university research project about study habits.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Step 1: Load data
df = pd.read_csv('study_habits_survey.csv')
# Step 2: Clean data
df = df.dropna() # Remove incomplete responses
df = df[df['hours_per_week'] > 0] # Remove invalid entries
# Step 3: Descriptive statistics
print("=== Descriptive Statistics ===")
print(f"Total respondents: {len(df)}")
print(f"Average study hours: {df['hours_per_week'].mean():.1f}")
print(f"Average GPA: {df['gpa'].mean():.2f}")
# Step 4: Visualization
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='hours_per_week', y='gpa', hue='major')
plt.title('Study Hours vs GPA by Major')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.tight_layout()
plt.savefig('study_hours_vs_gpa.png', dpi=150)
plt.show()
# Step 5: Statistical test
correlation, p_value = stats.pearsonr(df['hours_per_week'], df['gpa'])
print(f"\nPearson correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")
This workflow — load, clean, describe, visualize, test — applies to virtually any dataset you encounter in your academic career.
Handling common data problems
| Problem | Solution |
|---|---|
| Missing values | df.dropna() or df.fillna(df.mean()) |
| Duplicate rows | df.drop_duplicates() |
| Wrong data types | df['column'] = pd.to_numeric(df['column']) |
| Outliers | Filter with df[df['value'] < threshold] |
| Inconsistent text | df['column'] = df['column'].str.lower().str.strip() |
Always document what you cleaned and why. Professors and thesis advisors want to see that your data preparation was methodical, not arbitrary.
Presenting results in your thesis
When including Python analysis in your academic work:
- Export charts as high-resolution images — Use
plt.savefig('chart.png', dpi=300)for print quality - Include your code in an appendix — Or provide a link to your Colab notebook or GitHub repository
- Interpret every chart — Never include a visualization without explaining what it shows
- Report statistical results properly — Include test name, test statistic, p-value and degrees of freedom
- Use APA format for tables — If your institution requires it, format statistical tables according to APA 7th edition guidelines
Need help with data analysis for your thesis or research project? At Folium Labs we handle everything from survey design to statistical analysis and data visualization. Get professional results that meet your institution's academic standards.
Next steps
Once you are comfortable with the basics, explore these libraries to expand your capabilities:
- Scikit-learn — Machine learning (classification, regression, clustering)
- Statsmodels — Advanced statistical models (regression analysis, time series)
- Plotly — Interactive charts for web-based presentations
- NLTK — Text analysis and natural language processing
Python is a skill that pays dividends long after graduation. The analysis techniques you learn for your thesis are the same ones companies use to make data-driven decisions.
Struggling with your research methodology or data analysis? Explore our research support services and let our team guide you through the process.
Need help with your project?
Our team can handle your thesis, research or technology project.
Get a quote