Here’s a detailed list of the top 30 interview questions for an Entry-Level Data Analyst, along with well-structured answers to help you prepare effectively.
1. Technical & SQL Questions
Q1. What is the difference between Data Analysis and Data Analytics?
Answer:
Data Analysis focuses on inspecting, cleaning, and transforming data to find patterns and answer specific questions.
Data Analytics is broader, involving data collection, analysis, visualization, and decision-making using insights.
Q2. Explain the data cleaning process. How do you handle missing values?
Answer:
Steps in data cleaning:
Remove duplicates – Drop redundant rows.
Handle missing values – Fill with mean/median (numerical) or mode (categorical), or drop if insignificant.
Correct inconsistencies – Fix typos, standardize formats.
Remove outliers – Using IQR or Z-score.
Q3. What are the different types of data sampling techniques?
Answer:
Random Sampling – Every record has an equal chance of selection.
Stratified Sampling – Divides data into groups (strata) and samples from each.
Cluster Sampling – Divides data into clusters and randomly selects clusters.
Systematic Sampling – Selects every n-th record from a list.
Q4. Write a SQL query to find the second-highest salary from an Employee table.
Answer:
SELECT MAX(salary) AS second_highest_salary FROM Employee WHERE salary < (SELECT MAX(salary) FROM Employee);
(Alternatively, use RANK()
or DENSE_RANK()
in modern SQL.)
Q5. What is the difference between WHERE and HAVING in SQL?
Answer:
WHERE filters rows before grouping (used with SELECT
, UPDATE
, etc.).
HAVING filters after grouping (used with GROUP BY
).
Q6. How do you join tables in SQL? Explain INNER, LEFT, RIGHT, and FULL JOIN.
Answer:
INNER JOIN → Returns matching rows from both tables.
LEFT JOIN → All rows from the left table + matched rows from the right (null if no match).
RIGHT JOIN → All rows from the right table + matched rows from the left.
FULL JOIN → All rows from both tables (null where no match).
Q7. What are aggregate functions in SQL? Give examples.
Answer:
Functions that perform calculations on a set of values:
COUNT()
– Number of rows.
SUM()
– Total of values.
AVG()
– Average.
MAX()/MIN()
– Highest/lowest value.
Q8. How would you optimize a slow-running SQL query?
Answer:
Indexing – Add indexes on frequently queried columns.
Avoid SELECT *
– Fetch only needed columns.
Limit subqueries – Replace with JOINs where possible.
Partition large tables – Split data for faster access.
Q9. What is normalization in databases? Explain 1NF, 2NF, and 3NF.
Answer:
1NF – Each column has atomic values (no repeating groups).
2NF – Meets 1NF + no partial dependency (all non-key columns depend on the full primary key).
3NF – Meets 2NF + no transitive dependency (non-key columns don’t depend on other non-key columns).
Q10. How do you use GROUP BY and ORDER BY in SQL?
Answer:
GROUP BY → Groups rows with the same values (used with aggregate functions).
ORDER BY → Sorts results in ascending (ASC
) or descending (DESC
) order.
2. Excel & Data Visualization
Q11. How do you use VLOOKUP and INDEX-MATCH in Excel?
Answer:
VLOOKUP:
excel
=VLOOKUP(lookup_value, table_array, col_index, [range_lookup])
(Finds a value in the first column of a range.)
INDEX-MATCH (More flexible)
excel
Copy
=INDEX(column_to_return, MATCH(lookup_value, lookup_column, 0))
Q12. What are pivot tables, and how do you use them?
Answer:
Summarizes large datasets by dragging fields into Rows, Columns, Values, and Filters.
Used for grouping, summing, averaging, and cross-tabulation.
Q13. How do you handle duplicate data in Excel?
Answer:
Remove Duplicates (Data tab → Remove Duplicates).
Conditional Formatting → Highlight duplicates.
COUNTIF formula → Identify duplicates before removal.
Q14. What are the best practices for data visualization?
Answer:
Choose the right chart (Bar for comparisons, Line for trends, Pie for proportions).
Label clearly (titles, axes, legends).
Avoid clutter (limit colors, remove unnecessary elements).
Use consistent scales.
Q15. Explain the difference between Bar Chart, Histogram, and Pie Chart.
Answer:
Bar Chart → Compares categories (discrete data).
Histogram → Shows distribution (continuous data in bins).
Pie Chart → Shows parts of a whole (percentages).
3. Python/R & Data Analysis
Q16. What Python/R libraries have you used for data analysis?
Answer:
Pandas (Data manipulation).
NumPy (Numerical operations).
Matplotlib/Seaborn (Visualization).
Scikit-learn (Machine learning).
Q17. How do you read and manipulate a CSV file in Python?
Answer:
python
import pandas as pd df = pd.read_csv("data.csv") df.head() # View first 5 rows df.dropna() # Remove missing values df.groupby("column").mean() # Aggregate
Q18. What is the difference between List, Tuple, and Dictionary in Python?
Answer:
List → Mutable, ordered [1, 2, 3]
.
Tuple → Immutable, ordered (1, 2, 3)
.
Dictionary → Key-value pairs {"key": "value"}
.
Q19. How do you handle missing data in Pandas?
Answer:
Drop rows: df.dropna()
.
Fill with mean: df.fillna(df.mean())
.
Forward fill: df.fillna(method='ffill')
.
Q20. Write a Python code to filter rows where age > 30 in a DataFrame.
Answer:
python
filtered_df = df[df["age"] > 30]
4. Statistics & Problem-Solving
Q21. What is the difference between mean, median, and mode?
Answer:
Mean → Average (sum of values / count).
Median → Middle value in sorted data.
Mode → Most frequent value.
Q22. Explain correlation vs. causation.
Answer:
Correlation → Variables move together (e.g., ice cream sales & drowning).
Causation → One variable directly affects another (e.g., smoking causes cancer).
Q23. What is hypothesis testing? Explain p-value and significance level.
Answer:
Hypothesis Testing → Tests if a claim (e.g., "Drug X works") is statistically significant.
p-value → Probability of observing results if the null hypothesis is true.
Significance level (α) → Threshold (usually 0.05) to reject the null hypothesis.
Q24. How do you identify outliers in a dataset?
Answer:
Boxplot (IQR method) → Outliers are outside 1.5 * IQR.
Z-score → Values beyond ±3 standard deviations.
Q25. What is A/B testing, and how is it used in data analysis?
Answer:
Compares two versions (A & B) to see which performs better (e.g., website layouts).
Uses hypothesis testing (t-tests) to check statistical significance.
5. Behavioral & Situational Questions
Q26. Tell me about a data-related project you worked on. What insights did you find?
Answer: (Example for freshers)
"In a college project, I analyzed COVID-19 data using Python. I found that countries with early lockdowns had lower death rates. I visualized trends using Matplotlib."
Q27. How do you prioritize tasks when working with multiple datasets?
Answer:
Understand business goals (what’s most impactful?).
Check deadlines.
Start with high-priority, low-complexity tasks.
Q28. Describe a time when you had to explain complex data to a non-technical person.
Answer:
"I presented sales trends to marketing teams using simple bar charts instead of raw SQL outputs, focusing on key takeaways like ‘Q4 sales dropped by 15%’."