Call/WhatsApp: +1 914 416 5343

U.S. Department of Education

Policy Priority: The Department of Education is focused on ensuring that parents, students, and policymakers are able to
use its publicly available data to take meaningful action to improve outcomes.
Supporting Decision-Making for Student Sub-populations and their Families
Problem: Certain mobile or disconnected student populations entering or reentering the community could greatly benefit
from data and resources to support their wellbeing and success. Such students and families often lack information that is
necessary to distinguish between their school options, access services, and identify affordable housing near high-quality
school and in safe neighborhoods that have access to transit and employment.
Identifying Equity Scores and Gaps
Problem: Within and across school districts and communities there are significant disparities in outcomes (e.g. achievement,
graduation rate) between different student groups, in whole or in part due to inequitable access to resources (e.g. per-pupil
expenditures, rigorous coursework, effective teachers). Decision-makers and parents to would benefit from information that
could help them understand these inequities and/or identify where gaps between groups of students may exist.
Your Task
You are a data scientist hired by a non-profit organization whose mission is to increase college graduation rates for
underpriveleged populations. Through advocacy and targeted outreach programs, your organization strives to identify and
alleviate barriers to educational achievement. A
Your team is committed to developing a more data-based approach to decision making. As a prelude to future analyses,
you are requested to analyze the data to identify clusters of similar colleges and universities
A Few Tips
1. Clustering Algorithm K-means is a powerful and recommended clustering algorithm, but the dimension is very high. You
may need to use dimension reduction/feature extraction methods to make preprocess the data. At the end of the day,
the choice of clustering technique(s) is yours.
2. Data Preparation: What variables to use (you obviously don’t have to use them all)? How will you deal with missing
values? Categorical variables? Normalization or scaling? These are all very subjective questions you need to figure out
as a data scientist. In addition to being completely data driven, you may also want to look into the educational theories
related to the problem statement and technical characteristics of the algorithm(s) you’re using.
3. Is it possible to explain what each cluster represents? Did you retain or prepare a set of features that enables a
meaningful interpretation of the clusters? Do the compositions of the clusters seem to make sense?
The annotated analytical process and the reproducible code.
In your submission, you may want to include an array of cluster labels corresponding to UNITID (the unique
college/university I.D. variable). Note: Due to the presence of missing data, some observations may be ommitted prior
to clustering.
A brief explanation (interpretation) of the clusters.
You can choose to work on this assignment individually or in a team (team size <= 3). If you want to work in a larger group,
email Lukas.
Your submission should be in .html format or .pdf format .ipynb or .Rmd file will not be accepted. This will
demonstrate how you communicate your code/analysis with others who may not have access to your data. If you work
choose to work in a team, only one of the team members needs to submit the assignment.
Your work will be evaluated on three simple criteria: (a) the implementation process of the clustering, (b) the clarity of your