Virtual U.org
Get Personal Training on VU Today
    
Top shadow
 
 register/help
User Name:

Password:



The Jackson Hole Higher Education Group, Inc.

CyberCampus Project

Technical Document 3.2

Student Segmentation Analysis

This paper describes work in progress on the Higher Education simulation project funded by the Alfred P. Sloan Foundation. Contents may not be used or cited without permission. Limited distribution is provided to obtain comments and criticisms, and to assist potential development partners. Copyright © 1998 by The Jackson Hole Higher Education Group, Inc..

Table of Contents

1. Introduction

2. Student Segment Data

3. Student Performance Variables

3.1. Academic Rating

3.2. Extracurricular Activities

3.3. Athletics

4. Segmentation

5. Student and Institutional Behavior by Segment

5.1. Data Extraction Procedure

5.2. Projecting Results to the Average Institution

5.3. Illustration of Results

6. Application Ratios and Yield Rates

7. Family Income Statistics

Endnotes

1. Introduction

Higher Education is a computer-based simulation game under development that targets both the institutional professional and the interested layperson to participate in leadership challenges in a college or university setting. Players set, monitor, and modify a variety of institutional parameters and policies, allocate resources as they see fit, and watch as results continually unfold. The game provides an opportunity to experiment and succeed or fail in a safe and entertaining fantasy environment. While Higher Education is necessarily a caricature of real academic life, it is grounded in authentic data and will provide serious lessons in higher education. The game will be driven by a sophisticated simulation engine that models six broad areas:

  1. enrollment management

  2. resource allocation and finance

  3. academic operations

  4. physical plant activities

  5. performance indicators

  6. initialization procedures

Models 1-5 are described in Technical Documents 2.1 through 2.5, prepared during the project’s preliminary design and prototyping phase. An overview is provided in Technical Document 2.0. The engine’s development was supported by the Sloan Foundation and the Spencer Foundation.

This paper defines the undergraduate student segments needed for the enrollment management model. It extents the concepts described originally in Technical Document 2.1, "Enrollment Management," and reports preliminary empirical analysis. The work done so far will allow us to populate the CyberCampus undergraduate enrollment database. However, opportunities for refinement exist and should be taken up if time and resources permit. Additional work is needed to fully implement the financial aid model, and this paper will eventually be revised to include the additional material.

After a brief overview of our student database, the paper defines performance indices for academic ability, extracurricular activities, and athletic performance—the three variables identified in Td 2.1 as driving the student segmentation structure. (The graduate and non-matriculated student segments will depend only on academic ability.) We next develop a decision rule, based on the three indices, for assigning students to segments. Then we calculate extract the data for applications, admissions offers, matriculations, and financial aid offers and awards for each institutional segment, student segment, and gender-ethnic group.

Technical Document 3.1. "Initialization," describes how the student segment information will be used by CyberCampus. It also describes our institutional database, which will be referenced occasionally herein.

2. Student Segment Data

The CyberCampus demand model is structured around (a) seven student segments, ranging from the highly sought-after "Blue Chip" group to "Stretch" candidates, and (b) four gender-ethnic categories (minority/non-minority crossed with gender). While institution-oriented data sources like IPEDS don’t get to this level of detail, the necessary information can be derived from the National Education Longitudinal Study (NELS), data from which were supplied to us by Penn’s Institute for Research on Higher Education (IRHE). Data on the schools comprising CyberCampus’s seven institutional segments also were supplied by IRHE.

The NELS dataset provides a rich array of variables for the 8,018 survey respondents who reported attendance at a four-year postsecondary institution. Included are:

  • FICE codes for the two most preferred institutions to which applications were sent,
  • whether admission was offered by each of these institutions,
  • whether financial aid was sought and given by each of them,
  • the institution actually attended,
  • an extensive array of academic, extracurricular, and socioeconomic variables, and
  • gender and ethnicity.

IRHE supplied the institutional segment number for each FICE code mentioned in the dataset, provided the school was among the 1,200 in our institutional database.

3. Student Performance Variables

CyberCampus segments students according to academic, extracurricular, and athletic ratings. Each segment is further divided in terms of gender and ethnicity. We also calculate certain family income variables associated with each segment.

This section provides the definitions for the academic, extracurricular, and athletic ratings. Then results of these calculations were converted to a 0-10 scale and stored for use by the segmentation algorithm.1

The performance ratings are based on the variables and weights shown in Figure 1. Variable selection was dictated by the NELS data structure. The weights were determined by judgment. It may be possible to estimate the weights statistically in the context of an empirically-optimized segmentation analysis, but this is currently beyond our scope.

3.1 Academic Rating

Academic rating depends on a student’s SAT score (ACT has been translated into the SAT scale), high school GPA, and high score on all the advanced placement (AP) exams taken. The data for high-AP score are coded on a 0-5 scale, so the values were divided by 5 before being multiplied by the AP weight. Standard deviations were used to normalize SAT and GPA because they are continuous variables. Missing data on SAT and GPA were handled by regressing each variable on the other and using the resulting prediction whenever one of the pair was missing. The record was discarded if both variables were missing. The regression equations are:2

SAT = 0.4 + 578.98 GPA; GPA = 1.59 + 0.0015 SAT

Missing values for High-AP were set to 0 ("no AP taken")

3.2 Extracurricular Activities

A student’s extracurricular rating is the weighted sum of the eight variables shown under that heading in Figure 1. The student’s record was discarded if any of the four school-based variables (the four variables on the left) or "Time spent on extracurricular activities" was missing. The three community service variables were treated as providing bonus points, with missing values set to zero. Normalization was achieved by dividing each variable by its range.

Figure 1: Student Segmentation Variables and Weights

3.3 Athletics

A student’s athletics rating is the weighted sum of three variables: participation in a team sport, participation in an individual sport, and being names as a most valuable player (MVP). The record was discarded if either participation variable was missing, but the MVP variable was set to zero if missing. Once again the range was used for normalization.

4. Segmentation

The next step was to use the three performance variables to create the segmentation structure. The decision rule was derived with one eye on the logic behind the student segment names and the other on getting the kind of membership distribution described below. An empirically optimized decision rule may be achievable and this would certainly be desirable if time and resources permit.

The decision rule is as follows:

stuSeg = 1 if v1 ≥ 5.5 and v2 ≥ 5.5 else (Blue Chip)
stuSeg = 2 if v1 ≥ 7.0 else (Scholar)
stuSeg = 3 if v1 ≥ 4.0 and v2 ≥ 5.5 else (Extracurricular)
stuSeg = 4 if v3 ≥ 5.0 else (Athlete)
stuSeg = 5 if (v1 ≥ 4.0 if v2 ≥ 5) or v1 ≥ 5.5 else (Balanced)
stuSeg = 6 if v1 ≥ 4.0 else (Average)
stuSeg = 7. (Stretch)

The diagram illustrates the decision rule graphically as it applies to the academic and extracurricular indices. Membership in the "Athlete" segment (#4) requires an athletic rating of five or more—and it applied only to students who failed to qualify for the three preceding segments.

The segment membership percentages came out as shown in the figure. We had no normative information to guide us, but started with the idea that the four specialized segments (1-4) should aggregate roughly to the membership of the two general ones (5-6) and that the "Stretch" segment should be somewhat smaller. In retrospect, it may be desirable to revise the decision rule to produce more evenly balanced segments. This would mitigate the small-sample problems, discussed later, that have arisen in subcategories.

5. Student and Institutional Behavior by Segment

With the segment definitions in hand, the next step was to estimate the distribution of applications, admissions offers, matriculations, financial aid requests, and financial aid awards, for each institutional segment, by student segment and gender-ethnic group. While all these results are derived from the NELS database, one must remember that applications, matriculations, and financial aid requests represent student behavior while admissions offers and aid awards are the consequences of institutional decisions.

5.1 Data Extraction Procedure

To understand the data structure, imagine a table whose columns are the institutional segments and rows the student segments and gender-ethnic groups. Figures 3 and 4 have this structure. The data are extracted cell by cell, according to the following procedure:

a. Select all records in the ratings file that meet the student segment and gender-ethnic criteria.

b. From these, select all records for which the institutional segment code of "institution attended" or either of the two "applied to" schools equals the target institutional segment. ("Institution attended" may have been listed as one of the first two "applied-to" choices, but if different we know the student must have applied.)

c. Sum the respondent weights for all the selected records. (These weights are provided by NELS as projection factors to the universe.)

d. Divide each cell by the grand total for the column to get the fraction of an institutional segment’s applications (or admissions offers, etc.) accounted for by the student segment and gender-ethnic group.

We recognize that this procedure truncates the distribution of applications to the respondent’s first two preferences, or these two plus the institution attended, but must accept it given the NELS survey design. Similar problems arise in connection with the other variables. Further truncation occurs for the financial aid variable, because our NELS data do not include information about aid offers and awards for the institutions actually attended if different from the two "applied to" schools. As shown in the chart, however, the NELS data for number of applications per respondent averages less that two for every student segment except segment two. These data were

obtained by asking directly about the number of applications submitted, so they are not subject to truncation. They suggest that truncation is not a serious problem.

5.2 Projecting Results to the Average Institution

Our institutional database contains school-by-school data for undergraduate applications, admissions offers, and matriculations. The first three lines of Figure 2 present the average of these figures for each institutional segment. While equivalent numbers can be calculated from the NELS data, the institution-based estimates are sure to be more reliable. That is why step "d," above, divides the extracted sums by their column totals. Multiplying the resulting fractions by the overall average for each institutional segment (in Figure 2) produces the requisite per-institution figures for applications, offers, and matriculations by student segment and gender-ethnic group.

Figure 2: Average Applications, Offers, and Matriculations per School
(based on institutional data)

Figure 2 also shows the applications ratios and yield rates computed from the aforementioned averages. They play no role in the subsequent calculations but they do provide a useful reference point.

5.3 Illustration of Results

Figure 3 presents the results for student applications. Similar tables have been prepared for admissions offers, and matriculations. These tables will be stored in the CyberCampus database.

Figure 3: Distributions of Applications Across Student Categories

According to these results, Segment-1 institutions (the so-called "super-medallions") get an average of 119 applications from white male blue-chip students, 2519 from scholars, and none from extracurriculars. Some 350 applications come from white female blue-chips, 32 from minority male blue-chips, and none from minority female blue-chips.

The zero for white male extracurriculars illustrates the small-sample problem mentioned earlier. Despite the large size of the NELS dataset (8,018 respondents, of which more than 6,000 met our data acceptance criteria), some cells in the table are sparsely populated and thus subject to large sampling errors. For example, even one person reporting an application to a medallion institution would have changed the projection by a material amount. Figure 4 provides the information on sample sizes for the applications results. The data show, for example, that the difference between the blue chip and extracurricular super-medallion cells is only three respondents. It may prove desirable to reset the segmentation criteria to more evenly balance the segment memberships and/or smooth the results judgmentally before entering them in the CyberCampus database.

Figure 4: Sample Sizes for the Applications Results

Figure 5 uses the NELS data to addresses a different question: "How do members of a given student segment and gender-ethnic group distribute their applications among schools in the seven institutional segments?" Here the division is row by row rather than in terms of column totals. There is no reason to supplement the NELS data because the results need not be projected to the universe of institutions.

Figure 5: Distributions of Applications Across Institutional Segments

According to the table, the NELS white male blue chip students submitted 1.1% of their applications to super medallion schools (instSeg 1), 4.6% to medallion schools (instSeg 2), 25.8% to name brand schools (instSeg 3), and so on. Scholars, on the other hand, were much more likely to favor the super-medallions and medallions. We doubt that the differences between the blue chips and scholars are as great as indicated, and once again suspect sampling errors due to sparse data. On the other hand, the table does illuminate the broad patterns of student application behavior. In particular, the mass of applications shifts rightward as one moves down the segmentation hierarchy.

6. Application Ratios and Yield Rates

Section 7 of Technical Document 3.1 describes how CyberCampus will use the results of Section 5.2 to determine the applications ratios and yield rates for the player-generated institution. These calculations are not strictly relevant to the present discussion. However, it will be useful to illustrate the result in juxtaposition with Figure 3.

Calculation of the PGI’s application figures and yield rates proceeds in four stages, which are illustrated in Figure 6. First we compute a weighted average of the applications, offers, and matriculations for the seven institutional segments as shown in the first three columns. Next we compute the PGI’s overall applications ratio, shown at the bottom of the fourth column, based on the totals for applications and matriculations. The third step computes the percentage distribution of applications (also in the fourth column) by dividing each cell in the first column by the column total. The final step computes the yield rates (column 5) by dividing each row’s matriculations by the associated figure for offers.

Figure 6: Illustration of Application Ratios and Yield Rates for the PGI
(based on an assumed set of player specifications for the PGI)

The CyberCampus player determines targets for undergraduate matriculations and admissions through the initial conditions or by his or her decisions as the game progresses. The target matriculations number gets multiplied by the overall applications ratio to determine total applications, which are then distributed across student segments and gender-ethnic groups using the calculated percentages. The games admissions algorithm produces figures for each student segment and gender-ethnic group. Applying the computed yield rates to these results produces the requisite figures for matriculations. Changes in the PGI initial specification and its evolution over time will change the weights, which will produce interesting variations in the number and distribution of applications, offers, and yields.

7. Family Income Statistics

The undergraduate financial aid algorithm, described in Technical Document 2.1 on Enrollment Management, requires the mean and standard deviation of family income by student segment and gender-ethnic group. These data have been derived from the NELS database, and the results are illustrated in Figure 7.

Figure 7: Average Income and Standard Deviation of Income

Two transformations were applied to the raw data before Figure 7 was calculated. First, the NELS income codes were transformed to dollar amounts, in thousands, by using the mid-points of cells. (The open-ended upper cell, "above $150,000," was transformed to 250.) Then we took the square root of the decoded figures in order to compensate for skewness in the income distribution. Hence the numbers should be interpreted as the mean and standard deviation of the square root of income expressed in thousands of dollars.

We may wish to smooth some of the figures, although rebalancing the student segments may stabilize them somewhat. (Income data from independent sources to help with the smoothing would be welcome.) Stabilizing the data also may allow us to break out the averages by institutional segment, based on student applications, rather than using the overall figure for each student category. (The standard deviations probably would remain pooled.) This would add another interesting element to the game’s representation of the market for students.3

Endnotes:

  1. The ratings are calculated and are stored and the segmentation decision rule is applied in '.NELS-stuSegs.Setup'. The files 'NELS-Academics', 'NELS-Extracurricular Activities', and 'NELS-Athletics must be open when working with the 'Setup' file. Then PASTE-SPECIAL-values the segment assignments column (SCORES: H) to the same column in '.NELS-stuSegs'.
  2. Memory limitations led us to calculate the regressions based on only the first 6013 of the 8018 variables in the dataset. This shouldn't change the coefficients materially but it would be good to correct the problem if the system is recomputed.
  3. The results were calculated in '.NELS-stuSegs'. File '.NELS-Applied_Admit_Fin Aid' must be open to run '.NELS-stuSegs'. Follow the "run-macro" instructions above the tables.