Category: Statistics

Culture of Evidence … Does it Exist? Could it Exist??

Perhaps you are like me … when the same phrase is used so extensively, I develop an allergic-type reaction to the phrase.  “Awesome” is such a phrase, though my fellow educators do not use that phrase nearly as much as our students.  However, we use ‘culture of evidence’, and I have surely developed a reaction to it.

Part of my reaction goes back to a prior phrase … “change the culture”, used quite a few years ago to describe the desire to alter other people’s beliefs as well as their behavior.  Education is based on a search for truth, which necessarily implies individual responsibility for such choices.  Since I don’t work for Buzz Feed nor Complete College America, my priority is on education in this classic sense.

The phrase “culture of evidence” continues to be used in education, directed at colleges in particular.  One part of this is a good thing, of course … encouraging the use of data to analyze problems.  However, that is not what the phrase means.  It’s not like people say “apply the scientific method to education”; I can get behind that, though we need to remember that a significant portion of our work will remain more artistic and intuitive than scientific.  [Take a look at https://www.innovativeeducators.org/products/assessing-summer-bridge-developing-a-culture-of-evidence-to-support-student-success for example.]

No, this ‘culture of evidence’ is not a support for the scientific method.  Instead, there are two primary components to the idea:

  • Accountability
  • Justification by data

Every job and profession comes with the needs for accountability; that’s fine, though this is the minor emphasis of ‘culture of evidence’.

The primary idea is the justification by data; take a look at the student affairs professional viewpoint (https://www.naspa.org/publications/books/building-a-culture-of-evidence-in-student-affairs-a-guide-for-leaders-and-p  ) and the Achieving The Dream perspective (http://achievingthedream.org/focus-areas/culture-of-evidence-inquiry  ).

All of this writing about “culture of evidence” suggests that the goal is to use statistical methodologies in support of institutional mission.  Gives it a scientific sound, but does it make any sense at all?

First of all, the classic definition of culture (as used in the phrase) speaks to shared patterns:

Culture: the set of shared attitudes, values, goals, and practices that characterizes an institution or organization  (Merriam-Webster online dictionary)

In an educational institution, how many members of the organization will be engaged with the ‘evidence’ as justification, and how are they involved?  The predominant role is one of data collection … providing organizational data points that somebody else will use to justify what the organization wants to justify.  How can we say ‘culture of evidence’ when the shared practice is recording data?  For most people, it’s just part of their job responsibilities … nothing more.

Secondly, what is this ‘evidence’?  There is an implication that there are measurements possible for all aspects of the institutional mission.  You’ve seen this — respected institutions are judged as ‘failures’ because the available measurements are negative.  I’m reminded of an old quote … the difference between the importance of measurements versus measuring the important.

There is also the problem of talking about ‘evidence’ without the use of statistical thinking or designs.  As statisticians, we know that ‘statistics’ is used to better understand problems and questions … but the outcome of statistics is frequently that we have more questions to consider.

No, I think this “culture of evidence” phrase describes both an impossible condition and a undesirable goal.  We can’t measure everything, and we can’t all be statisticians.  Nor should we want judgments about the quality of an institution to be reduced to summative measures of a limited set of variables covering a limited range of ‘outputs’ in education.

The ‘culture of evidence’ phrase, and it’s derivatives (‘evidentiary basis’, for example) are used to suggest a scientific practice without any commitment to the scientific method.  As normally practiced, ‘culture of evidence’ often conflicts with the scientific method (to support pre-determined answers or solutions) and has little to do with institutional culture.

Well, this is what happens when I have an allergic reaction to the written word … I have a need to write about it!

 Join Dev Math Revival on Facebook:

Factors in Student Performance: Improving Research

Our data work, and our research, in collegiate mathematics education tends to be simple in design and ambiguous in results.  We often see good or great initial results with a project, only to see regression towards the mean over time (or worse).  I’d like to propose a more complete analysis of the problem space.

The typical data collection or research design involves measuring student characteristics … test scores, HS GPA, prior college work, grades in math classes, etc.  For classical laboratory research, this would be equivalent to measuring the subjects without measuring the treatment effects directly.

So, think about measurements for our ‘treatments’.  If we are looking into the effectiveness of math courses, the treatments are the net results of the course and the delivery of that course.  Since we often dis-aggregate the data by course, we at least ‘control’ for those effects.  However, we are not very sophisticated in measuring the delivery of the course — in spite of the fact that we have data available to provide some levels of measurement.

As an example, we offer many sections of pre-calculus at my college.  Over a period of 4 years, there might be 20 distinct faculty who teach this course.  A few of these faculty only teach one section in one semester; however, the more typical situation is that a faculty member routinely teaches the same course … and develops a relatively consistent delivery treatment.

We often presume (implicitly) that the course outcomes students experience are relatively stable across instructor treatment.  This presumption is easily disproved, and easily compensated for.

Here is a typical graph of instructor variation in treatment within one course:

 

 

 

 

 

 

 

 

 

 

We have pass rates ranging from about 40% to about 90%, with the course mean (weighted) represented by the horizontal line at about 65%.  As a statistician, I am not viewing either extreme as good or bad (they might both be ‘bad’ as a mathematician); however, I am viewing these pass rates as a measure of the instructor treatment in this course.  Ideally, we would have more than one treatment measure.  This one measure (instructor pass rate) is a good place to start for practitioner ‘research’. In analyzing student results, the statistical issue is:

Does a group  of students (identified by some characteristic) experience results which are significantly different from the treatment measure as estimated by the instructor pass rate?

The data set then includes a treatment measure, as well as the measurements about students.  In regression, we then include this ‘instructor pass rate’ as a variable.  When there is substantial variation in instructor treatment measures, that variable often is the strongest correlate with success.  If we attempt to measure student results without controlling for this treatment, we can report false positives or false negatives due to that set of confounding variables. Another tool, then, is to compute the ‘gain’ for each student.  The typical binary coding (1=pass 2.0/C; 0=else) is used, but then subtract the instructor treatment measure from this.  Examples:

  • Student passes, instructor pass rate = .64 … gain = 1-.64 = .36
  • Student does not pass, instructor pass rate = .64 … gain = 0-.64 = -.64

When we analyze something like placement test scores versus success, we can graph this gain by the test score:

 

 

 

 

 

 

 

 

 

 

This ‘gain’ value for each score shows that there is no significant change in student results until the ACT Math score is 26 (well above the cutoff of 22).   This graph is from Minitab, which does not report the n values for each group; as you’d expect the large confidence interval for a score of 28 is due to the small n (6 in this case).

That conclusion is hidden if we look only at the pass rate, instead of the ‘gain’.  This graph shows an apparent ‘decreased’ outcome for scores of 24 & 25 … which have an equal value in the ‘gain’ graph above:

 

 

 

 

 

 

 

 

The main point of this post is not how our pre-calculus course is doing, or how good our faculty are.  The issue is ‘treatment measures’ separate from student measures.  One of the primary weaknesses of educational research is that we generally do not control for treatments when comparing subjects; that is a fundamental defect which needs to be corrected before we can have stable research results which can help practitioners.

This is one of the reasons why we should not trust the ‘results’ reported by change agents such as Complete College America, or Jobs For the Future, or even the Community College Research Center.  Not only do the treatment measures vary by instructor at one institution, I am pretty sure that they vary across institutions and regions.  Unless we can show that there is no significant variation in treatment results, there is no way to trust ‘results’ which reach conclusions just based on student measures.

 Join Dev Math Revival on Facebook:

Statistics: No Box-and-Whiskers; A Better Histogram

Many of you know that I have ‘been around’ for a long time.  My first statistics course was around 1970, and I started teaching some statistics in 1973.  I’ve had some concerns about a tool invented about that time (box and whisker plots), and want to propose a replacement graphic.

Here are two box & whisker plots (done in horizontal format, which I prefer):

box-plot-Wait_Times_May2016 box-plot-HDL_May2016

 

 

 

 

 

 

 

 

There are two basic flaws in the box & whisker display:

  1. The display implies information about variation, when the underlying summary does not (quartiles).
  2. The display requires the reader to invert the visual relationship:  A larger ‘box’ means a smaller density, a smaller ‘box’ means a larger density

Here are the underlying data sets, presented in histogram format (which is not perfect, but avoids both of those issues):

Histograma_HDL

 

 

 

 

 

 

 

 

histogram_wait_time

 

 

 

 

 

 

 

 

 

 

 

 

Some of the problems with box plots are well documented; a number of more sophisticated displays have been used.  See http://vita.had.co.nz/papers/boxplots.pdf. These better displays are seldom used, especially in introductory statistics courses.

The main attractions of the box-plot was that it provided an easy visual display of 5 numbers — minimum, first quartile, median, third quartile, maximum.  The problem with creating a visual display of such simple summary data is that it will always imply more information than existed in the summary.  We’ve got a solution at hand, much simpler than the alternatives used (which are based on maintaining the box concept):

Replace basic box-and-whisker plots with a “quartiled histogram”.

A quartiled histogram adds the quartile markers to a normal histogram display.  Here are two examples; compare these to the box plots above:

Quartiled-Historgram-HDL_May2016

 

 

 

 

 

 

 

 

 

Quartiled-Historgram-Wait_Times_May2016

 

 

 

 

 

 

 

 

 

 

 

 

The quartiled histogram combines the basic histogram with a simplified cumulative frequency chart — without losing the independent information of each category.

Perhaps a basic box and whisker plot works when the audience is sophisticated in understanding statistics (researchers, statisticians, etc).  Because of known perceptual weaknesses, I think we would be better served to either not cover box & whisker plots in intro classes — or to cover them briefly with a caution that they are to be avoided in favor of more sophisticated displays.

 Join Dev Math Revival on Facebook:

WordPress Themes