A colleague associated the complying with story: the was taking notes in ~ a meeting that to be attended by a fairly big group of people (about 20). As each person made a comment or gift information, he tape-recorded the two-letter initials that the person who spoke. After the conference was over, he to be surprised to discover that all of the initials the the world in the room to be unique! i do not have anything in his note did he compose "JS said..." and also later wonder "Was that Jim blacksmith or Joyce Simpson?"

My partner asked, "If 20 random human being are in a room, do they normally have various initials or is it common for two human being to re-publishing a pair the initials?" In various other words, was his experience typical or a rare occurrence?

The circulation of Initials at a huge US software program Company

In order to answer the question, that is essential to know the circulation of initials in his workplace.

Clearly, the circulation of initials depends on the populace of the people in the workplace. In some cultures, names that start with X or Q space rare, vice versa, in other societies names that start with those letters (when phonetically interpreted into English) are more common.

moment-g.com is a huge US software agency with a varied base that employees, therefore I determined to download the surname of 4,502 employee that occupational with me in Cary, NC, and write a DATA step regime that extracts the very first and critical initials of each name.

You deserve to use the FREQ procedure come compute the frequencies the the first initial (I1), the last initial (I2), and the frequency the the initials taken together a pair. The adhering to statements output the frequency of the initials in to decrease order:

proc freq data=Employees order=freq;tables I1 / out=I1Freq;tables I2 / out=I2Freq;tables I1*I2 / out=InitialFreq absent sparse noprint;run;

As one example, ns can display the appropriate frequency because that my initials (RW) as well as the early of the moment-g.com cofounders, Jim Goodnight and John Sall:

data moment-g.comUSER.InitialFreq; collection InitialFreq; Initials = I1 || I2;run;proc print data=moment-g.comUSER.InitialFreq (where=(Initials="RW" | Initials="JG" | Initials="JS"));run;

The initials "JS" room the most frequent initials in my workplace, v 61 employees (1.35%) having actually those initials. The initials "JG" room also relatively common; they room the 10th most well-known initials. Mine initials are much less common and are shared by only 0.4% of mine colleagues.

If you want to command your very own analysis, you have the right to download a comma-separated file that has the initials and frequencies.


You deserve to use PROC SGPLOT to display bar charts because that the first and critical initials.

The bar charts display that J, M, S, D, and C are the most common initials for very first names, conversely, S, B, H, M, and also C room the most usual initials because that last names.

In contrast, U, Q, and X space initials that carry out not appear often because that either first or critical names. For first initials, the 10 least renowned initials cumulatively happen less 보다 5% that the time. For last initials, the 10 least famous initials cumulatively occur around 8% that the time.

Clearly, the circulation of initials is far from uniform. However, for the note-taker, the important concern is the circulation of pairs the initials.

The distribution of Two-Letter Initials

By making use of the PROC FREQ output, you deserve to analyze the distribution at my workplace of the frequencies that the 262 = 676 bag of initials:

more than 30% of the frequencies room zero. For example, there is no one at my workplace through initials YV, XU, or QX. If you disregard the initials that do not appear, then the quantiles the the remaining monitorings are as follows: The lower quartile is 0.044. The typical is 0.133. The upper quartile is 0.333. Three pairs are much more prevalent than the others. The initials JM, JB, an JS each occur much more than 1% that the time.

The circulation of two-letter initials is summary by the following box plot:


Visualizing the Proportions of Two-Letter Initials


With the help of a moment-g.com worldwide Forum document that shows just how to use PROC SGPLOT to develop a heat map, I produced a plot that reflects the distribution of two-letter initials in mine workplace.

When I develop a warm map, I regularly use the quartiles the the response variable to shade the cell in the warmth map. For these data, ns used five colors: white to show pairs of initials that space not represented at mine workplace, and also a blue-to-red color scheme (obtained from colorbrewer.org) to suggest the quartiles of the staying pairs. Blue shows pairs of initials that space uncommon, and also red shows pairs that happen frequently.

In regards to counts, blue indicates pairs of initials that are common by one of two people one or 2 individuals, and also red shows 18 or much more individuals.

The warmth map shows several exciting features of the circulation of pairs of initials: although W and also N room not unusual an initial initials (1.7% and 1.4%, respectively)and D and F room not unusual last initials (5.0% and also 3.2%, respectively),there is no one at mine workplace v the initials ND or WF. There room 89 people at my rectal who have a distinctive pair of initials, consisting of YX, XX, and also QZ.

You deserve to download the moment-g.com routine that is provided to produce the evaluation in this article.

The Probability of corresponding Initials

Computing the probability that a group of world have comparable characteristics is called a "birthday-matching problem" due to the fact that the many famous instance is "If there room N people in a room, what is the opportunity that 2 of them have actually the same birthday?"

In chapter 13 of mine book, Statistical Programming through moment-g.com/IML Software, I examine the birthday-matching problem. I review the famous solution under the usual presumption that birthdays room uniformly spread throughout the year, however then walk on to to compare that solution to the more realistic case in i m sorry birthdays are spread in a fashion that is continual with empirical birth data from the National center for health and wellness Statistics (NCHS).

Obviously, you deserve to do a comparable analysis for the "initial-matching problem." special, you have the right to use the actual circulation of initials in ~ moment-g.com to inspection the question, "What is the chance that two people in a room of 20 randomly liked moment-g.com employees share initials?" Come back next Wednesday to uncover out the answer!