Section A:

1.

The levels of measurement used can be factors in

determining what measures are used in the descriptive analysis of the data.

Each measure has it’s merits and can be used effectively if following the

proper criteria. For example, when working with nominal data, since there are

no clearly defined intervals we cannot obtain a mean that would provide logical

data. However, the mode which represents the category with the largest volume

of samples in the data is still relevant to nominal data and can give us pertinent

information, in this case which option or category most people in a sample

choose. In order to effectively obtain a median, which is the value that separates

the lower half of the data from the upper half, the data must be ordered in a significant

way, where one is better than the other. Thus, the median, which is not useful

for nominal data is a sufficient descriptive tool for ordinal data, as a way of

finding the “positional” average of a sample. Going further into interval and

ratio data, the mean of the data now becomes available due to the clearly

defined equal intervals and the ability to do arithmetic on the data. When

categories are equidistant, comparing them numerically becomes possible and

averaging them out yields the mean. The mean is a common tool of measuring an

average in data sets and is crucial in obtaining further data like the standard

deviation which gives us a more complete analysis. It is important to note,

that we can use options besides the default measures for levels of measurements.

If presenting the median for ratio data offers an important insight into a

trend it should also be mentioned. There are also cases of instability for

measures, the mode of a set of data can become unstable when two or more

categories are equally common. When data has a distribution that is

concentrated on both tails and relatively low in the center (a bathtub

distribution) the median becomes susceptible to drastic changes from

proportionally insignificant alterations in the tails. Lastly, the mean is also

unstable in presence of extreme outliers, they can be rather skewed simply due

to a small few samples lying unusually distant from the regular mean. To avoid

using unstable means, taking trimmed means or using another measure such as the

median is often the preferred solution. One last important distinction is the

option of sometimes treating ordinal data as interval in order to obtain the

mean. Sometimes this can be useful but it only works if the distance between

categories can be estimated to be roughly the same. This is another reason

determining the level of measurement is crucial, as it presents us with which

tools can be used to provide an accurate analysis as well as some alternatives

in cases of instability. When considering measures of dispersion for nominal sets

of data the index of diversity tells us how likely it is, that two cases drawn

at random from a sample will come from different categories. As an example, if

a random survey was taken of a class of 40 about their country of origin and

the IOD was 0.723, we would move the decimal over and say that there is a 72.3%

chance that two students drawn at random will be from different countries. The

index of qualitative variation is a similar measure for nominal data but it

holds itself to a different scale, between 1 and 0. For most small samples of

data the two measures will differ, but as samples become higher the two values

begin to correlate more closely. Both measures also do not rank categories

against each other and hence, order is not important when obtaining these measures.

It should also be noted that both IOD and IQV do not take into consideration

the number of categories present, and thus the third measure, entropy is useful

because as it codes categories as 1s or 0s, it keeps track of how many are used

total when counting the bits it provides. Entropy itself measures “the number

of bits of information we need, on average, to eliminate uncertainty about

where cases are found” 39. The interquartile and interdecile range are similar

methods of measure, typically used for ordinal data and provide ranges either

in quartiles (fourths) or deciles (tenths) to show where samples of data are

concentrated. It also helps focus on key areas such as the centre of the data,

without being heavily affected by outliers. The Median Absolute Deviation is

also a sufficient tool when working with ordinal data, as it is resistant to

extreme values and provides a good substitute for standard deviations when they

are unstable. The MAD is a measure typically used in analyzing the centre of a

distribution. Lastly, the standard deviation is the measure used for interval

and ratio data. It has many advantages compared to other measures, such as it

considers all cases, it is expressed in meaningful units that can compared

across samples. The standard deviation is however susceptible to extreme

outliers. Measures of association deal with reducing errors in our predictions,

based on data sets. For example peoples voting preference based on ethnicity,

we would use lambda. If the result was say 0.28, that would mean we could reduce

our error in predicting voting preference by ethnicity by 28%. If we used the

gamma measure and it turned out to be .34, we would then say we can reduce our

errors in predicting the direction of pairs by 34%. The Somer’s d measure would

also give us a similar answer, however some consider it more accurate as it assesses

independent variable in all cases it might influence. Finally, the Q value

tells us the percentage we can “reduce our errors in guessing whether one pairs

favour one form of association (or the other) if we know how the variable are

linked” 145.

2.

PRE measures help us reduce errors by determining

the way two variables are linked which than allow us to predict more accurately

the outcome of the data. Understanding the relationships between variables will

help us make more correct predictions in their outcomes and thus lowering how

many errors we make. If we wanted to predict party voted for by person

ethnicity, we could obtain a lambda between the two variables. Hypothetically

if this lambda came out to be 0.43, we could reduce our errors by 43% in

predicting party voted for based on a person’s ethnicity if we knew how the

variables are linked. When using the gamma measure, the focus is on the

concentration of concordant and discordant pairs, and compares them. Gamma

reduces our error in predicting the direction of pairs, if the link between

variables is known. Somers d is a modified version of gamma, taking in to

consideration the variable T, “representing the pairs of cases tied on the

dependent variable, but not the independent”137. Lastly, the measure Q, which

does not measure concordant or discordant pairs but instead measures whether

pairs favour one type of association or the other. It reduces our errors in the

sense that we can predict which type of association more pairs will favour, if

we know how the two variables are linked.

3.

Conditional tables are tables used in analyzing the

relationships between variables when there may be hidden variables present that

affect the data but are not initially known and need to be tested. Most are

used in seeing if a third variable is linked with the original independent and

dependent variables. As a broad example, imagine a study looking at the link

between alcohol consumption and income. We notice there is a trend between the

two variables, that as alcoholism becomes a greater problem, income goes down.

Now it can be agreeable to say people who drink constantly do not earn as much,

but that alone does not tell you why. By incorporating another variable (test

factor) and tables showing its association with the original variables, we may

be able to gain new insight on the relationships between the three, a process

called “specification”. Say if for our example we started looking at education

as our third variable, and found out those who have no education are more prone

to be alcoholics and in turn make less money. This could suggest a spurious

relationship between the variables, both possibly being caused by a lack of

education. Examples like this, where many factors could be hidden require

conditional tables to clearly demonstrate the effect third variables may have

in situations as well as possibly bring to light some causal links. In conditional

tables we see that by controlling for a third variable, we may see a change in

the how the relationships between variables are represented, and that the correlation

between the variables can possibly be made more accurate by analyzing hidden

variables. In our example, we may find that the better the education someone

has, the higher their income and the lower their chance at consuming high

concentrations of alcohol on average. We may find a distorted relationship, “one

that shows “the original table shows an effect in the opposite direction to what

we obtain after controlling for a third variable” 189.

Section B:

1.

Graphs are mainly used as visual representations of

numerical data, they organize and present data in more readily digestible

images as opposed to rows of numbers. Using graphs is advantageous when trying

to show links between variables, trends or skews in curves, or when you want to

compare categories visually side by side. They provide an easy and effective

summation of data that might involve countless samples and allow multiple ways

to present themselves. Commonly used graphs like bar charts or histograms are

very concise and organized when representing categorical data. The three types

of variables shown in bar charts or histograms are discrete variables, which

have distinct columns, continuous variables which are lines that can take on ay

value within its line’s range, and discrete-continuous which have distinct

categories which follow along an ordered range. In these types of graphs the

mode can usually be found easiest, it is simply the highest bar among the

categories. The median and mean cannot be seen as simply but their general area

can be assumed through the concentration of data in the middle of the graphs.

In signal peaked curves that are continuous, if there is a positive skew the

order of the measures of central tendency is usually mode, median, than mean,

where as if it negatively skewed, it is the opposite order. The boxplot is

another visual tool used to represent data, it is a representation of the

five-number summary of a data set and contains the minimum, the first quartile,

the median, the third quartile, and the maximum. The box starts at the lower

part of the first quartile and stretches all the way up to the third quartile,

it has a thick bold black line that goes through the center where the median is

and usually has to lines stretching through it up to its maximum on the upper

half and its minimum on the lower half. The boxplot also typically contains

circles on the outer ends signifying outliers. An advantage of boxplots is that

they can be readily compared to other boxplots, similarities or differences can

be spotted easily just by seeing if their medians line up, or minimums and

maximums differ greatly, making it an efficient tool when analyzing multiple

sets of data. Back-to-back histograms are a style of graph made for direct

comparison. Containing two sets of data, they allow us to compare proportion of

both samples side by side according to the percentage on the left hand side of

the graph, which is very useful when comparing groups like males and females.

Population pyramids are a similar type of graph, but they usually measure ages

with both sets of data on either side of a center column showing the age

ranges. For ratio or interval level data, scatterplots can be used to show how

closely correlate two variables are at every point, which is the visual

representation of r. If for example a set of data has an r of 1, that means the

variables are very positively correlated to each other, where as if they have a

correlation of -1, the two variable are negatively correlated (One gets higher,

the other lower). There are also cases where r=0 which simply means there is no

correlation between the two variables. Boxplots can also be used if we want to

show some significant data, but do not need to show every value of the

variables. We can also use line graphs to chart specific relationships between

variables, for example to see the average change in y when x changes by 3. Even

bar charts can be used since they can provide the mean scores with the heights

ot the bars. We also have used association plots to display data obtained from

crosstabulations, where importance is placed on cases within cells. Association

tables show lines for heavy cells(shaded) which have observed values greater

than the expected and light cells which have expected values greater than their

observed. Association tables, unsurprisingly are used to determine the association

between two variables and more so the strength of their association by using

standard residuals. If a value of less than plus or minus 2 SR’s is obtained

than there is an association between two variables. Mosaic plots are another

type of visual representation for association between two variables. Cases are

represented by large rectangles between rows and columns, their sizes of which

are proportional to the number of samples in their category (more samples in a

category, the larger the rectangle). The standardized residuals in mosaic plots

are read through the patterns given in the key. Mosaic plots also present

extreme values in an easily read way, with SR’s greater than + or – 2. Lastly,

the double decker graph is plot which contains bars whose width represents the

subcategories, while the dependent variable is the scale which measures the

height of the bars. The unique feature of the double decker is that it allows

for more than one independent variable at the bottom of the graph. Showing the

changes throughout categories as different independent variables are considered,

highlighting the associations between multiple variables. We can also obtain

more specific data of how variables are related, as in how a unit of change in

y affects a unit of change in x. The double decker can show spurious relations or

distorted relations depending on which independent variables are looked at,

which can be advantageous in a situation where one independent variable does

not provide a complete picture of what is presented in the data. Scatterplots

are also used for regression results, with trend lines going through the

centroid. With multiple regression we can obtain the effect of each predictor

with the other being controlled. In multiple regression, we can even compare

the effects of all the independent variable with the effects of the individual

predictors.

Section C:

(An important clarification, when referring to standard

deviations of distributions, I know it is more correct to call them standard

errors, I am aware of the difference and I hope I will not be penalized for referring

to them simply as standard deviations)

1.

The four common sampling distributions are the normal,

the t, the chi-square, and the f distribution. The normal distribution is the

most commonly used one, it is continuous, unimodal, and symmetric, also the

standard normal curve has a mean of 0 and standard deviation of 1, although

this could differ in other cases. The normal distribution is also mesokurtic,

meaning its peak is relatively centered. Using the Central Limit Theorem, which

tell us “any sampling distribution of a mean for any variable with a definable

SD, will be normal if the sample size is large enough”90.Meaning we can

identify the proportion of samples that lie within certain ranges away from the

mean, assuming that distributions are large enough to be considered normal and

they also have a definable standard deviation. Using that information we can

determine 68% of all cases lie within 1 standard deviation away from the mean,

or 95% lie within 2 SDs of the mean. For smaller distributions we typically use

the t distribution, also unimodal and symmetric but they contain heavier tails and

lower peaks. However as sample size increases for t distributions they begin to

resemble normal distributions closely. T distributions also have degrees of

freedom which become important in assessing p-values and how significant specific

data is. P-values determine the point in which values display significant signs

of association between variables. The standard chi-square distribution will be

unimodal, with a peak at 0, a right skewed distribution and mean of 1, with a

standard deviation of 2. The chi-square distribution becomes more normal as its

degrees of freedom increase, that is to say its peak grows, it skews back over

to normal curve, and it becomes symmetric as dF goes up. The three

distributions also have another relation to each other, “the t distribution

results from dividing the normal distribution by a chi square”101. Chi-square

distributions are efficient for two-way crosstabulations because they keep

track of the independent rows and columns used. Finally, the f distribution is

continuous, symmetric, and slightly positively skewed, but as degrees of freedom

for both become higher, f distributions become more symmetric. The f

distribution has two typical uses, the first is significance testing for

comparison means in ANOVA, and in regression to check the significance of a set

of predictors. Anova is usually used to compare the means of subgroups with a

null hypothesis that there is no difference. If the means are different than

the null will abandoned. The f distribution shows us which results should be

treated as significant in ANOVA and regression.

2.

The Bayesian approach to predicting probabilities is

one that asks the direct question of “how likely is it that the true difference

between groups, or the association between variables, is such-and-such, or lies

in a such-and-such range”116. Instead of the standard model of statistical

inference which is first started with a null hypothesis stating there is no

difference. The Bayesian method starts out with a set of prior probabilities, data

on how likely certain outcomes are based on past information. These priors are

the first bit of information used in setting up the analysis, than given the

data we obtain ourselves we build a set of posterior probabilities. Posterior

probabilities are the more looked at set of probabilities between the two,

because they incorporate the actual data, which could for example swamp out

irrelevant personal prior probabilities that are not significant for the data.

With the revised probabilities, “an estimate of the likely difference between

groups, or the likely association between variables”126can be obtained, as

well as saying “how far from our best estimates the truth is likely to be”

126. The range we estimate the true value is likely to be in is known as the

credible interval, it refers to the area in a distribution that our real figure

relies in as well as how likely. For example, if we take the central 90% of our

posterior, than we could say that given the priors, it is 90% likely the true

figure lies somewhere in that range. Unlike the confidence interval used in the

standard approach which would be referring to 90% of all samples lying within

that range. The main methodical difference between the standard approach of

inference and the Bayesian method, is that the Bayesian uses prior

probabilities to built information which it will than test the association of,

while the standard approach will build a null hypothesis that says there will

be no difference and than try to either prove or disprove it.

Section D:

1.

In regression analysis when the link between an

independent variable (IV) and a dependent variable (DV) is not linear we must analyze

the data and choose which option is most sufficient for each case. When dealing

with a flat trend line, we must truncate the independent variable where it

begins to flatten out. The flat categories will be truncated (appear summed up

as a new category), so that the original trend can become linear. Another case

is when we exponential curves, which arise when there is a dependent variable

whose slope rises or lowers at a separate rate from the IV. Some dependent

variables increase constantly as time passes by specific rates, these graphs

can become linear by taking their logarithms. Another case, one where the DV

may have a changing slope throughout different parts of the graph is also

possible, in this case usually a second b value is taken at the point where the

slope changes for our linear equation.

2.

In multiple regressions, to estimate the effect of

an IV, with others held constant we typically use software which conducts matrix

manipulations. These manipulations regress each independent variable on to the

others to create residuals that are independent o other independent variables.

The residuals represent “the part of the variation in x1 that is independent of

X2” 226. We become liable to obtain biased coefficients when we leave out

relrvant predictors because if the removed IV can be associated with other IVs

in the equation, the section of there variation that is correlated with the

missing variable will be used in predicting y. To prevent a biased result, all

relevant predictors must be included because the residuals that are produced

for the dependent variable are affected by the other independent variables. We

must also try to avoid irrelevant predictors as they will skew our estimates. An

irrelevant predictors presence will cause a reduction in the residuals of the

predictors it is correlated with which will lead to a reduction in the amount

of information used to predict the dependent variable, thus producing inaccurate

estimates.

3.