Vocabulary and Methods for describing data
Statistics- The Science of collecting, analyzing, and drawing conclusions from data.
Population- the entire collection of individuals or objects about which information is desired
Census- performed to gather information about the entire population
Sample- a subset of the population randomly selected for study in a defined manner
Descriptive statistics-the methods of organizing and summarizing data
Variable-any characteristic that may change from one individual to another
Data-the actual observations you collected from your smaple or population
Types of variables
Categorical variable(Qualitative)-data is sorted into catagories that have no meaningful numerical value(ex. hair color, eye color, race)
Numerical value(Quantitative)-observation or measurments take on menaingful numeric values; makes sense to average these numbers together; Two types discrete and continuous(ex. height, number of followers)
Discrete numerical variable-data can only take on specific values in the domain of the variable; usually counts of items(ex. number of pets you owm)
Continuous numerical variable-data can take on any value in the domain of the variable; usually measurments of something(ex.time until phone battery dies)
Univariant-data that describes ONE characteristic of the population
Bivariant-data that describes TWO characteristics of the population
Multivariant-data that describes MORE THAN TWO characteristics of the population
Types of Categorical Graphs
Bar Graphs
-Used for categorical variable
-Bars DO NOT touch
-Categorical variable usually on x-axis and frequency is on the y-axis
-Label both axes and include a title
-To describe: Comment on which occured the most often and least often
-You can make a double bar graph or segmented bar graph for bivariate categorical data sets
Pie Graph
-Used for categorical data
-Unless you are doing a big project or official report it is used to approximate each section
-Label each section, including the percentage, or include a key if there isn't enough space to label directly on the graph
-To describe: Comment on which occurred the most often and least often
Types of Numerical Graphs
DotPlot
-Used with numerical data (either discrete or continuous)
-Made by putting dots ona number line
-Put a title and label on the x-axis(No y-axis)
-You can make comparative dotplots by using the same axis for multiple groups
Types of Distributions
Symmetrical
-Refers to dataa in which both sides are approximately the same when the graph is folded vertically down the middle
Bell Curve
-a special type of symmetrical curve
-a very important distribution that occurs often in the real world
Uniform
-Rectangular shaped
-A special type of symmetrical curve
-refers to data in which every value has approximately equal frequency
Skewed Left/Right
-Refers to data in which one side is longer than the other
-The direction of the skew is on the side witht he longer tail
-The mean is pulled in the direction of the skew
-Left skew = negative skew
-Right skew = posotive skew
Bimodel
-Distribution with two distinct maximums
-Usually causwed by two distinctly different averages within the same population
-Could be symmetrical by chance, but doesn't need to be
SOCS
Shape
-Symmetrical, Uniform, Normal, Skewed Left/ Right, Bimodel
Outlier
-Outliers are values that are away from the rest of the data
-If there isn't one make sure to state it
-State anything unusual such as gaps or clusters
Center
-Describe the mean or median of the data
Spread
-Describe th IQR, Range or Standard deviation
+CONTEXT
-Reference the specific problem situation using complee sentences
More Graphs for Numerical Data
Stem Plots
-Used with univariant, numerical data
-You must have a key so that we know hot to read numbers
-You can split stems when you have a long list of leaves(If you split one you have to split them all)
-You can make a comparative stem plot to compare two different groups
Histograms
-Used with Numerical data
-Bars touch on Histograms
-Two types Discrete and Continuous
-Discrete(Bars are centered over discrete values)
-Continuous(Bars cover a class or interval of values)
-For comparative histograms(use two seperate graphs with the same scale on the horizontal axis)
Measures of Central tendency & variability
Some Vocab
Parameter-a fixed value about a population; typically unknown
Statistic-a value calculated from a sample that is used to estimate the parameter
Measures of central tendency
Median-the middle of the data; 50th percentile; the (n+1)/2 term when the data points are listed in order from least to greatest
Mean-the arithmetic average
Mode-the observation that occurs the most often, if all values only occue once there is no mode, there can be more than one mode
Resistant-when a statistic is not affected by outliers(it resists the affect of outliers)
How does the shape of the grapg affect the mean and median?-In a symmetrical distribution the mean = median
-the mean is pulled in the direction of the skew wether left or right
-The mean is not always a good representation of dat because it can be skewed by outliers
Variability-how spread out the data is
-variability is important because it allows us to distinguish between usual and unusual values
-In some situations we want more variability and in others we want less
Measures of variability
-Range (Non-resistant)
-IQR (Q3-Q1)
-Standard deviation (Non-resistant)
-Variance (Non-resistant)
Linear transformation
-When adding or subtracting a constant to a random variable, meassures of central tendency(mean and median) are affected, but measures of variability (SD, IQR, Range) are not affected
-When multiplying or dividing a constant by a random variable, measures of central tendency and measures of variability are all affected
-Formula: y = a + bx
Linear Combination Rule
-To find the mean for the sum(or difference), add(or sutract) the two means
-To find the standard deviation of the sum (or differences), ALWAYS add the variances, then take the square root
How do I know the difference between each role?
-Linear transformations are when you add, subtract, multiply, or divide by a constant to all numbers in the data set
-Linear combinations are when both numbers are variables(both numbers can change)
Boxplots
Advantages of Boxplotes
-Easy to make
-Displays Outliers
-Construction is not subjective
-Useful for comparative displays
Disadvantages of boxplots
-Does not retain individual observations
-Is not good for small data sets (n < 10)
Important terms / formulas
-Interquartile Range (IQR)(the range of the middle 50% of the data)
-Q1 = First quartile = 25th percentile
-Q3 = Third quartile = 75th percentile
-IQR = Q3 - Q1
-Lower Fence = Q1 - 1.5(IQR)
-Upper Fence = Q3 + 1.5(IQR)
- five number summary = Min Q1 Median Q3 Max
Z- scores and the empirical rule
Z-scores
-Standardized score
-has mean = 0 & SD = 1
-Z is the number of Standard Deviations x is away from the mean
Empirical rule
-Can only be used with normal curve
-Approximate 68% of the observations are within 1 SD of the mean
-Approximate 95% of the observations are within 2 SD of the mean
-Approximate 99.7% of the observations are within 3 SD of the mean
Normal Curve
-Bell shaped symmetrical curve
-Transition points between cupping upward & downward occur at mean + SD and mean - SD
-As the Standard deviation increases, the curve flattens ans spreads
-As the Standard deviation decreases, the curve gets taller and thinner
Percentile-the percent of the population that is less than(or equal to) that value.
Cummulative Relative Freequency Graph(Ogives)
Cummulative realtivve frequency plots allow us to asnswer questions about percentiles
Linear Regression & New Graphs
Scatterplots
-Described as either Linear or, Non-Linear, Positive or, Negative, Weak, Moderate or Strong.
-Sketches should include a maximum and minimum value for the X and Y axis.
Correlation Coefficient
-Also known as the (r) value
- It is a Quantitative assessment of the strength & direction of the linear relationship between bivariant, quantitative data.
-The Strenght is measured on a scale of -1 to 1 if the value is (0,.5) & (-.5,0) it is considered weak, (.5,.8) & (-.5, -.8) is considered moderate and anything from (.8 to 1) & (-.8 to -1) is considered strong. If there is a value of zero there is No strength.
-Its formula is below but should never have to be used as the calculator does it automatically.
LSRL
-Used for bivariant, numerical data
-The line that gives the best fit for the data set
-Always define your x and y variables
-We use X to predict Y
Response Variables and terms
-The Explanatory Variable explains the other one (causes the response variable to change, independent variable)
-The Response Variable is a response to the other one (The dependant variable)
-Extrapolation is using the LSRL to predict y for values of x outside of the range of the x values in the original data set
-Influential point is a point that significantly impacts the slope of the LSRL. When removed the slope significantly changes
-Outlier In a regression setting, an outlier is a data point that is far away from the LSRL relative to the other points
-Lurking Variable is a different outside variable that causes both x and y to change CORRELATION DOES NOT IMPLY CAUSATION
Residuls
-A residual is The vertical deviation between the observations & the LSRL
-The sum of residuals is always zero
-A Residual Plot is a scatterplot of the (x, residual) pairs, its purpose is to tell if the model is appropriate, If not pattern exists between the points the model is appropriate
Coeffiecient of Determination
-Is the proportion of variation in y that can be attributed to an approximate linear relationship between X and Y
-Remains the same even if you switch the X and Y
-Non-Resistant because it is affected by outliers
Regression Facts
-The r value is not affected by linear transformations
-The value of r measures the extent to which X and Y are linearly related
-The point X(bar),Y(bar) is on every regression line
-The LSRL and the Correlation coefficient are both Non-Resistant
Interpretations
-Slope: For each increase in the x-variable of one x-unit, there is a predicted increase/decrease in the y-variable of b y-units.
-Y-intercept: When the x-variable is equal to 0, the predicted y-variable is equal to A y-units.
-Correlation coefficient(r): There is a strength(strong, moderate, or weak), direction(positive or negative), linear relationship between x-variable and y-variable.
-Coefficient of determination(r^2): r^2% of the variation in the y-variable can be explained by the approximate linear relationship between x-variable and y-variable.
-Residual(resid = y(bar) - y(hat)): The actual y-variable is |resid| y-units above if +? below if - the predicted y-variable
Non-Linear
If the data is not scattered then you might need to replace y with log(y) or x with log(x) until you get a residual plot that is appropriate.
Simulations
A simulation is an imitation of chance behavior, based on a model that accurately reflect the experiment under consideration.
We can get randomly generated digits that can be used to model experiments through a digit table
Setting up a simulation
- State the model, defining the key componets
- State what one trial would be, including a stopping rule if necessary
- State what you would record
- Conduct the trials and record observations
- Summarize and draw conclusion
Sampling Design
The different ways to collect data are surveys, experiments and studies(Observational in the present, Retrospective in the past and Prospective in the future)
-Population: The entire group of individuals that we want information about.
-Census: A Complete count of the population or information about the entire population.
-Sample: A part of the population that we examine in order to gather information about the entire population.
-Sampling Design the method we use to choose a sample from a population. We use a sampling frame which is a list of every individual in the population. Depending on how we choose our sample, certain biases may be present or certain groups may not be fully represented. This means that we will have to carefully choose who we exactly add to our sample in order to avoid interpreting false or uninclusive data.
Sampling Designs
-Simple Random Sample: every individual has an equal chance of being selected. Every set of n individuals has an equal chance of being selected.
-Stratified Random Sample: population is divided into groups called strata. SRSs are selected from each group.
-Systematic Random Sample: sample is selected following a systematic approach. Select starting point and increment over until goal is reached. (number between 1 and n. Select individuals every nth person after)
-Cluster Random Sample: randomly pick a location and sample all individuals in that location.
-Multistage Sample: any combination of other sample methods. Selects successively smaller groups within population in stages.
Advantages of smaple Design Types
-Simple Random Sample: Unbiased, easy to design, and easier to formulate in order to get SD and confidence intervals.
-Stratified Random Sample: Unbiased, less variability, easy if strata already exist.
-Systematic Random Sample: Unbiased, Sample frame is not needed, easier and more efficient.
-Cluster: Unbiased, Sample frame is not needed, easier and more efficient.
Disadvantages of Sample Design Types
-Simple Random Sample More variability, may not be representative, sampling frame is required.
-Stratified Random Sample: Difficult to execute if strata are not already present, formulas are more complicated, a sample frame is required.
-Systematic Random Sample: More variability, can be confounded by trend or cycle, formulas are more complicated.
-Cluster: - More variability, clusters may not be representative, formulas are more complicated.
Bias
A Bias is a systematic error in measuring the estimate that would repeatedly cause the data to be wrong. There are several types of biases that result from different circumstances.
-Voluntary Response Bias: people select themselves to participate in the study thus are not randomly chosen. This type of bias tends to favor those with stronger responses to the topic of the study.
-Nonresponse Bias: - individuals are randomly chosen to be part of the sample but refuse to cooperate with the study. Note that this bias can not exist along with voluntary response bias as the subjects are RANDOMLY chosen and do not volunteer.
-Convenience Bias: - Subjects are chosen based on the simplicity or convenience of the situation. These subjects are therefore not chosen at random.
-Under Coverage Bias: - some groups are left out of the selection process whether it be on purpose or accidentally.
-Response Bias: - Occurs when the response of the subject is impacted by the actions of the subject.
-Wording of Question Bias: - Occurs when the question given to the subject influences the answer of the subject. Questions should be kept neutral and should be appropriate to the population being surveyed.
Experimental Design
Experiments actively impose randomly assigned treatments in order to observe the response to said treatments. These treatments are given to Experimental Units and treatments can be categorized into factors (explanatory variable) which is what is tested/changed. Levels are specific values or types of the factor. The response variable is the result of the treatment and is what we measure at the end of the experiment. Treatments are specific conditions imposed on the units. They have the same levels when there is only one factor and combinations of levels if there are multiple factors. A control group is a group used to compare factors against a blank state with a placebo. Placebos are dummy treatments that have no effect on the subject. They are not required in every experiment. Blinding is when units or evaluators are oblivious to which treatment the subject received. Double blinding is when both the subject and evaluator are unaware of what treatment was given. Confounding variables are third variables that may impact the factor and response variable. Experiments are able to show causation as experiments do not have confounding variables.
Principles of Experimental designs
-Control: the effects of extraneous variables on the response.
-Randomization: the use of chance to assign subjects treatments.
-Replication: the imposing of the experiment on many subjects to quantify the nature of the variation.
Types of Experimental Design
Completely Randomized Design: experimental units are assigned completely random treatments.
-Randomized Block Design: experimental units are blocked into homogeneous groups then randomly assigned treatments.
-Matched Pair Design: A type of block design with two methods/treatments.
-Randomization: eliminates potential confounding variables by spreading uncontrolled confounding variables evenly throughout treatment groups.
Blocking reduced variability.
-Variability: is controlled by sample size. Larger samples produce statistics with less variability.
Probability
Fundamental COunting Principles
The fundamental counting principle is a rule used to count the total number of possible outcomes in a situation. It states that if there are n ways of doing something, and m ways of doing another thing after that, then there are n × m n\times m n×m ways to perform both of these actions.
Permutations
A counting where order of the elements does matter and each element cannot be used more than once nPr
Combinations
A counting problem where the order of the elements does not matter and each element cannot be used more than once. nCr