Overview of Data Collection and Analysis Services
When writing a dissertation, students are often faced with a question where to get the data and how to analyse it. Some data is only available through subscription to paid services such as Thomson One Banker, Compustat, Fame, or Bloomberg. Other data is available for free but it must be aggregated from multiple sources, which are not easy to find. Our data collection services are designed to serve students and help them collect the data for their dissertation. In addition to our dissertation writing services, we also offer statistical analysis of data. Our team of professional writers is experienced with most statistical software including SPSS, Stata, Eviews, R, Minitab and Matlab, among others. Below is just a brief sample of statistical methods of data analysis we offer.
What is the Purpose of Data Analysis?
Data analysis is used in dissertations and reports to test hypotheses, evaluate relationships between variables and to examine dynamics of variables and their patterns. Without data analysis, a dissertation will not receive a pass mark. However, there are different methods by which data analysis can be done. These include quantitative and qualitative methods. The list below provides examples of quantitative methods of data analysis with statistical software.
How can 15 Writers help?
We offer a wide array of data collection and data analysis services for dissertations. These include primary data collection through surveys or interviews, secondary data collection from both free and subscription-based databases, and analysis of data in most popular statistical software such as SPSS, Stata, Eviews, Matlab, Minitab and R. If you don’t know which methods of data analysis would be best for your dissertation, our team of professional writers will help you decide. We also invite you to try our free Methodology Generator Tool to determine an optimal design of your research. If you study economics, our free panel data resource will help you retrieve the most popular economic time-series from Worldbank automatically converted into a panel form.
How It Works
Send us your requirements to get a price
Make payment for your order
An expert writer is assigned to you
Your paper is sent to you on or before your deadline
We understand how stressful University can be, which is why we've made our process as easy as possible. Simply send us your requirements to get a quote, make payment and the work will begin. You will be assigned a writer who is a qualified expert within your field. They will work on the order, carefully following the guidelines you have sent. Once the work is finished, it is checked by our in-house quality control team, and then emailed to you on, or before the deadline you have requested. All orders will be 100% original, and covered by our guarantees. Click on the button below to get started today.
Statistical Methods of Data Analysis in SPSS, Stata and Eviews
Descriptive and Graphical Analysis
Descriptive statistics are summary statistics that describe features of data such as its central tendency and variance. These statistics usually include the maximum and minimum value, standard deviation, variance, range, skewness and kurtosis and central tendency indicators such as the mean, median and mode. These statistics are used to characterise a data set so that it can be easily understood and interpreted in statistical analysis.
Research also need descriptive statistics in order to make early detection of potential outliers. For example, if there is a large range between the maximum and minimum and there is a high value of the standard deviation and high kurtosis, it is very likely that the data contains outliers. Descriptive statistics can also indicate whether the data follows normal distribution. In a normally distributed data, the mean and median values will coincide. The mean value of a variable is its arithmetic average, that is the sum of values divided by the number of observations. Meanwhile, the median is the central value in the data set arranged from the lowest to the highest. The normality of distribution is more precisely detected by observing the values of the skewness and kurtosis and the estimated value of the Jarque-Bera statistic. Volatility of data is best described by the standard deviation. If the data is normally distributed, around 95% of the observations will lie within two standard deviations from the mean and 99% of the observations will lie within three standard deviations.
In terms of the statistical software, Stata provides the smallest set of descriptive statistics by default that includes only the mean, maximum and minimum values, standard deviation and the number of observations for each variable. However, this list can be expanded using additional options. Eviews provides a wider set of descriptive statistics by the default command. In addition to the ones from Stata, it includes the skewness, kurtosis and the Jarque-Bera statistic. As for SPSS, it provides the same statistics as Eviews except the Jarque-Bera test, while the range and standard error can be added optionally. Along with that, Stata and SPSS allow for presenting frequency tables for dummy and categorical variables which is a useful option when primary data is analysed.
Along with descriptive statistics, graphical analysis is frequently used to detect long-term trends in data and visually assess its stationarity. Stationarity implies a consistent long-term mean and equally spread variance around this value. It can be easily detected with basic line charts. However, if the researcher wants to graphically assess the relationship between two variables, a scatter plot will be a more suitable option. Pie charts are generally used for visualising structure and constituents. Bar charts are used to present discrete categories. Correlograms are used to observe patterns of serial correlation in data. Q-Q and P-P plots are valuable instruments for detecting deviations from normal distribution in the data.
Most statistical software including Stata, Eviews and SPSS have sufficient instruments for graphical analysis and are able to construct the described graphs.
Time-Series Analysis (AR, MA, ARIMA, ARCH and GARCH models)
Time-series analysis is distinguished by its focus on one variable at a time. It is a univariate analysis that explores properties of a variable that varies across time. Thus, it can be applied to most economic and financial time-series but it is not common for survey data where mostly cross-sectional observations are present.
Time-series models are generally divided into those that model current behaviour as a function of past values and those that model volatility of time-series. The former includes the autoregressive (AR), moving average (MA) and ARIMA model that combines both AR and MA terms. Volatility of time-series is mostly analysed using ARCH and GARCH type models.
AR models are estimated by regressing the current values of a variable on pasted (lagged) values of the same variable. MA models are estimating by regressing the present values of a variable on random shocks. When both lagged values and random shocks are included in the regression specification, the ARMA or ARIMA model is derived.
ARCH is one of the simplest models of volatility where the variance of residuals from an intercept only or ARIMA model are regressed on the lagged squared residuals. ARCH models are a special case of a more extended GARCH type model where the variance of residuals is regressed not only on the lagged squared residuals but also on the lagged variance term. ARCH and GARCH models are widely used in the financial industry since most asset prices are volatile and exhibit volatility clustering.
Both Eviews and Stata are well equipped for running the time-series models and allow for estimating ARCH and GARCH models conveniently. They also offer several options to extend the GARCH models. As for SPSS, it does not offer an intuitive and built-in option for running the volatility models. However, it could be worked around by using extensions.
Regression Analysis (OLS, Logistic Regressions, Panel Regression, Quantile Regression, and Ordinal Regression)
Regression analysis is a popular technique of quantitative data exploration when more than one variable is involved. It is used to model the linear associations between the dependent variable and one or several explanatory, or independent variables. Depending on the type of data used, several types of regressions can be singled out.
The most common type of a regression is an ordinary least squares (OLS) regression. It is employed when the dependent variable is continuous. The relationship between this variable and a set of independent variables is assumed to be linear. The model is also assumed to be linear in parameters. It is an optimal instrument for analysing instantaneous and lagged relationships between economic, financial and social variables. The main criterion, however, is that the data should be quantitative and measurable on a ratio scale. However, categorical variables transformed into dummy variables can also be used in such regressions. OLS regressions can be applied to both time-series and cross-sections.
While it is perfectly fine to include dummy variables as explanatory, or independent, variables in OLS regressions, they cannot be used as dependent variables. If a binary variable needs to be used as the dependent variable, then logistic regressions should be run instead of the basic OLS models. The two most popular types of binary regressions are logit and probit models. In logit models, the coefficients are interpreted as odds ratios but in general the results are usually consistent between these models are their application depends on the field of study.
As was discussed above, OLS regressions can be applied to time-series and cross-sections but what if the data contains both? Such a dataset is known as panel data and applying a pooled OLS regression may result in weak findings as heterogeneity between the cross-sections or specific time-periods would be ignored. There are two popular ways to deal with the problem of heterogeneity in the panel data. The first one is to estimate a fixed effect (FE) panel regression. This is equivalent to adding dummy variables for each cross-section or each time period depending on what differences the researcher intends to model. An alternative way is to run a random effect (RE) panel regression. The latter is estimated by adding a new stochastic term that accounts for heterogeneity in cross-sections or time periods. An optimal method is usually selected on the basis of the Hausman test.
Quantile regression is often applied when the assumptions of a linear OLS do not hold. Unlike traditional OLS regression that measures the conditional mean effect of one variable on the other, quantile regressions are used to model the median effect or an alternative quantile that is unachievable with linear OLS. Quantile regressions are less sensitive to the presence of outliers in data.
It was noted that linear OLS regression work well with quantitative data. However, most primary data collected in surveys is usually ordinal such as the data ordered on a Likert scale. Ordinal regression is a type of regression analysis that allows for establishing the relationship between such ordinal variables.
When it comes to choosing an appropriate software for running regression analysis, it is important to keep in mind that SPSS is limited in that it does not offer functionality for working with panel regressions. This, however, can easily be done in Eviews and Stata, which beside panel regressions also allow for estimating ordinal, logistic, quantile and OLS models. However, SPSS has sufficient instruments for standard OLS and ordinal regressions and is a great tool for analysing primary data.
Correlation Analysis (Pearson, Spearman, and Kendal-Tau)
Correlation analysis is a statistical procedure which determines whether there is an association between a pair of variables without implication for causality between them. Besides detecting co-movements in variables, correlation analysis is also often used for detecting multicollinearity issues arising when independent variables in a regression are strongly correlated.
There are different methods of establishing a correlation between two variables including the Pearson, Spearman and Kendall Tau rank coefficient. The Pearson correlation is applied only to quantitative variables measured on ratio scale. Such variables should be continuous. Examples are stock prices and macroeconomic variables. When the relationship between two quantitative variables is not linear but monotonic, the Spearman correlation would be a better option. Lastly, if the explored variables are ordinal, such as the Likert scale data, the Kendall Tau coefficient is an optimal solution for establishing correlation.
Correlation coefficients are in a range from -1 to 1 indicating extreme negative and extreme positive correlation respectively. A zero coefficient implies that there is no correlation between the variables at all. Generally, absolute values of correlation coefficient in excess of 0.7 indicate strong correlation between variables. The values around 0.5 indicate only moderate correlation.
SPSS provides the most convenient way for estimating correlation between variables. By selecting the correlation analysis, the user can choose the type of data and type of test in the same window. In Eviews, a default correlation analysis implies estimation of Pearson’s coefficient. However, the calculation of the Spearman and Kendall Tau rank coefficients can be done in the menu of covariance analysis which is not very convenient. As for Stata, the procedure is similar to Eviews where the Pearson correlation is used as a default option, and the Spearman and Kendall Tau coefficients are available in the menu of non-parametric tests.
Parametric vs Non-Parametric Tests
When conducting hypothesis testing, researchers are always faced with a decision whether to use parametric or non-parametric tests. A parametric test is a procedure of checking a hypothesis assuming that the data is distributed in line with some theoretical distribution such as normal distribution. Nonparametric tests are employed when parametric tests cannot be used. Most non-parametric tests employ a certain method of ranking the observations and testing for anomalies of the distribution. While parametric tests explore group means, non-parametric tests examine group medians.
Parametric tests are usually preferred as these tests are better at detecting the “strangeness” of the distribution. On the other hand, non-parametric tests are necessary in some cases when distribution is not known. This necessity can be connected with non-normality of distribution, unknown distribution principle or very small size of the sample. In addition to this, non-parametric tests could be applied when the sample contains outliers or “out of range” values that cannot be removed.
A one-sample t-test is employed to check whether a sample mean is substantially different from some hypothesised value such as the population mean. Its non-parametric equivalent is the Wilcoxon signed rank test. A two-sample t-test is used to compare the means of two groups. The Mann-Whitney test is its nonparametric equivalent. If there are more than two groups, a one-way ANOVA test is run. Its non-parametric equivalent is the Kruskal-Wallis test. ANOVA requires the dependent variable to be normally distributed whereas the Kruskal-Wallis test does not have this requirement.
Even though most statistical software could be used to conducted hypothesis testing, it is less convenient in Eviews because these tests are difficult to find in menus. A one-sample t-test is called the simple hypothesis test, whereas a two-sample t-test and ANOVA are called Equality tests by Classification. Both Stata and SPSS are much more convenient for conducting hypothesis testing. SPSS has a separate menu titled “Compare means” where all these tests can be found. Stata has two sub-menus for parametric and non-parametric tests.
Cointegration, VAR and VECM
An OLS regression works well with stationary data. If the data are non-stationary, an OLS regression may produce inefficient results and the residuals are likely to suffer from serial correlation. Thus, in OLS regressions transformation of variables is often necessary. However, this method, as was noted before, allows for measuring instantaneous or short-term relationships between variables. When a long-term relationship needs to be assessed, cointegration analysis is conducted.
There are two common methods of cointegration analysis. The first one is based on OLS regressions and is known as the Engle-Granger methodology. The second method is based on vector autoregressive (VAR) analysis and is known as the Johansen cointegration test. The main advantage of the Johansen test over the Engle-Granger methodology is that the latter allows for detecting only one cointegrating equation and all assumptions of OLS regressions must be addressed. The Johansen test is able to detect more than one cointegrating equations and it has fewer assumptions.
In both methods, the data are first tested for stationarity using unit root tests. If the variables are non-stationary but their first differences are stationary, they are said to be integrated of order 1. Cointegration between two variables can be detected only if they are integrated of the same order, be it order 1 or higher.
If there is no cointegration between variables, short-term relationships between their dynamics are analysed using an unrestricted VAR model. However, if there is at least one cointegrating equation detected, further analysis is conducted with the Vector Error Correction Model (VECM).
A VAR is usually estimated with stationary variables who are modelled as functions of their own lagged values and the lagged values of other variables in the system. Thus, a VAR model does not have to be backed up by theory and no issues with enodgeneity have to be resolved as in traditional OLS regressions. The results of a VAR model are best interpreted graphically using Impulse Response Functions (IRFs). The optimal number of lags for the variables in the system is usually chosen based on information criteria (IC) such as Akaike IC or Schwarz IC. In contrast to VAR, a VECM is estimated with non-stationary variables.
Both Stata and Eviews offer a rich set of instruments to work with cointegration tests, VAR and VECM models. Meanwhile, SPSS does not have enough built-in functions for running VAR and VECM models. Thus, the choice of statistical software will be critical in this case.
Factor Analysis (FA) and Principal Component Analysis (PCA)
In many cases, when a regression model is formulated, some independent variables may be highly correlated. Possible solutions for removing this multicollinearity problem might be to omit some of the correlated variables or form composite scores from the correlated variables. However, in this case, the transformed model would explain less variance of the dependent variable. Another solution may be to create latent variables that influence the correlated variables and use them in the model instead. There are two ways to do this, namely through the Principal Component Analysis (PCA) and the Factor Analysis (FA).
These two methods are similar in that they both reduce the data to a few key factors. They will not work well if the initial inputs are not highly correlated. If the correlation between the variables is very high, the outcomes of the PCA and the FA may be quite similar. However, a key difference between these two methods is the way in which the new elements, namely Components in the PCA and Factors in the FA are created.
All three software allow for conducting both the FA and the PCA. SPSS contains the PCA as a part of the FA. Meanwhile, Stata has separate options for conducting the FA and the PCA as parts of multivariate analysis. To conduct the FA and PCA in Eviews, it is necessary to open a group of variables that needs to be reduced. The choice of the software in this case is a matter of preference as the functionality is more or less the same.