Prepare the ship data stored in “ship.csv” for future analytics tasks using functions and methods of NumPy and pandas, respectively.
(a) Design your own Python program to carry out the following tasks:
(i) Read in “ship.csv” as pandas DataFrame called “ship”. Since there are 6 observations where MS and Y are “.” to indicate that they are missing values, declare this character as missing values in your program accordingly.
(ii) Since the variable names of this dataset are rather short and do not really describe the nature of the variables, rename the ship types to “types”, construction years to “c_years”, operation periods to “o_periods”, the aggregated months of service to “s_months”, and the number of incidents to “incidents”.
(iii) For a better understanding of the data, compute the average service months and the average number of incidents for the cross-products of every category in types and operation periods. The averages should be rounded to the nearest integers. Store the resulting table to an object named “ship group”.
(iv) Replace the missing values in the variable “s_months” and “incidents” by the respective means of the other ships that share the same type AND the same operation period. Add comments to elaborate your Python program as well.
(v) Construct a Python program to save the target variable “incidents” in a pandas DataFrame named “Y”.
(b) Except for the months of service and number of incidents, all the other variables, including “types”, “c_years”, and “o_periods” are actually nominal and not interval/ratio.
(i) Perform an appropriate data type conversion for these variables so that they can be recognised as categorical variables
(ii) Construct Python code to convert all categorical variables to dummy variables and save the result as a pandas DataFrame named “X”.
(iii) Researchers suggest that the aggregated months of service of each ship must be scaled down due to its wide range of values. Perform a log-transformation of this variable in the DataFrame and name the transformed variable “log_s_months”. The transformed variable should be attached to both DataFrames “X” and “ship”.
(c) Normally, we shall split the DataFrame into training and testing datasets to evaluate the predictive power of the model. Study the dataset carefully and explain why it is not sensible to split the DataFrame here, and we shall use the entire dataset for training purposes instead.
(d) We shall now save the prepared DataFrame “ship” as a new csv text file called “ship_prepared.csv”. Furthermore, we shall also create a database called “ship.db” and export the DataFrame to the database as tables. Write a Python program to carry out these two tasks.
In their book Generalized Linear Models (New York: Chapman & Hall, 1983), the authors P. McCullagh and J.A. Nelder used the Poisson regression to study the ship dataset. Poisson regression is a special case of generalized linear models in which the target variable, or dependent variable, is Poisson distributed.
Since one of the main application areas of Poisson regression is to fit linear models on count data, we can therefore use Poisson regression to predict the number of incidents (which also count) given some input variables. Mathematically, Poisson regression is a linear model in which the expected value of the target variable Y is calculated by where β0 is the intercept, β1, β2, …, βk are the coefficients of the independent variables X1, X2, …, Xk. E(Y) is the predicted, or expected value of Y, which will be transformed by the natural logarithm function.
(a) Find the corresponding sci-kit-learn module in the official website of scikit-learn and discuss the corresponding module, estimator, fit and predict functions, as well as their parameters in your own words.
(b) Analyse the data by fitting a Poisson regression based on the DataFrames X and Y generated in Question 1. Follow the instruction in the official website and report the parameters of the estimated model. Create a Python program to fit a Poisson regression and generate a table or a DataFrame to present the coefficients with the corresponding labels.
(c) The deviance of Y and its expected value E(Y), estimated by the model constructed in
c), measures the goodness of fit of the model. The lower the deviance, the better is the model. Below is the equation of how it should be calculated. If Y = 0, the expression log[Y/exp(E(Y))] will be taken as zero. Employ your own Python program to compute D without using the score() function of the sci-kit-learn package.
Hire a Professional Essay & Assignment Writer for completing your Academic Assessments
Native Best Writers Team
- 100% Plagiarism-Free Essay
- Highest Satisfaction Rate
- Free Revision
- On-Time Delivery