Lately I have been rather productive in my programming and frustrated at the same time. Trying to solve the problems of creating a demographics summary table proved to be a lesson in frustration with R. Since I love R, this was disheartening. I did eventually find the
reporttools package which does make a great latex table, but onlyin latex. Also the
tables package looks great, but also not entirely what I was looking for, so I do the first logical thing for an R User when faced with this sort of thing. I created a package to fill in the missing functionality.
The new package is
dostats. There are two functions of the package.
- Create summaries of vectors through the
- Manipulate functions.
The package started out with the
dostats function for creating more informative summary tables. It works very similar with
tables package, but it is designed to work with
plyr functions. The idea is to pass in a vector as the first argument and then the remaining arguments are functions that compute statistics on the vector. For example:
dostats(rnorm(100), mean, sd, N = length)
## mean sd N
## 1 0.0775 0.8975 100
There is also the renaming construct built in to create the desired variables. This construct is nice because it facilitates easily passing as an argument into
ldply such as
ldply(mtcars, dostats, mean, sd, IQR)
## .id mean sd IQR
## 1 mpg 20.0906 6.0269 7.375
## 2 cyl 6.1875 1.7859 4.000
## 3 disp 230.7219 123.9387 205.175
## 4 hp 146.6875 68.5629 83.500
## 5 drat 3.5966 0.5347 0.840
## 6 wt 3.2172 0.9785 1.029
## 7 qsec 17.8487 1.7869 2.008
## 8 vs 0.4375 0.5040 1.000
## 9 am 0.4062 0.4990 1.000
## 10 gear 3.6875 0.7378 1.000
## 11 carb 2.8125 1.6152 2.000
This makes for a more logical summary
data.frame object that has usable columns, each with the same data type. Unfortunatly this does not always work for all data set. The above example only has numerical data. Any data frame with categorigal data would have that data treated as categorical. Another limitation is that the results of each function must be the same dimention for each variable. For this reason I introduced functions that filter by the variable class.
class.stats creates a dostats function for a given class, tested by
integer.stats predefined class stats for integer variables. This defined as
numeric.stats for numeric variables, which would also include integer variables.
factor.stats for factors.
class.stats function is passed to ldply, variable not matching that class are silently removed.
ldply(iris, numeric.stats, mean, sd)
## .id mean sd
## 1 Sepal.Length 5.843 0.8281
## 2 Sepal.Width 3.057 0.4359
## 3 Petal.Length 3.758 1.7653
## 4 Petal.Width 1.199 0.7622
ldply(iris, factor.stats, N = length)
## .id N
## 1 Species 150
You can also chain together arguments to compute on subsets using
ddply(iris, .(Species), ldply, numeric.stats,
mean, median, sd)
## Species .id mean median sd
## 1 setosa Sepal.Length 5.006 5.00 0.3525
## 2 setosa Sepal.Width 3.428 3.40 0.3791
## 3 setosa Petal.Length 1.462 1.50 0.1737
## 4 setosa Petal.Width 0.246 0.20 0.1054
## 5 versicolor Sepal.Length 5.936 5.90 0.5162
## 6 versicolor Sepal.Width 2.770 2.80 0.3138
## 7 versicolor Petal.Length 4.260 4.35 0.4699
## 8 versicolor Petal.Width 1.326 1.30 0.1978
## 9 virginica Sepal.Length 6.588 6.50 0.6359
## 10 virginica Sepal.Width 2.974 3.00 0.3225
## 11 virginica Petal.Length 5.552 5.55 0.5519
## 12 virginica Petal.Width 2.026 2.00 0.2747
Passing all these functions around also requires some extra function manipulation functions. Now that is a mouthful, but something we do with R.
R lacks a function composition function. So I created one.
function(x)any(is.na(x)) is just to long to type, and I find myself doing things like this far too often. The word “function” is just too long to type and takes up lots of space. It is much easier to do
compose(any, is.na) either of which results in a function that creates a new function testing if there are any missing values. The two forms are
compose takes any number of arguments and nests them with the right most being the inner most and the left being the outermost. The easy to remember is that they read the same as when they were input.
Composition and dostats, only operate on the first argument which necessitates functions for manipulating arguments.
wargs: creates a new function with changed defaults. An example would be
wargs(mean, rm.na=T) creates a new function that automatically removes missing values.
onarg: Specifies the first argument for the function. Such as
onarg(rep,'times') makes the number of times to repeate the first argument.
One example of this that is included in
dostats is the
%contains% which is the reverse order of
There will likely be more functions as I come across the necessity. If you have an idea that should be included submit to the issues tracker.