Thomas Lumley

Mentioned 1

A complete guide to carrying out complex survey analysis using R As survey analysis continues to serve as a core component of sociological research, researchers are increasingly relying upon data gathered from complex surveys to carry out traditional analyses. Complex Surveys is a practical guide to the analysis of this kind of data using R, the freely available and downloadable statistical programming language. As creator of the specific survey package for R, the author provides the ultimate presentation of how to successfully use the software for analyzing data from complex surveys while also utilizing the most current data from health and social sciences studies to demonstrate the application of survey research methods in these fields. The book begins with coverage of basic tools and topics within survey analysis such as simple and stratified sampling, cluster sampling, linear regression, and categorical data regression. Subsequent chapters delve into more technical aspects of complex survey analysis, including post-stratification, two-phase sampling, missing data, and causal inference. Throughout the book, an emphasis is placed on graphics, regression modeling, and two-phase designs. In addition, the author supplies a unique discussion of epidemiological two-phase designs as well as probability-weighting for causal inference. All of the book's examples and figures are generated using R, and a related Web site provides the R code that allows readers to reproduce the presented content. Each chapter concludes with exercises that vary in level of complexity, and detailed appendices outline additional mathematical and computational descriptions to assist readers with comparing results from various software systems. Complex Surveys is an excellent book for courses on sampling and complex surveys at the upper-undergraduate and graduate levels. It is also a practical reference guide for applied statisticians and practitioners in the social and health sciences who use statistics in their everyday work.

Mentioned in questions and answers.

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.

For example I may start with this (note that the multiselect columns have an _M after them):

```
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
```

and I want to go to this:

```
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
```

I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can: (1) pull column numbers that have the special suffix with grep() (2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly (3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection

I appreciate any help and please let me know if this is not clear.

I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.

1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block. Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.

2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS. Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.

3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started: Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology

The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.

To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).