A Study Journey of Cleaning Data from Household Survey in Rural China

  • 08 November 2023  |  Stories

The summer vacation in 2023 is a vacation full of opportunities and challenges for me, because in this holiday I tried a new scientific research field of cleaning survey data from household survey in rural China.
The dataset, named China Rural Development Survey, is from the project of High-quality development of agriculture and rural areas in China, and it is a panel dataset based on long-term follow-up surveys conducted by the research team headed by Dr. Linxiu Zhang, Director of UNEP-IEMP. Since 2005, the research team has conducted six waves of tracked survey for the full sample of 2000 households at 100 villages in 25 countries across 5 provinces in China. This panel dataset contains rich information, especially on the employment of rural labor, agricultural production, land rental market and plot investment, rural environment, and public investment. The project aims to provide science-based policy recommendations of high-quality and sustainable development of agriculture and rural areas in China.

Before the event, I believed that my study on microeconomics in school and prior experience with the projects on economics would make this work a straightforward endeavor. However, upon delving into the subject matter, I recognized the true complexity of the task at hand. Due to the magnitude of the dataset, I was assigned to clean the data on the financial assets of rural households and worked with a group of master and Ph. D. students. To clean the data correctly and efficiently, we are all trained on the questionnaire and using Stata software for three days.

Stata is a tool for statistical analysis which is highly effective in cleaning raw data and addressing complex questions for data scientists. Stata’s capabilities in handling large volumes of data are akin to that of a superhero. Here, I will share my experience of learning how to use Stata and how it has become a valuable assistant.

Firstly, understanding the fundamental concepts of Stata is crucial. Acquiring familiarity with the Stata commands and data manipulation techniques is akin to acquainting oneself with the tools in a handyman's toolkit. During my initial training, Professor Yunli Bai imparted significant knowledge on the operation of Stata. Precisely, she elucidated the importance of data cleaning, how to open a dofile and write code, and handle outlines. It can be straightforward to begin, but writing multiple complicated commands can lead to confusion. As a result, it took me a couple of days to gain independence in operating through practice with my teacher’s assistance. 

Next, I delved into data cleaning. Stata's capabilities enabled me to identify outliers, and missing information in the dataset of household questionnaire. It was akin to donning special spectacles to unveil concealed patterns. This stage was pivotal in determining which issues are required resolution before further proceeding. However, mastering this technique was challenging as even the slightest operational error can lead to failure of data cleaning. I sought advice from my teacher several occasions with patience, and I eventually learned the method and found great enjoyment in it.

The basic and important task of data analysis was data cleaning. Using my newly acquired Stata skills, I located missing information, fixed mistakes in logic, and organized the code and data properly. It was like tidying up a cluttered room, ensuring that each item had its assigned spot. Stata’s helpful features, such as loops and conditional statements, simplified repetitive cleaning. Once the cleaning was over, I reviewed the new dataset to guarantee all variables accurate. It’s like reviewing an important document. You need to spot any errors before showing your results. It's fantastic to see how Stata simplifies testing and confirming the accuracy of the cleaned data. 

Reflecting on this study journey, the most valuable takeaways emphasize the significance of meticulousness, methodical troubleshooting, and perseverance. Tasked with cleaning rural household survey data via Stata, I have come to appreciate the criticality of exactitude in data organization, which is essential for researchers. The learning process showed me that cleaning data isn’t just a task, it’s a vital move in ensuring precise analyses and outcomes. Stata is proved to be a helpful companion, streamlining the process and revealing the tales concealed in reviewing farmers. Throughout this study experience, I discovered that I have a strong interest in handling data. As I finish this part, I appreciate the teachers for the knowledge and fresh ideas I gained by knowing and studying the information on farmers and rural communities. 

Author: Yingfei Cao, Senior high school student, Beijing Chaoyang RCF Dongba School