class: clear, title-slide, inverse, center, top, middle # Foundations lab 1 - Data Structures ---- ### Jeanne McClure, Jennifer Houchins ### September 12, 2022 --- # Recap from Orientation lab -- - Well Documented -- - Clear Communication -- - Reproducible Theory ??? In the orientation lab we learned about reproducible research. Keep your code well documented, use clear communication and follow the reproducible theory. We also learned about the open movement and its importance for reproducible research. From your pre-reading, Krumm 2018 sums it up nicely stating, "The basic idea behind the open data movement is that anyone can access or use a data set, and key to this movement is not just accessibility but usability. Making open data usable means making it accessible in machine-readable, structured, granular, and well-documented formats." --- # Agenda .pull-left[**Part-1 Conceptual Overview** - Types of Data Used in LA (Sources) - Characteristics of Data (Format) ] .pull-right[**Part-2 Code-Along** - Read in Data - view Data .footnote[Pre-Reading: [Learning Analytics Goes to School, (Ch. 2, pp 28 - 41 ) By Andrew Krumm, Barbara Means, Marie Bienkowski](https://github.com/laser-institute/essential-readings/blob/main/foundation_labs/foundlab_1/krumm_2018.pdf) ] ] ??? In this session for the conceptual overview we are going to learn more about the types and characteristics of data structures common in Learning Analytics. Many of these you will be familiar with and some may be new. In our code-along we will continue with learning how to read in data, using the various packages from `tidyverse`. We will also learn to view data between a tibble and a data frame. --- class: clear, inverse, middle, center Part 1: ---- Data Structures: Conceptual Overview --- # Types of Data Used in LA (Sources) .center[ <img src="img/data_LA.png" height="400px"/> ] ??? Although we are going to talk about these types of data in silo, in practice researchers find value in using them together. The types of data we are looking at are Digital Learning Environments, Administrative data and Sensors and Multi-modal data. These types of data categories typically show: - Indicators of a school's success and progress. - Student progress, student behavior, and student learning and, - Demographic or social media sources. --- # Digital Learning Environments .panelset[ .panel[.panel-name[Definition] .pull-left[Digital learning environments include any set of technology-based methods that can be applied to support learning and instruction. Ifenthaler(2017). ] .pull-right[ <img src="img/digital_LA.png" height="300px"/> ] ] ??? .panel[.panel-name[Games+] .pull-left[**Games and Simulations:** - Captures real-time interactions - Cognitive outcomes - Behavior outcomes - Affective outcomes ] .pull-right[ <img src="img/crystal_island.png" height="300px"/> ] ] .panel[.panel-name[ITS] .pull-left[**Intelligent Tutoring Systems** 1. An expert, or domain model 2. A student model 3. An instructional, or pedagogical model ] .pull-right[ <img src="img/duolingo.png" height="350px"/> ] ] .panel[.panel-name[LMS] .pull-left[Learning Management Systems - "web-based systems that allow instructors and students to share instructional materials, make class announcements, submit and return course assignments, and communicate with each other online." (Krum et al., 2018) ] .pull-right[ <img src="img/canvas.png" height="300px"/> ] ] .panel[.panel-name[MOOCs] .pull-left[ - Free/freemium courses, certificates, PD - Edx - Coursera - Friday Institute - Collect granular data like ITS ] .pull-right[ <img src="img/fi_mooc.png" height="300px"/> ] ] ] ??? **Definition tab** Digital learning environments include any set of technology-based methods that can be applied to support learning and instruction. According to Krumm et al, Digital Learning Environments have fueled data intensive research in education more than any other data. Within the DLE category we include - Games and Simulations - Intelligent tutoring systems - Learning Management System and, - MOOCs **Games tab** The types of data collected in Games and Simulations is capturing real-time interactions, cognitive outcomes, behavior outcomes and affect outcomes. These data can be used for - Early, iterative user-testing - help surface basic player interactions for improved core mechanics - UI/UX and learning design - Structure discovery and relationship mining Crystal Island: Uncharted Discovery is an educational video game created by a team of educators and computer scientists at the Center for Educational Informatics aimed at teaching students upper elementary science education, focusing on landforms, navigation, and modeling. In the game, students play as shipwrecked adventurers on a volcanic island. To escape the island, they must complete a series of quests that test their critical thinking skills and teach them content-related information. Upon completion of the quests, the players gain access to a new area of the island which contains a multi-skill quest that requires students to use the knowledge and skills they learned during the previous quests. Once the final quest is completed, the players gain access to a communication device which allows them to call the outside world for rescue. Crystal Island fuses automated reasoning and human language technologies with game technologies and new media to investigate high-impact adaptive learning environments. **intelligent tutoring tab** Intelligent Tutoring Systems(ITS for short) is a type of DLE that applies AI to students interactions in the digital system. ITS includes three main models, 1. an expert, or domain model, which organizes the skills and strategies in the domain, 2. a student model of what a student understands about the domain that is inferred from their performances on learning tasks and, 3. an instructional (pedagogical model) that includes common mistakes and misconceptions along with a corresponding feedback strategy Intelligent tutoring systems are online based teaching that can generate student feedback with no human interaction. The same data that the ITS uses to figure out how to respond to a student’s actions can also be used by human analysts to gain a detailed picture of the learning processes and the behaviors learners engage in. Duolingo app uses artificial intelligence known as spaced repetition. This concept delivers personalized language lessons over longer intervals for optimal learning rather than cramming lessons into a short period of time. **LMS tab** We are all probably pretty familiar with Learning Management Systems(LMS). A Learning Management System (LMS) is an online integrated software used for creating, delivering, tracking, and reporting educational courses and outcomes. Contributes to trace data, survey data, discussion posts, open ended questions and video diaries. Tools allow instructors to provide feedback to students’ Often used as *student-facing repositories* of digital resources and activities. At North Carolina State University we use Moodle. What LMS do you use at your institution? What types of data does it generate? **MOOCs tab** Raise your hand if you have ever participated in a MOOC? The FI has many free MOOCs available, but there are also EDx and Coursera that you can take a MOOC at. These offer a major source for data from the large numbers of participants. Friday Insitutes free Moocs offer professional Development for in-service and pre-service K12 educators. For more information check out the website include in the speaker notes. https://place.fi.ncsu.edu/ Data from MOOCs can take a closer look at how participants are interacting within the classroom social network. Have you ever pulled data from your own MOOC? or learning management system? What types of insights were analyzed? --- # Administrative Data .panelset[ .panel[.panel-name[Definition] .pull-left[Types of State and Federal systems used in schools and districts to collect and store information to deliver some *sort* of service.] .pull-right[ <img src="img/Admin_LA.png" height="300px"/> ] ] .panel[.panel-name[SIS] .pull-left[Student Information Systems: - Student level - Academic Performance - Teacher & Administrator facing repository - Technology-based intervention ] .pull-right[ <img src="img/powerschool.png" height="300px"/> ] ] .panel[.panel-name[SLDS] .pull-left[ - Connects state-wide - Accountability and reporting ] .pull-right[ <img src="img/slds.png" height="300px"/> ] ] ??? **Definition-tab** Administrative Data is the second type of data fueling data intensive research. Used in Schools, at district level and throughout most states. There are two main types - Student Information System or SIS - Statewide longitudinal data systems **Student information systems - tab** Can be at the Student level or Teacher and Administrator level of student demographic and learning-outcome data. - A ready source of data on educational outcomes (e.g., grades) and demographic information, which can play a role in evaluations of technology-based interventions as well as early warning system research **statewide longitudinal data systems - tab** **statewide longitudinal data systems** are a continuum of data that includes standardized test performances, attendance, and major behavioral infractions. Which carries a unique statewide identifier for every student. The SLDS has the ability to link state K12 with HEd system includes the storage of each student’s demographic characteristics and enrollment history. Includes all scores on state accountability tests Krumm et al., additionally mentions the importance of the data infrastructure to examine “educational outcomes at the scale of an entire state” p.34 State level and university-based researchers increasingly leveraged these data for both accountability and reporting purposes as well as district-and school-improvement purposes. Typically they are not available to the public rather threw an IRB and MOA through DPI. (Lots of acronyms) North Carolina public education state longitudinal data system (SLDS) as presented through publicly available resources of public primary, secondary and higher education, information made available to the public through the National Center for Education Statistics (NCES). https://childandfamilypolicy.duke.edu/north-carolina-education-research-data/ Or a free version through EPIC at UNC. https://childandfamilypolicy.duke.edu/north-carolina-education-research-data/ ] --- # Sensors and Multimodal .panelset[ .panel[.panel-name[Sensors and recording devices] .pull-left[ - Location - Physical movement - Speech data ] .pull-right[ - Capture biometric data - Parse audio - Video recordings using machine learning and artificial intelligence techniques. ] ] .panel[.panel-name[Multimodal] - Identifying multiple affective - Engagement-related states - Detect facial expressions - Speech recognition - Speech prosody ] ??? **sensor and recording tab** How many of you have an apple watch, or a fitbit or Samsung? These watches measure different metrics surrounding our health and fitness. The data can then be used to explain or predict our **health** Senors and video recordings collect data by capturing biometric data, parse audio and from video recordings using machine learning and artificial intelligence techniques. **Multimodal-tab** Krumm et al. suggest that multimodal investigation identify multiple affective and engagement-related states, detect facial expressions and use audio, and video, for speech recognition and speech prosody. - Understanding what learners do as they engage in learning tasks can drive digital learning environment adaptations. ] --- # Characteristics of Data .panelset[ .panel[.panel-name[Types] .center[ <img src="img/data_types.png" height="400px"/> ] ] .panel[.panel-name[Structured] <table> <thead> <tr> <th style="text-align:right;"> student_id </th> <th style="text-align:left;"> course_id </th> <th style="text-align:right;"> total_points_possible </th> <th style="text-align:right;"> total_points_earned </th> <th style="text-align:right;"> percentage_earned </th> <th style="text-align:left;"> subject </th> <th style="text-align:left;"> semester </th> <th style="text-align:left;"> section </th> <th style="text-align:left;"> Gradebook_Item </th> <th style="text-align:left;"> Grade_Category </th> <th style="text-align:right;"> FinalGradeCEMS </th> <th style="text-align:right;"> Points_Possible </th> <th style="text-align:right;"> Points_Earned </th> <th style="text-align:left;"> Gender </th> <th style="text-align:right;"> q1 </th> <th style="text-align:right;"> q2 </th> <th style="text-align:right;"> q3 </th> <th style="text-align:right;"> q4 </th> <th style="text-align:right;"> q5 </th> <th style="text-align:right;"> q6 </th> <th style="text-align:right;"> q7 </th> <th style="text-align:right;"> q8 </th> <th style="text-align:right;"> q9 </th> <th style="text-align:right;"> q10 </th> <th style="text-align:right;"> TimeSpent </th> <th style="text-align:right;"> TimeSpent_hours </th> <th style="text-align:right;"> TimeSpent_std </th> <th style="text-align:right;"> int </th> <th style="text-align:right;"> pc </th> <th style="text-align:right;"> uv </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 43146 </td> <td style="text-align:left;"> FrScA-S216-02 </td> <td style="text-align:right;"> 3280 </td> <td style="text-align:right;"> 2220 </td> <td style="text-align:right;"> 0.6768293 </td> <td style="text-align:left;"> FrScA </td> <td style="text-align:left;"> S216 </td> <td style="text-align:left;"> 02 </td> <td style="text-align:left;"> POINTS EARNED & TOTAL COURSE POINTS </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 93.45372 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1555.1667 </td> <td style="text-align:right;"> 25.9194450 </td> <td style="text-align:right;"> -0.1805150 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 4.333333 </td> </tr> <tr> <td style="text-align:right;"> 44638 </td> <td style="text-align:left;"> OcnA-S116-01 </td> <td style="text-align:right;"> 3531 </td> <td style="text-align:right;"> 2672 </td> <td style="text-align:right;"> 0.7567261 </td> <td style="text-align:left;"> OcnA </td> <td style="text-align:left;"> S116 </td> <td style="text-align:left;"> 01 </td> <td style="text-align:left;"> ATTEMPTED </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 81.70184 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1382.7001 </td> <td style="text-align:right;"> 23.0450017 </td> <td style="text-align:right;"> -0.3078031 </td> <td style="text-align:right;"> 4.2 </td> <td style="text-align:right;"> 3.5 </td> <td style="text-align:right;"> 4.000000 </td> </tr> <tr> <td style="text-align:right;"> 47448 </td> <td style="text-align:left;"> FrScA-S216-01 </td> <td style="text-align:right;"> 2870 </td> <td style="text-align:right;"> 1897 </td> <td style="text-align:right;"> 0.6609756 </td> <td style="text-align:left;"> FrScA </td> <td style="text-align:left;"> S216 </td> <td style="text-align:left;"> 01 </td> <td style="text-align:left;"> POINTS EARNED & TOTAL COURSE POINTS </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 88.48758 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 860.4335 </td> <td style="text-align:right;"> 14.3405583 </td> <td style="text-align:right;"> -0.6932595 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 4.0 </td> <td style="text-align:right;"> 3.666667 </td> </tr> <tr> <td style="text-align:right;"> 47979 </td> <td style="text-align:left;"> OcnA-S216-01 </td> <td style="text-align:right;"> 4562 </td> <td style="text-align:right;"> 3090 </td> <td style="text-align:right;"> 0.6773345 </td> <td style="text-align:left;"> OcnA </td> <td style="text-align:left;"> S216 </td> <td style="text-align:left;"> 01 </td> <td style="text-align:left;"> POINTS EARNED & TOTAL COURSE POINTS </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 81.85260 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1598.6166 </td> <td style="text-align:right;"> 26.6436100 </td> <td style="text-align:right;"> -0.1484470 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 3.5 </td> <td style="text-align:right;"> 5.000000 </td> </tr> <tr> <td style="text-align:right;"> 48797 </td> <td style="text-align:left;"> PhysA-S116-01 </td> <td style="text-align:right;"> 2207 </td> <td style="text-align:right;"> 1910 </td> <td style="text-align:right;"> 0.8654282 </td> <td style="text-align:left;"> PhysA </td> <td style="text-align:left;"> S116 </td> <td style="text-align:left;"> 01 </td> <td style="text-align:left;"> POINTS EARNED & TOTAL COURSE POINTS </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 84.00000 </td> <td style="text-align:right;"> 438 </td> <td style="text-align:right;"> 399 </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1481.8000 </td> <td style="text-align:right;"> 24.6966667 </td> <td style="text-align:right;"> -0.2346629 </td> <td style="text-align:right;"> 3.8 </td> <td style="text-align:right;"> 3.5 </td> <td style="text-align:right;"> 3.500000 </td> </tr> <tr> <td style="text-align:right;"> 51943 </td> <td style="text-align:left;"> FrScA-S216-03 </td> <td style="text-align:right;"> 4208 </td> <td style="text-align:right;"> 3596 </td> <td style="text-align:right;"> 0.8545627 </td> <td style="text-align:left;"> FrScA </td> <td style="text-align:left;"> S216 </td> <td style="text-align:left;"> 03 </td> <td style="text-align:left;"> POINTS EARNED & TOTAL COURSE POINTS </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 3.4501 </td> <td style="text-align:right;"> 0.0575017 </td> <td style="text-align:right;"> -1.3257521 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 4.0 </td> <td style="text-align:right;"> 4.000000 </td> </tr> </tbody> </table> ] .panel[.panel-name[Unstructured] .center[ <img src="img/unstructured.png" height="400px"/> ] ] .panel[.panel-name[Semi-Structured] .center[ <img src="img/semi.png" height="400px"/> ] ] .panel[.panel-name[Meta-Data] .center[ <img src="img/meta_dat.png" height="300px"/> ] ] ??? ***TYPES-TAB** In data-intensive research we might acquire data from LMS, SIS or another source. Data maybe to big for a typical database software tools to capture, store, manage, and analyze” Those data are similar but different because 1. the tasks that students are engaged in and 2. how data from those tasks are collected and stored by the technology. We identify four main types of data structures. - structured - unstructured - semi-structured and - meta-data **STRUCTURED - TAB** Structured data is Quantitative in nature Here we see the rows and columns nice and tidy. - data is tidy when each column is a variable, each row is an observation, and each cell is a single value. We typically have tidy data in Excel files or SQL databases that we can import using the readr function or here function. - Tabular - relational data. Examples might include: - performance data (current and prior) - demographic data (student’s ethnic and or group or disability) Each field (variable) is discrete and can be accessed separately or jointly along with data from other fields (variables). **UNSTRUCTURED TAB** Unstructured data is Qualitative in nature and includes: - Non-Relational Data - sensor data, web logs, clickstreams, keystroke capture data, emails, images and videos Text Heavy relationships between nodes IE: Audio, video, dates SQL database **SEMI-*STRUCTURED-TAB** Semi-structured data includes Qualitative Triangular Data (uses surveys and interview answers) Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables. Semi structured can also include log data. It is a self-describing structure that contains tags examples: Social media, JSON and XML, Social Networkdata **META-DATA - TAB** Not in the readings but something to think about is Meta-data collection, such as the field dates, times and locations of photos, videos and other structured data about data. - contexts and purposes for why the data was collected You can read more about the meta data standards on Canada's heritage information network. The link is in the presenters notes. https://www.canada.ca/en/heritage-information-network/services/collections-documentation-standards/chin-guide-museum-standards/metadata-data-structure.html ] --- class: clear, inverse, middle, center Part 2: ---- Code-Along --- # Set Working Directory ```r #sets your working directory setwd("/path/to/my/directory") # tells you what working directory you are in getwd() ``` ??? Although we are using projects we want to make sure you are aware of setting the working directory. Continuing on from the orientation code-along we will use "good programming practices" to learn more about reading in data. Again we are using a script file in R Studio. You will become familiar with R Markdown files in the next foundation lab. A default working directory is a folder where RStudio goes, every time you open it. You can change the default working directory from RStudio menu under: Tools –> Global options –> click on “Browse” to select the default working directory you want. Although you can set your working directory in Session menu tool it is better to set it with code for reproducibility using the setwd() function. You can also create a project and connect to your github account. Then open up the project which also sets the directory. --- # Read in Data .panelset[ .panel[.panel-name[package] ```r # Load tidyverse package *library(tidyverse) # could load only readr *library(readr) ``` ] .panel[.panel-name[functions] Common readr functions <img src="img/readr_function.png" height="300px"/> ] .panel[.panel-name[readr] # Using the readr function . Use `read_csv()` function to read in CSV. .pull-left[ ```r *online_classes <- read_csv("data/sci-online-classes.csv") head(online_classes, n=5) ``` ``` ## # A tibble: 5 × 30 ## student_id course_id total_points_possi… total_points_ea… percentage_earn… ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 43146 FrScA-S216-02 3280 2220 0.677 ## 2 44638 OcnA-S116-01 3531 2672 0.757 ## 3 47448 FrScA-S216-01 2870 1897 0.661 ## 4 47979 OcnA-S216-01 4562 3090 0.677 ## 5 48797 PhysA-S116-01 2207 1910 0.865 ## # … with 25 more variables: subject <chr>, semester <chr>, section <chr>, ## # Gradebook_Item <chr>, Grade_Category <lgl>, FinalGradeCEMS <dbl>, ## # Points_Possible <dbl>, Points_Earned <dbl>, Gender <chr>, q1 <dbl>, ## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, ## # q9 <dbl>, q10 <dbl>, TimeSpent <dbl>, TimeSpent_hours <dbl>, ## # TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl> ``` ] .pull-right[ <img src="img/delimiter.png" height="300px"/> ] ] .panel[.panel-name[readxl] # read in excel files Read in with `readxl` package ```r *library(readxl) csss_tweets <- read_excel("data/csss_tweets.xlsx") head(csss_tweets, n = 5) ``` ``` ## # A tibble: 5 × 91 ## user_id status_id created_at screen_name text source ## <chr> <chr> <dttm> <chr> <chr> <chr> ## 1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt… ## 2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt… ## 3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt… ## 4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt… ## 5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt… ## # … with 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>, ## # reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>, ## # is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>, ## # quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>, ## # urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>, ## # media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>, ## # ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, … ``` ] .panel[.panel-name[functions] Check what sheets are included. ```r ?read_excel *excel_sheets("data/csss_tweets.xlsx") ``` ``` ## [1] "Sheet1" ``` ```r csss_tweets <- read_excel("data/csss_tweets.xlsx", sheet = "Sheet1") head(csss_tweets, n = 5) ``` ``` ## # A tibble: 5 × 91 ## user_id status_id created_at screen_name text source ## <chr> <chr> <dttm> <chr> <chr> <chr> ## 1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt… ## 2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt… ## 3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt… ## 4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt… ## 5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt… ## # … with 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>, ## # reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>, ## # is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>, ## # quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>, ## # urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>, ## # media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>, ## # ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, … ``` ] ??? **PACKAGES-TAB** In order to read in data we must load packages that allow us to read data in, like we did in foundation lab 1. Although there are read-in capabilities with the `utils package` in base r, we are going to use a more modern approach. One of the packages in Tidyverse allows for reading in Comma delimiter files, or CSV. *Reminder if needed - this was discussed in foundation 1* R comes with several packages in base r when you start a session. Packages are a collection of objects, data sets around a theme or purpose. You may find a package whose objects are for the purpose of correlation, like in the cor package. Or you may find a package that has objects to produce better tables, like in the *kable extra* package. Or you may find a package that will help wrangling the data (processing it) easier than in Base R package. That is the package we are going to install here Tidyverse. Tidyverse is a suite of packages that share a common philosophy and are designed to work together. Think Google products or Microsoft products. data importing, data wrangling and plotting. **FUNCTIONS-TAB** Common functions that are included when you read in data are 1. the file, we need to give a path to get the data. It should be in the working directory. 2. column names corresponds to variables on the first row. The first row of the input will be used as the column names, and will not be included in the data frame. If FALSE, column names will be generated automatically: X1, X2, X3 etc. 3. NA corresponds to missing values. commonly you will see capital NA but you may also see quotations or only a period depending on the original database (excel, SAS, STATA etc). 4. Skip functions allows you to skill to when your data starts. 5. col_types is not typically needed but good to know. R guesses what type of column types you have when it reads in the file using the default guess_max. This is convenient (and fast), but not robust. If the imputation fails, you'll need to increase the guess_max or supply the correct types yourself. 6. guess_max if the imputation fails increase the number beyond the 1000 default to guess col_types correctly. **READR-TAB** readr is a quick way to read in comma-separated values (CSV) and tab-separated values (TSV) - We read in from the path. My data is stored in a folder called data. I then use the assignment operator to save it as a new object called online-classes. - (instructor is showing how to import from file) a quick way to see where your data file is stored is by clicking on the file under files and then import. **readxl- TAB** To read in an excel file we will need to install the readxl package and then read in with the read_excel function. We can see that it is an excel file with the .xlsx extension. **FUNCTIONS-TAB** You can check what sheets are included in your excel file by calling it with the excel_sheets function. Then you can call only that sheet. Our data only includes one sheet so we do not really need to use this function, but it's good to know. ] --- # Read in data cont. .panelset[ .panel[.panel-name[path] Read in from Path ```r sci_ol_classes <- sci_online_classes <- read_csv("data/sci-online-classes.csv") ``` ] .panel[.panel-name[url] Read in a url path with `read_csv2` function. ```r *air_quality <- read_csv2("https://www4.stat.ncsu.edu/~online/datasets/AirQualityUCI.csv") head(air_quality, n=3) ``` ``` ## # A tibble: 3 × 17 ## Date Time `CO(GT)` `PT08.S1(CO)` `NMHC(GT)` `C6H6(GT)` `PT08.S2(NMHC)` ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10/03/2004 18.00… 2.6 1360 150 11.9 1046 ## 2 10/03/2004 19.00… 2 1292 112 9.4 955 ## 3 10/03/2004 20.00… 2.2 1402 88 9 939 ## # … with 10 more variables: `NOx(GT)` <dbl>, `PT08.S3(NOx)` <dbl>, ## # `NO2(GT)` <dbl>, `PT08.S4(NO2)` <dbl>, `PT08.S5(O3)` <dbl>, T <dbl>, ## # RH <dbl>, AH <dbl>, ...16 <lgl>, ...17 <lgl> ``` ] .panel[.panel-name[.dta file] Read in a Stata file with `read_dta` function ```r library(haven) *gpa_dt <- read_dta("data/GPA3.dta") ``` ] ] ??? **PATH-tab** You can read in from your computers path. If you aren't sure you go to the file and hold down Shift on your keyboard and right-click on the file. You would need to change the direction of the slashes so that R can read it in. **URL-tab** If you want to read in a url you use the read_csv2 function. **.dta-TAB** You may want to read in a SAS, SPSS or Stata file. To do this we would use the Haven package. https://haven.tidyverse.org/ --- class: inverse, clear, center ## .font130[.pull-left[**What's next?**]] <br/><br/><br/><br/><br/> .pull-left-wide[.left[.font100[ - Make sure to complete the R Programming primer: [Work with Data](https://rstudio.cloud/learn/primers/2)) - Complete the Badge requirement document[Foundations badge - Data Sources](https://github.com/laser-institute/foundational-skills/tree/master/foundation_lab_1/lab1_badge). ]] ] ## .font175[.center[Thank you! Any questions?]] --- <img src="img/team_2022.png" height="550px"/>