Rstudioconf2020 Agenda

Apologies for some of the tables needing to be horizontally scrolled. My hugo theme has a fixed width and I don’t know enough css to adjust this in short order. Hopefully, folks will find this helpful. You can access the script I used to clean and tidy the data on my github.

Day 1

Session 1
EducationProductionCase StudyProgramming
11:30 AM-11:52 AM
TitleMeet You Where You RDeploying End-To-End Data Science with Shiny, Plumber, and PinsProfessional Case StudiesSimplified Data Quality Monitoring of Dynamic Longitudinal Data: A Functional Programming Approach
SpeakerLauren ChadwickAlex GoldKatie MasielloJacqueline Gutman
SpeakerInfoCustomer Success Representative RStudioSolutions Engineer RStudioCustomer Success Representative RStudioFlatiron Health
AbstractAt RStudio, we wake up and go to bed thinking about the positive impact that open source work and data science has had and can have on the world. To maximize this impact, we find three areas of investment absolutely critical to ensure our open source community keeps up with the world’s changes and outlives us all: 1. Find ways to make R more approachable. 2. Enable teams of all types & sizes (educational, professional, etc.) to be able to leverage the work they’re doing in R, and effortlessly communicate that work to others. 3. Extend the language so our open-source community can continue to be at the forefront of innovation, no matter their preference of tool or language. Underpinning these investments is also the core belief that every data scientist, regardless of skill level, use-case, or professional experience, is an asset to our community. Whether you’re a student currently learning R, a Python fan looking to become multilingual, or the Head of Data Science at NASA, we want you to become a part of our journey. In return, we’ll do our best to ensure that journey is a fulfilling endeavor. This presentation will take a deeper dive into the ways in which you can utilize RStudio's educational offerings and enterprise toolchain in personal, educational, and corporate settings. educational offerings and enterprise toolchain in personal, educational, and corporate settings.It’s easier than ever to craft a complete R-centric data science pipeline thanks to packages like Shiny, Plumber, and Pins. In this talk, you’ll learn how to use R to bring your modeling and visualization work into production. You’ll walk away with recipes, tips, and tricks to deploy data, models, and apps to ensure your work is as impactful as possible.The path to becoming a world-class, data-driven organization is daunting. The challenges you will likely face along the way can be thorny, and in some cases, seem outright impossible to overcome. How do you get teams that traditionally butt heads, such as IT and data science, to complement each other and work in unison? How can you efficiently scale the scope and reach of your data products as requirements change? Your time should be spent doing truly valuable work instead of updating charts and reports. How do you prevent the support structure behind your platform from toppling like a house of cards? Despite these challenges, we think that the end result is worth it: an organization that is equipped to make important decisions, with confidence, using data analysis that comes from a sustainable environment. We see this outcome every day.Ensuring the quality of data we deliver to customers or provide as inputs to models is often one of the most under-appreciated and yet time-consuming responsibilities of a modern data scientist. This task is challenging enough when working with static data, but when we have access to dynamic, longitudinal, continuously updating data, that complexity can become an asset. We will demonstrate how to to simplify data quality monitoring of dynamic data with a functional programming approach that enables early and actionable detection of data quality concerns. Using purrr as well as tidyr and nested tibbles, we will illustrate the five key pillars of enjoyable, user-friendly data quality monitoring with relevant R code: Readability, Reproducibility, Efficiency, Robustness, and Compositionality. Readability: FP empowers us to abstract away from the mechanics and implementation of comparing two or more related datasets and move towards declaring the intent of features and metrics we want to compare. Reproducibility: By avoiding side-effects and dependencies on external states and inputs, and using functional units which can be easily tested over a variety of inputs, FP reduces the burden to create reproducible code. Perhaps more importantly, FP supports not just reproducibility of results, but reproducibility of workflows that can be continually applied to dynamic datasets. Efficiency: FP enables more efficient code through lazy evaluation, caching, and simplifying implementation over parallel backends. Robustness: FP allows greater testability of our code through modularization and elegant error-handling, with customized fail-safes for data that differs in expected ways over time. Compositionality: FP encourages higher-level reasoning with functions, which in turn drives both readability--through higher-level, more abstract code--and robustness, through modifying function behavior in case errors are encountered.
11:53 AM-12:15 PM
TitleData Science Education in 2022We’re hitting R a million times a day so we made a talk about itHow Vibrant Emotional Health Connected Siloed Data Sources and Streamlined Reporting Using Rvctrs: Creating custom vector classes with the vctrs package
SpeakerCarl HoweHeather NolisSean MurphyJesse Sadler
SpeakerInfoDirector of Education RStudio Greg Wilson Data Scientist & Professional Educator RStudioMachine Learning Engineer T-Mobile Dr. Jacqueline Nolis Principal Data Scientist Nolis, LLCSenior Data Scientist VibrantLecturer Loyola Marymount University
AbstractMore people are learning data science every day, and there are more ways for them to learn than ever before. To understand where we are and where we might be going, this talk looks at what data science education could look like two years from now: far enough away that we can dream, but close enough that we can only dream a little. We explore the balance between automated and collaborative learning, different ways to deliver different kinds of lessons to different kinds of people, and ways in which our tools and practices could improve.Often reserved for Elite Engineers, production can be a perilous place for R users - but never fear! For the past year, we at T-Mobile have been sludging through production outages, nation-wide product launches, and all of the muck that floods from R models being hit over a million times every day. From “we’re strictly a java shop” to a devops team that proudly states “we support Java, node, and R,” this talk will cover the technical hiccups, interdisciplinary communication struggles, and an open-source R package {loadtest} that’s changed the way our team views performance testing. You too can dazzle your enterprise with the power of R.Vibrant Emotional Health is the mental health not-for-profit behind the US National Suicide Prevention Lifeline, New York City's NYC Well program, and various other emotional health contact center programs and direct services. We engage in emotionally charged conversations with people experiencing a wide variety of mental health and emotional concerns, our programs vary in scope, in resources, and span several technologies. In addition, our data collection and reporting requirements change dynamically in response to emerging clinical needs and reporting requirements from our sponsors. In short, the data we collect is complex, often unstructured, and stored in a variety of sources. In this context, R Markdown Documents have allowed us to interface directly with multiple databases, Google Sheets, API's, csv's, and JSON stores to generate integrated reports. Organizing these reports into R packages with accompanying functions that standardize the calculation of KPI's and apply consistent themes across analyses has allowed us to improve the clarity and aesthetics of our reporting while reducing manual work that was previously needed to produce these reports. Building on this framework we have developed functions to standardize data connections, create reusable data visualizations, and generate reproducible analyses in response to ad hoc analytic requests. These same functions also facilitate the creation of Shiny dashboards where core visualizations that were previously only available in static reports can be manipulated directly by end users to explore clinical and operational trends. These dashboards also facilitate self service reporting by end users. We present here the framework we have developed for our organization wide and program specific packages, the types of functions and artifacts they include and our plans for future development.The base R types of vectors enable the representation of an amazingly wide array of data types. There is so much you can do with R. However, there may be times when your data does not fit into one of the base types and/or you want to add metadata to vectors. vctrs is a developer-focused package that provides a clear path for creating your own S3-vector class, while ensuring that the classes you build integrate into user expectations for how vectors work in R. This presentation will discuss the why and how of using vctrs through the example of debkeepr, a package for integrating historical non-decimal currencies such as pounds, shillings, and pence into R. The presentation will provide a step-by-step process for developing various types of vectors and thinking through the design process of how vectors of different classes should work together.
12:16 PM-12:38 PM
TitleData science education as an economic and public health intervention in East BaltimoreGrowth Hacking with R - Product Analytics at Scale using R and RStudioBuilding a new data science pipeline for the FT with RStudio ConnectAsynchronous programming in R
SpeakerJeff LeekAndrew ManganoGeorge KastrinakisWinston Chang
SpeakerInfoProfessor of Biostatistics, Problem Forward Data Science- Chief Data ScientistData Intelligence Lead SalesforceSenior Data Scientist Financial TimesSoftware Engineer RStudio
AbstractJohns Hopkins Bloomberg School of HealthSalesforce is not only a cloud software solution out of the box, but also a highly customizable platform that can be modified for a wide range of use cases. In addition to complexity, customer trust is our #1 company value and customer data privacy is abstracted from everyone outside of the customer. Product and Growth Analytics is an emerging field separate from business analytics and data science and focuses on building software product that improve user retention and engagement. Companies like Facebook and AirBnB have robust data science teams focused on product analytics. At Salesforce however, given the scale, customization, and privacy values, product data science is not so straightforward. Utilizing R and Rstudio tools for collaboration and reproducible analytics, the Data Intelligence team is able to solve complex problems at enterprise scale. This talk will preview anonymized predictive and growth analytics work while also highlighting how we work and collaborate cross platform and languages (Python via reticulate).We have recently implemented a new Data Science workflow and pipeline, using RStudio Connect and Google Cloud Services. This has vastly decreased our pipeline complexity, allowing us to bring our models and products into scheduled production more quickly. In addition, our workflow, working closely together as a team on all projects on a regular two-week sprint cycle, has increased the range of projects we have been able to take on and complete. To detail some of the key lessons we’ve learned (and some of the difficulties!), we’ll walk you through one of our recent sprints, where we productionalised the generation of a suite of behavioural and demographic features so that they can be more easily plugged in to a range of models and used across the business by the FT’s platform and product teams.Writing regular R code is straightforward: you tell R to do something, it does it, and then it returns control back to you. This is called synchronous programming. However, if you use R to coordinate threads, processes, or network communication, the regular model may be unable to do what you want, or it may only be able to do it with a significant performance penalty. In this talk I'll explain how asynchronous programming with the later package can handle these kinds of programming problems. I'll also show how to provide a synchronous interface for asynchronous code, so that users will have a simple, familiar way to use your code.
12:39 PM-12:59 PM
TitleOf Teacups, Giraffes, & R MarkdownPlumber: Growing UpHow to win an AI Hackathon, without using AIAzure Pipelines and GitHub Actions
SpeakerDesiree De LeonJames BlairColin GillespieJim Hester
SpeakerInfoPhD Student Emory UniversitySolutions Engineer RStudioJumping RiversSoftware Engineer RStudio
AbstractHow do you make your R Markdown lessons feel friendly for learners you’ll never meet? How do you make it engaging so they sit and stay a while? How do you make it memorable so they come back to visit again? In this talk, I’ll share lessons learned from my experience of making a series of online statistics modules (co-authored by Hasse Walum) that feel accessible and fun-- housed entirely in an R Markdown site, complete with a whimsical, illustrated narrative about teacup giraffes. I’ll show how adding good characters with your audience in mind, good design, and good play helped me make the most of HTML output. To help you get started, I’ll share resources that Alison Hill and I have developed--including a series of cookbooks and out-of-the-box templates-- so that you will have a leg up on applying these ideas to R Markdown collections of your own.Plumber is a package that allows R users to create APIs out of R functions. This flexible approach allows R processes to be accessed by toolchains and frameworks outside of R. In this talk, we'll look at recent developments in the Plumber package along with some Plumber best practices.Anyone reading a newspaper or listening to the news is led to believe that AI is the solution to all problems. From self-driving cars to detecting disease to catching fraud, there doesn’t seem to be a situation that AI can’t tackle. Once “big data” is thrown into the mix, the AI solution is all but certain. But is AI always needed? Over the last eighteen months, Jumping Rivers has entered (and won) four Hackathons. All Hackathons were characterised with “big data” and the need to improve prediction. All Hackathons were won without using AI (or any sort of machine learning). This talk will focus on one particular competition around reducing leakage at Northumbrian Water. Using a combination of R, Shiny, and tidyverse (and a few other tricks), we were able to demonstrate within the short Hackathon time frame that clear presentation of data to the front line engineers was more likely to reduce leakage, than simply providing vague estimates of a potential future leakOpen source R packages on GitHub often take advantage of continuous integration services to automatically check their packages for errors. This is very useful to catch things quickly, as well and increasing confidence for proposed changes, as the Pull Requests can be checked before they are merged. Travis-CI and Appveyor are the most popular current methods. However newer services, Azure Pipelines and GitHub Actions, show promise for being more powerful and simpler to configure and debug. I will discuss these services and demonstrate some of their capabilities and how to configure them for your own use in packages and reports.

Break For Lunch

Session 2
2:15 PM-2:37 PM
TitleIf you build it, they will come...but then what? Facilitating communities of practice in RAccelerating Analytics with Apache Arrow15 Years of R in Quantitative FinanceProduction-grade Shiny Apps with golem
SpeakerDr. Kate HertweckNeal RichardsonMr Brandon FarrColin Fay
SpeakerInfoBioinformatics Training Manager Fred Hutchinson Cancer Research CenterDirector of Engineering Ursa Labs / RStudioSr Quant Copper Rock Capital PartnersData Scientist & R Hacker ThinkR
AbstractWhy did you learn R? Chances are good that if you're an attendee of rstudio::conf, you've found a community of R coders who are willing to share their knowledge and learn with you. While it's possible to develop expert R coding skills in isolation, most software development and data analysis projects benefit from groups of people working collaboratively, and R communities are unparalleled in their inclusivity and commitment to learning collectively. Such communities, whether they support R coders at a single institution, geographic region, or online, require deliberate planning and effort to develop and sustain. How do you create a group culture that encompasses R users of various skill levels who may be working on diverse problems? How do you assess what members of a community need or prefer? How do you encourage investment and cohesion so the group will sustain itself? This talk will describe potential pitfalls and impediments to creating and facilitating cooperative learning communities for R coding, and will allow you to identify potential strategies for overcoming these challenges so you can continue giving back to the R communities that supported you along the way.The Apache Arrow project is a cross-language development platform for in-memory data designed to improve system performance, memory use, and interoperability. This talk presents recent developments in the 'arrow' package, which provides an R interface to the Arrow C++ library. We'll cover the goals of the broader Arrow project, how to get started with the 'arrow' package in R, some general concepts for working with data efficiently in Arrow, and a brief overview of upcoming features.Use of R in the investment industry is established and growing. This talk will discuss changes seen in 15 years of practice within asset management firms. I hope discussion of lessons learned and recommendations will benefit those currently in finance and those interested in hearing how the flexibility of R manifests in the financial world.Shiny is an amazing tool when it comes to creating web applications with R. Almost anybody can get a small Shiny App in a matter of minutes, provided they have a basic knowledge of R. As of today, we can safely tell that it has become the de-facto tool for web application in the R world. Building a proof-of-concept application is easy, but things change when the application becomes larger and more complex, and especially when it comes to sending that app to production—until recently there hasn't been any real framework for building and deploying production-grade Shiny Apps. This is where 'golem' comes into play: offering Shiny developers an opinionated framework for creating production-ready Shiny Applications. With 'golem', Shiny developers now have a toolkit for making a stable, easy-to-maintain, and robust for production web application with R. 'golem' has been developed to abstract away the most common engineering tasks (for example, module creation, addition of external CSS or JavaScript file, ...), so you can focus on what matters: building the application. And once your application is ready to be deployed, 'golem' guides you through testing, and brings you tools for deploying to common platforms. In this talk, Colin and Vincent will present the 'golem' package, first talking about the "why 'golem'?", then presenting the general philosophy behind this framework, and help you get started building your first Shiny App with 'golem'.
2:38 PM-3:00 PM
TitleEmbracing R in the Geospatial CommunityUpdates on Spark, MLflow, and the broader ML ecosystemDeep Learning Extraction for Counterparty Risk Signals from a Corpus of Millions of DocumentsMaking the Shiny Contest
SpeakerTina CormierJavier LuraschiMoody HadiMine Çetinkaya-Rundel
SpeakerInfoData Scientist, Remote Sensing IndigoRStudioGroup Manager S&P Global - Market IntelligenceEducator RStudio
AbstractGeospatial analysts work in a wide range of positions within almost every industry. They work in government, non-profit, academic, and private institutions using geospatial data and technology to answer questions about the environment, agriculture, climate, urban planning and design, marketing, public health, transportation, and myriad other topics. A typical day may include data prep/cleaning, field work, cartography, image analysis, vector analysis, feature engineering, modeling, or database management. This diverse group necessarily uses a diverse set of tools. In this talk, we will explore how R fits into the spatial analyst’s toolkit. What does the geo community think of R? Who uses it? What groups avoid it? What geo-packages are used most? How can we, as a community, make R more appealing for geospatial scientists?NAChina has been experiencing rapid growth over the last decade due to economically friendly reforms and a growing skilled and young population. With this increasing growth, China’s interconnectedness with the global economy has increased significantly. In parallel to this economic evolution, technology has experienced rapid acceleration, which has enabled firms and governments to track and record vast amounts of data. The side effect of this unstructured big data growth is that datasets may be polluted, meaning information can be conflicting, missing, and/or unreliable. This creates a gap in the ability to provide transparency to the exposed firms importing from China: both timely early warning signals and wide coverage of small- and medium-sized enterprises (SMEs). We have been able to address this problem for our end-users by using deep learning to extract information value and opinion from a public corpus to create the needed transparency. Our data science & machine learning stack uses connect, shiny, reticulate, tensorflow and scikit-learn to build the interactive solution to our clients and deploy it using spark and airflow.In January 2019 RStudio launched the first-ever Shiny contest to recognize outstanding Shiny applications and to share them with the community. We received 136 submissions for the contest and reviewing them was incredibly inspiring and humbling. In this talk, we shine a spotlight on the backstage: the inspiration behind the contest, the process of evaluation, what we learned about Shiny developers and how we can better support them, and what we learned about running contests and how we hope to improve the Shiny Contest experience. We also highlight some of the winning apps as well as the newly revamped Shiny Gallery, which features many noteworthy contest submissions. Finally, we introduce the new process for submitting your apps to the Shiny Gallery and, of course, to Shiny Contest 2020!
3:01 PM-3:23 PM
TitleThe development of "datos" package for the R4DS Spanish translationWhat's new in TensorFlow for RRpanda trading simulation - from an idea to a multi-user shiny appStyling Shiny apps with Sass and Bootstrap 4
SpeakerRiva QuirogaDaniel FalbelNima SafaianJoe Cheng
SpeakerInfoEditorSoftware Engineer RStudioPartner rpandaCTO RStudio Customizing the style--fonts, colors, margins, spacing--of Shiny apps has always been possible, but never as easy as we’d like it to be. Canned themes like those in the shinythemes package can easily make apps look slightly less generic, but that’s small consolation if your goal is to match the visual style of your university, corporation, or client. In theory, one can "just" use CSS to customize the appearance of your Shiny app, the same as any other web application. But in practice, the use of large CSS frameworks like Bootstrap means significant CSS expertise is required to comprehensively change the look of an app.
AbstractThe Programming HistorianTensorFlow is the most popular open-source platform for machine learning and it's ecosystem is evolving incredibly fast. In this talk we will explore what's new in TensorFlow 2.0 as well as how to build data pre-processing pipelines using the tfdatasets package and how to use pre-trained models with tfhub.The idea of rpanda commodities trading simulation was many years in the making. As energy trading professionals working in the industry, we had developed insights around how to make risk/reward market calls, and what skills make someone an exceptional commodities trader. Traders are one of the most expensive seats in terms of monetizing value from the assets. We developed rpanda as a simulated environment which replicates closely how real-life physical commodities trading works in order to assist talent development and selection, both in academics and enterprise. My co-founder and I did not know how to design production-ready software, but we always had used R/Shiny for market analysis in our corporate jobs. Rather than hiring expensive app developers, we decided to do it ourselves. We used Rstudio development stack such as Rstudio Connect and open source tools, like plumber to turn our idea into a production-ready app that is used by University of Alberta classes. In this presentation, we share our journey, technical challenges, and how we overcame them.Relief is on the way. As part of a round of upgrades to Shiny’s UI, we’ve made fundamental changes to the way R users can interact with CSS, using new R packages we’ve created around Sass and Bootstrap 4. In this talk, we’ll show some of the features of these packages and tell you how you can take advantage of them in your apps.
3:24 PM-3:44 PM
TitleR: Then and NowDeep Learning with RThe good, the bad and the ugly: What I learned while consulting across the business as a data scientReproducible Shiny apps with shinymeta
SpeakerJared LanderPaige BaileyBen BarnardDr Carson Sievert
SpeakerInfoChief Data Scientist Lander AnalyticsProduct ManagerData Scientist Wells FargoSoftware Engineer RStudio
AbstractR has changed a lot since the meetup was founded 10 years ago. Back then we were using base graphics (or lattice) and the apply family of functions and we didn't have pipes. At the time there was an impressive 1800 packages on CRAN, now there are over 15,000 extending R's reach far beyond its traditional domain of statistics and machine learning into publishing, website building and video generation. The community has grown and changed dramatically during that time, with the New York meetup alone going from 25 to over 10,000 members. During this talk we go through a then-and-now of R code and community to palpably see how everything has changed.GoogleA collection of data science stories about current problems that data scientists might face while working in academia, industry, and government. Some lessons learned, some situations avoided, what I learned, and how I survived my journey. First, I discuss the struggle of advocating for R when senior leaders decide Python is the only appropriate product. Then, I describe why donut charts are superior to pie charts, and why we should all be using them. Finally, the case of the uncatchable “drive-by” stakeholder and where to find them. The fight is real, and the path is long for the evangelical data scientist.Shiny makes it easy to take domain logic from an existing R script and wrap some reactive logic around it to produce an interactive webpage where others can quickly explore different variables, parameter values, models/algorithms, etc. Although the interactivity is great for many reasons, once an interesting result is found, it’s more difficult to prove the correctness of the result since: (1) the result can only be (easily) reproduced via the Shiny app and (2) the relevant domain logic which produced the result is obscured by Shiny’s reactive logic. The R package shinymeta provides tools for capturing and exporting domain logic for execution outside of a Shiny runtime (so that others can reproduce Shiny-based result(s) from a new R session).

Break for Snack

Session 3
Learning and Using RProgrammingPharmaCase Study
4:00 PM-4:22 PM
TitleFlipbooksGetting things loggedApproaches to Assay Processing Package ValidationJournalism with RStudio, R, and the tidyverse
SpeakerEvangeline ReynoldsGergely DarocziMr Ellis HughesLarry Fenn
SpeakerInfoDr. University of DenverSenior Director of Data Operations System1Fred Hutch Cancer Research CenterJournalist Associated Press
AbstractGood examples facilitate accomplishing new or unpracticed tasks in a programmatic workflow. Tools for communicating examples have improved in recent years. Especially embraced are tools that show code and its resultant output immediately thereafter --- the case of `Jupytr` notebooks and `Rmarkdown` documents. But creators using these tools often must choose between big-picture or narrow-focus demonstration; creators tend to either demo a complete code pipeline that accomplishes a realistic task or instead demonstrate a minimal example which makes clear the behavior of a particular function, but how it might be used in a larger project isn't clear. Flipbooks help address this problem, allowing the creator to present a full demonstration which accomplishes a real task, and gives the viewer the opportunity to focus on unfamiliar steps. A set of flipbook building functions parse code in a data manipulation or visualization pipeline and then build it back up incrementally. Aligned superimposition of new code and output atop previous code and output makes it easy to identify how each code change triggers changes in output. The presentation will guide attendees in creating their own Flipbooks (with Xaringan slides) or mini Flipbooks (gif output).One of the greatest strength of R is the ease and speed of developing a prototype (let it be a report or dashboard, a statistical model or rule-based automation to solve a business problem etc), but deploying to production is not a broadly discussed topic despite its importance. This hands-on talk focuses on best practices and actual R packages to help transforming the prototypes developed by business analysts and data scientist into production jobs running in a secured and monitored environment that is easy to maintain -- discussing the importance of logging, securing credentials, effective helper functions to connect to database, open-source and SaaS job schedulers, dockerizing the run environment and scaling infrastructure.In this talk I will discuss the steps that have been created for validating internally generated R packages at SCHARP (Statistical Center for HIV/AIDS Research and Prevention) and the lessons learned while creating packages as a team. Housed within Fred Hutch, SCHARP is an instrumental partner in the research and clinical trials surrounding HIV prevention and vaccine development. Part of SCHARP’s work involves analyzing experimental biomarkers and endpoints which change as the experimental question, analysis methods, antigens measured, and assays evolve. Maintaining a validated code base that is rigid in its output format, but flexible enough to cater a variety of inputs with minimal custom coding has proven to be important for reproducibility and scalability. SCHARP has developed several key steps in the creation, validation, and documentation of R packages that take advantage of R’s packaging functionality. First, the programming team works with leadership to define specifications and lay out a roadmap of the package at the functional level. Next, statistical programmers work together to develop the package, taking advantage of the rich R ecosystem of packages for development such as roxygen2, devtools, usethis, and testthat. Once the code has been developed, the package is validated to ensure it passes all specifications using a combination of testthat and rmarkdown. Finally, the package is made available for use across the team on live data. These procedures set up a framework for validating assay processing packages that furthers the ability of Fred Hutch to provide world-class support for our clinical trials.The Associated Press data team primarily uses R and the tidyverse as the main tool for doing data processing and analysis. In this talk, some of the technology behind the published stories will be showcased: - Using dbplyr to work off a hosted database containing 380 million opioid records to identify "pill mills". - Using open-sourced AP style templates for R Markdown and ggplot to quickly produce graphics and reports off breaking news. - Using R Markdown and htmlwidgets to give reporters and editors interactive reports to identify reporting leads.
4:23 PM-4:45 PM
TitleLearning R with humorous side projectsTechnical debt is a social problemBuilding a native iPad dashboard using plumber and RStudio Connect in PharmaPutting the Fun in Functional Data: A tidy pipeline to identify routes in NFL tracking data
SpeakerRyan TimpeMr. Gordon ShotwellAymen WaqarDani Chu
SpeakerInfoSenior Data Scientist The LEGO GroupSenior Data Scientist SocureData Science Manager Astellas Pharma USQuantitative Analyst - Statistics NHL Seattle
AbstractWhat should you name a new dinosaur discovery, according to neural networks? Which season of The Golden Girls should you watch when playing a drinking game? How can you build a LEGO set for the lowest price? R is constantly evolving, so as users, we’re constantly learning. Over the past few years, I’ve found that working on side projects is great for hands-on learning - and for me, the more absurd the project, the better. Side projects provide a safe, low-stakes environment to learn new packages and methodologies before using them in work or in production. Sharing those projects can help publicize the package and increase its accessibility, benefiting both the original author and future users. In this talk, I’ll share my experiences with side projects for learning state-of-the-art data science tools and growing as an R user, including how one project helped me land my dream job.Technical debt is a big problem for the R community. Even though R has excellent support for testing, documentation and packaging code it has the reputation that it is not suitable for production applications because data scientists don’t pay enough attention to technical debt within their codebases. Most people think of technical debt as an engineering problem. We choose to make our current work cheaper at the expense of needing to do more work down the road. But when you look closely at the root causes of technical debt they are almost always about interpersonal relationships. Developers have trouble empathizing with other users of their code and so don’t spend the time to make that code easy for future developers to use and understand. In this talk I argue that we should think about technical debt as a social problem because it gives us insight into why it’s so hard to pay back. I then provide a practical roadmap of how to introduce best practices into your data science team.As companies are becoming aware of the need to embrace data-driven solutions, R has gained a huge momentum over recent years. Getting the insights to users has become a very important factor of Data Scientist work. While our world has advanced there is a need to build not only web applications, but also applications on mobile that are available offline. We would like to share with you how within months we have gone from nothing to a production-ready application that handles 500 concurrent users in healthcare. There are plenty of challenges to solve including restricted environments, internal processes and users availability. We will show you how to overcome them and iterate fast, navigating through complex infrastructure and integrating with proxy architecture to serve applications to end users in compliant manner. With RStudio Connect and Plumber you can deploy a scalable REST API that can feed insights to your users. This allows you to go one step further and implement native applications for tablets and smartphones. With the right tools, mindset and priorities you can achieve personal success by introducing a digital transformation within your organization, starting with something as small as converting a business critical Excel file that is slow, difficult to edit and maintain, to a robust application. Step by step your organization will evolve and become empowered by your insights uncovering even more untapped potential.Currently in football many hours are spent watching game film to manually label the routes run on passing plays. Using tracking data, each route can be described as a sequence of spatial-temporal measurements that varies in length depending on the duration of the play. This data can be conveniently analyzed using nested columns in tidyr and purrr. We demonstrate how model-based curve clustering using Bernstein polynomial basis functions (i.e. Bézier curves) fit using the Expectation Maximization algorithm can cluster route trajectories. Each cluster can then be labelled to obtain route names for each route and create route trees for all receivers. The clusters and routes can be visualized nicely using ggplot and seen developing over time using gganimate.
4:46 PM-5:08 PM
TitleToward a grammar of psychological experimentsFuture: Simple Async, Parallel & Distributed Processing in R - What's Next?FlatironKitchen: How we overhauled a Frankensteinian SQL workflow with the tidyverseR + Tidyverse in Sports
SpeakerDanielle NavarroHenrik BengtssonNathaniel PhillipsNamita Nandakumar
SpeakerInfoUniversity of New South WalesAssociate Professor University of California, San FranciscoSenior Data Scientist Roche FlatironKitchen: How we overhauled a Frankensteinian SQL workflow with the tidyverse to enable fast, reproducible, elegant analyses of electronic health records.Quantitative Analyst Philadelphia Eagles
AbstractWhy does a psychological scientist learn a programming language? While motivations are many and varied the two most prominent are data analysis and data collection. The R programming language is well placed to address the first need, but there are fewer options for programming behavioural experiments within the R ecosystem. The simplest experimental designs can be recast as surveys, for which there are many options, but studies in cognitive psychology, psychophysics or developmental psychology typically require more flexibility. In this talk I outline the design principles behind xprmntr, an R package that provides wrappers to the a javascript library (jsPsych) for constructing web based psychology experiments and uses the plumber package to call server side R code as needed. In doing so, I discuss limitations to the current implementation and what a "grammar of experiments" might look like.Future is a minimal and unifying framework for asynchronous, parallel, and distributed computing in R. It is designed for robustness, consistency, scalability, extendability, and adoptability - all in the spirit of "developer writes code once, user runs it anywhere". It is being used in production for high-performance computing and asynchronous UX, among other things. In this talk, I will discuss common feature requests, recent progress we have made, and what is the pipeline.The increasing availability of real-world electronic health record (EHR) data is revolutionising how pharma companies are developing Personalized Healthcare (PHC) solutions. However, the scale and complexity of EHR data pose major challenges in deriving fit-for-purpose insights systematically and efficiently. The conventional approach, where siloed programmers write (or copy and paste) thousands of lines of undocumented, untested, unconnected SAS and SQL code for every research project is bad for business and ultimately for patients. Our team threw out the conventional approach and turned to R and the tidyverse. The result is FlatironKitchen, a modern R package enabling end-to-end EHR analyses in a cohesive, user-centric platform. FlatironKitchen allows users to “pipe their way” from database connections, to calculating derived variables, to running statistical analyses, to creating stunning visualisations. All of the technical details are both fully documented and seamlessly automised allowing users to focus on only meaningful functions that are fit-for-purpose to EHR analyses. The result: FlatironKitchen code is so simple it actually tells a step-by-step, human readable story about what the data scientist is doing-- a far cry from the Frankensteinian SQL/SAS code from the past. FlatironKitchen represents the best of both worlds in pharmaceutical data science. It gives expert data scientists a library of unit-tested, customisable functions for implementing existing procedures and designing new ones. Simultaneously, it enables those who are ‘coding insecure’ to -- finally -- work directly with data by reducing barriers. FlatironKitchen’s simple, easy-to-use syntax, combined with its training library of tutorials, vignettes and lessons made possible through RMarkdown has shown itself to be truly empowering. In addition to showcasing FlatironKitchen, we share lessons learned, and give a call to action for other pharma companies to embrace R.This talk will use a case study, most likely in hockey, to showcase the many ways in which R and the tidyverse can be used to analyze sports data as well as the unique priorities and considerations that are involved in applying statistical tools to sports problems.
5:09 PM-5:29 PM
TitleR/RMarkdown for interactive clinical trial reportingParallel computing with R using foreach, future, and other packagesUsing R to Create Reproducible Engineering Test ReportsMaking better spaghetti (plots): Exploring the individuals in longitudinal data with the brolgar pac
SpeakerFrank HarrellBryan LewisMs. Ana Alyeska SantosDr. Nicholas Tierney
SpeakerInfoNANAQuality Engineer I Biosense Webster, Inc.Lecturer Monash University
AbstractNASteve Weston's foreach package defines a simple but powerful framework for map/reduce and list-comprehension-style parallel computation in R. One of its great innovations is the ability to support many interchangeable back-end computing systems so that *the same R code* can run sequentially, in parallel on your laptop, or across a supercomputer. Recent new packages like future package define elegant new programming approaches that can use the foreach framework to run across a wide variety of parallel computing systems. This talk introduces the basics of foreach and future packages with examples using a variety of back-end systems including MPI, Redis and R's default parallel package clusters.Engineers at Biosense Webster, a Johnson and Johnson medical device company that specializes in diagnosing and treating cardiac arrhythmias, write multiple test reports to comply with FDA regulatory standards. These intricate reports require 36 hours of an engineer’s time on average, constraining the engineers from completing investigations and studies in a timely matter. Writing scripts in R that create reproducible reports can significantly reduce the time spent by an engineer creating these reports allowing them to do a much thorough investigation with a larger scope. Through Shiny, engineers could conveniently have their parameters and recorded data processed and stored in a database by accessing a web link and filling out the required information within a user-friendly interface. Upon the generation of the report, accurate and properly formatted test reports, compliant to both the company and FDA regulatory standards, are produced through Rmarkdown and knitr knitting all the outputs with complete data analysis tools such as normality plots and process capability measurements to a word document that follows company required headers, footers, and headings. The reproducible report creation shown in this report can be extended to other types of test reports and protocols. The pilot phase that has been conducted has shown that complete report production has been decreased from 36 hours to an hour.There are two main challenges of working with longitudinal (panel) data: 1) Visualising the data, and 2) Understanding the model. Visualising longitudinal data is challenging as you often get a "spaghetti plot”, where a line is drawn for each individual. When overlaid in one plot, it can have the appearance of a bowl of spaghetti. With even a small number of subjects, these plots are too overloaded to be read easily. For similar reasons, it is difficult to relate the model predictions back to the individual and keep the context of what the model means for the individual. For both visualisation, and modelling, it is challenging to capture interesting or unusual individuals, which are often lost in the noise. Better tools, and a more diverse set of grammar and verbs are needed to visualise and understand longitudinal data and models, to capture the individual experiences. In this talk, I introduce the R package, **brolgar** (BRowse over Longitudinal data Graphically and Analytically in R), which provides new tools, verbs, and grammar to identify and summarise interesting individual patterns in longitudinal data. This package extends upon ggplot2 with custom facets, and the new tidyverts time series packages to efficiently explore longitudinal data.

Day 2

Session 4
10:30 AM-10:52 AM
TitleBranding and Packaging Reports with R MarkdownBuilding a Medical Device with RThe Glamour of GraphicsRMarkdown Driven Development
SpeakerDr. Jake ThompsonRon KeizerWilliam ChaseEmily Riederer
SpeakerInfoSenior Psychometrician Accessible Teaching, Learning, and Assessment SystemsChief Science Officer InsightRXData analyst University of PennsylvaniaAnalytics Manager Capital One
AbstractThe creation of research reports and manuscripts is a critical aspect of the work conducted by organizations and individual researchers. Most often, this process involves copying and pasting output from many different analyses into a separate document. Especially in organizations that produce annual reports for repeated analyses, this process can also involve applying incremental updates to annual reports. It is important to ensure that all relevant tables, figures, and numbers within the text are updated appropriately. Done manually, these processes are often error prone and inefficient. R Markdown is ideally suited to support these tasks. With R Markdown, users are able to conduct analyses directly in the document or read in output from a separate analyses pipeline. Tables, figures, and in-line results can then be dynamically populated and automatically numbered to ensure that everything is correctly updated when new data is provided. Additionally, the appearance of documents rendered with R Markdown can be customized to meet specific branding and formatting requirements of organizations and journals. In this presentation, we will present one implementation of customized R Markdown reports used for Accessible Teaching, Learning, and Assessment Systems (ATLAS) at the University of Kansas. A publicly available R package, ratlas, provides both Microsoft Word and LaTeX templates for different types of projects at ATLAS with their own unique formatting requirements. We will discuss how to create brand-specific templates, as well as how to incorporate the templates into an R package that can be used to unify report creation across an organization. We will also describe other components of branding reports beyond R Markdown templates, including customized ggplot2 themes, which can also be wrapped into the R package. Finally, we will share lessons learned from incorporating the R package workflow into an existing reporting pipeline.The InsightRX precision dosing platform tailors in-patient drug doses to individual patients' characteristics and biomarkers, leveraging pharmacological models of drug metabolism and drug effects. These models are implemented in R, exposed through APIs, and called from a cloud-based web application. The core of our pharmacokinetic/pharmacodynamic simulation functionality is available open source at `` and ``. As a regulated device in Europe (and soon to be in the US) used in over 100 hospitals, the platform is necessarily developed under "design control", meaning that strict product planning and engineering practices are required. This has implications for how the application and APIs are developed and deployed, such as strict version control workflows and implementation of rigorous testing procedures. To meet the requirements for high availability and horizontal scaling, we use a combination of Plumber and OpenCPU, hosted on RStudio Connect and AWS Fargate/ECS, which cater to the various needs of the development and production environments.I see a lot of ugly charts. This is to be expected as I work with a lot of academics and data scientists, neither of whom have been trained in how to design attractive charts. I myself produced many ugly charts during my years as a research scientist, when the design process basically came down to random tweaking until things "looked good". If only I could go back and tell young inexperienced me that there was a better way. In this talk, I will present that better way--a series of design principles that can take any chart from drab to fab. Rather than applying these techniques willy nilly, I will show how they form a layered "Glamour of Graphics" that is structured and can be easily applied to any chart. This Glamour of Graphics has some simple implementations in ggplot, where we will replace geoms, aesthetics, and scales with typography, color, and layout. Finally, I will discuss why looks matter when it comes to charts, and how by following the Glamour of Graphics you can design charts that are more persuasive and more accurately perceived.RMarkdown enables analysts to engage with code interactively, embrace literate programming, and rapidly produce a wide variety of high-quality data products such as documents, emails, dashboards, and websites. However, RMarkdown is less commonly explored and celebrated for the important role it can play in helping R users grow into developers. In this talk, I will provide an overview of RMarkdown Driven Development: a workflow for converting one-off analysis into a well-engineered and well-designed R package with deep empathy for user needs. We will explore how the methodical incorporation of good coding practices such as modularization and testing naturally evolves a single-file RMarkdown into an R project or package. Along the way, we will discuss big-picture questions like “optimal stopping” (why some data products are better left as single files or projects) and concrete details such as the {here} and {testthat} packages which can provide step-change improvements to project sustainability.
10:53 AM-11:15 AM
TitleDon’t repeat yourself, talk to yourself! Repeated reporting in the R universe.Development of a web-based clinical decision support application for platelet transfusion management3D ggplots with rayshaderrenv: Project Environments to R
SpeakerSharla GelfandJustin JuskewitchDr. Tyler Morgan-WallMr. Kevin Ushey
SpeakerInfoR and Shiny DeveloperTransfusion medicine and clinical informatics fellow Mayo Clinic Development of a web-based clinical decision support application for platelet transfusion management using R and the TidyverseDr. Institute for Defense Analysesrenv: Project Environments for R RStudio, Inc
AbstractIf you’re responsible for analyses that need updating or repeating on a semi-regular basis, you might find yourself doing the same work over and over again. The principle of "don’t repeat yourself" from software engineering motivates us to use functions and packages, the core of repetition in the R universe. For analyses, it can be difficult to know how to use this principle and move beyond "copying and pasting scripts and changing the data file and the object names and updating the dates and results in RMarkdown", especially when there’s some element of human intervention required, whether it be for validating assumptions or cleaning artisanal data. This talk will focus on those next steps, showcasing opportunities to stop repeating yourself and instead anticipate the needs of and communicate effectively with your future self (or the next person with your job!) using project-oriented workflows, clever interactivity, templated analyses, functions, and yes, your own packages.Blood product transfusion is a high risk and costly medical procedure. Platelets (blood cells that initiate clotting) are a rare and expensive blood product with a short shelf life. Proper management of platelet transfusions is essential to clinical care, particularly for patients who have developed antibodies against specific platelet types due to pregnancy or past transfusions. By providing platelets that avoid a patient’s known antibodies, improved patient outcomes and better inventory management of a rare blood product are achieved. To address this need, we used R, Tidyverse, and several key packages (Shiny, shinydashboard, dplyr, purrr, httr, officer, flextables, futures) to develop a web-based application (PLTVXM) to help guide platelet inventory selection. PLTVXM queries information on available/pending platelet inventory (and eligible donors) from reports that run in our institutional reporting tool Tableau® via a Tableau Server REST API. Patient antibody and blood type information is securely retrieved from a clinical data lake via an in-house R package (“dart”) and a custom institutional API. The retrieved data is processed by a published algorithm implemented in R and incorporates user input to present sortable tables of patient-specific compatible platelet inventory (and donors) for consideration. The requisite documentation for platelet product reservation or donor recruitment is then autogenerated using institutional form templates. PLTVXM is deployed on an RStudio Connect server which allows seamless integration with our institution’s Active Directory identity management infrastructure. The pilot version of PLTVXM was created by physicians without formal computer programming training in two weeks. After successful demonstration, PLTVXM was approved for clinical validation and future use in our practice. Our experience highlights how R can facilitate creation of dynamic web-based applications for a wide range of business (or clinical) needs.Learn how a single line of code can transform your data visualizations into stunning 3D using the rayshader package. In this talk, I will show how you can use rayshader to create beautiful 3D figures and animations to help promote your research and analyses to the public. Find out how to use principles of cinematography to take users on a 3D tour of your data, scripted entirely within R. Leaving the 3D pie charts in the pantry at home, I will discuss how to build interpretable, engaging, and informative plots using all three dimensions.The renv package helps you create reproducible environments for your R projects. With renv, you can make your R projects more: Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. Portable: Easily transport your projects from one computer to another, even across different platforms. renv makes it easy to install the packages your project depends on. Reproducible: renv records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. In this presentation, I'll introduce renv and some of its main workflows.
11:16 AM-11:38 AM
TitleHow Rmarkdown changed my lifeForecasting Platelet Blood Bag Demand to Reduce Inventory Wastage at the Stanford Blood CenterDesigning Effective VisualizationsRStudio 1.3 Sneak Preview
SpeakerRob HyndmanQian ZhaoMiriah MeyerJonathan McPherson
SpeakerInfoProfessor of Statistics Monash UniversityPh.D. student Stanford UniversityProfessorSoftware Engineer RStudio
AbstractOver the last few years, Rmarkdown seems to have taken over my life, or at least my written communication. These days I use Rmarkdown to maintain my website, write my blog, write textbooks, write academic papers, prepare slides for talks, keep my CV up-to-date, help my students write theses, prepare university policy documents, write letters, prepare exams, write reports for clients, and more. I haven't quite got to the point of using it for shopping lists, but perhaps that's my next Rmarkdown template. I will reflect on the journey in getting to this point, what I've lost and what I've gained. I will also speculate on what might be next in the Rmarkdownification of my life.The Stanford Blood Center collects and distributes blood products to Stanford Hospital. One of these is platelets, a vital clot-forming blood component with a limited shelf life of a few days. Previous work (Guan et al. , 2017) formulated an optimization problem using features aggregated from the available data to solve the problem of reducing waste. An R package was created for a three-day ordering strategy but has not been put into production due to lack of human trust in modelling accuracy. In summer 2019, the Stanford Data Science for Social Good team, decided to make use of additional patient-level data and models to predict platelet consumption rather than relying solely on aggregated data. Modeling the transfusion recipients into different subpopulations allows for finer-grained predictions on a patient level. We make extensive use of R packages, such as the tidyverse and R Shiny, to conduct exploratory data analysis, build models, and create a user-intuitive dashboard. The Shiny dashboard is designed to display consumption predictions aggregated across all models, consumption predictions for each subpopulation, and historical performance of the model, thereby serving as a valuable tool in building the trust necessary for adopting the algorithmic ordering strategies. Reference Guan, L., Tian, X., et al. (2017). “Big data modeling to predict platelet usage and minimize wastage in a tertiary care system.” PNAS (43) 114: 11368 - 11373. Retrieved from: of UtahRStudio 1.3, currently available as a preview release, includes a number of new capabilities that will help you be more productive in R. It's also more configurable, accessible, and flexible. In this talk, you'll learn to take advantage of these new tools.
11:39 AM-11:59 AM
TitleOne R Markdown Document, Fourteen DemosShiny New Things: Using R to Bridge the Gap in EMR ReportingTidyverse 2019-2020Using Jupyter with RStudio Server Pro
SpeakerYihui XieMr. Brendan GrahamHadley WickhamKarl Feinauer
SpeakerInfoSoftware Engineer and Data Scientist RStudioHealthcare Data Analyst Children's Hospital of PhiladelphiaChief ScientistSoftware Engineer RStudio
AbstractR Markdown is a document format based on the R language and Markdown to intermingle computing with narratives in the same document. With this simple format, you can actually do a lot of things. For example, you can generate reports dynamically (no need to cut-and-paste any results because all results can be dynamically generated from R), write papers and books, create websites, and make presentations. In this talk, I'll use a single R Markdown document to give demos of the R packages rmarkdown, bookdown for authoring books (, blogdown for creating websites (, rticles for writing journal papers (, xaringan for making slides (, flexdashboard for generating dashboards (, learnr for tutorials (, rolldown for storytelling (, and the integration between Shiny and R Markdown. To make the best use of your time during the presentation, I recommend you to take a look at the rmarkdown website in advance: Medical Records (EMRs) are a treasure trove of information, but tend to fall disappointingly short when it comes to visualizing and reporting data in a user friendly and intuitive manner. Building reports in an EMR can be a frustrating experience; the developer is at the mercy of how the data is stored within the EMR and the available EMR reporting tools can be bland and uninspiring. But reporting on data in the EMR doesn't have to be this way! Combining the data-rich EMR with R's robust reporting capabilities benefits both developers and consumers of data. This talk will describe how a cross-departmental project team uses an internal R package, RMarkdown reports scheduled via R Studio Connect, and an interactive flexdashboard app to quickly implement solutions to gaps in the reporting capabilities of the EMR. The flexibility of R relative to EMR reporting tools facilitates a design thinking approach to reporting allowing for more user input, customization and quick iteration. Furthermore, the web-based app we developed is able to be embedded within the EMR itself allowing for a more streamlined workflow.RStudioThis talk is for R admins who want to learn how to set up Jupyter notebooks on RStudio Server Pro. We'll cover prerequisites, basic configuration, best practices for management, Jupyter Lab, and more.

Break for Lunch

Session 5
ModelingOrganizational ThinkingProgrammingggplot2
1:00 PM-1:22 PM
TitleMLOps for R with Azure Machine LearningSmall Team, Big Value: Using R to Design VisualizationsAuto-magic package development: Building an R API for building Vega-Lite SpecsBest practices for programming with ggplot2
SpeakerDavid SmithIan LyttleAlicia SchepDewey Dunnington
SpeakerInfoCloud Advocate MicrosoftSchneider ElectricSenior Data Scientist Outlier AIPh.D. Candidate Dalhousie University
AbstractAzure Machine Learning service (Azure ML) is Microsoft’s cloud-based machine learning platform that enables data scientists and their teams to carry out end-to-end machine learning workflows at scale. With Azure ML's new open-source R SDK and R capabilities, you can take advantage of the platform’s enterprise-grade features to train, tune, manage and deploy R-based machine learning models and applications. In this talk, the attendees will learn how to: •Carry out ML workflows using the authoring experience of their choice, from no-code to code-first options that include Azure ML’s drag-and-drop visual interface for defining workflows and RStudio Server on the Data Science Instance, a hosted VM workstation, for using the Azure ML R SDK from the RStudio browser-based interface. •Use the Azure ML R SDK to manage cloud resources and train, hyperparameter tune, and log and visualize metrics for their models at scale on Azure compute. •Build ML Pipelines in R for defining and orchestrating reusable and reproducible ML workflows. •Deploy, manage, and monitor their R ML models and applications as web services on Azure Container Instance and Azure Kubernetes Service, with an emphasis on robust DevOps and CI/CD for orchestrating and streamlining their end-to-end data science development lifecycle.Many R users can feel isolated due to the prevalence of Python or Tableau at their institutions. This talk will focus on how we use R to develop reference implementations of visualizations (using ggplot2), and to develop corporate-themed color maps (using the colorspace package) to bring value to the entire institution. Color maps can be translated into variety of formats, for Tableau, Qlik Sense, d3, etc., and deployed independently from R. For visualizations, our goal is to translate ggplot2 objects to Vega-Lite specifications, using a package we are developing: ggvega. Vega-Lite visualizations are web-native, and are rendered independently from R. Specifications can be designed to be extensible to new data, allowing them serve as templates, to be deployed and updated for use outside of R. Of course, despite isolation within an institution, our work with the larger R open-source communities provides a foundation on which to build; in fact, we have a lot of company and are having a lot of fun.Vega-lite is a high-level grammar of interactive graphics implemented in Javascript; it renders interactive visualizations in the browser based on a JSON specification. In Python and Javascript, the Altair and vega-lite-api packages have demonstrated how the development of APIs to build Vega-Lite graphics can be partially automated based on the Vega-Lite JSON schema, which describes the required format for a Vega-Lite JSON specification. This talk will describe the development of the ‘vlbuildr’ package for building Vega-Lite specifications in R and the ‘vlmetabuildr’ package for building the ‘vlbuildr’ package. The ‘vlbuildr’ package seeks to provide a pipe-friendly, “R-like” functional interface for building up simple to complex specifications for Vega-Lite graphics, which can in turn be rendered as an HtmlWidget by the ‘vegawidget’ R package. Building such an API in a fully automated way from the Vega-Lite schema presents considerable challenges, so the approach taken here was to rely on partial automation. Human judgement dictates the basic contours of the API, such as what groups of functions to include and how various types of building blocks will go together. The part that is automated is filling in many details such as the different variants of a group of functions, the exact parameters needed for each function, and the documentation of those parameters -- the parts that would be extremely tedious to port over!The ggplot2 package is widely acknowledged as a powerful, dynamic, and easy-to-learn graphics framework when used in an interactive environment. Using ggplot2 in a package or Shiny app environment adds several constraints which are sometimes circumvented using ggplot2 behaviour that may change in the future. Some best practices include (1) using the `.data` pronoun to refer to the layer data within `aes()` and `vars()` instead of the original variable name, (2) ensuring that `plot()` methods that use ggplot2 explicitly `print()` one or more ggplot objects, (3) defining extension themes that modify a complete theme within ggplot2 (like `theme_gray()`), and (4) testing graphical output using the vdiffr package. Collectively, these practices result in better error messages with unexpected user input and ensure compatibility with most versions of ggplot2, including those to come in the future.
1:23 PM-1:45 PM
TitleTotally Tidy Tuning TechniquesUnicoRns are realBridging the gap between SQL and R: Introducing queryparser and tidyquerySpruce up your ggplot2 visualizations with formatted text
SpeakerMax KuhnDr. Travis GerkeIan CookClaus Wilke
SpeakerInfoApplied Machine Learning RStudioScientific Director of Collaborative Data Services Moffitt Cancer CenterCurriculum Developer ClouderaProfessor of Integrative Biology The University of Texas at Austin
AbstractMany models have structural parameters that cannot be directly estimated from the data. These tuning parameters can have a significant effect on model performance and require some mechanism for finding reasonable values. The tune and workflow packages enable tidymodels users to optimize these parameters using a variety of efficient grid search methods as well as with iterative search techniques (such as Bayesian optimization).Common advice from experienced data scientists to job-seekers is to avoid job postings that describe a "data science unicorn": someone who has experience performing an unrealistically large array of technical and business-related job duties. Seeking a unicorn is viewed as a potential indicator that the company fails to understand their data science needs, and that new hires will not be poised for success due to lacking support and resources [Robinson & Nolis, 2019]. The R language, particularly when used with RStudio products, has evolved to enable production-level activities in the areas of data wrangling, reporting/dashboarding, database/software engineering, machine learning, and web application development. It is increasingly plausible that a data scientist will be able to efficiently perform a wide variety of job functions with experience only in a single language (R). Indeed, even entry level R users may tread into "unicorn" territory. Current standards for data scientist job descriptions and salaries do not accommodate this nuance, leaving both job-seekers and hiring managers unable to distinguish job requirements which should be read as warning signs from listings which are idyllic matches for the modern R unicorn. In this talk, we present data aggregated from several large compensation analytics companies which summarize current benchmarks for data science job descriptions and corresponding salary ranges. We then suggest job description language to target modern R users, considering both job duty compatibility and job post findability. These descriptions are presented with likely salary range pairings. Attention is given to deviations from traditional degree requirements, years of experience, and demands for multiple programming language literacy which may lack relevance for the R unicorn. Our overarching goal is to provide job description templates which encourage optimal matchmaking between R job seekers and organizations in need of their talents.Like it or not, SQL is the closest thing we have to a universal language for working with structured data. Celebrating its 50th birthday in 2020, SQL today integrates with thousands of applications and has millions of users worldwide. Data analysts using SQL represent a large audience of potential R users motivated to expand their data science skills. But learning R can be frustrating for SQL users. One major frustration is the inability to directly query R data frames with SQL SELECT statements. Eager to use R for tasks that are not possible with SQL (like data visualization and machine learning), these users are dismayed to find that they must first learn an unfamiliar syntax for data manipulation. The popularity of the sqldf package (which automatically exports an R data frame into an embedded database, then runs a SQL query on it) demonstrates this frustration. But now there is a way to directly query an R data frame without moving the data out of R. In this talk, I introduce tidyquery, a new R package that runs SQL queries directly on R data frames. tidyquery is powered by dplyr and by queryparser, a new pure-R, no-dependency SQL query parser.The ggtext package provides various functions to add formatted text to ggplot2 figures, both in the form of plot or axis labels and in the form of text labels or text boxes inside the plot panel. Text formatting can be achieved through a small subset of markdown, HTML, and CSS directives. Features currently supported include italics, bold, super- and sub-script, as well as changing font size, font family, and color. Basic support for adding images to formatted text is also available.
1:46 PM-2:08 PM
TitleNeural Networks for Longitudinal Data AnalysisData Science in MeatspaceList-columns in data.table: Reducing the cognitive and computational burden when working with compleThe little package that could: taking visualizations to the next level with the scales package
SpeakerDr. Sydeaka WatsonBenJoaquin GouverneurTyson BarrettDana Seidel
SpeakerInfoSenior Data Scientist Korelasi Data Insights; Elicit InsightsManager, Datalab Plenty Unlimited, Inc.Research Assistant Professor Utah State UniversitySenior Data Scientist Plenty Unlimited
AbstractLongitudinal data (or panel data) arise when observations are recorded on the same individuals at multiple points in time. For example, a longitudinal baseball study might track individual player characteristics (team affiliation, age, height, weight, etc.) and outcomes (batting average, stolen bases, runs, strikeouts, etc.) over multiple seasons, where the number of seasons could vary across players. Neural network frameworks such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) can flexibly accommodate this data structure while preserving and exploiting temporal relationships. In this presentation, we highlight the use of neural networks for longitudinal data analysis with tensorflow and keras in R.The Data Science community is dominated by folks doing amazing work with data that starts in and never leaves **cyberspace**. This talk is about best paractices and playbooks for doing data science that involves **meatspace** (the opposite of cyberspace) and why R is such a great language for working with data that originated in the physical world. While the concrete examples in this talk will mostly come from the **manufacturing** space, where I have the most experience, I believe the themes are relevant to many meatspace workflows. We'll talk through effective playbooks that can help you navigate common tasks throughout the life-cycle of a project. We’ll also weave in how R’s glorious package ecosystem, including `tidyverse`, can be combined with other languages like `python`, and with enterprise products like **RStudio Connect** to great effect. Specifically, we'll discuss practices in these areas: * best practices for **data collection** in meatspace * the importance of quantifying **measurement system error** * collecting the correct data for training **computer vision** models * the rarely discussed cost of **maintaining models** in productionThe use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). I compare the behavior of the data.table approaches to the dplyr::group_nest() function and tidyr::unnest(), two of the several powerful tidyverse nesting and unnesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns.Precise axes, proper data transformation, and informative visual data mappings are critical components to any polished visualization. The scales package, the unsung hero behind ggplot2’s scale_* infrastructure, includes functions to help any R user manipulate and polish their visualizations. In this presentation, we will explore the functionality of this small but mighty package: demonstrating its functions for polishing guides, e.g. breaks and labels, managing data transformations, and for mapping aesthetic palettes to data.
2:09 PM-2:29 PM
TitleStochastic Block Models with R: Statistically rigerous clusting with rigorous codeValue in Data Science Beyond Models in ProductionAdvances in tidyevalExtending your ability to extend ggplot2
SpeakerNick StrayerEduardo Ariño de la RubiaLionel HenryThomas Lin Pedersen
SpeakerInfoPhD Candidate Vanderbilt UniversityNAAdvances in tidyeval RStudioRStudio
AbstractOften a machine learning research project starts with brainstorming, continues to one-off scripts while an idea forms, and finally a package is written to disseminate the product. In this talk I will share my experience rethinking this process by spreading the package writing across the whole process. While there are cognitive overheads involved with setting up a package framework, I will argue that these overheads can actually serve as a scaffolding for not only good code but robust research practices. The end result of this experiment is the SBMR package: a native R package written to fit and investigate the results of Bipartite Stochastic Block Models that forms the backbone of my PhD dissertation. By going over the ups and downs of this process I hope to leave the audience with inspiration for moving the package writing process closer to the start of their projects and melding research and code more closely to improve both.ML in production is one of the most obvious ways that data science organizations create value in business. However, these models are at the very end of a long story of how quantitative research changes and enhances organizations. In this talk I will discuss how I have found DS organization to be truly transformative outside of ML in the loop. Bio: Eduardo Ariño de la Rubia is a DS manager and educator. He loves R and RStudio. He has a Masters in Negotiation, Conflict Resolution and Peacebuilding, which is probably the most useful training he could have received.In tidyverse grammars such as dplyr you can refer to the columns in your data frames as if they were objects in the workspace. This syntax is optimised for interactivity and is a great fit for data analysis, but it makes it harder to write functions and reuse code. In this talk we present some advances in the tidy eval framework that make it easier to program around tidyverse pipelines without having to learn a lot of theory.The ggplot2 package continue to be one of the most used frameworks for producing graphics in R. While being extremely flexible, the package itself can be constrained by the different types of graphic elements and statistic transformations available. Instead of continuing to add new features, the development in recent years have focused on making ggplot2 extensible by other packages, thus distributing development and maintenance. Despite the best of intentions, ggplot2 can feel daunting to extend, due unusual idiosyncrasies, a foreign object system, and a partly obscured rendering model. This talk intend to remove the mystery of extending ggplot2, by describing the basic ways that it can be extended and showcasing a couple of simple extensions that can be build with very little code. Lastly, it will include discussions of some best practices and gotchas that may come in handy when you start out.

Break for Snack

Session 6
Lightning Talks-Room 3Lightning Talks-Room 2Panel-Room 1
2:45 PM-3:30 PM
TitleVariousVariousCareer Advice for Data Scientists
SpeakerVariousVariousJen Hecht
SpeakerInfoNANAVP of People Operations Rstudio

Breakout of Lightning Talks

Session 6 - Lightning Talks
Lightning Talks-Room 3Lightning Talks-Room 2
2:45 PM-2:50 PM
TitleMaking a tidy dress`livecode`: broadcast your live coding sessions from and to RStudio
SpeakerDr. Amelia McNamaraColin Rundel
SpeakerInfoAssistant Professor of Computer & Information Sciences University of St ThomasLecturer University of Edinburgh
AbstractAfter at least a year of dreaming about it, I finally produced the #rstats/#tidyverse dress of my dreams. This involved designing fabric, getting it custom printed, making a pattern from an existing garment, and sewing the dress. ( I learned a lot of useful lessons during this project, including "do unit tests" (make a practice dress) and "document your work" (get your BFF to take pictures of you).In this talk we will demonstrate `livecode`, a new R package for broadcasting code for live code demonstrations. This package implements a simple webserver (using `httpuv`) to dynamically publishes the content of a code file (i.e. `.R` or `.Rmd`) as you edit it live. This enables your students to have near realtime access to your code as you write it. The broadcast file can be viewed with any webbrowser but the package is specifically designed to be used within RStudio leveraging its builtin viewer. This gives students have direct access to the shared code within the IDE, allowing direct copying into their own source files and/or the console and thereby improving their ability to interact and experiment with your code.
2:50 PM-2:55 PM
TitleDatasets in Reproducible Research with 'pins'A high school student’s journey to bring R into the classroom.
SpeakerJavier LuraschiJay Campanell
SpeakerInfoRStudioLiberal Arts and Sciences Academy
AbstractOpen source code is an essential piece in making science reproducible. Tools like 'rmarkdown' and GitHub facilitate running and sharing outcomes with colleagues and with the broad scientific community at large. However, it is less clear what tools should be used to retrieve, store and share datasets; while it is possible to make datasets part of your workflows today, it is usually hard and we are often left with manually sharing or downloading links to datasets. Not only that, but it's also hard to share or discover datasets. In this talk we will introduce for the first time the 'pins' package. A package designed to: pin, discover and share resources. Meaning that, you can use 'pins' to simplify your data science workflows by easily fetching resources from GitHub, Kaggle, CRAN and RStudio Connect. We will present a 'pin' as a generic resource that can contain tabular datasets like CSVs, unstructured data like JSON files, image archives as ZIP files and so on. This talk will be highly interactive showing you how to get started by installing 'pins' from CRAN, retrieve and cache resources, share and discover useful and fun data resources to improve and enhance your day-to-day workflows.My 8th grade capstone project introduced me to R. The project was a data visualization about breakfast tacos. I used R and other web based tools. My lightning talk will focus on my experience about using R for class projects and getting the support from my parents to help integrate R into the classroom. I will show how students can get started when they have no clue on how to use R. I will talk about the project’s toolkit which includes RStudio cloud, Google sheets, chomebook, measurement tools, my phone and how R is being used in my school.
2:55 PM-3:00 PM
TitleBecoming an R bloggerCourse Material Creation in the R Ecosystem
SpeakerRebecca BarterKelly Bodwin
SpeakerInfoUC BerkeleyAssistant Professor of Statistics California Polytechnic State University - San Luis Obispo
AbstractBlogging is an excellent way to learn, improve your communication skills, and gain exposure in the R and data science communities. In this talk, I will discuss how and why I started blogging, and why you should too. I will guide you through choosing topics, writing your blog using RStudio and blogdown, hosting it on netlify, and sharing your blog with the world. This talk is for you if you've wanted to start a blog on R, data science, or to showcase your data analyses, but don't know where to start.In this talk, I will introduce a suite of three packages designed to aid course material creation in R: {demoR} for displaying code in knitted R Markdown with custom highlighting and formatting; {shindig} for shortcuts to creating simple educational Shiny apps; and {curricular} for easy creation of syllabi, homework exercises, exams, etc. Together, we will explore how these new tools - in conjunction with other existing resources - have been used to create a clean and consistent ecosystem for my R-based Introductory Statistics course. I will share some metrics on student outcomes, as well as my own experiences with the advantages and challenges in building the course.
3:00 PM-3:05 PM
TitleMexican electoral quick count night with RData Science for Software Engineers: busting software myths with R
SpeakerMA Maria Ortiz ManceraMx. Yim Register
SpeakerInfoAdvisor CONABIORStudio/University of Washington
AbstractIn Mexico the elections take place on a Sunday, and the official results are presented a week later. To prevent unjustified victory claims during that period the electoral authority organizes a quick count the same night of the election. The quick count consists in selecting a random sample of the polling stations and estimating the percentage of votes in favor of each candidate. With highly competitive electoral processes the quick count has become very important, the rapidity and precision of its results auspicious an environment of trust, and it serves as a tool against fraud. In this application reproducibility is very important. On the scientific side, it is crucial to examine the veracity and robustness of the conclusions of the methodologies. However, in this case, reproducibility is more important still, as it helps to achieve transparency in the electoral procedure. Anyone can download the sample and compute the same results that were announced the night of the election. This transparency fosters trust in institutions and gives legitimacy to the outcome of the quick count. We believe that developing an R package with detailed vignettes made the procedure accessible for the public. The package also facilitated code development and estimation on the election night, when the models were run with partial samples every five minutes, for three different state elections and for the presidential election. Our models were one of 9 different approaches to do the estimation and yet our code is the only publicly available, we are championing for more openness on procedures by sharing our experience. As for the model we developed Bayesian hierarchical models that include demographic and geographic covariates, the purpose of the models is to reduce the biases associated to such covariates due to the fact that complete samples are rarely available to publish the results in a timely manner hence the results are announced using partial samples which have biases.The software engineering world is full of claims about best practices, languages, packages, styles, and workflows, but most software engineering students are never taught how to find, read, and interpret actual evidence on those topics. Is agile development really the secret to success? Do some languages actually cause more defects than others? This talk describes a series of meaningful lessons that explore research in software engineering for the beginner R programmer by teaching students to interpret and replicate research findings while learning meaningful results for their field in addition to common statistical methods. The lessons serve as a primer for software engineers to participate in a data-driven society; from advertising and business to combating misinformation and helping user experience.
3:05 PM-3:10 PM
TitleRproject templates to automate and standardize your workflowLearn to teach, for goodness sake.
SpeakerCaroline LedbetterMike K Smith
SpeakerInfoSr Professional Research Assitant University of ColoradoSenior Director, Statistics Pfizer
AbstractMany teams and organizations have tasks and structures that are standard across projects. Lack of consistency and documentation can lead to lost productivity when team members join collaborations or previous work is consulted by your future self. Setting up folder structures can be particularly tedious. This talk will demonstrate using Rstudio project templates as part of an organizational package to automatically setup file structures, establish git repositories and add standardized readme files. It will also show how including report templates for Rmarkdown files can lead to more consistent and professional reports. Project info can be optionally stored so that project information can be easily added automatically to the top of reports and included in snippets for code file headers. Creating standard, easy to implement documentation and procedures can be particularly effective in encouraging skeptical collaborators to use git and Rmarkdown. Organizational packages can also be a great place to house functions that are specific and common to an organizations needs. The talk will showcase this functionality using the CIDAtools package that we developed. While the CIDAtools package was developed to address issues that sometimes arise from the less structured environment of academia, the tools presented can be equally useful in an industry setting.Even though I’ve completed 4 marathons, you certainly shouldn’t come to me for a training plan on how to achieve your goals for any race you’re about to run. So why do we often turn to “experienced R users” to help us learn R or train an organization? The RStudio certified trainers have been taught modern, evidence-based teaching practices which they use in planning training sessions in order to help delegates achieve THEIR learning goals effectively in a given time-frame. My talk will illustrate some of these teaching concepts and how, by becoming a certified trainer, you can help others learn about R more effectively.
3:10 PM-3:15 PM
TitleSound annotation with Shiny and wavesurferLearning by Teaching: Mentoring at the R4DS Online Learning Community
SpeakerAthos DamianiMr. Jon Harmon
SpeakerInfoFounder/Statistician R6Senior Data Scientist Macmillan Learning
AbstractWe observed a huge improvements of Machine Learning tools but the main effort were to help at post annotated dataset step. We still struggle to build a trusty pipeline to make these annotations. The package wavesurfer brings to R users the ability to annotate audio files with ease and reliability, exploring the friendly user interface of Shiny to make this hard and laborious part of the project more joyful and efficient.I host a weekly R Office Hour on the R4DS Online Learning Community Slack. By doing so, I have learned more about R than I ever would have thought. Here I'll present concrete examples of how R users can participate in the R community to expand their skills. R users of all skill levels can develop their skills by helping one another learn. Committing to help people with their coding challenges leads to exploration of answers in areas you might otherwise not examine.
3:15 PM-3:20 PM
TitleEvery voice matters: An analysis of @WeAreRLadiesPeer review in data science courses
SpeakerKatherine SimeonTherese Anders
SpeakerInfoNorthwestern UniversityPostdoctoral research fellow Hertie School
AbstractAs a rotating curation, @WeAreRLadies is a twitter account that has a different curator (i.e., tweeter) each week with a mission to highlight female and minority genders and their work in R. So far, curators have tweeted from 18 different countries and represent a variety of domains and levels of R expertise, ranging from R novices to those developing their own packages. With 45 R-Ladies curators to date, the account has become a popular R-related twitter resource, gaining more than 13,000 followers in the past year and hundreds of interactions each week. This talk will present a text analysis and reflection on over a year of Twitter text data from @WeAreRLadies. As the founder and maintainer of this account, I witness firsthand the bidirectional relationship between one’s learning journey and their use of R. In this talk, I will attempt to quantify this through a text analysis that explores how one’s experiences learning and using R relates to how they talk (or tweet) about it. By analyzing tweet text as well as other metrics provided by twitter (e.g., number of likes, replies, and clicks), I will showcase different ways curators have engaged with the R Twitter community and explore how account engagement has changed as the number of curators and followers continue to grow. I will also discuss how curators’ different areas of expertise have resulted in tweets and discussions that both demonstrate the variety of tools available in R, and spotlight unifying ideas and best practices in R programming. Finally, I will reflect on lessons learned and future directions for @WeAreRLadies, as well as its contribution to the R-Ladies Global initiative. Overall, this talk will discuss how diverse perspectives of @WeAreRLadies curators have enriched the conversations in the R Twitter community by validating different learning journeys and by promoting and amplifying underrepresented voices.Peer review enables instructors of large data science classes to provide substantive feedback to students beyond what is feasible with standard code review via automated grading and continuous integration. It facilitates peer learning, which is shown in literature to have positive learning outcomes, and can reduce the burden of grading by course staff. The ghclass package provides a suite of functions to manage courses via GitHub repositories. The package has recently been supplemented with the functionality to implement peer review. Developed during my 2019 summer internship with RStudio in collaboration with my mentor Mine Çetinkaya-Rundel, the peer review functions in ghclass interface with the GitHub API to create review repositories, move files between authors and reviewers, submit feedback, and collect grades. In this presentation, I will give a demonstration of the peer review functions in ghclass. A set of six functions allows instructors to 1) create a random review roster, 2) set up the review repository infrastructure within a GitHub organization, 3) move assignments from authors to reviewers, 4) collect grades, 5) return the feedback, and 6) obtain a rating of the review from the authors. I reflect on the pedagogy of implementing peer review in introductory data science classes and talk about lessons learned from a real-world test run of the package in the Fall semester 2019 at the University of Edinburgh, conducted by Mine Çetinkaya-Rundel. The presentation highlights ghclass as an R command-line based, open source, low profile, and powerful solution to enable peer review in classes ranging from a size of two to approximately 400 students.
3:20 PM-3:25 PM
TitleLessons about R I learned from my catThe Five Principles of Data Science Education
SpeakerAmanda GadrowHunter Glanz
SpeakerInfoDirector of QA and Support RStudioDr. California Polytechnic State University
AbstractForming good development habits for R projects is pretty straight-forward if you follow the lessons I've learned from my cat, whose advice includes "be lazy", "keep your claws sharp", and "land on your feet". Attendees of this talk will learn how to make life easier on colleagues and their future selves by using simple software engineering best practices to build their current projects. Each point will come with cat photos and code samples, the two best parts of the Internet!In this talk, I will outline a unified philosophy of data science education, and provide tips and tools for implementing these principles in the classroom using R and RStudio. Although data science as a professional discipline is well-established, its pedagogy is still in a period of growth. Even within a single university, multiple data science courses may be offered across different departments leading to inevitable redundancy of efforts amidst rich domain-specific innovations. My experience as an instructor in many such courses has lead me to five principles that transcend domain, context, and choice of language: reproducibility, communication, version control, practical application, and data ethics. For each of these full-stack themes, I will share examples of how to leverage tools in R and RStudio to enhance learning.
3:25 PM-3:30 PM
TitleNATidyBlocks: using the language of the tidyverse in a blocks-based interface
SpeakerNAMaya Gans
AbstractNAAs an intern at RStudio, I developed a blocks-based coding language mimicking the verb-driven programming of the tidyverse. Blocks-based coding environments are a popular way to introduce programming to novices. Instead of typing in code, users click blocks together to create loops, conditionals, and expressions. Studies have shown that students are more successful and more interested in coding when introduced through a block-based language like Scratch or Snap! rather than a text-based language. However, it's much easier to express control flow with these tools than to manipulate data: adding 1 to a variable requires several steps, and there are no built-in capabilities for working with tabular data. On the other hand, R's tidyverse libraries provide a predictable, consistent grammar for doing these tasks. As an intern at RStudio, I developed a blocks-based coding language mimicking the verb driven programming of the tidyverse. Tabular data can be imported and transformed using verbs like filter, select, and summarize, and functions can be strung together using pipes, which users can think of as meaning "and then". The talk will include a demo of TidyBlocks and a description of how we're testing and improving it.