2 Importing Data into Positron for SQL Analysis

Now look at data tables in my local directory “SyntheaCovid19”.

cd /Users/nnthieu/SyntheaData/SyntheaCovid19
ls

allergies.csv
careplans.csv
conditions.csv
covid_cons.csv
devices.csv
encounters.csv
imaging_studies.csv
immunizations.csv
medications.csv
observations.csv
organizations.csv
patients.csv
payer_transitions.csv
payers.csv
procedures.csv
providers.csv
supplies.csv
Synthea COVID-19 Analysis.html
synthea19.duckdb

A SQL database named “synthea19.duckdb” already exists. If it has not been created yet, I create it as follows:

# Load necessary libraries
library(DBI)
library(duckdb)
library(RPostgres)
library(tools)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

The R command con <- dbConnect(duckdb::duckdb(), dbdir = "/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea.duckdb") establishes a connection to a DuckDB database using the DBI and duckdb packages. Here, dbConnect() creates the connection, while duckdb::duckdb() specifies the DuckDB driver. The argument dbdir points to the location of the database file on the user’s computer; if the file already exists, DuckDB opens it, and if not, it creates a new one. The resulting connection object, stored in con, allows the user to run SQL queries, manage tables, and interact with the database directly from R.

con <- dbConnect(duckdb::duckdb(), dbdir = "/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea.duckdb")

The command dbExecute(con, "INSTALL postgres; LOAD postgres;") instructs DuckDB, through the active database connection con, to install and activate the PostgreSQL extension. Using dbExecute() from the DBI package, the SQL statements "INSTALL postgres" and "LOAD postgres" are executed within DuckDB. The installation step ensures that the PostgreSQL extension is available, while the loading step makes it active for the current session. Once loaded, this extension allows DuckDB to connect to and query data directly from PostgreSQL databases, enabling seamless data integration and transfer between DuckDB and PostgreSQL systems

# Load the PostgreSQL extension
dbExecute(con, "INSTALL postgres; LOAD postgres;")

This R code imports all CSV files from the folder /Users/nnthieu/SyntheaData/SyntheaCovid19 into a DuckDB database. First, the list.files() function collects the full paths of all .csv files in the directory. Then, the for loop iterates through each file. For each one, tools::file_path_sans_ext(basename(file)) extracts the file name without its extension to use as the database table name. Before creating the new table, dbExecute() runs a SQL command to drop any existing table with the same name to avoid conflicts. Next, a SQL query is constructed using CREATE TABLE ... AS SELECT * FROM read_csv_auto('%s'), which reads the CSV file directly into DuckDB and automatically infers column types. This query is then executed, creating a new table in the database for each CSV file. In summary, the code automates the process of loading multiple CSV files into DuckDB as separate tables.

csv_files <- list.files("/Users/nnthieu/SyntheaData/SyntheaCovid19", pattern = "*.csv", full.names = TRUE)

for (file in csv_files) {
  table_name <- tools::file_path_sans_ext(basename(file))
  
  # Drop table if it exists
  dbExecute(con, sprintf("DROP TABLE IF EXISTS %s", table_name))
  
  # Create table from CSV
  query <- sprintf("CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s')", table_name, file)
  dbExecute(con, query)
}

In case the database already exists, I connect to it as follows:

# Close any existing connections first
if (exists("con")) {
  try(dbDisconnect(con), silent = TRUE)
}

# Connect to DuckDB database
con <- dbConnect(
  duckdb::duckdb(),
  dbdir = "/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea19.duckdb"
)

# Load the PostgreSQL extension
dbExecute(con, "INSTALL postgres; LOAD postgres;")

[1] 0

I test whether the database connection works properly.

# Verify tables
dbListTables(con)

 [1] "allergies"         "careplans"         "conditions"       
 [4] "devices"           "encounters"        "imaging_studies"  
 [7] "immunizations"     "medications"       "observations"     
[10] "organizations"     "patients"          "payer_transitions"
[13] "payers"            "procedures"        "providers"        
[16] "supplies"

Now I first do a SQL analysis if my connection setup work correctly

query <- "SELECT * FROM conditions"
df = dbGetQuery(con, query)
head(df)

       START       STOP                              PATIENT
1 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482
2 2020-03-01 2020-03-01 1ff7f10f-a204-4bb1-aa72-dd763fa99482
3 2020-03-01 2020-03-30 1ff7f10f-a204-4bb1-aa72-dd763fa99482
4 2020-02-12 2020-02-26 9bcf6ed5-d808-44af-98a0-7d78a29ede72
5 2020-03-13 2020-04-14 9bcf6ed5-d808-44af-98a0-7d78a29ede72
6 2020-03-13 2020-04-14 9bcf6ed5-d808-44af-98a0-7d78a29ede72
                             ENCOUNTER      CODE              DESCRIPTION
1 60584c6a-e26b-4176-a660-86ad84c7ceca 386661006          Fever (finding)
2 60584c6a-e26b-4176-a660-86ad84c7ceca 840544004       Suspected COVID-19
3 60584c6a-e26b-4176-a660-86ad84c7ceca 840539006                 COVID-19
4 5d9dfe80-7edf-4181-9753-770e361934d1  44465007          Sprain of ankle
5 a78e78d9-33bb-40bc-9e42-e47ff7e910cc  49727002          Cough (finding)
6 a78e78d9-33bb-40bc-9e42-e47ff7e910cc 248595008 Sputum finding (finding)
  days_duration
1            29
2             0
3            29
4            14
5            32
6            32

2.0.1 Summary

This chapter provides a step-by-step guide on setting up a database and preparing it for SQL analysis in Positron. It begins by explaining how to create a new database and define the necessary tables within it. Once the database structure is established, the chapter demonstrates how to connect the database to Positron, enabling users to perform SQL queries and analyses. By the end of the chapter, readers will have a complete setup that allows them to efficiently conduct SQL-based data analysis within the Positron environment.

Notice that the connection to the database is strict. If you repeatedly run the project, it may cause errors because another connection already exists. In this case, you need to completely close RStudio and then reopen it to use the database connection for SQL analysis.