The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The R command con <- dbConnect(duckdb::duckdb(), dbdir = "/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea.duckdb") establishes a connection to a DuckDB database using the DBI and duckdb packages. Here, dbConnect() creates the connection, while duckdb::duckdb() specifies the DuckDB driver. The argument dbdir points to the location of the database file on the user’s computer; if the file already exists, DuckDB opens it, and if not, it creates a new one. The resulting connection object, stored in con, allows the user to run SQL queries, manage tables, and interact with the database directly from R.
con <-dbConnect(duckdb::duckdb(), dbdir ="/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea.duckdb")
The command dbExecute(con, "INSTALL postgres; LOAD postgres;") instructs DuckDB, through the active database connection con, to install and activate the PostgreSQL extension. Using dbExecute() from the DBI package, the SQL statements "INSTALL postgres" and "LOAD postgres" are executed within DuckDB. The installation step ensures that the PostgreSQL extension is available, while the loading step makes it active for the current session. Once loaded, this extension allows DuckDB to connect to and query data directly from PostgreSQL databases, enabling seamless data integration and transfer between DuckDB and PostgreSQL systems
# Load the PostgreSQL extensiondbExecute(con, "INSTALL postgres; LOAD postgres;")
This R code imports all CSV files from the folder /Users/nnthieu/SyntheaData/SyntheaCovid19 into a DuckDB database. First, the list.files() function collects the full paths of all .csv files in the directory. Then, the for loop iterates through each file. For each one, tools::file_path_sans_ext(basename(file)) extracts the file name without its extension to use as the database table name. Before creating the new table, dbExecute() runs a SQL command to drop any existing table with the same name to avoid conflicts. Next, a SQL query is constructed using CREATE TABLE ... AS SELECT * FROM read_csv_auto('%s'), which reads the CSV file directly into DuckDB and automatically infers column types. This query is then executed, creating a new table in the database for each CSV file. In summary, the code automates the process of loading multiple CSV files into DuckDB as separate tables.
csv_files <-list.files("/Users/nnthieu/SyntheaData/SyntheaCovid19", pattern ="*.csv", full.names =TRUE)for (file in csv_files) { table_name <- tools::file_path_sans_ext(basename(file))# Drop table if it existsdbExecute(con, sprintf("DROP TABLE IF EXISTS %s", table_name))# Create table from CSV query <-sprintf("CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s')", table_name, file)dbExecute(con, query)}
In case the database already exists, I connect to it as follows:
# Close any existing connections firstif (exists("con")) {try(dbDisconnect(con), silent =TRUE)}# Connect to DuckDB databasecon <-dbConnect( duckdb::duckdb(),dbdir ="/Users/nnthieu/SyntheaData/SyntheaCovid19/synthea19.duckdb")
# Load the PostgreSQL extensiondbExecute(con, "INSTALL postgres; LOAD postgres;")
[1] 0
I test whether the database connection works properly.
This chapter provides a step-by-step guide on setting up a database and preparing it for SQL analysis in Positron. It begins by explaining how to create a new database and define the necessary tables within it. Once the database structure is established, the chapter demonstrates how to connect the database to Positron, enabling users to perform SQL queries and analyses. By the end of the chapter, readers will have a complete setup that allows them to efficiently conduct SQL-based data analysis within the Positron environment.
Notice that the connection to the database is strict. If you repeatedly run the project, it may cause errors because another connection already exists. In this case, you need to completely close RStudio and then reopen it to use the database connection for SQL analysis.