← Back to Prompts

base-R

A comprehensive reference guide for mastering base R programming, data structures, and statistical computing.

by OpenPrompts_Bot
--- name: base-r description: Provides base R programming guidance covering data structures, data wrangling, statistical modeling, visualization, and I/O, using only packages included in a standard R installation --- # Base R Programming Skill A comprehensive reference for base R programming — covering data structures, control flow, functions, I/O, statistical computing, and plotting. ## Quick Reference ### Data Structures ```r # Vectors (atomic) x <- c(1, 2, 3) # numeric y <- c("a", "b", "c") # character z <- c(TRUE, FALSE, TRUE) # logical # Factor f <- factor(c("low", "med", "high"), levels = c("low", "med", "high"), ordered = TRUE) # Matrix m <- matrix(1:6, nrow = 2, ncol = 3) m[1, ] # first row m[, 2] # second column # List lst <- list(name = "ali", scores = c(90, 85), passed = TRUE) lst$name # access by name lst[[2]] # access by position # Data frame df <- data.frame( id = 1:3, name = c("a", "b", "c"), value = c(10.5, 20.3, 30.1), stringsAsFactors = FALSE ) df[df$value > 15, ] # filter rows df$new_col <- df$value * 2 # add column ``` ### Subsetting ```r # Vectors x[1:3] # by position x[c(TRUE, FALSE)] # by logical x[x > 5] # by condition x[-1] # exclude first # Data frames df[1:5, ] # first 5 rows df[, c("name", "value")] # select columns df[df$value > 10, "name"] # filter + select subset(df, value > 10, select = c(name, value)) # which() for index positions idx <- which(df$value == max(df$value)) ``` ### Control Flow ```r # if/else if (x > 0) { "positive" } else if (x == 0) { "zero" } else { "negative" } # ifelse (vectorized) ifelse(x > 0, "pos", "neg") # for loop for (i in seq_along(x)) { cat(i, x[i], "\n") } # while while (condition) { # body if (stop_cond) break } # switch switch(type, "a" = do_a(), "b" = do_b(), stop("Unknown type") ) ``` ### Functions ```r # Define my_func <- function(x, y = 1, ...) { result <- x + y return(result) # or just: result } # Anonymous functions sapply(1:5, function(x) x^2) # R 4.1+ shorthand: sapply(1:5, \(x) x^2) # Useful: do.call for calling with a list of args do.call(paste, list("a", "b", sep = "-")) ``` ### Apply Family ```r # sapply — simplify result to vector/matrix sapply(lst, length) # lapply — always returns list lapply(lst, function(x) x[1]) # vapply — like sapply but with type safety vapply(lst, length, integer(1)) # apply — over matrix margins (1=rows, 2=cols) apply(m, 2, sum) # tapply — apply by groups tapply(df$value, df$group, mean) # mapply — multivariate mapply(function(x, y) x + y, 1:3, 4:6) # aggregate — like tapply for data frames aggregate(value ~ group, data = df, FUN = mean) ``` ### String Operations ```r paste("a", "b", sep = "-") # "a-b" paste0("x", 1:3) # "x1" "x2" "x3" sprintf("%.2f%%", 3.14159) # "3.14%" nchar("hello") # 5 substr("hello", 1, 3) # "hel" gsub("old", "new", text) # replace all grep("pattern", x) # indices of matches grepl("pattern", x) # logical vector strsplit("a,b,c", ",") # list("a","b","c") trimws(" hi ") # "hi" tolower("ABC") # "abc" ``` ### Data I/O ```r # CSV df <- read.csv("data.csv", stringsAsFactors = FALSE) write.csv(df, "output.csv", row.names = FALSE) # Tab-delimited df <- read.delim("data.tsv") # General df <- read.table("data.txt", header = TRUE, sep = "\t") # RDS (single R object, preserves types) saveRDS(obj, "data.rds") obj <- readRDS("data.rds") # RData (multiple objects) save(df1, df2, file = "data.RData") load("data.RData") # Connections con <- file("big.csv", "r") chunk <- readLines(con, n = 100) close(con) ``` ### Base Plotting ```r # Scatter plot(x, y, main = "Title", xlab = "X", ylab = "Y", pch = 19, col = "steelblue", cex = 1.2) # Line plot(x, y, type = "l", lwd = 2, col = "red") lines(x, y2, col = "blue", lty = 2) # add line # Bar barplot(table(df$category), main = "Counts", col = "lightblue", las = 2) # Histogram hist(x, breaks = 30, col = "grey80", main = "Distribution", xlab = "Value") # Box plot boxplot(value ~ group, data = df, col = "lightyellow", main = "By Group") # Multiple plots par(mfrow = c(2, 2)) # 2x2 grid # ... four plots ... par(mfrow = c(1, 1)) # reset # Save to file png("plot.png", width = 800, height = 600) plot(x, y) dev.off() # Add elements legend("topright", legend = c("A", "B"), col = c("red", "blue"), lty = 1) abline(h = 0, lty = 2, col = "grey") text(x, y, labels = names, pos = 3, cex = 0.8) ``` ### Statistics ```r # Descriptive mean(x); median(x); sd(x); var(x) quantile(x, probs = c(0.25, 0.5, 0.75)) summary(df) cor(x, y) table(df$category) # frequency table # Linear model fit <- lm(y ~ x1 + x2, data = df) summary(fit) coef(fit) predict(fit, newdata = new_df) confint(fit) # t-test t.test(x, y) # two-sample t.test(x, mu = 0) # one-sample t.test(before, after, paired = TRUE) # Chi-square chisq.test(table(df$a, df$b)) # ANOVA fit <- aov(value ~ group, data = df) summary(fit) TukeyHSD(fit) # Correlation test cor.test(x, y, method = "pearson") ``` ### Data Manipulation ```r # Merge (join) merged <- merge(df1, df2, by = "id") # inner merged <- merge(df1, df2, by = "id", all = TRUE) # full outer merged <- merge(df1, df2, by = "id", all.x = TRUE) # left # Reshape wide <- reshape(long, direction = "wide", idvar = "id", timevar = "time", v.names = "value") long <- reshape(wide, direction = "long", varying = list(c("v1", "v2")), v.names = "value") # Sort df[order(df$value), ] # ascending df[order(-df$value), ] # descending df[order(df$group, -df$value), ] # multi-column # Remove duplicates df[!duplicated(df), ] df[!duplicated(df$id), ] # Stack / combine rbind(df1, df2) # stack rows (same columns) cbind(df1, df2) # bind columns (same rows) # Transform columns df$log_val <- log(df$value) df$category <- cut(df$value, breaks = c(0, 10, 20, Inf), labels = c("low", "med", "high")) ``` ### Environment & Debugging ```r ls() # list objects rm(x) # remove object rm(list = ls()) # clear all str(obj) # structure class(obj) # class typeof(obj) # internal type is.na(x) # check NA complete.cases(df) # rows without NA traceback() # after error debug(my_func) # step through browser() # breakpoint in code system.time(expr) # timing Sys.time() # current time ``` ## Reference Files For deeper coverage, read the reference files in `references/`: ### Function Gotchas & Quick Reference (condensed from R 4.5.3 Reference Manual) Non-obvious behaviors, surprising defaults, and tricky interactions — only what Claude doesn't already know: - **data-wrangling.md** — Read when: subsetting returns wrong type, apply on data frame gives unexpected coercion, merge/split/cbind behaves oddly, factor levels persist after filtering, table/duplicated edge cases. - **modeling.md** — Read when: formula syntax is confusing (`I()`, `*` vs `:`, `/`), aov gives wrong SS type, glm silently fits OLS, nls won't converge, predict returns wrong scale, optim/optimize needs tuning. - **statistics.md** — Read when: hypothesis test gives surprising result, need to choose correct p.adjust method, clustering parameters seem wrong, distribution function naming is confusing (`d`/`p`/`q`/`r` prefixes). - **visualization.md** — Read when: par settings reset unexpectedly, layout/mfrow interaction is confusing, axis labels are clipped, colors don't look right, need specialty plots (contour, persp, mosaic, pairs). - **io-and-text.md** — Read when: read.table silently drops data or misparses columns, regex behaves differently than expected, sprintf formatting is tricky, write.table output has unwanted row names. - **dates-and-system.md** — Read when: Date/POSIXct conversion gives wrong day, time zones cause off-by-one, difftime units are unexpected, need to find/list/test files programmatically. - **misc-utilities.md** — Read when: do.call behaves differently than direct call, need Reduce/Filter/Map, tryCatch handler doesn't fire, all.equal returns string not logical, time series functions need setup. ## Tips for Writing Good R Code - Use `vapply()` over `sapply()` in production code — it enforces return types - Prefer `seq_along(x)` over `1:length(x)` — the latter breaks when `x` is empty - Use `stringsAsFactors = FALSE` in `read.csv()` / `data.frame()` (default changed in R 4.0) - Vectorize operations instead of writing loops when possible - Use `stop()`, `warning()`, `message()` for error handling — not `print()` - `<<-` assigns to parent environment — use sparingly and intentionally - `with(df, expr)` avoids repeating `df$` everywhere - `Sys.setenv()` and `.Renviron` for environment variables FILE:references/misc-utilities.md # Miscellaneous Utilities — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## do.call - `do.call(fun, args_list)` — `args` must be a **list**, even for a single argument. - `quote = TRUE` prevents evaluation of arguments before the call — needed when passing expressions/symbols. - Behavior of `substitute` inside `do.call` differs from direct calls. Semantics are not fully defined for this case. - Useful pattern: `do.call(rbind, list_of_dfs)` to combine a list of data frames. --- ## Reduce / Filter / Map / Find / Position R's functional programming helpers from base — genuinely non-obvious. - `Reduce(f, x)` applies binary function `f` cumulatively: `Reduce("+", 1:4)` = `((1+2)+3)+4`. Direction matters for non-commutative ops. - `Reduce(f, x, accumulate = TRUE)` returns all intermediate results — equivalent to Python's `itertools.accumulate`. - `Reduce(f, x, right = TRUE)` folds from the right: `f(x1, f(x2, f(x3, x4)))`. - `Reduce` with `init` adds a starting value: `Reduce(f, x, init = v)` = `f(f(f(v, x1), x2), x3)`. - `Filter(f, x)` keeps elements where `f(elem)` is `TRUE`. Unlike `x[sapply(x, f)]`, handles `NULL`/empty correctly. - `Map(f, ...)` is a simple wrapper for `mapply(f, ..., SIMPLIFY = FALSE)` — always returns a list. - `Find(f, x)` returns the **first** element where `f(elem)` is `TRUE`. `Find(f, x, right = TRUE)` for last. - `Position(f, x)` returns the **index** of the first match (like `Find` but returns position, not value). --- ## lengths - `lengths(x)` returns the length of **each element** of a list. Equivalent to `sapply(x, length)` but faster (implemented in C). - Works on any list-like object. Returns integer vector. --- ## conditions (tryCatch / withCallingHandlers) - `tryCatch` **unwinds** the call stack — handler runs in the calling environment, not where the error occurred. Cannot resume execution. - `withCallingHandlers` does NOT unwind — handler runs where the condition was signaled. Can inspect/log then let the condition propagate. - `tryCatch(expr, error = function(e) e)` returns the error condition object. - `tryCatch(expr, warning = function(w) {...})` catches the **first** warning and exits. Use `withCallingHandlers` + `invokeRestart("muffleWarning")` to suppress warnings but continue. - `tryCatch` `finally` clause always runs (like Java try/finally). - `globalCallingHandlers()` registers handlers that persist for the session (useful for logging). - Custom conditions: `stop(errorCondition("msg", class = "myError"))` then catch with `tryCatch(..., myError = function(e) ...)`. --- ## all.equal - Tests **near equality** with tolerance (default `1.5e-8`, i.e., `sqrt(.Machine$double.eps)`). - Returns `TRUE` or a **character string** describing the difference — NOT `FALSE`. Use `isTRUE(all.equal(x, y))` in conditionals. - `tolerance` argument controls numeric tolerance. `scale` for absolute vs relative comparison. - Checks attributes, names, dimensions — more thorough than `==`. --- ## combn - `combn(n, m)` or `combn(x, m)`: generates all combinations of `m` items from `x`. - Returns a **matrix** with `m` rows; each column is one combination. - `FUN` argument applies a function to each combination: `combn(5, 3, sum)` returns sums of all 3-element subsets. - `simplify = FALSE` returns a list instead of a matrix. --- ## modifyList - `modifyList(x, val)` replaces elements of list `x` with those in `val` by **name**. - Setting a value to `NULL` **removes** that element from the list. - **Does** add new names not in `x` — it uses `x[names(val)] <- val` internally, so any name in `val` gets added or replaced. --- ## relist - Inverse of `unlist`: given a flat vector and a skeleton list, reconstructs the nested structure. - `relist(flesh, skeleton)` — `flesh` is the flat data, `skeleton` provides the shape. - Works with factors, matrices, and nested lists. --- ## txtProgressBar - `txtProgressBar(min, max, style = 3)` — style 3 shows percentage + bar (most useful). - Update with `setTxtProgressBar(pb, value)`. Close with `close(pb)`. - Style 1: rotating `|/-\`, style 2: simple progress. Only style 3 shows percentage. --- ## object.size - Returns an **estimate** of memory used by an object. Not always exact for shared references. - `format(object.size(x), units = "MB")` for human-readable output. - Does not count the size of environments or external pointers. --- ## installed.packages / update.packages - `installed.packages()` can be slow (scans all packages). Use `find.package()` or `requireNamespace()` to check for a specific package. - `update.packages(ask = FALSE)` updates all packages without prompting. - `lib.loc` specifies which library to check/update. --- ## vignette / demo - `vignette()` lists all vignettes; `vignette("name", package = "pkg")` opens a specific one. - `demo()` lists all demos; `demo("topic")` runs one interactively. - `browseVignettes()` opens vignette browser in HTML. --- ## Time series: acf / arima / ts / stl / decompose - `ts(data, start, frequency)`: `frequency` is observations per unit time (12 for monthly, 4 for quarterly). - `acf` default `type = "correlation"`. Use `type = "partial"` for PACF. `plot = FALSE` to suppress auto-plotting. - `arima(x, order = c(p,d,q))` for ARIMA models. `seasonal = list(order = c(P,D,Q), period = S)` for seasonal component. - `arima` handles `NA` values in the time series (via Kalman filter). - `stl` requires `s.window` (seasonal window) — must be specified, no default. `s.window = "periodic"` assumes fixed seasonality. - `decompose`: simpler than `stl`, uses moving averages. `type = "additive"` or `"multiplicative"`. - `stl` result components: `$time.series` matrix with columns `seasonal`, `trend`, `remainder`. FILE:references/data-wrangling.md # Data Wrangling — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## Extract / Extract.data.frame Indexing pitfalls in base R. - `m[j = 2, i = 1]` is `m[2, 1]` not `m[1, 2]` — argument names are **ignored** in `[`, positional matching only. Never name index args. - Factor indexing: `x[f]` uses integer codes of factor `f`, not its character labels. Use `x[as.character(f)]` for label-based indexing. - `x[[]]` with no index is always an error. `x$name` does partial matching by default; `x[["name"]]` does not (exact by default). - Assigning `NULL` via `x[[i]] <- NULL` or `x$name <- NULL` **deletes** that list element. - Data frame `[` with single column: `df[, 1]` returns a **vector** (drop=TRUE default for columns), but `df[1, ]` returns a **data frame** (drop=FALSE for rows). Use `drop = FALSE` explicitly. - Matrix indexing a data frame (`df[cbind(i,j)]`) coerces to matrix first — avoid. --- ## subset Use interactively only; unsafe for programming. - `subset` argument uses **non-standard evaluation** — column names are resolved in the data frame, which can silently pick up wrong variables in programmatic use. Use `[` with explicit logic in functions. - `NA`s in the logical condition are treated as `FALSE` (rows silently dropped). - Factors may retain unused levels after subsetting; call `droplevels()`. --- ## match / %in% - `%in%` **never returns NA** — this makes it safe for `if()` conditions unlike `==`. - `match()` returns position of **first** match only; duplicates in `table` are ignored. - Factors, raw vectors, and lists are all converted to character before matching. - `NaN` matches `NaN` but not `NA`; `NA` matches `NA` only. --- ## apply - On a **data frame**, `apply` coerces to matrix via `as.matrix` first — mixed types become character. - Return value orientation is transposed: if FUN returns length-n vector, result has dim `c(n, dim(X)[MARGIN])`. Row results become **columns**. - Factor results are coerced to character in the output array. - `...` args cannot share names with `X`, `MARGIN`, or `FUN` (partial matching risk). --- ## lapply / sapply / vapply - `sapply` can return a vector, matrix, or list unpredictably — use `vapply` in non-interactive code with explicit `FUN.VALUE` template. - Calling primitives directly in `lapply` can cause dispatch issues; wrap in `function(x) is.numeric(x)` rather than bare `is.numeric`. - `sapply` with `simplify = "array"` can produce higher-rank arrays (not just matrices). --- ## tapply - Returns an **array** (not a data frame). Class info on return values is **discarded** (e.g., Date objects become numeric). - `...` args to FUN are **not** divided into cells — they apply globally, so FUN should not expect additional args with same length as X. - `default = NA` fills empty cells; set `default = 0` for sum-like operations. Before R 3.4.0 this was hard-coded to `NA`. - Use `array2DF()` to convert result to a data frame. --- ## mapply - Argument name is `SIMPLIFY` (all caps) not `simplify` — inconsistent with `sapply`. - `MoreArgs` must be a **list** of args not vectorized over. - Recycles shorter args to common length; zero-length arg gives zero-length result. --- ## merge - Default `by` is `intersect(names(x), names(y))` — can silently merge on unintended columns if data frames share column names. - `by = 0` or `by = "row.names"` merges on row names, adding a "Row.names" column. - `by = NULL` (or both `by.x`/`by.y` length 0) produces **Cartesian product**. - Result is sorted on `by` columns by default (`sort = TRUE`). For unsorted output use `sort = FALSE`. - Duplicate key matches produce **all combinations** (one row per match pair). --- ## split - If `f` is a list of factors, interaction is used; levels containing `"."` can cause unexpected splits unless `sep` is changed. - `drop = FALSE` (default) retains empty factor levels as empty list elements. - Supports formula syntax: `split(df, ~ Month)`. --- ## cbind / rbind - `cbind` on data frames calls `data.frame(...)`, not `cbind.matrix`. Mixing matrices and data frames can give unexpected results. - `rbind` on data frames matches columns **by name**, not position. Missing columns get `NA`. - `cbind(NULL)` returns `NULL` (not a matrix). For consistency, `rbind(NULL)` also returns `NULL`. --- ## table - By default **excludes NA** (`useNA = "no"`). Use `useNA = "ifany"` or `exclude = NULL` to count NAs. - Setting `exclude` non-empty and non-default implies `useNA = "ifany"`. - Result is always an **array** (even 1D), class "table". Convert to data frame with `as.data.frame(tbl)`. - Two kinds of NA (factor-level NA vs actual NA) are treated differently depending on `useNA`/`exclude`. --- ## duplicated / unique - `duplicated` marks the **second and later** occurrences as TRUE, not the first. Use `fromLast = TRUE` to reverse. - For data frames, operates on whole rows. For lists, compares recursively. - `unique` keeps the **first** occurrence of each value. --- ## data.frame (gotchas) - `stringsAsFactors = FALSE` is the default since R 4.0.0 (was TRUE before). - Atomic vectors recycle to match longest column, but only if exact multiple. Protect with `I()` to prevent conversion. - Duplicate column names allowed only with `check.names = FALSE`, but many operations will de-dup them silently. - Matrix arguments are expanded to multiple columns unless protected by `I()`. --- ## factor (gotchas) - `as.numeric(f)` returns **integer codes**, not original values. Use `as.numeric(levels(f))[f]` or `as.numeric(as.character(f))`. - Only `==` and `!=` work between factors; factors must have identical level sets. Ordered factors support `<`, `>`. - `c()` on factors unions level sets (since R 4.1.0), but earlier versions converted to integer. - Levels are sorted by default, but sort order is **locale-dependent** at creation time. --- ## aggregate - Formula interface (`aggregate(y ~ x, data, FUN)`) drops `NA` groups by default. - The data frame method requires `by` as a **list** (not a vector). - Returns columns named after the grouping variables, with result column keeping the original name. - If FUN returns multiple values, result column is a **matrix column** inside the data frame. --- ## complete.cases - Returns a logical vector: TRUE for rows with **no** NAs across all columns/arguments. - Works on multiple arguments (e.g., `complete.cases(x, y)` checks both). --- ## order - Returns a **permutation vector** of indices, not the sorted values. Use `x[order(x)]` to sort. - Default is ascending; use `-x` for descending numeric, or `decreasing = TRUE`. - For character sorting, depends on locale. Use `method = "radix"` for locale-independent fast sorting. - `sort.int()` with `method = "radix"` is much faster for large integer/character vectors. FILE:references/dates-and-system.md # Dates and System — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## Dates (Date class) - `Date` objects are stored as **integer days since 1970-01-01**. Arithmetic works in days. - `Sys.Date()` returns current date as Date object. - `seq.Date(from, to, by = "month")` — "month" increments can produce varying-length intervals. Adding 1 month to Jan 31 gives Mar 3 (not Feb 28). - `diff(dates)` returns a `difftime` object in days. - `format(date, "%Y")` for year, `"%m"` for month, `"%d"` for day, `"%A"` for weekday name (locale-dependent). - Years before 1CE may not be handled correctly. - `length(date_vector) <- n` pads with `NA`s if extended. --- ## DateTimeClasses (POSIXct / POSIXlt) - `POSIXct`: seconds since 1970-01-01 UTC (compact, a numeric vector). - `POSIXlt`: list with components `$sec`, `$min`, `$hour`, `$mday`, `$mon` (0-11!), `$year` (since 1900!), `$wday` (0-6, Sunday=0), `$yday` (0-365). - Converting between POSIXct and Date: `as.Date(posixct_obj)` uses `tz = "UTC"` by default — may give different date than intended if original was in another timezone. - `Sys.time()` returns POSIXct in current timezone. - `strptime` returns POSIXlt; `as.POSIXct(strptime(...))` to get POSIXct. - `difftime` arithmetic: subtracting POSIXct objects gives difftime. Units auto-selected ("secs", "mins", "hours", "days", "weeks"). --- ## difftime - `difftime(time1, time2, units = "auto")` — auto-selects smallest sensible unit. - Explicit units: `"secs"`, `"mins"`, `"hours"`, `"days"`, `"weeks"`. No "months" or "years" (variable length). - `as.numeric(diff, units = "hours")` to extract numeric value in specific units. - `units(diff_obj) <- "hours"` changes the unit in place. --- ## system.time / proc.time - `system.time(expr)` returns `user`, `system`, and `elapsed` time. - `gcFirst = TRUE` (default): runs garbage collection before timing for more consistent results. - `proc.time()` returns cumulative time since R started — take differences for intervals. - `elapsed` (wall clock) can be less than `user` (multi-threaded BLAS) or more (I/O waits). --- ## Sys.sleep - `Sys.sleep(seconds)` — allows fractional seconds. Actual sleep may be longer (OS scheduling). - The process **yields** to the OS during sleep (does not busy-wait). --- ## options (key options) Selected non-obvious options: - `options(scipen = n)`: positive biases toward fixed notation, negative toward scientific. Default 0. Applies to `print`/`format`/`cat` but not `sprintf`. - `options(digits = n)`: significant digits for printing (1-22, default 7). Suggestion only. - `options(digits.secs = n)`: max decimal digits for seconds in time formatting (0-6, default 0). - `options(warn = n)`: -1 = ignore warnings, 0 = collect (default), 1 = immediate, 2 = convert to errors. - `options(error = recover)`: drop into debugger on error. `options(error = NULL)` resets to default. - `options(OutDec = ",")`: change decimal separator in output (affects `format`, `print`, NOT `sprintf`). - `options(stringsAsFactors = FALSE)`: global default for `data.frame` (moot since R 4.0.0 where it's already FALSE). - `options(expressions = 5000)`: max nested evaluations. Increase for deep recursion. - `options(max.print = 99999)`: controls truncation in `print` output. - `options(na.action = "na.omit")`: default NA handling in model functions. - `options(contrasts = c("contr.treatment", "contr.poly"))`: default contrasts for unordered/ordered factors. --- ## file.path / basename / dirname - `file.path("a", "b", "c.txt")` → `"a/b/c.txt"` (platform-appropriate separator). - `basename("/a/b/c.txt")` → `"c.txt"`. `dirname("/a/b/c.txt")` → `"/a/b"`. - `file.path` does NOT normalize paths (no `..` resolution); use `normalizePath()` for that. --- ## list.files - `list.files(pattern = "*.csv")` — `pattern` is a **regex**, not a glob! Use `glob2rx("*.csv")` or `"\\.csv$"`. - `full.names = FALSE` (default) returns basenames only. Use `full.names = TRUE` for complete paths. - `recursive = TRUE` to search subdirectories. - `all.files = TRUE` to include hidden files (starting with `.`). --- ## file.info - Returns data frame with `size`, `isdir`, `mode`, `mtime`, `ctime`, `atime`, `uid`, `gid`. - `mtime`: modification time (POSIXct). Useful for `file.info(f)$mtime`. - On some filesystems, `ctime` is status-change time, not creation time. --- ## file_test - `file_test("-f", path)`: TRUE if regular file exists. - `file_test("-d", path)`: TRUE if directory exists. - `file_test("-nt", f1, f2)`: TRUE if f1 is newer than f2. - More reliable than `file.exists()` for distinguishing files from directories. FILE:references/io-and-text.md # I/O and Text Processing — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## read.table (gotchas) - `sep = ""` (default) means **any whitespace** (spaces, tabs, newlines) — not a literal empty string. - `comment.char = "#"` by default — lines with `#` are truncated. Use `comment.char = ""` to disable (also faster). - `header` auto-detection: set to TRUE if first row has **one fewer field** than subsequent rows (the missing field is assumed to be row names). - `colClasses = "NULL"` **skips** that column entirely — very useful for speed. - `read.csv` defaults differ from `read.table`: `header = TRUE`, `sep = ","`, `fill = TRUE`, `comment.char = ""`. - For large files: specifying `colClasses` and `nrows` dramatically reduces memory usage. `read.table` is slow for wide data frames (hundreds of columns); use `scan` or `data.table::fread` for matrices. - `stringsAsFactors = FALSE` since R 4.0.0 (was TRUE before). --- ## write.table (gotchas) - `row.names = TRUE` by default — produces an unnamed first column that confuses re-reading. Use `row.names = FALSE` or `col.names = NA` for Excel-compatible CSV. - `write.csv` fixes `sep = ","`, `dec = "."`, and uses `qmethod = "double"` — cannot override these via `...`. - `quote = TRUE` (default) quotes character/factor columns. Numeric columns are never quoted. - Matrix-like columns in data frames expand to multiple columns silently. - Slow for data frames with many columns (hundreds+); each column processed separately by class. --- ## read.fwf - Reads fixed-width format files. `widths` is a vector of field widths. - **Negative widths skip** that many characters (useful for ignoring fields). - `buffersize` controls how many lines are read at a time; increase for large files. - Uses `read.table` internally after splitting fields. --- ## count.fields - Counts fields per line in a file — useful for diagnosing read errors. - `sep` and `quote` arguments match those of `read.table`. --- ## grep / grepl / sub / gsub (gotchas) - Three regex modes: POSIX extended (default), `perl = TRUE`, `fixed = TRUE`. They behave differently for edge cases. - **Name arguments explicitly** — unnamed args after `x`/`pattern` are matched positionally to `ignore.case`, `perl`, etc. Common source of silent bugs. - `sub` replaces **first** match only; `gsub` replaces **all** matches. - Backreferences: `"\\1"` in replacement (double backslash in R strings). With `perl = TRUE`: `"\\U\\1"` for uppercase conversion. - `grep(value = TRUE)` returns matching **elements**; `grep(value = FALSE)` (default) returns **indices**. - `grepl` returns logical vector — preferred for filtering. - `regexpr` returns first match position + length (as attributes); `gregexpr` returns all matches as a list. - `regexec` returns match + capture group positions; `gregexec` does this for all matches. - Character classes like `[:alpha:]` must be inside `[[:alpha:]]` (double brackets) in POSIX mode. --- ## strsplit - Returns a **list** (one element per input string), even for a single string. - `split = ""` or `split = character(0)` splits into individual characters. - Match at beginning of string: first element of result is `""`. Match at end: no trailing `""`. - `fixed = TRUE` is faster and avoids regex interpretation. - Common mistake: unnamed arguments silently match `fixed`, `perl`, etc. --- ## substr / substring - `substr(x, start, stop)`: extracts/replaces substring. 1-indexed, inclusive on both ends. - `substring(x, first, last)`: same but `last` defaults to `1000000L` (effectively "to end"). Vectorized over `first`/`last`. - Assignment form: `substr(x, 1, 3) <- "abc"` replaces in place (must be same length replacement). --- ## trimws - `which = "both"` (default), `"left"`, or `"right"`. - `whitespace = "[ \\t\\r\\n]"` — customizable regex for what counts as whitespace. --- ## nchar - `type = "bytes"` counts bytes; `type = "chars"` (default) counts characters; `type = "width"` counts display width. - `nchar(NA)` returns `NA` (not 2). `nchar(factor)` works on the level labels. - `keepNA = TRUE` (default since R 3.3.0); set to `FALSE` to count `"NA"` as 2 characters. --- ## format / formatC - `format(x, digits, nsmall)`: `nsmall` forces minimum decimal places. `big.mark = ","` adds thousands separator. - `formatC(x, format = "f", digits = 2)`: C-style formatting. `format = "e"` for scientific, `"g"` for general. - `format` returns character vector; always right-justified by default (`justify = "right"`). --- ## type.convert - Converts character vectors to appropriate types (logical, integer, double, complex, character). - `as.is = TRUE` (recommended): keeps characters as character, not factor. - Applied column-wise on data frames. `tryLogical = TRUE` (R 4.3+) converts "TRUE"/"FALSE" columns. --- ## Rscript - `commandArgs(trailingOnly = TRUE)` gets script arguments (excluding R/Rscript flags). - `#!` line on Unix: `/usr/bin/env Rscript` or full path. - `--vanilla` or `--no-init-file` to skip `.Rprofile` loading. - Exit code: `quit(status = 1)` for error exit. --- ## capture.output - Captures output from `cat`, `print`, or any expression that writes to stdout. - `file = NULL` (default) returns character vector. `file = "out.txt"` writes directly to file. - `type = "message"` captures stderr instead. --- ## URLencode / URLdecode - `URLencode(url, reserved = FALSE)` by default does NOT encode reserved chars (`/`, `?`, `&`, etc.). - Set `reserved = TRUE` to encode a URL **component** (query parameter value). --- ## glob2rx - Converts shell glob patterns to regex: `glob2rx("*.csv")` → `"^.*\\.csv$"`. - Useful with `list.files(pattern = glob2rx("data_*.RDS"))`. FILE:references/modeling.md # Modeling — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## formula Symbolic model specification gotchas. - `I()` is required to use arithmetic operators literally: `y ~ x + I(x^2)`. Without `I()`, `^` means interaction crossing. - `*` = main effects + interaction: `a*b` expands to `a + b + a:b`. - `(a+b+c)^2` = all main effects + all 2-way interactions (not squaring). - `-` removes terms: `(a+b+c)^2 - a:b` drops only the `a:b` interaction. - `/` means nesting: `a/b` = `a + b %in% a` = `a + a:b`. - `.` in formula means "all other columns in data" (in `terms.formula` context) or "previous contents" (in `update.formula`). - Formula objects carry an **environment** used for variable lookup; `as.formula("y ~ x")` uses `parent.frame()`. --- ## terms / model.matrix - `model.matrix` creates the design matrix including dummy coding. Default contrasts: `contr.treatment` for unordered factors, `contr.poly` for ordered. - `terms` object attributes: `order` (interaction order per term), `intercept`, `factors` matrix. - Column names from `model.matrix` can be surprising: e.g., `factorLevelName` concatenation. --- ## glm - Default `family = gaussian(link = "identity")` — `glm()` with no `family` silently fits OLS (same as `lm`, but slower and with deviance-based output). - Common families: `binomial(link = "logit")`, `poisson(link = "log")`, `Gamma(link = "inverse")`, `inverse.gaussian()`. - `binomial` accepts response as: 0/1 vector, logical, factor (second level = success), or 2-column matrix `cbind(success, failure)`. - `weights` in `glm` means **prior weights** (not frequency weights) — for frequency weights, use the cbind trick or offset. - `predict.glm(type = "response")` for predicted probabilities; default `type = "link"` returns log-odds (for logistic) or log-rate (for Poisson). - `anova(glm_obj, test = "Chisq")` for deviance-based tests; `"F"` is invalid for non-Gaussian families. - Quasi-families (`quasibinomial`, `quasipoisson`) allow overdispersion — no AIC is computed. - Convergence: `control = glm.control(maxit = 100)` if default 25 iterations isn't enough. --- ## aov - `aov` is a wrapper around `lm` that stores extra info for balanced ANOVA. For unbalanced designs, Type I SS (sequential) are computed — order of terms matters. - For Type III SS, use `car::Anova()` or set contrasts to `contr.sum`/`contr.helmert`. - Error strata for repeated measures: `aov(y ~ A*B + Error(Subject/B))`. - `summary.aov` gives ANOVA table; `summary.lm(aov_obj)` gives regression-style summary. --- ## nls - Requires **good starting values** in `start = list(...)` or convergence fails. - Self-starting models (`SSlogis`, `SSasymp`, etc.) auto-compute starting values. - Algorithm `"port"` allows bounds on parameters (`lower`/`upper`). - If data fits too exactly (no residual noise), convergence check fails — use `control = list(scaleOffset = 1)` or jitter data. - `weights` argument for weighted NLS; `na.action` for missing value handling. --- ## step / add1 - `step` does **stepwise** model selection by AIC (default). Use `k = log(n)` for BIC. - Direction: `direction = "both"` (default), `"forward"`, or `"backward"`. - `add1`/`drop1` evaluate single-term additions/deletions; `step` calls these iteratively. - `scope` argument defines the upper/lower model bounds for search. - `step` modifies the model object in place — can be slow for large models with many candidate terms. --- ## predict.lm / predict.glm - `predict.lm` with `interval = "confidence"` gives CI for **mean** response; `interval = "prediction"` gives PI for **new observation** (wider). - `newdata` must have columns matching the original formula variables — factors must have the same levels. - `predict.glm` with `type = "response"` gives predictions on the response scale (e.g., probabilities for logistic); `type = "link"` (default) gives on the link scale. - `se.fit = TRUE` returns standard errors; for `predict.glm` these are on the **link** scale regardless of `type`. - `predict.lm` with `type = "terms"` returns the contribution of each term. --- ## loess - `span` controls smoothness (default 0.75). Span < 1 uses that proportion of points; span > 1 uses all points with adjusted distance. - Maximum **4 predictors**. Memory usage is roughly **quadratic** in n (1000 points ~ 10MB). - `degree = 0` (local constant) is allowed but poorly tested — use with caution. - Not identical to S's `loess`; conditioning is not implemented. - `normalize = TRUE` (default) standardizes predictors to common scale; set `FALSE` for spatial coords. --- ## lowess vs loess - `lowess` is the older function; returns `list(x, y)` — cannot predict at new points. - `loess` is the newer formula interface with `predict` method. - `lowess` parameter is `f` (span, default 2/3); `loess` parameter is `span` (default 0.75). - `lowess` `iter` default is 3 (robustifying iterations); `loess` default `family = "gaussian"` (no robustness). --- ## smooth.spline - Default smoothing parameter selected by **GCV** (generalized cross-validation). - `cv = TRUE` uses ordinary leave-one-out CV instead — do not use with duplicate x values. - `spar` and `lambda` control smoothness; `df` can specify equivalent degrees of freedom. - Returns object with `predict`, `print`, `plot` methods. The `fit` component has knots and coefficients. --- ## optim - **Minimizes** by default. To maximize: set `control = list(fnscale = -1)`. - Default method is Nelder-Mead (no gradients, robust but slow). Poor for 1D — use `"Brent"` or `optimize()`. - `"L-BFGS-B"` is the only method supporting box constraints (`lower`/`upper`). Bounds auto-select this method with a warning. - `"SANN"` (simulated annealing): convergence code is **always 0** — it never "fails". `maxit` = total function evals (default 10000), no other stopping criterion. - `parscale`: scale parameters so unit change in each produces comparable objective change. Critical for mixed-scale problems. - `hessian = TRUE`: returns numerical Hessian of the **unconstrained** problem even if box constraints are active. - `fn` can return `NA`/`Inf` (except `"L-BFGS-B"` which requires finite values always). Initial value must be finite. --- ## optimize / uniroot - `optimize`: 1D minimization on a bounded interval. Returns `minimum` and `objective`. - `uniroot`: finds a root of `f` in `[lower, upper]`. **Requires** `f(lower)` and `f(upper)` to have opposite signs. - `uniroot` with `extendInt = "yes"` can auto-extend the interval to find sign change — but can find spurious roots for functions that don't actually cross zero. - `nlm`: Newton-type minimizer. Gradient/Hessian as **attributes** of the return value from `fn` (unusual interface). --- ## TukeyHSD - Requires a fitted `aov` object (not `lm`). - Default `conf.level = 0.95`. Returns adjusted p-values and confidence intervals for all pairwise comparisons. - Only meaningful for **balanced** or near-balanced designs; can be liberal for very unbalanced data. --- ## anova (for lm) - `anova(model)`: sequential (Type I) SS — **order of terms matters**. - `anova(model1, model2)`: F-test comparing nested models. - For Type II or III SS use `car::Anova()`. FILE:references/statistics.md # Statistics — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## chisq.test - `correct = TRUE` (default) applies Yates continuity correction for **2x2 tables only**. - `simulate.p.value = TRUE`: Monte Carlo with `B = 2000` replicates (min p ~ 0.0005). Simulation assumes **fixed marginals** (Fisher-style sampling, not the chi-sq assumption). - For goodness-of-fit: pass a vector, not a matrix. `p` must sum to 1 (or set `rescale.p = TRUE`). - Return object includes `$expected`, `$residuals` (Pearson), and `$stdres` (standardized). --- ## wilcox.test - `exact = TRUE` by default for small samples with no ties. With ties, normal approximation used. - `correct = TRUE` applies continuity correction to normal approximation. - `conf.int = TRUE` computes Hodges-Lehmann estimator and confidence interval (not just the p-value). - Paired test: `paired = TRUE` uses signed-rank test (Wilcoxon), not rank-sum (Mann-Whitney). --- ## fisher.test - For tables larger than 2x2, uses simulation (`simulate.p.value = TRUE`) or network algorithm. - `workspace` controls memory for the network algorithm; increase if you get errors on large tables. - `or` argument tests a specific odds ratio (default 1) — only for 2x2 tables. --- ## ks.test - Two-sample test or one-sample against a reference distribution. - Does **not** handle ties well — warns and uses asymptotic approximation. - For composite hypotheses (parameters estimated from data), p-values are **conservative** (too large). Use `dgof` or `ks.test` with `exact = NULL` for discrete distributions. --- ## p.adjust - Methods: `"holm"` (default), `"BH"` (Benjamini-Hochberg FDR), `"bonferroni"`, `"BY"`, `"hochberg"`, `"hommel"`, `"fdr"` (alias for BH), `"none"`. - `n` argument: total number of hypotheses (can be larger than `length(p)` if some p-values are excluded). - Handles `NA`s: adjusted p-values are `NA` where input is `NA`. --- ## pairwise.t.test / pairwise.wilcox.test - `p.adjust.method` defaults to `"holm"`. Change to `"BH"` for FDR control. - `pool.sd = TRUE` (default for t-test): uses pooled SD across all groups (assumes equal variances). - Returns a matrix of p-values, not test statistics. --- ## shapiro.test - Sample size must be between 3 and 5000. - Tests normality; low p-value = evidence against normality. --- ## kmeans - `nstart > 1` recommended (e.g., `nstart = 25`): runs algorithm from multiple random starts, returns best. - Default `iter.max = 10` — may be too low for convergence. Increase for large/complex data. - Default algorithm is "Hartigan-Wong" (generally best). Very close points may cause non-convergence (warning with `ifault = 4`). - Cluster numbering is arbitrary; ordering may differ across platforms. - Always returns k clusters when k is specified (except Lloyd-Forgy may return fewer). --- ## hclust - `method = "ward.D2"` implements Ward's criterion correctly (using squared distances). The older `"ward.D"` did not square distances (retained for back-compatibility). - Input must be a `dist` object. Use `as.dist()` to convert a symmetric matrix. - `hang = -1` in `plot()` aligns all labels at the bottom. --- ## dist - `method = "euclidean"` (default). Other options: `"manhattan"`, `"maximum"`, `"canberra"`, `"binary"`, `"minkowski"`. - Returns a `dist` object (lower triangle only). Use `as.matrix()` to get full matrix. - `"canberra"`: terms with zero numerator and denominator are **omitted** from the sum (not treated as 0/0). - `Inf` values: Euclidean distance involving `Inf` is `Inf`. Multiple `Inf`s in same obs give `NaN` for some methods. --- ## prcomp vs princomp - `prcomp` uses **SVD** (numerically superior); `princomp` uses `eigen` on covariance (less stable, N-1 vs N scaling). - `scale. = TRUE` in `prcomp` standardizes variables; important when variables have very different scales. - `princomp` standard deviations differ from `prcomp` by factor `sqrt((n-1)/n)`. - Both return `$rotation` (loadings) and `$x` (scores); sign of components may differ between runs. --- ## density - Default bandwidth: `bw = "nrd0"` (Silverman's rule of thumb). For multimodal data, consider `"SJ"` or `"bcv"`. - `adjust`: multiplicative factor on bandwidth. `adjust = 0.5` halves the bandwidth (less smooth). - Default kernel: `"gaussian"`. Range of density extends beyond data range (controlled by `cut`, default 3 bandwidths). - `n = 512`: number of evaluation points. Increase for smoother plotting. - `from`/`to`: explicitly bound the evaluation range. --- ## quantile - **Nine** `type` options (1-9). Default `type = 7` (R default, linear interpolation). Type 1 = inverse of empirical CDF (SAS default). Types 4-9 are continuous; 1-3 are discontinuous. - `na.rm = FALSE` by default — returns NA if any NAs present. - `names = TRUE` by default, adding "0%", "25%", etc. as names. --- ## Distributions (gotchas across all) All distribution functions follow the `d/p/q/r` pattern. Common non-obvious points: - **`n` argument in `r*()` functions**: if `length(n) > 1`, uses `length(n)` as the count, not `n` itself. So `rnorm(c(1,2,3))` generates 3 values, not 1+2+3. - `log = TRUE` / `log.p = TRUE`: compute on log scale for numerical stability in tails. - `lower.tail = FALSE` gives survival function P(X > x) directly (more accurate than 1 - pnorm() in tails). - **Gamma**: parameterized by `shape` and `rate` (= 1/scale). Default `rate = 1`. Specifying both `rate` and `scale` is an error. - **Beta**: `shape1` (alpha), `shape2` (beta) — no `mean`/`sd` parameterization. - **Poisson `dpois`**: `x` can be non-integer (returns 0 with a warning for non-integer values if `log = FALSE`). - **Weibull**: `shape` and `scale` (no `rate`). R's parameterization: `f(x) = (shape/scale)(x/scale)^(shape-1) exp(-(x/scale)^shape)`. - **Lognormal**: `meanlog` and `sdlog` are mean/sd of the **log**, not of the distribution itself. --- ## cor.test - Default method: `"pearson"`. Also `"kendall"` and `"spearman"`. - Returns `$estimate`, `$p.value`, `$conf.int` (CI only for Pearson). - Formula interface: `cor.test(~ x + y, data = df)` — note the `~` with no LHS. --- ## ecdf - Returns a **function** (step function). Call it on new values: `Fn <- ecdf(x); Fn(3.5)`. - `plot(ecdf(x))` gives the empirical CDF plot. - The returned function is right-continuous with left limits (cadlag). --- ## weighted.mean - Handles `NA` in weights: observation is dropped if weight is `NA`. - Weights do not need to sum to 1; they are normalized internally. FILE:references/visualization.md # Visualization — Quick Reference > Non-obvious behaviors, gotchas, and tricky defaults for R functions. > Only what Claude doesn't already know. --- ## par (gotchas) - `par()` settings are per-device. Opening a new device resets everything. - Setting `mfrow`/`mfcol` resets `cex` to 1 and `mex` to 1. With 2x2 layout, base `cex` is multiplied by 0.83; with 3+ rows/columns, by 0.66. - `mai` (inches), `mar` (lines), `pin`, `plt`, `pty` all interact. Restoring all saved parameters after device resize can produce inconsistent results — last-alphabetically wins. - `bg` set via `par()` also sets `new = FALSE`. Setting `fg` via `par()` also sets `col`. - `xpd = NA` clips to device region (allows drawing in outer margins); `xpd = TRUE` clips to figure region; `xpd = FALSE` (default) clips to plot region. - `mgp = c(3, 1, 0)`: controls title line (`mgp[1]`), label line (`mgp[2]`), axis line (`mgp[3]`). All in `mex` units. - `las`: 0 = parallel to axis, 1 = horizontal, 2 = perpendicular, 3 = vertical. Does **not** respond to `srt`. - `tck = 1` draws grid lines across the plot. `tcl = -0.5` (default) gives outward ticks. - `usr` with log scale: contains **log10** of the coordinate limits, not the raw values. - Read-only parameters: `cin`, `cra`, `csi`, `cxy`, `din`, `page`. --- ## layout - `layout(mat)` where `mat` is a matrix of integers specifying figure arrangement. - `widths`/`heights` accept `lcm()` for absolute sizes mixed with relative sizes. - More flexible than `mfrow`/`mfcol` but cannot be queried once set (unlike `par("mfrow")`). - `layout.show(n)` visualizes the layout for debugging. --- ## axis / mtext - `axis(side, at, labels)`: `side` 1=bottom, 2=left, 3=top, 4=right. - Default gap between axis labels controlled by `par("mgp")`. Labels can overlap if not managed. - `mtext`: `line` argument positions text in margin lines (0 = adjacent to plot, positive = outward). `adj` controls horizontal position (0-1). - `mtext` with `outer = TRUE` writes in the **outer** margin (set by `par(oma = ...)`). --- ## curve - First argument can be an **expression** in `x` or a function: `curve(sin, 0, 2*pi)` or `curve(x^2 + 1, 0, 10)`. - `add = TRUE` to overlay on existing plot. Default `n = 101` evaluation points. - `xname = "x"` by default; change if your expression uses a different variable name. --- ## pairs - `panel` function receives `(x, y, ...)` for each pair. `lower.panel`, `upper.panel`, `diag.panel` for different regions. - `gap` controls spacing between panels (default 1). - Formula interface: `pairs(~ var1 + var2 + var3, data = df)`. --- ## coplot - Conditioning plots: `coplot(y ~ x | a)` or `coplot(y ~ x | a * b)` for two conditioning variables. - `panel` function can be customized; `rows`/`columns` control layout. - Default panel draws points; use `panel = panel.smooth` for loess overlay. --- ## matplot / matlines / matpoints - Plots columns of one matrix against columns of another. Recycles `col`, `lty`, `pch` across columns. - `type = "l"` by default (unlike `plot` which defaults to `"p"`). - Useful for plotting multiple time series or fitted curves simultaneously. --- ## contour / filled.contour / image - `contour(x, y, z)`: `z` must be a matrix with `dim = c(length(x), length(y))`. - `filled.contour` has a non-standard layout — it creates its own plot region for the color key. **Cannot use `par(mfrow)` with it**. Adding elements requires the `plot.axes` argument. - `image`: plots z-values as colored rectangles. Default color scheme may be misleading; set `col` explicitly. - For `image`, `x` and `y` specify **cell boundaries** or **midpoints** depending on context. --- ## persp - `persp(x, y, z, theta, phi)`: `theta` = azimuthal angle, `phi` = colatitude. - Returns a **transformation matrix** (invisible) for projecting 3D to 2D — use `trans3d()` to add points/lines to the perspective plot. - `shade` and `col` control surface shading. `border = NA` removes grid lines. --- ## segments / arrows / rect / polygon - All take vectorized coordinates; recycle as needed. - `arrows`: `code = 1` (head at start), `code = 2` (head at end, default), `code = 3` (both). - `polygon`: last point auto-connects to first. Fill with `col`; `border` controls outline. - `rect(xleft, ybottom, xright, ytop)` — note argument order is not the same as other systems. --- ## dev / dev.off / dev.copy - `dev.new()` opens a new device. `dev.off()` closes current device (and flushes output for file devices like `pdf`). - `dev.off()` on the **last** open device reverts to null device. - `dev.copy(pdf, file = "plot.pdf")` followed by `dev.off()` to save current plot. - `dev.list()` returns all open devices; `dev.cur()` the active one. --- ## pdf - Must call `dev.off()` to finalize the file. Without it, file may be empty/corrupt. - `onefile = TRUE` (default): multiple pages in one PDF. `onefile = FALSE`: one file per page (uses `%d` in filename for numbering). - `useDingbats = FALSE` recommended to avoid issues with certain PDF viewers and pch symbols. - Default size: 7x7 inches. `family` controls font family. --- ## png / bitmap devices - `res` controls DPI (default 72). For publication: `res = 300` with appropriate `width`/`height` in pixels or inches (with `units = "in"`). - `type = "cairo"` (on systems with cairo) gives better antialiasing than default. - `bg = "transparent"` for transparent background (PNG supports alpha). --- ## colors / rgb / hcl / col2rgb - `colors()` returns all 657 named colors. `col2rgb("color")` returns RGB matrix. - `rgb(r, g, b, alpha, maxColorValue = 255)` — note `maxColorValue` default is 1, not 255. - `hcl(h, c, l)`: perceptually uniform color space. Preferred for color scales. - `adjustcolor(col, alpha.f = 0.5)`: easy way to add transparency. --- ## colorRamp / colorRampPalette - `colorRamp` returns a **function** mapping [0,1] to RGB matrix. - `colorRampPalette` returns a **function** taking `n` and returning `n` interpolated colors. - `space = "Lab"` gives more perceptually uniform interpolation than `"rgb"`. --- ## palette / recordPlot - `palette()` returns current palette (default 8 colors). `palette("Set1")` sets a built-in palette. - Integer colors in plots index into the palette (with wrapping). Index 0 = background color. - `recordPlot()` / `replayPlot()`: save and restore a complete plot — device-dependent and fragile across sessions. FILE:assets/analysis_template.R # ============================================================ # Analysis Template — Base R # Copy this file, rename it, and fill in your details. # ============================================================ # Author : # Date : # Data : # Purpose : # ============================================================ # ── 0. Setup ───────────────────────────────────────────────── # Clear environment (optional — comment out if loading into existing session) rm(list = ls()) # Set working directory if needed # setwd("/path/to/your/project") # Reproducibility set.seed(42) # Libraries — uncomment what you need # library(haven) # read .dta / .sav / .sas # library(readxl) # read Excel files # library(openxlsx) # write Excel files # library(foreign) # older Stata / SPSS formats # library(survey) # survey-weighted analysis # library(lmtest) # Breusch-Pagan, Durbin-Watson etc. # library(sandwich) # robust standard errors # library(car) # Type II/III ANOVA, VIF # ── 1. Load Data ───────────────────────────────────────────── df <- read.csv("your_data.csv", stringsAsFactors = FALSE) # df <- readRDS("your_data.rds") # df <- haven::read_dta("your_data.dta") # First look — always run these dim(df) str(df) head(df, 10) summary(df) # ── 2. Data Quality Check ──────────────────────────────────── # Missing values na_report <- data.frame( column = names(df), n_miss = colSums(is.na(df)), pct_miss = round(colMeans(is.na(df)) * 100, 1), row.names = NULL ) print(na_report[na_report$n_miss > 0, ]) # Duplicates n_dup <- sum(duplicated(df)) cat(sprintf("Duplicate rows: %d\n", n_dup)) # Unique values for categorical columns cat_cols <- names(df)[sapply(df, function(x) is.character(x) | is.factor(x))] for (col in cat_cols) { cat(sprintf("\n%s (%d unique):\n", col, length(unique(df[[col]])))) print(table(df[[col]], useNA = "ifany")) } # ── 3. Clean & Transform ───────────────────────────────────── # Rename columns (example) # names(df)[names(df) == "old_name"] <- "new_name" # Convert types # df$group <- as.factor(df$group) # df$date <- as.Date(df$date, format = "%Y-%m-%d") # Recode values (example) # df$gender <- ifelse(df$gender == 1, "Male", "Female") # Create new variables (example) # df$log_income <- log(df$income + 1) # df$age_group <- cut(df$age, # breaks = c(0, 25, 45, 65, Inf), # labels = c("18-25", "26-45", "46-65", "65+")) # Filter rows (example) # df <- df[df$year >= 2010, ] # df <- df[complete.cases(df[, c("outcome", "predictor")]), ] # Drop unused factor levels # df <- droplevels(df) # ── 4. Descriptive Statistics ──────────────────────────────── # Numeric summary num_cols <- names(df)[sapply(df, is.numeric)] round(sapply(df[num_cols], function(x) c( n = sum(!is.na(x)), mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE), median = median(x, na.rm = TRUE), min = min(x, na.rm = TRUE), max = max(x, na.rm = TRUE) )), 3) # Cross-tabulation # table(df$group, df$category, useNA = "ifany") # prop.table(table(df$group, df$category), margin = 1) # row proportions # ── 5. Visualization (EDA) ─────────────────────────────────── par(mfrow = c(2, 2)) # Histogram of main outcome hist(df$outcome_var, main = "Distribution of Outcome", xlab = "Outcome", col = "steelblue", border = "white", breaks = 30) # Boxplot by group boxplot(outcome_var ~ group_var, data = df, main = "Outcome by Group", col = "lightyellow", las = 2) # Scatter plot plot(df$predictor, df$outcome_var, main = "Predictor vs Outcome", xlab = "Predictor", ylab = "Outcome", pch = 19, col = adjustcolor("steelblue", alpha.f = 0.5), cex = 0.8) abline(lm(outcome_var ~ predictor, data = df), col = "red", lwd = 2) # Correlation matrix (numeric columns only) cor_mat <- cor(df[num_cols], use = "complete.obs") image(cor_mat, main = "Correlation Matrix", col = hcl.colors(20, "RdBu", rev = TRUE)) par(mfrow = c(1, 1)) # ── 6. Analysis ─────────────────────────────────────────────── # ·· 6a. Comparison of means ·· t.test(outcome_var ~ group_var, data = df) # ·· 6b. Linear regression ·· fit <- lm(outcome_var ~ predictor1 + predictor2 + group_var, data = df) summary(fit) confint(fit) # Check VIF for multicollinearity (requires car) # car::vif(fit) # Robust standard errors (requires lmtest + sandwich) # lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit, type = "HC3")) # ·· 6c. ANOVA ·· # fit_aov <- aov(outcome_var ~ group_var, data = df) # summary(fit_aov) # TukeyHSD(fit_aov) # ·· 6d. Logistic regression (binary outcome) ·· # fit_logit <- glm(binary_outcome ~ x1 + x2, # data = df, # family = binomial(link = "logit")) # summary(fit_logit) # exp(coef(fit_logit)) # odds ratios # exp(confint(fit_logit)) # OR confidence intervals # ── 7. Model Diagnostics ───────────────────────────────────── par(mfrow = c(2, 2)) plot(fit) par(mfrow = c(1, 1)) # Residual normality shapiro.test(residuals(fit)) # Homoscedasticity (requires lmtest) # lmtest::bptest(fit) # ── 8. Save Output ──────────────────────────────────────────── # Cleaned data # write.csv(df, "data_clean.csv", row.names = FALSE) # saveRDS(df, "data_clean.rds") # Model results to text file # sink("results.txt") # cat("=== Linear Model ===\n") # print(summary(fit)) # cat("\n=== Confidence Intervals ===\n") # print(confint(fit)) # sink() # Plots to file # png("figure1_distributions.png", width = 1200, height = 900, res = 150) # par(mfrow = c(2, 2)) # # ... your plots ... # par(mfrow = c(1, 1)) # dev.off() # ============================================================ # END OF TEMPLATE # ============================================================ FILE:scripts/check_data.R # check_data.R — Quick data quality report for any R data frame # Usage: source("check_data.R") then call check_data(df) # Or: source("check_data.R"); check_data(read.csv("yourfile.csv")) check_data <- function(df, top_n_levels = 8) { if (!is.data.frame(df)) stop("Input must be a data frame.") n_row <- nrow(df) n_col <- ncol(df) cat("══════════════════════════════════════════\n") cat(" DATA QUALITY REPORT\n") cat("══════════════════════════════════════════\n") cat(sprintf(" Rows: %d Columns: %d\n", n_row, n_col)) cat("══════════════════════════════════════════\n\n") # ── 1. Column overview ────────────────────── cat("── COLUMN OVERVIEW ────────────────────────\n") for (col in names(df)) { x <- df[[col]] cls <- class(x)[1] n_na <- sum(is.na(x)) pct <- round(n_na / n_row * 100, 1) n_uniq <- length(unique(x[!is.na(x)])) na_flag <- if (n_na == 0) "" else sprintf(" *** %d NAs (%.1f%%)", n_na, pct) cat(sprintf(" %-20s %-12s %d unique%s\n", col, cls, n_uniq, na_flag)) } # ── 2. NA summary ──────────────────────────── cat("\n── NA SUMMARY ─────────────────────────────\n") na_counts <- sapply(df, function(x) sum(is.na(x))) cols_with_na <- na_counts[na_counts > 0] if (length(cols_with_na) == 0) { cat(" No missing values. \n") } else { cat(sprintf(" Columns with NAs: %d of %d\n\n", length(cols_with_na), n_col)) for (col in names(cols_with_na)) { bar_len <- round(cols_with_na[col] / n_row * 20) bar <- paste0(rep("█", bar_len), collapse = "") pct_na <- round(cols_with_na[col] / n_row * 100, 1) cat(sprintf(" %-20s [%-20s] %d (%.1f%%)\n", col, bar, cols_with_na[col], pct_na)) } } # ── 3. Numeric columns ─────────────────────── num_cols <- names(df)[sapply(df, is.numeric)] if (length(num_cols) > 0) { cat("\n── NUMERIC COLUMNS ────────────────────────\n") cat(sprintf(" %-20s %8s %8s %8s %8s %8s\n", "Column", "Min", "Mean", "Median", "Max", "SD")) cat(sprintf(" %-20s %8s %8s %8s %8s %8s\n", "──────", "───", "────", "──────", "───", "──")) for (col in num_cols) { x <- df[[col]][!is.na(df[[col]])] if (length(x) == 0) next cat(sprintf(" %-20s %8.3g %8.3g %8.3g %8.3g %8.3g\n", col, min(x), mean(x), median(x), max(x), sd(x))) } } # ── 4. Factor / character columns ─────────── cat_cols <- names(df)[sapply(df, function(x) is.factor(x) | is.character(x))] if (length(cat_cols) > 0) { cat("\n── CATEGORICAL COLUMNS ────────────────────\n") for (col in cat_cols) { x <- df[[col]] tbl <- sort(table(x, useNA = "no"), decreasing = TRUE) n_lv <- length(tbl) cat(sprintf("\n %s (%d unique values)\n", col, n_lv)) show <- min(top_n_levels, n_lv) for (i in seq_len(show)) { lbl <- names(tbl)[i] cnt <- tbl[i] pct <- round(cnt / n_row * 100, 1) cat(sprintf(" %-25s %5d (%.1f%%)\n", lbl, cnt, pct)) } if (n_lv > top_n_levels) { cat(sprintf(" ... and %d more levels\n", n_lv - top_n_levels)) } } } # ── 5. Duplicate rows ──────────────────────── cat("\n── DUPLICATES ─────────────────────────────\n") n_dup <- sum(duplicated(df)) if (n_dup == 0) { cat(" No duplicate rows.\n") } else { cat(sprintf(" %d duplicate row(s) found (%.1f%% of data)\n", n_dup, n_dup / n_row * 100)) } cat("\n══════════════════════════════════════════\n") cat(" END OF REPORT\n") cat("══════════════════════════════════════════\n") # Return invisibly for programmatic use invisible(list( dims = c(rows = n_row, cols = n_col), na_counts = na_counts, n_dupes = n_dup )) } FILE:scripts/scaffold_analysis.R #!/usr/bin/env Rscript # scaffold_analysis.R — Generates a starter analysis script # # Usage (from terminal): # Rscript scaffold_analysis.R myproject # Rscript scaffold_analysis.R myproject outcome_var group_var # # Usage (from R console): # source("scaffold_analysis.R") # scaffold_analysis("myproject", outcome = "score", group = "treatment") # # Output: myproject_analysis.R (ready to edit) scaffold_analysis <- function(project_name, outcome = "outcome", group = "group", data_file = NULL) { if (is.null(data_file)) data_file <- paste0(project_name, ".csv") out_file <- paste0(project_name, "_analysis.R") template <- sprintf( '# ============================================================ # Project : %s # Created : %s # ============================================================ # ── 0. Libraries ───────────────────────────────────────────── # Add packages you need here # library(ggplot2) # library(haven) # for .dta files # library(openxlsx) # for Excel output # ── 1. Load Data ───────────────────────────────────────────── df <- read.csv("%s", stringsAsFactors = FALSE) # Quick check — always do this first cat("Dimensions:", dim(df), "\\n") str(df) head(df) # ── 2. Explore / EDA ───────────────────────────────────────── summary(df) # NA check na_counts <- colSums(is.na(df)) na_counts[na_counts > 0] # Key variable distributions hist(df$%s, main = "Distribution of %s", xlab = "%s") if ("%s" %%in%% names(df)) { table(df$%s) barplot(table(df$%s), main = "Counts by %s", col = "steelblue", las = 2) } # ── 3. Clean / Transform ────────────────────────────────────── # df <- df[complete.cases(df), ] # drop rows with any NA # df$%s <- as.factor(df$%s) # convert to factor # ── 4. Analysis ─────────────────────────────────────────────── # Descriptive stats by group tapply(df$%s, df$%s, mean, na.rm = TRUE) tapply(df$%s, df$%s, sd, na.rm = TRUE) # t-test (two groups) # t.test(%s ~ %s, data = df) # Linear model fit <- lm(%s ~ %s, data = df) summary(fit) confint(fit) # ANOVA (multiple groups) # fit_aov <- aov(%s ~ %s, data = df) # summary(fit_aov) # TukeyHSD(fit_aov) # ── 5. Visualize Results ────────────────────────────────────── par(mfrow = c(1, 2)) # Boxplot by group boxplot(%s ~ %s, data = df, main = "%s by %s", xlab = "%s", ylab = "%s", col = "lightyellow") # Model diagnostics plot(fit, which = 1) # residuals vs fitted par(mfrow = c(1, 1)) # ── 6. Save Output ──────────────────────────────────────────── # Save cleaned data # write.csv(df, "%s_clean.csv", row.names = FALSE) # Save model summary to text # sink("%s_results.txt") # summary(fit) # sink() # Save plot to file # png("%s_boxplot.png", width = 800, height = 600, res = 150) # boxplot(%s ~ %s, data = df, col = "lightyellow") # dev.off() ', project_name, format(Sys.Date(), "%%Y-%%m-%%d"), data_file, # Section 2 — EDA outcome, outcome, outcome, group, group, group, group, # Section 3 group, group, # Section 4 outcome, group, outcome, group, outcome, group, outcome, group, outcome, group, outcome, group, # Section 5 outcome, group, outcome, group, group, outcome, # Section 6 project_name, project_name, project_name, outcome, group ) writeLines(template, out_file) cat(sprintf("Created: %s\n", out_file)) invisible(out_file) } # ── Run from command line ───────────────────────────────────── if (!interactive()) { args <- commandArgs(trailingOnly = TRUE) if (length(args) == 0) { cat("Usage: Rscript scaffold_analysis.R <project_name> [outcome_var] [group_var]\n") cat("Example: Rscript scaffold_analysis.R myproject score treatment\n") quit(status = 1) } project <- args[1] outcome <- if (length(args) >= 2) args[2] else "outcome" group <- if (length(args) >= 3) args[3] else "group" scaffold_analysis(project, outcome = outcome, group = group) } FILE:README.md # base-r-skill GitHub: https://github.com/iremaydas/base-r-skill A Claude Code skill for base R programming. --- ## The Story I'm a political science PhD candidate who uses R regularly but would never call myself *an R person*. I needed a Claude Code skill for base R — something without tidyverse, without ggplot2, just plain R — and I couldn't find one anywhere. So I made one myself. At 11pm. Asking Claude to help me build a skill for Claude. If you're also someone who Googles `how to drop NA rows in R` every single time, this one's for you. 🫶 --- ## What's Inside ``` base-r/ ├── SKILL.md # Main skill file ├── references/ # Gotchas & non-obvious behaviors │ ├── data-wrangling.md # Subsetting traps, apply family, merge, factor quirks │ ├── modeling.md # Formula syntax, lm/glm/aov/nls, optim │ ├── statistics.md # Hypothesis tests, distributions, clustering │ ├── visualization.md # par, layout, devices, colors │ ├── io-and-text.md # read.table, grep, regex, format │ ├── dates-and-system.md # Date/POSIXct traps, options(), file ops │ └── misc-utilities.md # tryCatch, do.call, time series, utilities ├── scripts/ │ ├── check_data.R # Quick data quality report for any data frame │ └── scaffold_analysis.R # Generates a starter analysis script └── assets/ └── analysis_template.R # Copy-paste analysis template ``` The reference files were condensed from the official R 4.5.3 manual — **19,518 lines → 945 lines** (95% reduction). Only the non-obvious stuff survived: gotchas, surprising defaults, tricky interactions. The things Claude already knows well got cut. --- ## How to Use Add this skill to your Claude Code setup by pointing to this repo. Then Claude will automatically load the relevant reference files when you're working on R tasks. Works best for: - Base R data manipulation (no tidyverse) - Statistical modeling with `lm`, `glm`, `aov` - Base graphics with `plot`, `par`, `barplot` - Understanding why your R code is doing that weird thing Not for: tidyverse, ggplot2, Shiny, or R package development. --- ## The `check_data.R` Script Probably the most useful standalone thing here. Source it and run `check_data(df)` on any data frame to get a formatted report of dimensions, NA counts, numeric summaries, and categorical breakdowns. ```r source("scripts/check_data.R") check_data(your_df) ``` --- ## Built With Help From - Claude (obviously) - The official R manuals (all 19,518 lines of them) - Mild frustration and several cups of coffee --- ## Contributing If you spot a missing gotcha, a wrong default, or something that should be in the references — PRs are very welcome. I'm learning too. --- *Made by [@iremaydas](https://github.com/iremaydas) — PhD candidate, occasional R user, full-time Googler of things I should probably know by now.*
Added on March 31, 2026