NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Data Manipulation in Clojure Compared to R and Python (codewithkira.com)
ertucetin 1 days ago [-]
I’ve built many different kinds of software (backend, frontend, 3D games, cli tools, code editor, and more) with Clojure and have been using it for over a decade now.

I can confidently say that, among the list I mentioned, it’s the best for data manipulation/transformation. Thanks to the author for presenting it clearly and showing how the libraries and code look across different languages, all of which do a great job.

But Clojure has its own special place (maybe in my heart as well :). I think Clojure should be used more in the data science space. Thanks to the JVM, it can be very performant (I’m looking at you, Python).

hatmatrix 21 hours ago [-]
There was XLISP-STAT before R, but the scientists have spoken. They don't like the parentheses.
iLemming 5 hours ago [-]
If you compare every single language Clojure can emit - Clojure&Java, Clojurescript&Javascript, babashka&Bash, Clojure-Dart&Dart, Jank&C++, Fennel&Lua (even though technically Fennel is only Clojure-like) - the number of delimiters (and often even number of parens) would be universally higher than in the Clojure code. I guarantee it. Clojure has a lower delimiter-to-structure ratio. Java has parens that exist purely for syntactic obligations - `(if (`, `for (`, etc. It's not that Clojure has fewer parens absolutely, yet it has no wasted ones - that is 100% true.
teleforce 21 hours ago [-]
All the comparisons are with scripting and untyped languages perhaps for faster development and more intuitive eco-system to increase developer productivity.

In the age of IntelliSense, auto-completion and AI assisted coding, does the choice of scripting and untyped language justifiable for increased in productivity at the expense of safety and reliability?

If you're building data system not just for exploratory, surely modern compiled and typed system languages like Rust and D language make more sense for safety and reliability for the end users?

Even more so with D language where you can even have scripting capability for exploratory and protyping stage with its built-in REPL facility [1],[2]. This is feasible due to its very fast compile time unlike Rust. It has more intuitive "Phytonic" syntax compared to other typed languages [3]. You can also program with GC on by default if you choose to. Apparently, you can have your cake and eat it too.

[1] drepl:

https://github.com/dlang-community/drepl

[2] Why I use the D programming language for scripting:

https://opensource.com/article/21/1/d-scripting

[3] All in on DLang: Why I pivoted to D for web, teaching, and graphics in 2025 and beyond! [PDF]

https://dconf.org/2025/slides/shah.pdf

zelphirkalt 12 hours ago [-]
One general problem or challenge with statically strongly typed languages is, that one can quick get to a local optimum, but that local optimum might lack some flexibility, that is needed later on, only discovered after some usage and seeing many use cases. Then a big refactoring is ahead, possibly even of the core types of the project. If that is allowed and introducing such flexibility thought of, it often happens, that expressing it in types becomes quite complex, which, without a lot of care, will impact the user of the project. The user needs to adhere to the same types and there might then be quite some ceremony around making something of the correct type, to use it with the project.

It is safer, but it is not without its downsides. It demands a careful design to make something people will enjoy using.

iLemming 5 hours ago [-]
> with its built-in REPL facility

brotha... please.. you're making me laugh so hard, my parentheses are shaking and getting unbalanced. Non-lispy langs that "provide REPL facilities" and Lisp dialects share the word "REPL" the way a rowboat and an aircraft carrier share the word "vessel". A Lisp REPL is an architectural relationship between you and a running system. Dlang's REPL is a nice sketchpad - worse than Python's. It's in the same "sketchpad league" with C#'s, Ruby's, Kotlin's and Node's. Clojure REPL is closer to what Smalltalk had - you're always inside the machine.

geokon 18 hours ago [-]
It's a bit apples to oranges.

If you're "building data system not just for exploratory" then you're probably not going to be using any of the presented options. However, in my experience Clojure has an ecosystem where there it is very easy to transition from exploring/playing with data at the REPL to a more robust "pro" setup that's designed to scale, handle failures, etc.

teleforce 15 hours ago [-]
I understand the sentiments but I disagree with the approach, it's probably efficient for exploratory but not effective for everything else including prototyping and systems development.

For any engineering work, including software engineering you choose the best tool for the job. In D you can have the high performance tool capable of bit shifting, string processing, array manipulation (to name a few) and from scripts to highly concurrent low-latency applications (see presentation in the ref [3] above by Prof. Shah from Yale).

It's a shame that the proper typed programming language are being ignored just because of programmers' locally sub-optimal preferences and limited exposure. The productivity increased using typical scripting languages including Python is diminishing everyday with the proliferation of IntelliSense, auto-complete and AI assisted coding.

For production codes, the scripting language based systems if they ever made it to production (mostly do e.g AirBNB, Twitter, Shopify, Github, etc) will be a maintenance headache and user nightmare, if the supports are not great and not unicorn start-ups. The last thing you want is that your saved eclaim form that you spent many hours preparing totally dissapeared since the system cannot recall the saved version. Granted this can be because of many reasons, but most of the problematic production systems are mostly written in scripting languages including Python because these are the only language the programmers know and familiar with. Adding to the insults are the readily available so called "battery included" libraries are convenients but ironically written in other compiled but unsafe system language in C/C++.

geokon 15 hours ago [-]
I think you're going to trouble convincing people a compile-loop language is going to be on-par with a REPL/interactive setup. You can look at some extreme example like MATLAB. With all your tools you're never going to reach the same level of interactive productivity with D for the subset of problems it's address.

You can have all your tools dump out and rewrite the oodles of boiler plate your typed languages require - but at the end of the day you have to read all that junk... or not? and just vibecode and #yolo it? But then you're back to "safety and reliability" problems and you haven't won anything

Also "safety and reliability" are just non-goals in a lot of contexts. My shitty plotting script doesn't care about "safety". It's not sitting on the network. It's reliable enough for the subset of inputs I provide it. I don't need to handle every conceivable corner case. I have other things to do

> Adding to the insults are the available readily available libraries are convenients but ironically written in other compiled but unsafe system language in C/C++

No on cares if you leak memory in some corner case with some esoteric inputs. And noone is worried your BLAS bindings are going to leak your secrets. These are just not objectives

teleforce 14 hours ago [-]
My point is that Dlang scales from beginner to expert, from scripting to highly concurrent low-latency applications. Why settle for sub-optimal scripting languages if you can have the real deal with much better performance and freely available open source?

In the automative world if you can afford it, you need daily drive car for the job and supermarket runs, weekend supercar for fun/showing off, and off-road 4x4 vehicles for overnight camping. But in the software world D can cater for mostly everything with free open-source compilers, minimum productivity overhead and much cheaper to host as well [1].

Funny you mentioned BLAS, since Dlang BLAS implementation has also surpassed the run-of-the-mill high performance BLAS library that these scripting languages can only dream of (Matlab calling the 3rd party Fortran codes no less) [2].

[1] Saving Money by Switching from PHP to D:

https://dlang.org/blog/2019/09/30/saving-money-by-switching-...

[2] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

physPop 8 hours ago [-]
python isn't untyped, its dynamically typed https://stackoverflow.com/questions/2351190/static-dynamic-v...
zmmmmm 21 hours ago [-]
Seems like it's going to be a tough sell to get people to want to write

    (tc/select-rows ds #(> (% "year") 2008))
instead of

    filter(ds, year > 2008)
They seem to ignore the existance of Spark, so even if you specifically want to use JVM it feels clearer and simpler:

    ds.filter(r => r.year > 2008)
aphyr 18 hours ago [-]
You're right, that is longer! I get why though; `filter` is a clojure.core function name people don't necessarily feel comfortable shadowing, and the Clojure and Spark versions make it clear what's a symbol in local scope versus a field in the dataset. I don't think it'd be hard to make a little wrapper for this sort of thing though! Here's an example which turns any symbols not in local scope into field lookups on an implicit row variable.

    (require '[clojure.walk :refer [postwalk]])

    (defmacro filter
      [ds & anaphoric-pred]
      (let [row-name (gensym 'row)
            pred     (postwalk (fn [form]
                                 (if (and (symbol? form) (nil? (resolve form)))
                                   `(get ~row-name ~(str form))
                                   form))
                       anaphoric-pred)]
      `(tc/select-rows ds (fn [~row-name] ~@pred))))
Now you can write

    (filter ds (> year 2008))
And it'll expand to the ts form:

    (pprint (macroexpand '(filter ds (> year 2008))))
    => (tc/select-rows ds (fn [row2411] (> (get row2411 "year") 2008)))
geokon 18 hours ago [-]
In my experience the advantage comes when you have a few more lines of code

The Clojure pipelining makes code much more readable. Granted dplyr has them too, but tidyverse pipes always felt like a hack on top of R (though my experience is dated here). While in Clojure I always feel like I'm playing with the fundamental language data-types/protocols. I can extend things in any way I want

hrrld 6 hours ago [-]
In practice we use `ds/filter-column` and `ds/filter` much more than `select-rows`.

The sell isn't about typing a few more or a few less characters, it's about doing data science functionally.

condwanaland 21 hours ago [-]
Couldn't agree more. R and dplyrs ability to pass column names as unquoted objects actually reduces cognitive load for new people so much (pure anecdata, nothing to back this up except lots of teaching people).

And that's on top of the vastly simpler syntax compared to what's being shown here

iLemming 6 hours ago [-]
> vastly simpler syntax

I've been programming for decades. I've used dozens of different, at times enormously esoteric languages. At one point I built ERPs in a language where operators were abbreviated Russian terms. After just a few years of using Lisp dialects I am absolutely convinced - there's no simpler and more readable syntax than of Lisp's. Anyone who doesn't see that, in my eyes just not made the distinction between familiarity and simplicity.

They're measuring how quickly their eyes can parse something they've already seen a thousand times, and calling that readability. But readability isn't recognition speed - it's the cognitive distance between the code and the computation it describes. And on that measure, Lisp is essentially lossless. There's no syntactic residue. No ceremony the language demands for its own sake. What you write is the structure of the thing, all the way down.

"You get used to it. I don't even see the code. All I see is blonde, brunette, redhead..." I don't look at Matrix feeling puzzled anymore. I see the truth.

People who bounce off the parentheses are reacting to something real: it doesn't look like what they already know. But that's not the language failing them. That's just the last bit of the old syntax dying. Give it a few months of structural editing and a proper REPL workflow, and you won't see parentheses anymore - you'll see shape. You'll see depth. And going back to anything else will feel like someone handed you a map drawn in crayon and called it a feature.

iLemming 6 hours ago [-]
[dead]
manudaro 17 hours ago [-]
The Clojure tablecloth performance numbers here are pretty surprising, usually see Python/polars dominating these benchmarks. Been running similar transformations on transit data feeds and polars consistently outperforms pandas by 3x-5x on the group-by operations, but hadn't considered Clojure for the pipeline. Anyone actually using tablecloth in production data workflows?
olivia-banks 24 hours ago [-]
Having "NA" being treated as nil/null/None by default seems like it would cause the Namibia problem!
daslu 7 hours ago [-]
Great post.

The way Tablecloth unifies column processing and row processing in a functional way is so elegant.

__mharrison__ 1 days ago [-]
Good pandas and polars code should also be written in an immutable way...
epgui 1 days ago [-]
Good python code can exist, but python makes it so easy to write bad code that good python rarely exists.
nxpnsv 1 days ago [-]
Agree. While it is common to see code like these pandas examples, it is very possible to write these manipulations so that they return a new frame or view without changing the inputs.
QubridAI 23 hours ago [-]
Interesting perspective Clojure’s immutable, functional approach makes data wrangling feel very different from the more imperative style of R and Python.
thrawa8387336 23 hours ago [-]
I always wished Incanter took off.
soumyaskartha 1 days ago [-]
Clojure never got the data science crowd even though the language is genuinely good for it. Always felt like a distribution problem more than a technical one.
levocardia 1 days ago [-]
In this very post you can see why: the dplyr code is just so much more readable. Like a lot of python, dplyr reads almost like pseudocode: take this dataset, select the columns that start with "bill", then filter so that bill_length is less than 30. So simple and so little fluff!
hatmatrix 21 hours ago [-]
Julia's Tidier.jl ecosystem is getting there too. It uses macros to mimic this 'special' evaluation framework of R, so the code is also readable in a similar way.
erichocean 1 days ago [-]
> is just so much more readable

I thought that too before I learned Clojure, now I find them equally readable.

lemming 20 hours ago [-]
I'm very familiar with Clojure, but even I can't make a good argument that:

    (tc/select-rows ds #(> (% "year") 2008))
is more, or at least as, intuitive as:

    filter(ds, year > 2008)
as cited above. I think there's a good argument to be made that Clojure's data processing abilities, particularly around immutable data, make a compelling case in spite of the syntax. The REPL is great too, and the JVM is fast. But I still to this day imagine infix comparisons in my head and then mentally move the comparator to the front of the list to make sure I get it right.
13 hours ago [-]
erichocean 13 hours ago [-]
How about this?

    (filter ds (> year 2008))
That's a trivial Clojure macro to make work if it's what you find "intuitive."
Capricorn2481 19 hours ago [-]
I am really not in data science, and I have decent Clojure experience. Is there a reason anyone would pick Clojure over something like K? From what I understand, those array languages are really good for writing safe but efficient code on rectangular data.
asa400 1 days ago [-]
Unfortunately, having to mess around with a JVM is a tough sell for a lot of data analysis folks. I'm not saying it's rational or right, but a lot of people hear "JVM" and they go "no thank you". Personally I think it's a non-issue, but you have to meet people where they are.
pjmlp 24 hours ago [-]
The irony given the mess of Python setup where there are companies whose business is to solve Python tooling.
asa400 18 hours ago [-]
Oh, I completely agree. Like I said, it's not rational, but it is what it is.
cmiles74 23 hours ago [-]
I dunno, if you can slog through the Python ecosystem then the JVM is starting to look not so bad. Plus with Clojure you don't need to deal with the headache and heartache that is Maven.
KingMob 17 hours ago [-]
I think that's true for only a limited subset of programs, though. The Clojure lib ecosystem is nowhere near the size of the broader Java ecosystem, so you frequently end up pulling Maven deps to plug holes anyway.
pjmlp 13 hours ago [-]
That is the goal of a polyglot runtime, and why Clojure was designed to be a hosted language that embraces the platform, unlike others that make their tiny island.
famicom0 1 days ago [-]
Meanwhile, I find it very annoying to deal with the litany of Python versions and the distinction between global packages and user packages, and needing to manage virtual environments just to run scripts. That being said, I am not an expert but that's always been my experience when I need to do anything Python related.
packetlost 23 hours ago [-]
idk, I don't think I've had to do anything beyond install the JVM to work with Clojure. I'm not really a fan of the clj commands flag choices though (-M, -X, etc. all make no sense)
KingMob 17 hours ago [-]
It's unfortunate, but people's associations with Java the lang bleed into their beliefs about the JVM, one of the most heavily-optimized VMs on the planet.

There's some historical cruft (especially the memory model), but picking the JVM as a target is a great decision (especially with Graal offering even more options).

pjmlp 13 hours ago [-]
Exactly, especially because there isn't THE JVM, rather a bunch of versions each with their own approaches to GC, JIT, JIT caches, ahead of time compilation.

Only .NET follows up on it at scale.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:12:19 GMT+0000 (Coordinated Universal Time) with Vercel.