This really doesn't have anything to do with C#. This is your classic nvarchar vs varchar issue (or unicode vs ASCII). The same thing happens if you mix collations.
I'm not sure why anyone would choose varchar for a column in 2026 unless if you have some sort of ancient backwards compatibility situation.
dspillett 22 hours ago [-]
> I'm not sure why anyone would choose varchar for a column in 2026
The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.
Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.
Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.
--------
[0] Codes with fixed ranges, etc.
[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.
gfody 17 hours ago [-]
utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.
tialaramex 13 hours ago [-]
UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.
downsplat 10 hours ago [-]
The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.
If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j
What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..
Tuna-Fish 13 hours ago [-]
Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.
divingdragon 12 hours ago [-]
Really, as an East Asian language user the rest of the comments here make me want to scream.
exceptione 5 hours ago [-]
I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
gfody 10 hours ago [-]
hn often makes me want to scream
SigmundA 10 hours ago [-]
The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.
gfody 10 hours ago [-]
implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences
SigmundA 7 hours ago [-]
It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?
I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.
By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.
This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.
beart 23 hours ago [-]
I agree with your first point. I've seen this same issue crop up in several other ORMs.
As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.
wvenable 23 hours ago [-]
> The vast majority of character fields in databases I've worked with do not need to store unicode values.
This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.
SigmundA 23 hours ago [-]
Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.
Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.
simonask 22 hours ago [-]
English has plenty of Unicode — claiming otherwise is such a cliché…
Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.
Slothrop99 17 hours ago [-]
Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.
Not that I disagree — Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.
simonask 16 hours ago [-]
My comment contains two glyphs that are not in CP1252.
zabzonk 17 hours ago [-]
> Unicode is a requirement everywhere human language is used
Strange then how it was not a requirement for many, many years.
procaryote 11 hours ago [-]
It was a mess back then though. Unicode fixed that.
zabzonk 11 hours ago [-]
I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.
NegativeLatency 21 hours ago [-]
Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji
SigmundA 19 hours ago [-]
Specifically not talking about a name field
SigmundA 20 hours ago [-]
I am talking about coded values, like Status = 'A', 'B' or 'C'
Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.
wvenable 17 hours ago [-]
It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.
andy81 11 hours ago [-]
The way to do enums in SQL (generally, not just MSSQL) is another table. It's better that they don't offer several ways to do the same thing.
sgarland 3 hours ago [-]
While I generally would prefer lookup tables, it's much easier to sell dev teams on "it looks and acts like a string - you don't have to change anything."
SigmundA 10 hours ago [-]
Mostly agree separate tables can have multiple attributes besides a text description and can be exposed for modification to the application easily so users or administrators can add and modify codes.
A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.
Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.
Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.
SigmundA 11 hours ago [-]
How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.
Typical code tables with code, description and anything else needed for that value which the user can configure in the app.
Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.
wvenable 2 hours ago [-]
Enums are for non user-configurable values.
For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.
Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.
SigmundA 2 hours ago [-]
>Enums are for non user-configurable values
In the systems I work with most coded values are user configurable.
>But those should have an auto-integer primary key and if you need the description, join for it.
Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'
select * from addresses where state = 11, no thanks.
Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.
>the length of the string is stored as an int
No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.
You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.
17 hours ago [-]
kstrauser 18 hours ago [-]
Those are all single byte characters in UTF-8.
SigmundA 10 hours ago [-]
We are talking nvarchar here, yes UTF-8 solves this issue completely and MSSQL supports it now days with varchar.
simonask 16 hours ago [-]
No. Look closer.
croes 15 hours ago [-]
But nvarchar is UTF-16
psidebot 21 hours ago [-]
Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…
_3u10 23 hours ago [-]
Generally if it stores user input it needs to support Unicode. That said UTF-8 is probably a way better choice than UTF-16/UCS-2
Dwedit 18 hours ago [-]
The one place UTF-16 massively wins is text that would be two bytes as UTF-16, but three bytes as UTF-8. That's mainly Chinese, Japanese, Korean, etc...
SigmundA 23 hours ago [-]
UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.
I have avoided it and have not followed if the issues are fully resolved, I would hope they are.
kstrauser 22 hours ago [-]
> UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.
Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.
da_chicken 19 hours ago [-]
I would be more critical of Microsoft choosing to support UCS-2/UTF-16 if Microsoft hadn't completed their implementation of Unicode support in the 90s and then been pretty consistent with it.
Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.
downsplat 10 hours ago [-]
It's not really a Linux vs MS thing though. When Unicode first came out, it was 16-bit, so all the early adopters went with that. That includes Java, Windows, JavaScript, the ICU lintaries, LibreOffice and its predecessors, .NET, the C language (remember wchar_t?), and probably a few more.
Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.
Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.
wvenable 4 hours ago [-]
Database systems are inherently conservative -- once you add something you have to support it forever. Microsoft went hog wild on XML in the database and I haven't seen it used in over a decade now.
recursive 20 hours ago [-]
In 92 it was a conference talk. In 98 it was adopted by the IETF. Point probably stands though.
swasheck 19 hours ago [-]
the data types were introduced with SQL Server 7 (1998) so i’m not sure it’s accurate to state that it’s considered as the new thing.
Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?
croes 15 hours ago [-]
Since MS SQL Server 2019 varchar supports unicode so now it’s the opposite, you use nvarchar instead of varchar for backwards compatibility reasons.
applfanboysbgon 23 hours ago [-]
I think this is a rather pertinent showcase of the danger of outsourcing your thinking to LLMs. This article strongly indicates to me that it is LLM-written, and it's likely the LLM diagnosed the issue as being a C# issue. When you don't understand the systems you're building with, all you can do is take the plausible-sounding generated text about what went wrong for granted, and then I suppose regurgitate it on your LLM-generated portfolio website in an ostensible show of your profound architectural knowledge.
ziml77 22 hours ago [-]
This is not at all just an LLM thing. I've been working with C# and MS SQL Server for many years and never even considered this could be happening when I use Dapper. There's likely code I have deployed running suboptimally because of this.
And it's not like I don't care about performance. If I see a small query taking more than a fraction of a second when testing in SSMS or If I see a larger query taking more than a few seconds I will dig into the query plan and try to make changes to improve it. For code that I took from testing in SSMS and moved into a Dapper query, I wouldn't have noticed performance issues from that move if the slowdown was never particularly large.
cosmez 22 hours ago [-]
This is a common issue, and most developers I worked with are not aware of it until they see the performance issues.
Most people are not aware of how Dapper maps types under the hood; once you know, you start being careful about it.
Nothing to do with LLMs, just plain old learning through mistakes.
keithnz 22 hours ago [-]
actually, LLMs do way better, with dapper the LLM generates code to specify types for strings
paulsutter 22 hours ago [-]
Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.
Utf16 is brain dead and an embarrassment
wvenable 21 hours ago [-]
Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.
So many problems could be solved with a time machine.
kstrauser 20 hours ago [-]
The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.
The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".
wvenable 16 hours ago [-]
I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.
It's now 2026, everything always looks different in hindsight.
kstrauser 15 hours ago [-]
I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point.
Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.
wvenable 4 hours ago [-]
Microsoft did the hard work of supporting Unicode when UTF-8 didn't exist (and mostly when UTF-16 didn't exist).
Any system that continued with only ASCII well into the 2000s could mostly just jump into UTF-8 without issue. Doing nothing for non-English users for almost two decades turned out to be a solid plan long term. Microsoft certainly didn't have that option.
gpvos 17 hours ago [-]
MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.
kstrauser 17 hours ago [-]
Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway.
Dwedit 18 hours ago [-]
It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.
SigmundA 23 hours ago [-]
Yes I have run into this regardless of client language and I consider it a defect in the optimizer.
wvenable 23 hours ago [-]
I wouldn't consider it a defect in the optimizer; it's doing exactly what it's told to do. It cannot convert an nvarchar to varchar -- that's a narrowing conversion. All it can do is convert the other way and lose the ability to use the index. If you think that there is no danger converting an nvarchar that contains only ASCII to varchar then I have about 70+ different collations that say otherwise.
17 hours ago [-]
SigmundA 20 hours ago [-]
Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?
If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.
wvenable 17 hours ago [-]
For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.
munch117 14 hours ago [-]
Making a plan that works for the general case, but is also efficient, is rather trivial. Here's pseudocode from spending two minutes on the problem:
# INPUT: lookfor: unicode
var lower, upper: ascii
lower = ascii_lower_bound(lookfor)
upper = ascii_upper_bound(lookfor)
for candidate:ascii in index_lookup(lower .. upper):
if expensive_correct_compare_equal(candidate.field, lookfor):
yield candidate
The magic is to have functions ascii_lower_bound and ascii_upper_bound, that compute an ASCII string such that all ASCII strings that compare smaller (greater) cannot be equal to the input. Those functions are not hard to write. Although you might have to implement versions for each supported locale-dependent text comparison algorithm, but still, not a big deal.
Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.
jstrong 16 hours ago [-]
optimizer can't inspect the value? pretty dumb optimizer, then.
zabzonk 14 hours ago [-]
It's not "the value", it's "the values".
wvenable 15 hours ago [-]
Running the optimizer for every execution of the same query is... not very optimal.
Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.
I'm not sure it makes sense to add more checks and another operation to every single query just for the case where the user explicitly mislabels the types. You're going to slow down everything everywhere (slightly) for a pretty obscure case. I suspect, in the long term, this would be a bad choice.
SigmundA 1 hours ago [-]
The check is added if it sees a varchar column and nvarchar parameter predicate on it.
It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.
There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.
briHass 22 hours ago [-]
I've found and fixed this bug before. There are 2 other ways to handle it
Dapper has a static configuration for things like TypeMappers, and you can change the default mapping for string to use varchar with: Dapper.SqlMapper.AddTypeMap(typeof(string),System.Data.DbType.AnsiString). I typically set that in the app startup, because I avoid NVARCHAR almost entirely (to save the extra byte per character, since I rarely need anything outside of ANSI.)
Or, one could use stored procedures. Assuming you take in a parameter that is the correct type for your indexed predicate, the conversion happens once when the SPROC is called, not done by the optimizer in the query.
I still have mixed feelings about overuse of SQL stored procedures, but this is a classic example of where on of their benefits is revealed: they are a defined interface for the database, where DB-specific types can be handled instead of polluting your code with specifics about your DB.
(This is also a problem for other type mismatches like DateTime/Date, numeric types, etc.)
ziml77 21 hours ago [-]
Sprocs are how I handle complex queries rather than embedding them in our server applications. It's definitely saved me from running into problems like this. And it comes with another advantage of giving DBAs more control to manage performance (DBAs do not like hearing that they can't take care of a performance issue that's cropped up because the query is compiled into an application)
bonesss 12 hours ago [-]
As a general issue of hygiene I tend to wrap any ORM and access it through an internal interface.
1) The joy of writing and saying DapperWrapper can’t be overstated.
2) in conjunction with meaningful domain types it lets you handle these issues across the app at a single point of control, and capture more domain semantics for testing.
diath 21 hours ago [-]
It's weird that the article does not show any benchmarks but crappy descriptions like "milliseconds to microseconds" and "tens of thousands to single digits". This is the kind of vague performance description LLMs like to give when you ask them about performance differences between solutions and don't explicitly ask for a benchmark suite.
pllbnk 14 hours ago [-]
I disagree. I think it's a nice discovery many might be unaware of and later spend a lot of time on tracking down the performance issue independently. I also disagree that a rigorous benchmark is needed for every single performance-related blog post because good benchmarks are difficult to write, you have to account for multiple variables. Here, the author just said - "trust me, it's much faster" and I trust them because they explained the reasoning behind the degradation.
nmeofthestate 12 hours ago [-]
The writing style certainly screams LLM.
_vertigo 16 hours ago [-]
> No schema changes. No new indexes. No query rewrites. Just telling Dapper the correct parameter type.
pllbnk 14 hours ago [-]
Are we automatically discarding everything that might or might not have been written or assisted by an LLM? I get it when the articles are the type of meaningless self improvement or similar kind of word soup. However, if hypothetically an author uses LLM assistance to improve their styling to their liking, I see nothing wrong with that as long as the core message stands out.
rmunn 12 hours ago [-]
I've seen so many LLM-generated articles by this point that obviously had no human editing done beforehand — just prompt and slap it onto the Web — that it makes me wonder every time. If I read this article, will I actually learn only truth? Or are there some key parts of this article that are actually false because the LLM hallucinated them, and the human involved didn't bother to double-check the article before publishing it?
If someone was just using the LLM for style, that's fine. But if they were using it for content, I just can't trust that it's accurate. And the time cost for me to read the article just isn't worth it if there's a chance it's wrong in important ways, so when I see obvious signs of LLM use, I just skip and move on.
Now, if someone acknowledged their LLM use up front and said "only used for style, facts have been verified by a human" or whatever, then I'd have enough confidence in the article to spend the time to read it. But unacknowledged LLM use? Too great a risk of uncorrected hallucinations, in my experience, so I'll skip it.
maciekkmrk 21 hours ago [-]
Interesting problem, but the AI prose makes me not want to read to the end.
downsplat 13 hours ago [-]
Did this post come out of a freezer from 1998? Who on earth creates databases in Latin1 in 2026?
Nevermind, looks like Sql Server didn't add utf8 collations until 2019 (!) and for decades people had to choose column by column between the 16-bit overhead of "nvarchar" and latin1... And still do if they want a bit of backwards compatibility. Amazing.
rmunn 12 hours ago [-]
"Just use Postgres" (which defaults to UTF-8 encoding unless specifically configured to use something else) is looking like better and better advice every day.
Doesn't help those tied to legacy systems that would require a huge, expensive effort to upgrade, though. Sorry, folks. There's a better system, you know it's a better system, and you can't use it because switching is too expensive? I've been there (not databases, in my case) and it truly sucks.
elmigranto 13 hours ago [-]
Third party dependencies are very easy: you just have to intimately know how it is implemented in addition to knowing your own code and stack, and then you are golden!
Nothing to learn, just focus on making your app, it’s all taken care of by This One Simple Package ;)
These things are so far from free as our tooling presents with “just nuget it or whatever”.
DeathMetal3000 4 hours ago [-]
I’m sure writing their own ORM would have given them instantaneous insight into this issue and introduced no other challenges. Open source developers hate this one weird trick!
elmigranto 3 hours ago [-]
Especially for things used directly, you need to understand both, own and third party code, roughly to the same level. With own code, you only care for your own use case; with third-party — you have to kind of get everyone else's.
Depending on what you do and the dependency's scope, either way can make sense.
smithkl42 23 hours ago [-]
Been bit by that before: it's not just an issue with Dapper, it can also hit you with Entity Framework.
pjmlp 16 hours ago [-]
I never had this issue with Dapper, as others point out, an holding it wrong problem.
andrelaszlo 22 hours ago [-]
I thought, having just read the title, that maybe it's time to upgrade if you're still on Ubuntu 6.06.
jiggawatts 23 hours ago [-]
This feels like a bug in the SQL query optimizer rather than Dapper.
It ought to be smart enough to convert a constant parameter to the target column type in a predicate constraint and then check for the availability of a covering index.
valiant55 23 hours ago [-]
There's a data type precedence that it uses to determine which value should be casted[0]. Nvarchar is higher precedence, therefore the varchar value is "lifted" to an nvarchar value first. This wouldn't be an issue if the types were reversed.
It's the optimizer caching the query plan as a parameterized query. It's not re-planning the index lookup on every execution.
SigmundA 23 hours ago [-]
The parameter type is part of the cache identity, nvarchar and varchar would have two cache entries with possibly different plans.
23 hours ago [-]
beart 22 hours ago [-]
How do you safely convert a 2 byte character to a 1 byte character?
jiggawatts 22 hours ago [-]
Easily! If it doesn't convert successfully because it includes characters outside of the range of the target codepage then the equality condition is necessarily false, and the engine should short-circuit and return an empty set.
adzm 23 hours ago [-]
even better is Entity Framework and how it handles null strings by creating some strange predicates in SQL that end up being unable to seek into string indexes
enord 23 hours ago [-]
This is due to utf-16, an unforgivable abomination.
bunbun69 12 hours ago [-]
AI slop article
Also no meaningful benchmarking was done
mvdtnz 20 hours ago [-]
This is a really interesting blog post - the kind of old school stuff the web used to be riddled with. I must say - would it have been that hard to just write this by hand? The AI adds nothing here but the same annoying old AI-isms that distract from the piece.
ltbarcly3 20 hours ago [-]
Life is too short to use SQL Server. I know people that use it will swear it's "not bad anymore" but yes it is.
bni 9 hours ago [-]
yes it is
Rendered at 22:40:12 GMT+0000 (Coordinated Universal Time) with Vercel.
I'm not sure why anyone would choose varchar for a column in 2026 unless if you have some sort of ancient backwards compatibility situation.
The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.
Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.
Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.
--------
[0] Codes with fixed ranges, etc.
[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.
If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j
I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.
By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.
This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.
As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.
This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.
Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.
Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.
Not that I disagree — Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.
Strange then how it was not a requirement for many, many years.
Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.
A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.
Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.
Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.
Typical code tables with code, description and anything else needed for that value which the user can configure in the app.
Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.
For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.
Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.
In the systems I work with most coded values are user configurable.
>But those should have an auto-integer primary key and if you need the description, join for it.
Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'
select * from addresses where state = 11, no thanks.
Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.
>the length of the string is stored as an int
No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.
You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.
I have avoided it and have not followed if the issues are fully resolved, I would hope they are.
Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.
Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.
Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.
Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.
https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-...
https://learn.microsoft.com/en-us/sql/relational-databases/d...
Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?
And it's not like I don't care about performance. If I see a small query taking more than a fraction of a second when testing in SSMS or If I see a larger query taking more than a few seconds I will dig into the query plan and try to make changes to improve it. For code that I took from testing in SSMS and moved into a Dapper query, I wouldn't have noticed performance issues from that move if the slowdown was never particularly large.
Most people are not aware of how Dapper maps types under the hood; once you know, you start being careful about it.
Nothing to do with LLMs, just plain old learning through mistakes.
Utf16 is brain dead and an embarrassment
So many problems could be solved with a time machine.
The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".
It's now 2026, everything always looks different in hindsight.
Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.
Any system that continued with only ASCII well into the 2000s could mostly just jump into UTF-8 without issue. Doing nothing for non-English users for almost two decades turned out to be a solid plan long term. Microsoft certainly didn't have that option.
If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.
Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.
Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.
The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...
It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.
There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.
Dapper has a static configuration for things like TypeMappers, and you can change the default mapping for string to use varchar with: Dapper.SqlMapper.AddTypeMap(typeof(string),System.Data.DbType.AnsiString). I typically set that in the app startup, because I avoid NVARCHAR almost entirely (to save the extra byte per character, since I rarely need anything outside of ANSI.)
Or, one could use stored procedures. Assuming you take in a parameter that is the correct type for your indexed predicate, the conversion happens once when the SPROC is called, not done by the optimizer in the query.
I still have mixed feelings about overuse of SQL stored procedures, but this is a classic example of where on of their benefits is revealed: they are a defined interface for the database, where DB-specific types can be handled instead of polluting your code with specifics about your DB.
(This is also a problem for other type mismatches like DateTime/Date, numeric types, etc.)
1) The joy of writing and saying DapperWrapper can’t be overstated.
2) in conjunction with meaningful domain types it lets you handle these issues across the app at a single point of control, and capture more domain semantics for testing.
If someone was just using the LLM for style, that's fine. But if they were using it for content, I just can't trust that it's accurate. And the time cost for me to read the article just isn't worth it if there's a chance it's wrong in important ways, so when I see obvious signs of LLM use, I just skip and move on.
Now, if someone acknowledged their LLM use up front and said "only used for style, facts have been verified by a human" or whatever, then I'd have enough confidence in the article to spend the time to read it. But unacknowledged LLM use? Too great a risk of uncorrected hallucinations, in my experience, so I'll skip it.
Nevermind, looks like Sql Server didn't add utf8 collations until 2019 (!) and for decades people had to choose column by column between the 16-bit overhead of "nvarchar" and latin1... And still do if they want a bit of backwards compatibility. Amazing.
Doesn't help those tied to legacy systems that would require a huge, expensive effort to upgrade, though. Sorry, folks. There's a better system, you know it's a better system, and you can't use it because switching is too expensive? I've been there (not databases, in my case) and it truly sucks.
Nothing to learn, just focus on making your app, it’s all taken care of by This One Simple Package ;)
These things are so far from free as our tooling presents with “just nuget it or whatever”.
Depending on what you do and the dependency's scope, either way can make sense.
It ought to be smart enough to convert a constant parameter to the target column type in a predicate constraint and then check for the availability of a covering index.
0: https://learn.microsoft.com/en-us/sql/t-sql/data-types/data-...
Also no meaningful benchmarking was done