6.16

Knowledge Graphs

RDF triples, semantic web, Wikidata, entity resolution.

Knowledge Graphs — Brief ☧

"In whom are hid all the treasures of wisdom and knowledge."

— Colossians 2:3 (KJV)

Deep version → | Back to Topos Theory →


Q: Think about how you naturally describe facts: "Paris is the

capital of France," "Water boils at 100 degrees," "Aspirin treats

headaches." Each fact has the same shape — a subject, a relationship,

and an object. What if you stored millions of these facts in a

structured database where you could query them, traverse connections,

and discover new relationships?

A: That is a knowledge graph — a database of

(subject, predicate, object) triples. Each triple is an edge in a

graph: the subject and object are nodes, and the

predicate is the labeled edge connecting them.

(Paris,         capitalOf,  France)
(Aspirin,       TREATS,     Headache)
(Inflammation,  TREATS,     IL-6)        ← from SemMedDB

Knowledge graphs power Google's Knowledge Panel, Amazon's product

recommendations, and biomedical research tools. They turn unstructured

knowledge into something a computer can traverse and reason over.

Q: You mentioned SemMedDB — what is that, and how does it connect

to this project?

A: SemMedDB is a massive biomedical knowledge graph containing

3.38 million concept-predicate-concept triples, automatically

extracted from millions of PubMed research abstracts. It captures

relationships like "Drug X TREATS Disease Y" and "Gene A INHIBITS

Protein B."

We used our FPGA to compute all pairwise domain intersections across

TREATS and INHIBITS predicates — 950 million pairs tested in

3,097 seconds (307K intersections/sec). This found 6.04 million

statistically significant associations, including Inflammation ↔ IL-6.

Q: 950 million pairs sounds enormous. How is that even feasible?

A: Each concept's "domain" — the set of other concepts it relates

to — is stored as a bitmask. Checking whether

two concepts share unexpected overlap is just a bitwise AND followed

by a popcount (count the 1-bits). Our FPGA performs these operations

on hierarchical bitmasks across 8 parallel lanes,

turning what would be a days-long computation into under an hour. The

algorithm is essentially streaming

intersection at massive scale.

Key Results

MetricValue
DatasetSemMedDB (395K CUIs, 3.38M domains)
Pairs tested950 million
Time (8-lane FPGA)3,097 seconds
Throughput307K intersections/sec
Significant hits6.04 million
Top hitInflammation ↔ IL-6 (-log10(p) = 2785)

These numbers tell a compelling story. The SemMedDB dataset contains 395,000 medical concepts, and we needed to test every pair of TREATS-related concepts for statistically significant overlap -- nearly a billion pairs. At 307,000 intersections per second, the 8-lane FPGA configuration completed this in just under an hour. The top hit, Inflammation paired with IL-6 (a key inflammatory cytokine), had a negative log10 p-value of 2785 -- an astronomically significant association that confirms known biology. Finding hits like this at scale is exactly the kind of task that motivates building specialized hardware: the computation is simple (bitwise AND + popcount), massively repetitive (a billion times), and perfectly parallel.

Connection to our system: Knowledge graphs are the real-world application

that motivates our entire FPGA architecture. Each medical concept is represented as a domain -- a

bitmask where each bit indicates whether that concept participates in a particular relationship. Computing the intersection of two concepts' domains (a bitwise AND) reveals their shared relationships. The popcount of the result tells you how many relationships they share, and comparing this to a random baseline gives you statistical significance. Our hierarchical domain structure makes this especially efficient: the Level 2 summary bits let the FPGA skip entire regions of the domain that have no set bits, avoiding wasted computation. This screening step is what makes the 950-million-pair computation feasible -- most pairs are quickly dismissed at the summary level without ever examining the full domain.

Learn more in the deep version

Related: Tensor Networks | Streaming Intersection


Soli Deo Gloria

Self-Check 1/1

Knowledge graphs store facts as subject-predicate-_____ triples.