Knowledge Graphs — Brief ☧
"In whom are hid all the treasures of wisdom and knowledge."
— Colossians 2:3 (KJV)
Deep version → | Back to Topos Theory →
Q: Think about how you naturally describe facts: "Paris is the
capital of France," "Water boils at 100 degrees," "Aspirin treats
headaches." Each fact has the same shape — a subject, a relationship,
and an object. What if you stored millions of these facts in a
structured database where you could query them, traverse connections,
and discover new relationships?
A: That is a knowledge graph — a database of
(subject, predicate, object) triples. Each triple is an edge in a
graph: the subject and object are nodes, and the
predicate is the labeled edge connecting them.
(Paris, capitalOf, France) (Aspirin, TREATS, Headache) (Inflammation, TREATS, IL-6) ← from SemMedDBKnowledge graphs power Google's Knowledge Panel, Amazon's product
recommendations, and biomedical research tools. They turn unstructured
knowledge into something a computer can traverse and reason over.
Q: You mentioned SemMedDB — what is that, and how does it connect
to this project?
A: SemMedDB is a massive biomedical knowledge graph containing
3.38 million concept-predicate-concept triples, automatically
extracted from millions of PubMed research abstracts. It captures
relationships like "Drug X TREATS Disease Y" and "Gene A INHIBITS
Protein B."
We used our FPGA to compute all pairwise domain intersections across
TREATS and INHIBITS predicates — 950 million pairs tested in
3,097 seconds (307K intersections/sec). This found 6.04 million
statistically significant associations, including Inflammation ↔ IL-6.
Q: 950 million pairs sounds enormous. How is that even feasible?
A: Each concept's "domain" — the set of other concepts it relates
to — is stored as a bitmask. Checking whether
two concepts share unexpected overlap is just a bitwise AND followed
by a popcount (count the 1-bits). Our FPGA performs these operations
on hierarchical bitmasks across 8 parallel lanes,
turning what would be a days-long computation into under an hour. The
algorithm is essentially streaming
intersection at massive scale.
Key Results
| Metric | Value |
|---|---|
| Dataset | SemMedDB (395K CUIs, 3.38M domains) |
| Pairs tested | 950 million |
| Time (8-lane FPGA) | 3,097 seconds |
| Throughput | 307K intersections/sec |
| Significant hits | 6.04 million |
| Top hit | Inflammation ↔ IL-6 (-log10(p) = 2785) |
These numbers tell a compelling story. The SemMedDB dataset contains 395,000 medical concepts, and we needed to test every pair of TREATS-related concepts for statistically significant overlap -- nearly a billion pairs. At 307,000 intersections per second, the 8-lane FPGA configuration completed this in just under an hour. The top hit, Inflammation paired with IL-6 (a key inflammatory cytokine), had a negative log10 p-value of 2785 -- an astronomically significant association that confirms known biology. Finding hits like this at scale is exactly the kind of task that motivates building specialized hardware: the computation is simple (bitwise AND + popcount), massively repetitive (a billion times), and perfectly parallel.
Connection to our system: Knowledge graphs are the real-world application
that motivates our entire FPGA architecture. Each medical concept is represented as a domain -- a
bitmask where each bit indicates whether that concept participates in a particular relationship. Computing the intersection of two concepts' domains (a bitwise AND) reveals their shared relationships. The popcount of the result tells you how many relationships they share, and comparing this to a random baseline gives you statistical significance. Our hierarchical domain structure makes this especially efficient: the Level 2 summary bits let the FPGA skip entire regions of the domain that have no set bits, avoiding wasted computation. This screening step is what makes the 950-million-pair computation feasible -- most pairs are quickly dismissed at the summary level without ever examining the full domain.
Soli Deo Gloria