Zusammenfassung der Ressource
Language Identification and
Modelling in Specialized hardware
- Introduction
- Natural language processing
- Tasks
- Language identification
- Is essential to using
WEB as corpus
- Language Modelling
- Challenge
- Large data sizes
- Solution
- Use specialized Hardware
- Graphics processing units - GPUs
- Application
- Neural Networks
- Parsing
- Field-programmable
gate arrays - FPGAs
- Features
- Fast
- Customizable
- Application
- Encoding Grammars
- Research
- idea
- Repurpose Network
security hardware
- Application specific
integrated circuit for
network monitoring
- Deterministic
pushdown
transducer
- A finite state transducer (FST) is a finite
state automata (FSA) which produces
output as well as reading input - Finite
State Machine!
- Recursive finite-domain programs can be
characterized by finite-state transducers
that are augmented with a pushdown store.
Such transducers are called pushdown
transducers.
- Has a stack
- Programmable
- Executes
regular
expressions
- POSIX
- When matched
- Outputs constant to CPU
- Use stack
- Push
- Pop
- Output matched span
- Halt
- No user
accessible
arithmetic
- Applications
- Reason
- Do not easily
map to regular
expressions
- Tasks
- Language
Identification
- Use model of Lui
and Baldwin, 2012
for 97 languages
- Naive Bayes
model
- Feature
strings are
converted to
literal regular
expressions
- Collect
feature
counts
- Emulate
automata
on CPU
- Language
modelling
- Using back-off models
- Using Telescoping series
- A telescoping series is any series where
nearly every term cancels with a
preceeding or following term. For
instance, the series
Anmerkungen:
- https://en.wikipedia.org/wiki/Telescoping_series
- http://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/SandS/SeriesTests/telescoping.html
- Collapse prob and
back off into single
function
- Preserves
sentence-level
probabilities
- Sends just one value Q per token
and not probability and back offs
- Saves CPU workload
and communications
- Simplified query
- For each word match as
much context as possible
- Sends just one value Q per token
and not probability and back offs
- Use greedy matching
- Match as much
leading context as
possible
- Scanning until a
match is found
- Report the longest match
- Resume scanning
- The longest matching
N-gram will be reported
- Use "greedy"
regular
expressions
- Experiments
- Performance Evaluation
- One core
- Language
identification
- 2.4 times faster
- As the fastest
CPU programm
- More details in
the paper
- tested against
some existing
models
- CLD2
- C++
- Original
- Python
- Java
- Language
modelling
- 1.8 to 6 times faster
- As CPU program
- KenLM
- DALM
- Part of speech
- In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical
tagging or word-category disambiguation, is the process of marking up a word in a text (corpus)
as corresponding to a particular part of speech, based on both its definition, as well as its
context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A
simplified form of this is commonly taught to school-age children, in the identification of words
as nouns, verbs, adjectives, adverbs, etc.
Anmerkungen:
- https://en.wikipedia.org/wiki/Part-of-speech_tagging
- Tarari T2540
PCI express
- Controlled by
1-thread CPU
program
- Performed
arithmetic
- Scalable till
4 devices